FAQ
I know this is one of those "How long is a piece of string?" questions
but I'm curious as to the order of magnitude of indexing performance.

http://lucene.apache.org/java/docs/benchmarks.html

seems to indicate about 100-120 docs/s is pretty good for average sized
documents (say, an email or something) or is that ludicrously out of
date for 2.3.x ?

Simon

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Grant Ingersoll at Jun 3, 2008 at 12:28 pm
    Of course it depends on analysis, etc., but my experience has been at
    least 2x faster, if not up to 4-5 times depending on the docs, etc.
    You can use the contrib/benchmark package to try for yourself, of
    course!
    On Jun 2, 2008, at 7:40 PM, Simon Wistow wrote:

    I know this is one of those "How long is a piece of string?" questions
    but I'm curious as to the order of magnitude of indexing performance.

    http://lucene.apache.org/java/docs/benchmarks.html

    seems to indicate about 100-120 docs/s is pretty good for average
    sized
    documents (say, an email or something) or is that ludicrously out of
    date for 2.3.x ?

    Simon

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    --------------------------
    Grant Ingersoll
    http://www.lucidimagination.com

    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ








    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Konstantyn Smirnov at Jun 6, 2008 at 9:01 am
    my 2 cents

    My indexing-module handles the documents with ~15 fields, most of those must
    be indexed and stored. Using the GermanAnalyzer I saw the following times:

    10 MB ~ 3400 docs --> 6-8 sec
    70 MB ~ 50000 docs --> 65 sec

    so it gives me 500 - 760 doc/s
    --
    View this message in context: http://www.nabble.com/Typical-Indexing-performance-tp17619271p17687701.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Jun 3, 2008 at 7:42 pm
    There i really no "typical". I'm playing with Hadoop (HDFS) and Solr at the moment, for example, and I'm seeing indexing rate of cca 70 docs/second. However, the bottleneck there is not indexing, it is reading data from HDFS (over the network).


    I've also seen 500+ docs/second.

    It depends on many factors:
    how fast reading your data source is, how complex your analysis is, the size of documents and number of fields, whether fields are stored or only indexed, the IndexWriter settings for segment merging and memory usage, of course, there is hardware, etc.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Simon Wistow <simon@thegestalt.org>
    To: Lucene <java-user@lucene.apache.org>
    Sent: Monday, June 2, 2008 7:40:52 PM
    Subject: Typical Indexing performance

    I know this is one of those "How long is a piece of string?" questions
    but I'm curious as to the order of magnitude of indexing performance.

    http://lucene.apache.org/java/docs/benchmarks.html

    seems to indicate about 100-120 docs/s is pretty good for average sized
    documents (say, an email or something) or is that ludicrously out of
    date for 2.3.x ?

    Simon

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Marcelo Ochoa at Jun 3, 2008 at 7:57 pm
    Hi:
    Here my latest testing of Oracle-Lucene integration (Lucene 2.3.2
    binary dist. / Oracle 11g):
    http://marceloochoa.blogspot.com/2008/06/new-binary-release-of-lucene-oracle.html
    Tested against Spanish Wikipedia Dumps and using Wikipedia Analyzer/Tokenizer.
    There is independent times for uploading process and for indexing process.
    Uploading process means parsing of Wikipedia XML dumps and insert
    into Oracle XMLDB repository which transform it in an object
    relational structure:
    http://marceloochoa.blogspot.com/2007/12/uploading-wikipedia-dumps-to-oracle.html
    Indexing process means a creation of a Lucene Domain Index with
    something like this:

    create index pages_lidx_all on pages p (value(p))
    indextype is Lucene.LuceneIndex
    parameters('PopulateIndex:false;SyncMode:Deferred;LogLevel:WARNING;Analyzer:org.apache.lucene.analysis.SpanishWikipediaAnalyzer;ExtraCols:extractValue(object_value,''/page/title'')
    "title",extractValue(object_value,''/page/revision/comment'')
    "comment",extract(object_value,''/page/revision/text/text()'')
    "text",extractValue(object_value,''/page/revision/timestamp'')
    "revisionDate";FormatCols:revisionDate(day);IncludeMasterColumn:false;LobStorageParameters:PCTVERSION
    0 ENABLE STORAGE IN ROW CHUNK 32768 CACHE READS
    FILESYSTEM_LIKE_LOGGING');

    Which indexs in separately Lucene Fields title, comment, text and
    timestamp XML nodes and the Oracle ROWID.
    Best regards, Marcelo.
    On Tue, Jun 3, 2008 at 4:42 PM, Otis Gospodnetic
    wrote:
    There i really no "typical". I'm playing with Hadoop (HDFS) and Solr at the moment, for example, and I'm seeing indexing rate of cca 70 docs/second. However, the bottleneck there is not indexing, it is reading data from HDFS (over the network).


    I've also seen 500+ docs/second.

    It depends on many factors:
    how fast reading your data source is, how complex your analysis is, the size of documents and number of fields, whether fields are stored or only indexed, the IndexWriter settings for segment merging and memory usage, of course, there is hardware, etc.

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

    ----- Original Message ----
    From: Simon Wistow <simon@thegestalt.org>
    To: Lucene <java-user@lucene.apache.org>
    Sent: Monday, June 2, 2008 7:40:52 PM
    Subject: Typical Indexing performance

    I know this is one of those "How long is a piece of string?" questions
    but I'm curious as to the order of magnitude of indexing performance.

    http://lucene.apache.org/java/docs/benchmarks.html

    seems to indicate about 100-120 docs/s is pretty good for average sized
    documents (say, an email or something) or is that ludicrously out of
    date for 2.3.x ?

    Simon

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --
    Marcelo F. Ochoa
    http://marceloochoa.blogspot.com/
    http://marcelo.ochoa.googlepages.com/home
    ______________
    Do you Know DBPrism? Look @ DB Prism's Web Site
    http://www.dbprism.com.ar/index.html
    More info?
    Chapter 17 of the book "Programming the Oracle Database using Java &
    Web Services"
    http://www.amazon.com/gp/product/1555583296/
    Chapter 21 of the book "Professional XML Databases" - Wrox Press
    http://www.amazon.com/gp/product/1861003587/
    Chapter 8 of the book "Oracle & Open Source" - O'Reilly
    http://www.oreilly.com/catalog/oracleopen/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedJun 3, '08 at 9:07a
activeJun 6, '08 at 9:01a
posts5
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase