FAQ
I was trying to build a lucene index (Lucene 2.0, JDK 5) with
approximately 150000 documents containing about 25 fields for each
document. After indexing about 45000 documents , the program crashed.
It was running as a batch job and did not log the cause of the crash.
In order to identify why the process crashed, I restarted the job
about 50 documents before the crash point so that I can identify the
problem. At this point, the program first tries to delete the document
if it's already present in the index and then adds it. As soon as I
start the program, the program aborts with a StackOverflowError while
calling indexreader.deleteDocuments(new Term()) method (even for the
document that was indexed earlier). Here is the partial stacktrace:

Exception in thread "main" java.lang.StackOverflowError
at java.lang.ref.Reference.(WeakReference.java:40)
at java.lang.ThreadLocal$ThreadLocalMap$Entry.(ThreadLocal.java:235)
at java.lang.ThreadLocal$ThreadLocalMap.getAfterMiss(ThreadLocal.java:375)
at java.lang.ThreadLocal$ThreadLocalMap.get(ThreadLocal.java:347)
at java.lang.ThreadLocal$ThreadLocalMap.access$000(ThreadLocal.java:225)
at java.lang.ThreadLocal.get(ThreadLocal.java:127)
at org.apache.lucene.index.TermInfosReader.getEnum(TermInfosReader.java:79)
at org.apache.lucene.index.TermInfosReader.get(TermInfosReader.java:139)
at org.apache.lucene.index.SegmentTermDocs.seek(SegmentTermDocs.java:50)
at org.apache.lucene.index.MultiTermDocs.termDocs(MultiReader.java:392)
at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:348)
at org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)

The last line [at
org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)]
repeats another 1010 times before the program crashes.

I understand that without the actual index or the documents, it's
nearly impossible to narrow down the cause of the error. However, can
you please point to any theoretical reason why
org.apache.lucene.index.MultiTermDocs.next will go into an infinite
loop?

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Yonik Seeley at Nov 27, 2006 at 6:17 pm

    On 11/27/06, Suman Ghosh wrote:
    The last line [at
    org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)]
    repeats another 1010 times before the program crashes.

    I understand that without the actual index or the documents, it's
    nearly impossible to narrow down the cause of the error. However, can
    you please point to any theoretical reason why
    org.apache.lucene.index.MultiTermDocs.next will go into an infinite
    loop?
    MultiTermDocs.next() is a recursive function. From what I can see of
    it though, it shouldn't recurse greater than the number of segments in
    the index.

    How many segments do you have in your index? What IndexWriter
    settings have you changed (mergeFactor, maxMergeDocs, etc)?

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Suman Ghosh at Nov 27, 2006 at 6:43 pm
    Here are the values:

    mergeFactor=10
    maxMergeDocs=100000
    minMergeDocs=100

    And I see your point. At the time of the crash, I have over 5000
    segments. I'll try some conservative number and try to rebuild the
    index.

    On 11/27/06, Yonik Seeley wrote:
    On 11/27/06, Suman Ghosh wrote:
    The last line [at
    org.apache.lucene.index.MultiTermDocs.next(MultiReader.java:349)]
    repeats another 1010 times before the program crashes.

    I understand that without the actual index or the documents, it's
    nearly impossible to narrow down the cause of the error. However, can
    you please point to any theoretical reason why
    org.apache.lucene.index.MultiTermDocs.next will go into an infinite
    loop?
    MultiTermDocs.next() is a recursive function. From what I can see of
    it though, it shouldn't recurse greater than the number of segments in
    the index.

    How many segments do you have in your index? What IndexWriter
    settings have you changed (mergeFactor, maxMergeDocs, etc)?

    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Nov 27, 2006 at 6:51 pm

    On 11/27/06, Suman Ghosh wrote:
    Here are the values:

    mergeFactor=10
    maxMergeDocs=100000
    minMergeDocs=100

    And I see your point. At the time of the crash, I have over 5000
    segments. I'll try some conservative number and try to rebuild the
    index.
    Although I don't see how those settings can produce 5000 segments,
    I've developed a non-recursive patch you might want to try:
    https://issues.apache.org/jira/browse/LUCENE-729

    The patch is to the Lucene trunk (current devel version), so if you
    want to stick with Lucene 2.0, you might have to patch by hand.


    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Suman Ghosh at Nov 27, 2006 at 6:59 pm
    Yonik,

    Thanks for the pointer. I'll try the nightly build once the change is committed.

    Suman
    On 11/27/06, Yonik Seeley wrote:
    On 11/27/06, Suman Ghosh wrote:
    Here are the values:

    mergeFactor=10
    maxMergeDocs=100000
    minMergeDocs=100

    And I see your point. At the time of the crash, I have over 5000
    segments. I'll try some conservative number and try to rebuild the
    index.
    Although I don't see how those settings can produce 5000 segments,
    I've developed a non-recursive patch you might want to try:
    https://issues.apache.org/jira/browse/LUCENE-729

    The patch is to the Lucene trunk (current devel version), so if you
    want to stick with Lucene 2.0, you might have to patch by hand.


    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Nov 27, 2006 at 9:37 pm

    Suman Ghosh wrote:
    On 11/27/06, Yonik Seeley wrote:
    On 11/27/06, Suman Ghosh wrote:
    Here are the values:

    mergeFactor=10
    maxMergeDocs=100000
    minMergeDocs=100

    And I see your point. At the time of the crash, I have over 5000
    segments. I'll try some conservative number and try to rebuild the
    index.
    Although I don't see how those settings can produce 5000 segments,
    I've developed a non-recursive patch you might want to try:
    https://issues.apache.org/jira/browse/LUCENE-729
    Suman, I'd really like to understand how you're getting so many
    segments in your index. Is this (getting 5000 segments) easy to
    reproduce? Are you closing / reopening your writer every so often (eg
    to delete documents or something)?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Suman Ghosh at Nov 28, 2006 at 3:25 am
    Mike,
    I've not tried it yet, but I think the problem can be reproduced.
    However, it'll take a few hours to reach that threshhold since my code
    also needs to extract text from some very large PDF documents to store
    in the index.

    I'll post the pseudo-code of my code tomorrow. Maybe that'll help
    point to mistakes I'm making in the logic.

    Suman
    On 11/27/06, Michael McCandless wrote:
    Suman Ghosh wrote:
    On 11/27/06, Yonik Seeley wrote:
    On 11/27/06, Suman Ghosh wrote:
    Here are the values:

    mergeFactor=10
    maxMergeDocs=100000
    minMergeDocs=100

    And I see your point. At the time of the crash, I have over 5000
    segments. I'll try some conservative number and try to rebuild the
    index.
    Although I don't see how those settings can produce 5000 segments,
    I've developed a non-recursive patch you might want to try:
    https://issues.apache.org/jira/browse/LUCENE-729
    Suman, I'd really like to understand how you're getting so many
    segments in your index. Is this (getting 5000 segments) easy to
    reproduce? Are you closing / reopening your writer every so often (eg
    to delete documents or something)?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Suman Ghosh at Nov 28, 2006 at 5:34 pm
    Mike,

    Below is the pseudo code of the application. A few implementation
    points to understand the pseudo-code:

    - We have a home grown threadpool class that allows us to index
    multiple documents in parallel. We usually submit 200 jobs to the
    pool (2-3 worker threads usually for the pool). Once these jobs are
    finished, we submit the next set of jobs.
    - All metadata for a document comes from a Oracle database. We
    retrieve the metadata in form of an XML document.
    - The indexing routine is designed with incremental indexing in
    mind. We intend to perform full index build once and continue
    with incremental indexing from that point onwards (on the
    average 200-300 document modified/added each day).

    Here is the pseudo-code. Please feel free to point out any implementation
    issues that might cause the problem.

    ====================BEGIN===========================
    Initialization:
    get database connection
    get threadpool instance

    IndexBuilder:
    for (;;) {
    get next 200 documents (from database) to be indexed. Values
    returned are a key for the document and the metadata xml

    exit if no more document available

    // first remove the documents (to be updated) from the
    // index instead of deleting and inserting them one after
    // another
    get IndexReader instance
    for all these documents {
    use reader.deleteDocuments(new Term("KEY", document key))
    }
    finally close the IndexReader instance

    // Now add these documents to the index
    get IndexWriter instance and
    set MergeFactor = 10
    set MaxMergeDocs = 100000
    set MaxFieldLength = 500000

    for all these documents {
    add a job to the threadpool with indexwriter instance
    and document metadata
    }
    and wait till jobs are finished

    finally close the IndexWriter instance
    } //end for

    get indexwriter instance and
    optimize index
    finally close the IndexWriter instance

    housekeeping:
    finally close threadpool and database connection

    Threadpool job:

    read individual metadata from XML and construct a Lucene
    Document object

    Determine if there is an associated file for the document (
    usually PDF/WORD/EXCEL/PPT). If so, extract text out of that
    document and put it in a field called FULLTEXT for specific
    searching.

    Use the indexwriter instance (supplied with the job) to add
    the document to the Lucene index

    ====================END===========================

    Thanks,

    Suman

    On 11/27/06, Michael McCandless wrote:
    Suman Ghosh wrote:
    On 11/27/06, Yonik Seeley wrote:
    On 11/27/06, Suman Ghosh wrote:
    Here are the values:

    mergeFactor=10
    maxMergeDocs=100000
    minMergeDocs=100

    And I see your point. At the time of the crash, I have over 5000
    segments. I'll try some conservative number and try to rebuild the
    index.
    Although I don't see how those settings can produce 5000 segments,
    I've developed a non-recursive patch you might want to try:
    https://issues.apache.org/jira/browse/LUCENE-729
    Suman, I'd really like to understand how you're getting so many
    segments in your index. Is this (getting 5000 segments) easy to
    reproduce? Are you closing / reopening your writer every so often (eg
    to delete documents or something)?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Nov 28, 2006 at 6:47 pm
    This looks correct to me. It's good you are doing the deletes
    "in bulk" up front for each batch of documents. So I guess you
    hit the error (& 5000 segments files) while processing batches
    of 200 docs (because you then optimize in the end)?

    Do you search this index while it's building, or, only in the
    end (after optimize)?

    Mike

    Suman Ghosh wrote:
    Mike,

    Below is the pseudo code of the application. A few implementation
    points to understand the pseudo-code:

    - We have a home grown threadpool class that allows us to index
    multiple documents in parallel. We usually submit 200 jobs to the
    pool (2-3 worker threads usually for the pool). Once these jobs are
    finished, we submit the next set of jobs.
    - All metadata for a document comes from a Oracle database. We
    retrieve the metadata in form of an XML document.
    - The indexing routine is designed with incremental indexing in
    mind. We intend to perform full index build once and continue
    with incremental indexing from that point onwards (on the
    average 200-300 document modified/added each day).

    Here is the pseudo-code. Please feel free to point out any implementation
    issues that might cause the problem.

    ====================BEGIN===========================
    Initialization:
    get database connection
    get threadpool instance

    IndexBuilder:
    for (;;) {
    get next 200 documents (from database) to be indexed. Values
    returned are a key for the document and the metadata xml

    exit if no more document available

    // first remove the documents (to be updated) from the
    // index instead of deleting and inserting them one after
    // another
    get IndexReader instance
    for all these documents {
    use reader.deleteDocuments(new Term("KEY", document key))
    }
    finally close the IndexReader instance

    // Now add these documents to the index
    get IndexWriter instance and
    set MergeFactor = 10
    set MaxMergeDocs = 100000
    set MaxFieldLength = 500000

    for all these documents {
    add a job to the threadpool with indexwriter instance
    and document metadata
    }
    and wait till jobs are finished

    finally close the IndexWriter instance
    } //end for

    get indexwriter instance and
    optimize index
    finally close the IndexWriter instance

    housekeeping:
    finally close threadpool and database connection

    Threadpool job:

    read individual metadata from XML and construct a Lucene
    Document object

    Determine if there is an associated file for the document (
    usually PDF/WORD/EXCEL/PPT). If so, extract text out of that
    document and put it in a field called FULLTEXT for specific
    searching.

    Use the indexwriter instance (supplied with the job) to add
    the document to the Lucene index

    ====================END===========================
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Suman Ghosh at Nov 28, 2006 at 8:52 pm
    The search functionality must be available during the index build. Since a
    relatively small number of documents are being affected (and also we plan to
    perform the build during a period of time we know to be relatively quiet
    from last 2 years site access data) during the build process, we hope that
    the build process will not cause any search issue.

    If you have any advice as to a better approach for incremental build and
    index optimization, I'll really appreciate if you could share it.

    Thanks

    Suman
    On 11/28/06, Michael McCandless wrote:


    This looks correct to me. It's good you are doing the deletes
    "in bulk" up front for each batch of documents. So I guess you
    hit the error (& 5000 segments files) while processing batches
    of 200 docs (because you then optimize in the end)?

    Do you search this index while it's building, or, only in the
    end (after optimize)?

    Mike

    Suman Ghosh wrote:
    Mike,

    Below is the pseudo code of the application. A few implementation
    points to understand the pseudo-code:

    - We have a home grown threadpool class that allows us to index
    multiple documents in parallel. We usually submit 200 jobs to the
    pool (2-3 worker threads usually for the pool). Once these jobs are
    finished, we submit the next set of jobs.
    - All metadata for a document comes from a Oracle database. We
    retrieve the metadata in form of an XML document.
    - The indexing routine is designed with incremental indexing in
    mind. We intend to perform full index build once and continue
    with incremental indexing from that point onwards (on the
    average 200-300 document modified/added each day).

    Here is the pseudo-code. Please feel free to point out any
    implementation
    issues that might cause the problem.

    ====================BEGIN===========================
    Initialization:
    get database connection
    get threadpool instance

    IndexBuilder:
    for (;;) {
    get next 200 documents (from database) to be indexed. Values
    returned are a key for the document and the metadata xml

    exit if no more document available

    // first remove the documents (to be updated) from the
    // index instead of deleting and inserting them one after
    // another
    get IndexReader instance
    for all these documents {
    use reader.deleteDocuments(new Term("KEY", document key))
    }
    finally close the IndexReader instance

    // Now add these documents to the index
    get IndexWriter instance and
    set MergeFactor = 10
    set MaxMergeDocs = 100000
    set MaxFieldLength = 500000

    for all these documents {
    add a job to the threadpool with indexwriter instance
    and document metadata
    }
    and wait till jobs are finished

    finally close the IndexWriter instance
    } //end for

    get indexwriter instance and
    optimize index
    finally close the IndexWriter instance

    housekeeping:
    finally close threadpool and database connection

    Threadpool job:

    read individual metadata from XML and construct a Lucene
    Document object

    Determine if there is an associated file for the document (
    usually PDF/WORD/EXCEL/PPT). If so, extract text out of that
    document and put it in a field called FULLTEXT for specific
    searching.

    Use the indexwriter instance (supplied with the job) to add
    the document to the Lucene index

    ====================END===========================
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Nov 28, 2006 at 10:42 pm

    Suman Ghosh wrote:
    The search functionality must be available during the index build. Since a
    relatively small number of documents are being affected (and also we
    plan to
    perform the build during a period of time we know to be relatively quiet
    from last 2 years site access data) during the build process, we hope that
    the build process will not cause any search issue.

    If you have any advice as to a better approach for incremental build and
    index optimization, I'll really appreciate if you could share it.
    OK a few more questions/notes :)

    Is your index mounted on a local filesystem (and searchers are running
    in the same JVM, or different JVM on the same machine)? Or is index
    in a remote filesystem like NFS? If it's NFS you should look at this
    issue:

    http://issues.apache.org/jira/browse/LUCENE-673

    What's your policy on reopening the readers (so they see the latest
    changes to the index)? You should probably take care not to re-open
    "at a bad time" (eg, after the deletes were done but before the new
    docs were added) else that searcher will be missing those 200 docs
    until it next re-opens...

    Actually I'm still confused: at what point do you see an index with
    5000 segments? Is it your initial full build of the index? In this
    case, you are not doing any deletes, and, you are using a single
    IndexWriter (with multiple threads), just calling "addDocument" many
    times?

    Oh I see: are you using that loop to build your initial index as well?

    In which case, you re-open the writer every 200 docs. One simple
    thing you could do instead is: if this is the first full build of your
    index, deletes should not be necessary, so you can skip the deletes
    and use a single writer instance (ie, don't close/open it every 200
    docs). Likely this will work around the 5000+ segments issue and not
    require you to try running with the trunk (still I'd like to know if
    upgrading to trunk correctly fixes the 5000+ segments).

    Since you don't have many docs changing per day, once you get your
    full index built (and get through the 5000+ segments issue) I think
    you won't hit that issue again since you optimize after adding a few
    hundred docs.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Nov 28, 2006 at 5:41 pm

    On 11/27/06, Michael McCandless wrote:
    Suman Ghosh wrote:
    On 11/27/06, Yonik Seeley wrote:
    On 11/27/06, Suman Ghosh wrote:
    Here are the values:

    mergeFactor=10
    maxMergeDocs=100000
    minMergeDocs=100

    And I see your point. At the time of the crash, I have over 5000
    segments. I'll try some conservative number and try to rebuild the
    index.
    Although I don't see how those settings can produce 5000 segments,
    I've developed a non-recursive patch you might want to try:
    https://issues.apache.org/jira/browse/LUCENE-729
    Suman, I'd really like to understand how you're getting so many
    segments in your index. Is this (getting 5000 segments) easy to
    reproduce? Are you closing / reopening your writer every so often (eg
    to delete documents or something)?
    Actually, in previous versions of Lucene, it *was* possible to get way
    too many first level segments because of the wonky logic when the
    IndexWriter was closed. That has been fixed in the trunk with the new
    merge policy, and you will never see more than mergeFactor first level
    segments.


    -Yonik
    http://incubator.apache.org/solr Solr, the open-source Lucene search server

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Nov 28, 2006 at 6:43 pm

    Yonik Seeley wrote:

    Actually, in previous versions of Lucene, it *was* possible to get way
    too many first level segments because of the wonky logic when the
    IndexWriter was closed. That has been fixed in the trunk with the new
    merge policy, and you will never see more than mergeFactor first level
    segments.
    Ahhh, OK. Suman, it seems likely that this is what you are hitting.
    Since you are planning on trying the nightly build of Lucene with the
    fix for LUCENE-729 (now committed), can you watch & see if indeed
    you no longer see 5000+ segments being created? Would be nice to be
    sure that this is in fact the cause of your 5000+ segments.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedNov 27, '06 at 5:10p
activeNov 28, '06 at 10:42p
posts13
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase