FAQ
Hello! I am using Lucene 1.4.3

I'm building a Lucene index, that will have about 25 million documents
when it is done.
I'm adding 250,000 at a time.

Currently there is about 1.2Million in there, and I ran into a problem.
After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
threw by writer.optimize(); (a standard IndexWriter)

The exception caused my program to quit out, and it didn't call
'writer.close();'

First, with it dying in the middle of an .optimize() is there any chance
my index is corrupted?

Second, I know I can remove the /tmp/lucene*.lock file to remove the
lock in order to add more, but is it safe to do that?

I've since figured out that I can pass -Xmx to the 'java' program in
order to increase the maximum amount of RAM.
It was using the default of 64M, I plan on increasing that to 175M to
start with.
That should solve the memory problems (I can allocate more if necessary
down the line).

Lastly, when I go back, open it again, and add another 250,000 and then
call optimize again, will a failed previous optimize hurt the index at all?



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Chris Hostetter at Aug 1, 2005 at 6:35 am
    If i remember correctly, what you'll find when you remove the lock file is
    that your index is still usable, and from the perspective of new
    IndexWriter/IndexReaders it's in the same state it was prior to the call
    to optimize, but from the perspective of an external observer, the index
    directory will contain a bunch of garbage files from the aborted optimize.

    At my work, we've taken the "safe" attitude that if you get an OOM
    exception, you should assume your index is corrupted, and rebuild from
    scratch for safety -- but i think it's safe to cleanup up the garbage
    files manually.

    which brings up something i ment to ask a while back: Has anyone written
    any index cleaning code? that locks an index (using the public API) and
    then inspects the file (using the API, or using low level knowledge of hte
    file structure) to generate a list of 'garbage' files in the index
    directory that should be safely deletable?

    (I considered writing this a few months ago, but then our "play it safe,
    treat it as corrupt" policy came out, and it wasn't all that neccessary
    for me)

    It seems like it might be a handy addition to the sandbox.


    : Date: Sun, 31 Jul 2005 23:20:36 -0400
    : From: Robert Schultz <robert@cosmicrealms.com>
    : Reply-To: java-user@lucene.apache.org
    : To: java-user@lucene.apache.org
    : Subject: Any problems with a failed IndexWriter optimize call?
    :
    : Hello! I am using Lucene 1.4.3
    :
    : I'm building a Lucene index, that will have about 25 million documents
    : when it is done.
    : I'm adding 250,000 at a time.
    :
    : Currently there is about 1.2Million in there, and I ran into a problem.
    : After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
    : threw by writer.optimize(); (a standard IndexWriter)
    :
    : The exception caused my program to quit out, and it didn't call
    : 'writer.close();'
    :
    : First, with it dying in the middle of an .optimize() is there any chance
    : my index is corrupted?
    :
    : Second, I know I can remove the /tmp/lucene*.lock file to remove the
    : lock in order to add more, but is it safe to do that?
    :
    : I've since figured out that I can pass -Xmx to the 'java' program in
    : order to increase the maximum amount of RAM.
    : It was using the default of 64M, I plan on increasing that to 175M to
    : start with.
    : That should solve the memory problems (I can allocate more if necessary
    : down the line).
    :
    : Lastly, when I go back, open it again, and add another 250,000 and then
    : call optimize again, will a failed previous optimize hurt the index at all?
    :
    :
    :
    : ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-user-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Yonik Seeley at Aug 1, 2005 at 7:42 pm
    If all segments were flushed to the disk (no adds since the last time
    the index writer was opened), then it seems like the index should be
    fine.

    The big question I have is what happens when there are in-memory
    segments in the case of an OOM exception during an optimize? Is data
    loss possible?

    -Yonik

    On 8/1/05, Chris Hostetter wrote:

    If i remember correctly, what you'll find when you remove the lock file is
    that your index is still usable, and from the perspective of new
    IndexWriter/IndexReaders it's in the same state it was prior to the call
    to optimize, but from the perspective of an external observer, the index
    directory will contain a bunch of garbage files from the aborted optimize.

    At my work, we've taken the "safe" attitude that if you get an OOM
    exception, you should assume your index is corrupted, and rebuild from
    scratch for safety -- but i think it's safe to cleanup up the garbage
    files manually.

    which brings up something i ment to ask a while back: Has anyone written
    any index cleaning code? that locks an index (using the public API) and
    then inspects the file (using the API, or using low level knowledge of hte
    file structure) to generate a list of 'garbage' files in the index
    directory that should be safely deletable?

    (I considered writing this a few months ago, but then our "play it safe,
    treat it as corrupt" policy came out, and it wasn't all that neccessary
    for me)

    It seems like it might be a handy addition to the sandbox.


    : Date: Sun, 31 Jul 2005 23:20:36 -0400
    : From: Robert Schultz <robert@cosmicrealms.com>
    : Reply-To: java-user@lucene.apache.org
    : To: java-user@lucene.apache.org
    : Subject: Any problems with a failed IndexWriter optimize call?
    :
    : Hello! I am using Lucene 1.4.3
    :
    : I'm building a Lucene index, that will have about 25 million documents
    : when it is done.
    : I'm adding 250,000 at a time.
    :
    : Currently there is about 1.2Million in there, and I ran into a problem.
    : After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
    : threw by writer.optimize(); (a standard IndexWriter)
    :
    : The exception caused my program to quit out, and it didn't call
    : 'writer.close();'
    :
    : First, with it dying in the middle of an .optimize() is there any chance
    : my index is corrupted?
    :
    : Second, I know I can remove the /tmp/lucene*.lock file to remove the
    : lock in order to add more, but is it safe to do that?
    :
    : I've since figured out that I can pass -Xmx to the 'java' program in
    : order to increase the maximum amount of RAM.
    : It was using the default of 64M, I plan on increasing that to 175M to
    : start with.
    : That should solve the memory problems (I can allocate more if necessary
    : down the line).
    :
    : Lastly, when I go back, open it again, and add another 250,000 and then
    : call optimize again, will a failed previous optimize hurt the index at all?
    :
    :
    :
    : ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-user-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Robert Schultz at Aug 1, 2005 at 8:12 pm
    I am going to play it safe.
    I'm going to wipe the index files, and start over (only put about 2 days
    of processing time into it so far).

    This time with a maximum memory of over 512MB

    Yonik Seeley wrote:
    If all segments were flushed to the disk (no adds since the last time
    the index writer was opened), then it seems like the index should be
    fine.

    The big question I have is what happens when there are in-memory
    segments in the case of an OOM exception during an optimize? Is data
    loss possible?

    -Yonik

    On 8/1/05, Chris Hostetter wrote:

    If i remember correctly, what you'll find when you remove the lock file is
    that your index is still usable, and from the perspective of new
    IndexWriter/IndexReaders it's in the same state it was prior to the call
    to optimize, but from the perspective of an external observer, the index
    directory will contain a bunch of garbage files from the aborted optimize.

    At my work, we've taken the "safe" attitude that if you get an OOM
    exception, you should assume your index is corrupted, and rebuild from
    scratch for safety -- but i think it's safe to cleanup up the garbage
    files manually.

    which brings up something i ment to ask a while back: Has anyone written
    any index cleaning code? that locks an index (using the public API) and
    then inspects the file (using the API, or using low level knowledge of hte
    file structure) to generate a list of 'garbage' files in the index
    directory that should be safely deletable?

    (I considered writing this a few months ago, but then our "play it safe,
    treat it as corrupt" policy came out, and it wasn't all that neccessary
    for me)

    It seems like it might be a handy addition to the sandbox.


    : Date: Sun, 31 Jul 2005 23:20:36 -0400
    : From: Robert Schultz <robert@cosmicrealms.com>
    : Reply-To: java-user@lucene.apache.org
    : To: java-user@lucene.apache.org
    : Subject: Any problems with a failed IndexWriter optimize call?
    :
    : Hello! I am using Lucene 1.4.3
    :
    : I'm building a Lucene index, that will have about 25 million documents
    : when it is done.
    : I'm adding 250,000 at a time.
    :
    : Currently there is about 1.2Million in there, and I ran into a problem.
    : After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
    : threw by writer.optimize(); (a standard IndexWriter)
    :
    : The exception caused my program to quit out, and it didn't call
    : 'writer.close();'
    :
    : First, with it dying in the middle of an .optimize() is there any chance
    : my index is corrupted?
    :
    : Second, I know I can remove the /tmp/lucene*.lock file to remove the
    : lock in order to add more, but is it safe to do that?
    :
    : I've since figured out that I can pass -Xmx to the 'java' program in
    : order to increase the maximum amount of RAM.
    : It was using the default of 64M, I plan on increasing that to 175M to
    : start with.
    : That should solve the memory problems (I can allocate more if necessary
    : down the line).
    :
    : Lastly, when I go back, open it again, and add another 250,000 and then
    : call optimize again, will a failed previous optimize hurt the index at all?
    :
    :
    :
    : ---------------------------------------------------------------------
    : To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    : For additional commands, e-mail: java-user-help@lucene.apache.org
    :



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Dan Armbrust at Aug 1, 2005 at 8:34 pm
    May I suggest:

    Don't call optimize. You don't need it. Here is my approach:

    Keep each one of your 250,000 document indexes separate - so run your
    batch, build the index, and then just close it. Don't try to optimize
    it. For each 250,000 document batch, just put it into a different folder.

    Now, when you have finished building your entire index, you will have a
    bunch of different unoptimized lucene indexes. Open up a new, blank
    index, and merge all of your other indexes into this one. The end
    result will be a single large (already optimized) index.


    This approach has several benefits -
    You can keep the parameters set in such a way that it performs better
    while indexing (without running into the out of file handles issues)
    If a failure occurs, you only have to redo the batch, not start over the
    entire process.
    You don't have unnecessary IO, but constantly rewriting your data with
    optimize() calls.
    You can very easily break up the indexing across multiple machines.
    If a failure occurs while trying to merge all of the indexes together,
    you don't lose anything - as you are only reading the existing indexes.
    You know they will all still be valid.

    I actually wrote a wrapper for Lucene that does all of this under the
    covers. At some point, I should get it released open source :)

    Dan

    --
    ****************************
    Daniel Armbrust
    Biomedical Informatics
    Mayo Clinic Rochester
    daniel.armbrust(at)mayo.edu
    http://informatics.mayo.edu/


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Tony Schwartz at Aug 1, 2005 at 1:15 pm
    Your index should be fine. You could use "luke" if you want to remove any dangling
    files not in use. Just run optimize again to fix it all up... You might want to
    allocate more memory that 175 though. Depending on your document sizes and how
    efficiently and fast you want lucene to work to index the data, you will want to give
    lucene plenty of memory to work with.

    Tony Schwartz
    tony@simpleobjects.com




    Hello! I am using Lucene 1.4.3

    I'm building a Lucene index, that will have about 25 million documents
    when it is done.
    I'm adding 250,000 at a time.

    Currently there is about 1.2Million in there, and I ran into a problem.
    After I had added a batch of 250,000 I go a 'java.lang.outOfMemory'
    threw by writer.optimize(); (a standard IndexWriter)

    The exception caused my program to quit out, and it didn't call
    'writer.close();'

    First, with it dying in the middle of an .optimize() is there any chance
    my index is corrupted?

    Second, I know I can remove the /tmp/lucene*.lock file to remove the
    lock in order to add more, but is it safe to do that?

    I've since figured out that I can pass -Xmx to the 'java' program in
    order to increase the maximum amount of RAM.
    It was using the default of 64M, I plan on increasing that to 175M to
    start with.
    That should solve the memory problems (I can allocate more if necessary
    down the line).

    Lastly, when I go back, open it again, and add another 250,000 and then
    call optimize again, will a failed previous optimize hurt the index at all?



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 1, '05 at 3:19a
activeAug 1, '05 at 8:34p
posts6
users5
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase