FAQ
Hi,

We are trying to index large collection of PDF documents, sizes varying
from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
text extraction) and on Windows as well as CentOS Linux. Used java -Xms
and -Xmx options, both at 1080m, even though we have 4GB on Windows and
32 GB on Linux with sufficient swap space.

With just one thread, though it takes time, the indexing happens. To
speed up, we tried multi-threaded approach with one Indexwriter for each
thread. After all the threads finish their indexing, they are merged.
With about 100 sample files and 10 threads, the program works pretty
well and it does speed up. But, when we run on document collection of
about 25GB, couple of threads just hang, while the rest have completed
their indexing. The program never gracefully exits, and the threads that
seem to have died ensure that the final index merging does not take
place. The program needs to be manually terminated.

Tried both with simple analyzer as well as standard analyzer, with
similar results.

Any useful tips / solutions welcome.

Thanks in advance,
Sithu Sudarsan
Graduate Research Assistant, UALR
& Visiting Researcher, CDRH/OSEL

sithu.sudarsan@fda.hhs.gov
sdsudarsan@ualr.edu

Search Discussions

  • Glen Newton at Oct 23, 2008 at 4:57 pm
    You might want to look at my indexing of 6.4 million PDF articles,
    full-text and metadata. It resulted in an 83GB index taking 20.5 hours
    to run. It uses multiple writers, is massively multithreaded.

    More info here:
    http://zzzoot.blogspot.com/2008/04/lucene-indexing-performance-benchmarks.html
    Check-out the notes at the bottom for details.

    In order to make threading/queues much easier and more robust, you
    want to use: java.util.concurrent.ThreadPoolExecutor
    http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/ThreadPoolExecutor.html

    Even with these, I've also had problems like you describe. One thing
    I've found is that you need to shut the ThreadPoolExecutor down
    correctly, something like:
    threadPoolExecutor.shutdown();
    while(!threadPoolExecutor.isTerminated())
    {
    try {
    Thread.sleep(ShutdownDelay);
    } catch (InterruptedException ie) {
    System.out.println(" interrupted");
    }
    }

    You also need to simplify your threading so as to make reduce deadlock
    possibilities.

    I hope this is useful.

    -Glen

    2008/10/23 Sudarsan, Sithu D. <Sithu.Sudarsan@fda.hhs.gov>:
    Hi,

    We are trying to index large collection of PDF documents, sizes varying
    from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
    text extraction) and on Windows as well as CentOS Linux. Used java -Xms
    and -Xmx options, both at 1080m, even though we have 4GB on Windows and
    32 GB on Linux with sufficient swap space.

    With just one thread, though it takes time, the indexing happens. To
    speed up, we tried multi-threaded approach with one Indexwriter for each
    thread. After all the threads finish their indexing, they are merged.
    With about 100 sample files and 10 threads, the program works pretty
    well and it does speed up. But, when we run on document collection of
    about 25GB, couple of threads just hang, while the rest have completed
    their indexing. The program never gracefully exits, and the threads that
    seem to have died ensure that the final index merging does not take
    place. The program needs to be manually terminated.

    Tried both with simple analyzer as well as standard analyzer, with
    similar results.

    Any useful tips / solutions welcome.

    Thanks in advance,
    Sithu Sudarsan
    Graduate Research Assistant, UALR
    & Visiting Researcher, CDRH/OSEL

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu


    --

    -

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Oct 23, 2008 at 5:31 pm
    It sounds like you might have some thread synchronization issues outside
    of Lucene. To simplify things a bit, you might try just using one
    IndexWriter. If I remember right, the IndexWriter is now pretty
    efficient, and there isn't much need to index to smaller indexes and
    then merge. There is a lot of juggling to get wrong with that approach.

    - Mark

    Sudarsan, Sithu D. wrote:
    Hi,

    We are trying to index large collection of PDF documents, sizes varying
    from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
    text extraction) and on Windows as well as CentOS Linux. Used java -Xms
    and -Xmx options, both at 1080m, even though we have 4GB on Windows and
    32 GB on Linux with sufficient swap space.

    With just one thread, though it takes time, the indexing happens. To
    speed up, we tried multi-threaded approach with one Indexwriter for each
    thread. After all the threads finish their indexing, they are merged.
    With about 100 sample files and 10 threads, the program works pretty
    well and it does speed up. But, when we run on document collection of
    about 25GB, couple of threads just hang, while the rest have completed
    their indexing. The program never gracefully exits, and the threads that
    seem to have died ensure that the final index merging does not take
    place. The program needs to be manually terminated.

    Tried both with simple analyzer as well as standard analyzer, with
    similar results.

    Any useful tips / solutions welcome.

    Thanks in advance,
    Sithu Sudarsan
    Graduate Research Assistant, UALR
    & Visiting Researcher, CDRH/OSEL

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Glen Newton at Oct 23, 2008 at 5:49 pm

    2008/10/23 Mark Miller <markrmiller@gmail.com>:
    It sounds like you might have some thread synchronization issues outside of
    Lucene. To simplify things a bit, you might try just using one IndexWriter.
    If I remember right, the IndexWriter is now pretty efficient, and there
    isn't much need to index to smaller indexes and then merge. There is a lot
    of juggling to get wrong with that approach.
    While I agree it is easier to have a single IndexWriter, if you have
    multiple cores you will get significant speed-ups with multiple
    IndexWriters, even with the impact of merging at the end.
    #IndexWriters = # physical cores is an reasonable rule of thumb.

    General speed-up estimate: # cores * 0.6 - 0.8 over single IndexWriter
    YMMV

    When I get around to it, I'll re-run my tests varying the # of
    IndexWriters & post.

    -Glen
    - Mark

    Sudarsan, Sithu D. wrote:
    Hi,

    We are trying to index large collection of PDF documents, sizes varying
    from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
    text extraction) and on Windows as well as CentOS Linux. Used java -Xms
    and -Xmx options, both at 1080m, even though we have 4GB on Windows and
    32 GB on Linux with sufficient swap space.

    With just one thread, though it takes time, the indexing happens. To
    speed up, we tried multi-threaded approach with one Indexwriter for each
    thread. After all the threads finish their indexing, they are merged.
    With about 100 sample files and 10 threads, the program works pretty
    well and it does speed up. But, when we run on document collection of
    about 25GB, couple of threads just hang, while the rest have completed
    their indexing. The program never gracefully exits, and the threads that
    seem to have died ensure that the final index merging does not take
    place. The program needs to be manually terminated.
    Tried both with simple analyzer as well as standard analyzer, with
    similar results.

    Any useful tips / solutions welcome.

    Thanks in advance,
    Sithu Sudarsan
    Graduate Research Assistant, UALR
    & Visiting Researcher, CDRH/OSEL

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --

    -

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mark Miller at Oct 23, 2008 at 7:11 pm

    Glen Newton wrote:
    2008/10/23 Mark Miller <markrmiller@gmail.com>:
    It sounds like you might have some thread synchronization issues outside of
    Lucene. To simplify things a bit, you might try just using one IndexWriter.
    If I remember right, the IndexWriter is now pretty efficient, and there
    isn't much need to index to smaller indexes and then merge. There is a lot
    of juggling to get wrong with that approach.
    While I agree it is easier to have a single IndexWriter, if you have
    multiple cores you will get significant speed-ups with multiple
    IndexWriters, even with the impact of merging at the end.
    #IndexWriters = # physical cores is an reasonable rule of thumb.

    General speed-up estimate: # cores * 0.6 - 0.8 over single IndexWriter
    YMMV

    When I get around to it, I'll re-run my tests varying the # of
    IndexWriters & post.

    -Glen
    Hey Mr McCandless, whats up with that? Can IndexWriter be made to be as
    efficient as using Multiple Writers? Where do you suppose the hold up
    is? Number of threads doing merges? Sync contention? I hate the idea of
    multiple IndexWriter/Readers being more efficient than a single
    instance. In an ideal Lucene world, a single instance would hide the
    complexity and use the number of threads needed to match multiple
    instance performance.
    - Mark

    Sudarsan, Sithu D. wrote:
    Hi,

    We are trying to index large collection of PDF documents, sizes varying
    from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox for
    text extraction) and on Windows as well as CentOS Linux. Used java -Xms
    and -Xmx options, both at 1080m, even though we have 4GB on Windows and
    32 GB on Linux with sufficient swap space.

    With just one thread, though it takes time, the indexing happens. To
    speed up, we tried multi-threaded approach with one Indexwriter for each
    thread. After all the threads finish their indexing, they are merged.
    With about 100 sample files and 10 threads, the program works pretty
    well and it does speed up. But, when we run on document collection of
    about 25GB, couple of threads just hang, while the rest have completed
    their indexing. The program never gracefully exits, and the threads that
    seem to have died ensure that the final index merging does not take
    place. The program needs to be manually terminated.
    Tried both with simple analyzer as well as standard analyzer, with
    similar results.

    Any useful tips / solutions welcome.

    Thanks in advance,
    Sithu Sudarsan
    Graduate Research Assistant, UALR
    & Visiting Researcher, CDRH/OSEL

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Oct 23, 2008 at 7:46 pm

    Mark Miller wrote:

    Glen Newton wrote:
    2008/10/23 Mark Miller <markrmiller@gmail.com>:
    It sounds like you might have some thread synchronization issues
    outside of
    Lucene. To simplify things a bit, you might try just using one
    IndexWriter.
    If I remember right, the IndexWriter is now pretty efficient, and
    there
    isn't much need to index to smaller indexes and then merge. There
    is a lot
    of juggling to get wrong with that approach.
    While I agree it is easier to have a single IndexWriter, if you have
    multiple cores you will get significant speed-ups with multiple
    IndexWriters, even with the impact of merging at the end.
    #IndexWriters = # physical cores is an reasonable rule of thumb.

    General speed-up estimate: # cores * 0.6 - 0.8 over single
    IndexWriter
    YMMV

    When I get around to it, I'll re-run my tests varying the # of
    IndexWriters & post.

    -Glen
    Hey Mr McCandless, whats up with that? Can IndexWriter be made to be
    as efficient as using Multiple Writers? Where do you suppose the
    hold up is? Number of threads doing merges? Sync contention? I hate
    the idea of multiple IndexWriter/Readers being more efficient than a
    single instance. In an ideal Lucene world, a single instance would
    hide the complexity and use the number of threads needed to match
    multiple instance performance.
    Honestly this surprises me: I would expect a single IndexWriter with
    multiple threads to be as fast (or faster, considering the extra merge
    time at the end) than multiple IndexWriters.

    IndexWriter's concurrency has improved alot lately, with
    ConcurrentMergeScheduler. The only serious operation that is not
    concurrent is flushing the RAM buffer as a new segment; but in a well
    tuned indexing process (large RAM buffer) the time spent there should
    be quite small, especially with a fast IO system.

    Actually, addIndexes is also not concurrent in that if multiple
    threads call it, only one can run at once. But normally you would
    call it with all the indices you want to add, and then the merging is
    concurrent.

    Glen, in your single IndexWriter test, is it possible there was
    accidental thread contention during document preparation or analysis?

    I do agree that we should strive to have enough concurrency in
    IndexWriter and IndexReader so that you don't get any real benefit by
    using separate instances. Eg in 2.4.0 you can now open read-only
    IndexReaders, and on Unix you can use NIOFSDirectory, both of which
    should go a long ways towards fixing IndexReader's concurrency issue.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Glen Newton at Oct 23, 2008 at 7:58 pm
    2008/10/23 Michael McCandless <lucene@mikemccandless.com>:
    Mark Miller wrote:
    Glen Newton wrote:
    2008/10/23 Mark Miller <markrmiller@gmail.com>:
    It sounds like you might have some thread synchronization issues outside
    of
    Lucene. To simplify things a bit, you might try just using one
    IndexWriter.
    If I remember right, the IndexWriter is now pretty efficient, and there
    isn't much need to index to smaller indexes and then merge. There is a
    lot
    of juggling to get wrong with that approach.
    While I agree it is easier to have a single IndexWriter, if you have
    multiple cores you will get significant speed-ups with multiple
    IndexWriters, even with the impact of merging at the end.
    #IndexWriters = # physical cores is an reasonable rule of thumb.

    General speed-up estimate: # cores * 0.6 - 0.8 over single IndexWriter
    YMMV

    When I get around to it, I'll re-run my tests varying the # of
    IndexWriters & post.

    -Glen
    Hey Mr McCandless, whats up with that? Can IndexWriter be made to be as
    efficient as using Multiple Writers? Where do you suppose the hold up is?
    Number of threads doing merges? Sync contention? I hate the idea of multiple
    IndexWriter/Readers being more efficient than a single instance. In an ideal
    Lucene world, a single instance would hide the complexity and use the number
    of threads needed to match multiple instance performance.
    Honestly this surprises me: I would expect a single IndexWriter with
    multiple threads to be as fast (or faster, considering the extra merge time
    at the end) than multiple IndexWriters.

    IndexWriter's concurrency has improved alot lately, with
    ConcurrentMergeScheduler. The only serious operation that is not concurrent
    is flushing the RAM buffer as a new segment; but in a well tuned indexing
    process (large RAM buffer) the time spent there should be quite small,
    especially with a fast IO system.

    Actually, addIndexes is also not concurrent in that if multiple threads call
    it, only one can run at once. But normally you would call it with all the
    indices you want to add, and then the merging is concurrent.

    Glen, in your single IndexWriter test, is it possible there was accidental
    thread contention during document preparation or analysis?
    I don't think there is. I've been refining this for quite a while, and
    have done a lot of analysis and hand-checking of the threading stuff.

    I do use multiple threads for document creation: this is where much of
    the speed-up happens (at least in my case where I have a large indexed
    field for the full-text of an article: the parsing becomes a
    significant part of the process).
    I do agree that we should strive to have enough concurrency in IndexWriter
    and IndexReader so that you don't get any real benefit by using separate
    instances. Eg in 2.4.0 you can now open read-only IndexReaders, and on Unix
    you can use NIOFSDirectory, both of which should go a long ways towards
    fixing IndexReader's concurrency issue.
    My original tests were in the Spring with 2.3.1. I am planning on
    doing the new tests with 2.4 for indexing, as well as re-doing my
    concurrent query tests[1] and concurrent multiple reader tests[2]
    using the features you describe. I am sure the results will be quite
    different...

    BTW the files I am indexing were originally PDFs, but were batch
    converted to text and stored compressed on the filesystem, so except
    for GUnzipping them there is no other overhead.

    [1]http://zzzoot.blogspot.com/2008/06/simultaneous-threaded-query-lucene.html
    [2]http://zzzoot.blogspot.com/2008/06/lucene-concurrent-search-performance.html

    -glen
    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    --

    -

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Oct 23, 2008 at 9:02 pm

    Glen Newton wrote:

    2008/10/23 Michael McCandless <lucene@mikemccandless.com>:
    Mark Miller wrote:
    Glen Newton wrote:
    2008/10/23 Mark Miller <markrmiller@gmail.com>:
    It sounds like you might have some thread synchronization issues
    outside
    of
    Lucene. To simplify things a bit, you might try just using one
    IndexWriter.
    If I remember right, the IndexWriter is now pretty efficient,
    and there
    isn't much need to index to smaller indexes and then merge.
    There is a
    lot
    of juggling to get wrong with that approach.
    While I agree it is easier to have a single IndexWriter, if you
    have
    multiple cores you will get significant speed-ups with multiple
    IndexWriters, even with the impact of merging at the end.
    #IndexWriters = # physical cores is an reasonable rule of thumb.

    General speed-up estimate: # cores * 0.6 - 0.8 over single
    IndexWriter
    YMMV

    When I get around to it, I'll re-run my tests varying the # of
    IndexWriters & post.

    -Glen
    Hey Mr McCandless, whats up with that? Can IndexWriter be made to
    be as
    efficient as using Multiple Writers? Where do you suppose the hold
    up is?
    Number of threads doing merges? Sync contention? I hate the idea
    of multiple
    IndexWriter/Readers being more efficient than a single instance.
    In an ideal
    Lucene world, a single instance would hide the complexity and use
    the number
    of threads needed to match multiple instance performance.
    Honestly this surprises me: I would expect a single IndexWriter with
    multiple threads to be as fast (or faster, considering the extra
    merge time
    at the end) than multiple IndexWriters.

    IndexWriter's concurrency has improved alot lately, with
    ConcurrentMergeScheduler. The only serious operation that is not
    concurrent
    is flushing the RAM buffer as a new segment; but in a well tuned
    indexing
    process (large RAM buffer) the time spent there should be quite
    small,
    especially with a fast IO system.

    Actually, addIndexes is also not concurrent in that if multiple
    threads call
    it, only one can run at once. But normally you would call it with
    all the
    indices you want to add, and then the merging is concurrent.

    Glen, in your single IndexWriter test, is it possible there was
    accidental
    thread contention during document preparation or analysis?
    I don't think there is. I've been refining this for quite a while, and
    have done a lot of analysis and hand-checking of the threading stuff.
    OK.

    For your multiple-index-writer test, how much time is spent building
    the N indices vs merging them in the end?
    I do use multiple threads for document creation: this is where much of
    the speed-up happens (at least in my case where I have a large indexed
    field for the full-text of an article: the parsing becomes a
    significant part of the process).
    So in the single-index-writer vs multiple-index-writer tests, this
    part (64 threads that construct document objects) is unchanged, right?

    How do you rate limit the 64 threads? (Ie, slow them down when they
    get too far ahead of indexing).

    If you only process documents with the 64 threads (but not index
    them), what percentage of the total time is that? I'd like to tease
    out "building documents" vs "indexing" times.
    I do agree that we should strive to have enough concurrency in
    IndexWriter
    and IndexReader so that you don't get any real benefit by using
    separate
    instances. Eg in 2.4.0 you can now open read-only IndexReaders, and
    on Unix
    you can use NIOFSDirectory, both of which should go a long ways
    towards
    fixing IndexReader's concurrency issue.
    My original tests were in the Spring with 2.3.1. I am planning on
    doing the new tests with 2.4 for indexing, as well as re-doing my
    concurrent query tests[1] and concurrent multiple reader tests[2]
    using the features you describe. I am sure the results will be quite
    different...
    Also, for the indexing tests, make sure you run with autoCommit=false.
    BTW the files I am indexing were originally PDFs, but were batch
    converted to text and stored compressed on the filesystem, so except
    for GUnzipping them there is no other overhead.
    But I'm confused: why do you need 64 threads to build up the
    documents? Gunzipping should be very low CPU cost. Are you pre-
    analyzing the fields on your documents?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sudarsan, Sithu D. at Oct 24, 2008 at 2:02 pm
    Hi Glen, Mike, Grant & Mark

    Thank you for the quick responses.

    1. Yes, I'm looking now at ThreadPoolExecutor. Looking for a sample code
    to improve the multi-threaded code.

    2. We'll try using as many Indexwriters as the number of cores, first
    (which is 2cpu x 4 core = 8).

    3. Yes, PDFBox exceptions have been independently checked. We've a
    prototype module to check PDF files that contain errors. Generally they
    are few, less than 1% of the total number of files. The PDFs all have
    been OCRed. Also, if any throws exceptions then they are quarantined in
    a separate folder for further analysis to have a look at the document
    itself.

    4. We've tried using larger JVM space by defining -Xms1800m and
    -Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m seems
    stable. That is strange as we have 32 GB of RAM and 34GB swap space.
    Typically no other application is running. However, the CentOS version
    is 32 bit. The Ungava project seems to be using 64 bit.

    5. -QUIT option for Linux does throw stack trace, but after few threads
    it hangs. Don't know why. Need to look at that.

    Meantime, we're seriously looking for a ThreadPoolExecutor sample source
    code. It looks like, we need to use unbounded queues.

    Really appreciate your inputs and will keep you posted on what we get.

    Now working on the code for ThreadPoolExecutor.

    Thanks and regards,
    Sithu Sudarsan
    Graduate Research Assistant, UALR
    & Visiting Researcher, CDRH/OSEL

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu

    -----Original Message-----
    From: Michael McCandless
    Sent: Thursday, October 23, 2008 5:01 PM
    To: java-user@lucene.apache.org; Glen Newton
    Subject: Re: Multi -threaded indexing of large number of PDF documents


    Glen Newton wrote:
    2008/10/23 Michael McCandless <lucene@mikemccandless.com>:
    Mark Miller wrote:
    Glen Newton wrote:
    2008/10/23 Mark Miller <markrmiller@gmail.com>:
    It sounds like you might have some thread synchronization issues
    outside
    of
    Lucene. To simplify things a bit, you might try just using one
    IndexWriter.
    If I remember right, the IndexWriter is now pretty efficient,
    and there
    isn't much need to index to smaller indexes and then merge.
    There is a
    lot
    of juggling to get wrong with that approach.
    While I agree it is easier to have a single IndexWriter, if you
    have
    multiple cores you will get significant speed-ups with multiple
    IndexWriters, even with the impact of merging at the end.
    #IndexWriters = # physical cores is an reasonable rule of thumb.

    General speed-up estimate: # cores * 0.6 - 0.8 over single
    IndexWriter
    YMMV

    When I get around to it, I'll re-run my tests varying the # of
    IndexWriters & post.

    -Glen
    Hey Mr McCandless, whats up with that? Can IndexWriter be made to
    be as
    efficient as using Multiple Writers? Where do you suppose the hold
    up is?
    Number of threads doing merges? Sync contention? I hate the idea
    of multiple
    IndexWriter/Readers being more efficient than a single instance.
    In an ideal
    Lucene world, a single instance would hide the complexity and use
    the number
    of threads needed to match multiple instance performance.
    Honestly this surprises me: I would expect a single IndexWriter with
    multiple threads to be as fast (or faster, considering the extra
    merge time
    at the end) than multiple IndexWriters.

    IndexWriter's concurrency has improved alot lately, with
    ConcurrentMergeScheduler. The only serious operation that is not
    concurrent
    is flushing the RAM buffer as a new segment; but in a well tuned
    indexing
    process (large RAM buffer) the time spent there should be quite
    small,
    especially with a fast IO system.

    Actually, addIndexes is also not concurrent in that if multiple
    threads call
    it, only one can run at once. But normally you would call it with
    all the
    indices you want to add, and then the merging is concurrent.

    Glen, in your single IndexWriter test, is it possible there was
    accidental
    thread contention during document preparation or analysis?
    I don't think there is. I've been refining this for quite a while, and
    have done a lot of analysis and hand-checking of the threading stuff.
    OK.

    For your multiple-index-writer test, how much time is spent building
    the N indices vs merging them in the end?
    I do use multiple threads for document creation: this is where much of
    the speed-up happens (at least in my case where I have a large indexed
    field for the full-text of an article: the parsing becomes a
    significant part of the process).
    So in the single-index-writer vs multiple-index-writer tests, this
    part (64 threads that construct document objects) is unchanged, right?

    How do you rate limit the 64 threads? (Ie, slow them down when they
    get too far ahead of indexing).

    If you only process documents with the 64 threads (but not index
    them), what percentage of the total time is that? I'd like to tease
    out "building documents" vs "indexing" times.
    I do agree that we should strive to have enough concurrency in
    IndexWriter
    and IndexReader so that you don't get any real benefit by using
    separate
    instances. Eg in 2.4.0 you can now open read-only IndexReaders, and
    on Unix
    you can use NIOFSDirectory, both of which should go a long ways
    towards
    fixing IndexReader's concurrency issue.
    My original tests were in the Spring with 2.3.1. I am planning on
    doing the new tests with 2.4 for indexing, as well as re-doing my
    concurrent query tests[1] and concurrent multiple reader tests[2]
    using the features you describe. I am sure the results will be quite
    different...
    Also, for the indexing tests, make sure you run with autoCommit=false.
    BTW the files I am indexing were originally PDFs, but were batch
    converted to text and stored compressed on the filesystem, so except
    for GUnzipping them there is no other overhead.
    But I'm confused: why do you need 64 threads to build up the
    documents? Gunzipping should be very low CPU cost. Are you pre-
    analyzing the fields on your documents?

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toke Eskildsen at Oct 24, 2008 at 2:43 pm

    On Fri, 2008-10-24 at 16:01 +0200, Sudarsan, Sithu D. wrote:
    4. We've tried using larger JVM space by defining -Xms1800m and
    -Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m seems
    stable. That is strange as we have 32 GB of RAM and 34GB swap space.
    Typically no other application is running. However, the CentOS version
    is 32 bit. The Ungava project seems to be using 64 bit.
    The <2GB limit for Java is a known problem under Windows. I don't know
    about CentOS, but from your description it seems that the problem exists
    on that platform too. Anyway, you'll never get above 4GB for Java when
    you're running 32bit. Might I ask why you're not using 64bit for a 32GB
    machine?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sudarsan, Sithu D. at Oct 24, 2008 at 2:52 pm
    There have been some earlier messages, where memory consumption issue
    for Lucene Documents due to 64 bit (double that of 32 bit). We expect
    the index to grow very large, and we may end up maintaining more than
    one with different analyzers for the same data set. Hence we are
    concerned about the index size as well. If there are ways to overcome
    it, we're game for 64 bit version as well :-)

    Any ideas,


    Thanks and regards,
    Sithu Sudarsan
    Graduate Research Assistant, UALR
    & Visiting Researcher, CDRH/OSEL

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu

    -----Original Message-----
    From: Toke Eskildsen
    Sent: Friday, October 24, 2008 10:43 AM
    To: java-user@lucene.apache.org
    Subject: RE: Multi -threaded indexing of large number of PDF documents
    On Fri, 2008-10-24 at 16:01 +0200, Sudarsan, Sithu D. wrote:
    4. We've tried using larger JVM space by defining -Xms1800m and
    -Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m seems
    stable. That is strange as we have 32 GB of RAM and 34GB swap space.
    Typically no other application is running. However, the CentOS version
    is 32 bit. The Ungava project seems to be using 64 bit.
    The <2GB limit for Java is a known problem under Windows. I don't know
    about CentOS, but from your description it seems that the problem exists
    on that platform too. Anyway, you'll never get above 4GB for Java when
    you're running 32bit. Might I ask why you're not using 64bit for a 32GB
    machine?


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Toke Eskildsen at Oct 24, 2008 at 4:01 pm

    Sudarsan, Sithu D. [Sithu.Sudarsan@fda.hhs.gov] wrote:
    There have been some earlier messages, where memory consumption issue
    for Lucene Documents due to 64 bit (double that of 32 bit).
    All pointers are doubled, yes. While not a doubling in total RAM consumption,
    it does give a substantial overhead.
    We expect the index to grow very large, and we may end up maintaining
    more than one with different analyzers for the same data set. Hence we are
    concerned about the index size as well. If there are ways to overcome
    it, we're game for 64 bit version as well :-)
    Fair enough. We've chosen to use use 64bit on our 16 and 32GB machines
    and have never looked back, but our initial requirements called for ~7GB
    for each JVM, so we didn't have a choice at the time.
    Any ideas,
    Solaris should be capable of giving you ~3.5GB for JVMs with 32bit.

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Sudarsan, Sithu D. at Nov 14, 2008 at 4:59 pm
    Hi All,

    Based on your valuable inputs, we tried a few experiments with number of
    threads. The observation is, if the number of threads are one less than
    the number of cores (we have 'main' as a separate thread. Essentially,
    including 'main' number of threads equal to number of cores), the
    indexing performance reaches the optimum level, with maximum CPU
    utilization. We have tried with Windows XP as well as CentOS.

    As regards to number of documents to be indexed and size of the
    documents, there seems to be some correlation between the two. But we
    are yet to ascertain the same.

    At this point, we are writing to a single IndexWriter. Have not tried
    comparing using multiple writers and merge them to benchmark the
    performance.

    Sincerely,
    Sithu D Sudarsan

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Oct 24, 2008 at 4:09 pm

    Sudarsan, Sithu D. wrote:
    Hi Glen, Mike, Grant & Mark

    Thank you for the quick responses.

    1. Yes, I'm looking now at ThreadPoolExecutor. Looking for a sample
    code
    to improve the multi-threaded code.

    2. We'll try using as many Indexwriters as the number of cores, first
    (which is 2cpu x 4 core = 8).
    You could also try multiple threads against a single IndexWriter.
    It's simpler, and you don't have to merge indices in the end. It'd be
    great if you could post back on net throughput because I'd really like
    to understand if there is some sort of thread issue sharing a single
    IndexWriter.
    3. Yes, PDFBox exceptions have been independently checked. We've a
    prototype module to check PDF files that contain errors. Generally
    they
    are few, less than 1% of the total number of files. The PDFs all have
    been OCRed. Also, if any throws exceptions then they are quarantined
    in
    a separate folder for further analysis to have a look at the document
    itself.

    4. We've tried using larger JVM space by defining -Xms1800m and
    -Xmx1800m, but it runs out of memory. Only -Xms1080m and -Xmx1080m
    seems
    stable. That is strange as we have 32 GB of RAM and 34GB swap space.
    Typically no other application is running. However, the CentOS version
    is 32 bit. The Ungava project seems to be using 64 bit.

    5. -QUIT option for Linux does throw stack trace, but after few
    threads
    it hangs. Don't know why. Need to look at that.
    Can you post the stack traces that you did see? (Do you think those
    threads are hung?)

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Grant Ingersoll at Oct 23, 2008 at 7:11 pm
    Can you describe your process a bit more? Are you measuring just the
    Lucene part or the whole ingestion part as well? If it's the latter,
    how do you know the issue is in Lucene? PDF extraction is annoying at
    best and highly problematic at its worst. Not saying it isn't Lucene,
    but I've seen PDFBox and other extractors fail a lot more than I've
    seen Lucene fail.

    Are there any exceptions that you are seeing anywhere in your log files?

    If you do have extraction as part of the process, what happens if you
    separate out extraction from indexing? Does it fail when you just
    index raw text in this manner?

    Cheers,
    Grant

    On Oct 23, 2008, at 12:16 PM, Sudarsan, Sithu D. wrote:


    Hi,

    We are trying to index large collection of PDF documents, sizes
    varying
    from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox
    for
    text extraction) and on Windows as well as CentOS Linux. Used java -
    Xms
    and -Xmx options, both at 1080m, even though we have 4GB on Windows
    and
    32 GB on Linux with sufficient swap space.

    With just one thread, though it takes time, the indexing happens. To
    speed up, we tried multi-threaded approach with one Indexwriter for
    each
    thread. After all the threads finish their indexing, they are merged.
    With about 100 sample files and 10 threads, the program works pretty
    well and it does speed up. But, when we run on document collection of
    about 25GB, couple of threads just hang, while the rest have completed
    their indexing. The program never gracefully exits, and the threads
    that
    seem to have died ensure that the final index merging does not take
    place. The program needs to be manually terminated.

    Tried both with simple analyzer as well as standard analyzer, with
    similar results.

    Any useful tips / solutions welcome.

    Thanks in advance,
    Sithu Sudarsan
    Graduate Research Assistant, UALR
    & Visiting Researcher, CDRH/OSEL

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu
    --------------------------
    Grant Ingersoll
    Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
    http://www.lucenebootcamp.com


    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Oct 23, 2008 at 7:19 pm
    Also, could you kill your process with -QUIT (on Linux; maybe there is
    something analogous on Windows?) when you see the threads hanging?
    That will give a stack dump for every thread.

    Mike

    Grant Ingersoll wrote:
    Can you describe your process a bit more? Are you measuring just
    the Lucene part or the whole ingestion part as well? If it's the
    latter, how do you know the issue is in Lucene? PDF extraction is
    annoying at best and highly problematic at its worst. Not saying it
    isn't Lucene, but I've seen PDFBox and other extractors fail a lot
    more than I've seen Lucene fail.

    Are there any exceptions that you are seeing anywhere in your log
    files?

    If you do have extraction as part of the process, what happens if
    you separate out extraction from indexing? Does it fail when you
    just index raw text in this manner?

    Cheers,
    Grant

    On Oct 23, 2008, at 12:16 PM, Sudarsan, Sithu D. wrote:


    Hi,

    We are trying to index large collection of PDF documents, sizes
    varying
    from few KB to few GB. Lucene 2.3.2 with jdk 1.6.0_01 (with PDFBox
    for
    text extraction) and on Windows as well as CentOS Linux. Used java -
    Xms
    and -Xmx options, both at 1080m, even though we have 4GB on Windows
    and
    32 GB on Linux with sufficient swap space.

    With just one thread, though it takes time, the indexing happens. To
    speed up, we tried multi-threaded approach with one Indexwriter for
    each
    thread. After all the threads finish their indexing, they are merged.
    With about 100 sample files and 10 threads, the program works pretty
    well and it does speed up. But, when we run on document collection of
    about 25GB, couple of threads just hang, while the rest have
    completed
    their indexing. The program never gracefully exits, and the threads
    that
    seem to have died ensure that the final index merging does not take
    place. The program needs to be manually terminated.

    Tried both with simple analyzer as well as standard analyzer, with
    similar results.

    Any useful tips / solutions welcome.

    Thanks in advance,
    Sithu Sudarsan
    Graduate Research Assistant, UALR
    & Visiting Researcher, CDRH/OSEL

    sithu.sudarsan@fda.hhs.gov
    sdsudarsan@ualr.edu
    --------------------------
    Grant Ingersoll
    Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
    http://www.lucenebootcamp.com


    Lucene Helpful Hints:
    http://wiki.apache.org/lucene-java/BasicsOfPerformance
    http://wiki.apache.org/lucene-java/LuceneFAQ










    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 23, '08 at 4:17p
activeNov 14, '08 at 4:59p
posts16
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase