FAQ
Hi! I'm using Lucene 2.3.2 to store a relatively-large index of HTML
documents. I'm storing ~150 million documents, taking up 150 GB of space.

I index the HTML text, but I only store primary key information that allows
me to retrieve it later. Thus, my document size is small, but obviously, I
need to store the index as well, and I imagine that's what takes up almost
all of the space.

Since I allow users to search this HTML specific to their user index, I
create multiple indexes (~2000) such that a given user only has to search
one of the 2000 indexes to get to their specific document. I also have
queries that span all 2000 indexes.

So, I have 2000 indexes full of small documents but relatively large
indexing space.

My question is what sort of disk to buy. Using "dstat", I've determined
that the disk is clearly the bottleneck. Nearly all the time I spend
indexing "chunks" of documents and committing them to disk is spent waiting
on I/O operations. I spawn multiple threads to access the various index
writers so as to minimize I/O wait time, but disk always ends up being the
problem.

Currently, I've got 7200rpm SATA drives (RAID 0), but I've also got 15k SAS
drives (RAID 0 as well) on hand.

My question is, what's the access pattern of Lucene when it comes to
indexing documents, merging segments, and eventually optimizing them (given
what I've mentioned about document count and document size)?

Am I better off with a drive that has a faster seek time, or do I need to
optimize for sustained throughput? How does the way in which Lucene lays
indexes on disk affect this?

If it helps, my merge factor is 50, and given that I run out of file
descriptors otherwise, I use the compound file format.

Thanks for your help,
Matt
--
View this message in context: http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19009580.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Otis Gospodnetic at Aug 18, 2008 at 6:47 pm
    Matt,

    One important bit that you didn't mention is what your maxBufferedSize setting is. If it's too low you will see lots of IO. Increasing it means less IO, but more JVM heap need. Is your disk IO caused by searches or indexing only?

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: mattspitz <mspitz@stanford.edu>
    To: java-user@lucene.apache.org
    Sent: Saturday, August 16, 2008 4:07:52 AM
    Subject: Appropriate disk optimization for large index?


    Hi! I'm using Lucene 2.3.2 to store a relatively-large index of HTML
    documents. I'm storing ~150 million documents, taking up 150 GB of space.

    I index the HTML text, but I only store primary key information that allows
    me to retrieve it later. Thus, my document size is small, but obviously, I
    need to store the index as well, and I imagine that's what takes up almost
    all of the space.

    Since I allow users to search this HTML specific to their user index, I
    create multiple indexes (~2000) such that a given user only has to search
    one of the 2000 indexes to get to their specific document. I also have
    queries that span all 2000 indexes.

    So, I have 2000 indexes full of small documents but relatively large
    indexing space.

    My question is what sort of disk to buy. Using "dstat", I've determined
    that the disk is clearly the bottleneck. Nearly all the time I spend
    indexing "chunks" of documents and committing them to disk is spent waiting
    on I/O operations. I spawn multiple threads to access the various index
    writers so as to minimize I/O wait time, but disk always ends up being the
    problem.

    Currently, I've got 7200rpm SATA drives (RAID 0), but I've also got 15k SAS
    drives (RAID 0 as well) on hand.

    My question is, what's the access pattern of Lucene when it comes to
    indexing documents, merging segments, and eventually optimizing them (given
    what I've mentioned about document count and document size)?

    Am I better off with a drive that has a faster seek time, or do I need to
    optimize for sustained throughput? How does the way in which Lucene lays
    indexes on disk affect this?

    If it helps, my merge factor is 50, and given that I run out of file
    descriptors otherwise, I use the compound file format.

    Thanks for your help,
    Matt
    --
    View this message in context:
    http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19009580.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mattspitz at Aug 18, 2008 at 7:30 pm
    So, my indexing is done in "rounds", where I pull a bunch of documents from
    the database, index them, and flush them to disk. I manually call "flush()"
    because I need to ensure that what's on disk is accurate with what I've
    pulled from the database.

    On each round, then, I flush to disk. I set the buffer such that it doesn't
    flush any segments until I manually call flush(), so as to incur I/O only
    once each "round"

    Thanks for your help,
    Matt


    Otis Gospodnetic wrote:
    Matt,

    One important bit that you didn't mention is what your maxBufferedSize
    setting is. If it's too low you will see lots of IO. Increasing it means
    less IO, but more JVM heap need. Is your disk IO caused by searches or
    indexing only?

    Otis
    --
    Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


    ----- Original Message ----
    From: mattspitz <mspitz@stanford.edu>
    To: java-user@lucene.apache.org
    Sent: Saturday, August 16, 2008 4:07:52 AM
    Subject: Appropriate disk optimization for large index?


    Hi! I'm using Lucene 2.3.2 to store a relatively-large index of HTML
    documents. I'm storing ~150 million documents, taking up 150 GB of
    space.

    I index the HTML text, but I only store primary key information that
    allows
    me to retrieve it later. Thus, my document size is small, but obviously,
    I
    need to store the index as well, and I imagine that's what takes up
    almost
    all of the space.

    Since I allow users to search this HTML specific to their user index, I
    create multiple indexes (~2000) such that a given user only has to search
    one of the 2000 indexes to get to their specific document. I also have
    queries that span all 2000 indexes.

    So, I have 2000 indexes full of small documents but relatively large
    indexing space.

    My question is what sort of disk to buy. Using "dstat", I've determined
    that the disk is clearly the bottleneck. Nearly all the time I spend
    indexing "chunks" of documents and committing them to disk is spent
    waiting
    on I/O operations. I spawn multiple threads to access the various index
    writers so as to minimize I/O wait time, but disk always ends up being
    the
    problem.

    Currently, I've got 7200rpm SATA drives (RAID 0), but I've also got 15k
    SAS
    drives (RAID 0 as well) on hand.

    My question is, what's the access pattern of Lucene when it comes to
    indexing documents, merging segments, and eventually optimizing them
    (given
    what I've mentioned about document count and document size)?

    Am I better off with a drive that has a faster seek time, or do I need to
    optimize for sustained throughput? How does the way in which Lucene lays
    indexes on disk affect this?

    If it helps, my merge factor is 50, and given that I run out of file
    descriptors otherwise, I use the compound file format.

    Thanks for your help,
    Matt
    --
    View this message in context:
    http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19009580.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19038372.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 18, 2008 at 8:55 pm

    mattspitz wrote:

    So, my indexing is done in "rounds", where I pull a bunch of
    documents from
    the database, index them, and flush them to disk. I manually call
    "flush()"
    because I need to ensure that what's on disk is accurate with what
    I've
    pulled from the database.

    On each round, then, I flush to disk. I set the buffer such that it
    doesn't
    flush any segments until I manually call flush(), so as to incur I/O
    only
    once each "round"
    Make sure once you upgrade to 2.4 (or trunk) that you switch to
    commit() instead of flush() because flush() doesn't sync the index
    files, so if the hardware or OS crashes your index will not match
    what's in the DB (and/or may become corrupt).

    I'm not sure which of seek time vs throughput is best to optimize in
    your IO system. On flushing a segment you'd likely want the fastest
    throughput, assuming the filesystem is able to assign many adjacent
    blocks to the files being flushed. During merging (and optimize) I
    think seek time is most important, because Lucene reads from 50 (your
    mergeFactor) files at once and then writes to one or two files. But,
    this (at least normal merging) is typically done concurrently with
    adding documents, so the time consumed may not matter in the net
    runtime of the overall indexing process. When a flush happens during
    a merge, seek time is likely most important.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mattspitz at Aug 18, 2008 at 9:18 pm
    Mike-

    Are the index files synced on writer.close()?

    Thank you so much for your help. I think the seek time is the issue,
    especially considering the high merge factor and the fact that the segments
    are scattered all over the disk.

    Will a faster disk cache access affect the optimization and merging? I
    don't really have a sense for what of the segments are kept in memory during
    a merge. It doesn't make sense to me that Lucene would pull all of the
    segments into memory to merge them, but I don't really know how.

    Thank you so much,
    Matt

    Michael McCandless-2 wrote:

    mattspitz wrote:
    So, my indexing is done in "rounds", where I pull a bunch of
    documents from
    the database, index them, and flush them to disk. I manually call
    "flush()"
    because I need to ensure that what's on disk is accurate with what
    I've
    pulled from the database.

    On each round, then, I flush to disk. I set the buffer such that it
    doesn't
    flush any segments until I manually call flush(), so as to incur I/O
    only
    once each "round"
    Make sure once you upgrade to 2.4 (or trunk) that you switch to
    commit() instead of flush() because flush() doesn't sync the index
    files, so if the hardware or OS crashes your index will not match
    what's in the DB (and/or may become corrupt).

    I'm not sure which of seek time vs throughput is best to optimize in
    your IO system. On flushing a segment you'd likely want the fastest
    throughput, assuming the filesystem is able to assign many adjacent
    blocks to the files being flushed. During merging (and optimize) I
    think seek time is most important, because Lucene reads from 50 (your
    mergeFactor) files at once and then writes to one or two files. But,
    this (at least normal merging) is typically done concurrently with
    adding documents, so the time consumed may not matter in the net
    runtime of the overall indexing process. When a flush happens during
    a merge, seek time is likely most important.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19040147.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 18, 2008 at 9:24 pm

    mattspitz wrote:

    Are the index files synced on writer.close()?
    No, they aren't. Not until 2.4 (trunk).
    Thank you so much for your help. I think the seek time is the issue,
    especially considering the high merge factor and the fact that the
    segments
    are scattered all over the disk.
    You're welcome! I agree: optimizing seek time seems likely to be the
    biggest win.
    Will a faster disk cache access affect the optimization and
    merging? I
    don't really have a sense for what of the segments are kept in
    memory during
    a merge. It doesn't make sense to me that Lucene would pull all of
    the
    segments into memory to merge them, but I don't really know how.
    Segments aren't kept in memory during merging... it's more like a
    cursor that sweeps through each of the files for the 50 segments being
    merged. Lucene does buffer its reads, so we read a chunk into RAM and
    then pull bits off that chunk. And the OS does readahead. But
    otherwise it's all on disk and we make a single sweep through each of
    the segments to be merged.

    So I wouldn't expect the disk cache's performance to impact Lucene,
    during merging or flushing.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mattspitz at Aug 18, 2008 at 9:28 pm
    Thanks for your replies!

    Is there no way to ensure consistency on the disk with 2.3.2?

    This is a little off-topic, but is it worth upgrading to 2.4 right now if
    I've got a very stable system already implemented with 2.3.2? I don't
    really want to introduce oddities because I'm using an "unfinished" version
    of Lucene. Is there a rough date for 2.4's release? I poked around the
    website and couldn't find one.

    Thanks,
    Matt


    Michael McCandless-2 wrote:

    mattspitz wrote:
    Are the index files synced on writer.close()?
    No, they aren't. Not until 2.4 (trunk).
    Thank you so much for your help. I think the seek time is the issue,
    especially considering the high merge factor and the fact that the
    segments
    are scattered all over the disk.
    You're welcome! I agree: optimizing seek time seems likely to be the
    biggest win.
    Will a faster disk cache access affect the optimization and
    merging? I
    don't really have a sense for what of the segments are kept in
    memory during
    a merge. It doesn't make sense to me that Lucene would pull all of
    the
    segments into memory to merge them, but I don't really know how.
    Segments aren't kept in memory during merging... it's more like a
    cursor that sweeps through each of the files for the 50 segments being
    merged. Lucene does buffer its reads, so we read a chunk into RAM and
    then pull bits off that chunk. And the OS does readahead. But
    otherwise it's all on disk and we make a single sweep through each of
    the segments to be merged.

    So I wouldn't expect the disk cache's performance to impact Lucene,
    during merging or flushing.

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19040324.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Michael McCandless at Aug 18, 2008 at 9:37 pm

    mattspitz wrote:

    Is there no way to ensure consistency on the disk with 2.3.2?
    Unfortunately no.
    This is a little off-topic, but is it worth upgrading to 2.4 right
    now if
    I've got a very stable system already implemented with 2.3.2? I don't
    really want to introduce oddities because I'm using an "unfinished"
    version
    of Lucene. Is there a rough date for 2.4's release? I poked around
    the
    website and couldn't find one.
    The trunk should be a drop-in replacement for 2.3.2, api-wise. But it
    is not yet released, and there've been alot of changes, so it's
    entirely possible there are bugs lurking.

    Unfortunately no date yet, but I agree we should do a release soon.
    It's been a good while (7 months) since 2.3.0 was released, and there
    are a number of improvements in 2.4. We talked a while back on the
    dev list about doing releases more frequently. I'll start a thread on
    the dev list to see what people think...

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Mattspitz at Aug 18, 2008 at 9:39 pm
    Mmmkay. I think I'll wait, then.

    Thank you so much for your help. I really appreciate it.

    Also, I really dig Lucene, so thanks for your hard work!

    -Matt

    Michael McCandless-2 wrote:

    mattspitz wrote:
    Is there no way to ensure consistency on the disk with 2.3.2?
    Unfortunately no.
    This is a little off-topic, but is it worth upgrading to 2.4 right
    now if
    I've got a very stable system already implemented with 2.3.2? I don't
    really want to introduce oddities because I'm using an "unfinished"
    version
    of Lucene. Is there a rough date for 2.4's release? I poked around
    the
    website and couldn't find one.
    The trunk should be a drop-in replacement for 2.3.2, api-wise. But it
    is not yet released, and there've been alot of changes, so it's
    entirely possible there are bugs lurking.

    Unfortunately no date yet, but I agree we should do a release soon.
    It's been a good while (7 months) since 2.3.0 was released, and there
    are a number of improvements in 2.4. We talked a while back on the
    dev list about doing releases more frequently. I'll start a thread on
    the dev list to see what people think...

    Mike

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    View this message in context: http://www.nabble.com/Appropriate-disk-optimization-for-large-index--tp19009580p19040519.html
    Sent from the Lucene - Java Users mailing list archive at Nabble.com.


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedAug 16, '08 at 8:08a
activeAug 18, '08 at 9:39p
posts9
users3
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase