FAQ
Here is a use case :
- my Lucene application is running under W2K
- I have (just) a gigabyte RAM
- my index is quite big, let's say 1.7 Gb (with a .tis of 31Mb an a .tii of
479 Kb)

Using RAMDirectory is impossible, FSDirectory works but is quite slow.

Could it be possible to create a custom Directory which would load the
entire .tis file and create all Term objects at startup? Obviously this
would require to change a lot of objects, because the access to the Term
objects no longer needs the .tii file.
Since Term text is compressed in the tis file, creating all the Term would
require more memory than the size of the tis. However in most cases the
application would be faster because :
- tree access to the Term (this is only the case for the Terms in the .tii)
- no need to create up to 127 temporary Term objects (with creation of
Strings and so on....)
- limit garbage collecting

What do you think of this? Does it seem useful?
Would it be complicated to achieve or would it require to many changes in
the existing sources?
I had a brief look at this part of the code but I have no clear idea of how
to do it nicely. Any suggestions?

Julien








---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Search Discussions

  • Erik Hatcher at Dec 4, 2003 at 2:56 pm

    On Thursday, December 4, 2003, at 09:45 AM, Julien Nioche wrote:
    Here is a use case :
    - my Lucene application is running under W2K
    - I have (just) a gigabyte RAM
    - my index is quite big, let's say 1.7 Gb (with a .tis of 31Mb an a
    .tii of
    479 Kb)

    Using RAMDirectory is impossible, FSDirectory works but is quite slow.
    I'm curious.... why is FSDirectory slow? And how are you measuring
    performance?

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Julien Nioche at Dec 4, 2003 at 3:16 pm
    It is (relatively ) slow because I send very large boolean queries. This may
    differ from the general use of Lucene where people search for a few terms
    and want to access only the n first documents.
    Profiling my apps shows that Term access and creation consumes a lot of
    time. There was a discussion about this issue when Dmitry proposed a patch
    to limit creation of temporary Term objects. In my case this is even more
    obvious because of the size of the queries.

    ----- Original Message -----
    From: "Erik Hatcher" <erik@ehatchersolutions.com>
    To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
    Sent: Thursday, December 04, 2003 3:56 PM
    Subject: Re: suggestion for a CustomDirectory

    On Thursday, December 4, 2003, at 09:45 AM, Julien Nioche wrote:
    Here is a use case :
    - my Lucene application is running under W2K
    - I have (just) a gigabyte RAM
    - my index is quite big, let's say 1.7 Gb (with a .tis of 31Mb an a
    .tii of
    479 Kb)

    Using RAMDirectory is impossible, FSDirectory works but is quite slow.
    I'm curious.... why is FSDirectory slow? And how are you measuring
    performance?

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Scott ganyo at Dec 4, 2003 at 3:29 pm
    I'm no expert in NIO, but I've heard a lot of people claim that NIO
    improvements over traditional I/O can be significant. I don't know if
    it would work for your case (does it have to map the entire file into
    memory?), but you might try the NIODirectory implementation that
    Francesco Bellomi sent to the list on July 6. Chances are that it
    would take a little updating at this point, but I doubt it would be
    much work.

    Scott
    On Dec 4, 2003, at 10:22 AM, Julien Nioche wrote:

    It is (relatively ) slow because I send very large boolean queries.
    This may
    differ from the general use of Lucene where people search for a few
    terms
    and want to access only the n first documents.
    Profiling my apps shows that Term access and creation consumes a lot of
    time. There was a discussion about this issue when Dmitry proposed a
    patch
    to limit creation of temporary Term objects. In my case this is even
    more
    obvious because of the size of the queries.

    ----- Original Message -----
    From: "Erik Hatcher" <erik@ehatchersolutions.com>
    To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
    Sent: Thursday, December 04, 2003 3:56 PM
    Subject: Re: suggestion for a CustomDirectory

    On Thursday, December 4, 2003, at 09:45 AM, Julien Nioche wrote:
    Here is a use case :
    - my Lucene application is running under W2K
    - I have (just) a gigabyte RAM
    - my index is quite big, let's say 1.7 Gb (with a .tis of 31Mb an a
    .tii of
    479 Kb)

    Using RAMDirectory is impossible, FSDirectory works but is quite
    slow.
    I'm curious.... why is FSDirectory slow? And how are you measuring
    performance?

    Erik


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
    The reasonable man adapts himself to the world; the unreasonable one
    persists in trying to adapt the world to himself. Therefore all
    progress depends on the unreasonable man. - George Bernard Shaw


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Doug Cutting at Dec 4, 2003 at 6:29 pm

    Julien Nioche wrote:
    However in most cases the
    application would be faster because :
    - tree access to the Term (this is only the case for the Terms in the .tii)
    - no need to create up to 127 temporary Term objects (with creation of
    Strings and so on....)
    - limit garbage collecting
    The .tii is already read into memory when the index is opened. So the
    only savings would be the creation of (on average) 64 temporary Term
    objects per query. Do you have any evidence that this is a substantial
    part of the computation? I'd be surprised if it was. To find out, you
    could write a program which compares the time it takes to call docFreq()
    on a set of terms (allocating the 64 temporary Terms) to what it takes
    to perform queries (doing the rest of the work). I'll bet that the
    first is substantially faster: most of the work of executing a query is
    processing the .frq and .prx files. These are bigger than the RAM on
    your machine, and so cannot be cached. Thus you'll always be doing some
    disk i/o, which will likely dominate real performance.

    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Julien Nioche at Dec 5, 2003 at 2:11 pm
    Thank you for your answer Doug

    Profiling my application indicates that a lot of times is spent for the
    creation of temporary Term objects.

    This is at least true for PhraseQueries weighting as shown on the profiling
    figures below :

    .41.2% - 473240 ms - 2802 inv.
    org.apache.lucene.search.PhraseQuery$PhraseWeight.scorer
    ..40.4% - 464202 ms - 7440 inv.
    org.apache.lucene.index.IndexReader.termPositions
    ...40.1% - 460378 ms - 7440 inv.
    org.apache.lucene.index.SegmentTermDocs.seek
    ....40.0% - 459297 ms - 7440 inv.
    org.apache.lucene.index.TermInfosReader.get
    .....39.1% - 448370 ms - 7440 inv.
    org.apache.lucene.index.TermInfosReader.scanEnum
    .......34.4% - 394578 ms - 484790 inv.
    org.apache.lucene.index.SegmentTermEnum.next
    .........25.8% - 296435 ms - 484790 inv.
    org.apache.lucene.index.SegmentTermEnum.readTerm
    .........3.5% - 40565 ms - 969580 inv.
    org.apache.lucene.store.InputStream.readVLong
    .........1.8% - 21147 ms - 484790 inv.
    org.apache.lucene.store.InputStream.readVInt

    This is only method time, it doesn't take into account the time required for
    garbage collecting all those temporary objects.

    I'll test other applications I made to confirm this.
    Scott,
    I tried NIODirectory and provided some benchmarks for it on the list with my
    apps. It improves a little bit the overall performances but it could be
    interesting if we could choose the files we want to map into memory.

    ----- Original Message -----
    From: "Doug Cutting" <cutting@lucene.com>
    To: "Lucene Developers List" <lucene-dev@jakarta.apache.org>
    Sent: Thursday, December 04, 2003 7:28 PM
    Subject: Re: suggestion for a CustomDirectory

    Julien Nioche wrote:
    However in most cases the
    application would be faster because :
    - tree access to the Term (this is only the case for the Terms in the
    .tii)
    - no need to create up to 127 temporary Term objects (with creation of
    Strings and so on....)
    - limit garbage collecting
    The .tii is already read into memory when the index is opened. So the
    only savings would be the creation of (on average) 64 temporary Term
    objects per query. Do you have any evidence that this is a substantial
    part of the computation? I'd be surprised if it was. To find out, you
    could write a program which compares the time it takes to call docFreq()
    on a set of terms (allocating the 64 temporary Terms) to what it takes
    to perform queries (doing the rest of the work). I'll bet that the
    first is substantially faster: most of the work of executing a query is
    processing the .frq and .prx files. These are bigger than the RAM on
    your machine, and so cannot be cached. Thus you'll always be doing some
    disk i/o, which will likely dominate real performance.

    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Doug Cutting at Dec 5, 2003 at 6:13 pm

    Julien Nioche wrote:
    Profiling my application indicates that a lot of times is spent for the
    creation of temporary Term objects.
    It does indeed look like term lookup is using a lot of your time. I
    don't see the Term constructor showing up as significant in your
    profile, so it looks to me like it could just the cost of parsing the
    data, not the allocation/GC stuff. I've found that allocation of
    temporary objects doesn't really cost much with modern garbage
    collectors. The biggest cost of allocating objects is sometimes just
    the constructor.

    What sort of queries are you making against what sort of an index? It
    looks like you're probably making large queries with lots of
    low-frequency terms, in order for term lookup to be such a large factor.
    You might try sorting the terms in the query. If subsequent lookups
    are nearby in the TermInfo file then it won't have to scan as much.
    Could that help? Also, is your index optimized? An optimized index
    will drastically reduce the term lookup costs.

    If all these fail, try reducing TermInfosWriter.INDEX_INTERVAL. You'll
    have to re-create your indexes each time you change this constant. You
    might try a value like 16. This would keep the number of terms in
    memory from being too huge (1 of 16 terms), but would reduce the average
    number scanned from 64 to 8, which would be substantial. Tell me how
    this works. If it makes a big difference, then perhaps we should make
    this parameter more easily changable.

    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Fp235-5 at Dec 5, 2003 at 8:53 pm
    Hello Doug,

    I can send you an example of the queries I'm building. It can be very large...
    Indexes are always optimized.
    All Term or Phrase Queries inside a BooleanQuery are sorted and indeed it speeds
    up things a little. However sorting the terms inside a PhraseQuery is quite
    limited (but possible if order does not matter). If I had a single BooleanQuery
    (let's say OR) ordering the Terms would improve a lot but unfortunately the
    Queries I send are made of enclosing Booleans on up to 3 or 4 levels.

    I found as well that disabling the idf by using a custom Similarity object
    improves a little bit in terms of speed.

    If I understand well, changing the TermInfosWriter.INDEX_INTERVAL would create a
    bigger .tii file and thus more Term objects would be available in memory. I'll
    try this to see what impact it has on the performance of my app.

    By "creation of temporary Term objects" I meant the whole process of finding a
    given Terms (i.e. parsing, creation, comparison). Dmitry's patch improved this
    part a lot and in my case reduced by 10-15% the overall time. Sadly it has never
    been included in the source and could have been useful for all kind of users.

    The idea behind the CustomDirectory is to kill two birds with one stone :
    1/ escape an all or nothing approach (all on FS or all on RAM) by putting often
    used information in memory + choose the kind of approach at reading time.
    2/ avoid useless creation/destruction of objects and improve access to Term
    objets (which do not have do be accessed sequentially)

    Thank you very much Doug for suggesting the use of INDEX_INTERVAL! I'll try it
    on Monday

    good week end everybody

    Julien


    ---------- Debut du message initial -----------

    De : Doug Cutting <cutting@lucene.com>
    A : Lucene Developers List <lucene-dev@jakarta.apache.org>
    Copies :
    Date : Fri, 05 Dec 2003 10:12:56 -0800
    Sujet : Re: suggestion for a CustomDirectory

    Julien Nioche wrote:
    Profiling my application indicates that a lot of times is spent for the
    creation of temporary Term objects.
    It does indeed look like term lookup is using a lot of your time. I
    don't see the Term constructor showing up as significant in your
    profile, so it looks to me like it could just the cost of parsing the
    data, not the allocation/GC stuff. I've found that allocation of
    temporary objects doesn't really cost much with modern garbage
    collectors. The biggest cost of allocating objects is sometimes just
    the constructor.

    What sort of queries are you making against what sort of an index? It
    looks like you're probably making large queries with lots of
    low-frequency terms, in order for term lookup to be such a large factor.
    You might try sorting the terms in the query. If subsequent lookups
    are nearby in the TermInfo file then it won't have to scan as much.
    Could that help? Also, is your index optimized? An optimized index
    will drastically reduce the term lookup costs.

    If all these fail, try reducing TermInfosWriter.INDEX_INTERVAL. You'll
    have to re-create your indexes each time you change this constant. You
    might try a value like 16. This would keep the number of terms in
    memory from being too huge (1 of 16 terms), but would reduce the average
    number scanned from 64 to 8, which would be substantial. Tell me how
    this works. If it makes a big difference, then perhaps we should make
    this parameter more easily changable.

    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Doug Cutting at Dec 5, 2003 at 10:02 pm

    fp235-5 wrote:
    Dmitry's patch improved this
    part a lot and in my case reduced by 10-15% the overall time. Sadly it has never
    been included in the source and could have been useful for all kind of users.
    If you read the thread associated with that patch you'll see that it
    made other things slower, because it removed another optimization. So
    while it would have helped some folks, it would have hurt others.

    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
  • Dmitry Serebrennikov at Dec 5, 2003 at 11:34 pm

    Doug Cutting wrote:

    fp235-5 wrote:
    Dmitry's patch improved this
    part a lot and in my case reduced by 10-15% the overall time. Sadly
    it has never
    been included in the source and could have been useful for all kind
    of users.

    If you read the thread associated with that patch you'll see that it
    made other things slower, because it removed another optimization. So
    while it would have helped some folks, it would have hurt others.
    I think that the removal of that other optimization was more due to my
    not understanding it well enough, rather than an integral part of
    reducing term creation. It's been a while though, so I don't know for
    sure any more. My recollection is that both could be accommodated with
    some work. In our environment/application we do not hit this other case
    and so have been happily running with the reduced Term object creation
    ever since that thread.

    I think for these GC effects to really start to show up one has to have
    all of the following:
    - extensive TermEnum scanning (as in non-startsWith wildcard
    queries, or with other application code)
    - multi-CPU machine
    - large number of concurrent queries
    - large number of terms in the indexes

    From what I remember of my tests, I found that scanning the term enums
    for matching terms caused lots of term objects to be created and
    discarded, which required excessive GC, which had severe performance
    consequences on multi-CPU machines (because the CPUs produce garbage in
    parallel but GC can only use one CPU to collect it). The multi-CPU issue
    is supposed to be fixed in Java 1.4, but I had not experimented with it
    yet, so I don't know if that is really effective.

    Dmitry
    Doug


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
    For additional commands, e-mail: lucene-dev-help@jakarta.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-dev @
categorieslucene
postedDec 4, '03 at 2:39p
activeDec 5, '03 at 11:34p
posts10
users5
websitelucene.apache.org

People

Translate

site design / logo © 2021 Grokbase