FAQ
Hi All,

I have the following problem - we have OutOfMemoryException when
seraching on the indexes that are of size 20 - 40 GB and contain 10 - 15
million docs.
When we make searches we perform query that match all the results but we
DO NOT fetch all the results - we fetch 100 of them. We also make
sorting by using the class Sort and we really need result to be sorted
on a field that is randomly defined by the user.
So my questions are:
1) Have Lucene some restrictions on index size on which it can perform
searches?
2) Is there some approach to estimate beforehand the RAM that will use
Lucene for sertain query? I mean on what exactly depends this memory
usage - on index size, on docs stored in the index, on size of this docs...
3) Is there some approach to controll the used RAM. For example when
searching not to exceed 1GB of used memory?
4) Is there some spcial approach to proceeding with such big indexes (we
expect in near future even 60 -80 GB indexes).


Best Regards,
Ivan

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Erick Erickson at Apr 6, 2007 at 2:31 pm
    I can only shed a little light on a couple of points, see below.
    On 4/6/07, Ivan Vasilev wrote:

    Hi All,

    I have the following problem - we have OutOfMemoryException when
    seraching on the indexes that are of size 20 - 40 GB and contain 10 - 15
    million docs.
    When we make searches we perform query that match all the results but we
    DO NOT fetch all the results - we fetch 100 of them. We also make
    sorting by using the class Sort and we really need result to be sorted
    on a field that is randomly defined by the user.
    So my questions are:

    The problem I suspect is the sorting. As I understand, Lucene
    builds internal caches for sorting and I suspect that this is the root
    of your problem. You can test this by trying your problem queries
    without sorting.

    How much memory are you giving the JVM?


    1) Have Lucene some restrictions on index size on which it can perform
    searches?

    No theoretical ones that I know of, but practical ones at times. As
    you are finding.

    2) Is there some approach to estimate beforehand the RAM that will use
    Lucene for sertain query? I mean on what exactly depends this memory
    usage - on index size, on docs stored in the index, on size of this
    docs...


    I'd like to know this myself. Hint, hint, hint....

    3) Is there some approach to controll the used RAM. For example when
    searching not to exceed 1GB of used memory?
    4) Is there some spcial approach to proceeding with such big indexes (we
    expect in near future even 60 -80 GB indexes).


    Best Regards,
    Ivan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Chris Hostetter at Apr 6, 2007 at 6:06 pm
    : The problem I suspect is the sorting. As I understand, Lucene
    : builds internal caches for sorting and I suspect that this is the root
    : of your problem. You can test this by trying your problem queries
    : without sorting.

    if Sorting really is the cause of your problems, you may want to try out
    this patch...

    https://issues.apache.org/jira/browse/LUCENE-769

    ...it *may* be advantageous in situations where memory is your most
    constrained resource, and you are willing to sacrifice speed for sorting
    ... it looks promising to me, but there haven't been any convincing
    usecases/benchmarks of people finding it beneficial (other then the
    original contributor)

    if you do try it, please post your comments in the issue.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Bublic Online at Apr 6, 2007 at 7:04 pm
    Hi Ivan, Chris and all!

    I'm that contributor of LUCENE-769 and I recommend it too :)
    OutOfMemory error was one of main reasons for me to make it.

    Regards,
    Artem Vasiliev
    On 4/6/07, Chris Hostetter wrote:


    : The problem I suspect is the sorting. As I understand, Lucene
    : builds internal caches for sorting and I suspect that this is the root
    : of your problem. You can test this by trying your problem queries
    : without sorting.

    if Sorting really is the cause of your problems, you may want to try out
    this patch...

    https://issues.apache.org/jira/browse/LUCENE-769

    ...it *may* be advantageous in situations where memory is your most
    constrained resource, and you are willing to sacrifice speed for sorting
    ... it looks promising to me, but there haven't been any convincing
    usecases/benchmarks of people finding it beneficial (other then the
    original contributor)

    if you do try it, please post your comments in the issue.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nilesh Bansal at Apr 8, 2007 at 5:03 am
    This seems like a very useful patch. Our application searches over 50
    million doc in a 40GB index. We only have simple conjunctive queries
    on a single field. Currently, the command line search program that
    prints top-10 results requires at least 200mb memory. Our web
    application, that searches the same index crashes with OOM when there
    are more than 10-12 concurrent requests (heap size set to 3GB). Will
    this patch help in such a situation?

    It seems that there are some issues with this patch and that was the
    reason it is not yet in the main source tree. Can someone please
    summerize what are the downsides of using such an approach. It will be
    really good if Lucene had it in main source tree and a flag to turn ON
    or OFF this feature.

    Bublic, can you tell me what exactly I need to do if I want to use this patch?

    thanks
    Nilesh
    On 4/6/07, Bublic Online wrote:
    Hi Ivan, Chris and all!

    I'm that contributor of LUCENE-769 and I recommend it too :)
    OutOfMemory error was one of main reasons for me to make it.

    Regards,
    Artem Vasiliev
    On 4/6/07, Chris Hostetter wrote:


    : The problem I suspect is the sorting. As I understand, Lucene
    : builds internal caches for sorting and I suspect that this is the root
    : of your problem. You can test this by trying your problem queries
    : without sorting.

    if Sorting really is the cause of your problems, you may want to try out
    this patch...

    https://issues.apache.org/jira/browse/LUCENE-769

    ...it *may* be advantageous in situations where memory is your most
    constrained resource, and you are willing to sacrifice speed for sorting
    ... it looks promising to me, but there haven't been any convincing
    usecases/benchmarks of people finding it beneficial (other then the
    original contributor)

    if you do try it, please post your comments in the issue.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Nilesh Bansal.
    http://queens.db.toronto.edu/~nilesh/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Artem at Apr 8, 2007 at 5:33 pm
    Hello Nilesh,

    Sunday, April 8, 2007, 9:03:06 AM, you wrote:

    NB> This seems like a very useful patch. Our application searches over 50
    NB> million doc in a 40GB index. We only have simple conjunctive queries
    NB> on a single field. Currently, the command line search program that
    NB> prints top-10 results requires at least 200mb memory. Our web
    NB> application, that searches the same index crashes with OOM when there
    NB> are more than 10-12 concurrent requests (heap size set to 3GB). Will
    NB> this patch help in such a situation?

    I must note that my patch only helps in lucene-OOM situations related to
    _sorted_ queries. If this is your case than I think yes it will help.

    In my app currently index is not so big, only 1mln docs. With the patch applied
    sample query giving first 30 of 120,000 sorted results made memory consumption
    jump from 18M to 20M according to jconsole.

    NB> It seems that there are some issues with this patch and that was the
    NB> reason it is not yet in the main source tree. Can someone please
    NB> summerize what are the downsides of using such an approach. It will be
    NB> really good if Lucene had it in main source tree and a flag to turn ON
    NB> or OFF this feature.

    First there's performance cost (for second and further queries with the
    same IndexSearcher). In default implementation all the index values of sorted
    field are cached during the first sorted search - this takes memory and time;
    but next queries run fast if there still some memory left. My implementation
    doesn't cache field values but loads them from respective documents on the fly -
    so it's slower but takes less memory. The query mentioned took about 3s (with
    rather small sorted fields values - about 20-100 chars).
    There's a limitation also - my implementation requires sorted field to be
    "stored" in index (Field.Store.YES in doc.add())

    NB> Bublic, can you tell me what exactly I need to do if I want to use this patch?

    You can include StoredFieldSortFactory class source file into your sources and
    then use StoredFieldSortFactory.create(sortFieldName, sortDescending) to get
    Sort object for sorting query.
    StoredFieldSortFactory source file can be extracted from LUCENE-769 patch or
    from sharehound sources: http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java

    Regards,
    Artem

    NB> thanks
    NB> Nilesh

    NB> On 4/6/07, Bublic Online wrote:
    Hi Ivan, Chris and all!

    I'm that contributor of LUCENE-769 and I recommend it too :)
    OutOfMemory error was one of main reasons for me to make it.

    Regards,
    Artem Vasiliev
    On 4/6/07, Chris Hostetter wrote:


    : The problem I suspect is the sorting. As I understand, Lucene
    : builds internal caches for sorting and I suspect that this is the root
    : of your problem. You can test this by trying your problem queries
    : without sorting.

    if Sorting really is the cause of your problems, you may want to try out
    this patch...

    https://issues.apache.org/jira/browse/LUCENE-769

    ...it *may* be advantageous in situations where memory is your most
    constrained resource, and you are willing to sacrifice speed for sorting
    ... it looks promising to me, but there haven't been any convincing
    usecases/benchmarks of people finding it beneficial (other then the
    original contributor)

    if you do try it, please post your comments in the issue.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    --
    Best regards,
    Artem mailto:abublic@gmail.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Nilesh Bansal at Apr 8, 2007 at 6:59 pm

    On 4/8/07, Artem wrote:
    I must note that my patch only helps in lucene-OOM situations related to
    _sorted_ queries. If this is your case than I think yes it will help.
    Probably a newbie question, but can you please explain what sorted
    queries mean? Is simple keyword search a sorted query?
    In my app currently index is not so big, only 1mln docs. With the patch applied
    sample query giving first 30 of 120,000 sorted results made memory consumption
    jump from 18M to 20M according to jconsole.

    NB> It seems that there are some issues with this patch and that was the
    NB> reason it is not yet in the main source tree. Can someone please
    NB> summerize what are the downsides of using such an approach. It will be
    NB> really good if Lucene had it in main source tree and a flag to turn ON
    NB> or OFF this feature.

    First there's performance cost (for second and further queries with the
    same IndexSearcher). In default implementation all the index values of sorted
    field are cached during the first sorted search - this takes memory and time;
    but next queries run fast if there still some memory left. My implementation
    doesn't cache field values but loads them from respective documents on the fly -
    so it's slower but takes less memory. The query mentioned took about 3s (with
    rather small sorted fields values - about 20-100 chars).
    There's a limitation also - my implementation requires sorted field to be
    "stored" in index (Field.Store.YES in doc.add())

    NB> Bublic, can you tell me what exactly I need to do if I want to use this patch?

    You can include StoredFieldSortFactory class source file into your sources and
    then use StoredFieldSortFactory.create(sortFieldName, sortDescending) to get
    Sort object for sorting query.
    StoredFieldSortFactory source file can be extracted from LUCENE-769 patch or
    from sharehound sources: http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java

    Regards,
    Artem

    NB> thanks
    NB> Nilesh

    NB> On 4/6/07, Bublic Online wrote:
    Hi Ivan, Chris and all!

    I'm that contributor of LUCENE-769 and I recommend it too :)
    OutOfMemory error was one of main reasons for me to make it.

    Regards,
    Artem Vasiliev
    On 4/6/07, Chris Hostetter wrote:


    : The problem I suspect is the sorting. As I understand, Lucene
    : builds internal caches for sorting and I suspect that this is the root
    : of your problem. You can test this by trying your problem queries
    : without sorting.

    if Sorting really is the cause of your problems, you may want to try out
    this patch...

    https://issues.apache.org/jira/browse/LUCENE-769

    ...it *may* be advantageous in situations where memory is your most
    constrained resource, and you are willing to sacrifice speed for sorting
    ... it looks promising to me, but there haven't been any convincing
    usecases/benchmarks of people finding it beneficial (other then the
    original contributor)

    if you do try it, please post your comments in the issue.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    --
    Best regards,
    Artem mailto:abublic@gmail.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Nilesh Bansal.
    http://queens.db.toronto.edu/~nilesh/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Erick Erickson at Apr 8, 2007 at 8:53 pm
    It *is* a bit confusing, since every search is sorted, kinda....

    Practically, a sorted query is one where you call one of the search
    methods (on, say, Searcher) with a Sort object, which sorts
    on one or more of the fields in your index (which ones are
    used are specified in the (array of) Sort objects).

    Searches that do NOT have a Sort object default to using
    relevance ranking, which is not nearly so memory-intensive. This is,
    after all, one float or so....

    The difference is that the fields referenced in the Sort object
    have to be read into memory and compared against all other
    values, and the aggregate may be quite large memory-wise.

    Erick
    On 4/8/07, Nilesh Bansal wrote:
    On 4/8/07, Artem wrote:
    I must note that my patch only helps in lucene-OOM situations related to
    _sorted_ queries. If this is your case than I think yes it will help.
    Probably a newbie question, but can you please explain what sorted
    queries mean? Is simple keyword search a sorted query?
    In my app currently index is not so big, only 1mln docs. With the patch applied
    sample query giving first 30 of 120,000 sorted results made memory
    consumption
    jump from 18M to 20M according to jconsole.

    NB> It seems that there are some issues with this patch and that was the
    NB> reason it is not yet in the main source tree. Can someone please
    NB> summerize what are the downsides of using such an approach. It will be
    NB> really good if Lucene had it in main source tree and a flag to turn ON
    NB> or OFF this feature.

    First there's performance cost (for second and further queries with the
    same IndexSearcher). In default implementation all the index values of sorted
    field are cached during the first sorted search - this takes memory and time;
    but next queries run fast if there still some memory left. My
    implementation
    doesn't cache field values but loads them from respective documents on the fly -
    so it's slower but takes less memory. The query mentioned took about 3s (with
    rather small sorted fields values - about 20-100 chars).
    There's a limitation also - my implementation requires sorted field to be
    "stored" in index (Field.Store.YES in doc.add())

    NB> Bublic, can you tell me what exactly I need to do if I want to use
    this patch?
    You can include StoredFieldSortFactory class source file into your
    sources and
    then use StoredFieldSortFactory.create(sortFieldName, sortDescending) to get
    Sort object for sorting query.
    StoredFieldSortFactory source file can be extracted from LUCENE-769 patch or
    from sharehound sources:
    http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java
    Regards,
    Artem

    NB> thanks
    NB> Nilesh

    NB> On 4/6/07, Bublic Online wrote:
    Hi Ivan, Chris and all!

    I'm that contributor of LUCENE-769 and I recommend it too :)
    OutOfMemory error was one of main reasons for me to make it.

    Regards,
    Artem Vasiliev
    On 4/6/07, Chris Hostetter wrote:


    : The problem I suspect is the sorting. As I understand, Lucene
    : builds internal caches for sorting and I suspect that this is the
    root
    : of your problem. You can test this by trying your problem queries
    : without sorting.

    if Sorting really is the cause of your problems, you may want to
    try out
    this patch...

    https://issues.apache.org/jira/browse/LUCENE-769

    ...it *may* be advantageous in situations where memory is your most
    constrained resource, and you are willing to sacrifice speed for
    sorting
    ... it looks promising to me, but there haven't been any convincing
    usecases/benchmarks of people finding it beneficial (other then the
    original contributor)

    if you do try it, please post your comments in the issue.



    -Hoss

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    --
    Best regards,
    Artem mailto:abublic@gmail.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    --
    Nilesh Bansal.
    http://queens.db.toronto.edu/~nilesh/

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Artem at Apr 9, 2007 at 5:03 pm
    Hello Nilesh,

    Sunday, April 8, 2007, 10:58:32 PM, you wrote:

    [talkin' about LUCENE-769]
    I must note that my patch only helps in lucene-OOM situations related to
    _sorted_ queries. If this is your case than I think yes it will help.
    NB> Probably a newbie question, but can you please explain what sorted
    NB> queries mean? Is simple keyword search a sorted query?

    That's simple - if results presented on screen sorted by that keyword it's
    sorted query :)
    Another test is your system's code. Sorted queries I mean are calls to
    IndexSearcher.search(query, sort).

    --
    Best regards,
    Artem mailto:abublic@gmail.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Artem at Apr 9, 2007 at 4:32 pm
    Hello Nilesh and all!

    NB> This seems like a very useful patch. Our application searches over 50
    NB> million doc in a 40GB index. We only have simple conjunctive queries
    NB> on a single field. Currently, the command line search program that
    NB> prints top-10 results requires at least 200mb memory. Our web
    NB> application, that searches the same index crashes with OOM when there
    NB> are more than 10-12 concurrent requests (heap size set to 3GB). Will
    NB> this patch help in such a situation?

    I must note that my patch only helps in lucene-OOM situations related to
    _sorted_ queries. If this is your case than I think yes it will help.

    In my app currently index is not so big, only 1mln docs. With the patch applied
    sample query giving first 30 of 120,000 sorted results made memory consumption
    jump from 18M to 20M according to jconsole.

    NB> It seems that there are some issues with this patch and that was the
    NB> reason it is not yet in the main source tree. Can someone please
    NB> summerize what are the downsides of using such an approach. It will be
    NB> really good if Lucene had it in main source tree and a flag to turn ON
    NB> or OFF this feature.

    First there's performance cost (for second and further queries with the
    same IndexSearcher). In default implementation all the index values of sorted
    field are cached during the first sorted search - this takes memory and time;
    but next queries run fast if there still some memory left. My implementation
    doesn't cache field values but loads them from respective documents on the fly -
    so it's slower but takes less memory. The query mentioned took about 3s (with
    rather small sorted fields values - about 20-100 chars).
    There's a limitation also - my implementation requires sorted field to be
    "stored" in index (Field.Store.YES in doc.add())

    NB> Bublic, can you tell me what exactly I need to do if I want to use this patch?

    You can include StoredFieldSortFactory class source file into your sources and
    then use StoredFieldSortFactory.create(sortFieldName, sortDescending) to get
    Sort object for sorting query.
    StoredFieldSortFactory source file can be extracted from LUCENE-769 patch or
    from sharehound sources: http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java

    Regards,
    Artem

    NB> thanks
    NB> Nilesh

    NB> On 4/6/07, Bublic Online wrote:
    Hi Ivan, Chris and all!

    I'm that contributor of LUCENE-769 and I recommend it too :)
    OutOfMemory error was one of main reasons for me to make it.

    Regards,
    Artem Vasiliev
    On 4/6/07, Chris Hostetter wrote:


    : The problem I suspect is the sorting. As I understand, Lucene
    : builds internal caches for sorting and I suspect that this is the root
    : of your problem. You can test this by trying your problem queries
    : without sorting.

    if Sorting really is the cause of your problems, you may want to try out
    this patch...

    https://issues.apache.org/jira/browse/LUCENE-769

    ...it *may* be advantageous in situations where memory is your most
    constrained resource, and you are willing to sacrifice speed for sorting
    ... it looks promising to me, but there haven't been any convincing
    usecases/benchmarks of people finding it beneficial (other then the
    original contributor)

    if you do try it, please post your comments in the issue.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    --
    Best regards,
    Artem mailto:abublic@gmail.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ivan Vasilev at Apr 23, 2007 at 7:09 pm
    Hi All,
    THANK YOU FOR YOUR HELP :)
    I put this problem in the forum but I had no chance to work on it last
    week unfurtunately...
    So now I tested the Artem's patch but the results show:
    1) speed is very slow compare with the usage without patch
    2) There are not very big differences of memory usage (I tested till now
    only with relativly small indexes - less than 1 GB and less than 1 mil
    docs because the when using with 20-40 GB indexes I had to wait more
    than 5 mins what is practically usless).

    So I have doubts if I use the patch correctly. I do just what is
    described in Artem's letter:

    AV> You can include StoredFieldSortFactory class source file into your sources and
    AV> then use StoredFieldSortFactory.create(sortFieldName, sortDescending) to get
    AV> Sort object for sorting query.
    AV> StoredFieldSortFactory source file can be extracted from LUCENE-769 patch or
    AV> from sharehound sources: http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java


    What I am wondering about is that in the patch commetns
    (https://issues.apache.org/jira/browse/LUCENE-769) I see that there is
    written that patch solves the problem by using WeakHashMap, but actually
    in the downloaded StoredFieldSortFactory.java file there is not used
    WeakHashMap. Another thing: In the comments in Lucene-769 issue there is
    mentioned something about classes like: WeakDocumentsCache and
    DocCachingIndexReader but I did not found them in Lucene source code
    neither as classes in StoredFieldSortFactory.java. So my questions are:
    1. Is it enought to include the file StoredFieldSortFactory.java in the
    source code or there are also other classes that I have to douwnload and
    include?
    2. Have I to use this DocCachingIndexReader instead of Reader that I
    currently use in cases when I expect OOMException and will use this patch?

    Thanks to all once again :),
    Ivan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Artem Vasiliev at Apr 24, 2007 at 1:17 pm
    Hello Ivan!

    It's so sad to me that you had bad results with that patch. :)

    The discussion in the ticket is out-of-date - the patch was initially in
    several classes, used WeakHashMap but then it evolved to what it's now - one
    StoredFieldSortFactory class. I use it in my sharehound app in pretty much
    the same the form it is in Jira currently and it does show good results to
    me.

    In your sample searches,
    - how many results do you have?
    - how long does the sorted search execute?
    - what is the average size of a sorted field?
    - what is the CPU and how much of it and memory you give to the application?

    I get page 1 (first 100 items) of sorted list with 10000 items in 0.3s to 3s
    (for date column it exactly depends on whether the sort is ascending or
    descending - don't know why is that). My index is about 1mln docs and 1G;
    sorted fields are rather small (numbers, dates and string of maybe 50
    symbols average). The machine looks quite beefy to me - Intel core duo with
    500M given to the application.

    Regards,
    Artem
    On 4/23/07, Ivan Vasilev wrote:

    Hi All,
    THANK YOU FOR YOUR HELP :)
    I put this problem in the forum but I had no chance to work on it last
    week unfurtunately...
    So now I tested the Artem's patch but the results show:
    1) speed is very slow compare with the usage without patch
    2) There are not very big differences of memory usage (I tested till now
    only with relativly small indexes - less than 1 GB and less than 1 mil
    docs because the when using with 20-40 GB indexes I had to wait more
    than 5 mins what is practically usless).

    So I have doubts if I use the patch correctly. I do just what is
    described in Artem's letter:

    AV> You can include StoredFieldSortFactory class source file into your
    sources and
    AV> then use StoredFieldSortFactory.create(sortFieldName, sortDescending)
    to get
    AV> Sort object for sorting query.
    AV> StoredFieldSortFactory source file can be extracted from LUCENE-769
    patch or
    AV> from sharehound sources:
    http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java


    What I am wondering about is that in the patch commetns
    (https://issues.apache.org/jira/browse/LUCENE-769) I see that there is
    written that patch solves the problem by using WeakHashMap, but actually
    in the downloaded StoredFieldSortFactory.java file there is not used
    WeakHashMap. Another thing: In the comments in Lucene-769 issue there is
    mentioned something about classes like: WeakDocumentsCache and
    DocCachingIndexReader but I did not found them in Lucene source code
    neither as classes in StoredFieldSortFactory.java. So my questions are:
    1. Is it enought to include the file StoredFieldSortFactory.java in the
    source code or there are also other classes that I have to douwnload and
    include?
    2. Have I to use this DocCachingIndexReader instead of Reader that I
    currently use in cases when I expect OOMException and will use this patch?

    Thanks to all once again :),
    Ivan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Artem Vasiliev at Apr 24, 2007 at 1:27 pm
    Ahhh, you said in your original post that your search matches _all_ the
    results.. Yup my patch will not help much in this case - after all all the
    values have to be read to be compared while sorting! :)

    LUCENE-769 patch helps only if result set is significantly less than full
    index size.

    Regards,
    Artem
    On 4/24/07, Artem Vasiliev wrote:

    Hello Ivan!

    It's so sad to me that you had bad results with that patch. :)

    The discussion in the ticket is out-of-date - the patch was initially in
    several classes, used WeakHashMap but then it evolved to what it's now - one
    StoredFieldSortFactory class. I use it in my sharehound app in pretty much
    the same the form it is in Jira currently and it does show good results to
    me.

    In your sample searches,
    - how many results do you have?
    - how long does the sorted search execute?
    - what is the average size of a sorted field?
    - what is the CPU and how much of it and memory you give to the
    application?

    I get page 1 (first 100 items) of sorted list with 10000 items in 0.3s to
    3s (for date column it exactly depends on whether the sort is ascending or
    descending - don't know why is that). My index is about 1mln docs and 1G;
    sorted fields are rather small (numbers, dates and string of maybe 50
    symbols average). The machine looks quite beefy to me - Intel core duo with
    500M given to the application.

    Regards,
    Artem
    On 4/23/07, Ivan Vasilev wrote:

    Hi All,
    THANK YOU FOR YOUR HELP :)
    I put this problem in the forum but I had no chance to work on it last
    week unfurtunately...
    So now I tested the Artem's patch but the results show:
    1) speed is very slow compare with the usage without patch
    2) There are not very big differences of memory usage (I tested till now
    only with relativly small indexes - less than 1 GB and less than 1 mil
    docs because the when using with 20-40 GB indexes I had to wait more
    than 5 mins what is practically usless).

    So I have doubts if I use the patch correctly. I do just what is
    described in Artem's letter:

    AV> You can include StoredFieldSortFactory class source file into your
    sources and
    AV> then use StoredFieldSortFactory.create(sortFieldName,
    sortDescending) to get
    AV> Sort object for sorting query.
    AV> StoredFieldSortFactory source file can be extracted from LUCENE-769
    patch or
    AV> from sharehound sources: http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java



    What I am wondering about is that in the patch commetns
    (https://issues.apache.org/jira/browse/LUCENE-769) I see that there is
    written that patch solves the problem by using WeakHashMap, but actually

    in the downloaded StoredFieldSortFactory.java file there is not used
    WeakHashMap. Another thing: In the comments in Lucene-769 issue there is
    mentioned something about classes like: WeakDocumentsCache and
    DocCachingIndexReader but I did not found them in Lucene source code
    neither as classes in StoredFieldSortFactory.java. So my questions are:
    1. Is it enought to include the file StoredFieldSortFactory.java in the
    source code or there are also other classes that I have to douwnload and

    include?
    2. Have I to use this DocCachingIndexReader instead of Reader that I
    currently use in cases when I expect OOMException and will use this
    patch?

    Thanks to all once again :),
    Ivan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Artem Vasiliev at Apr 24, 2007 at 1:30 pm
    Hi Ivan!

    btw may be forbidding the sorted search in case of too many results is an
    option? I did this way in my case.

    Regards,
    Artem.
    On 4/24/07, Artem Vasiliev wrote:

    Ahhh, you said in your original post that your search matches _all_ the
    results.. Yup my patch will not help much in this case - after all all the
    values have to be read to be compared while sorting! :)

    LUCENE-769 patch helps only if result set is significantly less than full
    index size.

    Regards,
    Artem
    On 4/24/07, Artem Vasiliev wrote:

    Hello Ivan!

    It's so sad to me that you had bad results with that patch. :)

    The discussion in the ticket is out-of-date - the patch was initially in
    several classes, used WeakHashMap but then it evolved to what it's now - one
    StoredFieldSortFactory class. I use it in my sharehound app in pretty much
    the same the form it is in Jira currently and it does show good results to
    me.

    In your sample searches,
    - how many results do you have?
    - how long does the sorted search execute?
    - what is the average size of a sorted field?
    - what is the CPU and how much of it and memory you give to the
    application?

    I get page 1 (first 100 items) of sorted list with 10000 items in 0.3sto 3s (for date column it exactly depends on whether the sort is ascending
    or descending - don't know why is that). My index is about 1mln docs and 1G;
    sorted fields are rather small (numbers, dates and string of maybe 50
    symbols average). The machine looks quite beefy to me - Intel core duo with
    500M given to the application.

    Regards,
    Artem
    On 4/23/07, Ivan Vasilev wrote:

    Hi All,
    THANK YOU FOR YOUR HELP :)
    I put this problem in the forum but I had no chance to work on it last
    week unfurtunately...
    So now I tested the Artem's patch but the results show:
    1) speed is very slow compare with the usage without patch
    2) There are not very big differences of memory usage (I tested till
    now
    only with relativly small indexes - less than 1 GB and less than 1 mil
    docs because the when using with 20-40 GB indexes I had to wait more
    than 5 mins what is practically usless).

    So I have doubts if I use the patch correctly. I do just what is
    described in Artem's letter:

    AV> You can include StoredFieldSortFactory class source file into your
    sources and
    AV> then use StoredFieldSortFactory.create(sortFieldName,
    sortDescending) to get
    AV> Sort object for sorting query.
    AV> StoredFieldSortFactory source file can be extracted from
    LUCENE-769 patch or
    AV> from sharehound sources: http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java



    What I am wondering about is that in the patch commetns
    (https://issues.apache.org/jira/browse/LUCENE-769 ) I see that there
    is
    written that patch solves the problem by using WeakHashMap, but
    actually
    in the downloaded StoredFieldSortFactory.java file there is not used
    WeakHashMap. Another thing: In the comments in Lucene-769 issue there
    is
    mentioned something about classes like: WeakDocumentsCache and
    DocCachingIndexReader but I did not found them in Lucene source code
    neither as classes in StoredFieldSortFactory.java. So my questions
    are:
    1. Is it enought to include the file StoredFieldSortFactory.java in
    the
    source code or there are also other classes that I have to douwnload
    and
    include?
    2. Have I to use this DocCachingIndexReader instead of Reader that I
    currently use in cases when I expect OOMException and will use this
    patch?

    Thanks to all once again :),
    Ivan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ivan Vasilev at Apr 25, 2007 at 11:01 am
    Hi Artem,

    Thank you very much for your mails :)
    So first I have to tell you that your patch works perfectly even with
    very big indexes - 40 GB (you can see the results bellow).
    The reason I to have bad test results last time is that I made a bit
    change (but I can not understand why this change made problem - on my
    opinion it should not have so big effects on performance).
    So the change that I made is - I added a new method in the class
    StoredFieldSortFactory. It is the same like create(String sortFieldName,
    boolean sortDescending) method but instead of wrapping SortField it
    return it directly and in my class I wrap this object in a Sort one.
    Here is the code:

    public static SortField createSortField(String sortFieldName, boolean
    sortDescending) {
    return new SortField(sortFieldName, instance, sortDescending);
    }

    I do this because we have to support sorting on multiple fields and I
    obtain all SortField objects in a cycle and then create Sort out of them:

    Sort sort = new Sort(sortFields);

    In my tests that were with very bad results (time for searches was more
    than 5 mins) in all the tests I used sorting ONLY BY ONE FIELD (means
    the array sortFields was always with length 1).
    But I still used the constructor Sort(SortField[]) but not
    Sort(SortField) as originally in your code in the method
    StoredFieldSortFactory.create(..).
    Do you think this is the reason for pure performance?

    If so, COULD YOU PLEASE TELL ME how to use your patch for sorting on
    multiple stored fields?

    Here are the test result of your patch with different indexes (the tests
    are with code just as you recommend to use it - with using of your
    create(..) method that uses constructor Sort(SortField) ):

    - CPU - Intel Core2Duo, max memory allowed to the process that makes
    searching - 1GB (not all of it used)
    **********************************************************************************************************
    - index size 3,3 GB, about 486 410 documents (all the testing searches
    include all documents);

    ____________________________________________________________________________________________

    - field size - it is file name and varies - on my opinion 15 - 30 chars
    average.
    - search time (ASC) - 1,312 s, memory usage - 71MB
    - search time (DSC) - 1,281 s, memory usage - 71MB

    - field size - it is abs path name and varies - on my opinion 60 - 90
    chars average.
    - search time (ASC) - 1,344 s, memory usage - 71MB
    - search time (DSC) - 1,328 s, memory usage - 71MB

    - field size - it is file size and varies - on my opinion 3 - 7 chars
    average.
    - search time (ASC) - 1,313 s, memory usage - 71MB
    - search time (DSC) - 1,312 s, memory usage - 71MB

    **********************************************************************************

    - index size 21,4 GB, about 376 999 documents (all the testing searches
    include all documents);
    ____________________________________________________________________________________________

    - field size - it is file name and varies - on my opinion 15 - 30 chars
    average.
    - search time (ASC) - 0,875 s, memory usage - 371MB
    - search time (DSC) - 0,828 s, memory usage - 371MB

    - field size - it is abs path name and varies - on my opinion 60 - 90
    chars average.
    - search time (ASC) - 0,844 s, memory usage - 371MB
    - search time (DSC) - 0,813 s, memory usage - 371MB

    - field size - it is file size and varies - on my opinion 3 - 7 chars
    average.
    - search time (ASC) - 0,813 s, memory usage - 371MB
    - search time (DSC) - 0,797 s, memory usage - 371MB

    **********************************************************************************

    - index size 42,9 GB, about 10 944 918 documents (all the testing
    searches include all documents);
    ____________________________________________________________________________________________

    - field size - it is file name and varies - on my opinion 15 - 30 chars
    average.
    - search time (ASC) - 21,905 s, memory usage - 625MB
    - search time (DSC) - 21,781 s, memory usage - 625MB

    - field size - it is abs path name and varies - on my opinion 60 - 90
    chars average.
    - search time (ASC) - 21,874 s, memory usage - 625MB
    - search time (DSC) - 21,749 s, memory usage - 625MB

    - field size - it is file size and varies - on my opinion 3 - 7 chars
    average.
    - search time (ASC) - 21,687 s, memory usage - 625MB
    - search time (DSC) - 21,812 s, memory usage - 625MB


    THANK YOU VERY MUCH,
    Ivan




    Artem Vasiliev wrote:
    Hello Ivan!

    It's so sad to me that you had bad results with that patch. :)

    The discussion in the ticket is out-of-date - the patch was initially in
    several classes, used WeakHashMap but then it evolved to what it's now
    - one
    StoredFieldSortFactory class. I use it in my sharehound app in pretty
    much
    the same the form it is in Jira currently and it does show good
    results to
    me.

    In your sample searches,
    - how many results do you have?
    - how long does the sorted search execute?
    - what is the average size of a sorted field?
    - what is the CPU and how much of it and memory you give to the
    application?

    I get page 1 (first 100 items) of sorted list with 10000 items in 0.3s
    to 3s
    (for date column it exactly depends on whether the sort is ascending or
    descending - don't know why is that). My index is about 1mln docs and 1G;
    sorted fields are rather small (numbers, dates and string of maybe 50
    symbols average). The machine looks quite beefy to me - Intel core duo
    with
    500M given to the application.

    Regards,
    Artem
    On 4/23/07, Ivan Vasilev wrote:

    Hi All,
    THANK YOU FOR YOUR HELP :)
    I put this problem in the forum but I had no chance to work on it last
    week unfurtunately...
    So now I tested the Artem's patch but the results show:
    1) speed is very slow compare with the usage without patch
    2) There are not very big differences of memory usage (I tested till now
    only with relativly small indexes - less than 1 GB and less than 1 mil
    docs because the when using with 20-40 GB indexes I had to wait more
    than 5 mins what is practically usless).

    So I have doubts if I use the patch correctly. I do just what is
    described in Artem's letter:

    AV> You can include StoredFieldSortFactory class source file into your
    sources and
    AV> then use StoredFieldSortFactory.create(sortFieldName,
    sortDescending)
    to get
    AV> Sort object for sorting query.
    AV> StoredFieldSortFactory source file can be extracted from LUCENE-769
    patch or
    AV> from sharehound sources:
    http://sharehound.cvs.sourceforge.net/*checkout*/sharehound/jNetCrawler/src/java/org/apache/lucene/search/StoredFieldSortFactory.java



    What I am wondering about is that in the patch commetns
    (https://issues.apache.org/jira/browse/LUCENE-769) I see that there is
    written that patch solves the problem by using WeakHashMap, but actually
    in the downloaded StoredFieldSortFactory.java file there is not used
    WeakHashMap. Another thing: In the comments in Lucene-769 issue there is
    mentioned something about classes like: WeakDocumentsCache and
    DocCachingIndexReader but I did not found them in Lucene source code
    neither as classes in StoredFieldSortFactory.java. So my questions are:
    1. Is it enought to include the file StoredFieldSortFactory.java in the
    source code or there are also other classes that I have to douwnload and
    include?
    2. Have I to use this DocCachingIndexReader instead of Reader that I
    currently use in cases when I expect OOMException and will use this
    patch?

    Thanks to all once again :),
    Ivan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Artem at Apr 25, 2007 at 7:28 pm
    Hello Ivan,

    That was cool news! Thanks! :) The timings are surprisingly good. 10 mln docs
    sorted in 20s.. cool! Also it looks like sorting algorithm employed by Lucene is
    quite memory-economic.

    Not supporting multiple fields is in fact another limitation of my patch. I
    don't need it so I didn't implement it :) What is needed to implement it is
    probably do it manually - employ FieldSelector fetching that bunch of fields;
    change compare(ScoreDoc scoreDoc1, ScoreDoc scoreDoc2) method so that it
    compares docs by a bunch of fields (there should be also another array of
    Asc/Desc flags somewhere which makes this more complicated) instead of single
    field; that's it.

    I don't understand yet why Sort(SortField[] fields) didn't give the same when
    fields.length == 1.. Probably we should dig into Lucene code to find out.
    In case of several fields I can imagine why this approach would be less effective: at least
    N*2 Document reads (by StoredFieldComparator.sortValue) will be needed to
    compare 2 documents (N is length of fields array).
    One read with appropriate FieldSelector is likely to perform better.

    Anyway, I do think StoredFieldSortFactory's approach could be successfully
    applied to multiple fields, but I'm not going to implement it yet. May be you?
    :)

    Regards,
    Artem

    IV> Hi Artem,

    IV> Thank you very much for your mails :)
    IV> So first I have to tell you that your patch works perfectly even with
    IV> very big indexes - 40 GB (you can see the results bellow).
    IV> The reason I to have bad test results last time is that I made a bit
    IV> change (but I can not understand why this change made problem - on my
    IV> opinion it should not have so big effects on performance).
    IV> So the change that I made is - I added a new method in the class
    IV> StoredFieldSortFactory. It is the same like create(String sortFieldName,
    IV> boolean sortDescending) method but instead of wrapping SortField it
    IV> return it directly and in my class I wrap this object in a Sort one.
    IV> Here is the code:

    IV> public static SortField createSortField(String sortFieldName, boolean
    IV> sortDescending) {
    IV> return new SortField(sortFieldName, instance, sortDescending);
    IV> }

    IV> I do this because we have to support sorting on multiple fields and I
    IV> obtain all SortField objects in a cycle and then create Sort out of them:

    IV> Sort sort = new Sort(sortFields);

    IV> In my tests that were with very bad results (time for searches was more
    IV> than 5 mins) in all the tests I used sorting ONLY BY ONE FIELD (means
    IV> the array sortFields was always with length 1).
    IV> But I still used the constructor Sort(SortField[]) but not
    IV> Sort(SortField) as originally in your code in the method
    IV> StoredFieldSortFactory.create(..).
    IV> Do you think this is the reason for pure performance?

    IV> If so, COULD YOU PLEASE TELL ME how to use your patch for sorting on
    IV> multiple stored fields?

    IV> Here are the test result of your patch with different indexes (the tests
    IV> are with code just as you recommend to use it - with using of your
    IV> create(..) method that uses constructor Sort(SortField) ):

    IV> - CPU - Intel Core2Duo, max memory allowed to the process that makes
    IV> searching - 1GB (not all of it used)
    IV> **********************************************************************************************************
    IV> - index size 3,3 GB, about 486 410 documents (all the testing searches
    IV> include all documents);

    IV> ____________________________________________________________________________________________

    IV> - field size - it is file name and varies - on my opinion 15 - 30 chars
    IV> average.
    IV> - search time (ASC) - 1,312 s, memory usage - 71MB
    IV> - search time (DSC) - 1,281 s, memory usage - 71MB

    IV> - field size - it is abs path name and varies - on my opinion 60 - 90
    IV> chars average.
    IV> - search time (ASC) - 1,344 s, memory usage - 71MB
    IV> - search time (DSC) - 1,328 s, memory usage - 71MB

    IV> - field size - it is file size and varies - on my opinion 3 - 7 chars
    IV> average.
    IV> - search time (ASC) - 1,313 s, memory usage - 71MB
    IV> - search time (DSC) - 1,312 s, memory usage - 71MB

    IV> **********************************************************************************

    IV> - index size 21,4 GB, about 376 999 documents (all the testing searches
    IV> include all documents);
    IV> ____________________________________________________________________________________________

    IV> - field size - it is file name and varies - on my opinion 15 - 30 chars
    IV> average.
    IV> - search time (ASC) - 0,875 s, memory usage - 371MB
    IV> - search time (DSC) - 0,828 s, memory usage - 371MB

    IV> - field size - it is abs path name and varies - on my opinion 60 - 90
    IV> chars average.
    IV> - search time (ASC) - 0,844 s, memory usage - 371MB
    IV> - search time (DSC) - 0,813 s, memory usage - 371MB

    IV> - field size - it is file size and varies - on my opinion 3 - 7 chars
    IV> average.
    IV> - search time (ASC) - 0,813 s, memory usage - 371MB
    IV> - search time (DSC) - 0,797 s, memory usage - 371MB

    IV> **********************************************************************************

    IV> - index size 42,9 GB, about 10 944 918 documents (all the testing
    IV> searches include all documents);
    IV> ____________________________________________________________________________________________

    IV> - field size - it is file name and varies - on my opinion 15 - 30 chars
    IV> average.
    IV> - search time (ASC) - 21,905 s, memory usage - 625MB
    IV> - search time (DSC) - 21,781 s, memory usage - 625MB

    IV> - field size - it is abs path name and varies - on my opinion 60 - 90
    IV> chars average.
    IV> - search time (ASC) - 21,874 s, memory usage - 625MB
    IV> - search time (DSC) - 21,749 s, memory usage - 625MB

    IV> - field size - it is file size and varies - on my opinion 3 - 7 chars
    IV> average.
    IV> - search time (ASC) - 21,687 s, memory usage - 625MB
    IV> - search time (DSC) - 21,812 s, memory usage - 625MB


    IV> THANK YOU VERY MUCH,
    IV> Ivan





    --
    Best regards,
    Artem mailto:abublic@gmail.com


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Apr 6, 2007 at 3:20 pm
    Ivane,

    Sorts will eat your memory, and how much they use depends on what you store in them - ints, String, floats...
    A profiler like JProfiler will tell you what's going on, who's eating your memory.

    Otis
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Simpy -- http://www.simpy.com/ - Tag - Search - Share

    ----- Original Message ----
    From: Ivan Vasilev <ivasilev@sirma.bg>
    To: java-user@lucene.apache.org
    Sent: Friday, April 6, 2007 7:09:38 AM
    Subject: Out of memory exception for big indexes

    Hi All,

    I have the following problem - we have OutOfMemoryException when
    seraching on the indexes that are of size 20 - 40 GB and contain 10 - 15
    million docs.
    When we make searches we perform query that match all the results but we
    DO NOT fetch all the results - we fetch 100 of them. We also make
    sorting by using the class Sort and we really need result to be sorted
    on a field that is randomly defined by the user.
    So my questions are:
    1) Have Lucene some restrictions on index size on which it can perform
    searches?
    2) Is there some approach to estimate beforehand the RAM that will use
    Lucene for sertain query? I mean on what exactly depends this memory
    usage - on index size, on docs stored in the index, on size of this docs...
    3) Is there some approach to controll the used RAM. For example when
    searching not to exceed 1GB of used memory?
    4) Is there some spcial approach to proceeding with such big indexes (we
    expect in near future even 60 -80 GB indexes).


    Best Regards,
    Ivan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Otis Gospodnetic at Apr 6, 2007 at 6:39 pm
    Craig,

    This just shows you that the JVM OOMed while running thar particular method, and does not necessarily mean that that method is what's consuming your RAM.
    Run your app and, if you are using Java 1.5/1.6 run jmap against that java process and tell it to show you how much memory objects are consuming. Sort it by the appropriate column and you'll get your top memoty hogs.

    Otis
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Simpy -- http://www.simpy.com/ - Tag - Search - Share

    ----- Original Message ----
    From: Craig W Conway <craigwconway@yahoo.com>
    To: java-user@lucene.apache.org
    Sent: Friday, April 6, 2007 1:10:36 PM
    Subject: Re: Out of memory exception for big indexes

    Would it be fair to say that you can expect OutOfMemory errors if you run complex queries? ie sorts, boosts, weights...

    My query looks like this:

    +(pathNodeId_2976569:1^5.0 pathNodeId_2976969:1 pathNodeId_2976255:1 pathNodeId_2976571:1) +(pathClassId:1 pathClassId:346 pathClassId:314) -id:369


    My OutOfMemory error occurs like so:

    java.lang.OutOfMemoryError: Java heap space
    Dumping heap to java_pid4512.hprof ...
    Heap dump file created [71421503 bytes in 2.640 secs]
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at org.apache.lucene.index.MultiReader.norms(MultiReader.java:173)
    at org.apache.lucene.search.TermQuery$TermWeight.scorer(TermQuery.java:69)
    at org.apache.lucene.search.BooleanQuery$BooleanWeight2.scorer(BooleanQuery.java:355)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:130)
    at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:100)
    at org.apache.lucene.search.Hits.getMoreDocs(Hits.java:66)
    at org.apache.lucene.search.Hits.(Searcher.java:45)
    at org.apache.lucene.search.Searcher.search(Searcher.java:37)

    References:

    http://www.opensubscriber.com/message/java-user@lucene.apache.org/1961376.html
    http://www.opensubscriber.com/message/java-user@lucene.apache.org/6362024.html


    ----- Original Message ----
    From: Otis Gospodnetic <otis_gospodnetic@yahoo.com>
    To: java-user@lucene.apache.org
    Sent: Friday, April 6, 2007 8:20:21 AM
    Subject: Re: Out of memory exception for big indexes

    Ivane,

    Sorts will eat your memory, and how much they use depends on what you store in them - ints, String, floats...
    A profiler like JProfiler will tell you what's going on, who's eating your memory.

    Otis
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
    Simpy -- http://www.simpy.com/ - Tag - Search - Share

    ----- Original Message ----
    From: Ivan Vasilev <ivasilev@sirma.bg>
    To: java-user@lucene.apache.org
    Sent: Friday, April 6, 2007 7:09:38 AM
    Subject: Out of memory exception for big indexes

    Hi All,

    I have the following problem - we have OutOfMemoryException when
    seraching on the indexes that are of size 20 - 40 GB and contain 10 - 15
    million docs.
    When we make searches we perform query that match all the results but we
    DO NOT fetch all the results - we fetch 100 of them. We also make
    sorting by using the class Sort and we really need result to be sorted
    on a field that is randomly defined by the user.
    So my questions are:
    1) Have Lucene some restrictions on index size on which it can perform
    searches?
    2) Is there some approach to estimate beforehand the RAM that will use
    Lucene for sertain query? I mean on what exactly depends this memory
    usage - on index size, on docs stored in the index, on size of this docs...
    3) Is there some approach to controll the used RAM. For example when
    searching not to exceed 1GB of used memory?
    4) Is there some spcial approach to proceeding with such big indexes (we
    expect in near future even 60 -80 GB indexes).


    Best Regards,
    Ivan

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org









    ____________________________________________________________________________________
    Bored stiff? Loosen up...
    Download and play hundreds of games for free on Yahoo! Games.
    http://games.yahoo.com/games/front



    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedApr 6, '07 at 11:10a
activeApr 25, '07 at 7:28p
posts18
users6
websitelucene.apache.org

People

Translate

site design / logo © 2022 Grokbase