FAQ
Hello, I've read a lot of threads now on memory consumption and sorting,
and I think I have a pretty good understanding of how things work, but I
could still need some input here..

We currently have a system consisting of 6 different lucene indexes (all
have the same structure, so you could say it is a form of sharding). We
currently use this approach because we want to be able to give users
access to different index (but not necessarily all indexes) etc.

(We are planning to move to a solr-based system, but for now we would like
to solve this issue with our current lucene-based system.)

The thing is, the indexes are rather big (ranging from 5G to 20G per index
and 10 - 30 million entries per index.)
We keep one searcher object open per index, and when the index is changed
(new documents added in batches several times a day), we update the
searcher objects.
In the warmup procedure we did a couple of searches and things work fine,
BUT i realized that in our application we return hits sorted by date by
default, and our warmup procedure did non-sorted queries... so still the
first searches done by the user after an update was slow (obviously).

To cope, I changed the warmup procedure to include a sorted search, and
now the user will not notice slow queries. Good!
But, the problem at hand is that we are running into memory problems (and
I understand that sorting does consume a lot of memory...) But is there
any way that is "best practice" to deal with this? The field we sort on is
an un_indexed text field representing the date. typically "2008-10-10". I
am aware that string field sorting consumes a lot of memory, so should we
change this field to something different? Would this help us with the
memory problems?

As a sidenote / couriosity question: Does it matter if we use the search
method returning Hits versus the search method returning TopFieldDocs? (we
are not iterating them in any way when this memory issue occurs)

Thanks in advance for any guidance we may get.

Best regards,
Aleksander M. Stensby



--
Aleksander M. Stensby
Senior Software Developer
Integrasco A/S

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Search Discussions

  • Aleksander M. Stensby at Oct 10, 2008 at 12:52 pm
    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new searcher
    and swap it out...) As far as the string versus int or long goes, I don't
    really see any big gain in changig it since 1460 * 10 bytes extra memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and sorting,
    and I think I have a pretty good understanding of how things work, but I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Mark harwood at Oct 10, 2008 at 1:20 pm
    Assuming content is added in chronological order and with no updates to existing docs couldn't you rely on internal Lucene document id to give a chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an added complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new searcher
    and swap it out...) As far as the string versus int or long goes, I don't
    really see any big gain in changig it since 1460 * 10 bytes extra memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and sorting,
    and I think I have a pretty good understanding of how things work, but I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Aleksander M. Stensby at Oct 10, 2008 at 2:26 pm
    Unfortunately no, since the documents that are added may come form a new
    "source" containing old documents aswell..:/
    I tried deploying our webapplication without any searcher objects and it
    consumes basically ~200mb of memory in tomcat.
    With 6 searchers the same applications manages to consume over 2.5 GB of
    memory when warming... :(
    I might have done some super-idiotic logic in the way I handle searching,
    but I can seriously not see what that might be...

    But I assume that people deal with much larger indexes than this, right?

    cheers,
    Aleksander

    On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood wrote:

    Assuming content is added in chronological order and with no updates to
    existing docs couldn't you rely on internal Lucene document id to give a
    chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an
    added complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be
    roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new
    searcher
    and swap it out...) As far as the string versus int or long goes, I don't
    really see any big gain in changig it since 1460 * 10 bytes extra memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and sorting,
    and I think I have a pretty good understanding of how things work, but I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Ganesh at Oct 13, 2008 at 12:44 pm
    Hello Mark,

    I am also facing the same sorting issue.
    In my case there will be only addition and deletion of data [no modification
    of existing records]. Whether i could rely on the indexed order of sorting.

    "SortField.FIELD_DOC" is the one helps to do sorting on indexed order?

    Regards
    Ganesh


    ----- Original Message -----
    From: "mark harwood" <[email protected]>
    To: <[email protected]>
    Sent: Friday, October 10, 2008 6:48 PM
    Subject: Re: Question regarding sorting and memory consumption in lucene


    Assuming content is added in chronological order and with no updates to
    existing docs couldn't you rely on internal Lucene document id to give a
    chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an added
    complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new searcher
    and swap it out...) As far as the string versus int or long goes, I don't
    really see any big gain in changig it since 1460 * 10 bytes extra memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and sorting,
    and I think I have a pretty good understanding of how things work, but I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

    Send instant messages to your online friends http://in.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Mark harwood at Oct 10, 2008 at 2:44 pm
    I think you have your memory cost calculation wrong.
    The cost is field size (10 bytes ?) times number of documents NOT number of unique terms.
    The cache is essentially an array of size reader.maxDoc() which is indexed directly into on docId to retrieve field values.

    You are right in needing to factor in the cost of keeping one active cache while busy warming-up a new one so that effectively doubles the RAM requirements.






    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:25:29
    Subject: Re: Question regarding sorting and memory consumption in lucene

    Unfortunately no, since the documents that are added may come form a new
    "source" containing old documents aswell..:/
    I tried deploying our webapplication without any searcher objects and it
    consumes basically ~200mb of memory in tomcat.
    With 6 searchers the same applications manages to consume over 2.5 GB of
    memory when warming... :(
    I might have done some super-idiotic logic in the way I handle searching,
    but I can seriously not see what that might be...

    But I assume that people deal with much larger indexes than this, right?

    cheers,
    Aleksander

    On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood wrote:

    Assuming content is added in chronological order and with no updates to
    existing docs couldn't you rely on internal Lucene document id to give a
    chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an
    added complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be
    roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new
    searcher
    and swap it out...) As far as the string versus int or long goes, I don't
    really see any big gain in changig it since 1460 * 10 bytes extra memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and sorting,
    and I think I have a pretty good understanding of how things work, but I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Aleksander M. Stensby at Oct 10, 2008 at 2:59 pm
    Yes, I understand that, and I did mean the number of documents, but I read
    in the javadoc that:

    "For String fields, the cache is larger: in addition to the above array,
    the value of every term in the field is kept in memory. If there are many
    unique terms in the field, this could be quite large."

    and in one of the mails on the mailinglist i read:
    "So if your field is an Int your talking numDocs*32 bits for your
    cache. For a Long field its numDocs*64. For a String field Lucene caches
    a String array with every unique term and then an int array indexing
    into the term array."

    But yes, I understand that the memory drain should not be significant due
    to the String field in our case since, but the fact that we have so many
    documents.
    Still, if we have 30 000 000 docs * 10 bytes = ~300 mb (ok I was a bit off
    there... :p)
    If we had a long field instead, it would be about 220 mb or something like
    that? hmm..

    But does that mean that I will just have to reduce my index (at least for
    the sorting) ?
    No other option?


    Cheers, and thanks for your help!
    - Aleks
    On Fri, 10 Oct 2008 16:43:35 +0200, mark harwood wrote:

    I think you have your memory cost calculation wrong.
    The cost is field size (10 bytes ?) times number of documents NOT number
    of unique terms.
    The cache is essentially an array of size reader.maxDoc() which is
    indexed directly into on docId to retrieve field values.

    You are right in needing to factor in the cost of keeping one active
    cache while busy warming-up a new one so that effectively doubles the
    RAM requirements.






    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:25:29
    Subject: Re: Question regarding sorting and memory consumption in lucene

    Unfortunately no, since the documents that are added may come form a new
    "source" containing old documents aswell..:/
    I tried deploying our webapplication without any searcher objects and it
    consumes basically ~200mb of memory in tomcat.
    With 6 searchers the same applications manages to consume over 2.5 GB of
    memory when warming... :(
    I might have done some super-idiotic logic in the way I handle searching,
    but I can seriously not see what that might be...

    But I assume that people deal with much larger indexes than this, right?

    cheers,
    Aleksander


    On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood
    wrote:
    Assuming content is added in chronological order and with no updates to
    existing docs couldn't you rely on internal Lucene document id to give a
    chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an
    added complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be
    roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use
    approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new
    searcher
    and swap it out...) As far as the string versus int or long goes, I
    don't
    really see any big gain in changig it since 1460 * 10 bytes extra
    memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only
    allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and
    sorting,
    and I think I have a pretty good understanding of how things work, but
    I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes
    (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the
    search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Mark harwood at Oct 10, 2008 at 3:08 pm
    Update: The statement "...cost is field size (10 bytes ?) times number of documents" is wrong.
    What you actually have is the cost of the unique strings (estimated at 10 * 1460 -effectively nothing) BUT you have to add the cost of the array of object references to those strings so

    30m x 8 bytes on 64bit java = 240mb
    or
    30m x 4 bytes on 32bit = 120mb

    ....which is where the bulk of the cost comes in.

    How about using a field cache of "short" which is effectively:

    new short[reader.maxDoc]
    or
    2bytes * 30 million = 60 meg.

    Each short could represent up to 65536 values - capable of representing a date range of 179 years.






    ----- Original Message ----
    From: mark harwood <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:43:35
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I think you have your memory cost calculation wrong.
    The cost is field size (10 bytes ?) times number of documents NOT number of unique terms.
    The cache is essentially an array of size reader.maxDoc() which is indexed directly into on docId to retrieve field values.

    You are right in needing to factor in the cost of keeping one active cache while busy warming-up a new one so that effectively doubles the RAM requirements.






    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:25:29
    Subject: Re: Question regarding sorting and memory consumption in lucene

    Unfortunately no, since the documents that are added may come form a new
    "source" containing old documents aswell..:/
    I tried deploying our webapplication without any searcher objects and it
    consumes basically ~200mb of memory in tomcat.
    With 6 searchers the same applications manages to consume over 2.5 GB of
    memory when warming... :(
    I might have done some super-idiotic logic in the way I handle searching,
    but I can seriously not see what that might be...

    But I assume that people deal with much larger indexes than this, right?

    cheers,
    Aleksander

    On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood wrote:

    Assuming content is added in chronological order and with no updates to
    existing docs couldn't you rely on internal Lucene document id to give a
    chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an
    added complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be
    roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new
    searcher
    and swap it out...) As far as the string versus int or long goes, I don't
    really see any big gain in changig it since 1460 * 10 bytes extra memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and sorting,
    and I think I have a pretty good understanding of how things work, but I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Aleksander M. Stensby at Oct 10, 2008 at 3:46 pm
    That's a really good idea Mark! :)
    Thanks! Will try to see if can make a quick change with your suggestion.
    (Too bad quick isn't really a word in my vocabulary when it's 6 o'clock on
    a Friday :(
    Guess it'll be a looong night.. :(

    Cheers,
    Aleks
    On Fri, 10 Oct 2008 17:07:31 +0200, mark harwood wrote:

    Update: The statement "...cost is field size (10 bytes ?) times number
    of documents" is wrong.
    What you actually have is the cost of the unique strings (estimated at
    10 * 1460 -effectively nothing) BUT you have to add the cost of the
    array of object references to those strings so

    30m x 8 bytes on 64bit java = 240mb
    or
    30m x 4 bytes on 32bit = 120mb

    ....which is where the bulk of the cost comes in.

    How about using a field cache of "short" which is effectively:

    new short[reader.maxDoc]
    or
    2bytes * 30 million = 60 meg.

    Each short could represent up to 65536 values - capable of representing
    a date range of 179 years.





    ----- Original Message ----
    From: mark harwood <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:43:35
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I think you have your memory cost calculation wrong.
    The cost is field size (10 bytes ?) times number of documents NOT number
    of unique terms.
    The cache is essentially an array of size reader.maxDoc() which is
    indexed directly into on docId to retrieve field values.

    You are right in needing to factor in the cost of keeping one active
    cache while busy warming-up a new one so that effectively doubles the
    RAM requirements.






    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:25:29
    Subject: Re: Question regarding sorting and memory consumption in lucene

    Unfortunately no, since the documents that are added may come form a new
    "source" containing old documents aswell..:/
    I tried deploying our webapplication without any searcher objects and it
    consumes basically ~200mb of memory in tomcat.
    With 6 searchers the same applications manages to consume over 2.5 GB of
    memory when warming... :(
    I might have done some super-idiotic logic in the way I handle searching,
    but I can seriously not see what that might be...

    But I assume that people deal with much larger indexes than this, right?

    cheers,
    Aleksander


    On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood
    wrote:
    Assuming content is added in chronological order and with no updates to
    existing docs couldn't you rely on internal Lucene document id to give a
    chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an
    added complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be
    roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use
    approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new
    searcher
    and swap it out...) As far as the string versus int or long goes, I
    don't
    really see any big gain in changig it since 1460 * 10 bytes extra
    memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only
    allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and
    sorting,
    and I think I have a pretty good understanding of how things work, but
    I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes
    (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the
    search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Robert Stewart at Oct 10, 2008 at 4:02 pm
    I have had a similar problem. What I do is load all the date field values at index startup, convert dates (timestamps) to a Julian date (# of seconds since 1970/1/1). Then I pre-sort that array using a very fast O(n) distribution sort, and then keep an array of integers which is the pre-sorted permutation of all documents in the index. So that, for docid=N, perm[N]=sorted order. Then it just takes enumerating docids in results (from a bitarray) to get the sorted order of results. Our index is approx. 38million docs. Sorting by date is around 20ms.

    -----Original Message-----
    From: Aleksander M. Stensby
    Sent: Friday, October 10, 2008 11:45 AM
    To: [email protected]
    Subject: Re: Question regarding sorting and memory consumption in lucene

    That's a really good idea Mark! :)
    Thanks! Will try to see if can make a quick change with your suggestion.
    (Too bad quick isn't really a word in my vocabulary when it's 6 o'clock on
    a Friday :(
    Guess it'll be a looong night.. :(

    Cheers,
    Aleks
    On Fri, 10 Oct 2008 17:07:31 +0200, mark harwood wrote:

    Update: The statement "...cost is field size (10 bytes ?) times number
    of documents" is wrong.
    What you actually have is the cost of the unique strings (estimated at
    10 * 1460 -effectively nothing) BUT you have to add the cost of the
    array of object references to those strings so

    30m x 8 bytes on 64bit java = 240mb
    or
    30m x 4 bytes on 32bit = 120mb

    ....which is where the bulk of the cost comes in.

    How about using a field cache of "short" which is effectively:

    new short[reader.maxDoc]
    or
    2bytes * 30 million = 60 meg.

    Each short could represent up to 65536 values - capable of representing
    a date range of 179 years.





    ----- Original Message ----
    From: mark harwood <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:43:35
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I think you have your memory cost calculation wrong.
    The cost is field size (10 bytes ?) times number of documents NOT number
    of unique terms.
    The cache is essentially an array of size reader.maxDoc() which is
    indexed directly into on docId to retrieve field values.

    You are right in needing to factor in the cost of keeping one active
    cache while busy warming-up a new one so that effectively doubles the
    RAM requirements.






    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:25:29
    Subject: Re: Question regarding sorting and memory consumption in lucene

    Unfortunately no, since the documents that are added may come form a new
    "source" containing old documents aswell..:/
    I tried deploying our webapplication without any searcher objects and it
    consumes basically ~200mb of memory in tomcat.
    With 6 searchers the same applications manages to consume over 2.5 GB of
    memory when warming... :(
    I might have done some super-idiotic logic in the way I handle searching,
    but I can seriously not see what that might be...

    But I assume that people deal with much larger indexes than this, right?

    cheers,
    Aleksander


    On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood
    wrote:
    Assuming content is added in chronological order and with no updates to
    existing docs couldn't you rely on internal Lucene document id to give a
    chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an
    added complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be
    roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use
    approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new
    searcher
    and swap it out...) As far as the string versus int or long goes, I
    don't
    really see any big gain in changig it since 1460 * 10 bytes extra
    memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only
    allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and
    sorting,
    and I think I have a pretty good understanding of how things work, but
    I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes
    (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the
    search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Mark harwood at Oct 10, 2008 at 4:23 pm
    Actually looking at this a little deeper maybe Lucene could/should automatically be doing this "short" optimisation here?

    Given a comparitively small set of unique terms (as in your example) it seems feasible that FieldCacheImpl could allocate a short[reader.maxDoc] array rather than an int[reader.maxDoc] array to represent values for sorting.
    That halves the memory required and could support fields with up to 65 thousand unique terms. It looks like towards the end of the code where FieldCacheImpl creates a StringIndex it is in a position to recognise that this optimisation could be made.

    Anyone more familiar with the internal workings of the Sort api care to comment further on if that is an option or have I missed something?
    I've not really poked around in this part of the code too much.

    Cheers,
    Mark




    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 16:45:02
    Subject: Re: Question regarding sorting and memory consumption in lucene

    That's a really good idea Mark! :)
    Thanks! Will try to see if can make a quick change with your suggestion.
    (Too bad quick isn't really a word in my vocabulary when it's 6 o'clock on
    a Friday :(
    Guess it'll be a looong night.. :(

    Cheers,
    Aleks
    On Fri, 10 Oct 2008 17:07:31 +0200, mark harwood wrote:

    Update: The statement "...cost is field size (10 bytes ?) times number
    of documents" is wrong.
    What you actually have is the cost of the unique strings (estimated at
    10 * 1460 -effectively nothing) BUT you have to add the cost of the
    array of object references to those strings so

    30m x 8 bytes on 64bit java = 240mb
    or
    30m x 4 bytes on 32bit = 120mb

    ....which is where the bulk of the cost comes in.

    How about using a field cache of "short" which is effectively:

    new short[reader.maxDoc]
    or
    2bytes * 30 million = 60 meg.

    Each short could represent up to 65536 values - capable of representing
    a date range of 179 years.





    ----- Original Message ----
    From: mark harwood <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:43:35
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I think you have your memory cost calculation wrong.
    The cost is field size (10 bytes ?) times number of documents NOT number
    of unique terms.
    The cache is essentially an array of size reader.maxDoc() which is
    indexed directly into on docId to retrieve field values.

    You are right in needing to factor in the cost of keeping one active
    cache while busy warming-up a new one so that effectively doubles the
    RAM requirements.






    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 15:25:29
    Subject: Re: Question regarding sorting and memory consumption in lucene

    Unfortunately no, since the documents that are added may come form a new
    "source" containing old documents aswell..:/
    I tried deploying our webapplication without any searcher objects and it
    consumes basically ~200mb of memory in tomcat.
    With 6 searchers the same applications manages to consume over 2.5 GB of
    memory when warming... :(
    I might have done some super-idiotic logic in the way I handle searching,
    but I can seriously not see what that might be...

    But I assume that people deal with much larger indexes than this, right?

    cheers,
    Aleksander


    On Fri, 10 Oct 2008 15:18:46 +0200, mark harwood
    wrote:
    Assuming content is added in chronological order and with no updates to
    existing docs couldn't you rely on internal Lucene document id to give a
    chronological sort order?
    That would require no memory cache at all when sorting.

    Querying across multiple indexes simultaneously however may present an
    added complication...



    ----- Original Message ----
    From: Aleksander M. Stensby <[email protected]>
    To: [email protected]
    Sent: Friday, 10 October, 2008 13:51:50
    Subject: Re: Question regarding sorting and memory consumption in lucene

    I'll follow up on my own question...
    Let's say that we have 4 years of data, meaning that there will be
    roughly
    4 * 365 = 1460 unique terms for our sort field.
    For one index, lets say with 30 million docs, the cache should use
    approx
    100mb, or am I wrong? and thus for 6 indexes we would need approx 600 mb
    for the caches? (and an additional 100mb every time we warm a new
    searcher
    and swap it out...) As far as the string versus int or long goes, I
    don't
    really see any big gain in changig it since 1460 * 10 bytes extra
    memory
    doesnt really make much difference. Or?

    I guess we should consider reducing the index size or at least only
    allow
    sorted search on a subset of the index (or a pruned version of the
    index...) ? Would that be better for us?
    But then again, I assume that there are much larger lucene-based indexes
    out there than ours, and you guys must have some solution to this issue,
    right? :)

    best regards,
    Aleksander


    On Fri, 10 Oct 2008 14:09:36 +0200, Aleksander M. Stensby
    wrote:
    Hello, I've read a lot of threads now on memory consumption and
    sorting,
    and I think I have a pretty good understanding of how things work, but
    I
    could still need some input here..

    We currently have a system consisting of 6 different lucene indexes
    (all
    have the same structure, so you could say it is a form of sharding). We
    currently use this approach because we want to be able to give users
    access to different index (but not necessarily all indexes) etc.

    (We are planning to move to a solr-based system, but for now we would
    like to solve this issue with our current lucene-based system.)

    The thing is, the indexes are rather big (ranging from 5G to 20G per
    index and 10 - 30 million entries per index.)
    We keep one searcher object open per index, and when the index is
    changed (new documents added in batches several times a day), we update
    the searcher objects.
    In the warmup procedure we did a couple of searches and things work
    fine, BUT i realized that in our application we return hits sorted by
    date by default, and our warmup procedure did non-sorted queries... so
    still the first searches done by the user after an update was slow
    (obviously).

    To cope, I changed the warmup procedure to include a sorted search, and
    now the user will not notice slow queries. Good!
    But, the problem at hand is that we are running into memory problems
    (and I understand that sorting does consume a lot of memory...) But is
    there any way that is "best practice" to deal with this? The field we
    sort on is an un_indexed text field representing the date. typically
    "2008-10-10". I am aware that string field sorting consumes a lot of
    memory, so should we change this field to something different? Would
    this help us with the memory problems?

    As a sidenote / couriosity question: Does it matter if we use the
    search
    method returning Hits versus the search method returning TopFieldDocs?
    (we are not iterating them in any way when this memory issue occurs)

    Thanks in advance for any guidance we may get.

    Best regards,
    Aleksander M. Stensby



    --
    Aleksander M. Stensby
    Senior Software Developer
    Integrasco A/S
    +47 41 22 82 72
    [email protected]

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]




    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Chris Hostetter at Oct 14, 2008 at 11:57 pm
    : Actually looking at this a little deeper maybe Lucene could/should
    : automatically be doing this "short" optimisation here?

    At the moment it can't, the array's in StringIndex are public.

    The other thing that would be a bit tricky is the initialization ... i
    can't think of any easy way to know in advance how many terms there are
    before iterating over all the terms, so you'd have to assume one and then
    if you're wrong copy to the other -- not sure how expensive thta copy
    would be.

    It's a little more feasible for custom clients to do when they know in
    advance how many terms they've got -- but some of the existing
    FieldCacheImpl code could probably be refactoredto make it easier on
    people.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Mark Harwood at Oct 15, 2008 at 6:40 am
    Yes, StringIndex's public fields make life awkward. Re initialization - I did think you could try use arrays of byte arrays. First 256 terms can be addressed using just one byte array, on encountering a 257th term an extra byte array is allocated. References to terms then require indexing into 2 byte arrays and bit shifting the 2nd byte to produce a comibined short which can address up to 65k terms held in a term pool.

    When sorting, a fast comparison of 2 values can avoid always indexing into all byte arrays and shifting to produce a number. Simply comparing entries from the most significant byte array first can reveal a difference in order, if equal then comparing bytes from the next most significant byte array is required and so on.

    Not sure how this would perform compared to simply upgrading whole byte arrays to shorts to ints as you go.

    Cheers,
    Mark

    On 15 Oct 2008, at 00:56, Chris Hostetter wrote:



    : Actually looking at this a little deeper maybe Lucene could/should
    : automatically be doing this "short" optimisation here?

    At the moment it can't, the array's in StringIndex are public.

    The other thing that would be a bit tricky is the initialization ... i
    can't think of any easy way to know in advance how many terms there are
    before iterating over all the terms, so you'd have to assume one and then
    if you're wrong copy to the other -- not sure how expensive thta copy
    would be.

    It's a little more feasible for custom clients to do when they know in
    advance how many terms they've got -- but some of the existing
    FieldCacheImpl code could probably be refactoredto make it easier on
    people.



    -Hoss


    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]






    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]
  • Mark harwood at Oct 15, 2008 at 4:56 pm
    Further to our discussion - see below a class that measures the added construction cost and memory savings for an optimised field value cache for a given index.
    The optimisation here being initial use of byte arrays, then shorts, then ints as more unique terms emerge.
    I imagine the majority of "faceting" fields and, to a lesser extent sorting fields (e.g. dates) have <= 65k unique terms and therefore can stand to benefit from this.


    Cheers
    Mark


    ===========
    Begin code.......



    package lucene.sort;

    import java.io.IOException;
    import java.text.NumberFormat;
    import java.util.Collection;
    import java.util.Iterator;

    import org.apache.lucene.index.IndexReader;
    import org.apache.lucene.index.Term;
    import org.apache.lucene.index.TermDocs;
    import org.apache.lucene.index.TermEnum;
    import org.apache.lucene.index.IndexReader.FieldOption;

    /**
    *
    * Test to measure cost of dynamically upgrading fieldcache from byte array to
    * short to int depending on index content term distribution.
    * Currently tests all fields in an index but probably better to measure a sensible subset
    * of fields ie those that are likely to be cached.
    *
    * @author MAHarwood
    *
    */
    public class BenchmarkOptimisedFieldCacheConstruction
    {
    static long totalExtraCachingCostMilliseconds = 0;
    static long totalRamBytesSaving = 0;
    private static int shortRange = ((int) Short.MAX_VALUE + (int) Math
    .abs(Short.MIN_VALUE));
    private static int byteRange = ((int) Byte.MAX_VALUE + (int) Math
    .abs(Byte.MIN_VALUE));
    static NumberFormat nf = NumberFormat.getIntegerInstance();

    public static void main(String[] args) throws Exception
    {
    nf.setGroupingUsed(true);

    //! Change this to analyse your choice of index
    IndexReader reader = IndexReader.open("/indexes/myTestIndex");
    int numDocs = reader.maxDoc();
    // Change the above value to fake the number of docs in the index (thereby
    // increasing size of arrays manipulated in this test)
    // int numDocs=30*1000*1000;

    Collection fields = reader.getFieldNames(FieldOption.INDEXED);
    for (Iterator iterator = fields.iterator(); iterator.hasNext();)
    {
    String fieldName = (String) iterator.next();
    measureOptimisedCachingCost(reader, fieldName, numDocs);
    }
    System.out
    .println("Caching all terms in this index in an optimised form would cost an extra "
    + totalExtraCachingCostMilliseconds
    + " millis "
    + "but save "
    + nf.format(totalRamBytesSaving)
    + " bytes RAM");
    }

    private static void measureOptimisedCachingCost(IndexReader reader,String field, int numDocs) throws IOException
    {
    TermDocs termDocs = reader.termDocs();
    TermEnum termEnum = reader.terms(new Term(field, ""));
    int t = 0; // current term number

    String[] mterms = new String[reader.maxDoc() + 1];

    // an entry for documents that have no terms in this field
    // should a document with no terms be at top or bottom?
    // this puts them at the top - if it is changed, FieldDocSortedHitQueue
    // needs to change as well.
    mterms[t++] = null;

    byte byteRefs[] = new byte[numDocs]; // up to 32 bits used to refer
    // into term pool
    short shortRefs[] = null;
    int intRefs[] = null;
    long totalConvertTimeForField = 0;

    try
    {
    do
    {
    Term term = termEnum.term();
    if (term == null || term.field() != field)
    break;
    // store term text
    // we expect that there is at most one term per document
    if (t >= mterms.length)
    throw new RuntimeException("there are more terms than "
    + "documents in field \"" + field
    + "\", but it's impossible to sort on "
    + "tokenized fields");
    mterms[t] = term.text();

    termDocs.seek(termEnum);
    while (termDocs.next())
    {
    int doc = termDocs.doc();
    if (intRefs != null)
    {
    intRefs[doc] = t;
    } else if (shortRefs != null)
    {
    // adjust number to make optimal use of negative range
    // of values that can be stored
    shortRefs[doc] = (short) ((short) t - Short.MAX_VALUE);
    int storedT = shortRefs[doc] + Short.MAX_VALUE;
    if (storedT != t)
    {
    System.err.println(storedT + "!=" + t);
    }

    } else
    {
    // adjust number to make optimal use of negative range
    // of values that can be stored
    byteRefs[doc] = (byte) ((byte) t - Byte.MAX_VALUE);
    }

    }
    t++;
    if ((byteRefs != null) && (shortRefs == null))
    {
    // More terms than can be accessed using a byte - move to shorts
    if (t >= byteRange)
    {
    long millis = System.currentTimeMillis();
    shortRefs = new short[numDocs];
    short adjust = (Short.MAX_VALUE - (short) Byte.MAX_VALUE);
    for (int i = 0; i < byteRefs.length; i++)
    {
    shortRefs[i] = (short) ((short) ((short) byteRefs[i]) - adjust);
    }
    long millisDiff = System.currentTimeMillis() - millis;
    byteRefs = null;
    totalConvertTimeForField += millisDiff;
    }
    } else
    {
    if (intRefs == null)
    {
    if (t >= shortRange)
    {
    //more terms than can be accessed using shorts - move to ints
    long millis = System.currentTimeMillis();
    intRefs = new int[numDocs];
    int adjust = Short.MAX_VALUE;
    for (int i = 0; i < shortRefs.length; i++)
    {
    intRefs[i] = (int) shortRefs[i] + adjust;
    }
    long millisDiff = System.currentTimeMillis()
    - millis;
    totalConvertTimeForField += millisDiff;
    shortRefs = null;
    }
    }
    }
    } while (termEnum.next());
    } finally
    {
    termDocs.close();
    termEnum.close();
    }
    if (intRefs != null)
    {
    long ramBytesSaving = 0;
    totalRamBytesSaving += ramBytesSaving;
    System.out.println("Field " +field + " added cache load cost of "
    + totalConvertTimeForField
    + " millis with no RAM saving over current FieldCacheImpl");
    } else
    {
    if (shortRefs != null)
    {
    long ramBytesSaving = numDocs * 2;
    totalRamBytesSaving += ramBytesSaving;
    System.out.println("Field " +field + " added cache load cost of "
    + totalConvertTimeForField + " millis but saved "
    + nf.format(ramBytesSaving)
    + " bytes RAM over current FieldCacheImpl");
    } else
    {
    long ramBytesSaving = numDocs * 3;
    totalRamBytesSaving += ramBytesSaving;
    System.out.println("Field " +field + " added cache load cost of "
    + totalConvertTimeForField + " millis but saved "
    + nf.format(ramBytesSaving)
    + " bytes RAM over current FieldCacheImpl");
    }
    }
    totalExtraCachingCostMilliseconds += totalConvertTimeForField;
    }
    }





    ---------------------------------------------------------------------
    To unsubscribe, e-mail: [email protected]
    For additional commands, e-mail: [email protected]

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 10, '08 at 12:10p
activeOct 15, '08 at 4:56p
posts14
users5
websitelucene.apache.org

People

Translate

site design / logo © 2023 Grokbase