FAQ
Hello all,

My indexing is growing by 1 million records per day and the memory
consumption of the searcher object is quite high.

There are different opinion in the groups. Few suggest to use single
database and few to use sharding. My Database has 10 million records now and
it might go till 30 million or more. I plan to shard the index. but
Multisearcher will give me benifit.

Regards
Ganesh


Send instant messages to your online friends http://in.messenger.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Search Discussions

  • Anshum at Oct 3, 2008 at 4:19 pm
    Hi Ganesh,

    I have experimented with sharded indexes and they seem to benefit me(atleast
    in my case). I would like to know a few things before I answer your
    question:
    1. Do you have a reasonable criteria ( a calculated one) to shard the
    indexes?
    2. How do you plan to split the index? Is it going to be document based
    (which I guess it should be as otherwise you would have to build a complete
    distributed system)
    3. Do you plan to put your indexes on the RAM or on (physically) seperate
    HDDs?

    Though all said and done, sharded indexes are a good approach, if done the
    right way.
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Fri, Oct 3, 2008 at 3:01 PM, Ganesh wrote:

    Hello all,

    My indexing is growing by 1 million records per day and the memory
    consumption of the searcher object is quite high.

    There are different opinion in the groups. Few suggest to use single
    database and few to use sharding. My Database has 10 million records now and
    it might go till 30 million or more. I plan to shard the index. but
    Multisearcher will give me benifit.

    Regards
    Ganesh


    Send instant messages to your online friends http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ganesh at Oct 6, 2008 at 4:37 am
    Hello Anshum,

    My index is growing 1 million documents per day. Initially i planned to have
    a single database but the sorting of one or more fields consumes more RAM.
    Whether sharding the index would also consume the same.

    My application should co-exist with other application of my product and my
    app could get 1 GB of RAM. Search speed is fine but i need to display the
    result in the sorted order.

    I thought to keep 7 days of documents in one index and create one more after
    the 7 days. After 30 days the first index may get deleted. I need to keep
    the documents in the index DB for 30 days. My Index DB is in HDD.

    I want to the pros and cons of sharding. I think maintance of the DB becomes
    easier.

    It would be very much helpful, if you share some of your thoughts.

    Regards
    Ganesh


    ----- Original Message -----
    From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, October 03, 2008 9:48 PM
    Subject: Re: Single searcher vs Multi Searcher

    Hi Ganesh,

    I have experimented with sharded indexes and they seem to benefit
    me(atleast
    in my case). I would like to know a few things before I answer your
    question:
    1. Do you have a reasonable criteria ( a calculated one) to shard the
    indexes?
    2. How do you plan to split the index? Is it going to be document based
    (which I guess it should be as otherwise you would have to build a
    complete
    distributed system)
    3. Do you plan to put your indexes on the RAM or on (physically) seperate
    HDDs?

    Though all said and done, sharded indexes are a good approach, if done the
    right way.
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Fri, Oct 3, 2008 at 3:01 PM, Ganesh wrote:

    Hello all,

    My indexing is growing by 1 million records per day and the memory
    consumption of the searcher object is quite high.

    There are different opinion in the groups. Few suggest to use single
    database and few to use sharding. My Database has 10 million records now
    and
    it might go till 30 million or more. I plan to shard the index. but
    Multisearcher will give me benifit.

    Regards
    Ganesh


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send instant messages to your online friends http://in.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Anshum at Oct 6, 2008 at 6:48 pm
    Hi Ganesh,
    About the memory consumption while sorting, it would end up using similar
    amounts, perhaps even more.. like in the case of regular parallel
    programming algorithms (hoping that you intend to search using a parallel
    multi searcher). Would you have to query particular indexes only for a
    particular search or would you be searching over all the indexes and then
    follow it up by merger (which the parallel multi searcher would do
    efficiently).?
    Also, I guess 30 indexes would be a little too many, haven't really tried
    out those many indexes for a multisearcher.
    As far as maintenance of DB is concerned, it might be easy as long as you
    don't have any document updates, in which case you'd have to shift the
    documents from one DB/index to another (which includes creating an entry in
    the latest index/DB and deleting the record from the older DB).
    I guess you'd have to pilot it, in case memory is an issue in your case and
    not speed, you could try a regular multisearcher instead of a parallel
    multisearcher.
    I guess when you say maintenance of the DB gets easier, you mean that the
    data in each individual table is controlled (but remember there could be
    other bigger hassles like the one mentioned above about moving data between
    indexes/DB).

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Mon, Oct 6, 2008 at 10:06 AM, Ganesh wrote:

    Hello Anshum,

    My index is growing 1 million documents per day. Initially i planned to
    have a single database but the sorting of one or more fields consumes more
    RAM. Whether sharding the index would also consume the same.

    My application should co-exist with other application of my product and my
    app could get 1 GB of RAM. Search speed is fine but i need to display the
    result in the sorted order.

    I thought to keep 7 days of documents in one index and create one more
    after the 7 days. After 30 days the first index may get deleted. I need to
    keep the documents in the index DB for 30 days. My Index DB is in HDD.

    I want to the pros and cons of sharding. I think maintance of the DB
    becomes easier.

    It would be very much helpful, if you share some of your thoughts.

    Regards
    Ganesh


    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, October 03, 2008 9:48 PM
    Subject: Re: Single searcher vs Multi Searcher



    Hi Ganesh,
    I have experimented with sharded indexes and they seem to benefit
    me(atleast
    in my case). I would like to know a few things before I answer your
    question:
    1. Do you have a reasonable criteria ( a calculated one) to shard the
    indexes?
    2. How do you plan to split the index? Is it going to be document based
    (which I guess it should be as otherwise you would have to build a
    complete
    distributed system)
    3. Do you plan to put your indexes on the RAM or on (physically) seperate
    HDDs?

    Though all said and done, sharded indexes are a good approach, if done the
    right way.
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Fri, Oct 3, 2008 at 3:01 PM, Ganesh wrote:

    Hello all,
    My indexing is growing by 1 million records per day and the memory
    consumption of the searcher object is quite high.

    There are different opinion in the groups. Few suggest to use single
    database and few to use sharding. My Database has 10 million records now
    and
    it might go till 30 million or more. I plan to shard the index. but
    Multisearcher will give me benifit.

    Regards
    Ganesh


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Send instant messages to your online friends http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ganesh at Oct 7, 2008 at 10:01 am
    Hello Anusham,

    My intention is to shard the index after every 7 days (week). After 30 days,
    (4th week) the first DB may get deleted. At any point of time i will be
    maitaining 3 to 4 DB.

    I want to know the pros and cons of the MultiSearcher or Index sharding
    approach. Any web links would be helpful.

    Regards
    Ganesh

    ----- Original Message -----
    From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Tuesday, October 07, 2008 12:18 AM
    Subject: Re: Single searcher vs Multi Searcher

    Hi Ganesh,
    About the memory consumption while sorting, it would end up using similar
    amounts, perhaps even more.. like in the case of regular parallel
    programming algorithms (hoping that you intend to search using a parallel
    multi searcher). Would you have to query particular indexes only for a
    particular search or would you be searching over all the indexes and then
    follow it up by merger (which the parallel multi searcher would do
    efficiently).?
    Also, I guess 30 indexes would be a little too many, haven't really tried
    out those many indexes for a multisearcher.
    As far as maintenance of DB is concerned, it might be easy as long as you
    don't have any document updates, in which case you'd have to shift the
    documents from one DB/index to another (which includes creating an entry
    in
    the latest index/DB and deleting the record from the older DB).
    I guess you'd have to pilot it, in case memory is an issue in your case
    and
    not speed, you could try a regular multisearcher instead of a parallel
    multisearcher.
    I guess when you say maintenance of the DB gets easier, you mean that the
    data in each individual table is controlled (but remember there could be
    other bigger hassles like the one mentioned above about moving data
    between
    indexes/DB).

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Mon, Oct 6, 2008 at 10:06 AM, Ganesh wrote:

    Hello Anshum,

    My index is growing 1 million documents per day. Initially i planned to
    have a single database but the sorting of one or more fields consumes
    more
    RAM. Whether sharding the index would also consume the same.

    My application should co-exist with other application of my product and
    my
    app could get 1 GB of RAM. Search speed is fine but i need to display the
    result in the sorted order.

    I thought to keep 7 days of documents in one index and create one more
    after the 7 days. After 30 days the first index may get deleted. I need
    to
    keep the documents in the index DB for 30 days. My Index DB is in HDD.

    I want to the pros and cons of sharding. I think maintance of the DB
    becomes easier.

    It would be very much helpful, if you share some of your thoughts.

    Regards
    Ganesh


    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, October 03, 2008 9:48 PM
    Subject: Re: Single searcher vs Multi Searcher



    Hi Ganesh,
    I have experimented with sharded indexes and they seem to benefit
    me(atleast
    in my case). I would like to know a few things before I answer your
    question:
    1. Do you have a reasonable criteria ( a calculated one) to shard the
    indexes?
    2. How do you plan to split the index? Is it going to be document based
    (which I guess it should be as otherwise you would have to build a
    complete
    distributed system)
    3. Do you plan to put your indexes on the RAM or on (physically)
    seperate
    HDDs?

    Though all said and done, sharded indexes are a good approach, if done
    the
    right way.
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Fri, Oct 3, 2008 at 3:01 PM, Ganesh wrote:

    Hello all,
    My indexing is growing by 1 million records per day and the memory
    consumption of the searcher object is quite high.

    There are different opinion in the groups. Few suggest to use single
    database and few to use sharding. My Database has 10 million records
    now
    and
    it might go till 30 million or more. I plan to shard the index. but
    Multisearcher will give me benifit.

    Regards
    Ganesh


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send instant messages to your online friends http://in.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Anshum at Oct 7, 2008 at 12:26 pm
    There were all of the links I could find:
    http://findmeajob.wordpress.com/2007/07/31/lucene-singlesearcher-vs-multisearcher/
    http://archives.devshed.com/forums/java-118/how-do-lucene-ids-work-with-multireader-and-multisearcher-993682.html

    It has been only out of personal experimentation.
    Maintaining 4 indexes and having a multisearcher for the same sounds ok to
    me, though it might(most probably would) use up more time (depending on your
    term/doc distribution).
    I have tried multisearchers for indexes that are sized at around 18Gs with
    over 10 million records in them and they seem to be working fine. Memory
    utilization hasn't really been my problem though maintenance has been
    required to be taken care of if I had record updates as well i.e.
    Lets assume we have 4 indexes {1,2,3,4} with 4 being the oldest one. Now a
    document got updated and so you would want to move it to the freshest index
    viz. index 1. In this case you would also have to delete this document from
    index 4 (after figuring out that it exists in 4) else this doc would show up
    as 2 hits using a multisearcher.
    This is the document movement that I was talking about.
    Talking only about online resource utilization ( as what I just spoke about
    is generally an offline process), you would have to spend extra resource to
    merge the hits from the different indexes.
    The other overhead in case of a multisearcher would be of creating those
    many searchers - for each index(though shouldn't be much, just though would
    point out).
    I guess you should try it as speed of search is not realy all that important
    to you as compared to running it on a singe box within the memory
    limitation.



    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Tue, Oct 7, 2008 at 3:29 PM, Ganesh wrote:

    Hello Anusham,

    My intention is to shard the index after every 7 days (week). After 30
    days, (4th week) the first DB may get deleted. At any point of time i will
    be maitaining 3 to 4 DB.

    I want to know the pros and cons of the MultiSearcher or Index sharding
    approach. Any web links would be helpful.

    Regards
    Ganesh

    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Tuesday, October 07, 2008 12:18 AM

    Subject: Re: Single searcher vs Multi Searcher


    Hi Ganesh,
    About the memory consumption while sorting, it would end up using similar
    amounts, perhaps even more.. like in the case of regular parallel
    programming algorithms (hoping that you intend to search using a parallel
    multi searcher). Would you have to query particular indexes only for a
    particular search or would you be searching over all the indexes and then
    follow it up by merger (which the parallel multi searcher would do
    efficiently).?
    Also, I guess 30 indexes would be a little too many, haven't really tried
    out those many indexes for a multisearcher.
    As far as maintenance of DB is concerned, it might be easy as long as you
    don't have any document updates, in which case you'd have to shift the
    documents from one DB/index to another (which includes creating an entry
    in
    the latest index/DB and deleting the record from the older DB).
    I guess you'd have to pilot it, in case memory is an issue in your case
    and
    not speed, you could try a regular multisearcher instead of a parallel
    multisearcher.
    I guess when you say maintenance of the DB gets easier, you mean that the
    data in each individual table is controlled (but remember there could be
    other bigger hassles like the one mentioned above about moving data
    between
    indexes/DB).

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Mon, Oct 6, 2008 at 10:06 AM, Ganesh wrote:

    Hello Anshum,
    My index is growing 1 million documents per day. Initially i planned to
    have a single database but the sorting of one or more fields consumes
    more
    RAM. Whether sharding the index would also consume the same.

    My application should co-exist with other application of my product and
    my
    app could get 1 GB of RAM. Search speed is fine but i need to display the
    result in the sorted order.

    I thought to keep 7 days of documents in one index and create one more
    after the 7 days. After 30 days the first index may get deleted. I need
    to
    keep the documents in the index DB for 30 days. My Index DB is in HDD.

    I want to the pros and cons of sharding. I think maintance of the DB
    becomes easier.

    It would be very much helpful, if you share some of your thoughts.

    Regards
    Ganesh


    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, October 03, 2008 9:48 PM
    Subject: Re: Single searcher vs Multi Searcher



    Hi Ganesh,
    I have experimented with sharded indexes and they seem to benefit
    me(atleast
    in my case). I would like to know a few things before I answer your
    question:
    1. Do you have a reasonable criteria ( a calculated one) to shard the
    indexes?
    2. How do you plan to split the index? Is it going to be document based
    (which I guess it should be as otherwise you would have to build a
    complete
    distributed system)
    3. Do you plan to put your indexes on the RAM or on (physically)
    seperate
    HDDs?

    Though all said and done, sharded indexes are a good approach, if done
    the
    right way.
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Fri, Oct 3, 2008 at 3:01 PM, Ganesh wrote:

    Hello all,
    My indexing is growing by 1 million records per day and the memory
    consumption of the searcher object is quite high.

    There are different opinion in the groups. Few suggest to use single
    database and few to use sharding. My Database has 10 million records
    now
    and
    it might go till 30 million or more. I plan to shard the index. but
    Multisearcher will give me benifit.

    Regards
    Ganesh


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Send instant messages to your online friends http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ganesh at Oct 8, 2008 at 12:10 pm
    Hello Anshum,

    In my case i have to add /modify records to the current index database and
    there will be only delete in older index DB. I will not have a situitation
    of updating the older records.

    Please brief me about your requirement and how you configured your database.
    Whether you have any incremental updates??
    What have gained by sharding the index.

    Regards
    Ganesh


    ----- Original Message -----
    From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Tuesday, October 07, 2008 5:55 PM
    Subject: Re: Single searcher vs Multi Searcher

    There were all of the links I could find:
    http://findmeajob.wordpress.com/2007/07/31/lucene-singlesearcher-vs-multisearcher/
    http://archives.devshed.com/forums/java-118/how-do-lucene-ids-work-with-multireader-and-multisearcher-993682.html

    It has been only out of personal experimentation.
    Maintaining 4 indexes and having a multisearcher for the same sounds ok to
    me, though it might(most probably would) use up more time (depending on
    your
    term/doc distribution).
    I have tried multisearchers for indexes that are sized at around 18Gs with
    over 10 million records in them and they seem to be working fine. Memory
    utilization hasn't really been my problem though maintenance has been
    required to be taken care of if I had record updates as well i.e.
    Lets assume we have 4 indexes {1,2,3,4} with 4 being the oldest one. Now a
    document got updated and so you would want to move it to the freshest
    index
    viz. index 1. In this case you would also have to delete this document
    from
    index 4 (after figuring out that it exists in 4) else this doc would show
    up
    as 2 hits using a multisearcher.
    This is the document movement that I was talking about.
    Talking only about online resource utilization ( as what I just spoke
    about
    is generally an offline process), you would have to spend extra resource
    to
    merge the hits from the different indexes.
    The other overhead in case of a multisearcher would be of creating those
    many searchers - for each index(though shouldn't be much, just though
    would
    point out).
    I guess you should try it as speed of search is not realy all that
    important
    to you as compared to running it on a singe box within the memory
    limitation.



    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Tue, Oct 7, 2008 at 3:29 PM, Ganesh wrote:

    Hello Anusham,

    My intention is to shard the index after every 7 days (week). After 30
    days, (4th week) the first DB may get deleted. At any point of time i
    will
    be maitaining 3 to 4 DB.

    I want to know the pros and cons of the MultiSearcher or Index sharding
    approach. Any web links would be helpful.

    Regards
    Ganesh

    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Tuesday, October 07, 2008 12:18 AM

    Subject: Re: Single searcher vs Multi Searcher


    Hi Ganesh,
    About the memory consumption while sorting, it would end up using
    similar
    amounts, perhaps even more.. like in the case of regular parallel
    programming algorithms (hoping that you intend to search using a
    parallel
    multi searcher). Would you have to query particular indexes only for a
    particular search or would you be searching over all the indexes and
    then
    follow it up by merger (which the parallel multi searcher would do
    efficiently).?
    Also, I guess 30 indexes would be a little too many, haven't really
    tried
    out those many indexes for a multisearcher.
    As far as maintenance of DB is concerned, it might be easy as long as
    you
    don't have any document updates, in which case you'd have to shift the
    documents from one DB/index to another (which includes creating an entry
    in
    the latest index/DB and deleting the record from the older DB).
    I guess you'd have to pilot it, in case memory is an issue in your case
    and
    not speed, you could try a regular multisearcher instead of a parallel
    multisearcher.
    I guess when you say maintenance of the DB gets easier, you mean that
    the
    data in each individual table is controlled (but remember there could be
    other bigger hassles like the one mentioned above about moving data
    between
    indexes/DB).

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Mon, Oct 6, 2008 at 10:06 AM, Ganesh wrote:

    Hello Anshum,
    My index is growing 1 million documents per day. Initially i planned to
    have a single database but the sorting of one or more fields consumes
    more
    RAM. Whether sharding the index would also consume the same.

    My application should co-exist with other application of my product and
    my
    app could get 1 GB of RAM. Search speed is fine but i need to display
    the
    result in the sorted order.

    I thought to keep 7 days of documents in one index and create one more
    after the 7 days. After 30 days the first index may get deleted. I need
    to
    keep the documents in the index DB for 30 days. My Index DB is in HDD.

    I want to the pros and cons of sharding. I think maintance of the DB
    becomes easier.

    It would be very much helpful, if you share some of your thoughts.

    Regards
    Ganesh


    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, October 03, 2008 9:48 PM
    Subject: Re: Single searcher vs Multi Searcher



    Hi Ganesh,
    I have experimented with sharded indexes and they seem to benefit
    me(atleast
    in my case). I would like to know a few things before I answer your
    question:
    1. Do you have a reasonable criteria ( a calculated one) to shard the
    indexes?
    2. How do you plan to split the index? Is it going to be document
    based
    (which I guess it should be as otherwise you would have to build a
    complete
    distributed system)
    3. Do you plan to put your indexes on the RAM or on (physically)
    seperate
    HDDs?

    Though all said and done, sharded indexes are a good approach, if done
    the
    right way.
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Fri, Oct 3, 2008 at 3:01 PM, Ganesh wrote:

    Hello all,
    My indexing is growing by 1 million records per day and the memory
    consumption of the searcher object is quite high.

    There are different opinion in the groups. Few suggest to use single
    database and few to use sharding. My Database has 10 million records
    now
    and
    it might go till 30 million or more. I plan to shard the index. but
    Multisearcher will give me benifit.

    Regards
    Ganesh


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send instant messages to your online friends http://in.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Anshum at Oct 10, 2008 at 4:50 am
    Hi Ganesh,

    Your situation seems pretty straight.
    I did not really split my database (storage), just that while indexing, I
    indexed the data into 'n' indexes (basis a particular field).
    Talking about implementation and gain, I have tried using multi searcher as
    well as parallel multisearcher and I seem to be running searches faster (my
    priority being a better time complexity as compared to space complexity) in
    case of a parallelmultisearcher.
    This helps me perform my search in a little over 1/n times the regular
    search time.
    Also, there are times when I don't want to search on all the indexes and so
    I end up only searching a select few indexes. I would advice you to shard
    the indexes keeping in mind the search criteria as well so that, if possible
    you search for lesser data as opposed to searching for everything.
    Also, I did incrementally update my index and so deleted the older doc from
    the older index and added it to a new incremental index (which was my
    concern as this takes a little time assuming you have a lot of updates as
    well).

    All said and done, this is an implementation of doc based index sharding
    i.e. everything for a particular document goes to a particular index. I also
    tried splitting on terms (so that when I search for A and B, I know which
    indexes store all docs (and other info) for the terms A and B. But this
    would mean a lot of building things from scratch so I couldnt get to really
    try it. There are a few distributed lucene implementations (though they are
    not that good) you could have a look at them.

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Wed, Oct 8, 2008 at 5:39 PM, Ganesh wrote:

    Hello Anshum,

    In my case i have to add /modify records to the current index database and
    there will be only delete in older index DB. I will not have a situitation
    of updating the older records.

    Please brief me about your requirement and how you configured your
    database.
    Whether you have any incremental updates??
    What have gained by sharding the index.

    Regards
    Ganesh


    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Tuesday, October 07, 2008 5:55 PM

    Subject: Re: Single searcher vs Multi Searcher


    There were all of the links I could find:
    http://findmeajob.wordpress.com/2007/07/31/lucene-singlesearcher-vs-multisearcher/

    http://archives.devshed.com/forums/java-118/how-do-lucene-ids-work-with-multireader-and-multisearcher-993682.html

    It has been only out of personal experimentation.
    Maintaining 4 indexes and having a multisearcher for the same sounds ok to
    me, though it might(most probably would) use up more time (depending on
    your
    term/doc distribution).
    I have tried multisearchers for indexes that are sized at around 18Gs with
    over 10 million records in them and they seem to be working fine. Memory
    utilization hasn't really been my problem though maintenance has been
    required to be taken care of if I had record updates as well i.e.
    Lets assume we have 4 indexes {1,2,3,4} with 4 being the oldest one. Now a
    document got updated and so you would want to move it to the freshest
    index
    viz. index 1. In this case you would also have to delete this document
    from
    index 4 (after figuring out that it exists in 4) else this doc would show
    up
    as 2 hits using a multisearcher.
    This is the document movement that I was talking about.
    Talking only about online resource utilization ( as what I just spoke
    about
    is generally an offline process), you would have to spend extra resource
    to
    merge the hits from the different indexes.
    The other overhead in case of a multisearcher would be of creating those
    many searchers - for each index(though shouldn't be much, just though
    would
    point out).
    I guess you should try it as speed of search is not realy all that
    important
    to you as compared to running it on a singe box within the memory
    limitation.



    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Tue, Oct 7, 2008 at 3:29 PM, Ganesh wrote:

    Hello Anusham,
    My intention is to shard the index after every 7 days (week). After 30
    days, (4th week) the first DB may get deleted. At any point of time i
    will
    be maitaining 3 to 4 DB.

    I want to know the pros and cons of the MultiSearcher or Index sharding
    approach. Any web links would be helpful.

    Regards
    Ganesh

    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Tuesday, October 07, 2008 12:18 AM

    Subject: Re: Single searcher vs Multi Searcher


    Hi Ganesh,
    About the memory consumption while sorting, it would end up using
    similar
    amounts, perhaps even more.. like in the case of regular parallel
    programming algorithms (hoping that you intend to search using a
    parallel
    multi searcher). Would you have to query particular indexes only for a
    particular search or would you be searching over all the indexes and
    then
    follow it up by merger (which the parallel multi searcher would do
    efficiently).?
    Also, I guess 30 indexes would be a little too many, haven't really
    tried
    out those many indexes for a multisearcher.
    As far as maintenance of DB is concerned, it might be easy as long as
    you
    don't have any document updates, in which case you'd have to shift the
    documents from one DB/index to another (which includes creating an entry
    in
    the latest index/DB and deleting the record from the older DB).
    I guess you'd have to pilot it, in case memory is an issue in your case
    and
    not speed, you could try a regular multisearcher instead of a parallel
    multisearcher.
    I guess when you say maintenance of the DB gets easier, you mean that
    the
    data in each individual table is controlled (but remember there could be
    other bigger hassles like the one mentioned above about moving data
    between
    indexes/DB).

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Mon, Oct 6, 2008 at 10:06 AM, Ganesh wrote:

    Hello Anshum,
    My index is growing 1 million documents per day. Initially i planned to
    have a single database but the sorting of one or more fields consumes
    more
    RAM. Whether sharding the index would also consume the same.

    My application should co-exist with other application of my product and
    my
    app could get 1 GB of RAM. Search speed is fine but i need to display
    the
    result in the sorted order.

    I thought to keep 7 days of documents in one index and create one more
    after the 7 days. After 30 days the first index may get deleted. I need
    to
    keep the documents in the index DB for 30 days. My Index DB is in HDD.

    I want to the pros and cons of sharding. I think maintance of the DB
    becomes easier.

    It would be very much helpful, if you share some of your thoughts.

    Regards
    Ganesh


    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, October 03, 2008 9:48 PM
    Subject: Re: Single searcher vs Multi Searcher



    Hi Ganesh,

    I have experimented with sharded indexes and they seem to benefit
    me(atleast
    in my case). I would like to know a few things before I answer your
    question:
    1. Do you have a reasonable criteria ( a calculated one) to shard the
    indexes?
    2. How do you plan to split the index? Is it going to be document
    based
    (which I guess it should be as otherwise you would have to build a
    complete
    distributed system)
    3. Do you plan to put your indexes on the RAM or on (physically)
    seperate
    HDDs?

    Though all said and done, sharded indexes are a good approach, if done
    the
    right way.
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Fri, Oct 3, 2008 at 3:01 PM, Ganesh wrote:

    Hello all,

    My indexing is growing by 1 million records per day and the memory
    consumption of the searcher object is quite high.

    There are different opinion in the groups. Few suggest to use single
    database and few to use sharding. My Database has 10 million records
    now
    and
    it might go till 30 million or more. I plan to shard the index. but
    Multisearcher will give me benifit.

    Regards
    Ganesh


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Send instant messages to your online friends http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
  • Ganesh at Oct 13, 2008 at 12:37 pm
    Hello Anshum,

    My criteria to shard would be date. I am planning to maintain 10 days of
    data in one DB. I think the memory required for single searcher and multi
    searcher would be more or less the same. In your case you may perform search
    on one DB but you might have all searcher objects of all DBs opened and
    warmed up and this will require same amount of memory as single searcher
    would do.

    Let me conclude:
    The benifits of going for sharding the database
    1) increase in search time
    2) increase in indexing speed
    3) easy maintance.

    Any other benifits?

    Regards
    Ganesh

    ----- Original Message -----
    From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, October 10, 2008 10:19 AM
    Subject: Re: Single searcher vs Multi Searcher

    Hi Ganesh,

    Your situation seems pretty straight.
    I did not really split my database (storage), just that while indexing, I
    indexed the data into 'n' indexes (basis a particular field).
    Talking about implementation and gain, I have tried using multi searcher
    as
    well as parallel multisearcher and I seem to be running searches faster
    (my
    priority being a better time complexity as compared to space complexity)
    in
    case of a parallelmultisearcher.
    This helps me perform my search in a little over 1/n times the regular
    search time.
    Also, there are times when I don't want to search on all the indexes and
    so
    I end up only searching a select few indexes. I would advice you to shard
    the indexes keeping in mind the search criteria as well so that, if
    possible
    you search for lesser data as opposed to searching for everything.
    Also, I did incrementally update my index and so deleted the older doc
    from
    the older index and added it to a new incremental index (which was my
    concern as this takes a little time assuming you have a lot of updates as
    well).

    All said and done, this is an implementation of doc based index sharding
    i.e. everything for a particular document goes to a particular index. I
    also
    tried splitting on terms (so that when I search for A and B, I know which
    indexes store all docs (and other info) for the terms A and B. But this
    would mean a lot of building things from scratch so I couldnt get to
    really
    try it. There are a few distributed lucene implementations (though they
    are
    not that good) you could have a look at them.

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............

    On Wed, Oct 8, 2008 at 5:39 PM, Ganesh wrote:

    Hello Anshum,

    In my case i have to add /modify records to the current index database
    and
    there will be only delete in older index DB. I will not have a
    situitation
    of updating the older records.

    Please brief me about your requirement and how you configured your
    database.
    Whether you have any incremental updates??
    What have gained by sharding the index.

    Regards
    Ganesh


    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Tuesday, October 07, 2008 5:55 PM

    Subject: Re: Single searcher vs Multi Searcher


    There were all of the links I could find:
    http://findmeajob.wordpress.com/2007/07/31/lucene-singlesearcher-vs-multisearcher/

    http://archives.devshed.com/forums/java-118/how-do-lucene-ids-work-with-multireader-and-multisearcher-993682.html

    It has been only out of personal experimentation.
    Maintaining 4 indexes and having a multisearcher for the same sounds ok
    to
    me, though it might(most probably would) use up more time (depending on
    your
    term/doc distribution).
    I have tried multisearchers for indexes that are sized at around 18Gs
    with
    over 10 million records in them and they seem to be working fine. Memory
    utilization hasn't really been my problem though maintenance has been
    required to be taken care of if I had record updates as well i.e.
    Lets assume we have 4 indexes {1,2,3,4} with 4 being the oldest one. Now
    a
    document got updated and so you would want to move it to the freshest
    index
    viz. index 1. In this case you would also have to delete this document
    from
    index 4 (after figuring out that it exists in 4) else this doc would
    show
    up
    as 2 hits using a multisearcher.
    This is the document movement that I was talking about.
    Talking only about online resource utilization ( as what I just spoke
    about
    is generally an offline process), you would have to spend extra resource
    to
    merge the hits from the different indexes.
    The other overhead in case of a multisearcher would be of creating those
    many searchers - for each index(though shouldn't be much, just though
    would
    point out).
    I guess you should try it as speed of search is not realy all that
    important
    to you as compared to running it on a singe box within the memory
    limitation.



    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Tue, Oct 7, 2008 at 3:29 PM, Ganesh wrote:

    Hello Anusham,
    My intention is to shard the index after every 7 days (week). After 30
    days, (4th week) the first DB may get deleted. At any point of time i
    will
    be maitaining 3 to 4 DB.

    I want to know the pros and cons of the MultiSearcher or Index sharding
    approach. Any web links would be helpful.

    Regards
    Ganesh

    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Tuesday, October 07, 2008 12:18 AM

    Subject: Re: Single searcher vs Multi Searcher


    Hi Ganesh,
    About the memory consumption while sorting, it would end up using
    similar
    amounts, perhaps even more.. like in the case of regular parallel
    programming algorithms (hoping that you intend to search using a
    parallel
    multi searcher). Would you have to query particular indexes only for a
    particular search or would you be searching over all the indexes and
    then
    follow it up by merger (which the parallel multi searcher would do
    efficiently).?
    Also, I guess 30 indexes would be a little too many, haven't really
    tried
    out those many indexes for a multisearcher.
    As far as maintenance of DB is concerned, it might be easy as long as
    you
    don't have any document updates, in which case you'd have to shift the
    documents from one DB/index to another (which includes creating an
    entry
    in
    the latest index/DB and deleting the record from the older DB).
    I guess you'd have to pilot it, in case memory is an issue in your
    case
    and
    not speed, you could try a regular multisearcher instead of a parallel
    multisearcher.
    I guess when you say maintenance of the DB gets easier, you mean that
    the
    data in each individual table is controlled (but remember there could
    be
    other bigger hassles like the one mentioned above about moving data
    between
    indexes/DB).

    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me. The
    distinction is yours to draw............


    On Mon, Oct 6, 2008 at 10:06 AM, Ganesh wrote:

    Hello Anshum,
    My index is growing 1 million documents per day. Initially i planned
    to
    have a single database but the sorting of one or more fields consumes
    more
    RAM. Whether sharding the index would also consume the same.

    My application should co-exist with other application of my product
    and
    my
    app could get 1 GB of RAM. Search speed is fine but i need to display
    the
    result in the sorted order.

    I thought to keep 7 days of documents in one index and create one
    more
    after the 7 days. After 30 days the first index may get deleted. I
    need
    to
    keep the documents in the index DB for 30 days. My Index DB is in
    HDD.

    I want to the pros and cons of sharding. I think maintance of the DB
    becomes easier.

    It would be very much helpful, if you share some of your thoughts.

    Regards
    Ganesh


    ----- Original Message ----- From: "Anshum" <anshumg@gmail.com>
    To: <java-user@lucene.apache.org>
    Sent: Friday, October 03, 2008 9:48 PM
    Subject: Re: Single searcher vs Multi Searcher



    Hi Ganesh,

    I have experimented with sharded indexes and they seem to benefit
    me(atleast
    in my case). I would like to know a few things before I answer your
    question:
    1. Do you have a reasonable criteria ( a calculated one) to shard
    the
    indexes?
    2. How do you plan to split the index? Is it going to be document
    based
    (which I guess it should be as otherwise you would have to build a
    complete
    distributed system)
    3. Do you plan to put your indexes on the RAM or on (physically)
    seperate
    HDDs?

    Though all said and done, sharded indexes are a good approach, if
    done
    the
    right way.
    --
    Anshum Gupta
    Naukri Labs!
    http://ai-cafe.blogspot.com

    The facts expressed here belong to everybody, the opinions to me.
    The
    distinction is yours to draw............


    On Fri, Oct 3, 2008 at 3:01 PM, Ganesh <emailgane@yahoo.co.in>
    wrote:

    Hello all,

    My indexing is growing by 1 million records per day and the memory
    consumption of the searcher object is quite high.

    There are different opinion in the groups. Few suggest to use
    single
    database and few to use sharding. My Database has 10 million
    records
    now
    and
    it might go till 30 million or more. I plan to shard the index. but
    Multisearcher will give me benifit.

    Regards
    Ganesh


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org




    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org


    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

    Send instant messages to your online friends
    http://in.messenger.yahoo.com
    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org
    Send instant messages to your online friends http://in.messenger.yahoo.com

    ---------------------------------------------------------------------
    To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
    For additional commands, e-mail: java-user-help@lucene.apache.org

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupjava-user @
categorieslucene
postedOct 3, '08 at 9:32a
activeOct 13, '08 at 12:37p
posts9
users2
websitelucene.apache.org

2 users in discussion

Ganesh: 5 posts Anshum: 4 posts

People

Translate

site design / logo © 2022 Grokbase