FAQ
Hi All,

I am trying to build a distributed system to build and serve lucene indexes.
I came across the Distributed Lucene project-
http://wiki.apache.org/hadoop/DistributedLucene
https://issues.apache.org/jira/browse/HADOOP-3394

and have a couple of questions. It will be really helpful if someone can
provide some insights.

1) Is this code production ready?
2) Does someone has performance data for this project?
3) It allows searches and updates/deletes to be performed at the same time.
How well the system will perform if there are frequent updates to the
system. Will it handle the search and update load easily or will it be
better to rebuild or update the indexes on different machines and then
deploy the indexes back to the machines that are serving the indexes?

Basically I am trying to choose between the 2 approaches-

1) Use Hadoop to build and/or update Lucene indexes and then deploy them on
separate cluster that will take care or load balancing, fault tolerance etc.
There is a package in Hadoop contrib that does this, so I can use that code.

2) Use and/or modify the Distributed Lucene code.

I am expecting daily updates to our index so I am not sure if Distribtued
Lucene code (which allows searches and updates on the same indexes) will be
able to handle search and update load efficiently.

Any suggestions ?

Thanks,
Tarandeep

Search Discussions

  • Aaron Kimball at Jun 2, 2009 at 12:40 am
    According to the JIRA issue mentioned, it doesn't seem to be production
    ready. The author claims it is "a prototype... not ready for inclusion". The
    patch was posted over a year ago, and there's been no further work or
    discussion. You might want to email the author Mark Butler (
    https://issues.apache.org/jira/secure/ViewProfile.jspa?name=butlermh)
    directly.

    This probably renders the performance data question moot.

    In general, if you're using Hadoop and HDFS to serve some content, updates
    will need to be performed by rewriting a whole index. So frequent updates
    are going to be troublesome. Likely what will happen is that updates will
    need to be batched up, rewritten to new index files, and then those will be
    installed in place of the outdated ones. I haven't read their design doc,
    though, so they might do something different, but since HDFS doesn't allow
    for modification of closed files, it'll be challenging to be more clever
    than that.

    You might want to go with option (1) and investigate using something like
    memcached, etc, to manage interactive query load.

    - Aaron
    On Mon, Jun 1, 2009 at 9:54 AM, Tarandeep Singh wrote:

    Hi All,

    I am trying to build a distributed system to build and serve lucene
    indexes.
    I came across the Distributed Lucene project-
    http://wiki.apache.org/hadoop/DistributedLucene
    https://issues.apache.org/jira/browse/HADOOP-3394

    and have a couple of questions. It will be really helpful if someone can
    provide some insights.

    1) Is this code production ready?
    2) Does someone has performance data for this project?
    3) It allows searches and updates/deletes to be performed at the same time.
    How well the system will perform if there are frequent updates to the
    system. Will it handle the search and update load easily or will it be
    better to rebuild or update the indexes on different machines and then
    deploy the indexes back to the machines that are serving the indexes?

    Basically I am trying to choose between the 2 approaches-

    1) Use Hadoop to build and/or update Lucene indexes and then deploy them on
    separate cluster that will take care or load balancing, fault tolerance
    etc.
    There is a package in Hadoop contrib that does this, so I can use that
    code.

    2) Use and/or modify the Distributed Lucene code.

    I am expecting daily updates to our index so I am not sure if Distribtued
    Lucene code (which allows searches and updates on the same indexes) will be
    able to handle search and update load efficiently.

    Any suggestions ?

    Thanks,
    Tarandeep
  • Stefan Groschupf at Jun 2, 2009 at 3:05 pm
    Hi,
    you might want to checkout:
    http://katta.sourceforge.net/

    Stefan

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Hadoop training and consulting
    http://www.scaleunlimited.com
    http://www.101tec.com


    On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote:

    Hi All,

    I am trying to build a distributed system to build and serve lucene
    indexes.
    I came across the Distributed Lucene project-
    http://wiki.apache.org/hadoop/DistributedLucene
    https://issues.apache.org/jira/browse/HADOOP-3394

    and have a couple of questions. It will be really helpful if someone
    can
    provide some insights.

    1) Is this code production ready?
    2) Does someone has performance data for this project?
    3) It allows searches and updates/deletes to be performed at the
    same time.
    How well the system will perform if there are frequent updates to the
    system. Will it handle the search and update load easily or will it be
    better to rebuild or update the indexes on different machines and then
    deploy the indexes back to the machines that are serving the indexes?

    Basically I am trying to choose between the 2 approaches-

    1) Use Hadoop to build and/or update Lucene indexes and then deploy
    them on
    separate cluster that will take care or load balancing, fault
    tolerance etc.
    There is a package in Hadoop contrib that does this, so I can use
    that code.

    2) Use and/or modify the Distributed Lucene code.

    I am expecting daily updates to our index so I am not sure if
    Distribtued
    Lucene code (which allows searches and updates on the same indexes)
    will be
    able to handle search and update load efficiently.

    Any suggestions ?

    Thanks,
    Tarandeep
  • Tarandeep Singh at Jun 2, 2009 at 4:23 pm
    thanks all for your replies. I am checking Katta...

    -Tarandeep
    On Tue, Jun 2, 2009 at 8:05 AM, Stefan Groschupf wrote:

    Hi,
    you might want to checkout:
    http://katta.sourceforge.net/

    Stefan

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Hadoop training and consulting
    http://www.scaleunlimited.com
    http://www.101tec.com




    On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote:

    Hi All,
    I am trying to build a distributed system to build and serve lucene
    indexes.
    I came across the Distributed Lucene project-
    http://wiki.apache.org/hadoop/DistributedLucene
    https://issues.apache.org/jira/browse/HADOOP-3394

    and have a couple of questions. It will be really helpful if someone can
    provide some insights.

    1) Is this code production ready?
    2) Does someone has performance data for this project?
    3) It allows searches and updates/deletes to be performed at the same
    time.
    How well the system will perform if there are frequent updates to the
    system. Will it handle the search and update load easily or will it be
    better to rebuild or update the indexes on different machines and then
    deploy the indexes back to the machines that are serving the indexes?

    Basically I am trying to choose between the 2 approaches-

    1) Use Hadoop to build and/or update Lucene indexes and then deploy them
    on
    separate cluster that will take care or load balancing, fault tolerance
    etc.
    There is a package in Hadoop contrib that does this, so I can use that
    code.

    2) Use and/or modify the Distributed Lucene code.

    I am expecting daily updates to our index so I am not sure if Distribtued
    Lucene code (which allows searches and updates on the same indexes) will
    be
    able to handle search and update load efficiently.

    Any suggestions ?

    Thanks,
    Tarandeep
  • Ted Dunning at Jun 2, 2009 at 6:09 pm
    Just a quick plug for Katta. We use it extensively (and have been sending
    back some patches).

    See www.deepdyve.com for a test drive.

    At my previous job, we had utter fits working with SOLR using sharded
    retrieval. Katta is designed to address the sharding problem very well and
    we have been very happy. Our extensions have been to adapt Katta so that it
    is a general sharding and replication engine that supports general queries.
    For some things we use a modified Lucene, for other things, we use our own
    code. Katta handles that really well.
    On Tue, Jun 2, 2009 at 9:23 AM, Tarandeep Singh wrote:

    thanks all for your replies. I am checking Katta...

    -Tarandeep
    On Tue, Jun 2, 2009 at 8:05 AM, Stefan Groschupf wrote:

    Hi,
    you might want to checkout:
    http://katta.sourceforge.net/

    Stefan

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Hadoop training and consulting
    http://www.scaleunlimited.com
    http://www.101tec.com




    On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote:

    Hi All,
    I am trying to build a distributed system to build and serve lucene
    indexes.
    I came across the Distributed Lucene project-
    http://wiki.apache.org/hadoop/DistributedLucene
    https://issues.apache.org/jira/browse/HADOOP-3394

    and have a couple of questions. It will be really helpful if someone can
    provide some insights.

    1) Is this code production ready?
    2) Does someone has performance data for this project?
    3) It allows searches and updates/deletes to be performed at the same
    time.
    How well the system will perform if there are frequent updates to the
    system. Will it handle the search and update load easily or will it be
    better to rebuild or update the indexes on different machines and then
    deploy the indexes back to the machines that are serving the indexes?

    Basically I am trying to choose between the 2 approaches-

    1) Use Hadoop to build and/or update Lucene indexes and then deploy them
    on
    separate cluster that will take care or load balancing, fault tolerance
    etc.
    There is a package in Hadoop contrib that does this, so I can use that
    code.

    2) Use and/or modify the Distributed Lucene code.

    I am expecting daily updates to our index so I am not sure if
    Distribtued
    Lucene code (which allows searches and updates on the same indexes) will
    be
    able to handle search and update load efficiently.

    Any suggestions ?

    Thanks,
    Tarandeep


    --
    Ted Dunning, CTO
    DeepDyve

    111 West Evelyn Ave. Ste. 202
    Sunnyvale, CA 94086
    http://www.deepdyve.com
    858-414-0013 (m)
    408-773-0220 (fax)
  • Devajyoti Sarkar at Jun 3, 2009 at 2:46 am
    Can Katta be used on an EC2 cluster?

    The reason why I ask this is that it appears to use ZooKeeper which ideally
    needs to have a dedicated drive. That may not be possible in a shared
    environment. Is this a non-issue with respect to Katta?

    I would appreciate any input in this regard.

    Dev

    On Wed, Jun 3, 2009 at 2:03 AM, Ted Dunning wrote:

    Just a quick plug for Katta. We use it extensively (and have been sending
    back some patches).

    See www.deepdyve.com for a test drive.

    At my previous job, we had utter fits working with SOLR using sharded
    retrieval. Katta is designed to address the sharding problem very well and
    we have been very happy. Our extensions have been to adapt Katta so that
    it
    is a general sharding and replication engine that supports general queries.
    For some things we use a modified Lucene, for other things, we use our own
    code. Katta handles that really well.
    On Tue, Jun 2, 2009 at 9:23 AM, Tarandeep Singh wrote:

    thanks all for your replies. I am checking Katta...

    -Tarandeep
    On Tue, Jun 2, 2009 at 8:05 AM, Stefan Groschupf wrote:

    Hi,
    you might want to checkout:
    http://katta.sourceforge.net/

    Stefan

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Hadoop training and consulting
    http://www.scaleunlimited.com
    http://www.101tec.com




    On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote:

    Hi All,
    I am trying to build a distributed system to build and serve lucene
    indexes.
    I came across the Distributed Lucene project-
    http://wiki.apache.org/hadoop/DistributedLucene
    https://issues.apache.org/jira/browse/HADOOP-3394

    and have a couple of questions. It will be really helpful if someone
    can
    provide some insights.

    1) Is this code production ready?
    2) Does someone has performance data for this project?
    3) It allows searches and updates/deletes to be performed at the same
    time.
    How well the system will perform if there are frequent updates to the
    system. Will it handle the search and update load easily or will it be
    better to rebuild or update the indexes on different machines and then
    deploy the indexes back to the machines that are serving the indexes?

    Basically I am trying to choose between the 2 approaches-

    1) Use Hadoop to build and/or update Lucene indexes and then deploy
    them
    on
    separate cluster that will take care or load balancing, fault
    tolerance
    etc.
    There is a package in Hadoop contrib that does this, so I can use that
    code.

    2) Use and/or modify the Distributed Lucene code.

    I am expecting daily updates to our index so I am not sure if
    Distribtued
    Lucene code (which allows searches and updates on the same indexes)
    will
    be
    able to handle search and update load efficiently.

    Any suggestions ?

    Thanks,
    Tarandeep


    --
    Ted Dunning, CTO
    DeepDyve

    111 West Evelyn Ave. Ste. 202
    Sunnyvale, CA 94086
    http://www.deepdyve.com
    858-414-0013 (m)
    408-773-0220 (fax)
  • Paco NATHAN at Jun 3, 2009 at 3:45 am
    Our Katta cluster at ShareThis runs on EC2.
    Managed using RightScale templates
    (thanks @mikebabineau)

    On Tue, Jun 2, 2009 at 19:46, Devajyoti Sarkar wrote:
    Can Katta be used on an EC2 cluster?

    The reason why I ask this is that it appears to use ZooKeeper which ideally
    needs to have a dedicated drive. That may not be possible in a shared
    environment. Is this a non-issue with respect to Katta?

    I would appreciate any input in this regard.

    Dev
  • Devajyoti Sarkar at Jun 3, 2009 at 4:00 am
    Thanks a lot.

    Dev
    On Wed, Jun 3, 2009 at 11:45 AM, Paco NATHAN wrote:

    Our Katta cluster at ShareThis runs on EC2.
    Managed using RightScale templates
    (thanks @mikebabineau)

    On Tue, Jun 2, 2009 at 19:46, Devajyoti Sarkar wrote:
    Can Katta be used on an EC2 cluster?

    The reason why I ask this is that it appears to use ZooKeeper which ideally
    needs to have a dedicated drive. That may not be possible in a shared
    environment. Is this a non-issue with respect to Katta?

    I would appreciate any input in this regard.

    Dev

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJun 1, '09 at 4:55p
activeJun 3, '09 at 4:00a
posts8
users6
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase