FAQ
Does it make sense to build large lucene indexes onto the Hadopp DFS? I
would think there would be some performance concerns? Did anyone have
experiences with this?

Thanks

-John

Search Discussions

  • Dennis Kubes at Nov 19, 2006 at 6:40 pm
    You would build the indexes on hadoop but then move then to local file
    systems for searching. You wouldn't want to perform searches using the DFS.

    Dennis

    John Wang wrote:
    Does it make sense to build large lucene indexes onto the Hadopp DFS? I
    would think there would be some performance concerns? Did anyone have
    experiences with this?

    Thanks

    -John
  • Howard chen at Nov 20, 2006 at 6:57 am

    On 11/20/06, Dennis Kubes wrote:
    You would build the indexes on hadoop but then move then to local file
    systems for searching. You wouldn't want to perform searches using the DFS.
    why not?

    using hadoop, we can use clusters of servers to answer a single a
    query, the only solution is to put index on the dfs?
  • Doug Cutting at Nov 20, 2006 at 7:03 pm

    Dennis Kubes wrote:
    You would build the indexes on hadoop but then move then to local file
    systems for searching. You wouldn't want to perform searches using the
    DFS.
    Creating Lucene indexes directly in DFS would be pretty slow. Nutch
    creates them locally, then copies them to DFS to avoid this.

    One could create a Lucene Directory implementation optimized for
    updates, where new files are written locally, and only flushed to DFS
    when the Directory is closed. When updating, Lucene creates and reads
    lots of files that might not last very long, so there's little point in
    replicating them on the network. For many applications, that should be
    considerably faster than either updating indexes directly in HDFS, or
    copying the entire index locally, modifying it, then copying it back.

    Lucene search works from HDFS-resident indexes, but is slow, especially
    if the indexes were created on a different node than that searching
    them. (HDFS tries to write one replica of each block locally on the
    node where it is created.)

    Doug
  • Dennis Kubes at Nov 20, 2006 at 11:51 pm
    I should have been more specific. Create the indexes using mapreduce,
    then store on the dfs using the indexer job. To have clusters of
    servers answer a single query we have found a best practice to be
    splitting the index and associated databases into smaller pieces and
    having those pieces on local file system that are fronted by distributed
    search servers. Then have a search website that uses the search servers
    to answer the query. An example of this setup can be found on the
    NutchHadoopTutorial on the Nutch wiki.

    Dennis

    Doug Cutting wrote:
    Dennis Kubes wrote:
    You would build the indexes on hadoop but then move then to local
    file systems for searching. You wouldn't want to perform searches
    using the DFS.
    Creating Lucene indexes directly in DFS would be pretty slow. Nutch
    creates them locally, then copies them to DFS to avoid this.

    One could create a Lucene Directory implementation optimized for
    updates, where new files are written locally, and only flushed to DFS
    when the Directory is closed. When updating, Lucene creates and reads
    lots of files that might not last very long, so there's little point
    in replicating them on the network. For many applications, that
    should be considerably faster than either updating indexes directly in
    HDFS, or copying the entire index locally, modifying it, then copying
    it back.

    Lucene search works from HDFS-resident indexes, but is slow,
    especially if the indexes were created on a different node than that
    searching them. (HDFS tries to write one replica of each block
    locally on the node where it is created.)

    Doug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 18, '06 at 11:42p
activeNov 20, '06 at 11:51p
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase