hi Ning , I am also looking at different approaches on indexing with hadoop ,
I could index using contrib package for hadoop into HDFS but since its not
designed for random access what would be the other recommended ways to move
them to Local file system

Also what would be the best approach to begin with ? should we look into
katta or solr integrations ?

thanks in advance.

Ning Li-5 wrote:
I'm missing why you would ever want the Lucene index in HDFS for
The Lucene indexes are written to HDFS, but that does not mean you
conduct search on the indexes stored in HDFS directly. HDFS is not
designed for random access. Usually the indexes are copied to the
nodes where search will be served. With
http://issues.apache.org/jira/browse/HADOOP-4801, however, it may
become feasible to search on HDFS directly.


On Mon, Mar 16, 2009 at 4:52 PM, Ian Soboroff wrote:

Does anyone have stats on how multiple readers on an optimized Lucene
index in HDFS compares with a ParallelMultiReader (or whatever its
called) over RPC on a local filesystem?

I'm missing why you would ever want the Lucene index in HDFS for


Ning Li <ning.li.00@gmail.com> writes:
I should have pointed out that Nutch index build and contrib/index
targets different applications. The latter is for applications who
simply want to build Lucene index from a set of documents - e.g. no
link analysis.

As to writing Lucene indexes, both work the same way - write the final
results to local file system and then copy to HDFS. In contrib/index,
the intermediate results are in memory and not written to HDFS.

Hope it clarifies things.


On Mon, Mar 16, 2009 at 2:57 PM, Ian Soboroff <ian.soboroff@nist.gov>
I understand why you would index in the reduce phase, because the
text gets shuffled to be next to the document. However, when you index
in the map phase, don't you just have to reindex later?

The main point to the OP is that HDFS is a bad FS for writing Lucene
indexes because of how Lucene works. The simple approach is to write
your index outside of HDFS in the reduce phase, and then merge the
indexes from each reducer manually.


Ning Li <ning.li.00@gmail.com> writes:
Or you can check out the index contrib. The difference of the two is
- In Nutch's indexing map/reduce job, indexes are built in the
reduce phase. Afterwards, they are merged into smaller number of
shards if necessary. The last time I checked, the merge process does
not use map/reduce.
- In contrib/index, small indexes are built in the map phase. They
are merged into the desired number of shards in the reduce phase. In
addition, they can be merged into existing shards.


On Fri, Mar 13, 2009 at 1:34 AM, 王红宝 wrote:
you can see the nutch code.

2009/3/13 Mark Kerzner <markkerzner@gmail.com>

How do I allow multiple nodes to write to the same index file in

Thank you,
View this message in context: http://www.nabble.com/Creating-Lucene-index-in-Hadoop-tp22490120p25780366.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 12 of 13 | next ›
Discussion Overview
groupcommon-user @
postedMar 13, '09 at 4:38a
activeOct 7, '09 at 4:10p



site design / logo © 2022 Grokbase