FAQ
Hi,

I am trying to create lucene indexes using the "contrib/index/hadoop-0.19.1-index.jar" provided by Hadoop.
Since it can be executed in map-reduced manner, I expect it to process large data very fast.
It processes small amount of data (< 5MB) very quickly.

Now 5 GB of input data is provided; and the fun starts :)

It goes out of memory. I increased the parameter "mapred.child.java.opts" in the file "hadoop-default.xml" to -Xmx1000m.
The processing went smoothly for 1.5 hours, completing 30% job.
Then master node hung.

Is there any way to get the ""contrib/index/hadoop-0.19.1-index.jar"" get going?
Is there any memory leak in the jar?

Can you suggest some alternatives?

Thanks,
- Bhushan

DISCLAIMER
==========
This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.

Search Discussions

  • Ted Dunning at Jul 9, 2009 at 3:38 pm
    You don't mention what size cluster you have, but we use a relatively small
    cluster and index hundreds of GB in an hour to few hours (depending on the
    content and the size fo the cluster). So your results are anomalous.

    However, we wrote our own indexer. The way it works is that documents are
    given randomized shard numbers and all of the documents for a single shard
    are indexed by a reduce function that produces a Lucene index on local disk
    and copies it to the central system. This is a very common idiom and I am
    sure that we essentially copied sample code from somewhere (likely from the
    Katta distribution). We use a relatively large number of shards and don't
    try to merge the shard indexes after indexing.

    I haven't looked at the index contrib code lately so I can't comment
    specifically on that. It is quite possible that it isn't widely used.

    My guess is that you have something very simple going awry.

    Can you say more about your cluster size, type of machine, operating system,
    what your average document size is, whether you are trying to merge indexes,
    and whether you are using the RAMDirectory or the FSDirectory?
    On Thu, Jul 9, 2009 at 8:26 AM, bhushan_mahale wrote:

    I am trying to create lucene indexes using the
    "contrib/index/hadoop-0.19.1-index.jar" provided by Hadoop.
    Since it can be executed in map-reduced manner, I expect it to process
    large data very fast.
    It processes small amount of data (< 5MB) very quickly.

    Now 5 GB of input data is provided; and the fun starts :)

    It goes out of memory.
  • Ken Krugler at Jul 9, 2009 at 4:18 pm

    You don't mention what size cluster you have, but we use a relatively small
    cluster and index hundreds of GB in an hour to few hours (depending on the
    content and the size fo the cluster). So your results are anomalous.

    However, we wrote our own indexer. The way it works is that documents are
    given randomized shard numbers and all of the documents for a single shard
    are indexed by a reduce function that produces a Lucene index on local disk
    and copies it to the central system. This is a very common idiom and I am
    sure that we essentially copied sample code from somewhere (likely from the
    Katta distribution). We use a relatively large number of shards and don't
    try to merge the shard indexes after indexing.
    FWIW, at Krugle a major performance bottleneck with index creation
    was the merge time. We had expected newer versions of Lucene to
    dramatically reduce the time required to do merges, but that didn't
    happen, even with a few stabs at tuning various parameters.

    So - if you can avoid merging, by using a reasonable number of
    shards, that can significantly reduce the total time required to
    build the index.

    And as for the actual index generation, with the Bixo project we're
    using some Katta-derived code...it's a Cascading Scheme (called
    IndexScheme) that generates an index. Pretty straightforward, other
    than needing to call Hadoop's reporter via a thread so the task
    doesn't time out during a long Lucene optimize.

    We wind up with one index (shard) per reducer, so by controlling the
    number of reducers we can vary the shard count, down to a minimum
    count == the number of slaves in the processing cluster.

    -- Ken

    I haven't looked at the index contrib code lately so I can't comment
    specifically on that. It is quite possible that it isn't widely used.

    My guess is that you have something very simple going awry.

    Can you say more about your cluster size, type of machine, operating system,
    what your average document size is, whether you are trying to merge indexes,
    and whether you are using the RAMDirectory or the FSDirectory?
    On Thu, Jul 9, 2009 at 8:26 AM, bhushan_mahale wrote:

    I am trying to create lucene indexes using the
    "contrib/index/hadoop-0.19.1-index.jar" provided by Hadoop.
    Since it can be executed in map-reduced manner, I expect it to process
    large data very fast.
    It processes small amount of data (< 5MB) very quickly.

    Now 5 GB of input data is provided; and the fun starts :)

    It goes out of memory.

    --
    Ken Krugler
    +1 530-210-6378
  • Ted Dunning at Jul 9, 2009 at 4:58 pm
    Exactly as we do.

    Also, I find that with a large enough collection to care about speed that we
    have many more shards than we have reducers so parallelism in indexing is
    nearly perfect.
    On Thu, Jul 9, 2009 at 9:13 AM, Ken Krugler wrote:


    We wind up with one index (shard) per reducer, so by controlling the number
    of reducers we can vary the shard count, down to a minimum count == the
    number of slaves in the processing cluster.



    --
    Ted Dunning, CTO
    DeepDyve
  • Ning Li at Jul 11, 2009 at 5:09 pm
    HADOOP-5491 should solve your memory problem with the index contrib.

    You can check out Katta if you also want shard management.

    Cheers,
    Ning

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 9, '09 at 3:27p
activeJul 11, '09 at 5:09p
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase