FAQ
I've got a Hadoop process that creates as its output a MapFile. Using one
reducer this is very slow (as the map is large), but with 150 (on a cluster
of 80 nodes) it runs quickly. The problem is that it produces 150 output
files as well. In a subsequent process I need to perform lookups on this
map - how is it recommended that I do this, given that I may not know the
number of existing MapFiles or their names? Is there a cleaner solution
than listing the contents of the directory containing all of the MapFiles
and then just querying each in sequence?
--
View this message in context: http://www.nabble.com/Performing-a-Lookup-in-Multiple-MapFiles--tp20565940p20565940.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Search Discussions

  • Lohit at Nov 18, 2008 at 8:02 pm
    Hi Dan,

    You could do one few things to get around this.
    1. In a subsequent step you could merge all your MapFile outputs into one file. This is if your MapFile output is small.
    2. Else, you can use the same partition function which hadoop used to find the partition ID. Partition ID can tell you which output file (out of the 150 files) your key is present in.
    Eg. if the partition ID was 23, then the output file you would have to look for would be part-00023 in the generated output.

    You can use your own Partition class (make sure you use it for your first job as well as second) or reuse the one already used by Hadoop. http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/mapred/Partitioner.html has details.

    I think this http://hadoop.apache.org/core/docs/r0.18.2/api/org/apache/hadoop/examples/SleepJob.html has its usage example. (look for SleepJob.java)

    -Lohit




    ----- Original Message ----
    From: Dan Benjamin <dbenjam@amazon.com>
    To: core-user@hadoop.apache.org
    Sent: Tuesday, November 18, 2008 10:53:47 AM
    Subject: Performing a Lookup in Multiple MapFiles?


    I've got a Hadoop process that creates as its output a MapFile. Using one
    reducer this is very slow (as the map is large), but with 150 (on a cluster
    of 80 nodes) it runs quickly. The problem is that it produces 150 output
    files as well. In a subsequent process I need to perform lookups on this
    map - how is it recommended that I do this, given that I may not know the
    number of existing MapFiles or their names? Is there a cleaner solution
    than listing the contents of the directory containing all of the MapFiles
    and then just querying each in sequence?
    --
    View this message in context: http://www.nabble.com/Performing-a-Lookup-in-Multiple-MapFiles--tp20565940p20565940.html
    Sent from the Hadoop core-user mailing list archive at Nabble.com.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 18, '08 at 6:54p
activeNov 18, '08 at 8:02p
posts2
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Lohit: 1 post Dan Benjamin: 1 post

People

Translate

site design / logo © 2022 Grokbase