FAQ
Hello everyone,

I have written my custom partitioner for partitioning datasets. I want to partition two datasets using the same partitioner and then in the next mapreduce job, I want each mapper to handle the same partition from the two sources and perform some function such as joining etc. How I can I ensure that one mapper gets the split that corresponds to same partition from both the sources?

Any help would be highly appreciated.
Alex

Search Discussions

  • Alex Loddengaard at Jul 5, 2010 at 6:16 pm
    Hi there,

    Unfortunately you can't control which mapper gets what data. The InputSplit
    -> map task assignment is random. You could, however, do the join in the
    reduce, by using an intermediate key as your join key.

    Does that make sense?

    Alex
    On Sat, Jul 3, 2010 at 9:28 AM, Denim Live wrote:

    Hello everyone,

    I have written my custom partitioner for partitioning datasets. I want to
    partition two datasets using the same partitioner and then in the next
    mapreduce job, I want each mapper to handle the same partition from the two
    sources and perform some function such as joining etc. How I can I ensure
    that one mapper gets the split that corresponds to same partition from both
    the sources?

    Any help would be highly appreciated.
    Alex

  • Denim Live at Jul 6, 2010 at 9:38 am
    Hi,
    Yes it makes sense to do the join on reduce-side but I want the other way round. One option can be something like this which someone from cloudera suggested: "write out all the partition numbers (one per line) to a
    single file, then use the NLineInputFormat to make each line its own map
    task. Then in your mapper itself, you will get in a key of "0" or "1" or "2"
    etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your
    mapper."

    This is one option. Any other suggestions are welcomed.




    ________________________________
    From: Alex Loddengaard <alex@cloudera.com>
    To: mapreduce-user@hadoop.apache.org
    Sent: Mon, July 5, 2010 7:16:02 PM
    Subject: Re: parititioning dataset

    Hi there,

    Unfortunately you can't control which mapper gets what data.  The InputSplit -> map task assignment is random.  You could, however, do the join in the reduce, by using an intermediate key as your join key.

    Does that make sense?

    Alex


    On Sat, Jul 3, 2010 at 9:28 AM, Denim Live wrote:

    Hello everyone,
    I have written my custom partitioner for partitioning datasets. I want to partition two datasets using the same partitioner and then in the next mapreduce job, I want each mapper to handle the same partition from the two sources and perform some function such as joining etc. How I can I ensure that one mapper gets the split that corresponds to same partition from both the sources?

    Any help would be highly appreciated.
    Alex


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedJul 3, '10 at 4:30p
activeJul 6, '10 at 9:38a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Denim Live: 2 posts Alex Loddengaard: 1 post

People

Translate

site design / logo © 2022 Grokbase