FAQ
Hello everyone,


I have written my custom partitioner for partitioning datasets. I want to
partition two datasets using the same partitioner and then in the next
mapreduce job, I want each mapper to handle the same partition from the two
sources and perform some function such as joining etc. How I can I ensure that
one mapper gets the split that corresponds to same partition from both the
sources?


Any help would be highly appreciated.

Search Discussions

  • Aaron Kimball at Jul 5, 2010 at 7:53 am
    One possibility: write out all the partition numbers (one per line) to a
    single file, then use the NLineInputFormat to make each line its own map
    task. Then in your mapper itself, you will get in a key of "0" or "1" or "2"
    etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your
    mapper.

    If you wanted to be more clever, it might be possible to subclass
    MultiFileInputFormat to group together both datasets "file-number-wise" when
    generating splits, but I don't have specific guidance here.

    - Aaron
    On Sat, Jul 3, 2010 at 9:35 AM, abc xyz wrote:



    Hello everyone,


    I have written my custom partitioner for partitioning datasets. I want to
    partition two datasets using the same partitioner and then in the next
    mapreduce job, I want each mapper to handle the same partition from the
    two
    sources and perform some function such as joining etc. How I can I ensure
    that
    one mapper gets the split that corresponds to same partition from both the
    sources?


    Any help would be highly appreciated.


  • Abc xyz at Jul 5, 2010 at 8:17 am
    Thanks Aaron. The first option sounds good.
    How can I ensure to write the partition numbers in a single file while I am
    writing each partition to a separate  file? I mean, Ok after the custom
    partitioner, an identity reducer would work to write the part-xxxxx file for
    each partition, but how to write one single file by all reducers containing
    their partition numbers? Should I do it manually?

    One possibility: write out all the partition numbers (one per line) to a
    single file, then use the NLineInputFormat to make each line its own map
    task. Then in your mapper itself, you will get in a key of "0" or "1" or "2"
    etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your
    mapper.

    If you wanted to be more clever, it might be possible to subclass
    MultiFileInputFormat to group together both datasets "file-number-wise" when
    generating splits, but I don't have specific guidance here.

    - Aaron
    On Sat, Jul 3, 2010 at 9:35 AM, abc xyz wrote:



    Hello everyone,


    I have written my custom partitioner for partitioning datasets. I want  to
    partition two datasets using the same partitioner and then in the  next
    mapreduce job, I want each mapper to handle the same partition from  the
    two
    sources and perform some function such as joining etc. How I  can I ensure
    that
    one mapper gets the split that corresponds to same  partition from both the
    sources?


    Any help would be highly appreciated.



    ________________________________

    From: Aaron Kimball <aaron@cloudera.com>
    To: common-user@hadoop.apache.org
    Sent: Mon, July 5, 2010 8:51:44 AM
    Subject: Re: Partitioned Datasets Map/Reduce

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 3, '10 at 4:35p
activeJul 5, '10 at 8:17a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Abc xyz: 2 posts Aaron Kimball: 1 post

People

Translate

site design / logo © 2022 Grokbase