FAQ
Hi all,

I had a small doubt regarding the reduce module. What I understand is that
after the shuffle / sort phase , all the records with the same key value
goes into a reduce function. If thats the case, what is the attribute of the
Writable key which ensures that all the keys go to the same reduce ?

I am working on a reduce side Join where I need to tag all the keys with a
bit which might vary but still want all those records to go into same
reduce. In Hadoop the Definitive Guide, pg. 235 they are using TextPair for
the key. But I dont understand how the keys with different tag information
goes into the same reduce.

Matthew

Search Discussions

  • Shi Yu at Oct 18, 2010 at 11:43 pm
    How many tags you have? If you have several number of tags, you'd better
    create a Vector class to hold those tags. And define sum function to
    increment the values of tags. Then the value class should be your new
    Vector class. That's better and more decent than the Textpair approach.

    Shi
    On 2010-10-18 5:19, Matthew John wrote:
    Hi all,

    I had a small doubt regarding the reduce module. What I understand is that
    after the shuffle / sort phase , all the records with the same key value
    goes into a reduce function. If thats the case, what is the attribute of the
    Writable key which ensures that all the keys go to the same reduce ?

    I am working on a reduce side Join where I need to tag all the keys with a
    bit which might vary but still want all those records to go into same
    reduce. In Hadoop the Definitive Guide, pg. 235 they are using TextPair for
    the key. But I dont understand how the keys with different tag information
    goes into the same reduce.

    Matthew
  • Brad Tofel at Oct 19, 2010 at 12:41 am
    The "Partitioner" implementation used with your job should define which
    reduce target receives a given map output key.

    I don't know if an existing Partitioner implementation exists which
    meets your needs, but it's not a very complex interface to develop, if
    nothing existing works for you.

    Brad
    On 10/18/2010 04:43 PM, Shi Yu wrote:
    How many tags you have? If you have several number of tags, you'd
    better create a Vector class to hold those tags. And define sum
    function to increment the values of tags. Then the value class should
    be your new Vector class. That's better and more decent than the
    Textpair approach.

    Shi
    On 2010-10-18 5:19, Matthew John wrote:
    Hi all,

    I had a small doubt regarding the reduce module. What I understand is
    that
    after the shuffle / sort phase , all the records with the same key value
    goes into a reduce function. If thats the case, what is the attribute
    of the
    Writable key which ensures that all the keys go to the same reduce ?

    I am working on a reduce side Join where I need to tag all the keys
    with a
    bit which might vary but still want all those records to go into same
    reduce. In Hadoop the Definitive Guide, pg. 235 they are using
    TextPair for
    the key. But I dont understand how the keys with different tag
    information
    goes into the same reduce.

    Matthew
  • Brad Tofel at Oct 19, 2010 at 12:58 am
    Whoops, just re-read your message, and see you may be asking about
    targeting a reduce callback function, not a reduce task..

    If that's the case, I'm not sure I understand what your "bit/tag" is
    for, and what you're trying to do with it. Can you provide a concrete
    example (not necessarily code) of some keys which need to group together?

    Is there a way to embed the "bit" within the value, so keys are always
    common?

    If you really need to fake out the system so different keys arrive in
    the same reduce, you might be able to do it with a combination of:

    org.apache.hadoop.mapreduce.Job

    .setSortComparatorClass()
    .setGroupingComparatorClass()
    .setPartitionerClass()

    Brad
    On 10/18/2010 05:41 PM, Brad Tofel wrote:
    The "Partitioner" implementation used with your job should define
    which reduce target receives a given map output key.

    I don't know if an existing Partitioner implementation exists which
    meets your needs, but it's not a very complex interface to develop, if
    nothing existing works for you.

    Brad
    On 10/18/2010 04:43 PM, Shi Yu wrote:
    How many tags you have? If you have several number of tags, you'd
    better create a Vector class to hold those tags. And define sum
    function to increment the values of tags. Then the value class should
    be your new Vector class. That's better and more decent than the
    Textpair approach.

    Shi
    On 2010-10-18 5:19, Matthew John wrote:
    Hi all,

    I had a small doubt regarding the reduce module. What I understand
    is that
    after the shuffle / sort phase , all the records with the same key
    value
    goes into a reduce function. If thats the case, what is the
    attribute of the
    Writable key which ensures that all the keys go to the same reduce ?

    I am working on a reduce side Join where I need to tag all the keys
    with a
    bit which might vary but still want all those records to go into same
    reduce. In Hadoop the Definitive Guide, pg. 235 they are using
    TextPair for
    the key. But I dont understand how the keys with different tag
    information
    goes into the same reduce.

    Matthew
  • Ed at Oct 19, 2010 at 1:06 pm
    Keys are partitioned among the reducers using a partition function which is
    specified in the aptly named Partitioner class. By default, Hadoop will
    hash the key (and probably mods the hash by the number of reducers) to
    determine which reducer to send your key to (I say probably because I
    haven't looked at the actual code). What this means for you is that if you
    set a custom bit in the key field, keys with different bits are not
    guaranteed to go to the same reducers even if they rest of key is the same.

    For example

    Key1 = (DataX+BitA) --> Reducer1
    Key2 = (DataX+BitB) --> Reducer2

    What you want is for any key with the same Data to go to the same reducer
    regardless of the bit value. To do this you need to write your own
    partitioner class and set your job to use that class using

    job.setPartitionerClass(MyCustomPartitioner.class)

    Your custom partitioner will need to break apart your key and only hash on
    the DataX part of it.

    The partitioner class is really easy to override and will look something
    like this:

    public class MyCustomPartitioner extends Partitioner<Key, Value> {
    public int getPartition(Key key, Value value, int numPartitions){
    //split my key so that the bit flag is removed
    //take the modified key and mod it by numPartitions
    return the result
    }
    }

    Of course Key and Value would be whatever Key and Value class you're using.

    Hope that helps.

    ~Ed


    On Mon, Oct 18, 2010 at 8:58 PM, Brad Tofel wrote:

    Whoops, just re-read your message, and see you may be asking about
    targeting a reduce callback function, not a reduce task..

    If that's the case, I'm not sure I understand what your "bit/tag" is for,
    and what you're trying to do with it. Can you provide a concrete example
    (not necessarily code) of some keys which need to group together?

    Is there a way to embed the "bit" within the value, so keys are always
    common?

    If you really need to fake out the system so different keys arrive in the
    same reduce, you might be able to do it with a combination of:

    org.apache.hadoop.mapreduce.Job

    .setSortComparatorClass()
    .setGroupingComparatorClass()
    .setPartitionerClass()

    Brad

    On 10/18/2010 05:41 PM, Brad Tofel wrote:

    The "Partitioner" implementation used with your job should define which
    reduce target receives a given map output key.

    I don't know if an existing Partitioner implementation exists which meets
    your needs, but it's not a very complex interface to develop, if nothing
    existing works for you.

    Brad
    On 10/18/2010 04:43 PM, Shi Yu wrote:

    How many tags you have? If you have several number of tags, you'd better
    create a Vector class to hold those tags. And define sum function to
    increment the values of tags. Then the value class should be your new Vector
    class. That's better and more decent than the Textpair approach.

    Shi
    On 2010-10-18 5:19, Matthew John wrote:

    Hi all,

    I had a small doubt regarding the reduce module. What I understand is
    that
    after the shuffle / sort phase , all the records with the same key value
    goes into a reduce function. If thats the case, what is the attribute of
    the
    Writable key which ensures that all the keys go to the same reduce ?

    I am working on a reduce side Join where I need to tag all the keys with
    a
    bit which might vary but still want all those records to go into same
    reduce. In Hadoop the Definitive Guide, pg. 235 they are using TextPair
    for
    the key. But I dont understand how the keys with different tag
    information
    goes into the same reduce.

    Matthew

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedOct 18, '10 at 10:20a
activeOct 19, '10 at 1:06p
posts5
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase