FAQ
Hi all,

I wrote a custom key class (implements WritableComparable) and implemented
the compareTo() method inside this class. Everything works fine when I run
the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
correctly in the output files.

But when I increase the number of reduce tasks, keys don't get aggregated
properly; same keys seem to end up in separate output files
(output/part-00000, output/part-00001, etc). This should not happen because
right before reduce() gets called, all (k,v) pairs from all map outputs with
the same 'k' are aggregated and the reduce function just iterates over the
values (v1, v2, etc)?

Do I need to implement anything else inside my custom key class other than
compareTo? I also tried implementing equals() but that didn't help either.
Then I came across setOutputKeyComparator(). So I added a custom Comparator
class inside the key class and tried setting this on the JobConf object. But
that didn't work either. What could be wrong?

Cheers,

--
Harish Mallipeddi
circos.com : poundbang.in/blog/

Search Discussions

  • Zhang, jian at Apr 11, 2008 at 11:17 am
    Hi,

    Please read this, you need to implement partitioner.
    It controls which key is sent to which reducer, if u want to get unique key result, you need to implement partitioner and the compareTO function should work properly.
    [WIKI]
    Partitioner

    Partitioner partitions the key space.

    Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction.

    HashPartitioner is the default Partitioner.



    Best Regards

    Jian Zhang


    -----邮件原件-----
    发件人: Harish Mallipeddi
    发送时间: 2008年4月11日 19:06
    收件人: core-user@hadoop.apache.org
    主题: Problem with key aggregation when number of reduce tasks is more than 1

    Hi all,

    I wrote a custom key class (implements WritableComparable) and implemented
    the compareTo() method inside this class. Everything works fine when I run
    the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
    correctly in the output files.

    But when I increase the number of reduce tasks, keys don't get aggregated
    properly; same keys seem to end up in separate output files
    (output/part-00000, output/part-00001, etc). This should not happen because
    right before reduce() gets called, all (k,v) pairs from all map outputs with
    the same 'k' are aggregated and the reduce function just iterates over the
    values (v1, v2, etc)?

    Do I need to implement anything else inside my custom key class other than
    compareTo? I also tried implementing equals() but that didn't help either.
    Then I came across setOutputKeyComparator(). So I added a custom Comparator
    class inside the key class and tried setting this on the JobConf object. But
    that didn't work either. What could be wrong?

    Cheers,

    --
    Harish Mallipeddi
    circos.com : poundbang.in/blog/
  • Pete Wyckoff at Apr 11, 2008 at 6:44 pm
  • Harish Mallipeddi at Apr 12, 2008 at 6:19 pm
    Hey thanks a lot. That's basically what I needed.

    2008/4/11 Zhang, jian <jzhang@freewheel.tv>:
    Hi,

    Please read this, you need to implement partitioner.
    It controls which key is sent to which reducer, if u want to get unique
    key result, you need to implement partitioner and the compareTO function
    should work properly.
    [WIKI]
    Partitioner

    Partitioner partitions the key space.

    Partitioner controls the partitioning of the keys of the intermediate
    map-outputs. The key (or a subset of the key) is used to derive the
    partition, typically by a hash function. The total number of partitions is
    the same as the number of reduce tasks for the job. Hence this controls
    which of the m reduce tasks the intermediate key (and hence the record) is
    sent to for reduction.

    HashPartitioner is the default Partitioner.



    Best Regards

    Jian Zhang


    -----邮件原件-----
    发件人: Harish Mallipeddi
    发送时间: 2008年4月11日 19:06
    收件人: core-user@hadoop.apache.org
    主题: Problem with key aggregation when number of reduce tasks is more than
    1

    Hi all,

    I wrote a custom key class (implements WritableComparable) and implemented
    the compareTo() method inside this class. Everything works fine when I run
    the m/r job with 1 reduce task (via setNumReduceTasks). Keys are sorted
    correctly in the output files.

    But when I increase the number of reduce tasks, keys don't get aggregated
    properly; same keys seem to end up in separate output files
    (output/part-00000, output/part-00001, etc). This should not happen
    because
    right before reduce() gets called, all (k,v) pairs from all map outputs
    with
    the same 'k' are aggregated and the reduce function just iterates over the
    values (v1, v2, etc)?

    Do I need to implement anything else inside my custom key class other than
    compareTo? I also tried implementing equals() but that didn't help either.
    Then I came across setOutputKeyComparator(). So I added a custom
    Comparator
    class inside the key class and tried setting this on the JobConf object.
    But
    that didn't work either. What could be wrong?

    Cheers,

    --
    Harish Mallipeddi
    circos.com : poundbang.in/blog/


    --
    Harish Mallipeddi
    circos.com : poundbang.in/blog/
  • Adrian Woodhead at Apr 11, 2008 at 11:24 am
    I've noticed that the mailing lists archives seem to be broken here:

    http://hadoop.apache.org/mail/core-user/

    I get a 403 forbidden. Any idea what's going on?

    Regards,

    Adrian
  • Nathan Fiedler at Apr 11, 2008 at 5:23 pm
    Yes, it's been like that for days. Hopefully someone in Apache can fix
    it. In the meantime, you can use the Nabble site:
    http://www.nabble.com/Hadoop-core-user-f30590.html

    n

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 11, '08 at 11:06a
activeApr 12, '08 at 6:19p
posts6
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase