FAQ
Hi everyone,

I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
I'm getting started with Hadoop streaming and python mapper and reducer.
From what I read in the mapreduce tutorial, mapper an reducer can be plugged
into Hadoop via the "-mapper" and "-reducer" options on job start. I was
wondering what the input for the reducer would look like, so I ran a Hadoop
job using my own mapper but /bin/cat as reducer. As you can see, the output
of the job is ordered, but the keys haven't been combined:

{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'} 107488
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'} 95560
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'} 95562

I would have expected something like:

{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'} 95560, 95562, 107488

my understanding from the tutorial was, that this reduction is a part of the
shuffle and sort phase. Or do I need to use a combiner to get that done?
Does Hadoop streaming even do this, or do I need to use a native java class?

Best,
Moritz

Search Discussions

  • Amareshwari Sri Ramadasu at Jul 14, 2010 at 8:09 am
    In streaming, the combined values are given to reducer as <key, value> pairs again, so you don't see key and list of values.
    I think it is done in that way to be symmetrical with mapper, though I don't know exact reason.

    Thanks
    Amareshwari

    On 7/14/10 1:05 PM, "Moritz Krog" wrote:

    Hi everyone,

    I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
    I'm getting started with Hadoop streaming and python mapper and reducer.
    From what I read in the mapreduce tutorial, mapper an reducer can be plugged
    into Hadoop via the "-mapper" and "-reducer" options on job start. I was
    wondering what the input for the reducer would look like, so I ran a Hadoop
    job using my own mapper but /bin/cat as reducer. As you can see, the output
    of the job is ordered, but the keys haven't been combined:

    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 107488
    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95560
    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95562

    I would have expected something like:

    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95560, 95562, 107488

    my understanding from the tutorial was, that this reduction is a part of the
    shuffle and sort phase. Or do I need to use a combiner to get that done?
    Does Hadoop streaming even do this, or do I need to use a native java class?

    Best,
    Moritz
  • Moritz Krog at Jul 14, 2010 at 8:17 am
    First of all thanks for the quick answer :)

    is there any way to configure the job in such a way, that I get the key ->
    value list? I specifically need exactly this behavior.. it's crucial to what
    I want to do with Hadoop..

    On Wed, Jul 14, 2010 at 10:06 AM, Amareshwari Sri Ramadasu wrote:

    In streaming, the combined values are given to reducer as <key, value>
    pairs again, so you don't see key and list of values.
    I think it is done in that way to be symmetrical with mapper, though I
    don't know exact reason.

    Thanks
    Amareshwari

    On 7/14/10 1:05 PM, "Moritz Krog" wrote:

    Hi everyone,

    I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
    I'm getting started with Hadoop streaming and python mapper and reducer.
    From what I read in the mapreduce tutorial, mapper an reducer can be
    plugged
    into Hadoop via the "-mapper" and "-reducer" options on job start. I was
    wondering what the input for the reducer would look like, so I ran a Hadoop
    job using my own mapper but /bin/cat as reducer. As you can see, the output
    of the job is ordered, but the keys haven't been combined:

    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 107488
    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95560
    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95562

    I would have expected something like:

    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95560, 95562, 107488

    my understanding from the tutorial was, that this reduction is a part of
    the
    shuffle and sort phase. Or do I need to use a combiner to get that done?
    Does Hadoop streaming even do this, or do I need to use a native java
    class?

    Best,
    Moritz
  • Alex Kozlov at Jul 14, 2010 at 8:52 am
    You can use the following perl script as a reducer:

    ===
    #!/usr/bin/perl

    $,="\t";

    while (<>) {
    my ($key, $value) = split($,, $_, 2);
    if ($lastkey eq $key) {
    push @values, $value;
    } else {
    print $lastkey, join(",", @values) if defined($lastkey);
    $lastkey = $key;
    @values = ($value);
    }
    }

    print $lastkey, join(",", @values) if defined($lastkey) and @values > 0;
    ===

    Alex K

    On Wed, Jul 14, 2010 at 1:17 AM, Moritz Krog wrote:

    First of all thanks for the quick answer :)

    is there any way to configure the job in such a way, that I get the key ->
    value list? I specifically need exactly this behavior.. it's crucial to
    what
    I want to do with Hadoop..


    On Wed, Jul 14, 2010 at 10:06 AM, Amareshwari Sri Ramadasu <
    amarsri@yahoo-inc.com> wrote:
    In streaming, the combined values are given to reducer as <key, value>
    pairs again, so you don't see key and list of values.
    I think it is done in that way to be symmetrical with mapper, though I
    don't know exact reason.

    Thanks
    Amareshwari

    On 7/14/10 1:05 PM, "Moritz Krog" wrote:

    Hi everyone,

    I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
    I'm getting started with Hadoop streaming and python mapper and reducer.
    From what I read in the mapreduce tutorial, mapper an reducer can be
    plugged
    into Hadoop via the "-mapper" and "-reducer" options on job start. I was
    wondering what the input for the reducer would look like, so I ran a Hadoop
    job using my own mapper but /bin/cat as reducer. As you can see, the output
    of the job is ordered, but the keys haven't been combined:

    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 107488
    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95560
    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95562

    I would have expected something like:

    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95560, 95562, 107488

    my understanding from the tutorial was, that this reduction is a part of
    the
    shuffle and sort phase. Or do I need to use a combiner to get that done?
    Does Hadoop streaming even do this, or do I need to use a native java
    class?

    Best,
    Moritz
  • Moritz Krog at Jul 14, 2010 at 9:20 am
    Does that Perl script also work when I use multiple reducer tasks?

    Anyway, this isn't really what I was looking for, because I intended to use
    my own reducer. On top of that, I also need the intermediate data run more
    than one time through the reducer. I was just hoping there is some way to
    make streaming output the intermediate data as k -> list(v) somehow.
    I could of course work in iterations, where I use the Perl reducer in the
    first iteration and use the results from that in later iterations... but it
    does sound like a lot of unnecessary work.
    On Wed, Jul 14, 2010 at 10:51 AM, Alex Kozlov wrote:

    You can use the following perl script as a reducer:

    ===
    #!/usr/bin/perl

    $,="\t";

    while (<>) {
    my ($key, $value) = split($,, $_, 2);
    if ($lastkey eq $key) {
    push @values, $value;
    } else {
    print $lastkey, join(",", @values) if defined($lastkey);
    $lastkey = $key;
    @values = ($value);
    }
    }

    print $lastkey, join(",", @values) if defined($lastkey) and @values > 0;
    ===

    Alex K


    On Wed, Jul 14, 2010 at 1:17 AM, Moritz Krog <moritzkrog@googlemail.com
    wrote:
    First of all thanks for the quick answer :)

    is there any way to configure the job in such a way, that I get the key ->
    value list? I specifically need exactly this behavior.. it's crucial to
    what
    I want to do with Hadoop..


    On Wed, Jul 14, 2010 at 10:06 AM, Amareshwari Sri Ramadasu <
    amarsri@yahoo-inc.com> wrote:
    In streaming, the combined values are given to reducer as <key, value>
    pairs again, so you don't see key and list of values.
    I think it is done in that way to be symmetrical with mapper, though I
    don't know exact reason.

    Thanks
    Amareshwari

    On 7/14/10 1:05 PM, "Moritz Krog" wrote:

    Hi everyone,

    I'm pretty new to Hadoop and generally avoiding Java everywhere I can,
    so
    I'm getting started with Hadoop streaming and python mapper and
    reducer.
    From what I read in the mapreduce tutorial, mapper an reducer can be
    plugged
    into Hadoop via the "-mapper" and "-reducer" options on job start. I
    was
    wondering what the input for the reducer would look like, so I ran a Hadoop
    job using my own mapper but /bin/cat as reducer. As you can see, the output
    of the job is ordered, but the keys haven't been combined:

    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 107488
    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95560
    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95562

    I would have expected something like:

    {'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
    'person'} 95560, 95562, 107488

    my understanding from the tutorial was, that this reduction is a part
    of
    the
    shuffle and sort phase. Or do I need to use a combiner to get that
    done?
    Does Hadoop streaming even do this, or do I need to use a native java
    class?

    Best,
    Moritz

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedJul 14, '10 at 7:36a
activeJul 14, '10 at 9:20a
posts5
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase