I'm pretty new to Hadoop and generally avoiding Java everywhere I can, so
I'm getting started with Hadoop streaming and python mapper and reducer.
From what I read in the mapreduce tutorial, mapper an reducer can be plugged
into Hadoop via the "-mapper" and "-reducer" options on job start. I waswondering what the input for the reducer would look like, so I ran a Hadoop
job using my own mapper but /bin/cat as reducer. As you can see, the output
of the job is ordered, but the keys haven't been combined:
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'} 107488
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'} 95560
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'} 95562
I would have expected something like:
{'lastname': 'Adhikari', 'firstnames': 'P', 'suffix': None, 'type':
'person'} 95560, 95562, 107488
my understanding from the tutorial was, that this reduction is a part of the
shuffle and sort phase. Or do I need to use a combiner to get that done?
Does Hadoop streaming even do this, or do I need to use a native java class?
Best,
Moritz