FAQ
Hi,

I am new to Hadoop. Looking at the documentation, I figured out how to
write map and reduce functions but now I'm stuck... How do we work with
the output file produced by the reducer? For example, the word count
example produces a file with words as keys and the number of occurrences
of each word as the values. Now, let's say I want to get the total
number of words by analyzing the output file... how I am supposed to do
it?

thx,
Sebastien Rainville

Search Discussions

  • Jeroen Verhagen at Aug 15, 2007 at 12:16 pm
    Hi Sebastien,
    On 8/14/07, Sebastien Rainville wrote:

    I am new to Hadoop. Looking at the documentation, I figured out how to
    write map and reduce functions but now I'm stuck... How do we work with
    the output file produced by the reducer? For example, the word count
    example produces a file with words as keys and the number of occurrences
    of each word as the values. Now, let's say I want to get the total
    number of words by analyzing the output file... how I am supposed to do
    it?
    I asked a similar question some time ago and haven't had any response
    sofar so I hope you will get one.

    Regarding your particular question, assuming each line in the output
    files contains exactly one word, counting the number of lines in the
    output files would give the answer you're looking for.

    But if you're looking for the count of particular word, I wonder if
    scanning through the output files for a line that starts with the word
    you're looking for is such an efficient solution.

    --

    regards,

    Jeroen
  • Calvin Yu at Aug 15, 2007 at 1:54 pm
    The manual way is to copy the split files to your local filesystem
    using 'hadoop fs -copyToLocal'. You could also write code to read
    that data from hdfs.

    What I do is set the reduced output to be in SequencedFile format, and
    then create a new SequenceFile.Reader to read the split files from
    hdfs.

    Calvin

    On 8/15/07, Jeroen Verhagen wrote:
    Hi Sebastien,
    On 8/14/07, Sebastien Rainville wrote:

    I am new to Hadoop. Looking at the documentation, I figured out how to
    write map and reduce functions but now I'm stuck... How do we work with
    the output file produced by the reducer? For example, the word count
    example produces a file with words as keys and the number of occurrences
    of each word as the values. Now, let's say I want to get the total
    number of words by analyzing the output file... how I am supposed to do
    it?
    I asked a similar question some time ago and haven't had any response
    sofar so I hope you will get one.

    Regarding your particular question, assuming each line in the output
    files contains exactly one word, counting the number of lines in the
    output files would give the answer you're looking for.

    But if you're looking for the count of particular word, I wonder if
    scanning through the output files for a line that starts with the word
    you're looking for is such an efficient solution.

    --

    regards,

    Jeroen
  • Doug Cutting at Aug 15, 2007 at 4:18 pm

    Sebastien Rainville wrote:
    I am new to Hadoop. Looking at the documentation, I figured out how to
    write map and reduce functions but now I'm stuck... How do we work with
    the output file produced by the reducer? For example, the word count
    example produces a file with words as keys and the number of occurrences
    of each word as the values. Now, let's say I want to get the total
    number of words by analyzing the output file... how I am supposed to do
    it?
    For global counts you can use counters:

    http://lucene.apache.org/hadoop/api/org/apache/hadoop/mapred/RunningJob.html#getCounters()

    The framework includes a counter for the number of output records, which
    is what you want in this case, so you don't even need to add a counter
    for that.

    For more complex summary statistics, if your output is very large, then
    it might be appropriate to run another MapReduce job over the output
    just to compute these.

    Doug

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedAug 14, '07 at 3:09p
activeAug 15, '07 at 4:18p
posts4
users4
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase