FAQ
I've written a MR job with multiple outputs. The "normal" output goes
to files named part-XXXXX and my secondary output records go to files
I've chosen to name "ExceptionDocuments" (and therefore are named
"ExceptionDocuments-m-XXXXX").

I'd like to pull merged copies of these files to my local filesystem
(two separate merged files, one containing the "normal" output and one
containing the ExceptionDocuments output). But, since hadoop lands
both of these outputs to files residing in the same directory, when I
issue "hadoop dfs -getmerge", what I get is a file that contains both
outputs.

To get around this, I have to move files around on HDFS so that my
different outputs are in different directories.

Is this the best/only way to deal with this? It would be better if
hadoop offered the option of writing different outputs to different
output directories, or if getmerge offered the ability to specify a
file prefix for files desired to be merged.

Thanks!

Search Discussions

  • Todd Lipcon at Apr 21, 2009 at 5:07 pm

    On Mon, Apr 20, 2009 at 1:14 PM, Stuart White wrote:
    Is this the best/only way to deal with this? It would be better if
    hadoop offered the option of writing different outputs to different
    output directories, or if getmerge offered the ability to specify a
    file prefix for files desired to be merged.
    Hi Stuart,

    Would dfs -cat do what you need? e.g:

    ./bin/hdfs dfs -cat /path/to/output/ExceptionDocuments-m-\* >
    /tmp/exceptions-merged

    -Todd
  • Stuart White at Apr 21, 2009 at 6:45 pm

    On Tue, Apr 21, 2009 at 12:06 PM, Todd Lipcon wrote:
    Would dfs -cat do what you need? e.g:

    ./bin/hdfs dfs -cat /path/to/output/ExceptionDocuments-m-\* >
    /tmp/exceptions-merged
    Yes, that would work. Thanks for the suggestion.
  • Koji Noguchi at Apr 21, 2009 at 6:01 pm
    Stuart,

    I once used MultipleOutputFormat and created
    (mapred.work.output.dir)/type1/part-_____
    (mapred.work.output.dir)/type2/part-_____
    ...

    And JobTracker took care of the renaming to
    (mapred.output.dir)/type{1,2}/part-______

    Would that work for you?

    Koji

    -----Original Message-----
    From: Stuart White
    Sent: Monday, April 20, 2009 1:15 PM
    To: core-user@hadoop.apache.org
    Subject: Multiple outputs and getmerge?

    I've written a MR job with multiple outputs. The "normal" output goes
    to files named part-XXXXX and my secondary output records go to files
    I've chosen to name "ExceptionDocuments" (and therefore are named
    "ExceptionDocuments-m-XXXXX").

    I'd like to pull merged copies of these files to my local filesystem
    (two separate merged files, one containing the "normal" output and one
    containing the ExceptionDocuments output). But, since hadoop lands
    both of these outputs to files residing in the same directory, when I
    issue "hadoop dfs -getmerge", what I get is a file that contains both
    outputs.

    To get around this, I have to move files around on HDFS so that my
    different outputs are in different directories.

    Is this the best/only way to deal with this? It would be better if
    hadoop offered the option of writing different outputs to different
    output directories, or if getmerge offered the ability to specify a
    file prefix for files desired to be merged.

    Thanks!
  • Stuart White at Apr 21, 2009 at 6:46 pm

    On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi wrote:
    I once used MultipleOutputFormat and created
    (mapred.work.output.dir)/type1/part-_____
    (mapred.work.output.dir)/type2/part-_____
    ...

    And JobTracker took care of the renaming to
    (mapred.output.dir)/type{1,2}/part-______

    Would that work for you?
    Can you please explain this in more detail? It looks like you're
    using MultipleOutputFormat for *both* of your outputs? So, you simply
    don't use the OutputCollector passed as a parm to Mapper#map()?
  • Koji Noguchi at Apr 21, 2009 at 8:55 pm
    Something in the lines of

    ... class MyOutputFormat extends MultipleTextOutputFormat<Text, Text> {
    protected String generateFileNameForKeyValue(Text key,
    Text v, String name) {
    Path outpath = new Path(key.toString(), name);
    return outpath.toString();
    }
    }

    would create a directory per key.

    If you just want to keep your side-effect files separate, then
    get your working dir by
    FileOutputFormat.getWorkOutputPath(...)
    or $mapred_work_output_dir

    and dfs -mkdir <workdir>/NewDir and put the secondary files there.

    Explained in

    http://hadoop.apache.org/core/docs/r0.18.3/api/org/apache/hadoop/mapred/FileOutputFormat.html#getWorkOutputPath(org.apache.hadoop.mapred.JobConf)


    Koji


    -----Original Message-----
    From: Stuart White
    Sent: Tuesday, April 21, 2009 11:46 AM
    To: core-user@hadoop.apache.org
    Subject: Re: Multiple outputs and getmerge?
    On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi wrote:

    I once used MultipleOutputFormat and created
    (mapred.work.output.dir)/type1/part-_____
    (mapred.work.output.dir)/type2/part-_____
    ...

    And JobTracker took care of the renaming to
    (mapred.output.dir)/type{1,2}/part-______

    Would that work for you?
    Can you please explain this in more detail? It looks like you're
    using MultipleOutputFormat for *both* of your outputs? So, you simply
    don't use the OutputCollector passed as a parm to Mapper#map()?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedApr 20, '09 at 8:15p
activeApr 21, '09 at 8:55p
posts6
users3
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase