FAQ
Hi I am new to Hadoop, so maybe I am missing something obvious. I have
written a small map reduce program that runs two jobs. I want the output of
the first job to serve as the input to the second job. Here is what my
driver code looks like:

public int run(String[] args) throws Exception {
Configuration conf = getConf();

Job job = new Job(conf, "Job One");
job.setJarByClass(CountCitations.class);

Path in = new Path(args[0]);
Path out1 = new Path("jobOneOutput");

FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out1);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(SequenceFileOutputFormat.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);

job.waitForCompletion(true);

job = new Job(conf, "Job Two");
job.setJarByClass(MyJob.class);

job.setInputFormatClass(SequenceFileInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.setInputPaths(job, out1);
FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MapCounts.class);
job.setReducerClass(ReduceCounts.class);

job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);

return 0;
}

The output path created from the first job is a directory, and it the file
in that directory that has a name like part-r-0000 that I want to feed as
input into the second job. I am running in pseudo-distributed mode so I know
that that file name is going to be the same every run. But in a true
distributed mode that file name will be different for each node. More over,
when in distributed mode don't I want a uniform view of that output file
which will be spread across my cluster? Is there something wrong in my code?
Or can someone point me to some examples that do this?

Thanks

- John

Search Discussions

  • Harsh J at Mar 3, 2011 at 4:19 am
    Hello,
    On Thu, Mar 3, 2011 at 7:51 AM, John Sanda wrote:
    The output path created from the first job is a directory, and it the file
    in that directory that has a name like part-r-0000 that I want to feed as
    input into the second job. I am running in pseudo-distributed mode so I know
    that that file name is going to be the same every run. But in a true
    distributed mode that file name will be different for each node. More over,
    The default filename of many OutputFormats start with "part", and is
    not node dependent. You will get filenames in out1 as part-r-00000
    onwards to part-r-{num. of reduce tasks for your job}.
    when in distributed mode don't I want a uniform view of that output file
    which will be spread across my cluster? Is there something wrong in my code?
    Or can someone point me to some examples that do this?
    I do not understand what you mean by uniform view. Using a directory
    as an input for a job is very much acceptable and a normal thing to do
    in file-based MR. The directories form the whole input, with files
    containing small "parts" of it. I do not see anything grossly wrong in
    your code provided.

    --
    Harsh J
    www.harshj.com
  • John Sanda at Mar 3, 2011 at 4:48 am
    Thanks for the response. What I meant by uniform view is that I would be
    able to avoid having to reference each individual part-r-xxxx file. It
    wasn't immediately clear to me that the directory could be the input path.
    That tells me then the problem(s) is somewhere in my MR code. Thanks!
    On Wed, Mar 2, 2011 at 11:19 PM, Harsh J wrote:

    Hello,
    On Thu, Mar 3, 2011 at 7:51 AM, John Sanda wrote:
    The output path created from the first job is a directory, and it the file
    in that directory that has a name like part-r-0000 that I want to feed as
    input into the second job. I am running in pseudo-distributed mode so I know
    that that file name is going to be the same every run. But in a true
    distributed mode that file name will be different for each node. More
    over,

    The default filename of many OutputFormats start with "part", and is
    not node dependent. You will get filenames in out1 as part-r-00000
    onwards to part-r-{num. of reduce tasks for your job}.
    when in distributed mode don't I want a uniform view of that output file
    which will be spread across my cluster? Is there something wrong in my code?
    Or can someone point me to some examples that do this?
    I do not understand what you mean by uniform view. Using a directory
    as an input for a job is very much acceptable and a normal thing to do
    in file-based MR. The directories form the whole input, with files
    containing small "parts" of it. I do not see anything grossly wrong in
    your code provided.

    --
    Harsh J
    www.harshj.com


    --

    - John

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
categorieshadoop
postedMar 3, '11 at 2:22a
activeMar 3, '11 at 4:48a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

John Sanda: 2 posts Harsh J: 1 post

People

Translate

site design / logo © 2022 Grokbase