FAQ
Hi,

I have a problem concerning the TeraSort benchmark.
I am running the version that ships with hadoop-0.21.0 and if I use it as described (i.e. TeraGen -TeraSort - TeraValidate), everything works fine.

However, for some tests I need to run, I added a simple job between TeraGen and TeraSort that does nothing but copy the input. I included its code below.

If I run this Copy-job after TeraGen, TeraSort will partition the input in a way, that most tuples will go to the last reducer.
For example if I run TeraSort with 500MB input, and 20 Reducers I get the following distribution:
-Reducers 0-18 process ~10.000 tuples each
-Reducer 19 processes ~5.000.000 tuples

Can anyone reproduce this behavior? I would really appreciated any help!

David


public class Copy extends Configured implements Tool {

public int run(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
Job job = Job.getInstance(new Cluster(getConf()), getConf());

Path inputDirOld = new Path(args[0]);
TeraInputFormat.addInputPath(job, inputDirOld);
job.setInputFormatClass(TeraInputFormat.class);

job.setJobName("Copy");
job.setJarByClass(Void.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setOutputFormatClass(TeraOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String[] args) throws Exception {
int res = ToolRunner.run(new Configuration(), new Void(), args);
System.exit(res);
}
}

Search Discussions

Discussion Posts

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 1 of 2 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedFeb 27, '11 at 10:27p
activeFeb 28, '11 at 10:30a
posts2
users1
websitehadoop.apache.org...
irc#hadoop

1 user in discussion

David Saile: 2 posts

People

Translate

site design / logo © 2022 Grokbase