interesting to see some top output for the various runs. It would also be
interesting to profile the Java stuff in both Pipes mode and non-Pipes mode.
- Doug
On Nov 8, 2007 7:00 PM, Owen O'Malley wrote:
The data was generated with:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar randomtextwriter -conf
gridmix-text.xml\
-outFormat org.apache.hadoop.mapred.TextOutputFormat /gridmix/
data/sort/text
contents of gridmix-text.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<configuration>
<property>
<name>test.randomtextwrite.total_bytes</name>
<value>429496729600</value>
</property>
<property>
<name>test.randomtextwrite.min_words_key</name>
<value>1</value>
</property>
<property>
<name>test.randomtextwrite.max_words_key</name>
<value>10</value>
</property>
<property>
<name>test.randomtextwrite.min_words_value</name>
<value>0</value>
</property>
<property>
<name>test.randomtextwrite.max_words_value</name>
<value>200</value>
</property>
</configuration>
And then ran the sort as:
Java:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort \
-inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
-outFormat org.apache.hadoop.mapred.TextOutputFormat \
-outKey org.apache.hadoop.io.Text -outValue
org.apache.hadoop.io.Text \
/gridmix/data/sort/text/part-*0 java-out
Pipes:
bin/hadoop pipes -input /gridmix/data/sort/text/part-*0 -output pipe-
out \
-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
-program /gridmix/programs/pipes-sort -reduces 78 \
-jobconf\
mapred.output.key.class=org.apache.hadoop.io.Text,mapred.output.value.cl
ass=org.apache.hadoop.io.Text \
-writer org.apache.hadoop.mapred.TextOutputFormat
Streaming:
bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar \
-input /gridmix/data/sort/text/part-*0 -output stream-out -mapper
cat -reducer cat \
-numReduceTasks 78
Note that these are the commands I used, although they generate 400gb
data and then only sort 10%. Clearly, it is a bit faster to just
generate 40gb and sort all of it. I'm just going to run the bigger
sort in the next couple of days.
is the pipes sort program and I'll upload that to HADOOP-2127.
-- Owen
On Nov 8, 2007, at 5:39 PM, Doug Judd wrote:
Can you provide more details of your test?
Sure, I guess I should have been more specific to start with. *grin*Can you provide more details of your test?
The data was generated with:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar randomtextwriter -conf
gridmix-text.xml\
-outFormat org.apache.hadoop.mapred.TextOutputFormat /gridmix/
data/sort/text
contents of gridmix-text.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<configuration>
<property>
<name>test.randomtextwrite.total_bytes</name>
<value>429496729600</value>
</property>
<property>
<name>test.randomtextwrite.min_words_key</name>
<value>1</value>
</property>
<property>
<name>test.randomtextwrite.max_words_key</name>
<value>10</value>
</property>
<property>
<name>test.randomtextwrite.min_words_value</name>
<value>0</value>
</property>
<property>
<name>test.randomtextwrite.max_words_value</name>
<value>200</value>
</property>
</configuration>
And then ran the sort as:
Java:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort \
-inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
-outFormat org.apache.hadoop.mapred.TextOutputFormat \
-outKey org.apache.hadoop.io.Text -outValue
org.apache.hadoop.io.Text \
/gridmix/data/sort/text/part-*0 java-out
Pipes:
bin/hadoop pipes -input /gridmix/data/sort/text/part-*0 -output pipe-
out \
-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
-program /gridmix/programs/pipes-sort -reduces 78 \
-jobconf\
mapred.output.key.class=org.apache.hadoop.io.Text,mapred.output.value.cl
ass=org.apache.hadoop.io.Text \
-writer org.apache.hadoop.mapred.TextOutputFormat
Streaming:
bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar \
-input /gridmix/data/sort/text/part-*0 -output stream-out -mapper
cat -reducer cat \
-numReduceTasks 78
Note that these are the commands I used, although they generate 400gb
data and then only sort 10%. Clearly, it is a bit faster to just
generate 40gb and sort all of it. I'm just going to run the bigger
sort in the next couple of days.
In particular what was the Java
Map-reduce program that your ran? Was it
src/examples/org/apache/hadoop/examples/Sort.java ? Yes
Also, I can't find anything called "RandomTextWriter" in the source
tarball, can you point me to it?
It is in the example directory of 0.15 too. The only remaining piece,Map-reduce program that your ran? Was it
src/examples/org/apache/hadoop/examples/Sort.java ? Yes
Also, I can't find anything called "RandomTextWriter" in the source
tarball, can you point me to it?
is the pipes sort program and I'll upload that to HADOOP-2127.
-- Owen