FAQ
Thanks, Owen. Did it look like the system was CPU bound? It would be
interesting to see some top output for the various runs. It would also be
interesting to profile the Java stuff in both Pipes mode and non-Pipes mode.

- Doug
On Nov 8, 2007 7:00 PM, Owen O'Malley wrote:

On Nov 8, 2007, at 5:39 PM, Doug Judd wrote:

Can you provide more details of your test?
Sure, I guess I should have been more specific to start with. *grin*

The data was generated with:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar randomtextwriter -conf
gridmix-text.xml\
-outFormat org.apache.hadoop.mapred.TextOutputFormat /gridmix/
data/sort/text
contents of gridmix-text.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

<configuration>

<property>
<name>test.randomtextwrite.total_bytes</name>
<value>429496729600</value>
</property>

<property>
<name>test.randomtextwrite.min_words_key</name>
<value>1</value>
</property>

<property>
<name>test.randomtextwrite.max_words_key</name>
<value>10</value>
</property>

<property>
<name>test.randomtextwrite.min_words_value</name>
<value>0</value>
</property>

<property>
<name>test.randomtextwrite.max_words_value</name>
<value>200</value>
</property>

</configuration>

And then ran the sort as:

Java:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort \
-inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
-outFormat org.apache.hadoop.mapred.TextOutputFormat \
-outKey org.apache.hadoop.io.Text -outValue
org.apache.hadoop.io.Text \
/gridmix/data/sort/text/part-*0 java-out

Pipes:
bin/hadoop pipes -input /gridmix/data/sort/text/part-*0 -output pipe-
out \
-inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
-program /gridmix/programs/pipes-sort -reduces 78 \
-jobconf\

mapred.output.key.class=org.apache.hadoop.io.Text,mapred.output.value.cl
ass=org.apache.hadoop.io.Text \
-writer org.apache.hadoop.mapred.TextOutputFormat

Streaming:
bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar \
-input /gridmix/data/sort/text/part-*0 -output stream-out -mapper
cat -reducer cat \
-numReduceTasks 78

Note that these are the commands I used, although they generate 400gb
data and then only sort 10%. Clearly, it is a bit faster to just
generate 40gb and sort all of it. I'm just going to run the bigger
sort in the next couple of days.
In particular what was the Java
Map-reduce program that your ran? Was it
src/examples/org/apache/hadoop/examples/Sort.java ? Yes
Also, I can't find anything called "RandomTextWriter" in the source
tarball, can you point me to it?
It is in the example directory of 0.15 too. The only remaining piece,
is the pipes sort program and I'll upload that to HADOOP-2127.

-- Owen

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 10 of 14 | next ›
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 9, '07 at 1:03a
activeNov 9, '07 at 8:15a
posts14
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2023 Grokbase