FAQ
I set up a little benchmark on a 39 node cluster to sort 40gb of
random text data (generated by RandomTextWriter using key length:
1-10 words and value length: 0-200 words, data uncompressed). The
runtimes in minutes are:

Java: 4:22
C++ (Pipes): 3:50
Streaming: 4:44

I was surprised to find that Pipes out performed Java, even with the
extra process. I suspect it was because of the buffering between the
input and output of Pipes.

-- Owen

Search Discussions

  • Aaron Kimball at Nov 9, 2007 at 1:11 am
    Neat benchmark. I've been meaning to do exactly that myself. And that is
    a surprise about Pipes!

    Thanks for the data
    - Aaron

    Owen O'Malley wrote:
    I set up a little benchmark on a 39 node cluster to sort 40gb of random
    text data (generated by RandomTextWriter using key length: 1-10 words
    and value length: 0-200 words, data uncompressed). The runtimes in
    minutes are:

    Java: 4:22
    C++ (Pipes): 3:50
    Streaming: 4:44

    I was surprised to find that Pipes out performed Java, even with the
    extra process. I suspect it was because of the buffering between the
    input and output of Pipes.

    -- Owen
  • Milind A Bhandarkar at Nov 9, 2007 at 1:16 am
    When you figure it out, could you please suggest an optimization for streaming?

    Does pipes deserializes and serializes data for the identity mappers or just "passes it through" ? (Streaming converts input to text, afaik)

    - milind


    ----- Original Message -----
    From: Owen O'Malley <oom@yahoo-inc.com>
    To: hadoop-user@lucene.apache.org <hadoop-user@lucene.apache.org>
    Sent: Thu Nov 08 17:03:01 2007
    Subject: sort speeds under java, c++, and streaming

    I set up a little benchmark on a 39 node cluster to sort 40gb of
    random text data (generated by RandomTextWriter using key length:
    1-10 words and value length: 0-200 words, data uncompressed). The
    runtimes in minutes are:

    Java: 4:22
    C++ (Pipes): 3:50
    Streaming: 4:44

    I was surprised to find that Pipes out performed Java, even with the
    extra process. I suspect it was because of the buffering between the
    input and output of Pipes.

    -- Owen
  • Owen O'Malley at Nov 9, 2007 at 3:11 am

    On Nov 8, 2007, at 5:14 PM, Milind A Bhandarkar wrote:

    Does pipes deserializes and serializes data for the identity
    mappers or just "passes it through" ? (Streaming converts input to
    text, afaik)
    Pipes serializes the objects to bytes and sends them to the C++
    program. The C++ program gets them as C++ strings, which are
    effectively byte arrays. Pipes does not do the conversion to Java
    strings that streaming does. Therefore, pipes can support arbitrary
    Writable objects. Hopefully in the future, we can change the map/
    reduce api to provide access to the raw bytes in the mapper and
    reducer as an option. In that case, pipes would not need to serialize
    at all.

    -- Owen
  • Joydeep Sen Sarma at Nov 9, 2007 at 1:36 am
    Doesn't the sorting and merging all still happen in Java-land?

    -----Original Message-----
    From: Owen O'Malley
    Sent: Thursday, November 08, 2007 5:03 PM
    To: hadoop-user@lucene.apache.org
    Subject: sort speeds under java, c++, and streaming

    I set up a little benchmark on a 39 node cluster to sort 40gb of
    random text data (generated by RandomTextWriter using key length:
    1-10 words and value length: 0-200 words, data uncompressed). The
    runtimes in minutes are:

    Java: 4:22
    C++ (Pipes): 3:50
    Streaming: 4:44

    I was surprised to find that Pipes out performed Java, even with the
    extra process. I suspect it was because of the buffering between the
    input and output of Pipes.

    -- Owen
  • Owen O'Malley at Nov 9, 2007 at 3:02 am

    On Nov 8, 2007, at 5:35 PM, Joydeep Sen Sarma wrote:

    Doesn't the sorting and merging all still happen in Java-land?
    Yes, that is why it surprised me.

    -- Owen
  • Joydeep Sen Sarma at Nov 9, 2007 at 7:29 am
    What about the partitioner or combiner - C++ or Java?

    Just wondering - there's got to be something that got subtracted.

    -----Original Message-----
    From: Owen O'Malley
    Sent: Thursday, November 08, 2007 7:01 PM
    To: hadoop-user@lucene.apache.org
    Subject: Re: sort speeds under java, c++, and streaming

    On Nov 8, 2007, at 5:35 PM, Joydeep Sen Sarma wrote:

    Doesn't the sorting and merging all still happen in Java-land?
    Yes, that is why it surprised me.

    -- Owen
  • Owen O'Malley at Nov 9, 2007 at 8:13 am

    On Nov 8, 2007, at 11:28 PM, Joydeep Sen Sarma wrote:

    What about the partitioner or combiner - C++ or Java?
    Both the partitioners were in Java. There was no combiner.
    Just wondering - there's got to be something that got subtracted.
    I found what was happening. The Java sort application was setting the
    number of maps, so it was getting 400 maps instead of 320. Forcing it
    to 320 maps using "-m 320", I get numbers basically the same as pipes.

    Java: 3:59, 4:03, 4:06

    -- Owen
  • Doug Judd at Nov 9, 2007 at 1:40 am
    Hi Owen,

    Can you provide more details of your test? In particular what was the Java
    Map-reduce program that your ran? Was it
    src/examples/org/apache/hadoop/examples/Sort.java ? Also, I can't find
    anything called "RandomTextWriter" in the source tarball, can you point me
    to it? Thanks.

    - Doug
    On Nov 8, 2007 5:03 PM, Owen O'Malley wrote:

    I set up a little benchmark on a 39 node cluster to sort 40gb of
    random text data (generated by RandomTextWriter using key length:
    1-10 words and value length: 0-200 words, data uncompressed). The
    runtimes in minutes are:

    Java: 4:22
    C++ (Pipes): 3:50
    Streaming: 4:44

    I was surprised to find that Pipes out performed Java, even with the
    extra process. I suspect it was because of the buffering between the
    input and output of Pipes.

    -- Owen
  • Owen O'Malley at Nov 9, 2007 at 3:01 am

    On Nov 8, 2007, at 5:39 PM, Doug Judd wrote:

    Can you provide more details of your test?
    Sure, I guess I should have been more specific to start with. *grin*

    The data was generated with:
    bin/hadoop jar hadoop-0.15.0-dev-examples.jar randomtextwriter -conf
    gridmix-text.xml\
    -outFormat org.apache.hadoop.mapred.TextOutputFormat /gridmix/
    data/sort/text
    contents of gridmix-text.xml:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

    <configuration>

    <property>
    <name>test.randomtextwrite.total_bytes</name>
    <value>429496729600</value>
    </property>

    <property>
    <name>test.randomtextwrite.min_words_key</name>
    <value>1</value>
    </property>

    <property>
    <name>test.randomtextwrite.max_words_key</name>
    <value>10</value>
    </property>

    <property>
    <name>test.randomtextwrite.min_words_value</name>
    <value>0</value>
    </property>

    <property>
    <name>test.randomtextwrite.max_words_value</name>
    <value>200</value>
    </property>

    </configuration>

    And then ran the sort as:

    Java:
    bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort \
    -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
    -outFormat org.apache.hadoop.mapred.TextOutputFormat \
    -outKey org.apache.hadoop.io.Text -outValue
    org.apache.hadoop.io.Text \
    /gridmix/data/sort/text/part-*0 java-out

    Pipes:
    bin/hadoop pipes -input /gridmix/data/sort/text/part-*0 -output pipe-
    out \
    -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
    -program /gridmix/programs/pipes-sort -reduces 78 \
    -jobconf\

    mapred.output.key.class=org.apache.hadoop.io.Text,mapred.output.value.cl
    ass=org.apache.hadoop.io.Text \
    -writer org.apache.hadoop.mapred.TextOutputFormat

    Streaming:
    bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar \
    -input /gridmix/data/sort/text/part-*0 -output stream-out -mapper
    cat -reducer cat \
    -numReduceTasks 78

    Note that these are the commands I used, although they generate 400gb
    data and then only sort 10%. Clearly, it is a bit faster to just
    generate 40gb and sort all of it. I'm just going to run the bigger
    sort in the next couple of days.
    In particular what was the Java
    Map-reduce program that your ran? Was it
    src/examples/org/apache/hadoop/examples/Sort.java ? Yes
    Also, I can't find anything called "RandomTextWriter" in the source
    tarball, can you point me to it?
    It is in the example directory of 0.15 too. The only remaining piece,
    is the pipes sort program and I'll upload that to HADOOP-2127.

    -- Owen
  • Doug Judd at Nov 9, 2007 at 4:39 am
    Thanks, Owen. Did it look like the system was CPU bound? It would be
    interesting to see some top output for the various runs. It would also be
    interesting to profile the Java stuff in both Pipes mode and non-Pipes mode.

    - Doug
    On Nov 8, 2007 7:00 PM, Owen O'Malley wrote:

    On Nov 8, 2007, at 5:39 PM, Doug Judd wrote:

    Can you provide more details of your test?
    Sure, I guess I should have been more specific to start with. *grin*

    The data was generated with:
    bin/hadoop jar hadoop-0.15.0-dev-examples.jar randomtextwriter -conf
    gridmix-text.xml\
    -outFormat org.apache.hadoop.mapred.TextOutputFormat /gridmix/
    data/sort/text
    contents of gridmix-text.xml:
    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>

    <configuration>

    <property>
    <name>test.randomtextwrite.total_bytes</name>
    <value>429496729600</value>
    </property>

    <property>
    <name>test.randomtextwrite.min_words_key</name>
    <value>1</value>
    </property>

    <property>
    <name>test.randomtextwrite.max_words_key</name>
    <value>10</value>
    </property>

    <property>
    <name>test.randomtextwrite.min_words_value</name>
    <value>0</value>
    </property>

    <property>
    <name>test.randomtextwrite.max_words_value</name>
    <value>200</value>
    </property>

    </configuration>

    And then ran the sort as:

    Java:
    bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort \
    -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \
    -outFormat org.apache.hadoop.mapred.TextOutputFormat \
    -outKey org.apache.hadoop.io.Text -outValue
    org.apache.hadoop.io.Text \
    /gridmix/data/sort/text/part-*0 java-out

    Pipes:
    bin/hadoop pipes -input /gridmix/data/sort/text/part-*0 -output pipe-
    out \
    -inputformat org.apache.hadoop.mapred.KeyValueTextInputFormat \
    -program /gridmix/programs/pipes-sort -reduces 78 \
    -jobconf\

    mapred.output.key.class=org.apache.hadoop.io.Text,mapred.output.value.cl
    ass=org.apache.hadoop.io.Text \
    -writer org.apache.hadoop.mapred.TextOutputFormat

    Streaming:
    bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar \
    -input /gridmix/data/sort/text/part-*0 -output stream-out -mapper
    cat -reducer cat \
    -numReduceTasks 78

    Note that these are the commands I used, although they generate 400gb
    data and then only sort 10%. Clearly, it is a bit faster to just
    generate 40gb and sort all of it. I'm just going to run the bigger
    sort in the next couple of days.
    In particular what was the Java
    Map-reduce program that your ran? Was it
    src/examples/org/apache/hadoop/examples/Sort.java ? Yes
    Also, I can't find anything called "RandomTextWriter" in the source
    tarball, can you point me to it?
    It is in the example directory of 0.15 too. The only remaining piece,
    is the pipes sort program and I'll upload that to HADOOP-2127.

    -- Owen
  • Owen O'Malley at Nov 9, 2007 at 8:15 am

    On Nov 8, 2007, at 8:39 PM, Doug Judd wrote:

    Thanks, Owen. Did it look like the system was CPU bound?
    I looked while the Java one was running and it was working a couple
    of the cpus pretty hard. (I was only running with the default 2tasks/
    node, which is really low given these are nice 8 cpu machines.)

    I should also mention that I was using a 500 node hdfs cluster that
    is a superset of the 39 node + 1 job tracker map/reduce cluster, so
    most of the hdfs reads and writes were outside of the map/reduce
    cluster.
    It would be
    interesting to see some top output for the various runs. It would
    also be
    interesting to profile the Java stuff in both Pipes mode and non-
    Pipes mode.
    What I'm doing is putting together a somewhat representative workload
    to look at increasing utilization, so at some point I'll deep dive
    into the detail, but the first pass will be looking at the top level
    issues.

    -- Owen
  • Milind A Bhandarkar at Nov 9, 2007 at 3:46 am
    One more thing about your original numbers.

    Are they repeatable ?

    - milind

    ----- Original Message -----
    From: Owen O'Malley <oom@yahoo-inc.com>
    To: hadoop-user@lucene.apache.org <hadoop-user@lucene.apache.org>
    Sent: Thu Nov 08 19:10:30 2007
    Subject: Re: sort speeds under java, c++, and streaming
    On Nov 8, 2007, at 5:14 PM, Milind A Bhandarkar wrote:

    Does pipes deserializes and serializes data for the identity
    mappers or just "passes it through" ? (Streaming converts input to
    text, afaik)
    Pipes serializes the objects to bytes and sends them to the C++
    program. The C++ program gets them as C++ strings, which are
    effectively byte arrays. Pipes does not do the conversion to Java
    strings that streaming does. Therefore, pipes can support arbitrary
    Writable objects. Hopefully in the future, we can change the map/
    reduce api to provide access to the raw bytes in the mapper and
    reducer as an option. In that case, pipes would not need to serialize
    at all.

    -- Owen
  • Owen O'Malley at Nov 9, 2007 at 7:31 am

    On Nov 8, 2007, at 7:45 PM, Milind A Bhandarkar wrote:

    Are they repeatable ?
    I ran 3 more runs of each and it looks pretty stable:

    Java: 4:15, 4:06, 4:11
    Pipes: 4:03, 3:51, 4:03
    Streaming: 4:58, 4:59, 5:00

    -- Owen

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedNov 9, '07 at 1:03a
activeNov 9, '07 at 8:15a
posts14
users5
websitehadoop.apache.org...
irc#hadoop

People

Translate

site design / logo © 2022 Grokbase