FAQ
Hello all.

I've just tested MapReduce in C++ against the Java version.

I did run WordCount included in 0.14.1 release version on a 1 node Hadoop
cluster (Pentium D with 2GB of RAM).
There were 2 input files (one 4.5MB file + one 36MB file).
I also did take Combiner out of Java version WordCount MapReduce, as there
was no Combiner used for C++ version.

The result is.... as many of you have guessed, Java version won the race big
time. Java version was about 4 times quicker.

Here is more detailed result.


Java Version
C++ Version Ratio (Java : C++) Total Time Taken
89 364 1 : 4 Longest Time Taken for Map
41 83 1 : 2 Longest Time Taken for Reduce
58 264 1 : 4.5

Any guess or idea on how to improve the performance of C++ MapReduce?

Taeho

--
Taeho Kang [tkang.blogspot.com]

Search Discussions

  • Owen O'Malley at Sep 13, 2007 at 5:23 pm

    On Sep 13, 2007, at 2:20 AM, Taeho Kang wrote:

    I did run WordCount included in 0.14.1 release version on a 1 node
    Hadoop
    cluster (Pentium D with 2GB of RAM).
    Thanks for running the benchmark. I'm afraid that with such a small
    cluster and data size you are getting swamped in the start up costs.
    I have not done enough benchmarking of the C++ bindings
    There were 2 input files (one 4.5MB file + one 36MB file).
    I also did take Combiner out of Java version WordCount MapReduce,
    as there
    was no Combiner used for C++ version.
    Actually, the wordcount-part.cc example does have a combiner. You
    would want to remove the partitioner from that example that forces
    every key to partition 0 however. *smile* Actually, as an example,
    the bad partitioner wasn't a good idea. I should move the bad
    partitioner to a test case.
    The result is.... as many of you have guessed, Java version won the
    race big
    time. Java version was about 4 times quicker.
    I'll write a sort benchmark for C++ so that we can run a reasonably
    large program. Note that for simple programs, the C++ is by
    definition slower since pipes runs the C++ as a subprocess underneath
    a Java mapper and reducer.

    -- Owen
  • Taeho Kang at Sep 14, 2007 at 1:13 am
    Thanks for your answers and clarifications.
    I will try to do some more benchmark testing with more nodes and keep you
    guys posted.


    On 9/14/07, Owen O'Malley wrote:

    On Sep 13, 2007, at 2:20 AM, Taeho Kang wrote:

    I did run WordCount included in 0.14.1 release version on a 1 node
    Hadoop
    cluster (Pentium D with 2GB of RAM).
    Thanks for running the benchmark. I'm afraid that with such a small
    cluster and data size you are getting swamped in the start up costs.
    I have not done enough benchmarking of the C++ bindings
    There were 2 input files (one 4.5MB file + one 36MB file).
    I also did take Combiner out of Java version WordCount MapReduce,
    as there
    was no Combiner used for C++ version.
    Actually, the wordcount-part.cc example does have a combiner. You
    would want to remove the partitioner from that example that forces
    every key to partition 0 however. *smile* Actually, as an example,
    the bad partitioner wasn't a good idea. I should move the bad
    partitioner to a test case.
    The result is.... as many of you have guessed, Java version won the
    race big
    time. Java version was about 4 times quicker.
    I'll write a sort benchmark for C++ so that we can run a reasonably
    large program. Note that for simple programs, the C++ is by
    definition slower since pipes runs the C++ as a subprocess underneath
    a Java mapper and reducer.

    -- Owen


    --
    Taeho Kang [tkang.blogspot.com]
    Software Engineer, NHN Corporation, Korea

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcommon-user @
categorieshadoop
postedSep 13, '07 at 9:20a
activeSep 14, '07 at 1:13a
posts3
users2
websitehadoop.apache.org...
irc#hadoop

2 users in discussion

Taeho Kang: 2 posts Owen O'Malley: 1 post

People

Translate

site design / logo © 2022 Grokbase