When you figure it out, could you please suggest an optimization for streaming?
Does pipes deserializes and serializes data for the identity mappers or just "passes it through" ? (Streaming converts input to text, afaik)
- milind
----- Original Message -----
From: Owen O'Malley <oom@yahoo-inc.com>
To: hadoop-user@lucene.apache.org <hadoop-user@lucene.apache.org>
Sent: Thu Nov 08 17:03:01 2007
Subject: sort speeds under java, c++, and streaming
I set up a little benchmark on a 39 node cluster to sort 40gb of
random text data (generated by RandomTextWriter using key length:
1-10 words and value length: 0-200 words, data uncompressed). The
runtimes in minutes are:
Java: 4:22
C++ (Pipes): 3:50
Streaming: 4:44
I was surprised to find that Pipes out performed Java, even with the
extra process. I suspect it was because of the buffering between the
input and output of Pipes.
-- Owen