|
Sam Ritchie |
at Jan 7, 2012 at 3:11 pm
|
⇧ |
| |
Ah, that's it -- can you remove JavaSerialization from your job
configuration? I recently added support for Kryo, which is far more
efficient than JavaSerialization and effectively replaces it.
It looks like the explicit value of "io.serializations" in your cluster
configuration is overriding the value I set up from within Cascalog. Try
changing your list of serializations to this:
cascading.tuple. hadoop.BytesSerialization,
cascading.tuple.hadoop.TupleSerialization,
org.apache.hadoop.io.serializer.WritableSerialization,
cascalog.hadoop.ClojureKryoSerialization
You'll see big improvements, in this case and in any other case where
JavaSerialization was grabbing tuples.
On Sat, Jan 7, 2012 at 2:30 AM, Artem Boytsov wrote:Amazingly, if I replace vector of longs with its textual representation
simply by (str/join " " vec), the difference between "map input bytes" and
"map output bytes" becomes only x3 (instead of x40!!!!), and the whole
mapreduce is finished about 10 times faster.
Looking at my screenshot, when using a vector 2Gb of map input becomes
90Gb of map output. When I substitute this vector by a string, 15Gb of map
input becomes 48Gb of map output.
I am at a loss of words.
Any ideas?
Artem.
On Sat, Jan 7, 2012 at 12:19 AM, Artem Boytsov wrote:I realized that map output bytes are probably before compression though
x40 compression ratio for this data still doesn't seem plausible.
I dumped the data in a readable format (i.e. "-2147160804 [-1860497744
681993524]") and even this file is only x3 times larger than the original,
binary, compressed version.
Artem.
On Sat, Jan 7, 2012 at 12:07 AM, Artem Boytsov wrote:
Hello, guys,
I'm seeing some very weird map output data blow up which I find very
hard to explain. The intermediate map output bytes is x40 map input bytes
for an extremely simple query:
(?<- output [?pair] (input _ ?pair))
Both output and input files are hfs-seqfile.
The input has two fields: java.lang.Long and
clojure.lang.PersistentVector of java.lang.Long (always of size 2), for
example:
1 [ 2, 3 ]
The query simply seeks to throw out the first field and dedupe the
second.
Compression is turned on using Lzo codec. Map output compression is
turned on as well.
Serializers specified: cascading.tuple. hadoop.BytesSerialization,
cascading.tuple.hadoop.TupleSerialization,
org.apache.hadoop.io.serializer.WritableSerialization,
org.apache.hadoop.io.serializer.JavaSerialization"
Libraries used:
ls -1 lib|grep casca
cascading-core-1.2.4.jar
cascading.kryo-0.1.5.jar
cascalog-1.8.5-20120102.070300-13.jar
Please take a look at the attached screenshot:
- 175m input records at ~2Gb which is ~10 bytes per record. Sounds
reasonable.
- map output records is the same (correct), but look at map output
bytes: 90Gb (!!!!!!!!!!!!!!!!!!)
How can it be? Map output should be just ?pair, map input is a long +
?pair. I am really baffled. Even if you convert records to their decimal
ascii representation, it wouldn't be x40. Could it be Cascading issue?
Thank you so much for your help!
Artem.
--
Sam Ritchie, Twitter Inc
703.662.1337
@sritchie09
(Too brief? Here's why!
http://emailcharter.org)