Hi!
I've noticed that streaming has big problems handling long lines, when
streaming.
In my special case the output of a reducer process takes very long time to run
and sometimes crashes with a number of random effects, a Java OutOfMemory
being the nicest one.
(which is a fact. A reducer outputing 10000 32000 byte lines takes ~11 minutes
to run. A reducer outputing 10 32000000 byte lines takes ~110 minutes)
So my questions are:
-) are the sorts used by hadoop stable?
-) how does hadoop arrive at the allocation of input lines to reducer? => I've
got situations were 19 reducers have processed 20M input lines, while 1
reducer needed to process 80M input lines.
TIA,
Andreas