I'm creating multiple sequence files as the output of a large MR-job (with
the SequenceFileOutputFormat). As expected, the keys in these sequence files
are nicely ordered since the reduce step does that for us. However, when I
create a MR-job to insert this data from the sequence files into HBase, the
sorted keys pose a problem: all mappers start writing to the same HBase
region, since the keys are ordered and Hadoop chops up the file into parts
starting at the beginning.
I randomized the file names which helps a little, but still there are
changes that large parts of the key space are inserted into the same region
causing slowdowns.

Is there a way to randomize the keys in these sequence files? I can simply
put a random value before the key (like "%RND-keyname"), but I'm wondering
if there is a less dirty method, like a random partitioner class ;-)


Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupmapreduce-user @
postedDec 27, '10 at 4:44p
activeDec 27, '10 at 4:44p

1 user in discussion

Eric: 1 post



site design / logo © 2022 Grokbase