Hi all,
piggybank.storage.MultiStorage allows storing the Pig output into
different directories, taken from a given field in a relation, so that
the output is partitioned by the unique values of that field.
This is just what I need for my use-case. However, I have about 50,000
unique values in the partitioning field. It seems that MutliStorage
will run one reducer per unique value, i.e., per output directory.
Obviously, this takes a long time.
Is there a better way of doing it?
I could group by the partitioning field and write a post-processing
script to go through the Pig output and write each line to a different
line. It would be simple, but I'd prefer to do it all in Pig for
consistency.
Thanks,
Thomas