Grokbase Groups Pig user June 2011
FAQ
Hi all,

piggybank.storage.MultiStorage allows storing the Pig output into
different directories, taken from a given field in a relation, so that
the output is partitioned by the unique values of that field.

This is just what I need for my use-case. However, I have about 50,000
unique values in the partitioning field. It seems that MutliStorage
will run one reducer per unique value, i.e., per output directory.
Obviously, this takes a long time.

Is there a better way of doing it?

I could group by the partitioning field and write a post-processing
script to go through the Pig output and write each line to a different
line. It would be simple, but I'd prefer to do it all in Pig for
consistency.

Thanks,
Thomas

Search Discussions

  • Daniel Dai at Jun 16, 2011 at 6:01 pm
    Try custom partitioner:
    http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby

    Daniel
    On 06/16/2011 12:38 AM, Thomas Kappler wrote:
    Hi all,

    piggybank.storage.MultiStorage allows storing the Pig output into
    different directories, taken from a given field in a relation, so that
    the output is partitioned by the unique values of that field.

    This is just what I need for my use-case. However, I have about 50,000
    unique values in the partitioning field. It seems that MutliStorage
    will run one reducer per unique value, i.e., per output directory.
    Obviously, this takes a long time.

    Is there a better way of doing it?

    I could group by the partitioning field and write a post-processing
    script to go through the Pig output and write each line to a different
    line. It would be simple, but I'd prefer to do it all in Pig for
    consistency.

    Thanks,
    Thomas
  • Jameson Li at Jun 17, 2011 at 9:53 am
    I have the same doubt as Thomas Kappler.
    And it will be kind of you if someone can say something more detailed about
    'custom partitioner' said by Daniel Dai.
    I think the docs 'piglatin_ref2.html#partitionby' seems too simple.


    2011/6/17 Daniel Dai <jianyong@yahoo-inc.com>
    Try custom partitioner: http://pig.apache.org/docs/r0.**
    8.1/piglatin_ref2.html#**partitionby<http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby>

    Daniel

    On 06/16/2011 12:38 AM, Thomas Kappler wrote:

    Hi all,

    piggybank.storage.MultiStorage allows storing the Pig output into
    different directories, taken from a given field in a relation, so that
    the output is partitioned by the unique values of that field.

    This is just what I need for my use-case. However, I have about 50,000
    unique values in the partitioning field. It seems that MutliStorage
    will run one reducer per unique value, i.e., per output directory.
    Obviously, this takes a long time.

    Is there a better way of doing it?

    I could group by the partitioning field and write a post-processing
    script to go through the Pig output and write each line to a different
    line. It would be simple, but I'd prefer to do it all in Pig for
    consistency.

    Thanks,
    Thomas
  • Thomas Kappler at Jun 17, 2011 at 10:18 am

    On Thu, Jun 16, 2011 at 20:00, Daniel Dai wrote:
    Try custom partitioner:
    http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby
    AFAIK "partition by" maps to the Hadoop partitioning, which is about
    what keys go to which reducer, which is a different problem.

    Hadoop In Action chapter 7.2 addresses partitioning into multiple
    output files, and highlights this difference. The book shows a custom
    implementation of MultipleOutputFormat as a solution.

    Thomas

    On 06/16/2011 12:38 AM, Thomas Kappler wrote:

    Hi all,

    piggybank.storage.MultiStorage allows storing the Pig output into
    different directories, taken from a given field in a relation, so that
    the output is partitioned by the unique values of that field.

    This is just what I need for my use-case. However, I have about 50,000
    unique values in the partitioning field. It seems that MutliStorage
    will run one reducer per unique value, i.e., per output directory.
    Obviously, this takes a long time.

    Is there a better way of doing it?

    I could group by the partitioning field and write a post-processing
    script to go through the Pig output and write each line to a different
    line. It would be simple, but I'd prefer to do it all in Pig for
    consistency.

    Thanks,
    Thomas
  • Xiaomeng Wan at Jun 17, 2011 at 4:44 pm
    We used to take the first character of the partition field, and
    multistorage on that.

    Shawn
    On Fri, Jun 17, 2011 at 4:18 AM, Thomas Kappler wrote:
    On Thu, Jun 16, 2011 at 20:00, Daniel Dai wrote:
    Try custom partitioner:
    http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#partitionby
    AFAIK "partition by" maps to the Hadoop partitioning, which is about
    what keys go to which reducer, which is a different problem.

    Hadoop In Action chapter 7.2 addresses partitioning into multiple
    output files, and highlights this difference. The book shows a custom
    implementation of MultipleOutputFormat as a solution.

    Thomas

    On 06/16/2011 12:38 AM, Thomas Kappler wrote:

    Hi all,

    piggybank.storage.MultiStorage allows storing the Pig output into
    different directories, taken from a given field in a relation, so that
    the output is partitioned by the unique values of that field.

    This is just what I need for my use-case. However, I have about 50,000
    unique values in the partitioning field. It seems that MutliStorage
    will run one reducer per unique value, i.e., per output directory.
    Obviously, this takes a long time.

    Is there a better way of doing it?

    I could group by the partitioning field and write a post-processing
    script to go through the Pig output and write each line to a different
    line. It would be simple, but I'd prefer to do it all in Pig for
    consistency.

    Thanks,
    Thomas

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 16, '11 at 7:39a
activeJun 17, '11 at 4:44p
posts5
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase