FAQ
I've been thinking more about the GROUP BY problem I talked about the other
day, specifically the fact that Pig does a bad job dealing with relations
that were already GROUPed and reGROUPS them.

I was thinking that perhaps one way to deal with this would be to support
the concept of joins against local code which implements the Map<K,V>
interface.

The way replicated joins works right now is similar to this:

Fragment replicate join is a special type of join that works well if one or
more relations are small enough to fit into main memory. In such cases, Pig
can perform a very efficient join because all of the hadoop work is done on
the map side. In this type of join the large relation is followed by one or
more small relations. The small relations must be small enough to fit into
main memory; if they don't, the process fails and an error is generated.

This could be similar to replicated joins except you just page through one
side of the relation since it's already sorted/grouped and setup for that
partition.

Any pre-sorted/grouped dataset which needs to be continually joined against
would benefit from this optimization.

Kevin

--

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Search Discussions

  • Daniel Dai at Sep 5, 2011 at 5:00 am
    The easiest way to achieve this is to write a UDF and do the second
    cogroup by yourself. Change your script:

    foo_grouped = GROUP foo BY col_a;
    both_grouped = GROUP foo_grouped BY $0 , bar BY col_a;

    =>

    store foo_grouped into 'foo_grouped';
    bar_grouped = GROUP bar BY col_a;
    both_grouped = foreach bar_grouped generate MERGE_UDF(*);

    In MERGE_UDF, you will need to:
    1. Find the index of the reducer. One way is through
    PigMapReduce.sJobConf("mapred.task.id"), there might be a better way
    2. Find the part file of 'foo_grouped' corresponding to the index
    3. If part file is small enough, you can load into the memory and
    simulate a fragment-replicated cogroup; otherwise, you can do a merge
    cogroup.

    Daniel
    On Sat, Sep 3, 2011 at 7:26 PM, Kevin Burton wrote:
    I've been thinking more about the GROUP BY problem I talked about the other
    day, specifically the fact that Pig does a bad job dealing with relations
    that were already GROUPed and reGROUPS them.

    I was thinking that perhaps one way to deal with this would be to support
    the concept of joins against local code which implements the Map<K,V>
    interface.

    The way replicated joins works right now is similar to this:

    Fragment replicate join is a special type of join that works well if one or
    more relations are small enough to fit into main memory. In such cases, Pig
    can perform a very efficient join because all of the hadoop work is done on
    the map side. In this type of join the large relation is followed by one or
    more small relations. The small relations must be small enough to fit into
    main memory; if they don't, the process fails and an error is generated.

    This could be similar to replicated joins except you just page through one
    side of the relation since it's already sorted/grouped and setup for that
    partition.

    Any pre-sorted/grouped dataset which needs to be continually joined against
    would benefit from this optimization.

    Kevin

    --

    Founder/CEO Spinn3r.com

    Location: *San Francisco, CA*
    Skype: *burtonator*

    Skype-in: *(415) 871-0687*

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 4, '11 at 2:27a
activeSep 5, '11 at 5:00a
posts2
users2
websitepig.apache.org

2 users in discussion

Kevin Burton: 1 post Daniel Dai: 1 post

People

Translate

site design / logo © 2021 Grokbase