Grokbase Groups Pig user August 2010
FAQ
I need to sort the DataBags that are input to my UDF after a COGROUP.
I am currently sorting them in memory but it is not going to scale in
the long term.

Is there a way to control the way that Pig sorts them (e.g. as you can
with a WritableComparable in raw map/reduce) prior to passing them in
so that I don't have to respill them to disk?

Thanks for any info,
Anthony

Search Discussions

  • Thejas M Nair at Aug 17, 2010 at 8:18 pm
    Pig will use the sort column of the bag as a secondary sort key for the MR
    job.
    Though in the case of co-group, it is only doing that for the first bag. If
    you have a large bag and a small one, you can position them in the pig query
    so that secondary sort is used on large one.


    This is what I tried (pig svn trunk version)-
    grunt> l1 = load 'x' as (a,b);
    grunt> l2 = load 'y' as (a,b);
    grunt> cg = cogroup l1 by a, l2 by a;
    grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b); generate
    group, o1, o2;}

    -- in the following explain output note that there is no POSort for o1 , and
    it says "Secondary sort: true"

    grunt> explain f
    ..
    ..
    #--------------------------------------------------
    # Map Reduce Plan
    #--------------------------------------------------
    MapReduce node 1-1018
    Map Plan
    Union[tuple] - 1-1019
    ---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
    Project[bytearray][0] - 1-1002

    ---l1:
    Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto
    rage) - 1-997
    ---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
    Project[bytearray][0] - 1-1004

    ---l2:
    Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto
    rage) - 1-998--------
    Reduce Plan
    f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
    ---f: New For Each(false,false,false)[bag] - 1-1014

    Project[bytearray][0] - 1-1005

    RelationToExpressionProject[bag][*] - 1-1009
    ---Project[tuple][1] - 1-1006
    RelationToExpressionProject[bag][*] - 1-1013
    ---o2: POSort[bag]() - 1-1012
    Project[bytearray][1] - 1-1011

    ---Project[tuple][2] - 1-1010
    ---cg: Package[tuple]{bytearray} - 1-1000--------
    Global sort: false
    Secondary sort: true
    ----------------




    On 8/17/10 11:59 AM, "Anthony Urso" wrote:

    I need to sort the DataBags that are input to my UDF after a COGROUP.
    I am currently sorting them in memory but it is not going to scale in
    the long term.

    Is there a way to control the way that Pig sorts them (e.g. as you can
    with a WritableComparable in raw map/reduce) prior to passing them in
    so that I don't have to respill them to disk?

    Thanks for any info,
    Anthony
  • Dmitriy Ryaboy at Aug 17, 2010 at 8:32 pm
    Thejas, is that part of the new secondary sort optimization work that's in
    trunk, or was this in 0.7?

    -D
    On Tue, Aug 17, 2010 at 1:17 PM, Thejas M Nair wrote:

    Pig will use the sort column of the bag as a secondary sort key for the MR
    job.
    Though in the case of co-group, it is only doing that for the first bag. If
    you have a large bag and a small one, you can position them in the pig
    query
    so that secondary sort is used on large one.


    This is what I tried (pig svn trunk version)-
    grunt> l1 = load 'x' as (a,b);
    grunt> l2 = load 'y' as (a,b);
    grunt> cg = cogroup l1 by a, l2 by a;
    grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b);
    generate
    group, o1, o2;}

    -- in the following explain output note that there is no POSort for o1 ,
    and
    it says "Secondary sort: true"

    grunt> explain f
    ..
    ..
    #--------------------------------------------------
    # Map Reduce Plan
    #--------------------------------------------------
    MapReduce node 1-1018
    Map Plan
    Union[tuple] - 1-1019
    ---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
    Project[bytearray][0] - 1-1002

    ---l1:
    Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto
    rage) - 1-997
    ---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
    Project[bytearray][0] - 1-1004

    ---l2:
    Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto
    rage) - 1-998--------
    Reduce Plan
    f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
    ---f: New For Each(false,false,false)[bag] - 1-1014

    Project[bytearray][0] - 1-1005

    RelationToExpressionProject[bag][*] - 1-1009
    ---Project[tuple][1] - 1-1006
    RelationToExpressionProject[bag][*] - 1-1013
    ---o2: POSort[bag]() - 1-1012
    Project[bytearray][1] - 1-1011

    ---Project[tuple][2] - 1-1010
    ---cg: Package[tuple]{bytearray} - 1-1000--------
    Global sort: false
    Secondary sort: true
    ----------------




    On 8/17/10 11:59 AM, "Anthony Urso" wrote:

    I need to sort the DataBags that are input to my UDF after a COGROUP.
    I am currently sorting them in memory but it is not going to scale in
    the long term.

    Is there a way to control the way that Pig sorts them (e.g. as you can
    with a WritableComparable in raw map/reduce) prior to passing them in
    so that I don't have to respill them to disk?

    Thanks for any info,
    Anthony
  • Thejas M Nair at Aug 17, 2010 at 11:19 pm
    I just checked the query plan with with 0.7, it also has this optimization .
    -Thejas


    On 8/17/10 1:32 PM, "Dmitriy Ryaboy" wrote:

    Thejas, is that part of the new secondary sort optimization work that's in
    trunk, or was this in 0.7?

    -D
    On Tue, Aug 17, 2010 at 1:17 PM, Thejas M Nair wrote:

    Pig will use the sort column of the bag as a secondary sort key for the MR
    job.
    Though in the case of co-group, it is only doing that for the first bag. If
    you have a large bag and a small one, you can position them in the pig
    query
    so that secondary sort is used on large one.


    This is what I tried (pig svn trunk version)-
    grunt> l1 = load 'x' as (a,b);
    grunt> l2 = load 'y' as (a,b);
    grunt> cg = cogroup l1 by a, l2 by a;
    grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b);
    generate
    group, o1, o2;}

    -- in the following explain output note that there is no POSort for o1 ,
    and
    it says "Secondary sort: true"

    grunt> explain f
    ..
    ..
    #--------------------------------------------------
    # Map Reduce Plan
    #--------------------------------------------------
    MapReduce node 1-1018
    Map Plan
    Union[tuple] - 1-1019
    ---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
    Project[bytearray][0] - 1-1002

    ---l1:
    Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto
    rage) - 1-997
    ---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
    Project[bytearray][0] - 1-1004

    ---l2:
    Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto
    rage) - 1-998--------
    Reduce Plan
    f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
    ---f: New For Each(false,false,false)[bag] - 1-1014

    Project[bytearray][0] - 1-1005

    RelationToExpressionProject[bag][*] - 1-1009
    ---Project[tuple][1] - 1-1006
    RelationToExpressionProject[bag][*] - 1-1013
    ---o2: POSort[bag]() - 1-1012
    Project[bytearray][1] - 1-1011

    ---Project[tuple][2] - 1-1010
    ---cg: Package[tuple]{bytearray} - 1-1000--------
    Global sort: false
    Secondary sort: true
    ----------------




    On 8/17/10 11:59 AM, "Anthony Urso" wrote:

    I need to sort the DataBags that are input to my UDF after a COGROUP.
    I am currently sorting them in memory but it is not going to scale in
    the long term.

    Is there a way to control the way that Pig sorts them (e.g. as you can
    with a WritableComparable in raw map/reduce) prior to passing them in
    so that I don't have to respill them to disk?

    Thanks for any info,
    Anthony

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 17, '10 at 7:00p
activeAug 17, '10 at 11:19p
posts4
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase