Grokbase Groups Pig user August 2010
Thejas, is that part of the new secondary sort optimization work that's in
trunk, or was this in 0.7?

On Tue, Aug 17, 2010 at 1:17 PM, Thejas M Nair wrote:

Pig will use the sort column of the bag as a secondary sort key for the MR
Though in the case of co-group, it is only doing that for the first bag. If
you have a large bag and a small one, you can position them in the pig
so that secondary sort is used on large one.

This is what I tried (pig svn trunk version)-
grunt> l1 = load 'x' as (a,b);
grunt> l2 = load 'y' as (a,b);
grunt> cg = cogroup l1 by a, l2 by a;
grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b);
group, o1, o2;}

-- in the following explain output note that there is no POSort for o1 ,
it says "Secondary sort: true"

grunt> explain f
# Map Reduce Plan
MapReduce node 1-1018
Map Plan
Union[tuple] - 1-1019
---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
Project[bytearray][0] - 1-1002

rage) - 1-997
---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
Project[bytearray][0] - 1-1004

rage) - 1-998--------
Reduce Plan
f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
---f: New For Each(false,false,false)[bag] - 1-1014

Project[bytearray][0] - 1-1005

RelationToExpressionProject[bag][*] - 1-1009
---Project[tuple][1] - 1-1006
RelationToExpressionProject[bag][*] - 1-1013
---o2: POSort[bag]() - 1-1012
Project[bytearray][1] - 1-1011

---Project[tuple][2] - 1-1010
---cg: Package[tuple]{bytearray} - 1-1000--------
Global sort: false
Secondary sort: true

On 8/17/10 11:59 AM, "Anthony Urso" wrote:

I need to sort the DataBags that are input to my UDF after a COGROUP.
I am currently sorting them in memory but it is not going to scale in
the long term.

Is there a way to control the way that Pig sorts them (e.g. as you can
with a WritableComparable in raw map/reduce) prior to passing them in
so that I don't have to respill them to disk?

Thanks for any info,

Search Discussions

Discussion Posts


Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 4 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 17, '10 at 7:00p
activeAug 17, '10 at 11:19p



site design / logo © 2021 Grokbase