Grokbase Groups Pig user August 2010
FAQ
I just checked the query plan with with 0.7, it also has this optimization .
-Thejas


On 8/17/10 1:32 PM, "Dmitriy Ryaboy" wrote:

Thejas, is that part of the new secondary sort optimization work that's in
trunk, or was this in 0.7?

-D
On Tue, Aug 17, 2010 at 1:17 PM, Thejas M Nair wrote:

Pig will use the sort column of the bag as a secondary sort key for the MR
job.
Though in the case of co-group, it is only doing that for the first bag. If
you have a large bag and a small one, you can position them in the pig
query
so that secondary sort is used on large one.


This is what I tried (pig svn trunk version)-
grunt> l1 = load 'x' as (a,b);
grunt> l2 = load 'y' as (a,b);
grunt> cg = cogroup l1 by a, l2 by a;
grunt> f = foreach cg { o1 = order l1 by (b); o2 = order l2 by (b);
generate
group, o1, o2;}

-- in the following explain output note that there is no POSort for o1 ,
and
it says "Secondary sort: true"

grunt> explain f
..
..
#--------------------------------------------------
# Map Reduce Plan
#--------------------------------------------------
MapReduce node 1-1018
Map Plan
Union[tuple] - 1-1019
---cg: Local Rearrange[tuple]{tuple}(false) - 1-1001
Project[bytearray][0] - 1-1002

---l1:
Load(file:///Users/tejas/pig_intbagcnt/trunk/x:org.apache.pig.builtin.PigSto
rage) - 1-997
---cg: Local Rearrange[tuple]{bytearray}(false) - 1-1003
Project[bytearray][0] - 1-1004

---l2:
Load(file:///Users/tejas/pig_intbagcnt/trunk/y:org.apache.pig.builtin.PigSto
rage) - 1-998--------
Reduce Plan
f: Store(fakefile:org.apache.pig.builtin.PigStorage) - 1-1015
---f: New For Each(false,false,false)[bag] - 1-1014

Project[bytearray][0] - 1-1005

RelationToExpressionProject[bag][*] - 1-1009
---Project[tuple][1] - 1-1006
RelationToExpressionProject[bag][*] - 1-1013
---o2: POSort[bag]() - 1-1012
Project[bytearray][1] - 1-1011

---Project[tuple][2] - 1-1010
---cg: Package[tuple]{bytearray} - 1-1000--------
Global sort: false
Secondary sort: true
----------------




On 8/17/10 11:59 AM, "Anthony Urso" wrote:

I need to sort the DataBags that are input to my UDF after a COGROUP.
I am currently sorting them in memory but it is not going to scale in
the long term.

Is there a way to control the way that Pig sorts them (e.g. as you can
with a WritableComparable in raw map/reduce) prior to passing them in
so that I don't have to respill them to disk?

Thanks for any info,
Anthony

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 4 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 17, '10 at 7:00p
activeAug 17, '10 at 11:19p
posts4
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase