Grokbase Groups Pig user March 2011
FAQ
Replicated-join will only work if the right most relation in join is small enough to fit in available memory, so it will not work with all data sets. But in this case you have one relation which has only one record, that should fit into memory.

The cogroup in your query might be running into some memory issue which might have been fixed in recent versions of pig.

-Thejas



On 3/14/11 4:41 PM, "Paltheru, Srikanth" wrote:

I tried using replicated-join in pig 0.5 it does not work. The feature I am trying to use is supported in 0.5 version as well. It just works for some datasets and doesn't for others.

From: Dmitriy Ryaboy
Sent: Monday, March 14, 2011 5:39 PM
To: user@pig.apache.org
Cc: Thejas M Nair; Paltheru, Srikanth
Subject: Re: Problems with Join in pig

Uh no I am wrong. They are on 20, 18 was 0.4



Yea Srikanth you guys should just upgrade. 0.5 to 0.6 is relatively painless. The jump to 0.7-0.8 is harder, but worth it.



D

On Mon, Mar 14, 2011 at 5:37 PM, Dmitriy Ryaboy wrote:
If they are on 5 that means they have bigger problems. They are on Hadoop 18.



D



On Mon, Mar 14, 2011 at 5:29 PM, Thejas M Nair wrote:
Fragment-replicate join will also produce an efficient query plan for this use case - http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Replicated+Joins <http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Replicated&#43;Joins><http://pig.apache.org/docs/r0.8.0/piglatin_ref1.html#Replicated+Joins> . It is available in 0.5 as well.
-Thejas




On 3/14/11 3:20 PM, "Paltheru, Srikanth" wrote:

I am using Pig 0.5 version. We don't have plans to upgrade it to a newer version. But the problem I have is the script runs for some files(both larger and smaller than the ones mentioned) but not for this particular one. I get "GC overhead limit" Error.
Thanks
Sri


-----Original Message-----
From: Thejas M Nair
Sent: Monday, March 14, 2011 4:18 PM
To: user@pig.apache.org; Paltheru, Srikanth
Subject: Re: Problems with Join in pig

What version of pig are you using ? There have been some memory utilization fixes in 0.8 . For this use case, you can also use the new scalar feature in
0.8 -
http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Sc <http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting&#43;Relations&#43;to&#43;Sc><http://pig.apache.org/docs/r0.8.0/piglatin_ref2.html#Casting+Relations+to+Sc>
alars . That query plan will be more efficient.

You might want to build a new version of pig from svn 0.8 branch because there have been some bug fixes after the release -

svn co http://svn.apache.org/repos/asf/pig/branches/branch-0.8
cd branch-0.8
ant

-Thejas

On 3/14/11 1:40 PM, "Paltheru, Srikanth" wrote:

The following pig script runs fine without the 2GB memory setting (see
in yellow). But fails with memory setting. I am not sure what's
happening. It's a simple operation of joining one tuple(of 1 row) with the other tuple.
Here is what I am trying to do:

1. grouping all SELECT HIT TIME DATA into a single tuple by doing a
GROUP ALL.
2. getting the min and max of that set and putting it into MIN HIT DATA.
This is a tuple with a single row.
3. then grouping SELECT MAX VISIT TIME DATA by visid, 4. then
generating DUMMY_KEY for every row, along with MAX of start time.
5. then try to join the single tuple in 2 with all tuples generated
in 4 to get a min time and a max time

Code:
Shell prompt:
## setting heap size to 2 GB
PIG_OPTS="$PIG_OPTS -Dmapred.child.java.opts=-Xmx2048m"
export PIG_OPTS

Pig/Grunt

RAW_DATA = LOAD
'/omniture_test_qa/cleansed_output_1/2011/01/05/wdgesp360/wdgesp360_20
11-01-05
*.tsv.gz' USING PigStorage('\t');
FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY $6 <= 0; SELECT_CAST_DATA =
FOREACH FILTER_EXCLUDES_DATA GENERATE 'DUMMYKEY' AS
DUMMY_KEY,(int)$0 AS hit_time_gmt, (long)$2 AS visid_high, (long)$3 AS
visid_low, (chararray)$5 AS truncated_hit; SELECT_DATA = FILTER
SELECT_CAST_DATA BY truncated_hit =='N'; --MIN AND MAX_HIT_TIME_GMT
FOR THE FILE/SUITE SELECT_HIT_TIME_DATA = FOREACH SELECT_DATA GENERATE
(int)hit_time_gmt; GROUPED_ALL_DATA = GROUP SELECT_HIT_TIME_DATA ALL
PARALLEL 100; MIN_HIT_DATA = FOREACH GROUPED_ALL_DATA GENERATE
'DUMMYKEY'AS
DUMMY_KEY,MIN(SELECT_HIT_TIME_DATA.hit_time_gmt) AS
MIN_HIT_TIME_GMT,MAX(SELECT_HIT_TIME_DATA.hit_time_gmt) AS
MAX_HIT_TIME_GMT; ---MAX_VISIT_START_TIME BY VISITOR_ID
SELECT_MAX_VISIT_TIME_DATA = FOREACH SELECT_DATA GENERATE
visid_high,visid_low,visit_start_time_gmt;
GROUP_BY_VISID_MAX_VISIT_TIME_DATA = GROUP SELECT_MAX_VISIT_TIME_DATA
BY
(visid_high,visid_low) PARALLEL 100;
MAX_VISIT_TIME = FOREACH GROUP_BY_VISID_MAX_VISIT_TIME_DATA GENERATE
'DUMMYKEY' AS DUMMY_KEY,FLATTEN(group.visid_high) AS
visid_high,FLATTEN(group.visid_low) AS visid_low,
MAX(SELECT_MAX_VISIT_TIME_DATA.visit_start_time_gmt) AS
MAX_VISIT_START_TIME; JOINED_MAX_VISIT_TIME_DATA = COGROUP
MAX_VISIT_TIME BY DUMMY_KEY OUTER,MIN_HIT_DATA BY DUMMY_KEY OUTER
PARALLEL 100; MIN_MAX_VISIT_HIT_TIME = FOREACH
JOINED_MAX_VISIT_TIME_DATA GENERATE
FLATTEN(MAX_VISIT_TIME.visid_high),FLATTEN(MAX_VISIT_TIME.visid_low),F
LATTEN(M
AX_VISIT_TIME.MAX_VISIT_START_TIME),FLATTEN(MIN_HIT_DATA.MIN_HIT_TIME_
GMT),FLA
TTEN(MIN_HIT_DATA.MAX_HIT_TIME_GMT);
DUMP MIN_MAX_VISIT_HIT_TIME;


Can any one please guide me through this problem?
Thanks
Sri

Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 9 of 9 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 14, '11 at 9:50p
activeMar 15, '11 at 12:55a
posts9
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase