Grokbase Groups Pig user March 2011
FAQ
My pigscript is taking a lot of time (>10 minutes) processing even < 100 rows of data and we have 105 map and reduce nodes. I am just wondering if we can you use a FILTER clause in the grouped data set. I realized I am filtering and grouping by the same key multiple times which will directly decrease the performance. So I am trying to find a way to filter the data on grouped data. Please see below the sample code for what I want to achieve and I am also including the original code which has multiple group bys an filter clauses which takes lot of time. I am trying to eliminate the yellow colored statements. We are using pig version 0.5. Any inputs for performance optimization is greatly appreciated.

RAW_DATA = LOAD '/omniture_test_qa/cleansed_output_1/2011/01/05/wdgafmfamily/wdgafmfamily*.tsv.gz' USING PigStorage('\t');
FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY (int)$6 <= 0 AND (chararray)$5=='N';
SELECT_DATA = FOREACH FILTER_EXCLUDES_DATA GENERATE (long)$0 AS hit_time_gmt, (long)$2 AS visid_high, (long)$3 AS visid_low, (int)$9 AS mobile_id, (int)$17 AS page_event;
GROUP_BY_VISID_DATA = GROUP SELECT_DATA BY (visid_high,visid_low) PARALLEL 100;
METRICS_DATA = FOREACH GROUP_BY_VISID_DATA
{
FILTER_PV_DATA = FILTER GROUP_BY_VISID_DATA BY SELECT_DATA::page_event == 0;
FILTER_WIRELESS_PV_DATA = FILTER GROUP_BY_VISID_DATA BY SELECT_DATA::page_event == 0 AND SELECT_DATA::mobile_id > 0;
GENERATE FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS visid_low, FLATTEN(COUNT(SELECT_DATA)) AS PAGE_VIEW_COUNT,FLATTEN(COUNT(SELECT_DATA)) AS PAGE_VIEW_COUNT;
};
DUMP METRICS_DATA;

Original Code:

RAW_DATA = LOAD '/omniture_test_qa/cleansed_output_1/2011/01/05/wdgafmfamily/wdgafmfamily*.tsv.gz' USING PigStorage('\t');
FILTER_EXCLUDES_DATA = FILTER RAW_DATA BY (int)$6 <= 0 AND (chararray)$5=='N';
SELECT_DATA = FOREACH FILTER_EXCLUDES_DATA GENERATE (long)$0 AS hit_time_gmt, (long)$2 AS visid_high, (long)$3 AS visid_low, (int)$9 AS mobile_id, (int)$17 AS page_event;

--PV COUNT
FILTER_PV_DATA = FILTER SELECT_ DATA BY page_event == 0;
SELECT_PV_DATA = FOREACH FILTER_PV_DATA GENERATE visid_high,visid_low;
GROUP_BY_VISID_SWID_DATA = GROUP SELECT_PV_DATA BY (visid_high,visid_low) PARALLEL 100;
PAGE_VIEWS = FOREACH GROUP_BY_VISID_SWID_DATA GENERATE FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS visid_low, FLATTEN(COUNT(SELECT_PV_DATA)) AS PAGE_VIEW_COUNT;

--WIRELESS PVS COUNT
FILTER_WIRELESS_PV_DATA = FILTER SELECT_DATA BY page_event == 0 AND mobile_id > 0;
SELECT_WIRELESS_PV_DATA = FOREACH FILTER_WIRELESS_PV_DATA GENERATE visid_high,visid_low;
GROUP_BY_VISID_SWID_WIRELESS_PV_DATA = GROUP SELECT_WIRELESS_PV_DATA BY (visid_high,visid_low) PARALLEL 100;
WIRELESS_PVS = FOREACH GROUP_BY_VISID_SWID_WIRELESS_PV_DATA GENERATE FLATTEN(group.visid_high) AS visid_high,FLATTEN(group.visid_low) AS visid_low, FLATTEN(COUNT(SELECT_WIRELESS_PV_DATA)) AS WIRELESS_PV_COUNT;
COGROUPED_DAILY_METRICS_DATA = COGROUP PAGE_VIEWS BY (visid_high,visid_low) OUTER,WIRELESS_PVS BY (visid_high,visid_low) OUTER;
DUMP COGROUPED_DAILY_METRICS_DATA;

Thanks
Sri

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 16, '11 at 7:12p
activeMar 16, '11 at 7:12p
posts1
users1
websitepig.apache.org

1 user in discussion

Paltheru, Srikanth: 1 post

People

Translate

site design / logo © 2021 Grokbase