Grokbase Groups Pig user May 2010
FAQ
Groups are processed on the reduce side with combiner pre-aggregating on
the map side. A single map by default gets a fixed chunk of data and not
the entire key. The only way I can see a huge map happening is if you
have one really huge record somewhere.

Olga

-----Original Message-----
From: Corbin Hoenes
Sent: Thursday, May 06, 2010 2:31 PM
To: pig-user@hadoop.apache.org
Subject: Re: SpillableMemoryManager - low memory handler called

Wondering if when we do a group like this:

grouped_urls_by_site = GROUP all_urls BY site;

if certain site has a lot of urls would they all have to be processed by
the same mapper (e.g. a single key?) Could this account for why we have
8GB in one map and not many in others?
On May 6, 2010, at 3:24 PM, Olga Natkovich wrote:

Looks like attachments are not coming through. Here is the script from
Corbin inline.

One thing you might want to try is to switch your cogroups to skewed
join and see if that solves the issue:

http://hadoop.apache.org/pig/docs/r0.6.0/piglatin_ref1.html#Skewed+Joins
Olga

--------------------------------------------topurl.pig------------------
-------------------------------------------
set job.name 'Generate topurl reports for $out_file1'

%default dir_prefix '../..'
%default storage 'BinStorage()'
%default tynt_udfs 'tynt-udfs.jar'
%default topN '20'
/* default to 30 days time period so that alltime report will get
14*30=420 min page views*/
%default timeperiod '30'
%default min_page_views_per_day '14'

register $dir_prefix/udfs/target/$tynt_udfs
register $dir_prefix/udfs/lib/piggybank.jar

---------------------summarize address bar
stats-----------------------------------
addbar_stats = LOAD '$in_file1/addbarstats' USING $storage AS
(site:chararray, url:chararray, guid:chararray, cnt:long);
grouped_addbar_by_url = GROUP addbar_stats BY (site, url) PARALLEL 180;
addbar_stats_by_url = FOREACH grouped_addbar_by_url GENERATE
FLATTEN(group) AS (site, url), COUNT(addbar_stats) AS addbarcnt,
SUM(addbar_stats.cnt) AS addbarvisits;
STORE addbar_stats_by_url INTO '$out_file1/addbarstatsbyurl' USING
$storage;

grouped_addbar_stats_by_site = GROUP addbar_stats_by_url BY site
PARALLEL 180;
addbar_stats_by_site = FOREACH grouped_addbar_stats_by_site GENERATE
group AS site, SUM(addbar_stats_by_url.addbarcnt) AS addbarcnt,
SUM(addbar_stats_by_url.addbarvisits) AS addbarvisits;
STORE addbar_stats_by_site INTO '$out_file1/addbarstatsbysite' USING
$storage;

----------------------calculate
ratio------------------------------------------
clickstatsbyurl = LOAD '$in_file1/clickstatsbyurl' USING $storage AS
(site:chararray, url:chararray, cnt:long, tracecnt:long, tcnt:long,
pcnt:long, wcnt:long, utracecnt:long, utcnt:long, upcnt:long,
uwcnt:long);
viewstatsbyurl = LOAD '$in_file1/viewstatsbyurl' USING $storage AS
(site:chararray, url:chararray, title:chararray, cnt:long, etcnt:long,
et1cnt:long, et2cnt:long, et3cnt:long, et6cnt:long, et7cnt:long);

light_clickstatsbyurl = FOREACH clickstatsbyurl GENERATE site, url, cnt;
light_viewstatsbyurl_noisy = FOREACH viewstatsbyurl GENERATE site, url,
title, cnt, etcnt;

light_viewstatsbyurl = FILTER light_viewstatsbyurl_noisy BY url != '-';
--light_addbarstatsbyurl = FOREACH addbar_stats_by_url GENERATE site,
url, addbarvisits;
--joined_stats_for_ratio = COGROUP light_viewstatsbyurl BY (site, url)
INNER, light_clickstatsbyurl BY (site, url) OUTER,
light_addbarstatsbyurl BY (site, url) OUTER;
--flattened_stats_for_ratio = FOREACH joined_stats_for_ratio GENERATE
FLATTEN(light_viewstatsbyurl) AS (site, url, title, cnt, etcnt),
--
(IsEmpty(light_clickstatsbyurl)?0:MAX(light_clickstatsbyurl.cnt)) as
clickcnt,
--
(IsEmpty(light_addbarstatsbyurl)?0:MAX(light_addbarstatsbyurl.addbarvisi
ts)) as addbarcnt;

joined_stats_for_ratio = COGROUP light_viewstatsbyurl BY (site, url)
INNER, light_clickstatsbyurl BY (site, url) OUTER;
flattened_stats_for_ratio = FOREACH joined_stats_for_ratio GENERATE
FLATTEN(light_viewstatsbyurl) AS (site, url, title, cnt, etcnt),

(IsEmpty(light_clickstatsbyurl)?0:MAX(light_clickstatsbyurl.cnt)) as
clickcnt;

ratio_by_url = FOREACH flattened_stats_for_ratio
{
generated_traffic = clickcnt+etcnt;
total_traffic = cnt;
ti =
((float)(generated_traffic))/((float)total_traffic);
GENERATE site, url, title, ((ti>1)?(-ti):ti) AS
ratio, generated_traffic AS gviews, total_traffic AS views;
}

------------------------combined with
#copies----------------------------------------
copystatsbyurl = LOAD '$in_file1/copystatsbyurl' USING $storage AS
(site:chararray, url:chararray, lcnt:long, scnt:long, icnt:long,
acnt:long);
light_copystatsbyurl = FOREACH copystatsbyurl GENERATE site, url,
lcnt+scnt+icnt AS cnt;

all_stats_by_url = COGROUP ratio_by_url BY (site, url) INNER,
light_copystatsbyurl BY (site, url) OUTER PARALLEL 62;
all_urls = FOREACH all_stats_by_url GENERATE FLATTEN(ratio_by_url) AS
(site, url, title, ratio, gviews, views),
(IsEmpty(light_copystatsbyurl)?0:MAX(light_copystatsbyurl.cnt)) as
copies;

grouped_urls_by_site = GROUP all_urls BY site;

top_ratios = FOREACH grouped_urls_by_site
{
filtered_by_minpageviews = FILTER all_urls BY views
($min_page_views_per_day*$timeperiod);
order_by_ratio = ORDER filtered_by_minpageviews BY
ratio DESC;
top_by_ratio = LIMIT order_by_ratio $topN;
GENERATE group AS site, top_by_ratio.(url, title,
ratio, gviews, views, copies) AS tops;
}

top_gviews = FOREACH grouped_urls_by_site
{
order_by_gviews = ORDER all_urls BY gviews DESC;
top_by_gviews = LIMIT order_by_gviews $topN;
GENERATE group AS site, top_by_gviews.(url, title,
ratio, gviews, views, copies) AS tops;
}

top_views = FOREACH grouped_urls_by_site
{
order_by_views = ORDER all_urls BY views DESC;
top_by_views = LIMIT order_by_views $topN;
GENERATE group AS site, top_by_views.(url, title,
ratio, gviews, views, copies) AS tops;
}

top_copies = FOREACH grouped_urls_by_site
{
order_by_copies = ORDER all_urls BY copies DESC;
top_by_copies = LIMIT order_by_copies $topN;
GENERATE group AS site, top_by_copies.(url, title,
ratio, gviews, views, copies) AS tops;
}

grouped_tops = JOIN top_ratios BY site, top_gviews BY site, top_views BY
site, top_copies BY site;

top_urls = FOREACH grouped_tops GENERATE top_ratios::site AS site,
top_ratios::tops, top_gviews::tops, top_views::tops, top_copies::tops;
store top_urls into '$out_file1/topurls' USING $storage;



-----Original Message-----
From: Corbin Hoenes
Sent: Thursday, May 06, 2010 11:57 AM
To: Olga Natkovich
Subject: Re: SpillableMemoryManager - low memory handler called

I have attached the script... please let me know if you have more
questions.

On May 6, 2010, at 12:36 PM, Olga Natkovich wrote:

This is just a warning saying that your job is spilling to the disk.
Please, if you can, post a script that is causing this issue. In
0.6.0
we moved large chunk of the code away from using
SpillableMemoryManager
but it is still used in some places. More changes are coming in 0.7.0
as
well.

Olga

-----Original Message-----
From: Corbin Hoenes
Sent: Thursday, May 06, 2010 11:31 AM
To: pig-user@hadoop.apache.org
Subject: Re: SpillableMemoryManager - low memory handler called

0.6

Sent from my iPhone

On May 6, 2010, at 12:16 PM, "Olga Natkovich" <olgan@yahoo-inc.com>
wrote:
Which version of Pig are you using?

-----Original Message-----
From: Corbin Hoenes
Sent: Thursday, May 06, 2010 10:29 AM
To: pig-user@hadoop.apache.org
Subject: SpillableMemoryManager - low memory handler called

Hi Piggers - Seeing an issue with a particular script where our job
is
taking 6hrs 42min to complete.

syslogs are showing loads of these:
INFO : org.apache.pig.impl.util.SpillableMemoryManager - low memory
handler called (Usage threshold exceeded) init = 5439488(5312K) used
=
283443200(276800K) committed = 357957632(349568K) max =
357957632(349568K)
INFO : org.apache.pig.impl.util.SpillableMemoryManager - low memory
handler called (Usage threshold exceeded) init = 5439488(5312K) used
=
267128840(260868K) committed = 357957632(349568K) max =
357957632(349568K)
One iteresting thing is it's the map phase that is slow and one of
the
mappers is getting 8GB of input while the other 2000 or so mappers
are
getting MBs and hundreds of MBs of data.

Any where I can start looking?

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 4 of 9 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 6, '10 at 9:18p
activeOct 19, '11 at 7:00p
posts9
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase