Grokbase Groups Pig user October 2010
FAQ
I'm trying to count N-gram occurrences as a percentage of total
tuples, and I'm running into a problem that I assume has a simple
solution I'm not thinking of. My script basically looks like:

log = LOAD blah AS (session_id:chararray, text:chararray...);
ngramed = FOREACH log GENERATE flatten(
org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram;
grpd = GROUP ngramed BY ngram;
freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count,
COUNT(ngramed) / X AS percent;
STORE freq INTO 'ngrams';

I'm trying to figure out how I can calculate X so that it represents
the total number of tuples in log. I could "GROUP ALL log" and get a
count of that, but how do I reference it in my FOREACH statement?

Thanks for any help anyone can provide.

-Mark

Search Discussions

  • Dmitriy Ryaboy at Oct 8, 2010 at 10:15 pm
    In Pig 8, you can generate a one-line relation and later refer to it as a
    scalar:

    counts = foreach (group ngramed all) generate COUNT(ngramed);

    percents = foreach grpd generate group as ngram, COUNT(ngramed) as count,
    COUNT(ngramed) / (long) counts.total as percent;

    In earlier versions, the solution is to do a replicated join on a constant
    (ugly, I know):
    counts = foreach (group ngramed all) generate COUNT(ngramed);
    grpd = join grpd by 1, counts by 1 using "replicated";
    percents = foreach grpd generate grpd::group as ngram, COUNT(grpd::ngramed)
    as count, COUNT(grpd::ngramed) / (long) counts::total as percent;

    Untested, may break :)

    On Fri, Oct 8, 2010 at 12:47 PM, Mark Stetzer wrote:

    I'm trying to count N-gram occurrences as a percentage of total
    tuples, and I'm running into a problem that I assume has a simple
    solution I'm not thinking of. My script basically looks like:

    log = LOAD blah AS (session_id:chararray, text:chararray...);
    ngramed = FOREACH log GENERATE flatten(
    org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram;
    grpd = GROUP ngramed BY ngram;
    freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count,
    COUNT(ngramed) / X AS percent;
    STORE freq INTO 'ngrams';

    I'm trying to figure out how I can calculate X so that it represents
    the total number of tuples in log. I could "GROUP ALL log" and get a
    count of that, but how do I reference it in my FOREACH statement?

    Thanks for any help anyone can provide.

    -Mark
  • Mark Stetzer at Oct 11, 2010 at 7:04 pm
    I was afraid I'd have to do a join on a constant (using Pig 0.6 at the
    moment). That works wonderfully. Thanks!
    On Fri, Oct 8, 2010 at 6:15 PM, Dmitriy Ryaboy wrote:
    In Pig 8, you can generate a one-line relation and later refer to it as a
    scalar:

    counts = foreach (group ngramed all) generate COUNT(ngramed);

    percents = foreach grpd generate group as ngram, COUNT(ngramed) as count,
    COUNT(ngramed) / (long) counts.total as percent;

    In earlier versions, the solution is to do a replicated join on a constant
    (ugly, I know):
    counts = foreach (group ngramed all) generate COUNT(ngramed);
    grpd = join grpd by 1, counts by 1 using "replicated";
    percents = foreach grpd generate grpd::group as ngram, COUNT(grpd::ngramed)
    as count, COUNT(grpd::ngramed) / (long) counts::total as percent;

    Untested, may break :)

    On Fri, Oct 8, 2010 at 12:47 PM, Mark Stetzer wrote:

    I'm trying to count N-gram occurrences as a percentage of total
    tuples, and I'm running into a problem that I assume has a simple
    solution I'm not thinking of.  My script basically looks like:

    log = LOAD blah AS (session_id:chararray, text:chararray...);
    ngramed = FOREACH log GENERATE flatten(
    org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram;
    grpd = GROUP ngramed BY ngram;
    freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count,
    COUNT(ngramed) / X AS percent;
    STORE freq INTO 'ngrams';

    I'm trying to figure out how I can calculate X so that it represents
    the total number of tuples in log.  I could "GROUP ALL log" and get a
    count of that, but how do I reference it in my FOREACH statement?

    Thanks for any help anyone can provide.

    -Mark

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 8, '10 at 7:47p
activeOct 11, '10 at 7:04p
posts3
users2
websitepig.apache.org

2 users in discussion

Mark Stetzer: 2 posts Dmitriy Ryaboy: 1 post

People

Translate

site design / logo © 2021 Grokbase