I'm trying to count N-gram occurrences as a percentage of total
tuples, and I'm running into a problem that I assume has a simple
solution I'm not thinking of. My script basically looks like:
log = LOAD blah AS (session_id:chararray, text:chararray...);
ngramed = FOREACH log GENERATE flatten(
org.apache.pig.tutorial.NGramGenerator(text) ) AS ngram;
grpd = GROUP ngramed BY ngram;
freq = FOREACH grpd GENERATE group AS ngram, COUNT(ngramed) AS count,
COUNT(ngramed) / X AS percent;
STORE freq INTO 'ngrams';
I'm trying to figure out how I can calculate X so that it represents
the total number of tuples in log. I could "GROUP ALL log" and get a
count of that, but how do I reference it in my FOREACH statement?
Thanks for any help anyone can provide.