Grokbase Groups Pig user June 2011
FAQ
Hi,

I'm looking to perform a sum normalization (divide a score by the sum of
scores of my data) with pig.

1) My first problem is I can't find a great way to do that.
Any suggestion?

I have an answer but I'm not really proud of it...
------------------------------------------------------------------------------
score_list = LOAD 'scores' USING PigStorage(';')
AS (word: chararray, score: double);

score_list_ = FOREACH score_list GENERATE
word,
score,
0 AS joinField;

group_score = GROUP score_list ALL;
sum_score = FOREACH group_score GENERATE
0 AS joinField,
SUM(score_list.score) as scoreTotal;

score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
DUMP out;
------------------------------------------------------------------------------

2) Secondly, I think there is a strange bug.
Considering the code above, if at the end I put only "GENERATE word" (and
not the scores), then it goes in some kind of infinite loop (repeating
"Spilling map output: record full = true"... in the log)


thanks,

tristan

Search Discussions

  • Daniel Dai at Jun 14, 2011 at 5:39 pm
    Take a look of Pig scalar:
    http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars

    Try this query:
    score_list = LOAD 'scores' USING PigStorage(';')
    AS (word: chararray, score: double);

    score_list_ = FOREACH score_list GENERATE
    word,
    score,
    0 AS joinField;

    group_score = GROUP score_list ALL;
    sum_score = FOREACH group_score GENERATE
    0 AS joinField,
    SUM(score_list.score) as scoreTotal;

    out = FOREACH score_list_ GENERATE word, (score / sum_score.scoreTotal);
    dump out;

    For the bug you find, would you mind open a Jira ticket?

    Thanks,
    Daniel
    On 06/14/2011 06:58 AM, Tristan Croiset wrote:
    Hi,

    I'm looking to perform a sum normalization (divide a score by the sum of
    scores of my data) with pig.

    1) My first problem is I can't find a great way to do that.
    Any suggestion?

    I have an answer but I'm not really proud of it...
    ------------------------------------------------------------------------------
    score_list = LOAD 'scores' USING PigStorage(';')
    AS (word: chararray, score: double);

    score_list_ = FOREACH score_list GENERATE
    word,
    score,
    0 AS joinField;

    group_score = GROUP score_list ALL;
    sum_score = FOREACH group_score GENERATE
    0 AS joinField,
    SUM(score_list.score) as scoreTotal;

    score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
    out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
    DUMP out;
    ------------------------------------------------------------------------------

    2) Secondly, I think there is a strange bug.
    Considering the code above, if at the end I put only "GENERATE word" (and
    not the scores), then it goes in some kind of infinite loop (repeating
    "Spilling map output: record full = true"... in the log)


    thanks,

    tristan
  • Tristan Croiset at Jun 14, 2011 at 6:44 pm
    2011/6/14 Daniel Dai <jianyong@yahoo-inc.com>
    Take a look of Pig scalar:
    http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#Casting+Relations+to+Scalars

    thanks! that's indeed what I needed.
    For the bug you find, would you mind open a Jira ticket?
    Sure.

    bests,

    tristan


    Thanks,
    Daniel

    On 06/14/2011 06:58 AM, Tristan Croiset wrote:

    Hi,

    I'm looking to perform a sum normalization (divide a score by the sum of
    scores of my data) with pig.

    1) My first problem is I can't find a great way to do that.
    Any suggestion?

    I have an answer but I'm not really proud of it...

    ------------------------------------------------------------------------------
    score_list = LOAD 'scores' USING PigStorage(';')
    AS (word: chararray, score: double);

    score_list_ = FOREACH score_list GENERATE
    word,
    score,
    0 AS joinField;

    group_score = GROUP score_list ALL;
    sum_score = FOREACH group_score GENERATE
    0 AS joinField,
    SUM(score_list.score) as scoreTotal;

    score_with_sum = JOIN score_list_ BY joinField, sum_score BY joinField;
    out = FOREACH score_with_sum GENERATE word, (score / scoreTotal);
    DUMP out;

    ------------------------------------------------------------------------------

    2) Secondly, I think there is a strange bug.
    Considering the code above, if at the end I put only "GENERATE word" (and
    not the scores), then it goes in some kind of infinite loop (repeating
    "Spilling map output: record full = true"... in the log)


    thanks,

    tristan

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 14, '11 at 1:59p
activeJun 14, '11 at 6:44p
posts3
users2
websitepig.apache.org

2 users in discussion

Tristan Croiset: 2 posts Daniel Dai: 1 post

People

Translate

site design / logo © 2021 Grokbase