Grokbase Groups Pig user January 2010
FAQ
I have a question on how to handle data that I would usually store in an
array, or into a normalized child table in a database. The input data
is a set of key/value pairs where one key can be associated with
multiple values (0 to n).

Here is a sample dataset with bucket being the multi value key:

family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,bucket=32
family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,bucket=32,bucket=54
family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,bucket=32
family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,bucket=32

What I am trying to calculate is a group count on
family,channel,timeframe and bucket, where the results would be:

(sports,baseball,today,12),2
(sports,baseball,today,27),2
(sports,baseball,today,32),2
(sports,baseball,today,54),1
(events,outdoor,weekend,13),2
(events,outdoor,weekend,27),2
(events,outdoor,weekend,32),2

One approach would seem to be to store the bucket values in a separate
relation and join using a segregate key created when reading the data
in. Something like:

A = (12345,sports,baseball,today,M)
B = (32,12345)(27,12345)(12,12345)

C = JOIN A by $0, B by $1;

D = GROUP C by (family,channel,timeframe,bucket)

I am sure this method would work, but it requires generating a
map/reduce friendly segregate key on which to join the data. Is there a
more direct way to do this in pig? Also, is it possible to load more
than one relation at a time (split the data between two relations) with
the LOAD statement?

Thanks,
Scott

Search Discussions

  • Jeff Zhang at Jan 21, 2010 at 4:40 pm
    Hi Scott,

    It seems that the bucket number of one record is arbitrary, then I suggest
    you write your own LoadFunc to load the bucket in a DataBag. This is the pig
    script:

    A = LOAD 'input' USING YourLoadFunc() AS
    (family,channel,timeframe,gender,b:{t:(bucket)};
    B = FOREACH A GENERATE family,channel,timeframe,gender, FLATTEN(b) AS
    bucket;
    C = GROUP B BY (family,channel,timeframe,gender, bucket);
    D = FOREACH C GENERATE group,COUNT($1);

    Hope it helps you


    On Thu, Jan 21, 2010 at 5:38 AM, Scott wrote:

    I have a question on how to handle data that I would usually store in an
    array, or into a normalized child table in a database. The input data is a
    set of key/value pairs where one key can be associated with multiple values
    (0 to n).

    Here is a sample dataset with bucket being the multi value key:


    family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,bucket=32

    family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,bucket=32,bucket=54

    family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,bucket=32

    family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,bucket=32

    What I am trying to calculate is a group count on family,channel,timeframe
    and bucket, where the results would be:

    (sports,baseball,today,12),2
    (sports,baseball,today,27),2
    (sports,baseball,today,32),2
    (sports,baseball,today,54),1
    (events,outdoor,weekend,13),2
    (events,outdoor,weekend,27),2
    (events,outdoor,weekend,32),2

    One approach would seem to be to store the bucket values in a separate
    relation and join using a segregate key created when reading the data in.
    Something like:

    A = (12345,sports,baseball,today,M)
    B = (32,12345)(27,12345)(12,12345)

    C = JOIN A by $0, B by $1;

    D = GROUP C by (family,channel,timeframe,bucket)

    I am sure this method would work, but it requires generating a map/reduce
    friendly segregate key on which to join the data. Is there a more direct way
    to do this in pig? Also, is it possible to load more than one relation at a
    time (split the data between two relations) with the LOAD statement?

    Thanks,
    Scott

    --
    Best Regards

    Jeff Zhang
  • Scott Kester at Jan 25, 2010 at 9:22 pm
    Thanks Jeff, you have pointed me in the right direction, but I'm not there yet. I wrote a load function as you suggested to load the bucket values into a DataBag, but the flatten/group step is not generating the results I am after. The groups only seem to include the first element of the bucket tuple.

    I interpreted your code example to show loading an inner bag b with tuple t and I assumed 'bucket' was the variable number of values in tuple t. So here is real data loaded in that way. Note I have more values than in my original example, but the bag/tuple are the last elements:

    (www,home,,,,Hidden1:1,693,ar,us,,2009,693,US:AR,US,{(10001)})
    (www,home,,,,PdSearch,535,oh,us,,2009,527,US:IN,US,{(10001,10051,10040)})
    (www,home,,,,WXPartner3,632,il,us,,2009,632,US:IL,US,{(10001,10048,10038)})
    (www,home,,,,HeaderSpon:1,,,,,2009,506,US:MA,US,{()})
    (www,fcst,trvl,btrav,,PageSpon,618,tx,us,,2009,618,US:TX,US,{(10001)})
    (www,fcst,trvl,btrav,36hr,PageCounter,515,oh,us,,2009,515,US:OH,US,{(10001,70002,70121,70012,10051,10040)})

    As you can see, some rows have 0 elements in the tuple, and others several. What I am trying to get is a 'cross' on a per row basis such that a row with 3 'bucket' elements is converted to 3 rows, 1 for each of the bucket elements.

    so a rows like

    (A,B,C,D,{(1,2,3)})
    (E,F,G,H,{(1,2,4,5)})

    would convert to

    (A,B,C,D,1)
    (A,B,C,D,2)
    (A,B,C,D,3)
    (E,F,G,H,1)
    (E,F,G,H,2)
    (E,F,G,H,4)
    (E,F,G,H,5)

    And I then do a group on those rows. Any ideas on how to achieve that transformation?

    Thanks,
    Scott


    On 1/21/10 11:40 AM, "Jeff Zhang" wrote:

    Hi Scott,

    It seems that the bucket number of one record is arbitrary, then I suggest
    you write your own LoadFunc to load the bucket in a DataBag. This is the pig
    script:

    A = LOAD 'input' USING YourLoadFunc() AS
    (family,channel,timeframe,gender,b:{t:(bucket)};
    B = FOREACH A GENERATE family,channel,timeframe,gender, FLATTEN(b) AS
    bucket;
    C = GROUP B BY (family,channel,timeframe,gender, bucket);
    D = FOREACH C GENERATE group,COUNT($1);

    Hope it helps you


    On Thu, Jan 21, 2010 at 5:38 AM, Scott wrote:

    I have a question on how to handle data that I would usually store in an
    array, or into a normalized child table in a database. The input data is a
    set of key/value pairs where one key can be associated with multiple values
    (0 to n).

    Here is a sample dataset with bucket being the multi value key:


    family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,b
    ucket=32

    family=sports,channel=baseball,timeframe=today,gender=M,bucket=12,bucket=27,b
    ucket=32,bucket=54

    family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,
    bucket=32

    family=events,channel=outdoor,timeframe=weekend,gender=F,bucket=13,bucket=27,
    bucket=32

    What I am trying to calculate is a group count on family,channel,timeframe
    and bucket, where the results would be:

    (sports,baseball,today,12),2
    (sports,baseball,today,27),2
    (sports,baseball,today,32),2
    (sports,baseball,today,54),1
    (events,outdoor,weekend,13),2
    (events,outdoor,weekend,27),2
    (events,outdoor,weekend,32),2

    One approach would seem to be to store the bucket values in a separate
    relation and join using a segregate key created when reading the data in.
    Something like:

    A = (12345,sports,baseball,today,M)
    B = (32,12345)(27,12345)(12,12345)

    C = JOIN A by $0, B by $1;

    D = GROUP C by (family,channel,timeframe,bucket)

    I am sure this method would work, but it requires generating a map/reduce
    friendly segregate key on which to join the data. Is there a more direct way
    to do this in pig? Also, is it possible to load more than one relation at a
    time (split the data between two relations) with the LOAD statement?

    Thanks,
    Scott

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 21, '10 at 1:38p
activeJan 25, '10 at 9:22p
posts3
users3
websitepig.apache.org

3 users in discussion

Jeff Zhang: 1 post Scott Kester: 1 post Scott: 1 post

People

Translate

site design / logo © 2021 Grokbase