Thanks Jeff, you have pointed me in the right direction, but I'm not there yet. I wrote a load function as you suggested to load the bucket values into a DataBag, but the flatten/group step is not generating the results I am after. The groups only seem to include the first element of the bucket tuple.
I interpreted your code example to show loading an inner bag b with tuple t and I assumed 'bucket' was the variable number of values in tuple t. So here is real data loaded in that way. Note I have more values than in my original example, but the bag/tuple are the last elements:
As you can see, some rows have 0 elements in the tuple, and others several. What I am trying to get is a 'cross' on a per row basis such that a row with 3 'bucket' elements is converted to 3 rows, 1 for each of the bucket elements.
so a rows like
would convert to
And I then do a group on those rows. Any ideas on how to achieve that transformation?
On 1/21/10 11:40 AM, "Jeff Zhang" wrote:
It seems that the bucket number of one record is arbitrary, then I suggest
you write your own LoadFunc to load the bucket in a DataBag. This is the pig
A = LOAD 'input' USING YourLoadFunc() AS
B = FOREACH A GENERATE family,channel,timeframe,gender, FLATTEN(b) AS
C = GROUP B BY (family,channel,timeframe,gender, bucket);
D = FOREACH C GENERATE group,COUNT($1);
Hope it helps you
On Thu, Jan 21, 2010 at 5:38 AM, Scott wrote:
I have a question on how to handle data that I would usually store in an
array, or into a normalized child table in a database. The input data is a
set of key/value pairs where one key can be associated with multiple values
(0 to n).
Here is a sample dataset with bucket being the multi value key:
What I am trying to calculate is a group count on family,channel,timeframe
and bucket, where the results would be:
One approach would seem to be to store the bucket values in a separate
relation and join using a segregate key created when reading the data in.
A = (12345,sports,baseball,today,M)
B = (32,12345)(27,12345)(12,12345)
C = JOIN A by $0, B by $1;
D = GROUP C by (family,channel,timeframe,bucket)
I am sure this method would work, but it requires generating a map/reduce
friendly segregate key on which to join the data. Is there a more direct way
to do this in pig? Also, is it possible to load more than one relation at a
time (split the data between two relations) with the LOAD statement?