Ok... I've done it :P Thanks for your help, done it through JOIN with the help of the new key field (that consist of txUser and txEpoch) that I use later to identify unique fields for GROUPing.
Sincerely,
Marek M.
________________________________________
From: Marek Miglinski [
[email protected]]
Sent: Wednesday, September 14, 2011 9:52 AM
To:
[email protected]Subject: RE: Dumb question guys
Thanks for your reply,
I can't use JOIN and I will explain why. So here I have data...
UP:
9,user1,sam1
5,user1,sam2
3,user1,sam3
9,user2,flin
TX:
7,user1,wow
9,user2,pop
I need to join tx with up by user and closest epoch (first field). If I do JOIN I will get (JOIN BY user):
7,user1,wow,9,user1,sam1
7,user1,wow,5,user1,sam2
7,user1,wow,3,user1,sam3
9,user2,pop,9,user2,flin
Now, I can't filter the records properly in FOREACH, because I don't know if current input row is what I need, ok?
So I do COGROUP and get:
{(7,user1,wow)}, {(9,user1,sam1), (5,user1,sam2), (3,user1,sam2)}
{(9,user2,pop)}, {(9,user2,flin)}
Now I can FILTER, ORDER and LIMIT through FOREACH because I have all data in one row:
recordExtract = FOREACH recordGroup {
recordFiltered = FILTER up BY upEpoch < tx.txEpoch;
recordOrdered = ORDER recordFiltered by upEpoch DESC;
recordLimited = LIMIT recordOrdered 1;
GENERATE
recordLimited
;
}
So if I get tx.txEpoch properly I will get the desired:
7,user1,wow,5,user1,sam2 (txEpoch 5 is closest to upEpoch 7)
9,user2,pop,9,user2,flin (txEpoch 9 is closest to upEpoch 9)
Do you have any clues?
________________________________________
From: Xiaomeng Wan [
[email protected]]
Sent: Tuesday, September 13, 2011 11:26 PM
To:
[email protected]Subject: Re: Dumb question guys
tx is a bag, you can not use it in that way unless it is a scalar. Not
sure about the logic here, but looks like you should use a join rather
than a cogroup
recordGroup = join up BY upInstance, tx BY txInstance;
recordFiltered = FILTER recordGroup BY upEpoch < txEpoch;
Shawn
On Tue, Sep 13, 2011 at 11:54 AM, Marek Miglinski wrote:
Hey all, 4 hours of true torture, hope you will help me (the task is easy)
up = LOAD '/up.log' USING PigStorage(',') AS (upEpoch:long, upInstance:chararray, upKeyword:chararray);
tx = LOAD '/tx.log' USING PigStorage(',') AS (txEpoch:long, txInstance:chararray, txKeyword:chararray);
recordGroup = COGROUP up BY (upInstance), tx BY (txInstance);
recordExtract = FOREACH recordGroup {
recordFiltered = FILTER up BY upEpoch < tx.txEpoch;
recordLimited = LIMIT recordFiltered 1;
GENERATE
recordLimited
;
}
How do I point PIG to my tx input with txEpoch field (from recordGroup)? tx::txEpoch, tx.txEpoch, txEpoch, recordGroup::tx.txEpoch doesn't work...
Always the same, with tx::txEpoch - "ERROR 1000: Error during parsing. Invalid alias: tx::txEpoch in {upEpoch: long,upInstance: chararray,upKeyword: chararray}"
Or with tx.txEpoch (I know it takes tx = LOAD as a source, but I need recordGroup::tx.txEpoch!) - "ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (1314835200050,99,sam), 2nd :(1314835200079,99,flin)"