Grokbase Groups Pig user August 2010
FAQ
The title might be a bit misleading but I hope you can help me.
I have some data (let's say a Web Log file) and I want to be able to compare
multiple items with each other. For example I want to know what items are
popular in certain user groups, which means that I want to find items which
got many successive hits from users from that group in a short period of
time.
Until now I only worked on the rows in an isolated manner, that is items
could be filtered or modified, without any knowledge of other records, but
this now requires to consider multiple records, and I have no clue as to how
approach this problem.

Any suggestions?

Regards,
Chris

Search Discussions

  • Thejas M Nair at Aug 29, 2010 at 1:22 am
    Can you give the multiple rows an id and use that ? In your example , can
    you assign a user-group id for each type of user (or maybe a map with
    attributes if a user can belong to multiple groups), and then process using
    that attribute or id ?
    (I might not have understood the problem correctly, example of input and
    output data might help)
    -Thejas


    On 8/28/10 11:10 AM, "Christian Decker" wrote:

    The title might be a bit misleading but I hope you can help me.
    I have some data (let's say a Web Log file) and I want to be able to compare
    multiple items with each other. For example I want to know what items are
    popular in certain user groups, which means that I want to find items which
    got many successive hits from users from that group in a short period of
    time.
    Until now I only worked on the rows in an isolated manner, that is items
    could be filtered or modified, without any knowledge of other records, but
    this now requires to consider multiple records, and I have no clue as to how
    approach this problem.

    Any suggestions?

    Regards,
    Chris
  • Mridul Muralidharan at Aug 29, 2010 at 4:09 pm
    Taking a guess, you could group things based on your criterion and
    condition.

    Something simple like :

    a) group by usergroup (might be too expensive ? number of records across
    timestamps for users in a group might be large !).

    b) group by (usergroup, timestamp / window) [this will loose accuracy
    near the time window, see below] : manageable, but less accurate.

    Other more sensible variations based on your input, etc !


    Something like :

    -- This means that if the users clicked at 9th and 11th minute, we
    bucket it into two different buckets and miss out on data : so
    typically, adjust accordingly for error, or replicate input or something
    more complicated than this simple snippet :-)

    %default WINDOW '60 * 10'

    A = $MY_INPUT AS (user:chararray, user_grp:chararray, timestamp:long);

    -- B = GROUP A by user_grp PARALLEL $PARALLELISM;
    B = GROUP A by (user_grp, timestamp / $WINDOW) PARALLEL $PARALLELISM;
    C = FILTER B by COUNT(A) > $THRESHOLD;




    Ofcourse, I hope I am not misunderstanding your query entirely !


    Regards,
    Mridul

    On Saturday 28 August 2010 11:40 PM, Christian Decker wrote:
    The title might be a bit misleading but I hope you can help me.
    I have some data (let's say a Web Log file) and I want to be able to compare
    multiple items with each other. For example I want to know what items are
    popular in certain user groups, which means that I want to find items which
    got many successive hits from users from that group in a short period of
    time.
    Until now I only worked on the rows in an isolated manner, that is items
    could be filtered or modified, without any knowledge of other records, but
    this now requires to consider multiple records, and I have no clue as to how
    approach this problem.

    Any suggestions?

    Regards,
    Chris

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 28, '10 at 6:13p
activeAug 29, '10 at 4:09p
posts3
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase