FAQ
Hi!

I am new to PIG, so pardon my naïve question. I have a data like this:

(A,1)
(A,5)
(B,4)
(C,22)
(C,10)

I need to calculate maximum value for each distinct value of 1st column:

(A,5)
(B,4)
(C,22)

If there is a good way to do it in PIG? The only way I see is to
first group by first column, calculated max values per group. After that
join with with original column, adding max value column:

(A,1,5)
(A,5,5)
(B,4,4)
(C,22,22)
(C,10,22)

Then I need to filter, dropping values where 3rd column is bigger that
2nd.
Finally, I will need to do DISTINCT to remove duplicate max values. It
is all
sounds quite complex computationally and I was wondering if there is a
better way...

Sincerely,
Vadim


--
"Hated by fools, and fools to hate, be this my motto and my fate"
(Jonathan Swift)

Search Discussions

  • Alan Gates at Dec 31, 2008 at 10:09 pm
    If I understand correctly, what you want is this:

    A = load 'yourfile' as (firstcol, secondcol);
    B = group A by firstcol;
    C = foreach B generate group, MAX($1.secondcol);

    This will collect like values in the first column and find the
    maximum value in the second column for each group.

    Alan.
    On Dec 31, 2008, at 1:48 PM, Vadim Zaliva wrote:

    Hi!

    I am new to PIG, so pardon my naïve question. I have a data like this:

    (A,1)
    (A,5)
    (B,4)
    (C,22)
    (C,10)

    I need to calculate maximum value for each distinct value of 1st
    column:

    (A,5)
    (B,4)
    (C,22)

    If there is a good way to do it in PIG? The only way I see is to
    first group by first column, calculated max values per group. After
    that
    join with with original column, adding max value column:

    (A,1,5)
    (A,5,5)
    (B,4,4)
    (C,22,22)
    (C,10,22)

    Then I need to filter, dropping values where 3rd column is bigger
    that 2nd.
    Finally, I will need to do DISTINCT to remove duplicate max values.
    It is all
    sounds quite complex computationally and I was wondering if there is a
    better way...

    Sincerely,
    Vadim


    --
    "Hated by fools, and fools to hate, be this my motto and my fate"
    (Jonathan Swift)



  • Vadim Zaliva at Dec 31, 2008 at 10:52 pm

    On Dec 31, 2008, at 14:08 , Alan Gates wrote:

    If I understand correctly, what you want is this:

    A = load 'yourfile' as (firstcol, secondcol);
    B = group A by firstcol;
    C = foreach B generate group, MAX($1.secondcol);

    This will collect like values in the first column and find the
    maximum value in the second column for each group.
    Thanks!

    Vadim

    --
    "La perfection est atteinte non quand il ne reste rien a ajouter, mais
    quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
  • Vadim Zaliva at Jan 3, 2009 at 12:18 am
    On Dec 31, 2008, at 14:08 , Alan Gates wrote:

    Perhaps my example was not very good. Let me rephrase it:

    Having data like this:

    (A,x,1)
    (A,u,5)
    (A,y,5)
    (B,z,4)
    (C,g,22)
    (C,h,10)

    I need to calculate:

    (A,u,5)
    (B,z,4)
    (C,g,22)

    So, for each first column value I need to keep only one row,
    with max. value in 3rd column.

    I come up with something like:

    A = load 'file' as (first, second, third);
    B = FOREACH A GENERATE first, third;
    C = GROUP B by first;
    D = FOREACH C GENERATE group AS first, MAX($1.third) as max;
    E = JOIN D by first, A by first;
    F = FILTER E by third == max
    G = FOREACH F GENERATE first, second, third

    At this point I will get:

    (A,u,5)
    (A,y,5)
    (B,z,4)
    (C,g,22)

    So I need something like DISTINCT G BY first, third but PIG does not
    have it.

    Any good way around it?

    Vadim
    If I understand correctly, what you want is this:

    A = load 'yourfile' as (firstcol, secondcol);
    B = group A by firstcol;
    C = foreach B generate group, MAX($1.secondcol);

    This will collect like values in the first column and find the
    maximum value in the second column for each group.

    Alan.
    On Dec 31, 2008, at 1:48 PM, Vadim Zaliva wrote:

    Hi!

    I am new to PIG, so pardon my naïve question. I have a data like
    this:

    (A,1)
    (A,5)
    (B,4)
    (C,22)
    (C,10)

    I need to calculate maximum value for each distinct value of 1st
    column:

    (A,5)
    (B,4)
    (C,22)

    If there is a good way to do it in PIG? The only way I see is to
    first group by first column, calculated max values per group. After
    that
    join with with original column, adding max value column:

    (A,1,5)
    (A,5,5)
    (B,4,4)
    (C,22,22)
    (C,10,22)

    Then I need to filter, dropping values where 3rd column is bigger
    that 2nd.
    Finally, I will need to do DISTINCT to remove duplicate max values.
    It is all
    sounds quite complex computationally and I was wondering if there
    is a
    better way...

    Sincerely,
    Vadim


    --
    "Hated by fools, and fools to hate, be this my motto and my fate"
    (Jonathan Swift)




    --
    "La perfection est atteinte non quand il ne reste rien a ajouter, mais
    quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
  • Ted Dunning at Jan 3, 2009 at 4:15 am
    I think that what you need is a custom max function that operates on pairs.

    Then you can group by the first field and keep the maximum pair.

    Something like this:

    A = load 'file' as (first, second, third);
    B = GROUP A by first;
    C = FOREACH B GENERATE first, PairwiseMax(B);


    PairwiseMax should accept a bunch of pairs and keep the one that has the
    largest second element. This should be relatively trivial to write in Java,
    but I think it would be difficult in Pig.
    On Fri, Jan 2, 2009 at 4:17 PM, Vadim Zaliva wrote:

    Perhaps my example was not very good. Let me rephrase it:

    Having data like this:

    (A,x,1)
    (A,u,5)
    (A,y,5)
    (B,z,4)
    (C,g,22)
    (C,h,10)

    I need to calculate:

    (A,u,5)
    (B,z,4)
    (C,g,22)

    So, for each first column value I need to keep only one row,
    with max. value in 3rd column.


    --
    Ted Dunning, CTO
    DeepDyve
    4600 Bohannon Drive, Suite 220
    Menlo Park, CA 94025
    www.deepdyve.com
    650-324-0110, ext. 738
    858-414-0013 (m)
  • Vadim Zaliva at Jan 3, 2009 at 7:13 am
    On Jan 2, 2009, at 20:14 , Ted Dunning wrote:

    I can certainly write a custom function, but I was curios how this type
    of problem could be solved using PIG only.

    If I am were to write custom function I would not do as you suggest.
    Your approach will not work very well on large data sets. I would write
    custom function which prints first record and skip all subsequent ones
    with matching set of fields. Thus even very large data set could
    be sorted using map/reduce framework first, then it could be
    processed by such function, which only needs to keep in memory one
    record (or rather matching field(s) of the last record).

    Still, I am curios to see if anybody could suggest PIG only solution
    to the problem. I was thinking of another approach: grouping by the
    first
    field, sorting sub-fields in each record, and when taking the first one.
    Unfortunately this would not work as well: FOREACH allows nester ORDER,
    but not LIMIT :(

    Sincerely,
    Vadim

    I think that what you need is a custom max function that operates on
    pairs.

    Then you can group by the first field and keep the maximum pair.

    Something like this:

    A = load 'file' as (first, second, third);
    B = GROUP A by first;
    C = FOREACH B GENERATE first, PairwiseMax(B);


    PairwiseMax should accept a bunch of pairs and keep the one that has
    the
    largest second element. This should be relatively trivial to write
    in Java,
    but I think it would be difficult in Pig.
    On Fri, Jan 2, 2009 at 4:17 PM, Vadim Zaliva wrote:

    Perhaps my example was not very good. Let me rephrase it:

    Having data like this:

    (A,x,1)
    (A,u,5)
    (A,y,5)
    (B,z,4)
    (C,g,22)
    (C,h,10)

    I need to calculate:

    (A,u,5)
    (B,z,4)
    (C,g,22)

    So, for each first column value I need to keep only one row,
    with max. value in 3rd column.


    --
    Ted Dunning, CTO
    DeepDyve
    4600 Bohannon Drive, Suite 220
    Menlo Park, CA 94025
    www.deepdyve.com
    650-324-0110, ext. 738
    858-414-0013 (m)

    --
    "La perfection est atteinte non quand il ne reste rien a ajouter, mais
    quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
  • Ted Dunning at Jan 3, 2009 at 7:41 pm
    As you like, but you still need to sort or compare the results to get what
    you want. Either way, the reduce function will have to grovel through all
    of the records in the group. With sorting, you pay the price of ordering
    all of the records. With max selection, you only need one comparison per
    record rather than log n.
    On Fri, Jan 2, 2009 at 11:13 PM, Vadim Zaliva wrote:

    If I am were to write custom function I would not do as you suggest.
    Your approach will not work very well on large data sets. I would write
    custom function which prints first record and skip all subsequent ones
    with matching set of fields.
  • Vadim Zaliva at Jan 4, 2009 at 1:52 am
    On Jan 3, 2009, at 11:41 , Ted Dunning wrote:

    Assuming that I want to write the function as you suggested, I do not
    see under what UDF category it falls (from this document):

    http://wiki.apache.org/pig/UDFManual

    It is close to "Aggregate Functions" but they must return a scalar
    value.

    If I am to write the way I suggested, "Filter Functions" may seem
    applicable, assuming that I can keep state between invocations and it
    is guaranteed that the same instance will be used to process all data
    set. But if data is split in chunks and functions applied to them
    independently this not gonna work.

    So, either way, I am stuck! :)

    The only way I see is to split my PIG script into 2 parts, save
    intermediate values. Then, I can invoke custom hadoop map/reduce task.
    After it completion, the second part of my PIG script could pick up
    results and continue.

    I think this is very clumsy. The problem I am trying to solve seems to
    be pretty trivial and common. I think PIG should have a way to solve
    it. One of the following modifications of PIG language will solve my
    problem:

    1. Allowing LIMIT as nested operation in FOREACH (in addition to ORDER
    and others which are
    currently allowed)
    2. Extending DISTINCT operation with "BY" clause, allowing users to
    specify list of fields.

    Does anybody else besides me raised such suggestions? Any chance to
    see them as part
    of the language anytime soon?

    Sincerely,
    Vadim
    As you like, but you still need to sort or compare the results to
    get what
    you want. Either way, the reduce function will have to grovel
    through all
    of the records in the group. With sorting, you pay the price of
    ordering
    all of the records. With max selection, you only need one
    comparison per
    record rather than log n.
    On Fri, Jan 2, 2009 at 11:13 PM, Vadim Zaliva wrote:

    If I am were to write custom function I would not do as you suggest.
    Your approach will not work very well on large data sets. I would
    write
    custom function which prints first record and skip all subsequent
    ones
    with matching set of fields.

    --
    "La perfection est atteinte non quand il ne reste rien a ajouter, mais
    quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
  • Benjamin Reed at Jan 5, 2009 at 6:00 pm
    Vadim,

    Why don't you write a function that takes a bag and returns a bag? i'm not sure why it bothers you whether or not the function will be considered a aggregation function. in PiggyBank there are functions that take bags and return bags.

    if i understand your problem correctly, you want a function that takes tuples grouped by the first field and returns the tuple with the highest third field from the group. that is a very simple function to write as an algebraic function that will be very efficient.

    ben
    ________________________________________
    From: Vadim Zaliva [lord@codeminders.com]
    Sent: Saturday, January 03, 2009 5:52 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: novice user

    On Jan 3, 2009, at 11:41 , Ted Dunning wrote:

    Assuming that I want to write the function as you suggested, I do not
    see under what UDF category it falls (from this document):

    http://wiki.apache.org/pig/UDFManual

    It is close to "Aggregate Functions" but they must return a scalar
    value.

    If I am to write the way I suggested, "Filter Functions" may seem
    applicable, assuming that I can keep state between invocations and it
    is guaranteed that the same instance will be used to process all data
    set. But if data is split in chunks and functions applied to them
    independently this not gonna work.

    So, either way, I am stuck! :)

    The only way I see is to split my PIG script into 2 parts, save
    intermediate values. Then, I can invoke custom hadoop map/reduce task.
    After it completion, the second part of my PIG script could pick up
    results and continue.

    I think this is very clumsy. The problem I am trying to solve seems to
    be pretty trivial and common. I think PIG should have a way to solve
    it. One of the following modifications of PIG language will solve my
    problem:

    1. Allowing LIMIT as nested operation in FOREACH (in addition to ORDER
    and others which are
    currently allowed)
    2. Extending DISTINCT operation with "BY" clause, allowing users to
    specify list of fields.

    Does anybody else besides me raised such suggestions? Any chance to
    see them as part
    of the language anytime soon?

    Sincerely,
    Vadim
    As you like, but you still need to sort or compare the results to
    get what
    you want. Either way, the reduce function will have to grovel
    through all
    of the records in the group. With sorting, you pay the price of
    ordering
    all of the records. With max selection, you only need one
    comparison per
    record rather than log n.
    On Fri, Jan 2, 2009 at 11:13 PM, Vadim Zaliva wrote:

    If I am were to write custom function I would not do as you suggest.
    Your approach will not work very well on large data sets. I would
    write
    custom function which prints first record and skip all subsequent
    ones
    with matching set of fields.

    --
    "La perfection est atteinte non quand il ne reste rien a ajouter, mais
    quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
  • Vadim Zaliva at Jan 5, 2009 at 6:10 pm

    On Jan 5, 2009, at 9:59 , Benjamin Reed wrote:

    Why don't you write a function that takes a bag and returns a bag?
    i'm not sure why it bothers you whether or not the function will be
    considered a aggregation function. in PiggyBank there are functions
    that take bags and return bags.

    If this is possible I will gladly do this. I was confused by
    documentation that documentation states:

    "An aggregate function is an eval function that takes a bag and
    returns a scalar value"

    So I was not sure if I can return a bag from UDF function. I will try
    that approach. Thanks!

    Vadim


    --
    "La perfection est atteinte non quand il ne reste rien a ajouter, mais
    quand il ne reste rien a enlever." (Antoine de Saint-Exupery)
  • Benjamin Reed at Jan 5, 2009 at 10:28 pm
    Sorry, I've been reading my mail queue LIFO. Ted is exactly right he just has a small typo:

    C = FOREACH B GENERATE first, PairwiseMax(A);

    PairwiseMax is trivial to write and it is exactly the reason we have UDFs.

    ben
    ________________________________________
    From: Ted Dunning [ted.dunning@gmail.com]
    Sent: Friday, January 02, 2009 8:14 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: novice user

    I think that what you need is a custom max function that operates on pairs.

    Then you can group by the first field and keep the maximum pair.

    Something like this:

    A = load 'file' as (first, second, third);
    B = GROUP A by first;
    C = FOREACH B GENERATE first, PairwiseMax(B);


    PairwiseMax should accept a bunch of pairs and keep the one that has the
    largest second element. This should be relatively trivial to write in Java,
    but I think it would be difficult in Pig.
    On Fri, Jan 2, 2009 at 4:17 PM, Vadim Zaliva wrote:

    Perhaps my example was not very good. Let me rephrase it:

    Having data like this:

    (A,x,1)
    (A,u,5)
    (A,y,5)
    (B,z,4)
    (C,g,22)
    (C,h,10)

    I need to calculate:

    (A,u,5)
    (B,z,4)
    (C,g,22)

    So, for each first column value I need to keep only one row,
    with max. value in 3rd column.


    --
    Ted Dunning, CTO
    DeepDyve
    4600 Bohannon Drive, Suite 220
    Menlo Park, CA 94025
    www.deepdyve.com
    650-324-0110, ext. 738
    858-414-0013 (m)

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 31, '08 at 9:52p
activeJan 5, '09 at 10:28p
posts11
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase