Grokbase Groups Pig user April 2011
FAQ
Hi

I have the following input relation:
Name Score
Jack 25
Jimmy 30
Sam 20
Hick 35
Tampa 22

My goal is to rank the tuples by score.

Pig script:

sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray,
score:int);
sample_data_group = GROUP sample_data BY score;
sample_data_count = FOREACH sample_data_group GENERATE group AS score,
COUNT(sample_data.name) AS countVal;
sample_data_order = ORDER sample_data_count BY score DESC;
sample_data_group_all = GROUP sample_data_order all;
sample_data_project = FOREACH sample_data_group_all GENERATE
FLATTEN(myUDF.Rank(sample_data_order));
dump sample_data_project;

Can someone please point me to a UDF example where a relation is read in and
iterated over all its tuples? I plan to iterate over the tuples and assign a
rank to each of them based on the score value.

Is there any other way to generate rank?

Thanks much.

Arun

Search Discussions

  • Jacob Perkins at Apr 27, 2011 at 2:19 am
    The question is, do you need the entire relation all at once to assign a
    rank? If so then map-reduce may not be the answer. If not, why not just
    run the UDF on each tuple of the relation, one at a time, with a
    projection?

    If you need some global information, such as the max and min score, then
    you might look at the MAX and MIN operations. They do require a GROUP
    ALL but are algebraic so it's not actually going to bring all the data
    to one machine as it otherwise would.

    --jacob
    @thedatachef

    On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
    Hi

    I have the following input relation:
    Name Score
    Jack 25
    Jimmy 30
    Sam 20
    Hick 35
    Tampa 22

    My goal is to rank the tuples by score.

    Pig script:

    sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray,
    score:int);
    sample_data_group = GROUP sample_data BY score;
    sample_data_count = FOREACH sample_data_group GENERATE group AS score,
    COUNT(sample_data.name) AS countVal;
    sample_data_order = ORDER sample_data_count BY score DESC;
    sample_data_group_all = GROUP sample_data_order all;
    sample_data_project = FOREACH sample_data_group_all GENERATE
    FLATTEN(myUDF.Rank(sample_data_order));
    dump sample_data_project;

    Can someone please point me to a UDF example where a relation is read in and
    iterated over all its tuples? I plan to iterate over the tuples and assign a
    rank to each of them based on the score value.

    Is there any other way to generate rank?

    Thanks much.

    Arun
  • Arun A K at Apr 27, 2011 at 2:44 am
    Thanks Jacob for the response.

    If I run the UDF on each tuple then how can I preserve the state of the rank
    variable. I mean the UDF won't be able to save the rank value between calls,
    right? Correct me if I am wrong in interpreting that the UDF would be
    invoked for each tuple.

    What I am looking in my output is an additional column indicating the rank.
    Something like

    Hick 35 1
    Jimmy 30 2
    Jack 25 3
    Tampa 22 4
    Sam 20 5

    Thanks.

    Arun

    On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins wrote:

    The question is, do you need the entire relation all at once to assign a
    rank? If so then map-reduce may not be the answer. If not, why not just
    run the UDF on each tuple of the relation, one at a time, with a
    projection?

    If you need some global information, such as the max and min score, then
    you might look at the MAX and MIN operations. They do require a GROUP
    ALL but are algebraic so it's not actually going to bring all the data
    to one machine as it otherwise would.

    --jacob
    @thedatachef

    On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
    Hi

    I have the following input relation:
    Name Score
    Jack 25
    Jimmy 30
    Sam 20
    Hick 35
    Tampa 22

    My goal is to rank the tuples by score.

    Pig script:

    sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray,
    score:int);
    sample_data_group = GROUP sample_data BY score;
    sample_data_count = FOREACH sample_data_group GENERATE group AS score,
    COUNT(sample_data.name) AS countVal;
    sample_data_order = ORDER sample_data_count BY score DESC;
    sample_data_group_all = GROUP sample_data_order all;
    sample_data_project = FOREACH sample_data_group_all GENERATE
    FLATTEN(myUDF.Rank(sample_data_order));
    dump sample_data_project;

    Can someone please point me to a UDF example where a relation is read in and
    iterated over all its tuples? I plan to iterate over the tuples and assign a
    rank to each of them based on the score value.

    Is there any other way to generate rank?

    Thanks much.

    Arun
  • Jacob Perkins at Apr 27, 2011 at 2:55 am
    What you've indicated does require access to the whole relation at once
    or at least a way of incrementing a counter and assigning its value to
    each tuple. This kind of shared/synchronized state isn't possible with
    Pig at the moment as far as I know.

    --jacob
    @thedatachef
    On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
    Thanks Jacob for the response.

    If I run the UDF on each tuple then how can I preserve the state of the rank
    variable. I mean the UDF won't be able to save the rank value between calls,
    right? Correct me if I am wrong in interpreting that the UDF would be
    invoked for each tuple.

    What I am looking in my output is an additional column indicating the rank.
    Something like

    Hick 35 1
    Jimmy 30 2
    Jack 25 3
    Tampa 22 4
    Sam 20 5

    Thanks.

    Arun

    On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins wrote:

    The question is, do you need the entire relation all at once to assign a
    rank? If so then map-reduce may not be the answer. If not, why not just
    run the UDF on each tuple of the relation, one at a time, with a
    projection?

    If you need some global information, such as the max and min score, then
    you might look at the MAX and MIN operations. They do require a GROUP
    ALL but are algebraic so it's not actually going to bring all the data
    to one machine as it otherwise would.

    --jacob
    @thedatachef

    On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
    Hi

    I have the following input relation:
    Name Score
    Jack 25
    Jimmy 30
    Sam 20
    Hick 35
    Tampa 22

    My goal is to rank the tuples by score.

    Pig script:

    sample_data = LOAD 'sample.txt' USING PigStorage() AS (name:chararray,
    score:int);
    sample_data_group = GROUP sample_data BY score;
    sample_data_count = FOREACH sample_data_group GENERATE group AS score,
    COUNT(sample_data.name) AS countVal;
    sample_data_order = ORDER sample_data_count BY score DESC;
    sample_data_group_all = GROUP sample_data_order all;
    sample_data_project = FOREACH sample_data_group_all GENERATE
    FLATTEN(myUDF.Rank(sample_data_order));
    dump sample_data_project;

    Can someone please point me to a UDF example where a relation is read in and
    iterated over all its tuples? I plan to iterate over the tuples and assign a
    rank to each of them based on the score value.

    Is there any other way to generate rank?

    Thanks much.

    Arun
  • Arun A K at Apr 27, 2011 at 3:50 am
    Thanks Jacob.

    I wonder if it is possible to get the rank of each record or say row number
    using Pig. Or do I need to have an external driver like a shell script which
    augments the sorted output from Pig with a rank?

    Thanks
    Arun


    On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins wrote:

    What you've indicated does require access to the whole relation at once
    or at least a way of incrementing a counter and assigning its value to
    each tuple. This kind of shared/synchronized state isn't possible with
    Pig at the moment as far as I know.

    --jacob
    @thedatachef
    On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
    Thanks Jacob for the response.

    If I run the UDF on each tuple then how can I preserve the state of the rank
    variable. I mean the UDF won't be able to save the rank value between calls,
    right? Correct me if I am wrong in interpreting that the UDF would be
    invoked for each tuple.

    What I am looking in my output is an additional column indicating the rank.
    Something like

    Hick 35 1
    Jimmy 30 2
    Jack 25 3
    Tampa 22 4
    Sam 20 5

    Thanks.

    Arun


    On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <
    jacob.a.perkins@gmail.com>wrote:
    The question is, do you need the entire relation all at once to assign
    a
    rank? If so then map-reduce may not be the answer. If not, why not just
    run the UDF on each tuple of the relation, one at a time, with a
    projection?

    If you need some global information, such as the max and min score,
    then
    you might look at the MAX and MIN operations. They do require a GROUP
    ALL but are algebraic so it's not actually going to bring all the data
    to one machine as it otherwise would.

    --jacob
    @thedatachef

    On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
    Hi

    I have the following input relation:
    Name Score
    Jack 25
    Jimmy 30
    Sam 20
    Hick 35
    Tampa 22

    My goal is to rank the tuples by score.

    Pig script:

    sample_data = LOAD 'sample.txt' USING PigStorage() AS
    (name:chararray,
    score:int);
    sample_data_group = GROUP sample_data BY score;
    sample_data_count = FOREACH sample_data_group GENERATE group AS
    score,
    COUNT(sample_data.name) AS countVal;
    sample_data_order = ORDER sample_data_count BY score DESC;
    sample_data_group_all = GROUP sample_data_order all;
    sample_data_project = FOREACH sample_data_group_all GENERATE
    FLATTEN(myUDF.Rank(sample_data_order));
    dump sample_data_project;

    Can someone please point me to a UDF example where a relation is read
    in
    and
    iterated over all its tuples? I plan to iterate over the tuples and assign a
    rank to each of them based on the score value.

    Is there any other way to generate rank?

    Thanks much.

    Arun
  • Dexin Wang at Apr 27, 2011 at 4:15 am
    If the whole set is not that big, sorting in shell might be the easiest. I've done that with result set of millions of records.

    On Apr 26, 2011, at 8:49 PM, Arun A K wrote:

    Thanks Jacob.

    I wonder if it is possible to get the rank of each record or say row number
    using Pig. Or do I need to have an external driver like a shell script which
    augments the sorted output from Pig with a rank?

    Thanks
    Arun


    On Tue, Apr 26, 2011 at 7:54 PM, Jacob Perkins wrote:

    What you've indicated does require access to the whole relation at once
    or at least a way of incrementing a counter and assigning its value to
    each tuple. This kind of shared/synchronized state isn't possible with
    Pig at the moment as far as I know.

    --jacob
    @thedatachef
    On Tue, 2011-04-26 at 19:43 -0700, Arun A K wrote:
    Thanks Jacob for the response.

    If I run the UDF on each tuple then how can I preserve the state of the rank
    variable. I mean the UDF won't be able to save the rank value between calls,
    right? Correct me if I am wrong in interpreting that the UDF would be
    invoked for each tuple.

    What I am looking in my output is an additional column indicating the rank.
    Something like

    Hick 35 1
    Jimmy 30 2
    Jack 25 3
    Tampa 22 4
    Sam 20 5

    Thanks.

    Arun


    On Tue, Apr 26, 2011 at 7:18 PM, Jacob Perkins <
    jacob.a.perkins@gmail.com>wrote:
    The question is, do you need the entire relation all at once to assign
    a
    rank? If so then map-reduce may not be the answer. If not, why not just
    run the UDF on each tuple of the relation, one at a time, with a
    projection?

    If you need some global information, such as the max and min score,
    then
    you might look at the MAX and MIN operations. They do require a GROUP
    ALL but are algebraic so it's not actually going to bring all the data
    to one machine as it otherwise would.

    --jacob
    @thedatachef

    On Tue, 2011-04-26 at 19:07 -0700, Arun A K wrote:
    Hi

    I have the following input relation:
    Name Score
    Jack 25
    Jimmy 30
    Sam 20
    Hick 35
    Tampa 22

    My goal is to rank the tuples by score.

    Pig script:

    sample_data = LOAD 'sample.txt' USING PigStorage() AS
    (name:chararray,
    score:int);
    sample_data_group = GROUP sample_data BY score;
    sample_data_count = FOREACH sample_data_group GENERATE group AS
    score,
    COUNT(sample_data.name) AS countVal;
    sample_data_order = ORDER sample_data_count BY score DESC;
    sample_data_group_all = GROUP sample_data_order all;
    sample_data_project = FOREACH sample_data_group_all GENERATE
    FLATTEN(myUDF.Rank(sample_data_order));
    dump sample_data_project;

    Can someone please point me to a UDF example where a relation is read
    in
    and
    iterated over all its tuples? I plan to iterate over the tuples and assign a
    rank to each of them based on the score value.

    Is there any other way to generate rank?

    Thanks much.

    Arun

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 27, '11 at 2:08a
activeApr 27, '11 at 4:15a
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase