FAQ
Excuse me I could have missed important part of PIG document and asked this trivial question here :) What is the best way to find out the total number of tuples (rows) in the bag of data loaded? For example, after "a = LOAD 'sth' AS (key, value); b = GROUP a BY key; c = FOREACH b GENERATE key;" I want to know how many tuples are loaded to 'a' and total number left in 'c'. One way might be to use a udf function. But is there a support of counting this in PIG?

Thanks,

Michael

Search Discussions

  • Dmitriy Ryaboy at Feb 24, 2010 at 12:10 am
    c = FOREACH b GENERATE group as key, COUNT(a);

    will give you the number of rows in a per key.

    a_all = group a ALL;
    a_count = FOREACH a_all GENERATE COUNT(a);

    will give you the total number of rows in a.

    Does that answer your question?

    On Tue, Feb 23, 2010 at 3:54 PM, jiang licht wrote:

    Excuse me I could have missed important part of PIG document and asked this
    trivial question here :) What is the best way to find out the total number
    of tuples (rows) in the bag of data loaded? For example, after "a = LOAD
    'sth' AS (key, value); b = GROUP a BY key; c = FOREACH b GENERATE key;" I
    want to know how many tuples are loaded to 'a' and total number left in 'c'.
    One way might be to use a udf function. But is there a support of counting
    this in PIG?

    Thanks,

    Michael

  • Jiang licht at Feb 24, 2010 at 12:28 am
    Thanks Dmitriy. That's not sth I want. I want sth just like that in SQL, you can get a number of total count of tuples (or other things of interest) and use that like a variable (sorry, I don't know if I should use variable here in PIG, but PIG passes command line parameter as a variable, right?). So, this variable will be convenient for quick calculation of statistics in PIG scripts. Though I also realize it might not be true to use a variable in this way in PIG. So, it might be a misconcept in my mind anyway...

    Thanks,

    Michael

    --- On Tue, 2/23/10, Dmitriy Ryaboy wrote:

    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    Subject: Re: count total number of tuples in a bag?
    To: pig-user@hadoop.apache.org
    Date: Tuesday, February 23, 2010, 6:10 PM

    c = FOREACH b GENERATE group as key, COUNT(a);

    will give you the number of rows in a per key.

    a_all = group a ALL;
    a_count = FOREACH a_all GENERATE COUNT(a);

    will give you the total number of rows in a.

    Does that answer your question?

    On Tue, Feb 23, 2010 at 3:54 PM, jiang licht wrote:

    Excuse me I could have missed important part of PIG document and asked this
    trivial question here :) What is the best way to find out the total number
    of tuples (rows) in the bag of data loaded? For example, after "a = LOAD
    'sth' AS (key, value); b = GROUP a BY key; c = FOREACH b GENERATE key;" I
    want to know how many tuples are loaded to 'a' and total number left in 'c'.
    One way might be to use a udf function. But is there a support of counting
    this in PIG?

    Thanks,

    Michael

  • Jeff Zhang at Feb 24, 2010 at 2:32 am
    One way I can think of is to store the total number of tuple in one
    specified place, and then load in your UDF when you wan to use it.

    a_all = group a ALL;
    a_count = FOREACH a_all GENERATE COUNT(a);
    store a_count into 'your_store_place';
    .....................

    d = foreach c generate YourUDF($0);



    On Tue, Feb 23, 2010 at 4:28 PM, jiang licht wrote:

    Thanks Dmitriy. That's not sth I want. I want sth just like that in SQL,
    you can get a number of total count of tuples (or other things of interest)
    and use that like a variable (sorry, I don't know if I should use variable
    here in PIG, but PIG passes command line parameter as a variable, right?).
    So, this variable will be convenient for quick calculation of statistics in
    PIG scripts. Though I also realize it might not be true to use a variable in
    this way in PIG. So, it might be a misconcept in my mind anyway...

    Thanks,

    Michael

    --- On Tue, 2/23/10, Dmitriy Ryaboy wrote:

    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    Subject: Re: count total number of tuples in a bag?
    To: pig-user@hadoop.apache.org
    Date: Tuesday, February 23, 2010, 6:10 PM

    c = FOREACH b GENERATE group as key, COUNT(a);

    will give you the number of rows in a per key.

    a_all = group a ALL;
    a_count = FOREACH a_all GENERATE COUNT(a);

    will give you the total number of rows in a.

    Does that answer your question?

    On Tue, Feb 23, 2010 at 3:54 PM, jiang licht wrote:

    Excuse me I could have missed important part of PIG document and asked this
    trivial question here :) What is the best way to find out the total number
    of tuples (rows) in the bag of data loaded? For example, after "a = LOAD
    'sth' AS (key, value); b = GROUP a BY key; c = FOREACH b GENERATE key;" I
    want to know how many tuples are loaded to 'a' and total number left in 'c'.
    One way might be to use a udf function. But is there a support of counting
    this in PIG?

    Thanks,

    Michael





    --
    Best Regards

    Jeff Zhang
  • Jiang licht at Feb 24, 2010 at 6:18 am
    If there are handy variables to carry values here and there, that'd be helpful :)

    Thanks,

    Michael

    --- On Tue, 2/23/10, Jeff Zhang wrote:

    From: Jeff Zhang <zjffdu@gmail.com>
    Subject: Re: count total number of tuples in a bag?
    To: pig-user@hadoop.apache.org
    Date: Tuesday, February 23, 2010, 8:32 PM

    One way I can think of is to store the total number of tuple in one
    specified place, and then load in your UDF when you wan to use it.

    a_all = group a ALL;
    a_count = FOREACH a_all GENERATE COUNT(a);
    store a_count into 'your_store_place';
    .....................

    d = foreach c generate YourUDF($0);



    On Tue, Feb 23, 2010 at 4:28 PM, jiang licht wrote:

    Thanks Dmitriy. That's not sth I want. I want sth just like that in SQL,
    you can get a number of total count of tuples (or other things of interest)
    and use that like a variable (sorry, I don't know if I should use variable
    here in PIG, but PIG passes command line parameter as a variable, right?).
    So, this variable will be convenient for quick calculation of statistics in
    PIG scripts. Though I also realize it might not be true to use a variable in
    this way in PIG. So, it might be a misconcept in my mind anyway...

    Thanks,

    Michael

    --- On Tue, 2/23/10, Dmitriy Ryaboy wrote:

    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    Subject: Re: count total number of tuples in a bag?
    To: pig-user@hadoop.apache.org
    Date: Tuesday, February 23, 2010, 6:10 PM

    c = FOREACH b GENERATE group as key, COUNT(a);

    will give you the number of rows in a per key.

    a_all = group a ALL;
    a_count = FOREACH a_all GENERATE COUNT(a);

    will give you the total number of rows in a.

    Does that answer your question?

    On Tue, Feb 23, 2010 at 3:54 PM, jiang licht wrote:

    Excuse me I could have missed important part of PIG document and asked this
    trivial question here :) What is the best way to find out the total number
    of tuples (rows) in the bag of data loaded? For example, after "a = LOAD
    'sth' AS (key, value); b = GROUP a BY key; c = FOREACH b GENERATE key;" I
    want to know how many tuples are loaded to 'a' and total number left in 'c'.
    One way might be to use a udf function. But is there a support of counting
    this in PIG?

    Thanks,

    Michael





    --
    Best Regards

    Jeff Zhang
  • Jiang licht at Feb 24, 2010 at 6:39 am
    I guess I'd take back some thoughts considering PIG is specially designed to produce m/r jobs. Unlike command line parameters or those specified by %declare, which wont change their values during the life of the whole job (may consist of multiple m/r tasks), variables that can change values from time to time do not fit in m/r scheme, which is good for those applications in which data once created are usually for read only. But a suggestion could be to allow to create variables and assign values to them only once and they carry the same values from the point they are assigned values to the end of the program, which means once a variable is assigned a value, it becomes immutable. ofcoz, even this will create some difficulty e.g. the difficulty for optimization since it may add extra data dependency ...


    Michael

    --- On Wed, 2/24/10, jiang licht wrote:

    From: jiang licht <licht_jiang@yahoo.com>
    Subject: Re: count total number of tuples in a bag?
    To: pig-user@hadoop.apache.org
    Date: Wednesday, February 24, 2010, 12:17 AM

    If there are handy variables to carry values here and there, that'd be helpful :)

    Thanks,

    Michael

    --- On Tue, 2/23/10, Jeff Zhang wrote:

    From: Jeff Zhang <zjffdu@gmail.com>
    Subject: Re: count total number of tuples in a bag?
    To: pig-user@hadoop.apache.org
    Date: Tuesday, February 23, 2010, 8:32 PM

    One way I can think of is to store the total number of tuple in one
    specified place, and then load in your UDF when you wan to use it.

    a_all = group a ALL;
    a_count = FOREACH a_all GENERATE COUNT(a);
    store a_count into 'your_store_place';
    .....................

    d = foreach c generate YourUDF($0);



    On Tue, Feb 23, 2010 at 4:28 PM, jiang licht wrote:

    Thanks Dmitriy. That's not sth I want. I want sth just like that in SQL,
    you can get a number of total count of tuples (or other things of interest)
    and use that like a variable (sorry, I don't know if I should use variable
    here in PIG, but PIG passes command line parameter as a variable, right?).
    So, this variable will be convenient for quick calculation of statistics in
    PIG scripts. Though I also realize it might not be true to use a variable in
    this way in PIG. So, it might be a misconcept in my mind anyway...

    Thanks,

    Michael

    --- On Tue, 2/23/10, Dmitriy Ryaboy wrote:

    From: Dmitriy Ryaboy <dvryaboy@gmail.com>
    Subject: Re: count total number of tuples in a bag?
    To: pig-user@hadoop.apache.org
    Date: Tuesday, February 23, 2010, 6:10 PM

    c = FOREACH b GENERATE group as key, COUNT(a);

    will give you the number of rows in a per key.

    a_all = group a ALL;
    a_count = FOREACH a_all GENERATE COUNT(a);

    will give you the total number of rows in a.

    Does that answer your question?

    On Tue, Feb 23, 2010 at 3:54 PM, jiang licht wrote:

    Excuse me I could have missed important part of PIG document and asked this
    trivial question here :) What is the best way to find out the total number
    of tuples (rows) in the bag of data loaded? For example, after "a = LOAD
    'sth' AS (key, value); b = GROUP a BY key; c = FOREACH b GENERATE key;" I
    want to know how many tuples are loaded to 'a' and total number left in 'c'.
    One way might be to use a udf function. But is there a support of counting
    this in PIG?

    Thanks,

    Michael





    --
    Best Regards

    Jeff Zhang

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 23, '10 at 11:55p
activeFeb 24, '10 at 6:39a
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase