Grokbase Groups Pig user October 2010
FAQ
I've seen a few threads about counters, PigStats, Elephant-Bird's stats
utility class, etc.

http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
http://www.mail-archive.com/user%40pig.apache.org/msg00034.html

Has any progress been made on this or to provide a comprehensive
stats/counter mechanism?

What I'm looking to do is three-fold:

1) Get stats on the number of records that are filtered out when using the
FILTER operation
2) Get stats on the number of records dropped/not loaded in a LOAD function
(and actual copies of the records/rows from the file for later evaluation)
3) Output my own stats from a Pig job (without resorting to writing my own
UDF and pushing things into PigStats using the Elephant-Bird utility)

If any of this is possible, it would be great to see some examples or
documentation. I would hate to go to raw Hadoop MR code just to get to
counters.

Thanks,

Josh

Search Discussions

  • Dmitriy Ryaboy at Oct 17, 2010 at 5:45 pm
    No on Filters (though every MR job tells you the number of records ingested,
    and the number returned, and as of 0.8 it also tells you which relations
    were being produced in the job -- so you can sort of back into that).
    EB sort of gives you 2), most of the loaders in there give you number of
    malformed records, though they do not store the bad records anywhere.
    I am not sure what you mean by 3) -- you can just increment
    counters. PigStatusReporter.getInstance().getCounter(myEnum).increment(1L);

    (watch out for a null reporter when you are still in the client-side).

    -D

    On Sat, Oct 16, 2010 at 2:28 PM, Josh Devins wrote:

    I've seen a few threads about counters, PigStats, Elephant-Bird's stats
    utility class, etc.

    http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
    http://www.mail-archive.com/user%40pig.apache.org/msg00034.html

    Has any progress been made on this or to provide a comprehensive
    stats/counter mechanism?

    What I'm looking to do is three-fold:

    1) Get stats on the number of records that are filtered out when using the
    FILTER operation
    2) Get stats on the number of records dropped/not loaded in a LOAD function
    (and actual copies of the records/rows from the file for later evaluation)
    3) Output my own stats from a Pig job (without resorting to writing my own
    UDF and pushing things into PigStats using the Elephant-Bird utility)

    If any of this is possible, it would be great to see some examples or
    documentation. I would hate to go to raw Hadoop MR code just to get to
    counters.

    Thanks,

    Josh
  • Josh Devins at Oct 18, 2010 at 9:56 am
    Thanks, I will explore the stats in MR mode a bit once I'm on 0.8/trunk.

    I will also have a look at wrapping some of the standard loaders to get
    better stats out of them. Is this of interest to anyone else? Should I
    submit back to PiggyBank?

    This syntax of counters.PigStatusReporter, is that documented somewhere? Is
    it only on 0.8/trunk? What other variables do we have access to in the
    "native" Pig script other than "counters"?

    Josh

    On 17 October 2010 19:44, Dmitriy Ryaboy wrote:

    No on Filters (though every MR job tells you the number of records
    ingested,
    and the number returned, and as of 0.8 it also tells you which relations
    were being produced in the job -- so you can sort of back into that).
    EB sort of gives you 2), most of the loaders in there give you number of
    malformed records, though they do not store the bad records anywhere.
    I am not sure what you mean by 3) -- you can just increment
    counters. PigStatusReporter.getInstance().getCounter(myEnum).increment(1L);

    (watch out for a null reporter when you are still in the client-side).

    -D

    On Sat, Oct 16, 2010 at 2:28 PM, Josh Devins wrote:

    I've seen a few threads about counters, PigStats, Elephant-Bird's stats
    utility class, etc.

    http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
    http://www.mail-archive.com/user%40pig.apache.org/msg00034.html

    Has any progress been made on this or to provide a comprehensive
    stats/counter mechanism?

    What I'm looking to do is three-fold:

    1) Get stats on the number of records that are filtered out when using the
    FILTER operation
    2) Get stats on the number of records dropped/not loaded in a LOAD function
    (and actual copies of the records/rows from the file for later
    evaluation)
    3) Output my own stats from a Pig job (without resorting to writing my own
    UDF and pushing things into PigStats using the Elephant-Bird utility)

    If any of this is possible, it would be great to see some examples or
    documentation. I would hate to go to raw Hadoop MR code just to get to
    counters.

    Thanks,

    Josh
  • Josh Devins at Oct 18, 2010 at 9:58 am
    Ah, sorry, just saw that this should read:

    PigStatusReporter.getInstance() and there is no special counters
    keyword/variable. However is this common for Pig, being able to access
    static methods directly from within a Pig script?

    Thanks,

    Josh

    On 18 October 2010 11:56, Josh Devins wrote:

    Thanks, I will explore the stats in MR mode a bit once I'm on 0.8/trunk.

    I will also have a look at wrapping some of the standard loaders to get
    better stats out of them. Is this of interest to anyone else? Should I
    submit back to PiggyBank?

    This syntax of counters.PigStatusReporter, is that documented somewhere? Is
    it only on 0.8/trunk? What other variables do we have access to in the
    "native" Pig script other than "counters"?

    Josh


    On 17 October 2010 19:44, Dmitriy Ryaboy wrote:

    No on Filters (though every MR job tells you the number of records
    ingested,
    and the number returned, and as of 0.8 it also tells you which relations
    were being produced in the job -- so you can sort of back into that).
    EB sort of gives you 2), most of the loaders in there give you number of
    malformed records, though they do not store the bad records anywhere.
    I am not sure what you mean by 3) -- you can just increment
    counters.
    PigStatusReporter.getInstance().getCounter(myEnum).increment(1L);

    (watch out for a null reporter when you are still in the client-side).

    -D

    On Sat, Oct 16, 2010 at 2:28 PM, Josh Devins wrote:

    I've seen a few threads about counters, PigStats, Elephant-Bird's stats
    utility class, etc.

    http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
    http://www.mail-archive.com/user%40pig.apache.org/msg00034.html

    Has any progress been made on this or to provide a comprehensive
    stats/counter mechanism?

    What I'm looking to do is three-fold:

    1) Get stats on the number of records that are filtered out when using the
    FILTER operation
    2) Get stats on the number of records dropped/not loaded in a LOAD function
    (and actual copies of the records/rows from the file for later
    evaluation)
    3) Output my own stats from a Pig job (without resorting to writing my own
    UDF and pushing things into PigStats using the Elephant-Bird utility)

    If any of this is possible, it would be great to see some examples or
    documentation. I would hate to go to raw Hadoop MR code just to get to
    counters.

    Thanks,

    Josh
  • Dmitriy Ryaboy at Oct 18, 2010 at 6:15 pm
    The code snipped I wrote was for use inside a UDF, not part of Pig Latin.
    The way to get at things like counters when running Pig code would
    have to be to write a Java driver program that would use the new API
    in https://issues.apache.org/jira/browse/PIG-1478 and
    https://issues.apache.org/jira/browse/PIG-1333

    -Dmitriy
    On Mon, Oct 18, 2010 at 2:57 AM, Josh Devins wrote:
    Ah, sorry, just saw that this should read:

    PigStatusReporter.getInstance() and there is no special counters
    keyword/variable. However is this common for Pig, being able to access
    static methods directly from within a Pig script?

    Thanks,

    Josh

    On 18 October 2010 11:56, Josh Devins wrote:

    Thanks, I will explore the stats in MR mode a bit once I'm on 0.8/trunk.

    I will also have a look at wrapping some of the standard loaders to get
    better stats out of them. Is this of interest to anyone else? Should I
    submit back to PiggyBank?

    This syntax of counters.PigStatusReporter, is that documented somewhere? Is
    it only on 0.8/trunk? What other variables do we have access to in the
    "native" Pig script other than "counters"?

    Josh


    On 17 October 2010 19:44, Dmitriy Ryaboy wrote:

    No on Filters (though every MR job tells you the number of records
    ingested,
    and the number returned, and as of 0.8 it also tells you which relations
    were being produced in the job -- so you can sort of back into that).
    EB sort of gives you 2), most of the loaders in there give you number of
    malformed records, though they do not store the bad records anywhere.
    I am not sure what you mean by 3) -- you can just increment
    counters.
    PigStatusReporter.getInstance().getCounter(myEnum).increment(1L);

    (watch out for a null reporter when you are still in the client-side).

    -D

    On Sat, Oct 16, 2010 at 2:28 PM, Josh Devins wrote:

    I've seen a few threads about counters, PigStats, Elephant-Bird's stats
    utility class, etc.

    http://www.mail-archive.com/pig-user@hadoop.apache.org/msg00900.html
    http://www.mail-archive.com/user%40pig.apache.org/msg00034.html

    Has any progress been made on this or to provide a comprehensive
    stats/counter mechanism?

    What I'm looking to do is three-fold:

    1) Get stats on the number of records that are filtered out when using the
    FILTER operation
    2) Get stats on the number of records dropped/not loaded in a LOAD function
    (and actual copies of the records/rows from the file for later
    evaluation)
    3) Output my own stats from a Pig job (without resorting to writing my own
    UDF and pushing things into PigStats using the Elephant-Bird utility)

    If any of this is possible, it would be great to see some examples or
    documentation. I would hate to go to raw Hadoop MR code just to get to
    counters.

    Thanks,

    Josh

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedOct 17, '10 at 4:15p
activeOct 18, '10 at 6:15p
posts5
users2
websitepig.apache.org

2 users in discussion

Josh Devins: 3 posts Dmitriy Ryaboy: 2 posts

People

Translate

site design / logo © 2021 Grokbase