Grokbase Groups Pig user January 2011
FAQ
Hello,

I have a pig script that uses piggy bank to calculate date differences.
Sometimes, when I get a wierd date or wrong format in the input, the script
throws and error and aborts.

Is there a way I could trap these errors and move on without stopping the
execution?

Thanks

PS: I'm using CDH2 with Pig 0.5

Search Discussions

  • Dmitriy Ryaboy at Jan 11, 2011 at 6:11 am
    Create a UDF that verifies the format, and go through a filtering step
    first.
    If you would like to save the malformated records so you can look at them
    later, you can use the SPLIT operator to route the good records to your
    regular workflow, and the bad records some place on HDFS.

    -D
    On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b wrote:

    Hello,

    I have a pig script that uses piggy bank to calculate date differences.
    Sometimes, when I get a wierd date or wrong format in the input, the script
    throws and error and aborts.

    Is there a way I could trap these errors and move on without stopping the
    execution?

    Thanks

    PS: I'm using CDH2 with Pig 0.5
  • Hadoop n00b at Jan 11, 2011 at 8:47 am
    Thanks, I sometimes get a date like 0001-01-01. This would be a valid date
    format, but when I try to get the seconds between this and another date, say
    2011-01-01, I get an error that the value is too large to be fit into int
    and the process stops. Do we have something like ifError(x-y, null,x-y)? Or
    would I have to implement this as an UDF?

    Thanks
    On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy wrote:

    Create a UDF that verifies the format, and go through a filtering step
    first.
    If you would like to save the malformated records so you can look at them
    later, you can use the SPLIT operator to route the good records to your
    regular workflow, and the bad records some place on HDFS.

    -D
    On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b wrote:

    Hello,

    I have a pig script that uses piggy bank to calculate date differences.
    Sometimes, when I get a wierd date or wrong format in the input, the script
    throws and error and aborts.

    Is there a way I could trap these errors and move on without stopping the
    execution?

    Thanks

    PS: I'm using CDH2 with Pig 0.5
  • Dmitriy Ryaboy at Jan 11, 2011 at 6:41 pm
    Right now error handling is controlled by the UDFs themselves, and there is
    no way to direct it externally.
    You can make an ErrorHandlingUDF that would take a udf spec, invoke it, trap
    errors, and then do the specified error handling behavior.. that's a bit
    ugly though.

    There is a problem with trapping general exceptions of course, in that if
    they happen 0.000001% of the time you can probably just ignore them, but if
    they happen in half your dataset, you want the job to tell you something is
    wrong. So this stuff gets non-trivial. If anyone wants to propose a design
    to solve this general problem, I think that would be a welcome addition.

    D
    On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b wrote:

    Thanks, I sometimes get a date like 0001-01-01. This would be a valid date
    format, but when I try to get the seconds between this and another date,
    say
    2011-01-01, I get an error that the value is too large to be fit into int
    and the process stops. Do we have something like ifError(x-y, null,x-y)? Or
    would I have to implement this as an UDF?

    Thanks
    On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy wrote:

    Create a UDF that verifies the format, and go through a filtering step
    first.
    If you would like to save the malformated records so you can look at them
    later, you can use the SPLIT operator to route the good records to your
    regular workflow, and the bad records some place on HDFS.

    -D
    On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b wrote:

    Hello,

    I have a pig script that uses piggy bank to calculate date differences.
    Sometimes, when I get a wierd date or wrong format in the input, the script
    throws and error and aborts.

    Is there a way I could trap these errors and move on without stopping
    the
    execution?

    Thanks

    PS: I'm using CDH2 with Pig 0.5
  • Santhosh Srinivasan at Jan 15, 2011 at 2:19 am
    Sorry about the late response.

    Hadoop n00b is proposing a language extension for error handling, similar to the mechanisms in other well known languages like C++, Java, etc.

    For now, can't the error semantics be handled by the UDF? For exceptional scenarios you could throw an ExecException with the right details. The physical operator that handles the execution of UDF's traps it for you and propagates the error back to the client. You can take a look at any of the builtin UDFs to see how Pig handles it internally.

    Santhosh

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: Tuesday, January 11, 2011 10:41 AM
    To: user@pig.apache.org
    Subject: Re: Exception Handling in Pig Scripts

    Right now error handling is controlled by the UDFs themselves, and there is no way to direct it externally.
    You can make an ErrorHandlingUDF that would take a udf spec, invoke it, trap errors, and then do the specified error handling behavior.. that's a bit ugly though.

    There is a problem with trapping general exceptions of course, in that if they happen 0.000001% of the time you can probably just ignore them, but if they happen in half your dataset, you want the job to tell you something is wrong. So this stuff gets non-trivial. If anyone wants to propose a design to solve this general problem, I think that would be a welcome addition.

    D
    On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b wrote:

    Thanks, I sometimes get a date like 0001-01-01. This would be a valid
    date format, but when I try to get the seconds between this and
    another date, say 2011-01-01, I get an error that the value is too
    large to be fit into int and the process stops. Do we have something
    like ifError(x-y, null,x-y)? Or would I have to implement this as an
    UDF?

    Thanks
    On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy wrote:

    Create a UDF that verifies the format, and go through a filtering
    step first.
    If you would like to save the malformated records so you can look at
    them later, you can use the SPLIT operator to route the good records
    to your regular workflow, and the bad records some place on HDFS.

    -D
    On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b wrote:

    Hello,

    I have a pig script that uses piggy bank to calculate date differences.
    Sometimes, when I get a wierd date or wrong format in the input,
    the script
    throws and error and aborts.

    Is there a way I could trap these errors and move on without
    stopping
    the
    execution?

    Thanks

    PS: I'm using CDH2 with Pig 0.5
  • Ashutosh Chauhan at Jan 15, 2011 at 3:32 pm
    Santhosh,

    The way you are proposing, it will kill the pig script. I think what
    user wants is to ignore few "bad records" and to process the rest and
    get results. Problem here is how to let user tell Pig the definition
    of "bad record" and how to let him specify threshold for % of bad
    records at which Pig should fail the script.

    Ashutosh
    On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan wrote:
    Sorry about the late response.

    Hadoop n00b is proposing a language extension for error handling, similar to the mechanisms in other well known languages like C++, Java, etc.

    For now, can't the error semantics be handled by the UDF? For exceptional scenarios you could throw an ExecException with the right details. The physical operator that handles the execution of UDF's traps it for you and propagates the error back to the client. You can take a look at any of the builtin UDFs to see how Pig handles it internally.

    Santhosh

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: Tuesday, January 11, 2011 10:41 AM
    To: user@pig.apache.org
    Subject: Re: Exception Handling in Pig Scripts

    Right now error handling is controlled by the UDFs themselves, and there is no way to direct it externally.
    You can make an ErrorHandlingUDF that would take a udf spec, invoke it, trap errors, and then do the specified error handling behavior.. that's a bit ugly though.

    There is a problem with trapping general exceptions of course, in that if they happen 0.000001% of the time you can probably just ignore them, but if they happen in half your dataset, you want the job to tell you something is wrong. So this stuff gets non-trivial. If anyone wants to propose a design to solve this general problem, I think that would be a welcome addition.

    D
    On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b wrote:

    Thanks, I sometimes get a date like 0001-01-01. This would be a valid
    date format, but when I try to get the seconds between this and
    another date, say 2011-01-01, I get an error that the value is too
    large to be fit into int and the process stops. Do we have something
    like ifError(x-y, null,x-y)? Or would I have to implement this as an
    UDF?

    Thanks

    On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Create a UDF that verifies the format, and go through a filtering
    step first.
    If you would like to save the malformated records so you can look at
    them later, you can use the SPLIT operator to route the good records
    to your regular workflow, and the bad records some place on HDFS.

    -D
    On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b wrote:

    Hello,

    I have a pig script that uses piggy bank to calculate date differences.
    Sometimes, when I get a wierd date or wrong format in the input,
    the script
    throws and error and aborts.

    Is there a way I could trap these errors and move on without
    stopping
    the
    execution?

    Thanks

    PS: I'm using CDH2 with Pig 0.5
  • Santhosh Srinivasan at Jan 16, 2011 at 7:54 am
    Thanks for the clarification Ashutosh.

    Implementing this in the user realm is tricky as Dmitriy states. Sensitivity to error thresholds will require support from the system. We can probably provide a taxonomy of records (good, bad, incomplete, etc.) to let users classify each record. The system can then track counts of each record type to facilitate the computation of thresholds. The last part is to allow users to specify thresholds and appropriate actions (interrupt, exit, continue, etc.). A possible mechanism to realize this is the ErrorHandlingUDF described by Dmitriy.

    Santhosh

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Friday, January 14, 2011 7:35 PM
    To: user@pig.apache.org
    Subject: Re: Exception Handling in Pig Scripts

    Santhosh,

    The way you are proposing, it will kill the pig script. I think what user wants is to ignore few "bad records" and to process the rest and get results. Problem here is how to let user tell Pig the definition of "bad record" and how to let him specify threshold for % of bad records at which Pig should fail the script.

    Ashutosh
    On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan wrote:
    Sorry about the late response.

    Hadoop n00b is proposing a language extension for error handling, similar to the mechanisms in other well known languages like C++, Java, etc.

    For now, can't the error semantics be handled by the UDF? For exceptional scenarios you could throw an ExecException with the right details. The physical operator that handles the execution of UDF's traps it for you and propagates the error back to the client. You can take a look at any of the builtin UDFs to see how Pig handles it internally.

    Santhosh

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: Tuesday, January 11, 2011 10:41 AM
    To: user@pig.apache.org
    Subject: Re: Exception Handling in Pig Scripts

    Right now error handling is controlled by the UDFs themselves, and there is no way to direct it externally.
    You can make an ErrorHandlingUDF that would take a udf spec, invoke it, trap errors, and then do the specified error handling behavior.. that's a bit ugly though.

    There is a problem with trapping general exceptions of course, in that if they happen 0.000001% of the time you can probably just ignore them, but if they happen in half your dataset, you want the job to tell you something is wrong. So this stuff gets non-trivial. If anyone wants to propose a design to solve this general problem, I think that would be a welcome addition.

    D
    On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b wrote:

    Thanks, I sometimes get a date like 0001-01-01. This would be a valid
    date format, but when I try to get the seconds between this and
    another date, say 2011-01-01, I get an error that the value is too
    large to be fit into int and the process stops. Do we have something
    like ifError(x-y, null,x-y)? Or would I have to implement this as an
    UDF?

    Thanks

    On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Create a UDF that verifies the format, and go through a filtering
    step first.
    If you would like to save the malformated records so you can look
    at them later, you can use the SPLIT operator to route the good
    records to your regular workflow, and the bad records some place on HDFS.

    -D
    On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b wrote:

    Hello,

    I have a pig script that uses piggy bank to calculate date differences.
    Sometimes, when I get a wierd date or wrong format in the input,
    the script
    throws and error and aborts.

    Is there a way I could trap these errors and move on without
    stopping
    the
    execution?

    Thanks

    PS: I'm using CDH2 with Pig 0.5
  • Dmitriy Ryaboy at Jan 16, 2011 at 8:23 am
    I was thinking about this..

    We add an optional ON_ERROR clause to operators, which allows a user to
    specify error handling. The error handler would be a udf that would
    implement an interface along these lines:

    public interface ErrorHandler {

    public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
    IOException;

    }

    I think this makes sense not to make a static method so that users could
    keep required state, and for example have the handler throw its own
    IOException of it's been invoked too many times.

    D

    On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan wrote:

    Thanks for the clarification Ashutosh.

    Implementing this in the user realm is tricky as Dmitriy states.
    Sensitivity to error thresholds will require support from the system. We can
    probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
    users classify each record. The system can then track counts of each record
    type to facilitate the computation of thresholds. The last part is to allow
    users to specify thresholds and appropriate actions (interrupt, exit,
    continue, etc.). A possible mechanism to realize this is the
    ErrorHandlingUDF described by Dmitriy.

    Santhosh

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Friday, January 14, 2011 7:35 PM
    To: user@pig.apache.org
    Subject: Re: Exception Handling in Pig Scripts

    Santhosh,

    The way you are proposing, it will kill the pig script. I think what user
    wants is to ignore few "bad records" and to process the rest and get
    results. Problem here is how to let user tell Pig the definition of "bad
    record" and how to let him specify threshold for % of bad records at which
    Pig should fail the script.

    Ashutosh
    On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan wrote:
    Sorry about the late response.

    Hadoop n00b is proposing a language extension for error handling, similar
    to the mechanisms in other well known languages like C++, Java, etc.
    For now, can't the error semantics be handled by the UDF? For exceptional
    scenarios you could throw an ExecException with the right details. The
    physical operator that handles the execution of UDF's traps it for you and
    propagates the error back to the client. You can take a look at any of the
    builtin UDFs to see how Pig handles it internally.
    Santhosh

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: Tuesday, January 11, 2011 10:41 AM
    To: user@pig.apache.org
    Subject: Re: Exception Handling in Pig Scripts

    Right now error handling is controlled by the UDFs themselves, and there
    is no way to direct it externally.
    You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
    trap errors, and then do the specified error handling behavior.. that's a
    bit ugly though.
    There is a problem with trapping general exceptions of course, in that if
    they happen 0.000001% of the time you can probably just ignore them, but if
    they happen in half your dataset, you want the job to tell you something is
    wrong. So this stuff gets non-trivial. If anyone wants to propose a design
    to solve this general problem, I think that would be a welcome addition.
    D
    On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b wrote:

    Thanks, I sometimes get a date like 0001-01-01. This would be a valid
    date format, but when I try to get the seconds between this and
    another date, say 2011-01-01, I get an error that the value is too
    large to be fit into int and the process stops. Do we have something
    like ifError(x-y, null,x-y)? Or would I have to implement this as an
    UDF?

    Thanks

    On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Create a UDF that verifies the format, and go through a filtering
    step first.
    If you would like to save the malformated records so you can look
    at them later, you can use the SPLIT operator to route the good
    records to your regular workflow, and the bad records some place on
    HDFS.
    -D
    On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b wrote:

    Hello,

    I have a pig script that uses piggy bank to calculate date
    differences.
    Sometimes, when I get a wierd date or wrong format in the input,
    the script
    throws and error and aborts.

    Is there a way I could trap these errors and move on without
    stopping
    the
    execution?

    Thanks

    PS: I'm using CDH2 with Pig 0.5
  • Hadoop n00b at Jan 17, 2011 at 12:57 pm
    Good discussion!

    Ashutosh is close to what I am looking for. Instead of the whole script
    erroring out, if I can ignore that record, or use a default value and
    proceed, it would make the script operate in a more automated manner rather
    than someone monitoring for errors.

    And when you are working with thousands of records, figuring out which one
    went wrong is another task in itself.

    Just my thoughts.

    Cheers!
    On Sun, Jan 16, 2011 at 1:52 PM, Dmitriy Ryaboy wrote:

    I was thinking about this..

    We add an optional ON_ERROR clause to operators, which allows a user to
    specify error handling. The error handler would be a udf that would
    implement an interface along these lines:

    public interface ErrorHandler {

    public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
    IOException;

    }

    I think this makes sense not to make a static method so that users could
    keep required state, and for example have the handler throw its own
    IOException of it's been invoked too many times.

    D


    On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan <sms@yahoo-inc.com
    wrote:
    Thanks for the clarification Ashutosh.

    Implementing this in the user realm is tricky as Dmitriy states.
    Sensitivity to error thresholds will require support from the system. We can
    probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
    users classify each record. The system can then track counts of each record
    type to facilitate the computation of thresholds. The last part is to allow
    users to specify thresholds and appropriate actions (interrupt, exit,
    continue, etc.). A possible mechanism to realize this is the
    ErrorHandlingUDF described by Dmitriy.

    Santhosh

    -----Original Message-----
    From: Ashutosh Chauhan
    Sent: Friday, January 14, 2011 7:35 PM
    To: user@pig.apache.org
    Subject: Re: Exception Handling in Pig Scripts

    Santhosh,

    The way you are proposing, it will kill the pig script. I think what user
    wants is to ignore few "bad records" and to process the rest and get
    results. Problem here is how to let user tell Pig the definition of "bad
    record" and how to let him specify threshold for % of bad records at which
    Pig should fail the script.

    Ashutosh

    On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <sms@yahoo-inc.com>
    wrote:
    Sorry about the late response.

    Hadoop n00b is proposing a language extension for error handling,
    similar
    to the mechanisms in other well known languages like C++, Java, etc.
    For now, can't the error semantics be handled by the UDF? For
    exceptional
    scenarios you could throw an ExecException with the right details. The
    physical operator that handles the execution of UDF's traps it for you and
    propagates the error back to the client. You can take a look at any of the
    builtin UDFs to see how Pig handles it internally.
    Santhosh

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: Tuesday, January 11, 2011 10:41 AM
    To: user@pig.apache.org
    Subject: Re: Exception Handling in Pig Scripts

    Right now error handling is controlled by the UDFs themselves, and
    there
    is no way to direct it externally.
    You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
    trap errors, and then do the specified error handling behavior.. that's a
    bit ugly though.
    There is a problem with trapping general exceptions of course, in that
    if
    they happen 0.000001% of the time you can probably just ignore them, but if
    they happen in half your dataset, you want the job to tell you something is
    wrong. So this stuff gets non-trivial. If anyone wants to propose a design
    to solve this general problem, I think that would be a welcome addition.
    D

    On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <new2hive@gmail.com>
    wrote:
    Thanks, I sometimes get a date like 0001-01-01. This would be a valid
    date format, but when I try to get the seconds between this and
    another date, say 2011-01-01, I get an error that the value is too
    large to be fit into int and the process stops. Do we have something
    like ifError(x-y, null,x-y)? Or would I have to implement this as an
    UDF?

    Thanks

    On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Create a UDF that verifies the format, and go through a filtering
    step first.
    If you would like to save the malformated records so you can look
    at them later, you can use the SPLIT operator to route the good
    records to your regular workflow, and the bad records some place on
    HDFS.
    -D

    On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <new2hive@gmail.com>
    wrote:
    Hello,

    I have a pig script that uses piggy bank to calculate date
    differences.
    Sometimes, when I get a wierd date or wrong format in the input,
    the script
    throws and error and aborts.

    Is there a way I could trap these errors and move on without
    stopping
    the
    execution?

    Thanks

    PS: I'm using CDH2 with Pig 0.5

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJan 11, '11 at 5:58a
activeJan 17, '11 at 12:57p
posts9
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase