FAQ
Hi folks!

I'm brand new to this list, so apologies if this is an inappropriate
newbie question, or is otherwise incorrect, but here goes.

I'm working with a bunch of pig scripts, and we're adding new ones
almost daily. They are getting more and more complex. The problem is
exacerbated by the proliferation of magic numbers throughout them. As a
software engineer, these are driving me nuts! The code is quite brittle.
There seems to be no way to centralize logic or even values.

For a simple example:
filtered_stuff = FILTER stuff by record_type == 23;

I'd prefer:
filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

Where RECORD_TYPE_ALPHA is defined in some other file that the pig
script consumes.

Sounds rather like the old C-style header files would be in order...

Am I missing something obvious here? How do you guys handle this
problem? (We're using pig 6 and are just starting to transition to pig 7.)

Thanks! --- Eric Wadsworth

Search Discussions

  • Saurav Datta at Sep 29, 2010 at 5:06 pm
    Hi Eric,

    As I understand, you would like to define the value of the filter at
    run time, and this value would be taken from a file.
    Am I correct ?

    Regards,
    Saurav
    On Sep 29, 2010, at 10:00 AM, Eric Wadsworth wrote:

    Hi folks!

    I'm brand new to this list, so apologies if this is an inappropriate
    newbie question, or is otherwise incorrect, but here goes.

    I'm working with a bunch of pig scripts, and we're adding new ones
    almost daily. They are getting more and more complex. The problem is
    exacerbated by the proliferation of magic numbers throughout them.
    As a software engineer, these are driving me nuts! The code is quite
    brittle. There seems to be no way to centralize logic or even values.

    For a simple example:
    filtered_stuff = FILTER stuff by record_type == 23;

    I'd prefer:
    filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

    Where RECORD_TYPE_ALPHA is defined in some other file that the pig
    script consumes.

    Sounds rather like the old C-style header files would be in order...

    Am I missing something obvious here? How do you guys handle this
    problem? (We're using pig 6 and are just starting to transition to
    pig 7.)

    Thanks! --- Eric Wadsworth
  • Eric Wadsworth at Sep 29, 2010 at 5:14 pm
    Saurav,

    Not that limited, but yes. Another example is in order. Say I have
    something like this:
    projected_data = FOREACH data GENERATE com.example.udfs.foo(7, 37,
    'https', fields#'bar') as bat;

    This sort of thing would be vastly better:
    projected_data = FOREACH data GENERATE
    com.example.udfs.foo(FOO_COMMAND_CODE, MAX_FIELD_LENGTH, SCHEME,
    fields#'bar') as bat;

    I know pig isn't a real programming language, maybe I'm asking for too
    much. But it's so brittle, and as we increase the number of various pig
    scripts, the odds of a change not breaking a bunch of stuff increases
    exponentially.

    --- Eric Wadsworth
    On 09/29/2010 11:06 AM, Saurav Datta wrote:
    Hi Eric,

    As I understand, you would like to define the value of the filter at
    run time, and this value would be taken from a file.
    Am I correct ?

    Regards,
    Saurav
    On Sep 29, 2010, at 10:00 AM, Eric Wadsworth wrote:

    Hi folks!

    I'm brand new to this list, so apologies if this is an inappropriate
    newbie question, or is otherwise incorrect, but here goes.

    I'm working with a bunch of pig scripts, and we're adding new ones
    almost daily. They are getting more and more complex. The problem is
    exacerbated by the proliferation of magic numbers throughout them. As
    a software engineer, these are driving me nuts! The code is quite
    brittle. There seems to be no way to centralize logic or even values.

    For a simple example:
    filtered_stuff = FILTER stuff by record_type == 23;

    I'd prefer:
    filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

    Where RECORD_TYPE_ALPHA is defined in some other file that the pig
    script consumes.

    Sounds rather like the old C-style header files would be in order...

    Am I missing something obvious here? How do you guys handle this
    problem? (We're using pig 6 and are just starting to transition to
    pig 7.)

    Thanks! --- Eric Wadsworth
  • Aniket Mokashi at Sep 29, 2010 at 5:16 pm
    http://wiki.apache.org/pig/ParameterSubstitution
    http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html

    Also, Pig 0.8 can have RECORD_TYPE_ALPHA take runtime values (alias like
    filtered_stuff_threshold).
    https://issues.apache.org/jira/browse/PIG-1434

    Thanks,
    Aniket

    -----Original Message-----
    From: Saurav Datta
    Sent: Wednesday, September 29, 2010 1:06 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: Magic numbers in my pig scripts

    Hi Eric,

    As I understand, you would like to define the value of the filter at
    run time, and this value would be taken from a file.
    Am I correct ?

    Regards,
    Saurav
    On Sep 29, 2010, at 10:00 AM, Eric Wadsworth wrote:

    Hi folks!

    I'm brand new to this list, so apologies if this is an inappropriate
    newbie question, or is otherwise incorrect, but here goes.

    I'm working with a bunch of pig scripts, and we're adding new ones
    almost daily. They are getting more and more complex. The problem is
    exacerbated by the proliferation of magic numbers throughout them.
    As a software engineer, these are driving me nuts! The code is quite
    brittle. There seems to be no way to centralize logic or even values.

    For a simple example:
    filtered_stuff = FILTER stuff by record_type == 23;

    I'd prefer:
    filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

    Where RECORD_TYPE_ALPHA is defined in some other file that the pig
    script consumes.

    Sounds rather like the old C-style header files would be in order...

    Am I missing something obvious here? How do you guys handle this
    problem? (We're using pig 6 and are just starting to transition to
    pig 7.)

    Thanks! --- Eric Wadsworth
  • Saurav Datta at Sep 29, 2010 at 5:25 pm
    Same here, I was coming to parameter substitution by reading from a
    parameter file.

    Here is how you declare the variable year, month and date .
    A = load '/INPUTDIR/$year/$month/$date/input_test.dat' using
    PigStorage(' ') as (field1, field2, field3) ;

    Here is how you invoke the pig script, in local mode though .
    pig -param_file param_file.cfg -x local testParamFile.pig


    And below are the contents of the param_file.cfg, in the same
    directory :
    year='2010'
    month='09'
    date='19'

    We are using Pig 0.7.0
    Let me know if this helps.

    Regards,
    Saurav
    On Sep 29, 2010, at 10:15 AM, Aniket Mokashi wrote:

    http://wiki.apache.org/pig/ParameterSubstitution
    http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html

    Also, Pig 0.8 can have RECORD_TYPE_ALPHA take runtime values (alias
    like
    filtered_stuff_threshold).
    https://issues.apache.org/jira/browse/PIG-1434

    Thanks,
    Aniket

    -----Original Message-----
    From: Saurav Datta
    Sent: Wednesday, September 29, 2010 1:06 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: Magic numbers in my pig scripts

    Hi Eric,

    As I understand, you would like to define the value of the filter at
    run time, and this value would be taken from a file.
    Am I correct ?

    Regards,
    Saurav
    On Sep 29, 2010, at 10:00 AM, Eric Wadsworth wrote:

    Hi folks!

    I'm brand new to this list, so apologies if this is an inappropriate
    newbie question, or is otherwise incorrect, but here goes.

    I'm working with a bunch of pig scripts, and we're adding new ones
    almost daily. They are getting more and more complex. The problem is
    exacerbated by the proliferation of magic numbers throughout them.
    As a software engineer, these are driving me nuts! The code is quite
    brittle. There seems to be no way to centralize logic or even values.

    For a simple example:
    filtered_stuff = FILTER stuff by record_type == 23;

    I'd prefer:
    filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

    Where RECORD_TYPE_ALPHA is defined in some other file that the pig
    script consumes.

    Sounds rather like the old C-style header files would be in order...

    Am I missing something obvious here? How do you guys handle this
    problem? (We're using pig 6 and are just starting to transition to
    pig 7.)

    Thanks! --- Eric Wadsworth
  • Matthew Smith at Sep 29, 2010 at 6:12 pm
    Maybe this is off topic, but I used it in Java code with a parameter
    array.

    In MAIN (or UI, Input, etc.):

    String[] params = new String[];

    params[0]= "date';
    params[1]="filter_regex";
    runScript(params);

    in runScript(String[] params, pigServer server, String inputPath, String
    outputPath)

    PigServer.registerQuery("data = Load "'+inputPath+'" USING
    PigStorage('|') AS (date:chararray,comment:chararray);");
    PigServer.registerQuery("filtered= FILTER data BY date=='"+params[0]+"'
    AND comment=='"+params[1]+"';);
    ....


    Just a thought...

    Matt

    -----Original Message-----
    From: Saurav Datta
    Sent: Wednesday, September 29, 2010 1:25 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: Magic numbers in my pig scripts

    Same here, I was coming to parameter substitution by reading from a
    parameter file.

    Here is how you declare the variable year, month and date .
    A = load '/INPUTDIR/$year/$month/$date/input_test.dat' using
    PigStorage(' ') as (field1, field2, field3) ;

    Here is how you invoke the pig script, in local mode though .
    pig -param_file param_file.cfg -x local testParamFile.pig


    And below are the contents of the param_file.cfg, in the same
    directory :
    year='2010'
    month='09'
    date='19'

    We are using Pig 0.7.0
    Let me know if this helps.

    Regards,
    Saurav
    On Sep 29, 2010, at 10:15 AM, Aniket Mokashi wrote:

    http://wiki.apache.org/pig/ParameterSubstitution
    http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html

    Also, Pig 0.8 can have RECORD_TYPE_ALPHA take runtime values (alias
    like
    filtered_stuff_threshold).
    https://issues.apache.org/jira/browse/PIG-1434

    Thanks,
    Aniket

    -----Original Message-----
    From: Saurav Datta
    Sent: Wednesday, September 29, 2010 1:06 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: Magic numbers in my pig scripts

    Hi Eric,

    As I understand, you would like to define the value of the filter at
    run time, and this value would be taken from a file.
    Am I correct ?

    Regards,
    Saurav
    On Sep 29, 2010, at 10:00 AM, Eric Wadsworth wrote:

    Hi folks!

    I'm brand new to this list, so apologies if this is an inappropriate
    newbie question, or is otherwise incorrect, but here goes.

    I'm working with a bunch of pig scripts, and we're adding new ones
    almost daily. They are getting more and more complex. The problem is
    exacerbated by the proliferation of magic numbers throughout them.
    As a software engineer, these are driving me nuts! The code is quite
    brittle. There seems to be no way to centralize logic or even values.

    For a simple example:
    filtered_stuff = FILTER stuff by record_type == 23;

    I'd prefer:
    filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

    Where RECORD_TYPE_ALPHA is defined in some other file that the pig
    script consumes.

    Sounds rather like the old C-style header files would be in order...

    Am I missing something obvious here? How do you guys handle this
    problem? (We're using pig 6 and are just starting to transition to
    pig 7.)

    Thanks! --- Eric Wadsworth
  • Eric Wadsworth at Sep 29, 2010 at 10:33 pm
    Piggers,

    Parameter substitution isn't really what I'm needing. After some
    discussion with my co-workers, it looks like the best feature would
    really be sort of a pre-processor. Basically, insert a line in your pig
    script that would "include" another pig script, right there. Then that
    other pig script could contain defines, code, whatever. This would allow
    us to build a hierarchy of scripts, where we could tweak some defines at
    the top level, and the results would be consumed by the lower levels.

    --- Eric Wadsworth
    On 09/29/2010 11:15 AM, Aniket Mokashi wrote:
    http://wiki.apache.org/pig/ParameterSubstitution
    http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html

    Also, Pig 0.8 can have RECORD_TYPE_ALPHA take runtime values (alias like
    filtered_stuff_threshold).
    https://issues.apache.org/jira/browse/PIG-1434

    Thanks,
    Aniket

    -----Original Message-----
    From: Saurav Datta
    Sent: Wednesday, September 29, 2010 1:06 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: Magic numbers in my pig scripts

    Hi Eric,

    As I understand, you would like to define the value of the filter at
    run time, and this value would be taken from a file.
    Am I correct ?

    Regards,
    Saurav

    On Sep 29, 2010, at 10:00 AM, Eric Wadsworth wrote:

    Hi folks!

    I'm brand new to this list, so apologies if this is an inappropriate
    newbie question, or is otherwise incorrect, but here goes.

    I'm working with a bunch of pig scripts, and we're adding new ones
    almost daily. They are getting more and more complex. The problem is
    exacerbated by the proliferation of magic numbers throughout them.
    As a software engineer, these are driving me nuts! The code is quite
    brittle. There seems to be no way to centralize logic or even values.

    For a simple example:
    filtered_stuff = FILTER stuff by record_type == 23;

    I'd prefer:
    filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

    Where RECORD_TYPE_ALPHA is defined in some other file that the pig
    script consumes.

    Sounds rather like the old C-style header files would be in order...

    Am I missing something obvious here? How do you guys handle this
    problem? (We're using pig 6 and are just starting to transition to
    pig 7.)

    Thanks! --- Eric Wadsworth
  • Thejas M Nair at Sep 29, 2010 at 11:00 pm
    Support for functions as part of the turing complete pig effort should help (it is in early design stages)-
    http://wiki.apache.org/pig/TuringCompletePig

    -Thejas


    On 9/29/10 3:32 PM, "Eric Wadsworth" wrote:

    Piggers,

    Parameter substitution isn't really what I'm needing. After some
    discussion with my co-workers, it looks like the best feature would
    really be sort of a pre-processor. Basically, insert a line in your pig
    script that would "include" another pig script, right there. Then that
    other pig script could contain defines, code, whatever. This would allow
    us to build a hierarchy of scripts, where we could tweak some defines at
    the top level, and the results would be consumed by the lower levels.

    --- Eric Wadsworth
    On 09/29/2010 11:15 AM, Aniket Mokashi wrote:
    http://wiki.apache.org/pig/ParameterSubstitution
    http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html

    Also, Pig 0.8 can have RECORD_TYPE_ALPHA take runtime values (alias like
    filtered_stuff_threshold).
    https://issues.apache.org/jira/browse/PIG-1434

    Thanks,
    Aniket

    -----Original Message-----
    From: Saurav Datta
    Sent: Wednesday, September 29, 2010 1:06 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: Magic numbers in my pig scripts

    Hi Eric,

    As I understand, you would like to define the value of the filter at
    run time, and this value would be taken from a file.
    Am I correct ?

    Regards,
    Saurav

    On Sep 29, 2010, at 10:00 AM, Eric Wadsworth wrote:

    Hi folks!

    I'm brand new to this list, so apologies if this is an inappropriate
    newbie question, or is otherwise incorrect, but here goes.

    I'm working with a bunch of pig scripts, and we're adding new ones
    almost daily. They are getting more and more complex. The problem is
    exacerbated by the proliferation of magic numbers throughout them.
    As a software engineer, these are driving me nuts! The code is quite
    brittle. There seems to be no way to centralize logic or even values.

    For a simple example:
    filtered_stuff = FILTER stuff by record_type == 23;

    I'd prefer:
    filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

    Where RECORD_TYPE_ALPHA is defined in some other file that the pig
    script consumes.

    Sounds rather like the old C-style header files would be in order...

    Am I missing something obvious here? How do you guys handle this
    problem? (We're using pig 6 and are just starting to transition to
    pig 7.)

    Thanks! --- Eric Wadsworth
  • Dmitriy Ryaboy at Sep 30, 2010 at 8:31 pm
    Eric, check out piglet:
    http://github.com/iconara/piglet


    On Wed, Sep 29, 2010 at 3:32 PM, Eric Wadsworth wrote:

    Piggers,

    Parameter substitution isn't really what I'm needing. After some discussion
    with my co-workers, it looks like the best feature would really be sort of a
    pre-processor. Basically, insert a line in your pig script that would
    "include" another pig script, right there. Then that other pig script could
    contain defines, code, whatever. This would allow us to build a hierarchy of
    scripts, where we could tweak some defines at the top level, and the results
    would be consumed by the lower levels.

    --- Eric Wadsworth

    On 09/29/2010 11:15 AM, Aniket Mokashi wrote:

    http://wiki.apache.org/pig/ParameterSubstitution
    http://hadoop.apache.org/pig/docs/r0.3.0/piglatin.html

    Also, Pig 0.8 can have RECORD_TYPE_ALPHA take runtime values (alias like
    filtered_stuff_threshold).
    https://issues.apache.org/jira/browse/PIG-1434

    Thanks,
    Aniket

    -----Original Message-----
    From: Saurav Datta
    Sent: Wednesday, September 29, 2010 1:06 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: Magic numbers in my pig scripts

    Hi Eric,

    As I understand, you would like to define the value of the filter at
    run time, and this value would be taken from a file.
    Am I correct ?

    Regards,
    Saurav

    On Sep 29, 2010, at 10:00 AM, Eric Wadsworth wrote:


    Hi folks!

    I'm brand new to this list, so apologies if this is an inappropriate
    newbie question, or is otherwise incorrect, but here goes.

    I'm working with a bunch of pig scripts, and we're adding new ones
    almost daily. They are getting more and more complex. The problem is
    exacerbated by the proliferation of magic numbers throughout them.
    As a software engineer, these are driving me nuts! The code is quite
    brittle. There seems to be no way to centralize logic or even values.

    For a simple example:
    filtered_stuff = FILTER stuff by record_type == 23;

    I'd prefer:
    filtered_stuff = FILTER stuff by record_type == RECORD_TYPE_ALPHA;

    Where RECORD_TYPE_ALPHA is defined in some other file that the pig
    script consumes.

    Sounds rather like the old C-style header files would be in order...

    Am I missing something obvious here? How do you guys handle this
    problem? (We're using pig 6 and are just starting to transition to
    pig 7.)

    Thanks! --- Eric Wadsworth

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 29, '10 at 5:01p
activeSep 30, '10 at 8:31p
posts9
users6
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase