FAQ
I have al is of regex patterns that I would like to run against a data
set, and if it matches a particular pattern in the list, tag it with the
predefined tag for that pattern.
Has this been done, or available somewhere?
I've not written any UDF's, and although I'm not against doing so, I
probably don't have the time to write one at this point.

If this isn't available somewhere I can work around this roadblock, but
it would be awesome if someone has cooked up this functionality
somewhere.

-----Original Message-----
From: Anze
Sent: Monday, December 06, 2010 3:09 PM
To: user@pig.apache.org
Subject: Re: Easy question...difference between this::form and
this.form?


Sorry to hijack your question, Jonathan, but while we are at it... :)

Is there a way to tell Pig NOT to add "base_alias::"? Almost half my
code
consists of FOREACH... GENERATE that just remove these prefixes.

Thanks,

Anze
On Monday 06 December 2010, Daniel Dai wrote:
After join, cross, foreach flatten, Pig will automatically add
"base_alias::" prefix. All other cases use "."

Daniel

Jonathan Coveney wrote:
It's very hard to search for this among the docs because it's so
generic,
so I thought I'd ask... I'm sure the answer is painfully easy.

Taking a look at this code that I found online, for example

--
-- Read in a bag of tuples (timeseries for this example) and divide
the
-- numeric column by its maximum.
--
%default DATABAG 'data/timeseries.tsv'

data = LOAD '$DATABAG' AS (month:chararray, count:int);
accumulate = GROUP data ALL;
calc_max = FOREACH accumulate GENERATE FLATTEN(data),
MAX(data.count) AS max_count;
normalize = FOREACH calc_max GENERATE data::month AS month,
data::count AS count, (float)data::count / (float)max_count AS
normed_count;
DUMP normalize;

What purpose does data::month serve versus data.count?

Thanks

Search Discussions

  • Zach Bailey at Dec 6, 2010 at 8:38 pm
    Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:


    1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:


    define AC_MATCHER com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


    2.) apply the AC_MATCHER to a tuple


    strings = LOAD 'myfile.txt' as (string:chararray);
    tagged_strings = FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


    The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:


    terrier parakeet
    hello
    goodbye
    tabby
    pit bull


    after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


    { string: 'terrier parakeet', tags: { 'dogs', 'birds' } }
    { string: 'hello', tags: {} }
    { string: 'goodbye', tags: {} }
    { string: 'tabby', tags: { 'cats' } }
    { string: 'pit bull', tags: { 'dogs' } }


    If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.

    Cheers,
    Zach

    On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

    I have al is of regex patterns that I would like to run against a data
    set, and if it matches a particular pattern in the list, tag it with the
    predefined tag for that pattern.
    Has this been done, or available somewhere?
    I've not written any UDF's, and although I'm not against doing so, I
    probably don't have the time to write one at this point.

    If this isn't available somewhere I can work around this roadblock, but
    it would be awesome if someone has cooked up this functionality
    somewhere.

    -----Original Message-----
    From: Anze
    Sent: Monday, December 06, 2010 3:09 PM
    To: user@pig.apache.org
    Subject: Re: Easy question...difference between this::form and
    this.form?


    Sorry to hijack your question, Jonathan, but while we are at it... :)

    Is there a way to tell Pig NOT to add "base_alias::"? Almost half my
    code
    consists of FOREACH... GENERATE that just remove these prefixes.

    Thanks,

    Anze
    On Monday 06 December 2010, Daniel Dai wrote:

    After join, cross, foreach flatten, Pig will automatically add
    "base_alias::" prefix. All other cases use "."

    Daniel

    Jonathan Coveney wrote:
    It's very hard to search for this among the docs because it's so
    generic,
    so I thought I'd ask... I'm sure the answer is painfully easy.

    Taking a look at this code that I found online, for example

    --
    -- Read in a bag of tuples (timeseries for this example) and divide
    the
    -- numeric column by its maximum.
    --
    %default DATABAG 'data/timeseries.tsv'

    data = LOAD '$DATABAG' AS (month:chararray, count:int);
    accumulate = GROUP data ALL;
    calc_max = FOREACH accumulate GENERATE FLATTEN(data),
    MAX(data.count) AS max_count;
    normalize = FOREACH calc_max GENERATE data::month AS month,
    data::count AS count, (float)data::count / (float)max_count AS
    normed_count;
    DUMP normalize;

    What purpose does data::month serve versus data.count?

    Thanks

  • Brian Adams at Dec 6, 2010 at 9:15 pm
    That is an interesting approach. I like it. Not ideal, but I think it could work for what I am doing.

    In general I think that is useful to the community and you should github it.
    By all means, I would love to use this.

    I think I could extend/fork this for my need.

    Thank you Zach!

    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 3:38 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:


    1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:


    define AC_MATCHER com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


    2.) apply the AC_MATCHER to a tuple


    strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


    The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:


    terrier parakeet
    hello
    goodbye
    tabby
    pit bull


    after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


    { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


    If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.

    Cheers,
    Zach

    On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

    I have al is of regex patterns that I would like to run against a data
    set, and if it matches a particular pattern in the list, tag it with
    the predefined tag for that pattern.
    Has this been done, or available somewhere?
    I've not written any UDF's, and although I'm not against doing so, I
    probably don't have the time to write one at this point.

    If this isn't available somewhere I can work around this roadblock,
    but it would be awesome if someone has cooked up this functionality
    somewhere.

    -----Original Message-----
    From: Anze
    Sent: Monday, December 06, 2010 3:09 PM
    To: user@pig.apache.org
    Subject: Re: Easy question...difference between this::form and
    this.form?


    Sorry to hijack your question, Jonathan, but while we are at it... :)

    Is there a way to tell Pig NOT to add "base_alias::"? Almost half my
    code consists of FOREACH... GENERATE that just remove these prefixes.

    Thanks,

    Anze
    On Monday 06 December 2010, Daniel Dai wrote:

    After join, cross, foreach flatten, Pig will automatically add
    "base_alias::" prefix. All other cases use "."

    Daniel

    Jonathan Coveney wrote:
    It's very hard to search for this among the docs because it's so
    generic,
    so I thought I'd ask... I'm sure the answer is painfully easy.

    Taking a look at this code that I found online, for example

    --
    -- Read in a bag of tuples (timeseries for this example) and
    divide
    the
    -- numeric column by its maximum.
    --
    %default DATABAG 'data/timeseries.tsv'

    data = LOAD '$DATABAG' AS (month:chararray, count:int); accumulate
    = GROUP data ALL; calc_max = FOREACH accumulate GENERATE
    FLATTEN(data),
    MAX(data.count) AS max_count;
    normalize = FOREACH calc_max GENERATE data::month AS month,
    data::count AS count, (float)data::count / (float)max_count AS
    normed_count; DUMP normalize;

    What purpose does data::month serve versus data.count?

    Thanks

  • Zach Bailey at Dec 6, 2010 at 9:46 pm
    Great. Let me clean up the code a bit and I'd be happy to post it. I'm definitely open to some alternatives in terms of how this UDF would be initialized, whether it is via a file sitting on HDFS, etc. The current initialization scheme is admittedly crude but was simple to code and works for us for now.

    Cheers,
    Zach

    On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

    That is an interesting approach. I like it. Not ideal, but I think it could work for what I am doing.

    In general I think that is useful to the community and you should github it.
    By all means, I would love to use this.

    I think I could extend/fork this for my need.

    Thank you Zach!

    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 3:38 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:


    1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:


    define AC_MATCHER com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


    2.) apply the AC_MATCHER to a tuple


    strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings = FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


    The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:


    terrier parakeet
    hello
    goodbye
    tabby
    pit bull


    after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


    { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string: 'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby', tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


    If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.

    Cheers,
    Zach


    On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

    I have al is of regex patterns that I would like to run against a data
    set, and if it matches a particular pattern in the list, tag it with
    the predefined tag for that pattern.
    Has this been done, or available somewhere?
    I've not written any UDF's, and although I'm not against doing so, I
    probably don't have the time to write one at this point.

    If this isn't available somewhere I can work around this roadblock,
    but it would be awesome if someone has cooked up this functionality
    somewhere.

    -----Original Message-----
    From: Anze
    Sent: Monday, December 06, 2010 3:09 PM
    To: user@pig.apache.org
    Subject: Re: Easy question...difference between this::form and
    this.form?


    Sorry to hijack your question, Jonathan, but while we are at it... :)

    Is there a way to tell Pig NOT to add "base_alias::"? Almost half my
    code consists of FOREACH... GENERATE that just remove these prefixes.

    Thanks,

    Anze
    On Monday 06 December 2010, Daniel Dai wrote:

    After join, cross, foreach flatten, Pig will automatically add
    "base_alias::" prefix. All other cases use "."

    Daniel

    Jonathan Coveney wrote:
    It's very hard to search for this among the docs because it's so
    generic,
    so I thought I'd ask... I'm sure the answer is painfully easy.

    Taking a look at this code that I found online, for example

    --
    -- Read in a bag of tuples (timeseries for this example) and
    divide
    the
    -- numeric column by its maximum.
    --
    %default DATABAG 'data/timeseries.tsv'

    data = LOAD '$DATABAG' AS (month:chararray, count:int); accumulate
    = GROUP data ALL; calc_max = FOREACH accumulate GENERATE
    FLATTEN(data),
    MAX(data.count) AS max_count;
    normalize = FOREACH calc_max GENERATE data::month AS month,
    data::count AS count, (float)data::count / (float)max_count AS
    normed_count; DUMP normalize;

    What purpose does data::month serve versus data.count?

    Thanks




  • Brian Adams at Dec 6, 2010 at 9:56 pm
    No problem.
    Sounds good. And no worry about messy code. We are all well aware that code often elegance when you are just trying to get it out the door.
    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 4:46 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Great. Let me clean up the code a bit and I'd be happy to post it. I'm definitely open to some alternatives in terms of how this UDF would be initialized, whether it is via a file sitting on HDFS, etc. The current initialization scheme is admittedly crude but was simple to code and works for us for now.

    Cheers,
    Zach

    On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

    That is an interesting approach. I like it. Not ideal, but I think it could work for what I am doing.

    In general I think that is useful to the community and you should github it.
    By all means, I would love to use this.

    I think I could extend/fork this for my need.

    Thank you Zach!

    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 3:38 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:


    1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:


    define AC_MATCHER
    com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
    bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


    2.) apply the AC_MATCHER to a tuple


    strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
    FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


    The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:


    terrier parakeet
    hello
    goodbye
    tabby
    pit bull


    after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


    { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
    'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby',
    tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


    If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.

    Cheers,
    Zach


    On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

    I have al is of regex patterns that I would like to run against a
    data set, and if it matches a particular pattern in the list, tag
    it with the predefined tag for that pattern.
    Has this been done, or available somewhere?
    I've not written any UDF's, and although I'm not against doing so,
    I probably don't have the time to write one at this point.

    If this isn't available somewhere I can work around this roadblock,
    but it would be awesome if someone has cooked up this functionality
    somewhere.

    -----Original Message-----
    From: Anze
    Sent: Monday, December 06, 2010 3:09 PM
    To: user@pig.apache.org
    Subject: Re: Easy question...difference between this::form and
    this.form?


    Sorry to hijack your question, Jonathan, but while we are at it...
    :)

    Is there a way to tell Pig NOT to add "base_alias::"? Almost half
    my code consists of FOREACH... GENERATE that just remove these prefixes.

    Thanks,

    Anze
    On Monday 06 December 2010, Daniel Dai wrote:

    After join, cross, foreach flatten, Pig will automatically add
    "base_alias::" prefix. All other cases use "."

    Daniel

    Jonathan Coveney wrote:
    It's very hard to search for this among the docs because it's so
    generic,
    so I thought I'd ask... I'm sure the answer is painfully easy.

    Taking a look at this code that I found online, for example

    --
    -- Read in a bag of tuples (timeseries for this example) and
    divide
    the
    -- numeric column by its maximum.
    --
    %default DATABAG 'data/timeseries.tsv'

    data = LOAD '$DATABAG' AS (month:chararray, count:int);
    accumulate = GROUP data ALL; calc_max = FOREACH accumulate
    GENERATE FLATTEN(data),
    MAX(data.count) AS max_count;
    normalize = FOREACH calc_max GENERATE data::month AS month,
    data::count AS count, (float)data::count / (float)max_count AS
    normed_count; DUMP normalize;

    What purpose does data::month serve versus data.count?

    Thanks




  • Zach Bailey at Dec 6, 2010 at 10:25 pm
    Here you go:


    https://github.com/znbailey/Dataclip-Piggybank


    The UDF you'll be interested in is here:


    https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java


    I would recommend grabbing the entire repo as that UDF depends on the repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick


    Enjoy,
    Zach

    On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:

    No problem.
    Sounds good. And no worry about messy code. We are all well aware that code often elegance when you are just trying to get it out the door.
    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 4:46 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Great. Let me clean up the code a bit and I'd be happy to post it. I'm definitely open to some alternatives in terms of how this UDF would be initialized, whether it is via a file sitting on HDFS, etc. The current initialization scheme is admittedly crude but was simple to code and works for us for now.

    Cheers,
    Zach


    On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

    That is an interesting approach. I like it. Not ideal, but I think it could work for what I am doing.

    In general I think that is useful to the community and you should github it.
    By all means, I would love to use this.

    I think I could extend/fork this for my need.

    Thank you Zach!

    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 3:38 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Does the UDF have to support regular expressions? If not, I have adapted the Aho-Corasick algorithm [1] to do something similar to what you're asking for. It works as follows:


    1.) Initialize the Aho-Corasick UDF with a list of tokens to search for, and a result to output when that token is found:


    define AC_MATCHER
    com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
    bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


    2.) apply the AC_MATCHER to a tuple


    strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
    FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


    The tagged_strings will then contain the original line along with a bag of matches. For instance if we had the following in myfile.txt:


    terrier parakeet
    hello
    goodbye
    tabby
    pit bull


    after running the commands in #2 tagged_strings would look like (pardon the ad-hoc notation):


    { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
    'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby',
    tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


    If this is something you'd be interested in using/extended I can put it up on github for your forking pleasure.

    Cheers,
    Zach


    On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

    I have al is of regex patterns that I would like to run against a
    data set, and if it matches a particular pattern in the list, tag
    it with the predefined tag for that pattern.
    Has this been done, or available somewhere?
    I've not written any UDF's, and although I'm not against doing so,
    I probably don't have the time to write one at this point.

    If this isn't available somewhere I can work around this roadblock,
    but it would be awesome if someone has cooked up this functionality
    somewhere.

    -----Original Message-----
    From: Anze
    Sent: Monday, December 06, 2010 3:09 PM
    To: user@pig.apache.org
    Subject: Re: Easy question...difference between this::form and
    this.form?


    Sorry to hijack your question, Jonathan, but while we are at it...
    :)

    Is there a way to tell Pig NOT to add "base_alias::"? Almost half
    my code consists of FOREACH... GENERATE that just remove these prefixes.

    Thanks,

    Anze
    On Monday 06 December 2010, Daniel Dai wrote:

    After join, cross, foreach flatten, Pig will automatically add
    "base_alias::" prefix. All other cases use "."

    Daniel

    Jonathan Coveney wrote:
    It's very hard to search for this among the docs because it's so
    generic,
    so I thought I'd ask... I'm sure the answer is painfully easy.

    Taking a look at this code that I found online, for example

    --
    -- Read in a bag of tuples (timeseries for this example) and
    divide
    the
    -- numeric column by its maximum.
    --
    %default DATABAG 'data/timeseries.tsv'

    data = LOAD '$DATABAG' AS (month:chararray, count:int);
    accumulate = GROUP data ALL; calc_max = FOREACH accumulate
    GENERATE FLATTEN(data),
    MAX(data.count) AS max_count;
    normalize = FOREACH calc_max GENERATE data::month AS month,
    data::count AS count, (float)data::count / (float)max_count AS
    normed_count; DUMP normalize;

    What purpose does data::month serve versus data.count?

    Thanks







  • Dmitriy Ryaboy at Dec 7, 2010 at 2:26 am
    Zach,
    Do you mind contributing that directly to the Piggybank's upcoming home,
    https://github.com/wilbur/Piggybank ?

    D
    On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey wrote:


    Here you go:


    https://github.com/znbailey/Dataclip-Piggybank


    The UDF you'll be interested in is here:



    https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java


    I would recommend grabbing the entire repo as that UDF depends on the
    repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick


    Enjoy,
    Zach

    On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:

    No problem.
    Sounds good. And no worry about messy code. We are all well aware that
    code often elegance when you are just trying to get it out the door.
    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 4:46 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Great. Let me clean up the code a bit and I'd be happy to post it. I'm
    definitely open to some alternatives in terms of how this UDF would be
    initialized, whether it is via a file sitting on HDFS, etc. The current
    initialization scheme is admittedly crude but was simple to code and works
    for us for now.
    Cheers,
    Zach


    On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

    That is an interesting approach. I like it. Not ideal, but I think it
    could work for what I am doing.
    In general I think that is useful to the community and you should
    github it.
    By all means, I would love to use this.

    I think I could extend/fork this for my need.

    Thank you Zach!

    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 3:38 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Does the UDF have to support regular expressions? If not, I have
    adapted the Aho-Corasick algorithm [1] to do something similar to what
    you're asking for. It works as follows:

    1.) Initialize the Aho-Corasick UDF with a list of tokens to search
    for, and a result to output when that token is found:

    define AC_MATCHER
    com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
    bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


    2.) apply the AC_MATCHER to a tuple


    strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
    FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


    The tagged_strings will then contain the original line along with a
    bag of matches. For instance if we had the following in myfile.txt:

    terrier parakeet
    hello
    goodbye
    tabby
    pit bull


    after running the commands in #2 tagged_strings would look like
    (pardon the ad-hoc notation):

    { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
    'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby',
    tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


    If this is something you'd be interested in using/extended I can put
    it up on github for your forking pleasure.
    Cheers,
    Zach


    On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

    I have al is of regex patterns that I would like to run against a
    data set, and if it matches a particular pattern in the list, tag
    it with the predefined tag for that pattern.
    Has this been done, or available somewhere?
    I've not written any UDF's, and although I'm not against doing so,
    I probably don't have the time to write one at this point.

    If this isn't available somewhere I can work around this roadblock,
    but it would be awesome if someone has cooked up this functionality
    somewhere.

    -----Original Message-----
    From: Anze
    Sent: Monday, December 06, 2010 3:09 PM
    To: user@pig.apache.org
    Subject: Re: Easy question...difference between this::form and
    this.form?


    Sorry to hijack your question, Jonathan, but while we are at it...
    :)

    Is there a way to tell Pig NOT to add "base_alias::"? Almost half
    my code consists of FOREACH... GENERATE that just remove these
    prefixes.
    Thanks,

    Anze
    On Monday 06 December 2010, Daniel Dai wrote:

    After join, cross, foreach flatten, Pig will automatically add
    "base_alias::" prefix. All other cases use "."

    Daniel

    Jonathan Coveney wrote:
    It's very hard to search for this among the docs because it's so
    generic,
    so I thought I'd ask... I'm sure the answer is painfully easy.

    Taking a look at this code that I found online, for example

    --
    -- Read in a bag of tuples (timeseries for this example) and
    divide
    the
    -- numeric column by its maximum.
    --
    %default DATABAG 'data/timeseries.tsv'

    data = LOAD '$DATABAG' AS (month:chararray, count:int);
    accumulate = GROUP data ALL; calc_max = FOREACH accumulate
    GENERATE FLATTEN(data),
    MAX(data.count) AS max_count;
    normalize = FOREACH calc_max GENERATE data::month AS month,
    data::count AS count, (float)data::count / (float)max_count AS
    normed_count; DUMP normalize;

    What purpose does data::month serve versus data.count?

    Thanks







  • Zach Bailey at Dec 7, 2010 at 7:34 pm
    Dmitriy,


    I'm happy to contribute those UDF classes to that Github repo. Are there instructions anywhere on how I should go about doing so? Of main concern are:


    * how to get repo access (should I fork and do a pull request?),
    * style/format/naming restrictions/suggestions (java code format - checkstyle, should the UDFs be upper cased, camel cased, etc.)
    * java package restrictions/suggestions (can the UDFs stay in com.dataclip.piggybank or should they be repackaged elsewhere)
    * how to handle repackaged code/libraries (one of my UDFs depends on a repackaged implementation of the Aho-Corasick algorithm)
    * pig version compatibility (the repo has 0.6.1, mine are written against 0.7.0)

    Thanks,
    Zach

    On Monday, December 6, 2010 at 9:26 PM, Dmitriy Ryaboy wrote:

    Zach,
    Do you mind contributing that directly to the Piggybank's upcoming home,
    https://github.com/wilbur/Piggybank ?

    D

    On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey wrote:

    Here you go:


    https://github.com/znbailey/Dataclip-Piggybank


    The UDF you'll be interested in is here:



    https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java


    I would recommend grabbing the entire repo as that UDF depends on the
    repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick


    Enjoy,
    Zach

    On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:

    No problem.
    Sounds good. And no worry about messy code. We are all well aware that
    code often elegance when you are just trying to get it out the door.
    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 4:46 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Great. Let me clean up the code a bit and I'd be happy to post it. I'm
    definitely open to some alternatives in terms of how this UDF would be
    initialized, whether it is via a file sitting on HDFS, etc. The current
    initialization scheme is admittedly crude but was simple to code and works
    for us for now.
    Cheers,
    Zach


    On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

    That is an interesting approach. I like it. Not ideal, but I think it
    could work for what I am doing.
    In general I think that is useful to the community and you should
    github it.
    By all means, I would love to use this.

    I think I could extend/fork this for my need.

    Thank you Zach!

    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 3:38 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Does the UDF have to support regular expressions? If not, I have
    adapted the Aho-Corasick algorithm [1] to do something similar to what
    you're asking for. It works as follows:

    1.) Initialize the Aho-Corasick UDF with a list of tokens to search
    for, and a result to output when that token is found:

    define AC_MATCHER
    com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
    bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')


    2.) apply the AC_MATCHER to a tuple


    strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
    FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


    The tagged_strings will then contain the original line along with a
    bag of matches. For instance if we had the following in myfile.txt:

    terrier parakeet
    hello
    goodbye
    tabby
    pit bull


    after running the commands in #2 tagged_strings would look like
    (pardon the ad-hoc notation):

    { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
    'hello', tags: {} } { string: 'goodbye', tags: {} } { string: 'tabby',
    tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


    If this is something you'd be interested in using/extended I can put
    it up on github for your forking pleasure.
    Cheers,
    Zach


    On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

    I have al is of regex patterns that I would like to run against a
    data set, and if it matches a particular pattern in the list, tag
    it with the predefined tag for that pattern.
    Has this been done, or available somewhere?
    I've not written any UDF's, and although I'm not against doing so,
    I probably don't have the time to write one at this point.

    If this isn't available somewhere I can work around this roadblock,
    but it would be awesome if someone has cooked up this functionality
    somewhere.

    -----Original Message-----
    From: Anze
    Sent: Monday, December 06, 2010 3:09 PM
    To: user@pig.apache.org
    Subject: Re: Easy question...difference between this::form and
    this.form?


    Sorry to hijack your question, Jonathan, but while we are at it...
    :)

    Is there a way to tell Pig NOT to add "base_alias::"? Almost half
    my code consists of FOREACH... GENERATE that just remove these
    prefixes.
    Thanks,

    Anze
    On Monday 06 December 2010, Daniel Dai wrote:

    After join, cross, foreach flatten, Pig will automatically add
    "base_alias::" prefix. All other cases use "."

    Daniel

    Jonathan Coveney wrote:
    It's very hard to search for this among the docs because it's so
    generic,
    so I thought I'd ask... I'm sure the answer is painfully easy.

    Taking a look at this code that I found online, for example

    --
    -- Read in a bag of tuples (timeseries for this example) and
    divide
    the
    -- numeric column by its maximum.
    --
    %default DATABAG 'data/timeseries.tsv'

    data = LOAD '$DATABAG' AS (month:chararray, count:int);
    accumulate = GROUP data ALL; calc_max = FOREACH accumulate
    GENERATE FLATTEN(data),
    MAX(data.count) AS max_count;
    normalize = FOREACH calc_max GENERATE data::month AS month,
    data::count AS count, (float)data::count / (float)max_count AS
    normed_count; DUMP normalize;

    What purpose does data::month serve versus data.count?

    Thanks










  • Dmitriy Ryaboy at Dec 8, 2010 at 12:59 am
    All good questions. I'll put all of this into a readme in the project, and
    on the Pig wiki.
    Thanks for you willingness to contribute!

    0) all contributions should have the apache license
    1) fork and make a pull request. The docs I will write up will include
    something along the lines of "by sending a pull request you implicitly
    confirm that you have the right to release this code under the apache 2.0
    license"
    2) I like camel-cased UDFs and generally follow the standard Sun code
    conventions, though I prefer a two-space indentation. One of the pain points
    for folks contributing to the main piggybank has been an overabundance of
    requirements; I think for wild-west piggybank, we will be a lot more
    lenient. Which has its costs, granted...
    3) no restrictions, though a more generic package would be cool. LinkedIn
    already contributed stuff under com.linkedin so there's precedent. If folks
    feel strongly about the implicit attribution, I am cool with that.
    4) assuming they are apache, just change the ant build file. Ivy preferred
    over checking in jars.
    5) If your UDF is not a Load/Store func, the interface is the same, so it
    doesn't matter. Most likely when I pull in the real piggybank, we'll just
    change version compatibility to 8.

    -D
    On Tue, Dec 7, 2010 at 11:34 AM, Zach Bailey wrote:


    Dmitriy,


    I'm happy to contribute those UDF classes to that Github repo. Are there
    instructions anywhere on how I should go about doing so? Of main concern
    are:


    * how to get repo access (should I fork and do a pull request?),
    * style/format/naming restrictions/suggestions (java code format -
    checkstyle, should the UDFs be upper cased, camel cased, etc.)
    * java package restrictions/suggestions (can the UDFs stay in
    com.dataclip.piggybank or should they be repackaged elsewhere)
    * how to handle repackaged code/libraries (one of my UDFs depends on a
    repackaged implementation of the Aho-Corasick algorithm)
    * pig version compatibility (the repo has 0.6.1, mine are written against
    0.7.0)

    Thanks,
    Zach

    On Monday, December 6, 2010 at 9:26 PM, Dmitriy Ryaboy wrote:

    Zach,
    Do you mind contributing that directly to the Piggybank's upcoming home,
    https://github.com/wilbur/Piggybank ?

    D

    On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey <zach.bailey@dataclip.com
    wrote:

    Here you go:


    https://github.com/znbailey/Dataclip-Piggybank


    The UDF you'll be interested in is here:


    https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java

    I would recommend grabbing the entire repo as that UDF depends on the
    repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick


    Enjoy,
    Zach

    On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:

    No problem.
    Sounds good. And no worry about messy code. We are all well aware
    that
    code often elegance when you are just trying to get it out the door.
    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 4:46 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Great. Let me clean up the code a bit and I'd be happy to post it.
    I'm
    definitely open to some alternatives in terms of how this UDF would be
    initialized, whether it is via a file sitting on HDFS, etc. The
    current
    initialization scheme is admittedly crude but was simple to code and
    works
    for us for now.
    Cheers,
    Zach


    On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

    That is an interesting approach. I like it. Not ideal, but I think
    it
    could work for what I am doing.
    In general I think that is useful to the community and you should
    github it.
    By all means, I would love to use this.

    I think I could extend/fork this for my need.

    Thank you Zach!

    -----Original Message-----
    From: Zach Bailey
    Sent: Monday, December 06, 2010 3:38 PM
    To: user@pig.apache.org
    Subject: Re: Regex Match Tagger UDF?


    Does the UDF have to support regular expressions? If not, I have
    adapted the Aho-Corasick algorithm [1] to do something similar to what
    you're asking for. It works as follows:

    1.) Initialize the Aho-Corasick UDF with a list of tokens to search
    for, and a result to output when that token is found:

    define AC_MATCHER
    com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
    bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')

    2.) apply the AC_MATCHER to a tuple


    strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
    FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


    The tagged_strings will then contain the original line along with a
    bag of matches. For instance if we had the following in myfile.txt:

    terrier parakeet
    hello
    goodbye
    tabby
    pit bull


    after running the commands in #2 tagged_strings would look like
    (pardon the ad-hoc notation):

    { string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
    'hello', tags: {} } { string: 'goodbye', tags: {} } { string:
    'tabby',
    tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


    If this is something you'd be interested in using/extended I can
    put
    it up on github for your forking pleasure.
    Cheers,
    Zach


    On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

    I have al is of regex patterns that I would like to run against a
    data set, and if it matches a particular pattern in the list, tag
    it with the predefined tag for that pattern.
    Has this been done, or available somewhere?
    I've not written any UDF's, and although I'm not against doing
    so,
    I probably don't have the time to write one at this point.

    If this isn't available somewhere I can work around this
    roadblock,
    but it would be awesome if someone has cooked up this
    functionality
    somewhere.

    -----Original Message-----
    From: Anze
    Sent: Monday, December 06, 2010 3:09 PM
    To: user@pig.apache.org
    Subject: Re: Easy question...difference between this::form and
    this.form?


    Sorry to hijack your question, Jonathan, but while we are at
    it...
    :)

    Is there a way to tell Pig NOT to add "base_alias::"? Almost half
    my code consists of FOREACH... GENERATE that just remove these
    prefixes.
    Thanks,

    Anze
    On Monday 06 December 2010, Daniel Dai wrote:

    After join, cross, foreach flatten, Pig will automatically add
    "base_alias::" prefix. All other cases use "."

    Daniel

    Jonathan Coveney wrote:
    It's very hard to search for this among the docs because it's
    so
    generic,
    so I thought I'd ask... I'm sure the answer is painfully
    easy.
    Taking a look at this code that I found online, for example

    --
    -- Read in a bag of tuples (timeseries for this example) and
    divide
    the
    -- numeric column by its maximum.
    --
    %default DATABAG 'data/timeseries.tsv'

    data = LOAD '$DATABAG' AS (month:chararray, count:int);
    accumulate = GROUP data ALL; calc_max = FOREACH accumulate
    GENERATE FLATTEN(data),
    MAX(data.count) AS max_count;
    normalize = FOREACH calc_max GENERATE data::month AS month,
    data::count AS count, (float)data::count / (float)max_count
    AS
    normed_count; DUMP normalize;

    What purpose does data::month serve versus data.count?

    Thanks










Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 6, '10 at 8:26p
activeDec 8, '10 at 12:59a
posts9
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase