FAQ
I am trying to do what seems like should be a simple task using pig and a UDF I have written but can't seem to figure out the syntax to get it working.


At a high level I have a UDF that takes a number of strings that I then want to see if exist in some other strings. So I write a UDF called CONTAINS_ANY that I would like to initialize with the "needles" and then use pig to distribute this search out via hadoop.


The problem is I can't figure out what the correct syntax is to initialize the UDF with the "needles" and then use the UDF later once it has been initialized. I have tried the following syntax:


define CONTAINS_STRINGS com.my.piggybank.CONTAINS_ANY('string1|string2');


and then invoking this by doing


filtered = FILTER data BY CONTAINS_STRINGS(haystack);


but this ends up re-initializing the UDF with the strings from the haystack which is not what I wanted.


Essentially I want to be able to write a UDF that is like the built-in MATCHES function so I can say something like:


filtered = FILTER data by haystack CONTAINS_STRINGS('string1|string2');


but so far have been unable to find any useful/relevant documentation on how to accomplish this.


Thanks a lot for any pointers or help anyone can give.


Best,
Zach

Search Discussions

  • Daniel Dai at Nov 30, 2010 at 9:41 pm
    Pig always instantiate UDF using the construct parameter defined in
    "define" statement. ". CONTAINS_STRINGS(haystack) only pass haystack to
    CONTAINS_STRINGS.exec(). It will not re-initializing the UDF.

    Daniel

    Zach Bailey wrote:
    I am trying to do what seems like should be a simple task using pig and a UDF I have written but can't seem to figure out the syntax to get it working.


    At a high level I have a UDF that takes a number of strings that I then want to see if exist in some other strings. So I write a UDF called CONTAINS_ANY that I would like to initialize with the "needles" and then use pig to distribute this search out via hadoop.


    The problem is I can't figure out what the correct syntax is to initialize the UDF with the "needles" and then use the UDF later once it has been initialized. I have tried the following syntax:


    define CONTAINS_STRINGS com.my.piggybank.CONTAINS_ANY('string1|string2');


    and then invoking this by doing


    filtered = FILTER data BY CONTAINS_STRINGS(haystack);


    but this ends up re-initializing the UDF with the strings from the haystack which is not what I wanted.


    Essentially I want to be able to write a UDF that is like the built-in MATCHES function so I can say something like:


    filtered = FILTER data by haystack CONTAINS_STRINGS('string1|string2');


    but so far have been unable to find any useful/relevant documentation on how to accomplish this.


    Thanks a lot for any pointers or help anyone can give.


    Best,
    Zach
  • Zach Bailey at Nov 30, 2010 at 9:46 pm
    Thanks Daniel. Of course, you are right. Turns out I had a bug elsewhere in my UDF that was making me think this was not working correctly. After fixing that bug the "define ..." works fine.


    Using the following works great:


    define INITIALIZED_UDF com.my.udfs.UDF(constructor_params)

    Thanks,
    Zach

    On Tuesday, November 30, 2010 at 4:40 PM, Daniel Dai wrote:

    Pig always instantiate UDF using the construct parameter defined in
    "define" statement. ". CONTAINS_STRINGS(haystack) only pass haystack to
    CONTAINS_STRINGS.exec(). It will not re-initializing the UDF.

    Daniel

    Zach Bailey wrote:
    I am trying to do what seems like should be a simple task using pig and a UDF I have written but can't seem to figure out the syntax to get it working.


    At a high level I have a UDF that takes a number of strings that I then want to see if exist in some other strings. So I write a UDF called CONTAINS_ANY that I would like to initialize with the "needles" and then use pig to distribute this search out via hadoop.


    The problem is I can't figure out what the correct syntax is to initialize the UDF with the "needles" and then use the UDF later once it has been initialized. I have tried the following syntax:


    define CONTAINS_STRINGS com.my.piggybank.CONTAINS_ANY('string1|string2');


    and then invoking this by doing


    filtered = FILTER data BY CONTAINS_STRINGS(haystack);


    but this ends up re-initializing the UDF with the strings from the haystack which is not what I wanted.


    Essentially I want to be able to write a UDF that is like the built-in MATCHES function so I can say something like:


    filtered = FILTER data by haystack CONTAINS_STRINGS('string1|string2');


    but so far have been unable to find any useful/relevant documentation on how to accomplish this.


    Thanks a lot for any pointers or help anyone can give.


    Best,
    Zach



  • Sheeba George at Dec 2, 2010 at 8:48 am
    Hi Daniel
    I have a related question. My UDF has a constructor that takes 2 param.
    *

    public* TopUDF(*int* top, *int* type){

    m_cnt = top;

    m_type = type;

    }



    But when I call instantiate using the below, I get error. Am I doing
    something wrong?

    define AggregateOthers com.ebay.ewa2.pig.load.TopUDF(3,0);
    Thanks
    Sheeba
    On Tue, Nov 30, 2010 at 1:45 PM, Zach Bailey wrote:


    Thanks Daniel. Of course, you are right. Turns out I had a bug elsewhere
    in my UDF that was making me think this was not working correctly. After
    fixing that bug the "define ..." works fine.


    Using the following works great:


    define INITIALIZED_UDF com.my.udfs.UDF(constructor_params)

    Thanks,
    Zach

    On Tuesday, November 30, 2010 at 4:40 PM, Daniel Dai wrote:

    Pig always instantiate UDF using the construct parameter defined in
    "define" statement. ". CONTAINS_STRINGS(haystack) only pass haystack to
    CONTAINS_STRINGS.exec(). It will not re-initializing the UDF.

    Daniel

    Zach Bailey wrote:
    I am trying to do what seems like should be a simple task using pig
    and a UDF I have written but can't seem to figure out the syntax to get it
    working.

    At a high level I have a UDF that takes a number of strings that I
    then want to see if exist in some other strings. So I write a UDF called
    CONTAINS_ANY that I would like to initialize with the "needles" and then use
    pig to distribute this search out via hadoop.

    The problem is I can't figure out what the correct syntax is to
    initialize the UDF with the "needles" and then use the UDF later once it has
    been initialized. I have tried the following syntax:

    define CONTAINS_STRINGS
    com.my.piggybank.CONTAINS_ANY('string1|string2');

    and then invoking this by doing


    filtered = FILTER data BY CONTAINS_STRINGS(haystack);


    but this ends up re-initializing the UDF with the strings from the
    haystack which is not what I wanted.

    Essentially I want to be able to write a UDF that is like the built-in
    MATCHES function so I can say something like:

    filtered = FILTER data by haystack
    CONTAINS_STRINGS('string1|string2');

    but so far have been unable to find any useful/relevant documentation
    on how to accomplish this.

    Thanks a lot for any pointers or help anyone can give.


    Best,
    Zach




    --
    Sheeba Ann George
  • Mridul Muralidharan at Dec 2, 2010 at 9:16 am
    As of now, udf's are limited to only String's as constructor params.


    Regards,
    Mridul
    On Thursday 02 December 2010 02:18 PM, Sheeba George wrote:
    Hi Daniel
    I have a related question. My UDF has a constructor that takes 2 param.
    *

    public* TopUDF(*int* top, *int* type){

    m_cnt = top;

    m_type = type;

    }



    But when I call instantiate using the below, I get error. Am I doing
    something wrong?

    define AggregateOthers com.ebay.ewa2.pig.load.TopUDF(3,0);
    Thanks
    Sheeba

    On Tue, Nov 30, 2010 at 1:45 PM, Zach Baileywrote:
    Thanks Daniel. Of course, you are right. Turns out I had a bug elsewhere
    in my UDF that was making me think this was not working correctly. After
    fixing that bug the "define ..." works fine.


    Using the following works great:


    define INITIALIZED_UDF com.my.udfs.UDF(constructor_params)

    Thanks,
    Zach

    On Tuesday, November 30, 2010 at 4:40 PM, Daniel Dai wrote:

    Pig always instantiate UDF using the construct parameter defined in
    "define" statement. ". CONTAINS_STRINGS(haystack) only pass haystack to
    CONTAINS_STRINGS.exec(). It will not re-initializing the UDF.

    Daniel

    Zach Bailey wrote:
    I am trying to do what seems like should be a simple task using pig
    and a UDF I have written but can't seem to figure out the syntax to get it
    working.

    At a high level I have a UDF that takes a number of strings that I
    then want to see if exist in some other strings. So I write a UDF called
    CONTAINS_ANY that I would like to initialize with the "needles" and then use
    pig to distribute this search out via hadoop.

    The problem is I can't figure out what the correct syntax is to
    initialize the UDF with the "needles" and then use the UDF later once it has
    been initialized. I have tried the following syntax:

    define CONTAINS_STRINGS
    com.my.piggybank.CONTAINS_ANY('string1|string2');

    and then invoking this by doing


    filtered = FILTER data BY CONTAINS_STRINGS(haystack);


    but this ends up re-initializing the UDF with the strings from the
    haystack which is not what I wanted.

    Essentially I want to be able to write a UDF that is like the built-in
    MATCHES function so I can say something like:

    filtered = FILTER data by haystack
    CONTAINS_STRINGS('string1|string2');

    but so far have been unable to find any useful/relevant documentation
    on how to accomplish this.

    Thanks a lot for any pointers or help anyone can give.


    Best,
    Zach



Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedNov 30, '10 at 7:49p
activeDec 2, '10 at 9:16a
posts5
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase