I am trying to do what seems like should be a simple task using pig and a UDF I have written but can't seem to figure out the syntax to get it working.
At a high level I have a UDF that takes a number of strings that I then want to see if exist in some other strings. So I write a UDF called CONTAINS_ANY that I would like to initialize with the "needles" and then use pig to distribute this search out via hadoop.
The problem is I can't figure out what the correct syntax is to initialize the UDF with the "needles" and then use the UDF later once it has been initialized. I have tried the following syntax:
define CONTAINS_STRINGS com.my.piggybank.CONTAINS_ANY('string1|string2');
and then invoking this by doing
filtered = FILTER data BY CONTAINS_STRINGS(haystack);
but this ends up re-initializing the UDF with the strings from the haystack which is not what I wanted.
Essentially I want to be able to write a UDF that is like the built-in MATCHES function so I can say something like:
filtered = FILTER data by haystack CONTAINS_STRINGS('string1|string2');
but so far have been unable to find any useful/relevant documentation on how to accomplish this.
Thanks a lot for any pointers or help anyone can give.