FAQ
All good questions. I'll put all of this into a readme in the project, and
on the Pig wiki.
Thanks for you willingness to contribute!

0) all contributions should have the apache license
1) fork and make a pull request. The docs I will write up will include
something along the lines of "by sending a pull request you implicitly
confirm that you have the right to release this code under the apache 2.0
license"
2) I like camel-cased UDFs and generally follow the standard Sun code
conventions, though I prefer a two-space indentation. One of the pain points
for folks contributing to the main piggybank has been an overabundance of
requirements; I think for wild-west piggybank, we will be a lot more
lenient. Which has its costs, granted...
3) no restrictions, though a more generic package would be cool. LinkedIn
already contributed stuff under com.linkedin so there's precedent. If folks
feel strongly about the implicit attribution, I am cool with that.
4) assuming they are apache, just change the ant build file. Ivy preferred
over checking in jars.
5) If your UDF is not a Load/Store func, the interface is the same, so it
doesn't matter. Most likely when I pull in the real piggybank, we'll just
change version compatibility to 8.

-D
On Tue, Dec 7, 2010 at 11:34 AM, Zach Bailey wrote:


Dmitriy,


I'm happy to contribute those UDF classes to that Github repo. Are there
instructions anywhere on how I should go about doing so? Of main concern
are:


* how to get repo access (should I fork and do a pull request?),
* style/format/naming restrictions/suggestions (java code format -
checkstyle, should the UDFs be upper cased, camel cased, etc.)
* java package restrictions/suggestions (can the UDFs stay in
com.dataclip.piggybank or should they be repackaged elsewhere)
* how to handle repackaged code/libraries (one of my UDFs depends on a
repackaged implementation of the Aho-Corasick algorithm)
* pig version compatibility (the repo has 0.6.1, mine are written against
0.7.0)

Thanks,
Zach

On Monday, December 6, 2010 at 9:26 PM, Dmitriy Ryaboy wrote:

Zach,
Do you mind contributing that directly to the Piggybank's upcoming home,
https://github.com/wilbur/Piggybank ?

D

On Mon, Dec 6, 2010 at 2:25 PM, Zach Bailey <zach.bailey@dataclip.com
wrote:

Here you go:


https://github.com/znbailey/Dataclip-Piggybank


The UDF you'll be interested in is here:


https://github.com/znbailey/Dataclip-Piggybank/blob/master/src/java/com/dataclip/piggybank/AHO_CORASICK.java

I would recommend grabbing the entire repo as that UDF depends on the
repackaged version of Aho-Corasick in org/arabidopsis/ahocorasick


Enjoy,
Zach

On Monday, December 6, 2010 at 4:55 PM, Brian Adams wrote:

No problem.
Sounds good. And no worry about messy code. We are all well aware
that
code often elegance when you are just trying to get it out the door.
-----Original Message-----
From: Zach Bailey
Sent: Monday, December 06, 2010 4:46 PM
To: user@pig.apache.org
Subject: Re: Regex Match Tagger UDF?


Great. Let me clean up the code a bit and I'd be happy to post it.
I'm
definitely open to some alternatives in terms of how this UDF would be
initialized, whether it is via a file sitting on HDFS, etc. The
current
initialization scheme is admittedly crude but was simple to code and
works
for us for now.
Cheers,
Zach


On Monday, December 6, 2010 at 4:15 PM, Brian Adams wrote:

That is an interesting approach. I like it. Not ideal, but I think
it
could work for what I am doing.
In general I think that is useful to the community and you should
github it.
By all means, I would love to use this.

I think I could extend/fork this for my need.

Thank you Zach!

-----Original Message-----
From: Zach Bailey
Sent: Monday, December 06, 2010 3:38 PM
To: user@pig.apache.org
Subject: Re: Regex Match Tagger UDF?


Does the UDF have to support regular expressions? If not, I have
adapted the Aho-Corasick algorithm [1] to do something similar to what
you're asking for. It works as follows:

1.) Initialize the Aho-Corasick UDF with a list of tokens to search
for, and a result to output when that token is found:

define AC_MATCHER
com.my.piggybank.AHO_CORASICK('dogs=[terrier|retriever|pit
bull];cats=[tabby|mainecoon|tuxedo];birds=[parakeet|parrot|cuckoo]')

2.) apply the AC_MATCHER to a tuple


strings = LOAD 'myfile.txt' as (string:chararray); tagged_strings =
FOREACH strings GENERATE string, AC_MATCHER(string) as tags;


The tagged_strings will then contain the original line along with a
bag of matches. For instance if we had the following in myfile.txt:

terrier parakeet
hello
goodbye
tabby
pit bull


after running the commands in #2 tagged_strings would look like
(pardon the ad-hoc notation):

{ string: 'terrier parakeet', tags: { 'dogs', 'birds' } } { string:
'hello', tags: {} } { string: 'goodbye', tags: {} } { string:
'tabby',
tags: { 'cats' } } { string: 'pit bull', tags: { 'dogs' } }


If this is something you'd be interested in using/extended I can
put
it up on github for your forking pleasure.
Cheers,
Zach


On Monday, December 6, 2010 at 3:25 PM, Brian Adams wrote:

I have al is of regex patterns that I would like to run against a
data set, and if it matches a particular pattern in the list, tag
it with the predefined tag for that pattern.
Has this been done, or available somewhere?
I've not written any UDF's, and although I'm not against doing
so,
I probably don't have the time to write one at this point.

If this isn't available somewhere I can work around this
roadblock,
but it would be awesome if someone has cooked up this
functionality
somewhere.

-----Original Message-----
From: Anze
Sent: Monday, December 06, 2010 3:09 PM
To: user@pig.apache.org
Subject: Re: Easy question...difference between this::form and
this.form?


Sorry to hijack your question, Jonathan, but while we are at
it...
:)

Is there a way to tell Pig NOT to add "base_alias::"? Almost half
my code consists of FOREACH... GENERATE that just remove these
prefixes.
Thanks,

Anze
On Monday 06 December 2010, Daniel Dai wrote:

After join, cross, foreach flatten, Pig will automatically add
"base_alias::" prefix. All other cases use "."

Daniel

Jonathan Coveney wrote:
It's very hard to search for this among the docs because it's
so
generic,
so I thought I'd ask... I'm sure the answer is painfully
easy.
Taking a look at this code that I found online, for example

--
-- Read in a bag of tuples (timeseries for this example) and
divide
the
-- numeric column by its maximum.
--
%default DATABAG 'data/timeseries.tsv'

data = LOAD '$DATABAG' AS (month:chararray, count:int);
accumulate = GROUP data ALL; calc_max = FOREACH accumulate
GENERATE FLATTEN(data),
MAX(data.count) AS max_count;
normalize = FOREACH calc_max GENERATE data::month AS month,
data::count AS count, (float)data::count / (float)max_count
AS
normed_count; DUMP normalize;

What purpose does data::month serve versus data.count?

Thanks










Search Discussions

Discussion Posts

Previous

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 9 of 9 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 6, '10 at 8:26p
activeDec 8, '10 at 12:59a
posts9
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase