|| at Jul 9, 2011 at 6:05 pm
On Fri, Jul 8, 2011 at 2:48 PM, Grant Ingersoll wrote:
But are you keeping member variables or do you put everything in the
Anything that you want to remember needs to put in the context.
PIG makes sure that the constructor is called with the same arguments on
front-end and backend. In addition, for loaders and storage, setContext API
is invoked with the same same context on frontend and backend.
Anything else you need to put in the context. If something is derived from
constructor arguments, you don't need to put into context for e.g.
not sure if I understood the question correctly, but PIG does not transfer
your object, so what you store in the member variables does not matter.
On Jul 8, 2011, at 3:21 PM, Raghu Angadi wrote:
yes. that is exactly how HBaseStorage uses context.
On Fri, Jul 8, 2011 at 10:19 AM, Jeremy Hanna <
In CassandraStorage, we had been using some load/store URL specific
information (keyspace, column family names) to make the
UDFContext.properties key unique, but with what Grant said was in the
we just wrote a patch to instead use the udf context signatures for
keys when setting and getting those property values. Is that the way to
then? I'm setting those as member variables and then using them later.
public void setUDFContextSignature(String signature)
this.loadSignature = signature;
/* StoreFunc methods */
public void setStoreFuncUDFContextSignature(String signature)
this.storeSignature = signature;
On Jul 8, 2011, at 7:24 AM, Grant Ingersoll wrote:
What is the guidance here on using member variables when implementing
UDFs and passing properties? That is, what are the semantics for using
to store properties for a UDF instance? The docs talk a lot about
sure that no side effects happen from multiple calls to a UDF instance,
it is not clear whether that means it's doing things like changing the
Location for a given instance of a UDF or just calling it multiple
PigStorage suggests not (since it keeps a member var location), but the
UDFContext docs suggests that one keep all state in the UDFContext under
another case where this has reared it's head in an improper
On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:
On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna <
On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:
I think this is the same problem we were having earlier:http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4
One workaround is to use defines to explicitly create different
instances of your UDF, and use them separately.. it's ugly but it
I tried doing something like:
define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
This still does not work since you can't distinguish the two. The way
thinking of doing this is to let user pass in some unique sting as a
substitute for context:
define ToCassandraBag1 ToCassandraBag('1');
define ToCassandraBag2 ToCassandraBag('2');
Ah yes. I had misunderstood. Thanks for the clarification. Now the
pig docs also make more sense in the Passing Configurations to UDFs
"The UDF can pass its constructor arguments, or some other identifying
strings. This allows each instantiation of the UDF to have a different
properties object thus avoiding name space collisions between
of the UDF."
and the HBaseStorage example was helpful to see that in action.
Thanks both to Raghu and Dmitriy.
inside the UDF, you would use this arg to make a 'contextString' (see
HBaseStorage.java for example use) to store any state.
ideally UDFs shouldn't have to do this.. They should have the same
info that is available for loaders and storage.
at the top and then using each one only once. That still produces
error. I guess in this case we'll just have to require the field
entered into the UDF and it won't introspect them. Ah well. Would
to be able to use it but I don't really see another way around this
the shared UDF context.
On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <
We have a UDF that introspects the output schema and gets the
names there and use that in the exec method.
A simple example is found here:
It takes the relation's aliases and uses them in the output so
user doesn't have to specify them. However we've been noticing that
have more than one ToCassandraBag call in a pig script, sometimes
run at the same time and the key is the same in the UDF context:
cassandra.input_field_schema. So we think there is an issue there
out of bounds exceptions when running the script, but when running
one at a time, it doesn't do that).
Is there a right way to do this parameter passing so that we don't
these errors when multiple calls exist?
We thought of using the schema hash code as a suffix (e.g.
cassandra.input_field_schema.12344321) but we don't have access to
schema in the exec method.
We thought of having the first parameter of the input tuple be a
name that the script specifies, like 'filename.relationalias' as a
convention to make them unique to the file. However in the
don't have access to the input tuple, just the schema itself, so it
get that value in there.
Any ideas on how to make this so it doesn't stomp on each other
the pig script? Is there a best way to do that?