Grokbase Groups Pig user July 2011
FAQ
We have a UDF that introspects the output schema and gets the field names there and use that in the exec method.

The UDF is found here: https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java

A simple example is found here: https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig

It takes the relation's aliases and uses them in the output so that the user doesn't have to specify them. However we've been noticing that if you have more than one ToCassandraBag call in a pig script, sometimes they are run at the same time and the key is the same in the UDF context: cassandra.input_field_schema. So we think there is an issue there (array out of bounds exceptions when running the script, but when running in grunt one at a time, it doesn't do that).

Is there a right way to do this parameter passing so that we don't get these errors when multiple calls exist?

We thought of using the schema hash code as a suffix (e.g. cassandra.input_field_schema.12344321) but we don't have access to the schema in the exec method.

We thought of having the first parameter of the input tuple be a unique name that the script specifies, like 'filename.relationalias' as a convention to make them unique to the file. However in the outputSchema, we don't have access to the input tuple, just the schema itself, so it couldn't get that value in there.

Any ideas on how to make this so it doesn't stomp on each other within the pig script? Is there a best way to do that?

Thanks!

Jeremy

Search Discussions

  • Dmitriy Ryaboy at Jul 6, 2011 at 5:48 pm
    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.

    D
    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna wrote:
    We have a UDF that introspects the output schema and gets the field names there and use that in the exec method.

    The UDF is found here: https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java

    A simple example is found here: https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig

    It takes the relation's aliases and uses them in the output so that the user doesn't have to specify them.  However we've been noticing that if you have more than one ToCassandraBag call in a pig script, sometimes they are run at the same time and the key is the same in the UDF context: cassandra.input_field_schema.  So we think there is an issue there (array out of bounds exceptions when running the script, but when running in grunt one at a time, it doesn't do that).

    Is there a right way to do this parameter passing so that we don't get these errors when multiple calls exist?

    We thought of using the schema hash code as a suffix (e.g. cassandra.input_field_schema.12344321) but we don't have access to the schema in the exec method.

    We thought of having the first parameter of the input tuple be a unique name that the script specifies, like 'filename.relationalias' as a convention to make them unique to the file.  However in the outputSchema, we don't have access to the input tuple, just the schema itself, so it couldn't get that value in there.

    Any ideas on how to make this so it doesn't stomp on each other within the pig script?  Is there a best way to do that?

    Thanks!

    Jeremy
  • Jeremy Hanna at Jul 7, 2011 at 2:23 am

    On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.
    Thanks Dmitriy.

    I tried doing something like:
    define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
    define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();

    at the top and then using each one only once. That still produces the same error. I guess in this case we'll just have to require the field names be entered into the UDF and it won't introspect them. Ah well. Would be nice to be able to use it but I don't really see another way around this bug with the shared UDF context.
    D
    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna wrote:
    We have a UDF that introspects the output schema and gets the field names there and use that in the exec method.

    The UDF is found here: https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java

    A simple example is found here: https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig

    It takes the relation's aliases and uses them in the output so that the user doesn't have to specify them. However we've been noticing that if you have more than one ToCassandraBag call in a pig script, sometimes they are run at the same time and the key is the same in the UDF context: cassandra.input_field_schema. So we think there is an issue there (array out of bounds exceptions when running the script, but when running in grunt one at a time, it doesn't do that).

    Is there a right way to do this parameter passing so that we don't get these errors when multiple calls exist?

    We thought of using the schema hash code as a suffix (e.g. cassandra.input_field_schema.12344321) but we don't have access to the schema in the exec method.

    We thought of having the first parameter of the input tuple be a unique name that the script specifies, like 'filename.relationalias' as a convention to make them unique to the file. However in the outputSchema, we don't have access to the input tuple, just the schema itself, so it couldn't get that value in there.

    Any ideas on how to make this so it doesn't stomp on each other within the pig script? Is there a best way to do that?

    Thanks!

    Jeremy
  • Raghu Angadi at Jul 7, 2011 at 4:21 am
    On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna wrote:
    On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.
    Thanks Dmitriy.

    I tried doing something like:
    define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
    define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
    This still does not work since you can't distinguish the two. The way I was
    thinking of doing this is to let user pass in some unique sting as a
    substitute for context:

    define ToCassandraBag1 ToCassandraBag('1');
    define ToCassandraBag2 ToCassandraBag('2');

    inside the UDF, you would use this arg to make a 'contextString' (see
    HBaseStorage.java for example use) to store any state.

    ideally UDFs shouldn't have to do this.. They should have the same context
    info that is available for loaders and storage.

    Raghu.

    at the top and then using each one only once. That still produces the same
    error. I guess in this case we'll just have to require the field names be
    entered into the UDF and it won't introspect them. Ah well. Would be nice
    to be able to use it but I don't really see another way around this bug with
    the shared UDF context.
    D
    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna wrote:
    We have a UDF that introspects the output schema and gets the field
    names there and use that in the exec method.
    The UDF is found here:
    https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
    A simple example is found here:
    https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
    It takes the relation's aliases and uses them in the output so that the
    user doesn't have to specify them. However we've been noticing that if you
    have more than one ToCassandraBag call in a pig script, sometimes they are
    run at the same time and the key is the same in the UDF context:
    cassandra.input_field_schema. So we think there is an issue there (array
    out of bounds exceptions when running the script, but when running in grunt
    one at a time, it doesn't do that).
    Is there a right way to do this parameter passing so that we don't get
    these errors when multiple calls exist?
    We thought of using the schema hash code as a suffix (e.g.
    cassandra.input_field_schema.12344321) but we don't have access to the
    schema in the exec method.
    We thought of having the first parameter of the input tuple be a unique
    name that the script specifies, like 'filename.relationalias' as a
    convention to make them unique to the file. However in the outputSchema, we
    don't have access to the input tuple, just the schema itself, so it couldn't
    get that value in there.
    Any ideas on how to make this so it doesn't stomp on each other within
    the pig script? Is there a best way to do that?
    Thanks!

    Jeremy
  • Jeremy Hanna at Jul 7, 2011 at 7:25 am

    On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
    On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna wrote:

    On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.
    Thanks Dmitriy.

    I tried doing something like:
    define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
    define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
    This still does not work since you can't distinguish the two. The way I was
    thinking of doing this is to let user pass in some unique sting as a
    substitute for context:

    define ToCassandraBag1 ToCassandraBag('1');
    define ToCassandraBag2 ToCassandraBag('2');
    Ah yes. I had misunderstood. Thanks for the clarification. Now the pig docs also make more sense in the Passing Configurations to UDFs section:
    http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
    It says:
    "The UDF can pass its constructor arguments, or some other identifying strings. This allows each instantiation of the UDF to have a different properties object thus avoiding name space collisions between instantiations of the UDF."
    and the HBaseStorage example was helpful to see that in action.

    Thanks both to Raghu and Dmitriy.
    inside the UDF, you would use this arg to make a 'contextString' (see
    HBaseStorage.java for example use) to store any state.

    ideally UDFs shouldn't have to do this.. They should have the same context
    info that is available for loaders and storage.

    Raghu.

    at the top and then using each one only once. That still produces the same
    error. I guess in this case we'll just have to require the field names be
    entered into the UDF and it won't introspect them. Ah well. Would be nice
    to be able to use it but I don't really see another way around this bug with
    the shared UDF context.
    D

    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <jeremy.hanna1234@gmail.com>
    wrote:
    We have a UDF that introspects the output schema and gets the field
    names there and use that in the exec method.
    The UDF is found here:
    https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
    A simple example is found here:
    https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
    It takes the relation's aliases and uses them in the output so that the
    user doesn't have to specify them. However we've been noticing that if you
    have more than one ToCassandraBag call in a pig script, sometimes they are
    run at the same time and the key is the same in the UDF context:
    cassandra.input_field_schema. So we think there is an issue there (array
    out of bounds exceptions when running the script, but when running in grunt
    one at a time, it doesn't do that).
    Is there a right way to do this parameter passing so that we don't get
    these errors when multiple calls exist?
    We thought of using the schema hash code as a suffix (e.g.
    cassandra.input_field_schema.12344321) but we don't have access to the
    schema in the exec method.
    We thought of having the first parameter of the input tuple be a unique
    name that the script specifies, like 'filename.relationalias' as a
    convention to make them unique to the file. However in the outputSchema, we
    don't have access to the input tuple, just the schema itself, so it couldn't
    get that value in there.
    Any ideas on how to make this so it doesn't stomp on each other within
    the pig script? Is there a best way to do that?
    Thanks!

    Jeremy
  • Grant Ingersoll at Jul 8, 2011 at 12:24 pm
    What is the guidance here on using member variables when implementing UDFs and passing properties? That is, what are the semantics for using them to store properties for a UDF instance? The docs talk a lot about making sure that no side effects happen from multiple calls to a UDF instance, but it is not clear whether that means it's doing things like changing the Location for a given instance of a UDF or just calling it multiple times. PigStorage suggests not (since it keeps a member var location), but the UDFContext docs suggests that one keep all state in the UDFContext under an appropriate signature.

    See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for another case where this has reared it's head in an improper implementation.

    -Grant
    On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:

    On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
    On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna wrote:

    On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.
    Thanks Dmitriy.

    I tried doing something like:
    define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
    define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
    This still does not work since you can't distinguish the two. The way I was
    thinking of doing this is to let user pass in some unique sting as a
    substitute for context:

    define ToCassandraBag1 ToCassandraBag('1');
    define ToCassandraBag2 ToCassandraBag('2');
    Ah yes. I had misunderstood. Thanks for the clarification. Now the pig docs also make more sense in the Passing Configurations to UDFs section:
    http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
    It says:
    "The UDF can pass its constructor arguments, or some other identifying strings. This allows each instantiation of the UDF to have a different properties object thus avoiding name space collisions between instantiations of the UDF."
    and the HBaseStorage example was helpful to see that in action.

    Thanks both to Raghu and Dmitriy.
    inside the UDF, you would use this arg to make a 'contextString' (see
    HBaseStorage.java for example use) to store any state.

    ideally UDFs shouldn't have to do this.. They should have the same context
    info that is available for loaders and storage.

    Raghu.

    at the top and then using each one only once. That still produces the same
    error. I guess in this case we'll just have to require the field names be
    entered into the UDF and it won't introspect them. Ah well. Would be nice
    to be able to use it but I don't really see another way around this bug with
    the shared UDF context.
    D

    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <jeremy.hanna1234@gmail.com>
    wrote:
    We have a UDF that introspects the output schema and gets the field
    names there and use that in the exec method.
    The UDF is found here:
    https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
    A simple example is found here:
    https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
    It takes the relation's aliases and uses them in the output so that the
    user doesn't have to specify them. However we've been noticing that if you
    have more than one ToCassandraBag call in a pig script, sometimes they are
    run at the same time and the key is the same in the UDF context:
    cassandra.input_field_schema. So we think there is an issue there (array
    out of bounds exceptions when running the script, but when running in grunt
    one at a time, it doesn't do that).
    Is there a right way to do this parameter passing so that we don't get
    these errors when multiple calls exist?
    We thought of using the schema hash code as a suffix (e.g.
    cassandra.input_field_schema.12344321) but we don't have access to the
    schema in the exec method.
    We thought of having the first parameter of the input tuple be a unique
    name that the script specifies, like 'filename.relationalias' as a
    convention to make them unique to the file. However in the outputSchema, we
    don't have access to the input tuple, just the schema itself, so it couldn't
    get that value in there.
    Any ideas on how to make this so it doesn't stomp on each other within
    the pig script? Is there a best way to do that?
    Thanks!

    Jeremy
    --------------------------
    Grant Ingersoll
  • Jeremy Hanna at Jul 8, 2011 at 5:20 pm
    In CassandraStorage, we had been using some load/store URL specific information (keyspace, column family names) to make the UDFContext.properties key unique, but with what Grant said was in the docs, we just wrote a patch to instead use the udf context signatures for those keys when setting and getting those property values. Is that the way to go then? I'm setting those as member variables and then using them later.

    @Override
    public void setUDFContextSignature(String signature)
    {
    this.loadSignature = signature;
    }

    /* StoreFunc methods */
    public void setStoreFuncUDFContextSignature(String signature)
    {
    this.storeSignature = signature;
    }

    On Jul 8, 2011, at 7:24 AM, Grant Ingersoll wrote:

    What is the guidance here on using member variables when implementing UDFs and passing properties? That is, what are the semantics for using them to store properties for a UDF instance? The docs talk a lot about making sure that no side effects happen from multiple calls to a UDF instance, but it is not clear whether that means it's doing things like changing the Location for a given instance of a UDF or just calling it multiple times. PigStorage suggests not (since it keeps a member var location), but the UDFContext docs suggests that one keep all state in the UDFContext under an appropriate signature.

    See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for another case where this has reared it's head in an improper implementation.

    -Grant
    On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:

    On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:
    On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna wrote:

    On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.
    Thanks Dmitriy.

    I tried doing something like:
    define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
    define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
    This still does not work since you can't distinguish the two. The way I was
    thinking of doing this is to let user pass in some unique sting as a
    substitute for context:

    define ToCassandraBag1 ToCassandraBag('1');
    define ToCassandraBag2 ToCassandraBag('2');
    Ah yes. I had misunderstood. Thanks for the clarification. Now the pig docs also make more sense in the Passing Configurations to UDFs section:
    http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
    It says:
    "The UDF can pass its constructor arguments, or some other identifying strings. This allows each instantiation of the UDF to have a different properties object thus avoiding name space collisions between instantiations of the UDF."
    and the HBaseStorage example was helpful to see that in action.

    Thanks both to Raghu and Dmitriy.
    inside the UDF, you would use this arg to make a 'contextString' (see
    HBaseStorage.java for example use) to store any state.

    ideally UDFs shouldn't have to do this.. They should have the same context
    info that is available for loaders and storage.

    Raghu.

    at the top and then using each one only once. That still produces the same
    error. I guess in this case we'll just have to require the field names be
    entered into the UDF and it won't introspect them. Ah well. Would be nice
    to be able to use it but I don't really see another way around this bug with
    the shared UDF context.
    D

    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <jeremy.hanna1234@gmail.com>
    wrote:
    We have a UDF that introspects the output schema and gets the field
    names there and use that in the exec method.
    The UDF is found here:
    https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
    A simple example is found here:
    https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
    It takes the relation's aliases and uses them in the output so that the
    user doesn't have to specify them. However we've been noticing that if you
    have more than one ToCassandraBag call in a pig script, sometimes they are
    run at the same time and the key is the same in the UDF context:
    cassandra.input_field_schema. So we think there is an issue there (array
    out of bounds exceptions when running the script, but when running in grunt
    one at a time, it doesn't do that).
    Is there a right way to do this parameter passing so that we don't get
    these errors when multiple calls exist?
    We thought of using the schema hash code as a suffix (e.g.
    cassandra.input_field_schema.12344321) but we don't have access to the
    schema in the exec method.
    We thought of having the first parameter of the input tuple be a unique
    name that the script specifies, like 'filename.relationalias' as a
    convention to make them unique to the file. However in the outputSchema, we
    don't have access to the input tuple, just the schema itself, so it couldn't
    get that value in there.
    Any ideas on how to make this so it doesn't stomp on each other within
    the pig script? Is there a best way to do that?
    Thanks!

    Jeremy
    --------------------------
    Grant Ingersoll

  • Raghu Angadi at Jul 8, 2011 at 7:22 pm
    yes. that is exactly how HBaseStorage uses context.
    On Fri, Jul 8, 2011 at 10:19 AM, Jeremy Hanna wrote:

    In CassandraStorage, we had been using some load/store URL specific
    information (keyspace, column family names) to make the
    UDFContext.properties key unique, but with what Grant said was in the docs,
    we just wrote a patch to instead use the udf context signatures for those
    keys when setting and getting those property values. Is that the way to go
    then? I'm setting those as member variables and then using them later.

    @Override
    public void setUDFContextSignature(String signature)
    {
    this.loadSignature = signature;
    }

    /* StoreFunc methods */
    public void setStoreFuncUDFContextSignature(String signature)
    {
    this.storeSignature = signature;
    }

    On Jul 8, 2011, at 7:24 AM, Grant Ingersoll wrote:

    What is the guidance here on using member variables when implementing
    UDFs and passing properties? That is, what are the semantics for using them
    to store properties for a UDF instance? The docs talk a lot about making
    sure that no side effects happen from multiple calls to a UDF instance, but
    it is not clear whether that means it's doing things like changing the
    Location for a given instance of a UDF or just calling it multiple times.
    PigStorage suggests not (since it keeps a member var location), but the
    UDFContext docs suggests that one keep all state in the UDFContext under an
    appropriate signature.
    See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for
    another case where this has reared it's head in an improper implementation.
    -Grant
    On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:

    On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:

    On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna <
    jeremy.hanna1234@gmail.com>wrote:
    On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.
    Thanks Dmitriy.

    I tried doing something like:
    define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
    define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
    This still does not work since you can't distinguish the two. The way I
    was
    thinking of doing this is to let user pass in some unique sting as a
    substitute for context:

    define ToCassandraBag1 ToCassandraBag('1');
    define ToCassandraBag2 ToCassandraBag('2');
    Ah yes. I had misunderstood. Thanks for the clarification. Now the
    pig docs also make more sense in the Passing Configurations to UDFs section:
    http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
    It says:
    "The UDF can pass its constructor arguments, or some other identifying
    strings. This allows each instantiation of the UDF to have a different
    properties object thus avoiding name space collisions between instantiations
    of the UDF."
    and the HBaseStorage example was helpful to see that in action.

    Thanks both to Raghu and Dmitriy.
    inside the UDF, you would use this arg to make a 'contextString' (see
    HBaseStorage.java for example use) to store any state.

    ideally UDFs shouldn't have to do this.. They should have the same
    context
    info that is available for loaders and storage.

    Raghu.

    at the top and then using each one only once. That still produces the
    same
    error. I guess in this case we'll just have to require the field
    names be
    entered into the UDF and it won't introspect them. Ah well. Would be
    nice
    to be able to use it but I don't really see another way around this
    bug with
    the shared UDF context.
    D

    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <
    jeremy.hanna1234@gmail.com>
    wrote:
    We have a UDF that introspects the output schema and gets the field
    names there and use that in the exec method.
    The UDF is found here:
    https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
    A simple example is found here:
    https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
    It takes the relation's aliases and uses them in the output so that
    the
    user doesn't have to specify them. However we've been noticing that
    if you
    have more than one ToCassandraBag call in a pig script, sometimes they
    are
    run at the same time and the key is the same in the UDF context:
    cassandra.input_field_schema. So we think there is an issue there
    (array
    out of bounds exceptions when running the script, but when running in
    grunt
    one at a time, it doesn't do that).
    Is there a right way to do this parameter passing so that we don't
    get
    these errors when multiple calls exist?
    We thought of using the schema hash code as a suffix (e.g.
    cassandra.input_field_schema.12344321) but we don't have access to the
    schema in the exec method.
    We thought of having the first parameter of the input tuple be a
    unique
    name that the script specifies, like 'filename.relationalias' as a
    convention to make them unique to the file. However in the
    outputSchema, we
    don't have access to the input tuple, just the schema itself, so it
    couldn't
    get that value in there.
    Any ideas on how to make this so it doesn't stomp on each other
    within
    the pig script? Is there a best way to do that?
    Thanks!

    Jeremy
    --------------------------
    Grant Ingersoll

  • Grant Ingersoll at Jul 8, 2011 at 9:48 pm
    But are you keeping member variables or do you put everything in the context?

    On Jul 8, 2011, at 3:21 PM, Raghu Angadi wrote:

    yes. that is exactly how HBaseStorage uses context.
    On Fri, Jul 8, 2011 at 10:19 AM, Jeremy Hanna wrote:

    In CassandraStorage, we had been using some load/store URL specific
    information (keyspace, column family names) to make the
    UDFContext.properties key unique, but with what Grant said was in the docs,
    we just wrote a patch to instead use the udf context signatures for those
    keys when setting and getting those property values. Is that the way to go
    then? I'm setting those as member variables and then using them later.

    @Override
    public void setUDFContextSignature(String signature)
    {
    this.loadSignature = signature;
    }

    /* StoreFunc methods */
    public void setStoreFuncUDFContextSignature(String signature)
    {
    this.storeSignature = signature;
    }

    On Jul 8, 2011, at 7:24 AM, Grant Ingersoll wrote:

    What is the guidance here on using member variables when implementing
    UDFs and passing properties? That is, what are the semantics for using them
    to store properties for a UDF instance? The docs talk a lot about making
    sure that no side effects happen from multiple calls to a UDF instance, but
    it is not clear whether that means it's doing things like changing the
    Location for a given instance of a UDF or just calling it multiple times.
    PigStorage suggests not (since it keeps a member var location), but the
    UDFContext docs suggests that one keep all state in the UDFContext under an
    appropriate signature.
    See also https://issues.apache.org/jira/browse/CASSANDRA-2869 for
    another case where this has reared it's head in an improper implementation.
    -Grant
    On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:

    On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:

    On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna <
    jeremy.hanna1234@gmail.com>wrote:
    On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.
    Thanks Dmitriy.

    I tried doing something like:
    define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
    define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
    This still does not work since you can't distinguish the two. The way I
    was
    thinking of doing this is to let user pass in some unique sting as a
    substitute for context:

    define ToCassandraBag1 ToCassandraBag('1');
    define ToCassandraBag2 ToCassandraBag('2');
    Ah yes. I had misunderstood. Thanks for the clarification. Now the
    pig docs also make more sense in the Passing Configurations to UDFs section:
    http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
    It says:
    "The UDF can pass its constructor arguments, or some other identifying
    strings. This allows each instantiation of the UDF to have a different
    properties object thus avoiding name space collisions between instantiations
    of the UDF."
    and the HBaseStorage example was helpful to see that in action.

    Thanks both to Raghu and Dmitriy.
    inside the UDF, you would use this arg to make a 'contextString' (see
    HBaseStorage.java for example use) to store any state.

    ideally UDFs shouldn't have to do this.. They should have the same
    context
    info that is available for loaders and storage.

    Raghu.

    at the top and then using each one only once. That still produces the
    same
    error. I guess in this case we'll just have to require the field
    names be
    entered into the UDF and it won't introspect them. Ah well. Would be
    nice
    to be able to use it but I don't really see another way around this
    bug with
    the shared UDF context.
    D

    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <
    jeremy.hanna1234@gmail.com>
    wrote:
    We have a UDF that introspects the output schema and gets the field
    names there and use that in the exec method.
    The UDF is found here:
    https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
    A simple example is found here:
    https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
    It takes the relation's aliases and uses them in the output so that
    the
    user doesn't have to specify them. However we've been noticing that
    if you
    have more than one ToCassandraBag call in a pig script, sometimes they
    are
    run at the same time and the key is the same in the UDF context:
    cassandra.input_field_schema. So we think there is an issue there
    (array
    out of bounds exceptions when running the script, but when running in
    grunt
    one at a time, it doesn't do that).
    Is there a right way to do this parameter passing so that we don't
    get
    these errors when multiple calls exist?
    We thought of using the schema hash code as a suffix (e.g.
    cassandra.input_field_schema.12344321) but we don't have access to the
    schema in the exec method.
    We thought of having the first parameter of the input tuple be a
    unique
    name that the script specifies, like 'filename.relationalias' as a
    convention to make them unique to the file. However in the
    outputSchema, we
    don't have access to the input tuple, just the schema itself, so it
    couldn't
    get that value in there.
    Any ideas on how to make this so it doesn't stomp on each other
    within
    the pig script? Is there a best way to do that?
    Thanks!

    Jeremy
    --------------------------
    Grant Ingersoll

    --------------------------
    Grant Ingersoll
  • Raghu Angadi at Jul 9, 2011 at 6:05 pm

    On Fri, Jul 8, 2011 at 2:48 PM, Grant Ingersoll wrote:

    But are you keeping member variables or do you put everything in the
    context?

    Anything that you want to remember needs to put in the context.
    PIG makes sure that the constructor is called with the same arguments on
    front-end and backend. In addition, for loaders and storage, setContext API
    is invoked with the same same context on frontend and backend.

    Anything else you need to put in the context. If something is derived from
    constructor arguments, you don't need to put into context for e.g.

    not sure if I understood the question correctly, but PIG does not transfer
    your object, so what you store in the member variables does not matter.

    Raghu.
    On Jul 8, 2011, at 3:21 PM, Raghu Angadi wrote:

    yes. that is exactly how HBaseStorage uses context.

    On Fri, Jul 8, 2011 at 10:19 AM, Jeremy Hanna <
    jeremy.hanna1234@gmail.com>wrote:
    In CassandraStorage, we had been using some load/store URL specific
    information (keyspace, column family names) to make the
    UDFContext.properties key unique, but with what Grant said was in the
    docs,
    we just wrote a patch to instead use the udf context signatures for
    those
    keys when setting and getting those property values. Is that the way to
    go
    then? I'm setting those as member variables and then using them later.

    @Override
    public void setUDFContextSignature(String signature)
    {
    this.loadSignature = signature;
    }

    /* StoreFunc methods */
    public void setStoreFuncUDFContextSignature(String signature)
    {
    this.storeSignature = signature;
    }

    On Jul 8, 2011, at 7:24 AM, Grant Ingersoll wrote:

    What is the guidance here on using member variables when implementing
    UDFs and passing properties? That is, what are the semantics for using
    them
    to store properties for a UDF instance? The docs talk a lot about
    making
    sure that no side effects happen from multiple calls to a UDF instance,
    but
    it is not clear whether that means it's doing things like changing the
    Location for a given instance of a UDF or just calling it multiple
    times.
    PigStorage suggests not (since it keeps a member var location), but the
    UDFContext docs suggests that one keep all state in the UDFContext under
    an
    appropriate signature.
    another case where this has reared it's head in an improper
    implementation.
    -Grant
    On Jul 7, 2011, at 3:24 AM, Jeremy Hanna wrote:

    On Jul 6, 2011, at 11:10 PM, Raghu Angadi wrote:

    On Wed, Jul 6, 2011 at 7:20 PM, Jeremy Hanna <
    jeremy.hanna1234@gmail.com>wrote:
    On Jul 6, 2011, at 12:47 PM, Dmitriy Ryaboy wrote:

    I think this is the same problem we were having earlier:
    http://hadoop.markmail.org/thread/kgxhdgw6zdmadch4

    One workaround is to use defines to explicitly create different
    instances of your UDF, and use them separately.. it's ugly but it
    works.
    Thanks Dmitriy.

    I tried doing something like:
    define ToCassandraBag1 org.pygmalion.udf.ToCassandraBag();
    define ToCassandraBag2 org.pygmalion.udf.ToCassandraBag();
    This still does not work since you can't distinguish the two. The way
    I
    was
    thinking of doing this is to let user pass in some unique sting as a
    substitute for context:

    define ToCassandraBag1 ToCassandraBag('1');
    define ToCassandraBag2 ToCassandraBag('2');
    Ah yes. I had misunderstood. Thanks for the clarification. Now the
    pig docs also make more sense in the Passing Configurations to UDFs
    section:
    http://pig.apache.org/docs/r0.8.1/udf.html#Passing+Configurations+to+UDFs
    It says:
    "The UDF can pass its constructor arguments, or some other identifying
    strings. This allows each instantiation of the UDF to have a different
    properties object thus avoiding name space collisions between
    instantiations
    of the UDF."
    and the HBaseStorage example was helpful to see that in action.

    Thanks both to Raghu and Dmitriy.
    inside the UDF, you would use this arg to make a 'contextString' (see
    HBaseStorage.java for example use) to store any state.

    ideally UDFs shouldn't have to do this.. They should have the same
    context
    info that is available for loaders and storage.

    Raghu.

    at the top and then using each one only once. That still produces
    the
    same
    error. I guess in this case we'll just have to require the field
    names be
    entered into the UDF and it won't introspect them. Ah well. Would
    be
    nice
    to be able to use it but I don't really see another way around this
    bug with
    the shared UDF context.
    D

    On Wed, Jul 6, 2011 at 9:42 AM, Jeremy Hanna <
    jeremy.hanna1234@gmail.com>
    wrote:
    We have a UDF that introspects the output schema and gets the
    field
    names there and use that in the exec method.
    The UDF is found here:
    https://github.com/jeromatron/pygmalion/blob/master/udf/src/main/java/org/pygmalion/udf/ToCassandraBag.java
    A simple example is found here:
    https://github.com/jeromatron/pygmalion/blob/master/scripts/from_to_cassandra_bag_example.pig
    It takes the relation's aliases and uses them in the output so
    that
    the
    user doesn't have to specify them. However we've been noticing that
    if you
    have more than one ToCassandraBag call in a pig script, sometimes
    they
    are
    run at the same time and the key is the same in the UDF context:
    cassandra.input_field_schema. So we think there is an issue there
    (array
    out of bounds exceptions when running the script, but when running
    in
    grunt
    one at a time, it doesn't do that).
    Is there a right way to do this parameter passing so that we don't
    get
    these errors when multiple calls exist?
    We thought of using the schema hash code as a suffix (e.g.
    cassandra.input_field_schema.12344321) but we don't have access to
    the
    schema in the exec method.
    We thought of having the first parameter of the input tuple be a
    unique
    name that the script specifies, like 'filename.relationalias' as a
    convention to make them unique to the file. However in the
    outputSchema, we
    don't have access to the input tuple, just the schema itself, so it
    couldn't
    get that value in there.
    Any ideas on how to make this so it doesn't stomp on each other
    within
    the pig script? Is there a best way to do that?
    Thanks!

    Jeremy
    --------------------------
    Grant Ingersoll

    --------------------------
    Grant Ingersoll


Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 6, '11 at 4:43p
activeJul 9, '11 at 6:05p
posts10
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase