Grokbase Groups Pig user March 2011
FAQ
Hello,

First of all thank you all, I really appreciate the help I get here :)
I've now written some UDFs and did some successful local runs.

When copying the date to HDFS and trying to run my pig file I get the job fails.

Looking at the jobtracker I have the following error several times:

Job Setup: Failed<http://shrek01.ethz.ch:50030/jobtasks.jsp?jobid=job_201103031425_0011&type=setup&pagenum=1&state=killed>

java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat cannot be cast to org.apache.hadoop.mapreduce.OutputFormat
at org.apache.hadoop.mapred.Task.initialize(Task.java:413)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:288)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

I tried to use DUMP instead of STORE (using PigStorage()) but the same error remains.

Any idea what I did wrong?

Best,
Will


From: Dmitriy Ryaboy
Sent: Wednesday, March 02, 2011 6:30 PM
To: user@pig.apache.org
Cc: Lai Will
Subject: Re: Shared resources

In practice, I have never seen an EvalFunc get instantiated twice on the same task, except when you are actually using it in two different places (foreach x generate Foo(field1), Foo(field2)), or when things like combiners are involved. In any case, number of instantiations is >> number of invocations, to the degree that it might as well be one even when it's not actually 1.

Put whatever initialization logic you need into the constructor. You can pass (string) arguments to UDF constructors in Pig using the DEFINE keyword:

define myFoo com.mycompany.Foo('some', 'arguments');

foreach x generate myFoo(field1);

D

On Wed, Mar 2, 2011 at 9:11 AM, Lai Will wrote:
I understand that the is not inter-task communication at all.
However my question arises within one task. The documentation says that we should not make any assumptions on how may EvalFunc instances (of the same class) are instantiated.
Therefore I assume that within the same task, there might be several instances of my EvalFunc and if every one of them is doing the parsing of resource files into data structures a lot of memory and computing power would be wasted.. so it's not about inter-task communication but about inter-instance communication.

Thank you for your help.

Best,
Will

-----Original Message-----
From: Alan Gates
Sent: Wednesday, March 02, 2011 5:17 PM
To: user@pig.apache.org
Subject: Re: Shared resources

There is no shared inter-task processing in Hadoop. Each task runs in a separate JVM and is locked off from all other tasks. This is partly because you do not know a priori which tasks will run together in which nodes, and partly for security. Data can be shared by all tasks on a node via the distributed cache. If all your work could be done once on the front end and then serialized to be later read by all tasks you could use this mechanism to share it. With the code in trunk UDFs can store data to the distributed cache, though this feature is not in a release yet.

Alan.
On Mar 2, 2011, at 7:54 AM, Lai Will wrote:

So I still get the redundant work whenever the same clusternode/vm
creates multiple instances of my EvalFunc?
And is it usual to have several instance of the EvalFunc on the same
clusternode/vm?

Will

-----Original Message-----
From: Alan Gates [mailto:gates@yahoo-inc.com Sent: Wednesday, March 02, 2011 4:49 PM
To: user@pig.apache.org Subject: Re: Shared resources

There is no method in the eval func that gets called on the backend
before any exec calls. You can keep a flag that tracks whether you
have done the initialization so that you only do it the first time.

Alan.
On Mar 2, 2011, at 5:29 AM, Lai Will wrote:

Hello,

I wrote a EvalFunc implementation that


1) Parses a SQL Query

2) Scans a folder for resource files and creates an index on
these files

3) According to certain properties of the SQL Query accesses
the corresponding file and creates a Java objects holding relevant
the information of the file (for reuse).

4) Does some computation with the SQL Query and the information
found in the file

5) Outputs a transformed SQL Query

Currently I'm doing local tests without Hadoop and the code works
fine.

The problem I see, is that right now I initialize my parser in the
EvalFunc, so that every time It gets instantiated a new instance of
the parser is generated. Ideally only on instance per machine would
be created.
Even worse right now I create the index and parse the corresponding
resource file once per call exec in EvalFunc and therefore do a lot
of redundant computation.

Just because I don't know where and how to put this shared
computation.
Does anybody have a solution on that?

Best,
Will

Search Discussions

  • Dmitriy Ryaboy at Mar 3, 2011 at 6:58 pm
    Can you share your loader code?
    On Thu, Mar 3, 2011 at 6:08 AM, Lai Will wrote:

    Hello,

    First of all thank you all, I really appreciate the help I get here :)
    I've now written some UDFs and did some successful local runs.

    When copying the date to HDFS and trying to run my pig file I get the job
    fails.

    Looking at the jobtracker I have the following error several times:

    Job Setup: Failed<
    http://shrek01.ethz.ch:50030/jobtasks.jsp?jobid=job_201103031425_0011&type=setup&pagenum=1&state=killed
    java.lang.ClassCastException:
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat
    cannot be cast to org.apache.hadoop.mapreduce.OutputFormat
    at org.apache.hadoop.mapred.Task.initialize(Task.java:413)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:288)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    I tried to use DUMP instead of STORE (using PigStorage()) but the same
    error remains.

    Any idea what I did wrong?

    Best,
    Will


    From: Dmitriy Ryaboy
    Sent: Wednesday, March 02, 2011 6:30 PM
    To: user@pig.apache.org
    Cc: Lai Will
    Subject: Re: Shared resources

    In practice, I have never seen an EvalFunc get instantiated twice on the
    same task, except when you are actually using it in two different places
    (foreach x generate Foo(field1), Foo(field2)), or when things like combiners
    are involved. In any case, number of instantiations is >> number of
    invocations, to the degree that it might as well be one even when it's not
    actually 1.

    Put whatever initialization logic you need into the constructor. You can
    pass (string) arguments to UDF constructors in Pig using the DEFINE keyword:

    define myFoo com.mycompany.Foo('some', 'arguments');

    foreach x generate myFoo(field1);

    D

    On Wed, Mar 2, 2011 at 9:11 AM, Lai Will <laiw@student.ethz.ch<mailto:
    laiw@student.ethz.ch>> wrote:
    I understand that the is not inter-task communication at all.
    However my question arises within one task. The documentation says that we
    should not make any assumptions on how may EvalFunc instances (of the same
    class) are instantiated.
    Therefore I assume that within the same task, there might be several
    instances of my EvalFunc and if every one of them is doing the parsing of
    resource files into data structures a lot of memory and computing power
    would be wasted.. so it's not about inter-task communication but about
    inter-instance communication.

    Thank you for your help.

    Best,
    Will

    -----Original Message-----
    From: Alan Gates [mailto:gates@yahoo-inc.com Sent: Wednesday, March 02, 2011 5:17 PM
    To: user@pig.apache.org Subject: Re: Shared resources

    There is no shared inter-task processing in Hadoop. Each task runs in a
    separate JVM and is locked off from all other tasks. This is partly because
    you do not know a priori which tasks will run together in which nodes, and
    partly for security. Data can be shared by all tasks on a node via the
    distributed cache. If all your work could be done once on the front end and
    then serialized to be later read by all tasks you could use this mechanism
    to share it. With the code in trunk UDFs can store data to the distributed
    cache, though this feature is not in a release yet.

    Alan.
    On Mar 2, 2011, at 7:54 AM, Lai Will wrote:

    So I still get the redundant work whenever the same clusternode/vm
    creates multiple instances of my EvalFunc?
    And is it usual to have several instance of the EvalFunc on the same
    clusternode/vm?

    Will

    -----Original Message-----
    From: Alan Gates [mailto:gates@yahoo-inc.com >]
    Sent: Wednesday, March 02, 2011 4:49 PM
    To: user@pig.apache.org > Subject: Re: Shared resources

    There is no method in the eval func that gets called on the backend
    before any exec calls. You can keep a flag that tracks whether you
    have done the initialization so that you only do it the first time.

    Alan.
    On Mar 2, 2011, at 5:29 AM, Lai Will wrote:

    Hello,

    I wrote a EvalFunc implementation that


    1) Parses a SQL Query

    2) Scans a folder for resource files and creates an index on
    these files

    3) According to certain properties of the SQL Query accesses
    the corresponding file and creates a Java objects holding relevant
    the information of the file (for reuse).

    4) Does some computation with the SQL Query and the information
    found in the file

    5) Outputs a transformed SQL Query

    Currently I'm doing local tests without Hadoop and the code works
    fine.

    The problem I see, is that right now I initialize my parser in the
    EvalFunc, so that every time It gets instantiated a new instance of
    the parser is generated. Ideally only on instance per machine would
    be created.
    Even worse right now I create the index and parse the corresponding
    resource file once per call exec in EvalFunc and therefore do a lot
    of redundant computation.

    Just because I don't know where and how to put this shared
    computation.
    Does anybody have a solution on that?

    Best,
    Will
  • Lai Will at Mar 3, 2011 at 7:50 pm
    Please find it attached.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: Thursday, March 03, 2011 7:58 PM
    To: user@pig.apache.org
    Subject: Re: PigOutputFormat cannot be cast to OutputFormat

    Can you share your loader code?
    On Thu, Mar 3, 2011 at 6:08 AM, Lai Will wrote:

    Hello,

    First of all thank you all, I really appreciate the help I get here :)
    I've now written some UDFs and did some successful local runs.

    When copying the date to HDFS and trying to run my pig file I get the
    job fails.

    Looking at the jobtracker I have the following error several times:

    Job Setup: Failed<
    http://shrek01.ethz.ch:50030/jobtasks.jsp?jobid=job_201103031425_0011&
    type=setup&pagenum=1&state=killed
    java.lang.ClassCastException:
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutput
    Format cannot be cast to org.apache.hadoop.mapreduce.OutputFormat
    at org.apache.hadoop.mapred.Task.initialize(Task.java:413)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:288)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    I tried to use DUMP instead of STORE (using PigStorage()) but the same
    error remains.

    Any idea what I did wrong?

    Best,
    Will


    From: Dmitriy Ryaboy
    Sent: Wednesday, March 02, 2011 6:30 PM
    To: user@pig.apache.org
    Cc: Lai Will
    Subject: Re: Shared resources

    In practice, I have never seen an EvalFunc get instantiated twice on
    the same task, except when you are actually using it in two different
    places (foreach x generate Foo(field1), Foo(field2)), or when things
    like combiners are involved. In any case, number of instantiations is
    number of invocations, to the degree that it might as well be one
    even when it's not actually 1.

    Put whatever initialization logic you need into the constructor. You
    can pass (string) arguments to UDF constructors in Pig using the DEFINE keyword:

    define myFoo com.mycompany.Foo('some', 'arguments');

    foreach x generate myFoo(field1);

    D

    On Wed, Mar 2, 2011 at 9:11 AM, Lai Will <laiw@student.ethz.ch<mailto:
    laiw@student.ethz.ch>> wrote:
    I understand that the is not inter-task communication at all.
    However my question arises within one task. The documentation says
    that we should not make any assumptions on how may EvalFunc instances
    (of the same
    class) are instantiated.
    Therefore I assume that within the same task, there might be several
    instances of my EvalFunc and if every one of them is doing the parsing
    of resource files into data structures a lot of memory and computing
    power would be wasted.. so it's not about inter-task communication but
    about inter-instance communication.

    Thank you for your help.

    Best,
    Will

    -----Original Message-----
    From: Alan Gates
    [mailto:gates@yahoo-inc.com Sent: Wednesday, March 02, 2011 5:17 PM
    To: user@pig.apache.org Subject: Re: Shared resources

    There is no shared inter-task processing in Hadoop. Each task runs in
    a separate JVM and is locked off from all other tasks. This is partly
    because you do not know a priori which tasks will run together in
    which nodes, and partly for security. Data can be shared by all tasks
    on a node via the distributed cache. If all your work could be done
    once on the front end and then serialized to be later read by all
    tasks you could use this mechanism to share it. With the code in
    trunk UDFs can store data to the distributed cache, though this feature is not in a release yet.

    Alan.
    On Mar 2, 2011, at 7:54 AM, Lai Will wrote:

    So I still get the redundant work whenever the same clusternode/vm
    creates multiple instances of my EvalFunc?
    And is it usual to have several instance of the EvalFunc on the same
    clusternode/vm?

    Will

    -----Original Message-----
    From: Alan Gates
    [mailto:gates@yahoo-inc.com >]
    Sent: Wednesday, March 02, 2011 4:49 PM
    To: user@pig.apache.org > Subject: Re: Shared resources

    There is no method in the eval func that gets called on the backend
    before any exec calls. You can keep a flag that tracks whether you
    have done the initialization so that you only do it the first time.

    Alan.
    On Mar 2, 2011, at 5:29 AM, Lai Will wrote:

    Hello,

    I wrote a EvalFunc implementation that


    1) Parses a SQL Query

    2) Scans a folder for resource files and creates an index on
    these files

    3) According to certain properties of the SQL Query accesses
    the corresponding file and creates a Java objects holding relevant
    the information of the file (for reuse).

    4) Does some computation with the SQL Query and the information
    found in the file

    5) Outputs a transformed SQL Query

    Currently I'm doing local tests without Hadoop and the code works
    fine.

    The problem I see, is that right now I initialize my parser in the
    EvalFunc, so that every time It gets instantiated a new instance of
    the parser is generated. Ideally only on instance per machine would
    be created.
    Even worse right now I create the index and parse the corresponding
    resource file once per call exec in EvalFunc and therefore do a
    lot of redundant computation.

    Just because I don't know where and how to put this shared
    computation.
    Does anybody have a solution on that?

    Best,
    Will
  • Lai Will at Mar 3, 2011 at 8:30 pm
    Seems that the attachment got lost.
    Anyway I tried something much simple and it seems that the problem lies in the registering of my jar file?!
    I'm running

    raw = load 'data/example.xml' using PigStorage();
    dump raw

    which succeeds, but when running

    register ./laiw.jar;
    raw = load 'data/example.xml' using PigStorage();
    dump raw

    it fails with

    java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat cannot be cast to org.apache.hadoop.mapreduce.OutputFormat
    at org.apache.hadoop.mapred.Task.initialize(Task.java:413)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:288)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    in the jobtracker.

    -----Original Message-----
    From: Lai Will
    Sent: Thursday, March 03, 2011 8:50 PM
    To: user@pig.apache.org
    Subject: RE: PigOutputFormat cannot be cast to OutputFormat

    Please find it attached.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: Thursday, March 03, 2011 7:58 PM
    To: user@pig.apache.org
    Subject: Re: PigOutputFormat cannot be cast to OutputFormat

    Can you share your loader code?
    On Thu, Mar 3, 2011 at 6:08 AM, Lai Will wrote:

    Hello,

    First of all thank you all, I really appreciate the help I get here :)
    I've now written some UDFs and did some successful local runs.

    When copying the date to HDFS and trying to run my pig file I get the
    job fails.

    Looking at the jobtracker I have the following error several times:

    Job Setup: Failed<
    http://shrek01.ethz.ch:50030/jobtasks.jsp?jobid=job_201103031425_0011&
    type=setup&pagenum=1&state=killed
    java.lang.ClassCastException:
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutput
    Format cannot be cast to org.apache.hadoop.mapreduce.OutputFormat
    at org.apache.hadoop.mapred.Task.initialize(Task.java:413)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:288)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    I tried to use DUMP instead of STORE (using PigStorage()) but the same
    error remains.

    Any idea what I did wrong?

    Best,
    Will


    From: Dmitriy Ryaboy
    Sent: Wednesday, March 02, 2011 6:30 PM
    To: user@pig.apache.org
    Cc: Lai Will
    Subject: Re: Shared resources

    In practice, I have never seen an EvalFunc get instantiated twice on
    the same task, except when you are actually using it in two different
    places (foreach x generate Foo(field1), Foo(field2)), or when things
    like combiners are involved. In any case, number of instantiations is
    number of invocations, to the degree that it might as well be one
    even when it's not actually 1.

    Put whatever initialization logic you need into the constructor. You
    can pass (string) arguments to UDF constructors in Pig using the DEFINE keyword:

    define myFoo com.mycompany.Foo('some', 'arguments');

    foreach x generate myFoo(field1);

    D

    On Wed, Mar 2, 2011 at 9:11 AM, Lai Will <laiw@student.ethz.ch<mailto:
    laiw@student.ethz.ch>> wrote:
    I understand that the is not inter-task communication at all.
    However my question arises within one task. The documentation says
    that we should not make any assumptions on how may EvalFunc instances
    (of the same
    class) are instantiated.
    Therefore I assume that within the same task, there might be several
    instances of my EvalFunc and if every one of them is doing the parsing
    of resource files into data structures a lot of memory and computing
    power would be wasted.. so it's not about inter-task communication but
    about inter-instance communication.

    Thank you for your help.

    Best,
    Will

    -----Original Message-----
    From: Alan Gates
    [mailto:gates@yahoo-inc.com Sent: Wednesday, March 02, 2011 5:17 PM
    To: user@pig.apache.org Subject: Re: Shared resources

    There is no shared inter-task processing in Hadoop. Each task runs in
    a separate JVM and is locked off from all other tasks. This is partly
    because you do not know a priori which tasks will run together in
    which nodes, and partly for security. Data can be shared by all tasks
    on a node via the distributed cache. If all your work could be done
    once on the front end and then serialized to be later read by all
    tasks you could use this mechanism to share it. With the code in
    trunk UDFs can store data to the distributed cache, though this feature is not in a release yet.

    Alan.
    On Mar 2, 2011, at 7:54 AM, Lai Will wrote:

    So I still get the redundant work whenever the same clusternode/vm
    creates multiple instances of my EvalFunc?
    And is it usual to have several instance of the EvalFunc on the same
    clusternode/vm?

    Will

    -----Original Message-----
    From: Alan Gates
    [mailto:gates@yahoo-inc.com >]
    Sent: Wednesday, March 02, 2011 4:49 PM
    To: user@pig.apache.org > Subject: Re: Shared resources

    There is no method in the eval func that gets called on the backend
    before any exec calls. You can keep a flag that tracks whether you
    have done the initialization so that you only do it the first time.

    Alan.
    On Mar 2, 2011, at 5:29 AM, Lai Will wrote:

    Hello,

    I wrote a EvalFunc implementation that


    1) Parses a SQL Query

    2) Scans a folder for resource files and creates an index on
    these files

    3) According to certain properties of the SQL Query accesses
    the corresponding file and creates a Java objects holding relevant
    the information of the file (for reuse).

    4) Does some computation with the SQL Query and the information
    found in the file

    5) Outputs a transformed SQL Query

    Currently I'm doing local tests without Hadoop and the code works
    fine.

    The problem I see, is that right now I initialize my parser in the
    EvalFunc, so that every time It gets instantiated a new instance of
    the parser is generated. Ideally only on instance per machine would
    be created.
    Even worse right now I create the index and parse the corresponding
    resource file once per call exec in EvalFunc and therefore do a
    lot of redundant computation.

    Just because I don't know where and how to put this shared
    computation.
    Does anybody have a solution on that?

    Best,
    Will
  • Lai Will at Mar 3, 2011 at 9:00 pm
    Alright its very weird I solved it now, but I don't know exactly why..
    I removed some resources that I don't need (especially lib/*.jar) from my jar.
    And now it works. Some class conflicts it seems..

    Now I have another problem, but I'll open a new thread for that.

    Will

    -----Original Message-----
    From: Lai Will
    Sent: Thursday, March 03, 2011 9:30 PM
    To: user@pig.apache.org
    Subject: RE: PigOutputFormat cannot be cast to OutputFormat

    Seems that the attachment got lost.
    Anyway I tried something much simple and it seems that the problem lies in the registering of my jar file?!
    I'm running

    raw = load 'data/example.xml' using PigStorage(); dump raw

    which succeeds, but when running

    register ./laiw.jar;
    raw = load 'data/example.xml' using PigStorage(); dump raw

    it fails with

    java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat cannot be cast to org.apache.hadoop.mapreduce.OutputFormat
    at org.apache.hadoop.mapred.Task.initialize(Task.java:413)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:288)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    in the jobtracker.

    -----Original Message-----
    From: Lai Will
    Sent: Thursday, March 03, 2011 8:50 PM
    To: user@pig.apache.org
    Subject: RE: PigOutputFormat cannot be cast to OutputFormat

    Please find it attached.

    -----Original Message-----
    From: Dmitriy Ryaboy
    Sent: Thursday, March 03, 2011 7:58 PM
    To: user@pig.apache.org
    Subject: Re: PigOutputFormat cannot be cast to OutputFormat

    Can you share your loader code?
    On Thu, Mar 3, 2011 at 6:08 AM, Lai Will wrote:

    Hello,

    First of all thank you all, I really appreciate the help I get here :)
    I've now written some UDFs and did some successful local runs.

    When copying the date to HDFS and trying to run my pig file I get the
    job fails.

    Looking at the jobtracker I have the following error several times:

    Job Setup: Failed<
    http://shrek01.ethz.ch:50030/jobtasks.jsp?jobid=job_201103031425_0011&
    type=setup&pagenum=1&state=killed
    java.lang.ClassCastException:
    org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutput
    Format cannot be cast to org.apache.hadoop.mapreduce.OutputFormat
    at org.apache.hadoop.mapred.Task.initialize(Task.java:413)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:288)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

    I tried to use DUMP instead of STORE (using PigStorage()) but the same
    error remains.

    Any idea what I did wrong?

    Best,
    Will


    From: Dmitriy Ryaboy
    Sent: Wednesday, March 02, 2011 6:30 PM
    To: user@pig.apache.org
    Cc: Lai Will
    Subject: Re: Shared resources

    In practice, I have never seen an EvalFunc get instantiated twice on
    the same task, except when you are actually using it in two different
    places (foreach x generate Foo(field1), Foo(field2)), or when things
    like combiners are involved. In any case, number of instantiations is
    number of invocations, to the degree that it might as well be one
    even when it's not actually 1.

    Put whatever initialization logic you need into the constructor. You
    can pass (string) arguments to UDF constructors in Pig using the DEFINE keyword:

    define myFoo com.mycompany.Foo('some', 'arguments');

    foreach x generate myFoo(field1);

    D

    On Wed, Mar 2, 2011 at 9:11 AM, Lai Will <laiw@student.ethz.ch<mailto:
    laiw@student.ethz.ch>> wrote:
    I understand that the is not inter-task communication at all.
    However my question arises within one task. The documentation says
    that we should not make any assumptions on how may EvalFunc instances
    (of the same
    class) are instantiated.
    Therefore I assume that within the same task, there might be several
    instances of my EvalFunc and if every one of them is doing the parsing
    of resource files into data structures a lot of memory and computing
    power would be wasted.. so it's not about inter-task communication but
    about inter-instance communication.

    Thank you for your help.

    Best,
    Will

    -----Original Message-----
    From: Alan Gates
    [mailto:gates@yahoo-inc.com Sent: Wednesday, March 02, 2011 5:17 PM
    To: user@pig.apache.org Subject: Re: Shared resources

    There is no shared inter-task processing in Hadoop. Each task runs in
    a separate JVM and is locked off from all other tasks. This is partly
    because you do not know a priori which tasks will run together in
    which nodes, and partly for security. Data can be shared by all tasks
    on a node via the distributed cache. If all your work could be done
    once on the front end and then serialized to be later read by all
    tasks you could use this mechanism to share it. With the code in
    trunk UDFs can store data to the distributed cache, though this feature is not in a release yet.

    Alan.
    On Mar 2, 2011, at 7:54 AM, Lai Will wrote:

    So I still get the redundant work whenever the same clusternode/vm
    creates multiple instances of my EvalFunc?
    And is it usual to have several instance of the EvalFunc on the same
    clusternode/vm?

    Will

    -----Original Message-----
    From: Alan Gates
    [mailto:gates@yahoo-inc.com >]
    Sent: Wednesday, March 02, 2011 4:49 PM
    To: user@pig.apache.org > Subject: Re: Shared resources

    There is no method in the eval func that gets called on the backend
    before any exec calls. You can keep a flag that tracks whether you
    have done the initialization so that you only do it the first time.

    Alan.
    On Mar 2, 2011, at 5:29 AM, Lai Will wrote:

    Hello,

    I wrote a EvalFunc implementation that


    1) Parses a SQL Query

    2) Scans a folder for resource files and creates an index on
    these files

    3) According to certain properties of the SQL Query accesses
    the corresponding file and creates a Java objects holding relevant
    the information of the file (for reuse).

    4) Does some computation with the SQL Query and the information
    found in the file

    5) Outputs a transformed SQL Query

    Currently I'm doing local tests without Hadoop and the code works
    fine.

    The problem I see, is that right now I initialize my parser in the
    EvalFunc, so that every time It gets instantiated a new instance of
    the parser is generated. Ideally only on instance per machine would
    be created.
    Even worse right now I create the index and parse the corresponding
    resource file once per call exec in EvalFunc and therefore do a
    lot of redundant computation.

    Just because I don't know where and how to put this shared
    computation.
    Does anybody have a solution on that?

    Best,
    Will

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMar 3, '11 at 2:09p
activeMar 3, '11 at 9:00p
posts5
users2
websitepig.apache.org

2 users in discussion

Lai Will: 4 posts Dmitriy Ryaboy: 1 post

People

Translate

site design / logo © 2021 Grokbase