Grokbase Groups Pig dev August 2010
FAQ
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Aniket Mokashi updated PIG-506:
-------------------------------

Attachment: NativeImplInitial.patch

Attached patch has initial implementation for this feature--
Dump, store, explain work fine. PigStats are generated properly.

ToDos-
Check for multiquery optimization related tests
Add test cases

Usage-
A = load 'dict.txt';
B = mapreduce 'hadoop-0.20.2-examples.jar' Store A into 'input' Load 'output' `wordcount input output`;
Does pig need a NATIVE keyword?
-------------------------------

Key: PIG-506
URL: https://issues.apache.org/jira/browse/PIG-506
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Alan Gates
Assignee: Aniket Mokashi
Priority: Minor
Fix For: 0.8.0

Attachments: NativeImplInitial.patch


Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
{code}
A = load 'myfile';
X = load 'myotherfile';
B = group A by $0;
C = foreach B generate group, myudf(B);
D = native (jar=mymr.jar, infile=frompig outfile=topig);
E = join D by $0, X by $0;
...
{code}
This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Aniket Mokashi (JIRA) at Aug 19, 2010 at 9:24 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Aniket Mokashi updated PIG-506:
    -------------------------------

    Attachment: NativeMapReduceFinale1.patch

    Attaching the final patch-
    Includes - MR changes, optimizer related changes, test cases for basic mr.
    ToDo- Test cases for optimizer
    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aniket Mokashi (JIRA) at Aug 19, 2010 at 10:26 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Aniket Mokashi updated PIG-506:
    -------------------------------

    Status: Patch Available (was: Open)
    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aniket Mokashi (JIRA) at Aug 19, 2010 at 10:36 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Aniket Mokashi updated PIG-506:
    -------------------------------

    Attachment: TestWordCount.jar

    We also need to add this jar in lib to get tests working.
    ToDo- CreateTestJarAtRuntime
    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aniket Mokashi (JIRA) at Aug 20, 2010 at 12:56 am
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Aniket Mokashi updated PIG-506:
    -------------------------------

    Attachment: NativeMapReduceFinale2.patch

    Submitting the updated patch
    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aniket Mokashi (JIRA) at Aug 20, 2010 at 11:27 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Aniket Mokashi updated PIG-506:
    -------------------------------

    Attachment: NativeMapReduceFinale3.patch
    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 23, 2010 at 7:37 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-506:
    ------------------------------

    Attachment: PIG-506.patch

    New patch address my comments.
    test-patch results -
    [exec] -1 overall.
    [exec]
    [exec] +1 @author. The patch does not contain any @author tags.
    [exec]
    [exec] +1 tests included. The patch appears to include 10 new or modified tests.
    [exec]
    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
    [exec]
    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
    [exec]
    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    [exec]
    [exec] -1 release audit. The applied patch generated 433 release audit warnings (more than the trunk's current 425 warnings).

    release audit warnings are for the javadoc html files
    I will commit once all unit tests pass.
    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 26, 2010 at 3:57 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-506:
    ------------------------------

    Attachment: PIG-506.2.patch

    PIG-506.2.patch has
    - Changes to get mapreduce operator working with new logical plan
    - Changes to LO/PO Native operators - The store and load for the operator are no longer within it, they are part of the plan. As a result, several changes in visitors made for handling the load/store within LONative has been reverted.
    - Fix for reporting failure when MR job corresponding to native operator fails.
    - Removed TestTestNativeMapReduce from exclude list in ant target.

    Some issues still to be fixed, which i will address as part of new jiras -
    - PIG-1570 The code path for handling failure in MR job corresponding to native MR is different and does not have the same behavior.
    - PIG-1571 If the output file for native MR exist, the query does not fail at compile time, it fails only at runtime. This file loaded in the nested load of native MR operator, it should be possible to check for this file.

    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 27, 2010 at 12:45 am
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-506:
    ------------------------------

    Attachment: PIG-506.3.patch

    Updated patch, earlier patch was missing src/org/apache/pig/newplan/logical/relational/LONative.java.

    test-patch and core tests are successful.

    [exec] +1 overall.
    [exec]
    [exec] +1 @author. The patch does not contain any @author tags.
    [exec]
    [exec] +1 tests included. The patch appears to include 8 new or modified tests.
    [exec]
    [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
    [exec]
    [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
    [exec]
    [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
    [exec]
    [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.3.patch, PIG-506.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 27, 2010 at 2:56 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Thejas M Nair updated PIG-506:
    ------------------------------

    Status: Resolved (was: Patch Available)
    Hadoop Flags: [Reviewed]
    Resolution: Fixed

    patch PIG-506.3.patch with changes suggested by Daniel committed to trunk.

    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.3.patch, PIG-506.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 10, '10 at 10:23p
activeAug 27, '10 at 2:56p
posts10
users1
websitepig.apache.org

1 user in discussion

Thejas M Nair (JIRA): 10 posts

People

Translate

site design / logo © 2022 Grokbase