Grokbase Groups Pig dev August 2010
FAQ
[ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12899965#action_12899965 ]

Aniket Mokashi commented on PIG-506:
------------------------------------

Wiki page explaining details of specification and implementation has been uploaded at - http://wiki.apache.org/pig/NativeMapReduce
Does pig need a NATIVE keyword?
-------------------------------

Key: PIG-506
URL: https://issues.apache.org/jira/browse/PIG-506
Project: Pig
Issue Type: New Feature
Components: impl
Reporter: Alan Gates
Assignee: Aniket Mokashi
Priority: Minor
Fix For: 0.8.0

Attachments: NativeImplInitial.patch


Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
{code}
A = load 'myfile';
X = load 'myotherfile';
B = group A by $0;
C = foreach B generate group, myudf(B);
D = native (jar=mymr.jar, infile=frompig outfile=topig);
E = join D by $0, X by $0;
...
{code}
This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Thejas M Nair (JIRA) at Aug 20, 2010 at 9:59 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900880#action_12900880 ]

    Thejas M Nair commented on PIG-506:
    -----------------------------------

    Comments on the patch-
    - In lot of places there is code like "mro.getClass() == NativeMapReduceOper.class". It is better to use instanceof, the code will be maintainable, if we decide to extend NativeMapReduceOper in future. (Also, its slightly readable.)
    - In MapReduceLauncher.launchPig(), the code to calculate the progress and notify it is in two places, it can be moved to a function.
    - In TestNativeMapReduce.java , a comment mentioning the the source of the jar file will be useful.

    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 20, 2010 at 9:59 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900881#action_12900881 ]

    Thejas M Nair commented on PIG-506:
    -----------------------------------

    Additional comment on the patch-
    - In PigStats.java , the null check on js has been removed. Is it no longer necessary ?

    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Aniket Mokashi (JIRA) at Aug 20, 2010 at 11:21 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12900911#action_12900911 ]

    Aniket Mokashi commented on PIG-506:
    ------------------------------------

    Current patch doesnt consider case for parallel keyword -- we can fix this by adding -D mapred.reduce.tasks=n to params of RunJar. (ToDo)
    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Thejas M Nair (JIRA) at Aug 24, 2010 at 4:18 pm
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12901946#action_12901946 ]

    Thejas M Nair commented on PIG-506:
    -----------------------------------

    Unit test passed, and I committed the changes. But it fails with latest changes to switch to new logical plan. I have added the test cases to exclude list in build.xml .
    Keeping the jira open until this is fixed.

    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Daniel Dai (JIRA) at Aug 27, 2010 at 6:43 am
    [ https://issues.apache.org/jira/browse/PIG-506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903263#action_12903263 ]

    Daniel Dai commented on PIG-506:
    --------------------------------

    Patch looks good. One minor comment, PlanHelper.LoadStoreFinder may better be PlanHelper.LoadStoreNativeFinder.
    Does pig need a NATIVE keyword?
    -------------------------------

    Key: PIG-506
    URL: https://issues.apache.org/jira/browse/PIG-506
    Project: Pig
    Issue Type: New Feature
    Components: impl
    Reporter: Alan Gates
    Assignee: Aniket Mokashi
    Priority: Minor
    Fix For: 0.8.0

    Attachments: NativeImplInitial.patch, NativeMapReduceFinale1.patch, NativeMapReduceFinale2.patch, NativeMapReduceFinale3.patch, PIG-506.2.patch, PIG-506.3.patch, PIG-506.patch, TestWordCount.jar


    Assume a user had a job that broke easily into three pieces. Further assume that pieces one and three were easily expressible in pig, but that piece two needed to be written in map reduce for whatever reason (performance, something that pig could not easily express, legacy job that was too important to change, etc.). Today the user would either have to use map reduce for the entire job or manually handle the stitching together of pig and map reduce jobs. What if instead pig provided a NATIVE keyword that would allow the script to pass off the data stream to the underlying system (in this case map reduce). The semantics of NATIVE would vary by underlying system. In the map reduce case, we would assume that this indicated a collection of one or more fully contained map reduce jobs, so that pig would store the data, invoke the map reduce jobs, and then read the resulting data to continue. It might look something like this:
    {code}
    A = load 'myfile';
    X = load 'myotherfile';
    B = group A by $0;
    C = foreach B generate group, myudf(B);
    D = native (jar=mymr.jar, infile=frompig outfile=topig);
    E = join D by $0, X by $0;
    ...
    {code}
    This differs from streaming in that it allows the user to insert an arbitrary amount of native processing, whereas streaming allows the insertion of one binary. It also differs in that, for streaming, data is piped directly into and out of the binary as part of the pig pipeline. Here the pipeline would be broken, data written to disk, and the native block invoked, then data read back from disk.
    Another alternative is to say this is unnecessary because the user can do the coordination from java, using the PIgServer interface to run pig and calling the map reduce job explicitly. The advantages of the native keyword are that the user need not be worried about coordination between the jobs, pig will take care of it. Also the user can make use of existing java applications without being a java programmer.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 18, '10 at 7:09p
activeAug 27, '10 at 6:43a
posts6
users1
websitepig.apache.org

1 user in discussion

Daniel Dai (JIRA): 6 posts

People

Translate

site design / logo © 2022 Grokbase