Grokbase Groups Pig dev November 2010
FAQ
SAMPLE command should accept parameters
---------------------------------------

Key: PIG-1713
URL: https://issues.apache.org/jira/browse/PIG-1713
Project: Pig
Issue Type: Improvement
Reporter: Viraj Bhat


I have a script which takes in a command line parameter.

{code}
pig -p number=100 script.pig
{code}

The script contains the following parameters:

{code}
A = load '/user/viraj/test' using PigStorage() as (a,b,c);

B = SAMPLE A 1/$number;

dump B;
{code}

Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

Ideal use case:

{code}
A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

...
...

W = group X by col1;

Z = foreach Y generate AVG(X);

AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

BB = SAMPLE AA 1/Z;

dump BB;
{code}

Viraj

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Thejas M Nair (JIRA) at Nov 9, 2010 at 1:15 am
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12929865#action_12929865 ]

    Thejas M Nair commented on PIG-1713:
    ------------------------------------

    Once the first use case is supported (expressions parameter for SAMPLE), the ideal use case will also automatically work - thanks to the 'relation as scalar' feature introduced in PIG-1434 . Until this feature is available, a workaround is to use a filter statement with a udf that returns true based on the probability argument.


    SAMPLE command should accept parameters
    ---------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat

    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Nov 13, 2010 at 12:00 am
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-1713:
    --------------------------------

    Fix Version/s: 0.9.0

    A "maybe" for 0.9
    SAMPLE command should accept parameters
    ---------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Fix For: 0.9.0


    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • David Ciemiewicz (JIRA) at Jan 25, 2011 at 5:17 pm
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986521#action_12986521 ]

    David Ciemiewicz commented on PIG-1713:
    ---------------------------------------

    An alternative might be to implement SAMPLE using Reservoir Sampling techniques, this way you never have to adjust the sampling probability - as long as N is greater than the sample size K, you'll always get exactly K elements.

    http://en.wikipedia.org/wiki/Reservoir_sampling

    Actually, to implement a scalable, parallel version of Reservoir Sampling that would work with Accumulator and Combiner interfaces, Weighted Reservoir Sampling (WRS) is required:

    http://utopia.duth.gr/~pefraimi/research/data/2007EncOfAlg.pdf
    SAMPLE command should accept parameters
    ---------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Fix For: 0.9.0


    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Feb 24, 2011 at 1:35 am
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-1713:
    --------------------------------

    Fix Version/s: (was: 0.9.0)
    SAMPLE command should accept parameters
    ---------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat

    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Olga Natkovich (JIRA) at Mar 3, 2011 at 1:11 am
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-1713:
    --------------------------------

    Fix Version/s: 0.10
    SAMPLE command should accept parameters
    ---------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Fix For: 0.10


    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    --
    This message is automatically generated by JIRA.
    -
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Mar 14, 2011 at 10:11 pm
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-1713:
    ----------------------------

    Description:
    I have a script which takes in a command line parameter.

    {code}
    pig -p number=100 script.pig
    {code}

    The script contains the following parameters:

    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);

    B = SAMPLE A 1/$number;

    dump B;
    {code}

    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

    Ideal use case:

    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

    ...
    ...

    W = group X by col1;

    Z = foreach Y generate AVG(X);

    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

    BB = SAMPLE AA 1/Z;

    dump BB;
    {code}

    Viraj

    Limit should has the same case.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

    was:
    I have a script which takes in a command line parameter.

    {code}
    pig -p number=100 script.pig
    {code}

    The script contains the following parameters:

    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);

    B = SAMPLE A 1/$number;

    dump B;
    {code}

    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

    Ideal use case:

    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

    ...
    ...

    W = group X by col1;

    Z = foreach Y generate AVG(X);

    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

    BB = SAMPLE AA 1/Z;

    dump BB;
    {code}

    Viraj

    Labels: gsoc2011 (was: )
    SAMPLE command should accept parameters
    ---------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Labels: gsoc2011
    Fix For: 0.10


    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    Limit should has the same case.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Gianmarco De Francisci Morales (JIRA) at Mar 21, 2011 at 6:45 pm
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009280#comment-13009280 ]

    Gianmarco De Francisci Morales commented on PIG-1713:
    -----------------------------------------------------

    To support the simple use case one would simply need to allow expressions in the SAMPLE argument.
    This should mainly require changes to the front-end I assume.

    For more complex techniques like reservoir one should implement a new (physical?) operator.
    What is the exact scope/goal of the project?

    Maybe it could be split in 2 parts. Supporting sampling with variable arguments as the first part, and adding more complex techniques as a second part?
    SAMPLE command should accept parameters
    ---------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Labels: gsoc2011
    Fix For: 0.10


    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    Limit should has the same case.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Mar 21, 2011 at 7:26 pm
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009323#comment-13009323 ]

    Daniel Dai commented on PIG-1713:
    ---------------------------------

    I think it better to split this issue into two. One is for scalar, the other for sampling algorithm.

    First part, yes, mostly it is a frontend work.

    Second part, I think we can allow sample to take optional argument. The scope of work is still open. We need to decide which algorithm to use. And AFAIK, Ciemiewicz already working on reservoir sampling, we may need to integrate it into our framework.
    SAMPLE command should accept parameters
    ---------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Labels: gsoc2011
    Fix For: 0.10


    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    Limit should has the same case.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (JIRA) at Mar 21, 2011 at 7:30 pm
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-1713:
    ----------------------------

    Description:
    I have a script which takes in a command line parameter.

    {code}
    pig -p number=100 script.pig
    {code}

    The script contains the following parameters:

    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);

    B = SAMPLE A 1/$number;

    dump B;
    {code}

    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

    Ideal use case:

    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

    ...
    ...

    W = group X by col1;

    Z = foreach Y generate AVG(X);

    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

    BB = SAMPLE AA 1/Z;

    dump BB;
    {code}

    Viraj

    Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

    was:
    I have a script which takes in a command line parameter.

    {code}
    pig -p number=100 script.pig
    {code}

    The script contains the following parameters:

    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);

    B = SAMPLE A 1/$number;

    dump B;
    {code}

    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

    Ideal use case:

    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

    ...
    ...

    W = group X by col1;

    Z = foreach Y generate AVG(X);

    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

    BB = SAMPLE AA 1/Z;

    dump BB;
    {code}

    Viraj

    Limit should has the same case.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

    Summary: SAMPLE command should accept parameters to specify alternative sampling algorithm (was: SAMPLE command should accept parameters)
    SAMPLE command should accept parameters to specify alternative sampling algorithm
    ---------------------------------------------------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Labels: gsoc2011
    Fix For: 0.10


    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Dmitriy V. Ryaboy (JIRA) at Jun 22, 2011 at 3:29 am
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13053020#comment-13053020 ]

    Dmitriy V. Ryaboy commented on PIG-1713:
    ----------------------------------------

    When making changes to how SAMPLE works, please keep in mind PIG-2014 (letting the optimizer push this operator around is clearly dangerous).
    SAMPLE command should accept parameters to specify alternative sampling algorithm
    ---------------------------------------------------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Labels: gsoc2011
    Fix For: 0.10


    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Olga Natkovich (Updated) (JIRA) at Oct 5, 2011 at 12:07 am
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich updated PIG-1713:
    --------------------------------

    Fix Version/s: (was: 0.10)
    SAMPLE command should accept parameters to specify alternative sampling algorithm
    ---------------------------------------------------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Labels: gsoc2011

    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (Updated) (JIRA) at Mar 13, 2012 at 11:59 pm
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-1713:
    ----------------------------

    Labels: gsoc2012 (was: gsoc2011)
    SAMPLE command should accept parameters to specify alternative sampling algorithm
    ---------------------------------------------------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Labels: gsoc2012

    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.
    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira
  • Daniel Dai (Updated) (JIRA) at Mar 14, 2012 at 5:10 am
    [ https://issues.apache.org/jira/browse/PIG-1713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Daniel Dai updated PIG-1713:
    ----------------------------

    Description:
    I have a script which takes in a command line parameter.

    {code}
    pig -p number=100 script.pig
    {code}

    The script contains the following parameters:

    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);

    B = SAMPLE A 1/$number;

    dump B;
    {code}

    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

    Ideal use case:

    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

    ...
    ...

    W = group X by col1;

    Z = foreach Y generate AVG(X);

    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

    BB = SAMPLE AA 1/Z;

    dump BB;
    {code}

    Viraj

    Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

    This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012

    was:
    I have a script which takes in a command line parameter.

    {code}
    pig -p number=100 script.pig
    {code}

    The script contains the following parameters:

    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);

    B = SAMPLE A 1/$number;

    dump B;
    {code}

    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.

    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.

    Ideal use case:

    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);

    ...
    ...

    W = group X by col1;

    Z = foreach Y generate AVG(X);

    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);

    BB = SAMPLE AA 1/Z;

    dump BB;
    {code}

    Viraj

    Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.

    This is a candidate project for Google summer of code 2011. More information about the program can be found at http://wiki.apache.org/pig/GSoc2011

    SAMPLE command should accept parameters to specify alternative sampling algorithm
    ---------------------------------------------------------------------------------

    Key: PIG-1713
    URL: https://issues.apache.org/jira/browse/PIG-1713
    Project: Pig
    Issue Type: Improvement
    Reporter: Viraj Bhat
    Labels: gsoc2012

    I have a script which takes in a command line parameter.
    {code}
    pig -p number=100 script.pig
    {code}
    The script contains the following parameters:
    {code}
    A = load '/user/viraj/test' using PigStorage() as (a,b,c);
    B = SAMPLE A 1/$number;
    dump B;
    {code}
    Realistic use cases of SAMPLE require statisticians to calculate SAMPLE data on demand.
    Ideally I would like to calculate SAMPLE from within Pig script without having to run one Pig script first get it's results and another to pass the results.
    Ideal use case:
    {code}
    A = load '/user/viraj/input' using PigStorage() as (col1, col2, col3);
    ...
    ...
    W = group X by col1;
    Z = foreach Y generate AVG(X);
    AA = load '/user/viraj/test' using PigStorage() as (a,b,c);
    BB = SAMPLE AA 1/Z;
    dump BB;
    {code}
    Viraj
    Change this Jira to only track sampling algorithm. PIG-1926 is opened to track limit/sample taking scalar.
    This is a candidate project for Google summer of code 2012. More information about the program can be found at https://cwiki.apache.org/confluence/display/PIG/GSoc2012
    --
    This message is automatically generated by JIRA.
    If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
    For more information on JIRA, see: http://www.atlassian.com/software/jira

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedNov 9, '10 at 12:33a
activeMar 14, '12 at 5:10a
posts14
users1
websitepig.apache.org

1 user in discussion

Daniel Dai (Updated) (JIRA): 14 posts

People

Translate

site design / logo © 2022 Grokbase