Grokbase Groups Pig user June 2010
FAQ
Is there a UDF for generating the top X % of results? For example, in a log
parsing context, it might be the set of search queries that represent the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only takes the
top *number* of results, rather a percentile.

Thanks
Dave Viner

Search Discussions

  • Hc busy at Jun 29, 2010 at 5:00 pm
    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the percentile
    in pass 1, and then perform the filter in pass 2.

    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner wrote:

    Is there a UDF for generating the top X % of results? For example, in a
    log
    parsing context, it might be the set of search queries that represent the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Dave Viner at Jun 29, 2010 at 5:04 pm
    How would I calculate the percentile in one pass? In order to calculate the
    percentile for each item, I need to know the total count. How do I get the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.

    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner wrote:

    Is there a UDF for generating the top X % of results? For example, in a
    log
    parsing context, it might be the set of search queries that represent the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Hc busy at Jun 29, 2010 at 5:18 pm
    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the pig for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split by V.

    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

    How would I calculate the percentile in one pass? In order to calculate
    the
    percentile for each item, I need to know the total count. How do I get the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.

    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner wrote:

    Is there a UDF for generating the top X % of results? For example, in
    a
    log
    parsing context, it might be the set of search queries that represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Dave Viner at Jun 29, 2010 at 10:01 pm
    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema: double. Other
    Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't seem to
    get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the pig for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split by V.

    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

    How would I calculate the percentile in one pass? In order to calculate
    the
    percentile for each item, I need to know the total count. How do I get the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.

    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner wrote:

    Is there a UDF for generating the top X % of results? For example,
    in
    a
    log
    parsing context, it might be the set of search queries that represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Dave Viner at Jun 29, 2010 at 10:24 pm
    Actually, I've gotten the first half of the code to work now. Here's how
    it looks:


    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    C = GROUP B ALL;
    U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
    V = FOREACH U GENERATE min + (max-min)*0.95;

    V is appropriately set... but I can't perform the final step of actually
    filtering the values by that count.

    The simple approach:
    TOP = FILTER B BY count > V;

    doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
    {query: chararray,count: long}

    Same with SPLIT.

    How do I filter or split on the value of V?

    Dave Viner
    On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
    Other Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't seem
    to get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the pig
    for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split by
    V.

    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

    How would I calculate the percentile in one pass? In order to calculate
    the
    percentile for each item, I need to know the total count. How do I get the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.


    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
    wrote:
    Is there a UDF for generating the top X % of results? For example,
    in
    a
    log
    parsing context, it might be the set of search queries that
    represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Aniket Mokashi at Jun 29, 2010 at 11:01 pm
    Hi Dave,

    TOP = FILTER B BY count > V; would not work right now with pig. Alias can be
    casted into scalars after https://issues.apache.org/jira/browse/PIG-1434 is
    added. As far as I know, there is no simple way of using values generated by
    mapreduce jobs directly into Pig jobs, you will need UDF to do that.
    One dirty approach is to store this value in a location with a store and
    then later use it to generate a new file with $0 and V and filter with $0 >
    $1.

    Thanks,
    Aniket

    -----Original Message-----
    From: daveviner@gmail.com On Behalf Of Dave
    Viner
    Sent: Tuesday, June 29, 2010 3:24 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: UDF for generating top xx % of results?

    Actually, I've gotten the first half of the code to work now. Here's how
    it looks:


    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    C = GROUP B ALL;
    U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
    V = FOREACH U GENERATE min + (max-min)*0.95;

    V is appropriately set... but I can't perform the final step of actually
    filtering the values by that count.

    The simple approach:
    TOP = FILTER B BY count > V;

    doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
    {query: chararray,count: long}

    Same with SPLIT.

    How do I filter or split on the value of V?

    Dave Viner
    On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
    Other Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't seem
    to get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the pig
    for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split by
    V.

    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

    How would I calculate the percentile in one pass? In order to
    calculate
    the
    percentile for each item, I need to know the total count. How do I get the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.


    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
    wrote:
    Is there a UDF for generating the top X % of results? For example,
    in
    a
    log
    parsing context, it might be the set of search queries that
    represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Dave Viner at Jun 29, 2010 at 11:15 pm
    Hi Aniket,

    Is it possible to pass an alias to a UDF? In other words could I do:

    TOP = FILTER B BY myCompare(count, V)

    or something similar?

    I don't quite understand your idea of using STORE. If I do:

    STORE V into 'threshold';
    XX = LOAD 'threshold' AS (threshold:long);

    How can I filter the count by XX? Seems like I'm back in the same
    situation.

    Do you mean doing the STORE(), then directly modifying the files themselves
    before doing the LOAD() so that it contains one row for each value of B,
    with the threshold value V appended to the end?

    Dave Viner
    On Tue, Jun 29, 2010 at 3:57 PM, Aniket Mokashi wrote:

    Hi Dave,

    TOP = FILTER B BY count > V; would not work right now with pig. Alias can
    be
    casted into scalars after https://issues.apache.org/jira/browse/PIG-1434is
    added. As far as I know, there is no simple way of using values generated
    by
    mapreduce jobs directly into Pig jobs, you will need UDF to do that.
    One dirty approach is to store this value in a location with a store and
    then later use it to generate a new file with $0 and V and filter with $0 >
    $1.

    Thanks,
    Aniket

    -----Original Message-----
    From: daveviner@gmail.com On Behalf Of Dave
    Viner
    Sent: Tuesday, June 29, 2010 3:24 PM
    To: pig-user@hadoop.apache.org
    Subject: Re: UDF for generating top xx % of results?

    Actually, I've gotten the first half of the code to work now. Here's how
    it looks:


    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    C = GROUP B ALL;
    U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
    V = FOREACH U GENERATE min + (max-min)*0.95;

    V is appropriately set... but I can't perform the final step of actually
    filtering the values by that count.

    The simple approach:
    TOP = FILTER B BY count > V;

    doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
    {query: chararray,count: long}

    Same with SPLIT.

    How do I filter or split on the value of V?

    Dave Viner
    On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
    Other Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't seem
    to get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the pig
    for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split by
    V.

    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

    How would I calculate the percentile in one pass? In order to
    calculate
    the
    percentile for each item, I need to know the total count. How do I
    get
    the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make
    the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.


    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
    wrote:
    Is there a UDF for generating the top X % of results? For
    example,
    in
    a
    log
    parsing context, it might be the set of search queries that
    represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Thejas Nair at Jun 29, 2010 at 11:14 pm
    This pig query will not give you the "top xx% of results" . Ie if xx = 50%
    and results = 200, you may not get 100 rows in output.
    If you really want "top xx% of results", you need to do something like the
    following -
    (if you just want rows that have "count < (max(count) - min(count))*0.95,
    you can store and use the value of V the way TOTAL_ROWS is stored in this
    example)

    Pig query 1
    -----------
    L = load 'file' as (query, count); -- assuming that count is already there
    for simplicity
    O = order L by count; -- order it by column that defines 'top'
    G = group L by 'all';
    TOTAL_ROWS = foreach G generate COUNT(G);
    store O into 'file.ordered';

    -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
    the results
    store TOTAL_ROWS into 'total_rows.txt';



    Pig query 2
    -----------
    -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
    the result stored in total_rows.txt
    L = load 'file.ordered';
    LIM = limit L, $TOP_ROWS;
    Store LIM into 'output';


    On 6/29/10 3:24 PM, "Dave Viner" wrote:

    Actually, I've gotten the first half of the code to work now. Here's how
    it looks:


    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    C = GROUP B ALL;
    U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
    V = FOREACH U GENERATE min + (max-min)*0.95;

    V is appropriately set... but I can't perform the final step of actually
    filtering the values by that count.

    The simple approach:
    TOP = FILTER B BY count > V;

    doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
    {query: chararray,count: long}

    Same with SPLIT.

    How do I filter or split on the value of V?

    Dave Viner
    On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
    Other Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't seem
    to get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the pig
    for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split by
    V.

    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

    How would I calculate the percentile in one pass? In order to calculate
    the
    percentile for each item, I need to know the total count. How do I get the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.


    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
    wrote:
    Is there a UDF for generating the top X % of results? For example,
    in
    a
    log
    parsing context, it might be the set of search queries that
    represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Dave Viner at Jun 29, 2010 at 11:22 pm
    Hi Thejas,

    Thanks for this suggestion. Just to make sure I understand the strategy
    here:

    Pig query 1 outputs 2 files (or directories w/ files in them):
    "file.ordered" which contains the data set (query, count) in ascending order
    of count; and "total_rows" which contains 1 line with the total number of
    rows.

    Then, I need a small script that reads the value of "total_rows", multiplies
    that value by the desired percentile (0.95 in the example), and invokes pig
    a second time, passing the "TOP_ROWS" value as a define on the command line.
    The second pig query outputs the desired file in "output" which contains
    only those rows which have a count above the given threshold.

    Is that correct?

    Thanks
    Dave Viner


    On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair wrote:

    This pig query will not give you the "top xx% of results" . Ie if xx = 50%
    and results = 200, you may not get 100 rows in output.
    If you really want "top xx% of results", you need to do something like the
    following -
    (if you just want rows that have "count < (max(count) - min(count))*0.95,
    you can store and use the value of V the way TOTAL_ROWS is stored in this
    example)

    Pig query 1
    -----------
    L = load 'file' as (query, count); -- assuming that count is already there
    for simplicity
    O = order L by count; -- order it by column that defines 'top'
    G = group L by 'all';
    TOTAL_ROWS = foreach G generate COUNT(G);
    store O into 'file.ordered';

    -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
    the results
    store TOTAL_ROWS into 'total_rows.txt';



    Pig query 2
    -----------
    -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
    the result stored in total_rows.txt
    L = load 'file.ordered';
    LIM = limit L, $TOP_ROWS;
    Store LIM into 'output';


    On 6/29/10 3:24 PM, "Dave Viner" wrote:

    Actually, I've gotten the first half of the code to work now. Here's how
    it looks:


    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    C = GROUP B ALL;
    U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
    V = FOREACH U GENERATE min + (max-min)*0.95;

    V is appropriately set... but I can't perform the final step of actually
    filtering the values by that count.

    The simple approach:
    TOP = FILTER B BY count > V;

    doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
    {query: chararray,count: long}

    Same with SPLIT.

    How do I filter or split on the value of V?

    Dave Viner
    On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
    Other Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
    an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't
    seem
    to get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the pig
    for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split
    by
    V.

    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

    How would I calculate the percentile in one pass? In order to
    calculate
    the
    percentile for each item, I need to know the total count. How do I
    get
    the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make
    the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.


    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
    wrote:
    Is there a UDF for generating the top X % of results? For example,
    in
    a
    log
    parsing context, it might be the set of search queries that
    represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Thejas Nair at Jun 30, 2010 at 12:20 am

    On 6/29/10 4:22 PM, "Dave Viner" wrote:

    Hi Thejas,

    Thanks for this suggestion. Just to make sure I understand the strategy
    here:

    Pig query 1 outputs 2 files (or directories w/ files in them):
    "file.ordered" which contains the data set (query, count) in ascending order
    of count; and "total_rows" which contains 1 line with the total number of
    rows.

    Then, I need a small script that reads the value of "total_rows", multiplies
    that value by the desired percentile (0.95 in the example), and invokes pig
    a second time, passing the "TOP_ROWS" value as a define on the command line.
    The second pig query outputs the desired file in "output" which contains
    only those rows which have a count above the given threshold.

    Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
    not all rows that have a value of count above given threshold. So it is
    possible that a few rows that are not in output have same value of count as
    the last result in "output" .
    For example if input has 100 rows with 1 as the count value, you will any
    95 rows in the output.

    -Thejas


    Is that correct?

    Thanks
    Dave Viner


    On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair wrote:

    This pig query will not give you the "top xx% of results" . Ie if xx = 50%
    and results = 200, you may not get 100 rows in output.
    If you really want "top xx% of results", you need to do something like the
    following -
    (if you just want rows that have "count < (max(count) - min(count))*0.95,
    you can store and use the value of V the way TOTAL_ROWS is stored in this
    example)

    Pig query 1
    -----------
    L = load 'file' as (query, count); -- assuming that count is already there
    for simplicity
    O = order L by count; -- order it by column that defines 'top'
    G = group L by 'all';
    TOTAL_ROWS = foreach G generate COUNT(G);
    store O into 'file.ordered';

    -- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
    the results
    store TOTAL_ROWS into 'total_rows.txt';



    Pig query 2
    -----------
    -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
    the result stored in total_rows.txt
    L = load 'file.ordered';
    LIM = limit L, $TOP_ROWS;
    Store LIM into 'output';


    On 6/29/10 3:24 PM, "Dave Viner" wrote:

    Actually, I've gotten the first half of the code to work now. Here's how
    it looks:


    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    C = GROUP B ALL;
    U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
    V = FOREACH U GENERATE min + (max-min)*0.95;

    V is appropriately set... but I can't perform the final step of actually
    filtering the values by that count.

    The simple approach:
    TOP = FILTER B BY count > V;

    doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
    {query: chararray,count: long}

    Same with SPLIT.

    How do I filter or split on the value of V?

    Dave Viner
    On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
    Other Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
    an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't
    seem
    to get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the pig
    for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split
    by
    V.


    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <daveviner@pobox.com>
    wrote:
    How would I calculate the percentile in one pass? In order to
    calculate
    the
    percentile for each item, I need to know the total count. How do I
    get
    the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make
    the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.


    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
    wrote:
    Is there a UDF for generating the top X % of results? For example,
    in
    a
    log
    parsing context, it might be the set of search queries that
    represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Dave Viner at Jun 30, 2010 at 1:43 am
    Ok. I think I see how it's working.

    As a side note to this conversation, the wiki at
    http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what I
    was looking for *should* be possible with this code:

    A = load 'data1';
    B = group A all;
    C = foreach B generate COUNT(B);
    store C into 'count';
    D = load 'data2';
    E = foreach D generate $1/GetScalar(C);


    However, GetScalar is no longer present in 0.7.0.

    Thanks for your help.

    Dave Viner


    On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair wrote:



    On 6/29/10 4:22 PM, "Dave Viner" wrote:

    Hi Thejas,

    Thanks for this suggestion. Just to make sure I understand the strategy
    here:

    Pig query 1 outputs 2 files (or directories w/ files in them):
    "file.ordered" which contains the data set (query, count) in ascending order
    of count; and "total_rows" which contains 1 line with the total number of
    rows.

    Then, I need a small script that reads the value of "total_rows",
    multiplies
    that value by the desired percentile (0.95 in the example), and invokes pig
    a second time, passing the "TOP_ROWS" value as a define on the command line.
    The second pig query outputs the desired file in "output" which contains
    only those rows which have a count above the given threshold.

    Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
    not all rows that have a value of count above given threshold. So it is
    possible that a few rows that are not in output have same value of count as
    the last result in "output" .
    For example if input has 100 rows with 1 as the count value, you will any
    95 rows in the output.

    -Thejas


    Is that correct?

    Thanks
    Dave Viner


    On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair wrote:

    This pig query will not give you the "top xx% of results" . Ie if xx =
    50%
    and results = 200, you may not get 100 rows in output.
    If you really want "top xx% of results", you need to do something like
    the
    following -
    (if you just want rows that have "count < (max(count) -
    min(count))*0.95,
    you can store and use the value of V the way TOTAL_ROWS is stored in
    this
    example)

    Pig query 1
    -----------
    L = load 'file' as (query, count); -- assuming that count is already
    there
    for simplicity
    O = order L by count; -- order it by column that defines 'top'
    G = group L by 'all';
    TOTAL_ROWS = foreach G generate COUNT(G);
    store O into 'file.ordered';

    -- at present, there is no way to use the 'TOTAL_ROWS' directly to
    filter
    the results
    store TOTAL_ROWS into 'total_rows.txt';



    Pig query 2
    -----------
    -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X
    is
    the result stored in total_rows.txt
    L = load 'file.ordered';
    LIM = limit L, $TOP_ROWS;
    Store LIM into 'output';


    On 6/29/10 3:24 PM, "Dave Viner" wrote:

    Actually, I've gotten the first half of the code to work now. Here's how
    it looks:


    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
    count:long;
    C = GROUP B ALL;
    U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
    V = FOREACH U GENERATE min + (max-min)*0.95;

    V is appropriately set... but I can't perform the final step of
    actually
    filtering the values by that count.

    The simple approach:
    TOP = FILTER B BY count > V;

    doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
    {query: chararray,count: long}

    Same with SPLIT.

    How do I filter or split on the value of V?

    Dave Viner
    On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2
    parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
    count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
    Other Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the
    error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
    an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't
    seem
    to get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the
    pig
    for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or split
    by
    V.


    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <daveviner@pobox.com>
    wrote:
    How would I calculate the percentile in one pass? In order to
    calculate
    the
    percentile for each item, I need to know the total count. How do I
    get
    the
    total count, and then calculate each item's percentile in one pass?

    I don't mind doing multiple passes - I am just not sure how to make
    the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.


    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
    wrote:
    Is there a UDF for generating the top X % of results? For
    example,
    in
    a
    log
    parsing context, it might be the set of search queries that
    represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Hc busy at Jun 30, 2010 at 4:04 pm
    ahh, I see, i seem to have misread the question, if it's top xx% entries,
    then certainly sorting and then limiting.

    @Thejas I had thought that Limit is distributed and does not guarantee u
    get the results in order ??

    @Dave
    here's what to do after you get the min + (max-min)*.95 into V which now has
    just one row.

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95 as v;
    D = cross TABLE, V;
    E = filter D by D.x <= V.v;

    We should really start a wiki page to store these common calculations in
    Pig... I'd imagine a lot of people out there needed to do this at one time
    or another.

    This is also why, imnsho, I'd like to see cross as a nested_op inside
    foreach, because of this type of application.

    On Tue, Jun 29, 2010 at 6:42 PM, Dave Viner wrote:

    Ok. I think I see how it's working.

    As a side note to this conversation, the wiki at
    http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what
    I
    was looking for *should* be possible with this code:

    A = load 'data1';
    B = group A all;
    C = foreach B generate COUNT(B);
    store C into 'count';
    D = load 'data2';
    E = foreach D generate $1/GetScalar(C);


    However, GetScalar is no longer present in 0.7.0.

    Thanks for your help.

    Dave Viner


    On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair wrote:



    On 6/29/10 4:22 PM, "Dave Viner" wrote:

    Hi Thejas,

    Thanks for this suggestion. Just to make sure I understand the
    strategy
    here:

    Pig query 1 outputs 2 files (or directories w/ files in them):
    "file.ordered" which contains the data set (query, count) in ascending order
    of count; and "total_rows" which contains 1 line with the total number
    of
    rows.

    Then, I need a small script that reads the value of "total_rows",
    multiplies
    that value by the desired percentile (0.95 in the example), and invokes pig
    a second time, passing the "TOP_ROWS" value as a define on the command line.
    The second pig query outputs the desired file in "output" which
    contains
    only those rows which have a count above the given threshold.

    Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
    not all rows that have a value of count above given threshold. So it is
    possible that a few rows that are not in output have same value of count as
    the last result in "output" .
    For example if input has 100 rows with 1 as the count value, you will any
    95 rows in the output.

    -Thejas


    Is that correct?

    Thanks
    Dave Viner



    On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <tejas@yahoo-inc.com>
    wrote:
    This pig query will not give you the "top xx% of results" . Ie if xx =
    50%
    and results = 200, you may not get 100 rows in output.
    If you really want "top xx% of results", you need to do something like
    the
    following -
    (if you just want rows that have "count < (max(count) -
    min(count))*0.95,
    you can store and use the value of V the way TOTAL_ROWS is stored in
    this
    example)

    Pig query 1
    -----------
    L = load 'file' as (query, count); -- assuming that count is already
    there
    for simplicity
    O = order L by count; -- order it by column that defines 'top'
    G = group L by 'all';
    TOTAL_ROWS = foreach G generate COUNT(G);
    store O into 'file.ordered';

    -- at present, there is no way to use the 'TOTAL_ROWS' directly to
    filter
    the results
    store TOTAL_ROWS into 'total_rows.txt';



    Pig query 2
    -----------
    -- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X
    is
    the result stored in total_rows.txt
    L = load 'file.ordered';
    LIM = limit L, $TOP_ROWS;
    Store LIM into 'output';


    On 6/29/10 3:24 PM, "Dave Viner" wrote:

    Actually, I've gotten the first half of the code to work now.
    Here's
    how
    it looks:


    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
    count:long;
    C = GROUP B ALL;
    U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
    V = FOREACH U GENERATE min + (max-min)*0.95;

    V is appropriately set... but I can't perform the final step of
    actually
    filtering the values by that count.

    The simple approach:
    TOP = FILTER B BY count > V;

    doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
    {query: chararray,count: long}

    Same with SPLIT.

    How do I filter or split on the value of V?

    Dave Viner

    On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <daveviner@pobox.com>
    wrote:
    I don't quite understand this pig latin. The piggybank
    function org.apache.pig.piggybank.evaluation.math.MIN takes 2
    parameters
    which are compared. Here's the sample I'm trying using the tutorial
    excitelog as a sample.

    X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
    (user:chararray, time:long, query:chararray);
    Z = FILTER X BY query is not null;
    A = GROUP Z BY query;
    B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
    count:long;
    U = FOREACH B GENERATE *,
    MIN(count) as min:long,
    MAX(count) as max:long;

    This doesn't seem to work at all. It dies with this error:
    ERROR 1022: Type mismatch merging schema prefix. Field Schema:
    double.
    Other Field Schema: min: long

    Changing the min:long and max:long to doubles (as suggested by the
    error
    message), causes this error:
    ERROR 1045: Could not infer the matching function for
    org.apache.pig.builtin.MIN as multiple or none of them fit. Please
    use
    an
    explicit cast.

    What am I missing in using the sample code you've provided? I can't
    seem
    to get it to work...

    Thanks for your help.
    Dave Viner

    On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

    That's what I tried to say in my last email.I don't believe you can
    calculate exactly the percentiles in just one pass. Writing out the
    pig
    for
    two pass algorithm should be easy enough..

    P = group TABLE all;
    U = foreach P generate MIN(x) as min, MAX(x) as max;
    V = foreach U generate min + (max-min)*0.95;

    would give you the 95th percentile cutoff, and u just filter or
    split
    by
    V.


    On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <daveviner@pobox.com>
    wrote:
    How would I calculate the percentile in one pass? In order to
    calculate
    the
    percentile for each item, I need to know the total count. How do
    I
    get
    the
    total count, and then calculate each item's percentile in one
    pass?
    I don't mind doing multiple passes - I am just not sure how to
    make
    the
    calculation.

    Thanks
    Dave Viner

    On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

    I think it's impossible to do this within one M/R. You will want
    to
    implement it in two M/R in Pig, because you have to calculate the
    percentile
    in pass 1, and then perform the filter in pass 2.


    On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
    wrote:
    Is there a UDF for generating the top X % of results? For
    example,
    in
    a
    log
    parsing context, it might be the set of search queries that
    represent
    the
    top 80% of all queries.

    I see in the piggybank that there is a TOP function, but that
    only
    takes
    the
    top *number* of results, rather a percentile.

    Thanks
    Dave Viner
  • Thejas Nair at Jun 30, 2010 at 9:45 pm

    On 6/30/10 9:02 AM, "hc busy" wrote:

    @Thejas I had thought that Limit is distributed and does not guarantee u
    get the results in order ??
    As mentioned under section on limit here -
    http://hadoop.apache.org/pig/docs/r0.7.0/piglatin_ref2.html#LIMIT
    " There is no guarantee which tuples will be returned, and the tuples that
    are returned can change from one run to the next. A particular set of tuples
    can be requested using the ORDER operator followed by LIMIT. "


    They query -
    Set default_parallel 10;
    L = load 'x';
    O = order L by $0;
    LIM = limit O 100;

    Will result in 3 MR jobs. (see explain output for details)
    1st MR - sampling MR job for order-by to determine the distribution on sort
    key and decide how to partition the data for ordering
    2nd MR - orders the result. Each reduce task will output only first 100
    records.
    3rd MR - does the final limit - map reads with sort key as the key, it has a
    single reducer task that reads the first 100 records.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 29, '10 at 3:16p
activeJun 30, '10 at 9:45p
posts14
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase