FAQ
Is there a UDF for generating the top X % of results? For example, in a log
parsing context, it might be the set of search queries that represent the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only takes the
top *number* of results, rather a percentile.

Thanks
Dave Viner

## Search Discussions

•  at Jun 29, 2010 at 5:00 pm ⇧
I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner wrote:

Is there a UDF for generating the top X % of results? For example, in a
log
parsing context, it might be the set of search queries that represent the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 29, 2010 at 5:04 pm ⇧
How would I calculate the percentile in one pass? In order to calculate the
percentile for each item, I need to know the total count. How do I get the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner wrote:

Is there a UDF for generating the top X % of results? For example, in a
log
parsing context, it might be the set of search queries that represent the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 29, 2010 at 5:18 pm ⇧
That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split by V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

How would I calculate the percentile in one pass? In order to calculate
the
percentile for each item, I need to know the total count. How do I get the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner wrote:

Is there a UDF for generating the top X % of results? For example, in
a
log
parsing context, it might be the set of search queries that represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 29, 2010 at 10:01 pm ⇧
I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double. Other
Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
explicit cast.

What am I missing in using the sample code you've provided? I can't seem to
get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split by V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

How would I calculate the percentile in one pass? In order to calculate
the
percentile for each item, I need to know the total count. How do I get the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner wrote:

Is there a UDF for generating the top X % of results? For example,
in
a
log
parsing context, it might be the set of search queries that represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 29, 2010 at 10:24 pm ⇧
Actually, I've gotten the first half of the code to work now. Here's how
it looks:

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner
On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
Other Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
explicit cast.

What am I missing in using the sample code you've provided? I can't seem
to get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig
for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split by
V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

How would I calculate the percentile in one pass? In order to calculate
the
percentile for each item, I need to know the total count. How do I get the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
wrote:
Is there a UDF for generating the top X % of results? For example,
in
a
log
parsing context, it might be the set of search queries that
represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 29, 2010 at 11:01 pm ⇧
Hi Dave,

TOP = FILTER B BY count > V; would not work right now with pig. Alias can be
casted into scalars after https://issues.apache.org/jira/browse/PIG-1434 is
added. As far as I know, there is no simple way of using values generated by
mapreduce jobs directly into Pig jobs, you will need UDF to do that.
One dirty approach is to store this value in a location with a store and
then later use it to generate a new file with \$0 and V and filter with \$0 >
\$1.

Thanks,
Aniket

-----Original Message-----
From: daveviner@gmail.com On Behalf Of Dave
Viner
Sent: Tuesday, June 29, 2010 3:24 PM
Subject: Re: UDF for generating top xx % of results?

Actually, I've gotten the first half of the code to work now. Here's how
it looks:

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner
On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
Other Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
explicit cast.

What am I missing in using the sample code you've provided? I can't seem
to get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig
for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split by
V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

How would I calculate the percentile in one pass? In order to
calculate
the
percentile for each item, I need to know the total count. How do I get the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
wrote:
Is there a UDF for generating the top X % of results? For example,
in
a
log
parsing context, it might be the set of search queries that
represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 29, 2010 at 11:15 pm ⇧
Hi Aniket,

Is it possible to pass an alias to a UDF? In other words could I do:

TOP = FILTER B BY myCompare(count, V)

or something similar?

I don't quite understand your idea of using STORE. If I do:

STORE V into 'threshold';
XX = LOAD 'threshold' AS (threshold:long);

How can I filter the count by XX? Seems like I'm back in the same
situation.

Do you mean doing the STORE(), then directly modifying the files themselves
before doing the LOAD() so that it contains one row for each value of B,
with the threshold value V appended to the end?

Dave Viner
On Tue, Jun 29, 2010 at 3:57 PM, Aniket Mokashi wrote:

Hi Dave,

TOP = FILTER B BY count > V; would not work right now with pig. Alias can
be
casted into scalars after https://issues.apache.org/jira/browse/PIG-1434is
added. As far as I know, there is no simple way of using values generated
by
mapreduce jobs directly into Pig jobs, you will need UDF to do that.
One dirty approach is to store this value in a location with a store and
then later use it to generate a new file with \$0 and V and filter with \$0 >
\$1.

Thanks,
Aniket

-----Original Message-----
From: daveviner@gmail.com On Behalf Of Dave
Viner
Sent: Tuesday, June 29, 2010 3:24 PM
Subject: Re: UDF for generating top xx % of results?

Actually, I've gotten the first half of the code to work now. Here's how
it looks:

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner
On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
Other Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
explicit cast.

What am I missing in using the sample code you've provided? I can't seem
to get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig
for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split by
V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

How would I calculate the percentile in one pass? In order to
calculate
the
percentile for each item, I need to know the total count. How do I
get
the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make
the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
wrote:
Is there a UDF for generating the top X % of results? For
example,
in
a
log
parsing context, it might be the set of search queries that
represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 29, 2010 at 11:14 pm ⇧
This pig query will not give you the "top xx% of results" . Ie if xx = 50%
and results = 200, you may not get 100 rows in output.
If you really want "top xx% of results", you need to do something like the
following -
(if you just want rows that have "count < (max(count) - min(count))*0.95,
you can store and use the value of V the way TOTAL_ROWS is stored in this
example)

Pig query 1
-----------
L = load 'file' as (query, count); -- assuming that count is already there
for simplicity
O = order L by count; -- order it by column that defines 'top'
G = group L by 'all';
TOTAL_ROWS = foreach G generate COUNT(G);
store O into 'file.ordered';

-- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
the results
store TOTAL_ROWS into 'total_rows.txt';

Pig query 2
-----------
-- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
the result stored in total_rows.txt
LIM = limit L, \$TOP_ROWS;
Store LIM into 'output';

On 6/29/10 3:24 PM, "Dave Viner" wrote:

Actually, I've gotten the first half of the code to work now. Here's how
it looks:

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner
On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
Other Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use an
explicit cast.

What am I missing in using the sample code you've provided? I can't seem
to get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig
for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split by
V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

How would I calculate the percentile in one pass? In order to calculate
the
percentile for each item, I need to know the total count. How do I get the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
wrote:
Is there a UDF for generating the top X % of results? For example,
in
a
log
parsing context, it might be the set of search queries that
represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 29, 2010 at 11:22 pm ⇧
Hi Thejas,

Thanks for this suggestion. Just to make sure I understand the strategy
here:

Pig query 1 outputs 2 files (or directories w/ files in them):
"file.ordered" which contains the data set (query, count) in ascending order
of count; and "total_rows" which contains 1 line with the total number of
rows.

Then, I need a small script that reads the value of "total_rows", multiplies
that value by the desired percentile (0.95 in the example), and invokes pig
a second time, passing the "TOP_ROWS" value as a define on the command line.
The second pig query outputs the desired file in "output" which contains
only those rows which have a count above the given threshold.

Is that correct?

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair wrote:

This pig query will not give you the "top xx% of results" . Ie if xx = 50%
and results = 200, you may not get 100 rows in output.
If you really want "top xx% of results", you need to do something like the
following -
(if you just want rows that have "count < (max(count) - min(count))*0.95,
you can store and use the value of V the way TOTAL_ROWS is stored in this
example)

Pig query 1
-----------
L = load 'file' as (query, count); -- assuming that count is already there
for simplicity
O = order L by count; -- order it by column that defines 'top'
G = group L by 'all';
TOTAL_ROWS = foreach G generate COUNT(G);
store O into 'file.ordered';

-- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
the results
store TOTAL_ROWS into 'total_rows.txt';

Pig query 2
-----------
-- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
the result stored in total_rows.txt
LIM = limit L, \$TOP_ROWS;
Store LIM into 'output';

On 6/29/10 3:24 PM, "Dave Viner" wrote:

Actually, I've gotten the first half of the code to work now. Here's how
it looks:

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner
On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
Other Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
an
explicit cast.

What am I missing in using the sample code you've provided? I can't
seem
to get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig
for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split
by
V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner wrote:

How would I calculate the percentile in one pass? In order to
calculate
the
percentile for each item, I need to know the total count. How do I
get
the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make
the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
wrote:
Is there a UDF for generating the top X % of results? For example,
in
a
log
parsing context, it might be the set of search queries that
represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 30, 2010 at 12:20 am ⇧

On 6/29/10 4:22 PM, "Dave Viner" wrote:

Hi Thejas,

Thanks for this suggestion. Just to make sure I understand the strategy
here:

Pig query 1 outputs 2 files (or directories w/ files in them):
"file.ordered" which contains the data set (query, count) in ascending order
of count; and "total_rows" which contains 1 line with the total number of
rows.

Then, I need a small script that reads the value of "total_rows", multiplies
that value by the desired percentile (0.95 in the example), and invokes pig
a second time, passing the "TOP_ROWS" value as a define on the command line.
The second pig query outputs the desired file in "output" which contains
only those rows which have a count above the given threshold.

Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
not all rows that have a value of count above given threshold. So it is
possible that a few rows that are not in output have same value of count as
the last result in "output" .
For example if input has 100 rows with 1 as the count value, you will any
95 rows in the output.

-Thejas

Is that correct?

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair wrote:

This pig query will not give you the "top xx% of results" . Ie if xx = 50%
and results = 200, you may not get 100 rows in output.
If you really want "top xx% of results", you need to do something like the
following -
(if you just want rows that have "count < (max(count) - min(count))*0.95,
you can store and use the value of V the way TOTAL_ROWS is stored in this
example)

Pig query 1
-----------
L = load 'file' as (query, count); -- assuming that count is already there
for simplicity
O = order L by count; -- order it by column that defines 'top'
G = group L by 'all';
TOTAL_ROWS = foreach G generate COUNT(G);
store O into 'file.ordered';

-- at present, there is no way to use the 'TOTAL_ROWS' directly to filter
the results
store TOTAL_ROWS into 'total_rows.txt';

Pig query 2
-----------
-- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X is
the result stored in total_rows.txt
LIM = limit L, \$TOP_ROWS;
Store LIM into 'output';

On 6/29/10 3:24 PM, "Dave Viner" wrote:

Actually, I've gotten the first half of the code to work now. Here's how
it looks:

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner
On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2 parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
Other Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
an
explicit cast.

What am I missing in using the sample code you've provided? I can't
seem
to get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the pig
for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split
by
V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <daveviner@pobox.com>
wrote:
How would I calculate the percentile in one pass? In order to
calculate
the
percentile for each item, I need to know the total count. How do I
get
the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make
the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
wrote:
Is there a UDF for generating the top X % of results? For example,
in
a
log
parsing context, it might be the set of search queries that
represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 30, 2010 at 1:43 am ⇧
Ok. I think I see how it's working.

As a side note to this conversation, the wiki at
http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what I
was looking for *should* be possible with this code:

B = group A all;
C = foreach B generate COUNT(B);
store C into 'count';
E = foreach D generate \$1/GetScalar(C);

However, GetScalar is no longer present in 0.7.0.

Dave Viner

On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair wrote:

On 6/29/10 4:22 PM, "Dave Viner" wrote:

Hi Thejas,

Thanks for this suggestion. Just to make sure I understand the strategy
here:

Pig query 1 outputs 2 files (or directories w/ files in them):
"file.ordered" which contains the data set (query, count) in ascending order
of count; and "total_rows" which contains 1 line with the total number of
rows.

Then, I need a small script that reads the value of "total_rows",
multiplies
that value by the desired percentile (0.95 in the example), and invokes pig
a second time, passing the "TOP_ROWS" value as a define on the command line.
The second pig query outputs the desired file in "output" which contains
only those rows which have a count above the given threshold.

Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
not all rows that have a value of count above given threshold. So it is
possible that a few rows that are not in output have same value of count as
the last result in "output" .
For example if input has 100 rows with 1 as the count value, you will any
95 rows in the output.

-Thejas

Is that correct?

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair wrote:

This pig query will not give you the "top xx% of results" . Ie if xx =
50%
and results = 200, you may not get 100 rows in output.
If you really want "top xx% of results", you need to do something like
the
following -
(if you just want rows that have "count < (max(count) -
min(count))*0.95,
you can store and use the value of V the way TOTAL_ROWS is stored in
this
example)

Pig query 1
-----------
L = load 'file' as (query, count); -- assuming that count is already
there
for simplicity
O = order L by count; -- order it by column that defines 'top'
G = group L by 'all';
TOTAL_ROWS = foreach G generate COUNT(G);
store O into 'file.ordered';

-- at present, there is no way to use the 'TOTAL_ROWS' directly to
filter
the results
store TOTAL_ROWS into 'total_rows.txt';

Pig query 2
-----------
-- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X
is
the result stored in total_rows.txt
LIM = limit L, \$TOP_ROWS;
Store LIM into 'output';

On 6/29/10 3:24 PM, "Dave Viner" wrote:

Actually, I've gotten the first half of the code to work now. Here's how
it looks:

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of
actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner
On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner wrote:

I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2
parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema: double.
Other Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the
error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please use
an
explicit cast.

What am I missing in using the sample code you've provided? I can't
seem
to get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the
pig
for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or split
by
V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <daveviner@pobox.com>
wrote:
How would I calculate the percentile in one pass? In order to
calculate
the
percentile for each item, I need to know the total count. How do I
get
the
total count, and then calculate each item's percentile in one pass?

I don't mind doing multiple passes - I am just not sure how to make
the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
wrote:
Is there a UDF for generating the top X % of results? For
example,
in
a
log
parsing context, it might be the set of search queries that
represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 30, 2010 at 4:04 pm ⇧
ahh, I see, i seem to have misread the question, if it's top xx% entries,
then certainly sorting and then limiting.

@Thejas I had thought that Limit is distributed and does not guarantee u
get the results in order ??

@Dave
here's what to do after you get the min + (max-min)*.95 into V which now has
just one row.

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95 as v;
D = cross TABLE, V;
E = filter D by D.x <= V.v;

We should really start a wiki page to store these common calculations in
Pig... I'd imagine a lot of people out there needed to do this at one time
or another.

This is also why, imnsho, I'd like to see cross as a nested_op inside
foreach, because of this type of application.

On Tue, Jun 29, 2010 at 6:42 PM, Dave Viner wrote:

Ok. I think I see how it's working.

As a side note to this conversation, the wiki at
http://wiki.apache.org/pig/PigStreamingFunctionalSpec indicates that what
I
was looking for *should* be possible with this code:

B = group A all;
C = foreach B generate COUNT(B);
store C into 'count';
E = foreach D generate \$1/GetScalar(C);

However, GetScalar is no longer present in 0.7.0.

Dave Viner

On Tue, Jun 29, 2010 at 5:18 PM, Thejas Nair wrote:

On 6/29/10 4:22 PM, "Dave Viner" wrote:

Hi Thejas,

Thanks for this suggestion. Just to make sure I understand the
strategy
here:

Pig query 1 outputs 2 files (or directories w/ files in them):
"file.ordered" which contains the data set (query, count) in ascending order
of count; and "total_rows" which contains 1 line with the total number
of
rows.

Then, I need a small script that reads the value of "total_rows",
multiplies
that value by the desired percentile (0.95 in the example), and invokes pig
a second time, passing the "TOP_ROWS" value as a define on the command line.
The second pig query outputs the desired file in "output" which
contains
only those rows which have a count above the given threshold.

Yes, you are right. But the "output" contains 0.95 * total_num_rows, it is
not all rows that have a value of count above given threshold. So it is
possible that a few rows that are not in output have same value of count as
the last result in "output" .
For example if input has 100 rows with 1 as the count value, you will any
95 rows in the output.

-Thejas

Is that correct?

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 4:11 PM, Thejas Nair <tejas@yahoo-inc.com>
wrote:
This pig query will not give you the "top xx% of results" . Ie if xx =
50%
and results = 200, you may not get 100 rows in output.
If you really want "top xx% of results", you need to do something like
the
following -
(if you just want rows that have "count < (max(count) -
min(count))*0.95,
you can store and use the value of V the way TOTAL_ROWS is stored in
this
example)

Pig query 1
-----------
L = load 'file' as (query, count); -- assuming that count is already
there
for simplicity
O = order L by count; -- order it by column that defines 'top'
G = group L by 'all';
TOTAL_ROWS = foreach G generate COUNT(G);
store O into 'file.ordered';

-- at present, there is no way to use the 'TOTAL_ROWS' directly to
filter
the results
store TOTAL_ROWS into 'total_rows.txt';

Pig query 2
-----------
-- assuming this query is invoked with -param TOP_ROWS=X*0.95, where X
is
the result stored in total_rows.txt
LIM = limit L, \$TOP_ROWS;
Store LIM into 'output';

On 6/29/10 3:24 PM, "Dave Viner" wrote:

Actually, I've gotten the first half of the code to work now.
Here's
how
it looks:

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
count:long;
C = GROUP B ALL;
U = FOREACH C GENERATE MIN(B.count) as min:long, MAX(B.count) as max:long;
V = FOREACH U GENERATE min + (max-min)*0.95;

V is appropriately set... but I can't perform the final step of
actually
filtering the values by that count.

The simple approach:
TOP = FILTER B BY count > V;

doesn't work... ERROR 1000: Error during parsing. Invalid alias: V in
{query: chararray,count: long}

Same with SPLIT.

How do I filter or split on the value of V?

Dave Viner

On Tue, Jun 29, 2010 at 3:00 PM, Dave Viner <daveviner@pobox.com>
wrote:
I don't quite understand this pig latin. The piggybank
function org.apache.pig.piggybank.evaluation.math.MIN takes 2
parameters
which are compared. Here's the sample I'm trying using the tutorial
excitelog as a sample.

X = LOAD 'samples/excite-small.log' USING PigStorage('\t') AS
(user:chararray, time:long, query:chararray);
Z = FILTER X BY query is not null;
A = GROUP Z BY query;
B = FOREACH A GENERATE group as query:chararray, COUNT(Z) as
count:long;
U = FOREACH B GENERATE *,
MIN(count) as min:long,
MAX(count) as max:long;

This doesn't seem to work at all. It dies with this error:
ERROR 1022: Type mismatch merging schema prefix. Field Schema:
double.
Other Field Schema: min: long

Changing the min:long and max:long to doubles (as suggested by the
error
message), causes this error:
ERROR 1045: Could not infer the matching function for
org.apache.pig.builtin.MIN as multiple or none of them fit. Please
use
an
explicit cast.

What am I missing in using the sample code you've provided? I can't
seem
to get it to work...

Dave Viner

On Tue, Jun 29, 2010 at 10:17 AM, hc busy wrote:

That's what I tried to say in my last email.I don't believe you can
calculate exactly the percentiles in just one pass. Writing out the
pig
for
two pass algorithm should be easy enough..

P = group TABLE all;
U = foreach P generate MIN(x) as min, MAX(x) as max;
V = foreach U generate min + (max-min)*0.95;

would give you the 95th percentile cutoff, and u just filter or
split
by
V.

On Tue, Jun 29, 2010 at 10:03 AM, Dave Viner <daveviner@pobox.com>
wrote:
How would I calculate the percentile in one pass? In order to
calculate
the
percentile for each item, I need to know the total count. How do
I
get
the
total count, and then calculate each item's percentile in one
pass?
I don't mind doing multiple passes - I am just not sure how to
make
the
calculation.

Thanks
Dave Viner

On Tue, Jun 29, 2010 at 9:59 AM, hc busy wrote:

I think it's impossible to do this within one M/R. You will want
to
implement it in two M/R in Pig, because you have to calculate the
percentile
in pass 1, and then perform the filter in pass 2.

On Tue, Jun 29, 2010 at 8:14 AM, Dave Viner <dave@vinertech.com>
wrote:
Is there a UDF for generating the top X % of results? For
example,
in
a
log
parsing context, it might be the set of search queries that
represent
the
top 80% of all queries.

I see in the piggybank that there is a TOP function, but that
only
takes
the
top *number* of results, rather a percentile.

Thanks
Dave Viner
•  at Jun 30, 2010 at 9:45 pm ⇧

On 6/30/10 9:02 AM, "hc busy" wrote:

@Thejas I had thought that Limit is distributed and does not guarantee u
get the results in order ??
As mentioned under section on limit here -
" There is no guarantee which tuples will be returned, and the tuples that
are returned can change from one run to the next. A particular set of tuples
can be requested using the ORDER operator followed by LIMIT. "

They query -
Set default_parallel 10;
O = order L by \$0;
LIM = limit O 100;

Will result in 3 MR jobs. (see explain output for details)
1st MR - sampling MR job for order-by to determine the distribution on sort
key and decide how to partition the data for ordering
2nd MR - orders the result. Each reduce task will output only first 100
records.
3rd MR - does the final limit - map reads with sort key as the key, it has a

## Related Discussions

Discussion Overview
 group user categories pig, hadoop posted Jun 29, '10 at 3:16p active Jun 30, '10 at 9:45p posts 14 users 5 website pig.apache.org

### 5 users in discussion

Content

People

Support

Translate

site design / logo © 2022 Grokbase