In case of COUNT(A) or COUNT(A.a2), since the combiner would get used, the
value that is sent from map to reduce will only be the result of COUNT for
each of the group on a1 in the map. Ie, The data transferred will be same in
However, pig can tell the loader that it needs only column a2, if you are
using COUNT(A.a2) in your query. If the loader has optimizations (selective
deserialization or columnar storgae) which results in less cost if fewer
number of columns are requested by pig, then you will benefit from using
But in case of group , I think the column pruning does not work across it,
and (if so) that should change in a future release.
On 8/28/10 12:44 PM, "Renato Marroquín Mogrovejo"
Hi, this is also interesting and kinda confusing for me too (=
From the db world, the second one would have a better performance, but Pig
doesn't save statistics on the data, so it has to read the whole file
anyways, and like the count operation is mainly done on the map side, all
attributes will be read anyways, but the ones that are not interesting for
us will be dismissed and not passed to the reducer part of the job, and
besides wouldn't the presence of null values affect the performance? For
example, if a2 would have many null values, then less values would be passed
2010/8/27 Mridul Muralidharan <firstname.lastname@example.org>
On second thoughts, that part is obvious - duh
On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:
But it does for COUNT(A.a2) ?
That is interesting, and somehow weird :)
On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
I think if you do COUNT(A), Pig will not realize it can ignore a2 and
a3, and project all of them.
On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
I am not sure why second option is better - in both cases, you are
shipping only the combined counts from map to reduce.
On other hand, first could be better since it means we need to
project only 'a1' - and none of the other fields.
Or did I miss something here ?
I am not very familiar to what pig does in this case right now.
On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
Generally speaking, the second option will be more performant as
let you drop column a3 early. In most cases the magnitude of
this is likely
to be very small as COUNT is an algebraic function, so most of
the work is
done map-side anyway, and only partial, pre-aggregated counts
from mappers to reducers. However, if A is very wide, or a
column store, or
has non-negligible deserialization cost that can be offset by
deserializing a few fields -- the second option is better.
On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<email@example.com
Wondering about performance and count...
A = load 'test.csv' as (a1,a2,a3);
B = GROUP A by a1;
-- which preferred?
C = FOREACH B GENERATE COUNT(A);
-- or would this only send a single field through the COUNT
and be more
C = FOREACH B GENERATE COUNT(A.a2);