Grokbase Groups Pig user August 2010
FAQ
Wondering about performance and count...
A = load 'test.csv' as (a1,a2,a3);
B = GROUP A by a1;
-- which preferred?
C = FOREACH B GENERATE COUNT(A);
-- or would this only send a single field through the COUNT and be more performant?
C = FOREACH B GENERATE COUNT(A.a2);

Search Discussions

  • Dmitriy Ryaboy at Aug 25, 2010 at 10:15 pm
    Generally speaking, the second option will be more performant as it might
    let you drop column a3 early. In most cases the magnitude of this is likely
    to be very small as COUNT is an algebraic function, so most of the work is
    done map-side anyway, and only partial, pre-aggregated counts are shipped
    from mappers to reducers. However, if A is very wide, or a column store, or
    has non-negligible deserialization cost that can be offset by only
    deserializing a few fields -- the second option is better.

    -D
    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);

  • Mridul Muralidharan at Aug 25, 2010 at 11:32 pm
    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to project
    only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul
    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:
    Generally speaking, the second option will be more performant as it might
    let you drop column a3 early. In most cases the magnitude of this is likely
    to be very small as COUNT is an algebraic function, so most of the work is
    done map-side anyway, and only partial, pre-aggregated counts are shipped
    from mappers to reducers. However, if A is very wide, or a column store, or
    has non-negligible deserialization cost that can be offset by only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoeneswrote:
    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);

  • Dmitriy Ryaboy at Aug 26, 2010 at 3:36 am
    I think if you do COUNT(A), Pig will not realize it can ignore a2 and a3,
    and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:
    I am not sure why second option is better - in both cases, you are shipping
    only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to project only
    'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul

    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant as it might
    let you drop column a3 early. In most cases the magnitude of this is
    likely
    to be very small as COUNT is an algebraic function, so most of the work is
    done map-side anyway, and only partial, pre-aggregated counts are shipped
    from mappers to reducers. However, if A is very wide, or a column store,
    or
    has non-negligible deserialization cost that can be offset by only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoeneswrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);


  • Mridul Muralidharan at Aug 26, 2010 at 8:28 am
    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul
    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoeneswrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);



  • Mridul Muralidharan at Aug 27, 2010 at 5:16 pm
    On second thoughts, that part is obvious - duh

    - Mridul
    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:

    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul
    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:
    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);



  • Renato Marroquín Mogrovejo at Aug 28, 2010 at 7:44 pm
    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance, but Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side, all
    attributes will be read anyways, but the ones that are not interesting for
    us will be dismissed and not passed to the reducer part of the job, and
    besides wouldn't the presence of null values affect the performance? For
    example, if a2 would have many null values, then less values would be passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan <mridulm@yahoo-inc.com>
    On second thoughts, that part is obvious - duh

    - Mridul

    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:


    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul
    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);




  • Thejas M Nair at Aug 28, 2010 at 11:32 pm
    In case of COUNT(A) or COUNT(A.a2), since the combiner would get used, the
    value that is sent from map to reduce will only be the result of COUNT for
    each of the group on a1 in the map. Ie, The data transferred will be same in
    both cases.

    However, pig can tell the loader that it needs only column a2, if you are
    using COUNT(A.a2) in your query. If the loader has optimizations (selective
    deserialization or columnar storgae) which results in less cost if fewer
    number of columns are requested by pig, then you will benefit from using
    COUNT(A.a2).
    But in case of group , I think the column pruning does not work across it,
    and (if so) that should change in a future release.



    -Thejas




    On 8/28/10 12:44 PM, "Renato Marroquín Mogrovejo"
    wrote:
    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance, but Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side, all
    attributes will be read anyways, but the ones that are not interesting for
    us will be dismissed and not passed to the reducer part of the job, and
    besides wouldn't the presence of null values affect the performance? For
    example, if a2 would have many null values, then less values would be passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan <mridulm@yahoo-inc.com>
    On second thoughts, that part is obvious - duh

    - Mridul

    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:


    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul
    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);




  • Mridul Muralidharan at Aug 29, 2010 at 4:01 pm
    Reason why COUNT(a.field1) would have better performance is 'cos pig
    does not 'know' what is required from a tuple in case of COUNT(a).
    In a custom mapred job, we can optimize it away so that only the single
    required field is projected out : but that is obviously not possible
    here (COUNT is a udf) : so the entire tuple is deserialized from input.

    Ofcourse, the performance difference, as Dmitriy noted, would not be
    very high.


    Regards,
    Mridul

    On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance, but Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side, all
    attributes will be read anyways, but the ones that are not interesting for
    us will be dismissed and not passed to the reducer part of the job, and
    besides wouldn't the presence of null values affect the performance? For
    example, if a2 would have many null values, then less values would be passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan<mridulm@yahoo-inc.com>
    On second thoughts, that part is obvious - duh

    - Mridul

    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:


    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul
    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);




  • Corbin Hoenes at Sep 2, 2010 at 6:10 pm
    Wow...thanks for all the discussion and insight guys.
    On Aug 29, 2010, at 10:01 AM, Mridul Muralidharan wrote:



    Reason why COUNT(a.field1) would have better performance is 'cos pig does not 'know' what is required from a tuple in case of COUNT(a).
    In a custom mapred job, we can optimize it away so that only the single required field is projected out : but that is obviously not possible here (COUNT is a udf) : so the entire tuple is deserialized from input.

    Ofcourse, the performance difference, as Dmitriy noted, would not be very high.


    Regards,
    Mridul

    On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:
    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance, but Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side, all
    attributes will be read anyways, but the ones that are not interesting for
    us will be dismissed and not passed to the reducer part of the job, and
    besides wouldn't the presence of null values affect the performance? For
    example, if a2 would have many null values, then less values would be passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan<mridulm@yahoo-inc.com>
    On second thoughts, that part is obvious - duh

    - Mridul

    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:


    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul
    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);




  • Renato Marroquín Mogrovejo at Sep 2, 2010 at 9:51 pm
    So in terms of performance is the same if I count just a single column or
    the whole data set, right?
    But what Thejas said about the loader having optimizations (selective
    deserialization or columnar storage) is something that Pig actually has? or
    is it something planned for the future?
    And hey using a combiner shouldn't be a thing we should try to avoid? I mean
    for the COUNT case, a combiner is needed, but are there any other operations
    that are put into that combiner? like trying to reuse the computation being
    made?
    Thanks for the replies (=

    Renato M.


    2010/8/29 Mridul Muralidharan <mridulm@yahoo-inc.com>

    Reason why COUNT(a.field1) would have better performance is 'cos pig does
    not 'know' what is required from a tuple in case of COUNT(a).
    In a custom mapred job, we can optimize it away so that only the single
    required field is projected out : but that is obviously not possible here
    (COUNT is a udf) : so the entire tuple is deserialized from input.

    Ofcourse, the performance difference, as Dmitriy noted, would not be very
    high.


    Regards,
    Mridul


    On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:

    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance, but
    Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side, all
    attributes will be read anyways, but the ones that are not interesting for
    us will be dismissed and not passed to the reducer part of the job, and
    besides wouldn't the presence of null values affect the performance? For
    example, if a2 would have many null values, then less values would be
    passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan<mridulm@yahoo-inc.com>

    On second thoughts, that part is obvious - duh

    - Mridul


    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:

    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul

    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant
    as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);





  • Dmitriy Ryaboy at Sep 2, 2010 at 10:24 pm
    Pig has selective deserialization and columnar storage if the loader you are
    using implements it. So that depends on what you are doing. Naturally, if
    your data is not stored in a way that separates the columns, Pig can't
    magically read them separately :).

    You should try to always use combiners.

    -D
    On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo wrote:

    So in terms of performance is the same if I count just a single column or
    the whole data set, right?
    But what Thejas said about the loader having optimizations (selective
    deserialization or columnar storage) is something that Pig actually has? or
    is it something planned for the future?
    And hey using a combiner shouldn't be a thing we should try to avoid? I
    mean for the COUNT case, a combiner is needed, but are there any other
    operations that are put into that combiner? like trying to reuse the
    computation being made?
    Thanks for the replies (=

    Renato M.


    2010/8/29 Mridul Muralidharan <mridulm@yahoo-inc.com>

    Reason why COUNT(a.field1) would have better performance is 'cos pig does
    not 'know' what is required from a tuple in case of COUNT(a).
    In a custom mapred job, we can optimize it away so that only the single
    required field is projected out : but that is obviously not possible here
    (COUNT is a udf) : so the entire tuple is deserialized from input.

    Ofcourse, the performance difference, as Dmitriy noted, would not be very
    high.


    Regards,
    Mridul


    On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:

    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance, but
    Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side, all
    attributes will be read anyways, but the ones that are not interesting
    for
    us will be dismissed and not passed to the reducer part of the job, and
    besides wouldn't the presence of null values affect the performance? For
    example, if a2 would have many null values, then less values would be
    passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan<mridulm@yahoo-inc.com>

    On second thoughts, that part is obvious - duh

    - Mridul


    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:

    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul

    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant
    as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
    corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the
    COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);





  • Renato Marroquín Mogrovejo at Sep 4, 2010 at 3:06 am
    Thanks Dmitriy! Hey, a couple of final questions please.
    Which are the deserializers that implement this selective deserialization?
    And the columnar storage used is Zebra?
    Thanks again for the great replies.

    Renato M.

    2010/9/2 Dmitriy Ryaboy <dvryaboy@gmail.com>
    Pig has selective deserialization and columnar storage if the loader you
    are using implements it. So that depends on what you are doing. Naturally,
    if your data is not stored in a way that separates the columns, Pig can't
    magically read them separately :).

    You should try to always use combiners.

    -D


    On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    So in terms of performance is the same if I count just a single column or
    the whole data set, right?
    But what Thejas said about the loader having optimizations (selective
    deserialization or columnar storage) is something that Pig actually has? or
    is it something planned for the future?
    And hey using a combiner shouldn't be a thing we should try to avoid? I
    mean for the COUNT case, a combiner is needed, but are there any other
    operations that are put into that combiner? like trying to reuse the
    computation being made?
    Thanks for the replies (=

    Renato M.


    2010/8/29 Mridul Muralidharan <mridulm@yahoo-inc.com>

    Reason why COUNT(a.field1) would have better performance is 'cos pig does
    not 'know' what is required from a tuple in case of COUNT(a).
    In a custom mapred job, we can optimize it away so that only the single
    required field is projected out : but that is obviously not possible here
    (COUNT is a udf) : so the entire tuple is deserialized from input.

    Ofcourse, the performance difference, as Dmitriy noted, would not be very
    high.


    Regards,
    Mridul


    On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:

    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance, but
    Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side,
    all
    attributes will be read anyways, but the ones that are not interesting
    for
    us will be dismissed and not passed to the reducer part of the job, and
    besides wouldn't the presence of null values affect the performance? For
    example, if a2 would have many null values, then less values would be
    passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan<mridulm@yahoo-inc.com>

    On second thoughts, that part is obvious - duh

    - Mridul


    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:

    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul

    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you
    are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant
    as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most
    of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
    corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the
    COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);





  • Thejas M Nair at Sep 10, 2010 at 3:39 pm
    Yes, Zebra has columnar storage format.
    Regarding selective deserialization (ie deserializing only columns that are
    actually needed for the pig query) - As per my understanding elephant-bird
    has a protocol buffer based loader which does lazy deserialization.
    PigStorage also does something similar- when PigStorage is used to load
    data, pigstorage returns bytearray type and there is type-casting foreach
    added by pig after the load which does the type conversion on the fields
    that are required in rest of the query.

    -Thejas



    On 9/3/10 8:05 PM, "Renato Marroquín Mogrovejo"
    wrote:
    Thanks Dmitriy! Hey, a couple of final questions please.
    Which are the deserializers that implement this selective deserialization?
    And the columnar storage used is Zebra?
    Thanks again for the great replies.

    Renato M.

    2010/9/2 Dmitriy Ryaboy <dvryaboy@gmail.com>
    Pig has selective deserialization and columnar storage if the loader you
    are using implements it. So that depends on what you are doing. Naturally,
    if your data is not stored in a way that separates the columns, Pig can't
    magically read them separately :).

    You should try to always use combiners.

    -D


    On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    So in terms of performance is the same if I count just a single column or
    the whole data set, right?
    But what Thejas said about the loader having optimizations (selective
    deserialization or columnar storage) is something that Pig actually has? or
    is it something planned for the future?
    And hey using a combiner shouldn't be a thing we should try to avoid? I
    mean for the COUNT case, a combiner is needed, but are there any other
    operations that are put into that combiner? like trying to reuse the
    computation being made?
    Thanks for the replies (=

    Renato M.


    2010/8/29 Mridul Muralidharan <mridulm@yahoo-inc.com>

    Reason why COUNT(a.field1) would have better performance is 'cos pig does
    not 'know' what is required from a tuple in case of COUNT(a).
    In a custom mapred job, we can optimize it away so that only the single
    required field is projected out : but that is obviously not possible here
    (COUNT is a udf) : so the entire tuple is deserialized from input.

    Ofcourse, the performance difference, as Dmitriy noted, would not be very
    high.


    Regards,
    Mridul


    On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:

    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance, but
    Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side,
    all
    attributes will be read anyways, but the ones that are not interesting
    for
    us will be dismissed and not passed to the reducer part of the job, and
    besides wouldn't the presence of null values affect the performance? For
    example, if a2 would have many null values, then less values would be
    passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan<mridulm@yahoo-inc.com>

    On second thoughts, that part is obvious - duh

    - Mridul


    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:

    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul

    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2 and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you
    are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right now.

    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more performant
    as
    it might
    let you drop column a3 early. In most cases the magnitude of
    this is likely
    to be very small as COUNT is an algebraic function, so most
    of
    the work is
    done map-side anyway, and only partial, pre-aggregated counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
    corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the
    COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);





  • Renato Marroquín Mogrovejo at Sep 10, 2010 at 3:52 pm
    Thanks Thejas!

    2010/9/10 Thejas M Nair <tejas@yahoo-inc.com>
    Yes, Zebra has columnar storage format.
    Regarding selective deserialization (ie deserializing only columns that
    are
    actually needed for the pig query) - As per my understanding elephant-bird
    has a protocol buffer based loader which does lazy deserialization.
    PigStorage also does something similar- when PigStorage is used to load
    data, pigstorage returns bytearray type and there is type-casting foreach
    added by pig after the load which does the type conversion on the fields
    that are required in rest of the query.

    -Thejas



    On 9/3/10 8:05 PM, "Renato Marroquín Mogrovejo"
    wrote:
    Thanks Dmitriy! Hey, a couple of final questions please.
    Which are the deserializers that implement this selective
    deserialization?
    And the columnar storage used is Zebra?
    Thanks again for the great replies.

    Renato M.

    2010/9/2 Dmitriy Ryaboy <dvryaboy@gmail.com>
    Pig has selective deserialization and columnar storage if the loader you
    are using implements it. So that depends on what you are doing.
    Naturally,
    if your data is not stored in a way that separates the columns, Pig
    can't
    magically read them separately :).

    You should try to always use combiners.

    -D


    On Thu, Sep 2, 2010 at 2:51 PM, Renato Marroquín Mogrovejo <
    renatoj.marroquin@gmail.com> wrote:
    So in terms of performance is the same if I count just a single column
    or
    the whole data set, right?
    But what Thejas said about the loader having optimizations (selective
    deserialization or columnar storage) is something that Pig actually
    has? or
    is it something planned for the future?
    And hey using a combiner shouldn't be a thing we should try to avoid? I
    mean for the COUNT case, a combiner is needed, but are there any other
    operations that are put into that combiner? like trying to reuse the
    computation being made?
    Thanks for the replies (=

    Renato M.


    2010/8/29 Mridul Muralidharan <mridulm@yahoo-inc.com>

    Reason why COUNT(a.field1) would have better performance is 'cos pig
    does
    not 'know' what is required from a tuple in case of COUNT(a).
    In a custom mapred job, we can optimize it away so that only the
    single
    required field is projected out : but that is obviously not possible
    here
    (COUNT is a udf) : so the entire tuple is deserialized from input.

    Ofcourse, the performance difference, as Dmitriy noted, would not be
    very
    high.


    Regards,
    Mridul


    On Sunday 29 August 2010 01:14 AM, Renato Marroquín Mogrovejo wrote:

    Hi, this is also interesting and kinda confusing for me too (=
    From the db world, the second one would have a better performance,
    but
    Pig
    doesn't save statistics on the data, so it has to read the whole file
    anyways, and like the count operation is mainly done on the map side,
    all
    attributes will be read anyways, but the ones that are not
    interesting
    for
    us will be dismissed and not passed to the reducer part of the job,
    and
    besides wouldn't the presence of null values affect the performance?
    For
    example, if a2 would have many null values, then less values would be
    passed
    too right?

    Renato M.

    2010/8/27 Mridul Muralidharan<mridulm@yahoo-inc.com>

    On second thoughts, that part is obvious - duh

    - Mridul


    On Thursday 26 August 2010 01:56 PM, Mridul Muralidharan wrote:

    But it does for COUNT(A.a2) ?
    That is interesting, and somehow weird :)

    Thanks !
    Mridul

    On Thursday 26 August 2010 09:05 AM, Dmitriy Ryaboy wrote:

    I think if you do COUNT(A), Pig will not realize it can ignore a2
    and
    a3, and project all of them.

    On Wed, Aug 25, 2010 at 4:31 PM, Mridul Muralidharan
    wrote:


    I am not sure why second option is better - in both cases, you
    are
    shipping only the combined counts from map to reduce.
    On other hand, first could be better since it means we need to
    project only 'a1' - and none of the other fields.

    Or did I miss something here ?
    I am not very familiar to what pig does in this case right
    now.
    Regards,
    Mridul


    On Thursday 26 August 2010 03:45 AM, Dmitriy Ryaboy wrote:

    Generally speaking, the second option will be more
    performant
    as
    it might
    let you drop column a3 early. In most cases the magnitude
    of
    this is likely
    to be very small as COUNT is an algebraic function, so
    most
    of
    the work is
    done map-side anyway, and only partial, pre-aggregated
    counts
    are shipped
    from mappers to reducers. However, if A is very wide, or a
    column store, or
    has non-negligible deserialization cost that can be offset
    by
    only
    deserializing a few fields -- the second option is better.

    -D

    On Wed, Aug 25, 2010 at 1:58 PM, Corbin Hoenes<
    corbin@tynt.com
    wrote:

    Wondering about performance and count...
    A = load 'test.csv' as (a1,a2,a3);
    B = GROUP A by a1;
    -- which preferred?
    C = FOREACH B GENERATE COUNT(A);
    -- or would this only send a single field through the
    COUNT
    and be more
    performant?
    C = FOREACH B GENERATE COUNT(A.a2);





Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 25, '10 at 8:59p
activeSep 10, '10 at 3:52p
posts15
users5
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase