Grokbase Groups Pig dev May 2010
FAQ
Hey, guys, how are Bags passed to EvalFunc stored?

I was looking at the Accumulator interface and it says that the reason why
this needed for COUNT and SUM is because EvalFunc always gives you the
entire bag when the EvalFunc is run on a bag.

I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code
inside that does


for(Tuple entry:inputDataBag){
.... stuff
}


was an actual iterator that iterated on the bag sequentially without
necessarily having the entire bag in memory all at once. ?? Because it's an
iterator, so there's no way to do anything other than to stream through it.

I'm looking at this because Accumulator has no way of telling Pig "I've seen
enough" It streams through the entire bag no matter what happens. (like,
hypothetically speaking, if I was writing "5th item of a sorted bag" udf),
after I see 5th of a 5 million entry bag, I want to stop executing if
possible.

Is there a easy way to make this happen?

Search Discussions

  • Alan Gates at May 27, 2010 at 9:34 pm
    The default case is that a UDFs that take bags (such as COUNT, etc.)
    are handed the entire bag at once. In the case where all UDFs in a
    foreach implement the algebraic interface and the expression itself is
    algebraic than the combiner will be used, thus significantly limiting
    the size of the bag handed to the UDF. The accumulator does hand
    records to the UDF a few thousand at a time. Currently it has no way
    to turn off the flow of records.

    What you want might be accomplished by the LIMIT operator, which can
    be used inside a nested foreach. Something like:

    C = foreach B {
    C1 = sort A by $0;
    C2 = limit 5 C1;
    generate myUDF(C2);
    }

    Alan.
    On May 26, 2010, at 11:59 AM, hc busy wrote:

    Hey, guys, how are Bags passed to EvalFunc stored?

    I was looking at the Accumulator interface and it says that the
    reason why
    this needed for COUNT and SUM is because EvalFunc always gives you the
    entire bag when the EvalFunc is run on a bag.

    I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the
    code
    inside that does


    for(Tuple entry:inputDataBag){
    .... stuff
    }


    was an actual iterator that iterated on the bag sequentially without
    necessarily having the entire bag in memory all at once. ?? Because
    it's an
    iterator, so there's no way to do anything other than to stream
    through it.

    I'm looking at this because Accumulator has no way of telling Pig
    "I've seen
    enough" It streams through the entire bag no matter what happens.
    (like,
    hypothetically speaking, if I was writing "5th item of a sorted bag"
    udf),
    after I see 5th of a 5 million entry bag, I want to stop executing if
    possible.

    Is there a easy way to make this happen?
  • Hc busy at Jun 1, 2010 at 8:45 pm
    well, see that's the thing, the 'sort A by $0' is already nlg(n)

    ahh, I see, my own example suffers from this problem.

    I guess I'm wondering how 'limit' works in conjunction with UDF's... A
    practical application escapes me right now, But if I do

    C = foreach B{
    C1 = MyUdf(B.bag_on_b);
    C2 = limit C1 5;
    }

    does it know to push limit in this case?

    On Thu, May 27, 2010 at 2:32 PM, Alan Gates wrote:

    The default case is that a UDFs that take bags (such as COUNT, etc.) are
    handed the entire bag at once. In the case where all UDFs in a foreach
    implement the algebraic interface and the expression itself is algebraic
    than the combiner will be used, thus significantly limiting the size of the
    bag handed to the UDF. The accumulator does hand records to the UDF a few
    thousand at a time. Currently it has no way to turn off the flow of
    records.

    What you want might be accomplished by the LIMIT operator, which can be
    used inside a nested foreach. Something like:

    C = foreach B {
    C1 = sort A by $0;
    C2 = limit 5 C1;
    generate myUDF(C2);
    }

    Alan.


    On May 26, 2010, at 11:59 AM, hc busy wrote:

    Hey, guys, how are Bags passed to EvalFunc stored?
    I was looking at the Accumulator interface and it says that the reason why
    this needed for COUNT and SUM is because EvalFunc always gives you the
    entire bag when the EvalFunc is run on a bag.

    I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and the code
    inside that does


    for(Tuple entry:inputDataBag){
    .... stuff
    }


    was an actual iterator that iterated on the bag sequentially without
    necessarily having the entire bag in memory all at once. ?? Because it's
    an
    iterator, so there's no way to do anything other than to stream through
    it.

    I'm looking at this because Accumulator has no way of telling Pig "I've
    seen
    enough" It streams through the entire bag no matter what happens. (like,
    hypothetically speaking, if I was writing "5th item of a sorted bag" udf),
    after I see 5th of a 5 million entry bag, I want to stop executing if
    possible.

    Is there a easy way to make this happen?
  • Alan Gates at Jun 2, 2010 at 4:15 pm
    I don't think it pushes limit yet in this case.

    Alan.
    On Jun 1, 2010, at 1:44 PM, hc busy wrote:

    well, see that's the thing, the 'sort A by $0' is already nlg(n)

    ahh, I see, my own example suffers from this problem.

    I guess I'm wondering how 'limit' works in conjunction with UDF's... A
    practical application escapes me right now, But if I do

    C = foreach B{
    C1 = MyUdf(B.bag_on_b);
    C2 = limit C1 5;
    }

    does it know to push limit in this case?

    On Thu, May 27, 2010 at 2:32 PM, Alan Gates wrote:

    The default case is that a UDFs that take bags (such as COUNT,
    etc.) are
    handed the entire bag at once. In the case where all UDFs in a
    foreach
    implement the algebraic interface and the expression itself is
    algebraic
    than the combiner will be used, thus significantly limiting the
    size of the
    bag handed to the UDF. The accumulator does hand records to the
    UDF a few
    thousand at a time. Currently it has no way to turn off the flow of
    records.

    What you want might be accomplished by the LIMIT operator, which
    can be
    used inside a nested foreach. Something like:

    C = foreach B {
    C1 = sort A by $0;
    C2 = limit 5 C1;
    generate myUDF(C2);
    }

    Alan.


    On May 26, 2010, at 11:59 AM, hc busy wrote:

    Hey, guys, how are Bags passed to EvalFunc stored?
    I was looking at the Accumulator interface and it says that the
    reason why
    this needed for COUNT and SUM is because EvalFunc always gives you
    the
    entire bag when the EvalFunc is run on a bag.

    I always thought if I did COUNT(TABLE) or SUM(TABLE.FIELD), and
    the code
    inside that does


    for(Tuple entry:inputDataBag){
    .... stuff
    }


    was an actual iterator that iterated on the bag sequentially without
    necessarily having the entire bag in memory all at once. ??
    Because it's
    an
    iterator, so there's no way to do anything other than to stream
    through
    it.

    I'm looking at this because Accumulator has no way of telling Pig
    "I've
    seen
    enough" It streams through the entire bag no matter what happens.
    (like,
    hypothetically speaking, if I was writing "5th item of a sorted
    bag" udf),
    after I see 5th of a 5 million entry bag, I want to stop executing
    if
    possible.

    Is there a easy way to make this happen?

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedMay 26, '10 at 6:59p
activeJun 2, '10 at 4:15p
posts4
users2
websitepig.apache.org

2 users in discussion

Alan Gates: 2 posts Hc busy: 2 posts

People

Translate

site design / logo © 2022 Grokbase