Grokbase Groups Pig user July 2008
FAQ
Folks,

Is there a way to do something akin to map (of map/reduce) over a tuple?
The input file is lines like this:

category word1 word2 ...

So simplest is to read it as a tuple (PigStorage ' '), but then I want
to iterate over the words
which are $1 ... <whatever> and create a bag of tuples, say:
(word, category)

I know I'm being lazy not writing code at this point, but I think Pig
should
be flexible enough to do what I want, ideally.

Tuples might also want something like a perl "shift" (front) or "pop"
(back) -- so you could
kind of manually shift distinguished values off the front, and then
treat the
rest of the tuple as a list of similar elements.

Anybody?

-- Steve

Search Discussions

  • Olga Natkovich at Jul 28, 2008 at 7:42 pm
    You would need a custom load function to do this.

    Olga
    -----Original Message-----
    From: Handerson, Steven K.
    Sent: Monday, July 28, 2008 12:34 PM
    To: pig-user@incubator.apache.org
    Subject: Newbie question -- iterating over tuples


    Folks,

    Is there a way to do something akin to map (of map/reduce)
    over a tuple?
    The input file is lines like this:

    category word1 word2 ...

    So simplest is to read it as a tuple (PigStorage ' '), but
    then I want to iterate over the words which are $1 ...
    <whatever> and create a bag of tuples, say:
    (word, category)

    I know I'm being lazy not writing code at this point, but I
    think Pig should be flexible enough to do what I want, ideally.

    Tuples might also want something like a perl "shift" (front) or "pop"
    (back) -- so you could
    kind of manually shift distinguished values off the front,
    and then treat the rest of the tuple as a list of similar elements.

    Anybody?

    -- Steve


  • Handerson, Steven K. at Jul 28, 2008 at 8:15 pm
    Ok, actually I pretty easily wrote an Eval function,
    PairFirstWithRest. Cool!

    Still, I think more of these kinds of things should be in a library;
    of course you should be *able* to write code and integrate,
    but a system that you can trick into doing what you want (out of the
    box) is better.

    Here it is, as a donation:

    public class PairFirstWithRest extends EvalFunc<DataBag> {
    @Override
    public void exec(Tuple input, DataBag output) throws IOException
    {
    Datum first = input.getField(0);
    for (int i=1; i<input.arity(); i++) {
    ArrayList<Datum> list = new ArrayList<Datum>();
    list.add(first);
    list.add(input.getField(i));
    output.add(new Tuple(list));
    }
    }
    }

    -- Steve


    -----Original Message-----
    From: Olga Natkovich
    Sent: Monday, July 28, 2008 3:39 PM
    To: pig-user@incubator.apache.org
    Subject: RE: Newbie question -- iterating over tuples

    You would need a custom load function to do this.

    Olga
    -----Original Message-----
    From: Handerson, Steven K.
    Sent: Monday, July 28, 2008 12:34 PM
    To: pig-user@incubator.apache.org
    Subject: Newbie question -- iterating over tuples


    Folks,

    Is there a way to do something akin to map (of map/reduce)
    over a tuple?
    The input file is lines like this:

    category word1 word2 ...

    So simplest is to read it as a tuple (PigStorage ' '), but
    then I want to iterate over the words which are $1 ...
    <whatever> and create a bag of tuples, say:
    (word, category)

    I know I'm being lazy not writing code at this point, but I
    think Pig should be flexible enough to do what I want, ideally.

    Tuples might also want something like a perl "shift" (front) or "pop"
    (back) -- so you could
    kind of manually shift distinguished values off the front,
    and then treat the rest of the tuple as a list of similar elements.

    Anybody?

    -- Steve


  • Olga Natkovich at Jul 28, 2008 at 8:31 pm
    You can contribute the function that you wrote back to the Piggybank and
    also use functions available there:

    http://wiki.apache.org/pig/PiggyBank

    Olga
    -----Original Message-----
    From: Handerson, Steven K.
    Sent: Monday, July 28, 2008 1:15 PM
    To: pig-user@incubator.apache.org
    Subject: RE: Newbie question -- iterating over tuples


    Ok, actually I pretty easily wrote an Eval function,
    PairFirstWithRest. Cool!

    Still, I think more of these kinds of things should be in a
    library; of course you should be *able* to write code and
    integrate, but a system that you can trick into doing what
    you want (out of the
    box) is better.

    Here it is, as a donation:

    public class PairFirstWithRest extends EvalFunc<DataBag> {
    @Override
    public void exec(Tuple input, DataBag output) throws
    IOException {
    Datum first = input.getField(0);
    for (int i=1; i<input.arity(); i++) {
    ArrayList<Datum> list = new ArrayList<Datum>();
    list.add(first);
    list.add(input.getField(i));
    output.add(new Tuple(list));
    }
    }
    }

    -- Steve


    -----Original Message-----
    From: Olga Natkovich
    Sent: Monday, July 28, 2008 3:39 PM
    To: pig-user@incubator.apache.org
    Subject: RE: Newbie question -- iterating over tuples

    You would need a custom load function to do this.

    Olga
    -----Original Message-----
    From: Handerson, Steven K.
    Sent: Monday, July 28, 2008 12:34 PM
    To: pig-user@incubator.apache.org
    Subject: Newbie question -- iterating over tuples


    Folks,

    Is there a way to do something akin to map (of map/reduce) over a
    tuple?
    The input file is lines like this:

    category word1 word2 ...

    So simplest is to read it as a tuple (PigStorage ' '), but
    then I want
    to iterate over the words which are $1 ...
    <whatever> and create a bag of tuples, say:
    (word, category)

    I know I'm being lazy not writing code at this point, but I think Pig
    should be flexible enough to do what I want, ideally.

    Tuples might also want something like a perl "shift"
    (front) or "pop"
    (back) -- so you could
    kind of manually shift distinguished values off the front, and then
    treat the rest of the tuple as a list of similar elements.

    Anybody?

    -- Steve


  • Handerson, Steven K. at Jul 29, 2008 at 6:53 am
    Folks:

    Why does Pig Latin not allow nested foreach/generates?
    Arbitrary nesting of data is already "allowed" constructively --
    why not allow operations over arbitrarily nested structures?
    It increases locality, and therefore parallelism,
    and otherwise you have to hack things with additional groups / joins otherwise
    (which will add unnecessary extra computation, slowing Pig programs down).

    Example:
    In naïve bayes, you extract features from documents (with categories) and group by features,
    to get feature counts for various categories.
    What you want to do next is (simple version) just sum the totals for
    all categories, and then divide the category-specific counts by the total.
    In pig:

    we have {(feature1 {(category1, count), (category2, count)...})
    (feature2 {(category1, count), (category2, count)...}) ... }
    we want to make
    {(feature1 total {(category, count), (category2, count)...
    and then
    {(feature1 {(category, count/total), (category2, count/total)} ..

    Why can't this be done like this?
    foreach feature {
    total = SUM(categorytuples.count);
    categoryprobs = foreach categorytuples {
    generate category, count/total;
    }
    generate feature, categoryprobs;
    }

    I'm afraid people are taking "map/reduce" too literally, and thinking
    every map has to have a reduce. This is a case where all you need is a map,
    which of course is very parallelizable.

    I suppose it makes things a little complicated for the planner in the above example,
    because the optimizer needs to know that it needs to do two maps over categorytuples,
    one to get the total, and the other to apply it, but the code spells out how it's done,
    really all the planner needs to do is know that it can't optimize this further.

    -- Steve
  • Alan Gates at Jul 29, 2008 at 2:54 pm
    The goal is certainly to support full nesting in foreach. So eventually
    anything that you can do at the top level will be doable inside a
    foreach, including another foreach with additional nesting. We just
    gotten to it yet.

    Alan.

    Handerson, Steven K. wrote:
    Folks:

    Why does Pig Latin not allow nested foreach/generates?
    Arbitrary nesting of data is already "allowed" constructively --
    why not allow operations over arbitrarily nested structures?
    It increases locality, and therefore parallelism,
    and otherwise you have to hack things with additional groups / joins otherwise
    (which will add unnecessary extra computation, slowing Pig programs down).

    Example:
    In naïve bayes, you extract features from documents (with categories) and group by features,
    to get feature counts for various categories.
    What you want to do next is (simple version) just sum the totals for
    all categories, and then divide the category-specific counts by the total.
    In pig:

    we have {(feature1 {(category1, count), (category2, count)...})
    (feature2 {(category1, count), (category2, count)...}) ... }
    we want to make
    {(feature1 total {(category, count), (category2, count)...
    and then
    {(feature1 {(category, count/total), (category2, count/total)} ..

    Why can't this be done like this?
    foreach feature {
    total = SUM(categorytuples.count);
    categoryprobs = foreach categorytuples {
    generate category, count/total;
    }
    generate feature, categoryprobs;
    }

    I'm afraid people are taking "map/reduce" too literally, and thinking
    every map has to have a reduce. This is a case where all you need is a map,
    which of course is very parallelizable.

    I suppose it makes things a little complicated for the planner in the above example,
    because the optimizer needs to know that it needs to do two maps over categorytuples,
    one to get the total, and the other to apply it, but the code spells out how it's done,
    really all the planner needs to do is know that it can't optimize this further.

    -- Steve

  • Handerson, Steven K. at Jul 29, 2008 at 7:13 pm
    Alan,

    Ok, great. Thanks for the reply.

    I was concerned because the documentation seems to indicate
    that this was a design choice -- "note that we disallow nested
    foreach..generates, because that would allow nesting to arbitrary depths."
    But I'm glad that's not really the case, that it's just NYI.

    I solved the particular problem by writing an EvalFunc that does
    the necessary iterations -- but of course it would be great if
    PigLatin supported this directly, eventually.

    -- Steve


    -----Original Message-----
    From: Alan Gates
    Sent: Tuesday, July 29, 2008 10:54 AM
    To: pig-user@incubator.apache.org
    Subject: Re: Newbie question -- nested foreach/generate

    The goal is certainly to support full nesting in foreach. So eventually
    anything that you can do at the top level will be doable inside a
    foreach, including another foreach with additional nesting. We just
    gotten to it yet.

    Alan.

    Handerson, Steven K. wrote:
    Folks:

    Why does Pig Latin not allow nested foreach/generates?
    Arbitrary nesting of data is already "allowed" constructively --
    why not allow operations over arbitrarily nested structures?
    It increases locality, and therefore parallelism,
    and otherwise you have to hack things with additional groups / joins otherwise
    (which will add unnecessary extra computation, slowing Pig programs down).

    Example:
    In naïve bayes, you extract features from documents (with categories) and group by features,
    to get feature counts for various categories.
    What you want to do next is (simple version) just sum the totals for
    all categories, and then divide the category-specific counts by the total.
    In pig:

    we have {(feature1 {(category1, count), (category2, count)...})
    (feature2 {(category1, count), (category2, count)...}) ... }
    we want to make
    {(feature1 total {(category, count), (category2, count)...
    and then
    {(feature1 {(category, count/total), (category2, count/total)} ..

    Why can't this be done like this?
    foreach feature {
    total = SUM(categorytuples.count);
    categoryprobs = foreach categorytuples {
    generate category, count/total;
    }
    generate feature, categoryprobs;
    }

    I'm afraid people are taking "map/reduce" too literally, and thinking
    every map has to have a reduce. This is a case where all you need is a map,
    which of course is very parallelizable.

    I suppose it makes things a little complicated for the planner in the above example,
    because the optimizer needs to know that it needs to do two maps over categorytuples,
    one to get the total, and the other to apply it, but the code spells out how it's done,
    really all the planner needs to do is know that it can't optimize this further.

    -- Steve

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 28, '08 at 7:35p
activeJul 29, '08 at 7:13p
posts7
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase