Grokbase Groups Pig user July 2011
FAQ
Possible to do conditional and more than one generate inside a foreach?

for example, I have tuples like this (names, days_ago)

(a,0)
(b,1)
(c,9)
(d,40)

b shows up 1 day ago, so it belongs to all of the following: yesterday, last
week, last month, and last quarter. So I'd like to turn the above to:

(a,0,today)
(b,1,yesterday)
(b,1,week)
(b,1,month)
(b,1,quarter)
(c,9,month)
(c,9,quarter)
(d,40,quarter)

I imagine/dream I could do something like this

B = FOREACH A
{
if (days_ago <= 90) generate name,days_ago,'quarter';
if (days_ago <= 30) generate name,days_ago,'month';
if (days_ago <= 7) generate name,days_ago,'week';
if (days_ago == 1) generate name,days_ago,'yesterday';
if (days_ago == 0) generate name,days_ago,'today';
}

of course that's not valid syntax. I could write my own UDF but would be
nice there's some way to get what I want without UDF.

Thanks!
Dexin

Search Discussions

  • Scott Foster at Jul 23, 2011 at 12:24 am
    Hi Dexin,
    This is the sort of thing I've started using Python UDFs for. See:
    http://wiki.apache.org/pig/UDFsUsingScriptingLanguages for examples of
    how to write the python code.

    If your udf was implemented in Python you could then do this...

    register 'udfs.py' using jython as udf;
    ...
    B = FOREACH A generate name, udf.daysAgoString(days_ago);

    scott.
    On Fri, Jul 22, 2011 at 4:42 PM, Dexin Wang wrote:
    Possible to do conditional and more than one generate inside a foreach?

    for example, I have tuples like this (names, days_ago)

    (a,0)
    (b,1)
    (c,9)
    (d,40)

    b shows up 1 day ago, so it belongs to all of the following: yesterday, last
    week, last month, and last quarter. So I'd like to turn the above to:

    (a,0,today)
    (b,1,yesterday)
    (b,1,week)
    (b,1,month)
    (b,1,quarter)
    (c,9,month)
    (c,9,quarter)
    (d,40,quarter)

    I imagine/dream I could do something like this

    B = FOREACH A
    {
    if (days_ago <= 90) generate name,days_ago,'quarter';
    if (days_ago <= 30) generate name,days_ago,'month';
    if (days_ago <= 7)   generate name,days_ago,'week';
    if (days_ago == 1)   generate name,days_ago,'yesterday';
    if (days_ago == 0)   generate name,days_ago,'today';
    }

    of course that's not valid syntax. I could write my own UDF but would be
    nice there's some way to get what I want without UDF.

    Thanks!
    Dexin
  • Dexin Wang at Jul 23, 2011 at 1:11 am
    Thanks. I'm not familiar with python, but I write bunch of UDFs in java.

    One question though, how do I pass the the entire tuple to the UDF, I mean I
    can't do something like this:

    B = FOREACH A GENERATE myudf(A)

    Essentially what I want is given a tuple, I want to enrich the tuple to add
    one more field to it, and the value of the new field depends on the value in
    some existing fields in the tuple.

    (a,1) -> (a,1,yesterday)

    how would I do that?

    I imagine I can do
    B = GROUP A BY random;
    C = FOREACH B GENERATE myudf(A);

    But I really don't like adding another GROUP BY here.
    On Fri, Jul 22, 2011 at 5:23 PM, Scott Foster wrote:

    Hi Dexin,
    This is the sort of thing I've started using Python UDFs for. See:
    http://wiki.apache.org/pig/UDFsUsingScriptingLanguages for examples of
    how to write the python code.

    If your udf was implemented in Python you could then do this...

    register 'udfs.py' using jython as udf;
    ...
    B = FOREACH A generate name, udf.daysAgoString(days_ago);

    scott.
    On Fri, Jul 22, 2011 at 4:42 PM, Dexin Wang wrote:
    Possible to do conditional and more than one generate inside a foreach?

    for example, I have tuples like this (names, days_ago)

    (a,0)
    (b,1)
    (c,9)
    (d,40)

    b shows up 1 day ago, so it belongs to all of the following: yesterday, last
    week, last month, and last quarter. So I'd like to turn the above to:

    (a,0,today)
    (b,1,yesterday)
    (b,1,week)
    (b,1,month)
    (b,1,quarter)
    (c,9,month)
    (c,9,quarter)
    (d,40,quarter)

    I imagine/dream I could do something like this

    B = FOREACH A
    {
    if (days_ago <= 90) generate name,days_ago,'quarter';
    if (days_ago <= 30) generate name,days_ago,'month';
    if (days_ago <= 7) generate name,days_ago,'week';
    if (days_ago == 1) generate name,days_ago,'yesterday';
    if (days_ago == 0) generate name,days_ago,'today';
    }

    of course that's not valid syntax. I could write my own UDF but would be
    nice there's some way to get what I want without UDF.

    Thanks!
    Dexin
  • Scott Foster at Jul 23, 2011 at 11:52 pm
    Dexin,

    After re-reading your original post, I better understand what you were
    asking and I see that I didn't really answer your question.

    Python UDFs do make writing UDFs much simpler so you might be more
    likely to actually use them.

    If you know Java, Python shouldn't be difficult to pick up.

    Though not having done it myself, I would say that you should be able
    to pass the tuple to the UDF. I see in the source for the
    ScriptingEngine that a Pig tuple is converted into a Python tuple so
    you should be able access any element of the Pig tuple in the Python
    UDF.

    I noticed this comment in the Python UDF manual though @
    http://pig.apache.org/docs/r0.8.1/udf.html#Python+UDFs
    # tuple in python are immutable, appending to a tuple is not possible.

    The immutable comment is important, you won't be able to enrich the
    tuple but you can copy the values into a new tuple and return that.

    All that being said, here is one possible approach to the original
    problem that produces

    (a,0,(quarter,month,week,,today))
    (b,1,(quarter,month,week,yesterday,))
    (c,9,(quarter,month,,,))
    (d,40,(quarter,,,,))

    for your input data.

    C = FOREACH B GENERATE names, days_ago, udfs.myudf(names, days_ago);

    and the python UDF (in another file):

    @outputSchema("timeperiods:tuple(quarter:chararray,month:chararray,week:chararray,yesterday:chararray,today:chararray)")
    def timePeriods(names, days_ago):
    periods = []
    if days_ago <= 90:
    periods.append('quarter')
    else:
    periods.append(None)
    if days_ago <= 30:
    periods.append('month')
    else:
    periods.append(None)
    if days_ago <= 7:
    periods.append('week')
    else:
    periods.append(None)
    if days_ago == 1:
    periods.append('yesterday')
    else:
    periods.append(None)
    if days_ago == 0:
    periods.append('today')
    else:
    periods.append(None)
    return tuple(periods)

    It's not exactly what you wanted but maybe it will suggest a proper solution.

    scott.
    On Fri, Jul 22, 2011 at 6:10 PM, Dexin Wang wrote:
    Thanks. I'm not familiar with python, but I write bunch of UDFs in java.

    One question though, how do I pass the the entire tuple to the UDF, I mean I
    can't do something like this:

    B = FOREACH A GENERATE myudf(A)

    Essentially what I want is given a tuple, I want to enrich the tuple to add
    one more field to it, and the value of the new field depends on the value in
    some existing fields in the tuple.

    (a,1) -> (a,1,yesterday)

    how would I do that?

    I imagine I can do
    B = GROUP A BY random;
    C = FOREACH B GENERATE myudf(A);

    But I really don't like adding another GROUP BY here.
    On Fri, Jul 22, 2011 at 5:23 PM, Scott Foster wrote:

    Hi Dexin,
    This is the sort of thing I've started using Python UDFs for. See:
    http://wiki.apache.org/pig/UDFsUsingScriptingLanguages for examples of
    how to write the python code.

    If your udf was implemented in Python you could then do this...

    register 'udfs.py' using jython as udf;
    ...
    B = FOREACH A generate name, udf.daysAgoString(days_ago);

    scott.
    On Fri, Jul 22, 2011 at 4:42 PM, Dexin Wang wrote:
    Possible to do conditional and more than one generate inside a foreach?

    for example, I have tuples like this (names, days_ago)

    (a,0)
    (b,1)
    (c,9)
    (d,40)

    b shows up 1 day ago, so it belongs to all of the following: yesterday, last
    week, last month, and last quarter. So I'd like to turn the above to:

    (a,0,today)
    (b,1,yesterday)
    (b,1,week)
    (b,1,month)
    (b,1,quarter)
    (c,9,month)
    (c,9,quarter)
    (d,40,quarter)

    I imagine/dream I could do something like this

    B = FOREACH A
    {
    if (days_ago <= 90) generate name,days_ago,'quarter';
    if (days_ago <= 30) generate name,days_ago,'month';
    if (days_ago <= 7)   generate name,days_ago,'week';
    if (days_ago == 1)   generate name,days_ago,'yesterday';
    if (days_ago == 0)   generate name,days_ago,'today';
    }

    of course that's not valid syntax. I could write my own UDF but would be
    nice there's some way to get what I want without UDF.

    Thanks!
    Dexin
  • Raghu Angadi at Jul 24, 2011 at 1:45 am
    I see 3 independent questions :

    1. How can we pass entire row tuple to an UDF as 'B = FOREACH A GENERATE
    myudf(A)', without knowing schema? I don't know if that is passible. It does
    feel like it should be possible.

    2. How can I return an augmented Tuple? Your UDF can make a copy of the
    input tuple and add whatever you like to and return it.. may be your
    question is not this simple.

    3. How can I make UDF result in multiple row for for input row as in your
    example:
    - your UDF needs to return bag of row tuples. For (b,1) it would
    return {(b,1,yesterday), (b,1,week), ... }
    - your pig script would flatten the output of the UDF :
    B = foreach A generate FLATTEN( myUDF(name, days_ago) );

    Raghu.
    On Fri, Jul 22, 2011 at 6:10 PM, Dexin Wang wrote:

    Thanks. I'm not familiar with python, but I write bunch of UDFs in java.

    One question though, how do I pass the the entire tuple to the UDF, I mean
    I
    can't do something like this:

    B = FOREACH A GENERATE myudf(A)

    Essentially what I want is given a tuple, I want to enrich the tuple to add
    one more field to it, and the value of the new field depends on the value
    in
    some existing fields in the tuple.

    (a,1) -> (a,1,yesterday)

    how would I do that?

    I imagine I can do
    B = GROUP A BY random;
    C = FOREACH B GENERATE myudf(A);

    But I really don't like adding another GROUP BY here.

    On Fri, Jul 22, 2011 at 5:23 PM, Scott Foster <scottf.concur@gmail.com
    wrote:
    Hi Dexin,
    This is the sort of thing I've started using Python UDFs for. See:
    http://wiki.apache.org/pig/UDFsUsingScriptingLanguages for examples of
    how to write the python code.

    If your udf was implemented in Python you could then do this...

    register 'udfs.py' using jython as udf;
    ...
    B = FOREACH A generate name, udf.daysAgoString(days_ago);

    scott.
    On Fri, Jul 22, 2011 at 4:42 PM, Dexin Wang wrote:
    Possible to do conditional and more than one generate inside a foreach?

    for example, I have tuples like this (names, days_ago)

    (a,0)
    (b,1)
    (c,9)
    (d,40)

    b shows up 1 day ago, so it belongs to all of the following: yesterday, last
    week, last month, and last quarter. So I'd like to turn the above to:

    (a,0,today)
    (b,1,yesterday)
    (b,1,week)
    (b,1,month)
    (b,1,quarter)
    (c,9,month)
    (c,9,quarter)
    (d,40,quarter)

    I imagine/dream I could do something like this

    B = FOREACH A
    {
    if (days_ago <= 90) generate name,days_ago,'quarter';
    if (days_ago <= 30) generate name,days_ago,'month';
    if (days_ago <= 7) generate name,days_ago,'week';
    if (days_ago == 1) generate name,days_ago,'yesterday';
    if (days_ago == 0) generate name,days_ago,'today';
    }

    of course that's not valid syntax. I could write my own UDF but would
    be
    nice there's some way to get what I want without UDF.

    Thanks!
    Dexin
  • Xiaomeng Wan at Jul 25, 2011 at 4:26 pm
    maybe you can try something like this:

    B = foreach A generate name,days_ago, FLATTEN(((days_ago ==
    1)?{('yesterday','week','month','quarter')}:((...)?:));

    Shawn
    On Sat, Jul 23, 2011 at 7:44 PM, Raghu Angadi wrote:
    I see 3 independent questions :

    1. How can we pass entire row tuple to an UDF as 'B = FOREACH A GENERATE
    myudf(A)', without knowing schema? I don't know if that is passible. It does
    feel like it should be possible.

    2. How can I return an augmented Tuple? Your UDF can make a copy of the
    input tuple and add whatever you like to and return it.. may be your
    question is not this simple.

    3. How can I make UDF result in multiple row for for input row  as in your
    example:
    - your UDF needs to return bag of row tuples. For (b,1) it would
    return {(b,1,yesterday), (b,1,week), ... }
    - your pig script would flatten the output of the UDF :
    B = foreach A generate FLATTEN( myUDF(name, days_ago) );

    Raghu.
    On Fri, Jul 22, 2011 at 6:10 PM, Dexin Wang wrote:

    Thanks. I'm not familiar with python, but I write bunch of UDFs in java.

    One question though, how do I pass the the entire tuple to the UDF, I mean
    I
    can't do something like this:

    B = FOREACH A GENERATE myudf(A)

    Essentially what I want is given a tuple, I want to enrich the tuple to add
    one more field to it, and the value of the new field depends on the value
    in
    some existing fields in the tuple.

    (a,1) -> (a,1,yesterday)

    how would I do that?

    I imagine I can do
    B = GROUP A BY random;
    C = FOREACH B GENERATE myudf(A);

    But I really don't like adding another GROUP BY here.

    On Fri, Jul 22, 2011 at 5:23 PM, Scott Foster <scottf.concur@gmail.com
    wrote:
    Hi Dexin,
    This is the sort of thing I've started using Python UDFs for. See:
    http://wiki.apache.org/pig/UDFsUsingScriptingLanguages for examples of
    how to write the python code.

    If your udf was implemented in Python you could then do this...

    register 'udfs.py' using jython as udf;
    ...
    B = FOREACH A generate name, udf.daysAgoString(days_ago);

    scott.
    On Fri, Jul 22, 2011 at 4:42 PM, Dexin Wang wrote:
    Possible to do conditional and more than one generate inside a foreach?

    for example, I have tuples like this (names, days_ago)

    (a,0)
    (b,1)
    (c,9)
    (d,40)

    b shows up 1 day ago, so it belongs to all of the following: yesterday, last
    week, last month, and last quarter. So I'd like to turn the above to:

    (a,0,today)
    (b,1,yesterday)
    (b,1,week)
    (b,1,month)
    (b,1,quarter)
    (c,9,month)
    (c,9,quarter)
    (d,40,quarter)

    I imagine/dream I could do something like this

    B = FOREACH A
    {
    if (days_ago <= 90) generate name,days_ago,'quarter';
    if (days_ago <= 30) generate name,days_ago,'month';
    if (days_ago <= 7)   generate name,days_ago,'week';
    if (days_ago == 1)   generate name,days_ago,'yesterday';
    if (days_ago == 0)   generate name,days_ago,'today';
    }

    of course that's not valid syntax. I could write my own UDF but would
    be
    nice there's some way to get what I want without UDF.

    Thanks!
    Dexin
  • Xiaomeng Wan at Jul 25, 2011 at 4:28 pm
    no, you want a bag. should be this:

    B = foreach A generate name,days_ago, FLATTEN(((days_ago ==
    1)?{('yesterday'),('week'),('month'),('quarter')}:((...)?:));
    On Mon, Jul 25, 2011 at 10:25 AM, Xiaomeng Wan wrote:
    maybe you can try something like this:

    B = foreach A generate name,days_ago, FLATTEN(((days_ago ==
    1)?{('yesterday','week','month','quarter')}:((...)?:));

    Shawn
    On Sat, Jul 23, 2011 at 7:44 PM, Raghu Angadi wrote:
    I see 3 independent questions :

    1. How can we pass entire row tuple to an UDF as 'B = FOREACH A GENERATE
    myudf(A)', without knowing schema? I don't know if that is passible. It does
    feel like it should be possible.

    2. How can I return an augmented Tuple? Your UDF can make a copy of the
    input tuple and add whatever you like to and return it.. may be your
    question is not this simple.

    3. How can I make UDF result in multiple row for for input row  as in your
    example:
    - your UDF needs to return bag of row tuples. For (b,1) it would
    return {(b,1,yesterday), (b,1,week), ... }
    - your pig script would flatten the output of the UDF :
    B = foreach A generate FLATTEN( myUDF(name, days_ago) );

    Raghu.
    On Fri, Jul 22, 2011 at 6:10 PM, Dexin Wang wrote:

    Thanks. I'm not familiar with python, but I write bunch of UDFs in java.

    One question though, how do I pass the the entire tuple to the UDF, I mean
    I
    can't do something like this:

    B = FOREACH A GENERATE myudf(A)

    Essentially what I want is given a tuple, I want to enrich the tuple to add
    one more field to it, and the value of the new field depends on the value
    in
    some existing fields in the tuple.

    (a,1) -> (a,1,yesterday)

    how would I do that?

    I imagine I can do
    B = GROUP A BY random;
    C = FOREACH B GENERATE myudf(A);

    But I really don't like adding another GROUP BY here.

    On Fri, Jul 22, 2011 at 5:23 PM, Scott Foster <scottf.concur@gmail.com
    wrote:
    Hi Dexin,
    This is the sort of thing I've started using Python UDFs for. See:
    http://wiki.apache.org/pig/UDFsUsingScriptingLanguages for examples of
    how to write the python code.

    If your udf was implemented in Python you could then do this...

    register 'udfs.py' using jython as udf;
    ...
    B = FOREACH A generate name, udf.daysAgoString(days_ago);

    scott.
    On Fri, Jul 22, 2011 at 4:42 PM, Dexin Wang wrote:
    Possible to do conditional and more than one generate inside a foreach?

    for example, I have tuples like this (names, days_ago)

    (a,0)
    (b,1)
    (c,9)
    (d,40)

    b shows up 1 day ago, so it belongs to all of the following: yesterday, last
    week, last month, and last quarter. So I'd like to turn the above to:

    (a,0,today)
    (b,1,yesterday)
    (b,1,week)
    (b,1,month)
    (b,1,quarter)
    (c,9,month)
    (c,9,quarter)
    (d,40,quarter)

    I imagine/dream I could do something like this

    B = FOREACH A
    {
    if (days_ago <= 90) generate name,days_ago,'quarter';
    if (days_ago <= 30) generate name,days_ago,'month';
    if (days_ago <= 7)   generate name,days_ago,'week';
    if (days_ago == 1)   generate name,days_ago,'yesterday';
    if (days_ago == 0)   generate name,days_ago,'today';
    }

    of course that's not valid syntax. I could write my own UDF but would
    be
    nice there's some way to get what I want without UDF.

    Thanks!
    Dexin
  • Dexin Wang at Jul 25, 2011 at 4:49 pm
    wow, awesome, works great! Thanks Shawn!
    On Mon, Jul 25, 2011 at 9:27 AM, Xiaomeng Wan wrote:

    no, you want a bag. should be this:

    B = foreach A generate name,days_ago, FLATTEN(((days_ago ==
    1)?{('yesterday'),('week'),('month'),('quarter')}:((...)?:));
    On Mon, Jul 25, 2011 at 10:25 AM, Xiaomeng Wan wrote:
    maybe you can try something like this:

    B = foreach A generate name,days_ago, FLATTEN(((days_ago ==
    1)?{('yesterday','week','month','quarter')}:((...)?:));

    Shawn
    On Sat, Jul 23, 2011 at 7:44 PM, Raghu Angadi wrote:
    I see 3 independent questions :

    1. How can we pass entire row tuple to an UDF as 'B = FOREACH A
    GENERATE
    myudf(A)', without knowing schema? I don't know if that is passible. It
    does
    feel like it should be possible.

    2. How can I return an augmented Tuple? Your UDF can make a copy of the
    input tuple and add whatever you like to and return it.. may be your
    question is not this simple.

    3. How can I make UDF result in multiple row for for input row as in
    your
    example:
    - your UDF needs to return bag of row tuples. For (b,1) it would
    return {(b,1,yesterday), (b,1,week), ... }
    - your pig script would flatten the output of the UDF :
    B = foreach A generate FLATTEN( myUDF(name, days_ago) );

    Raghu.
    On Fri, Jul 22, 2011 at 6:10 PM, Dexin Wang wrote:

    Thanks. I'm not familiar with python, but I write bunch of UDFs in
    java.
    One question though, how do I pass the the entire tuple to the UDF, I
    mean
    I
    can't do something like this:

    B = FOREACH A GENERATE myudf(A)

    Essentially what I want is given a tuple, I want to enrich the tuple to
    add
    one more field to it, and the value of the new field depends on the
    value
    in
    some existing fields in the tuple.

    (a,1) -> (a,1,yesterday)

    how would I do that?

    I imagine I can do
    B = GROUP A BY random;
    C = FOREACH B GENERATE myudf(A);

    But I really don't like adding another GROUP BY here.

    On Fri, Jul 22, 2011 at 5:23 PM, Scott Foster <scottf.concur@gmail.com
    wrote:
    Hi Dexin,
    This is the sort of thing I've started using Python UDFs for. See:
    http://wiki.apache.org/pig/UDFsUsingScriptingLanguages for examples
    of
    how to write the python code.

    If your udf was implemented in Python you could then do this...

    register 'udfs.py' using jython as udf;
    ...
    B = FOREACH A generate name, udf.daysAgoString(days_ago);

    scott.
    On Fri, Jul 22, 2011 at 4:42 PM, Dexin Wang wrote:
    Possible to do conditional and more than one generate inside a
    foreach?
    for example, I have tuples like this (names, days_ago)

    (a,0)
    (b,1)
    (c,9)
    (d,40)

    b shows up 1 day ago, so it belongs to all of the following:
    yesterday,
    last
    week, last month, and last quarter. So I'd like to turn the above
    to:
    (a,0,today)
    (b,1,yesterday)
    (b,1,week)
    (b,1,month)
    (b,1,quarter)
    (c,9,month)
    (c,9,quarter)
    (d,40,quarter)

    I imagine/dream I could do something like this

    B = FOREACH A
    {
    if (days_ago <= 90) generate name,days_ago,'quarter';
    if (days_ago <= 30) generate name,days_ago,'month';
    if (days_ago <= 7) generate name,days_ago,'week';
    if (days_ago == 1) generate name,days_ago,'yesterday';
    if (days_ago == 0) generate name,days_ago,'today';
    }

    of course that's not valid syntax. I could write my own UDF but
    would
    be
    nice there's some way to get what I want without UDF.

    Thanks!
    Dexin

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 22, '11 at 11:42p
activeJul 25, '11 at 4:49p
posts8
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase