FAQ
Hello,

I've got some activity data that I've processed with Pig to generate a
sequence of bags, one per user, that each contain a set of tuples of the
form (timestamp, activity id) that are ordered in time.
From each bag, I would like to produce a new bag of k-tuples where each
k-tuple contains a consecutive sequence of k activities from the original
ordered bag. For my initial stab at this, I wrote the following Jython UDF:

@outputSchemaFunction("schema")
def k_tuple_expansion(activities,k):
"""Scans through a time ordered bag of tuples of the form
(timestamp,activity id)
for a given user and returns a bag of k-tuples of all activity
sequences of length k.
"""
tups = []
for i in xrange(k-1,len(activities)):
actseq = [activities[i-j][1] for j in range(k-1,-1,-1)]
tups.append(tuple(actseq))
return tups

@schemaFunction("schema")
def schema(input):
# Return whatever type we were handed
return input

This code works appropriately. Problems arise when I try to process these
results further in Pig. Given I'm not specifying a static output schema,
since it is a function of k, Pig doesn't readily know what is being
returned.
My attempt to flatten each user bag of k-tuples is failing.

Given I could construct a string representation of the output schema once k
is known, is there some way to construct the string and pass it back? My
use of the schema function above follows the only example I've seen here.
https://cwiki.apache.org/PIG/udfsusingscriptinglanguages.html

If the Jython UDF approach is not the best, is there a native Pig approach
to attacking this problem?

Any pointers would be most appreciated!

Chris

Search Discussions

  • Andy Schlaikjer at Feb 22, 2012 at 6:45 pm
    Hi Chris,

    Not sure about your options in Jython, but it's pretty easy to do this
    in native Pig:

    1. Create custom EvalFunc UDF whose ctor takes (String) argument K,
    the expected size of output tuples, and parses out the int and stores
    in private field.
    2. Override the UDF's outputSchema method and use the value of K to
    construct valid output schema of appropriate size.

    Andy
    @sagemintblue
    On Tue, Feb 21, 2012 at 5:55 PM, Chris Diehl wrote:
    Hello,

    I've got some activity data that I've processed with Pig to generate a
    sequence of bags, one per user, that each contain a set of tuples of the
    form (timestamp, activity id) that are ordered in time.
    From each bag, I would like to produce a new bag of k-tuples where each
    k-tuple contains a consecutive sequence of k activities from the original
    ordered bag. For my initial stab at this, I wrote the following Jython UDF:

    @outputSchemaFunction("schema")
    def k_tuple_expansion(activities,k):
    """Scans through a time ordered bag of tuples of the form
    (timestamp,activity id)
    for a given user and returns a bag of k-tuples of all activity
    sequences of length k.
    """
    tups = []
    for i in xrange(k-1,len(activities)):
    actseq = [activities[i-j][1] for j in range(k-1,-1,-1)]
    tups.append(tuple(actseq))
    return tups

    @schemaFunction("schema")
    def schema(input):
    # Return whatever type we were handed
    return input

    This code works appropriately. Problems arise when I try to process these
    results further in Pig. Given I'm not specifying a static output schema,
    since it is a function of k, Pig doesn't readily know what is being
    returned.
    My attempt to flatten each user bag of k-tuples is failing.

    Given I could construct a string representation of the output schema once k
    is known, is there some way to construct the string and pass it back? My
    use of the schema function above follows the only example I've seen here.
    https://cwiki.apache.org/PIG/udfsusingscriptinglanguages.html

    If the Jython UDF approach is not the best, is there a native Pig approach
    to attacking this problem?

    Any pointers would be most appreciated!

    Chris

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 22, '12 at 1:56a
activeFeb 22, '12 at 6:45p
posts2
users2
websitepig.apache.org

2 users in discussion

Andy Schlaikjer: 1 post Chris Diehl: 1 post

People

Translate

site design / logo © 2021 Grokbase