Grokbase Groups Pig user May 2010
FAQ
Hi,

Is there a way to read a collection (of unknown size) of tab-delimited
values into a single data type (tuple?) during the LOAD phase?

Here's specifically what I'm looking to do. I have a given input file format
of tab-delimited fields like so:

[timestamp] [count] [field1] [field2] [field2] .. [fieldN]

I'm writing a pig job to take many small files and roll up the counts for a
given time increment of a lesser granularity. For example, many files with
timestamps rounded to 5 minute intervals will be rolled into a single file
with 1 hour granularity.

I'm able to do this by grouping on the timestamp (rounded down to the hour)
and each of the fields shown if I know the number of fields and I list them
all explicitly. I'd like to write this script though that would work on
different input formats, some which might have N fields, where others have
M. For a given job run, the number of fields in the input files passed would
be fixed.

So I'd like to be able to do something like this in pseudo code:

LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
...
GROUP BY round_hour(timestamp), rest_of_line
[flatten group and sum counts]
...
STORE round_hour(timestamp), totalCount, rest_of_line

Where I know nothing about how many tokens are in next_of_line. Any ideas
besides subclassing PigStorage or writing a new FileInputLoadFunc?

thanks,
Bill

Search Discussions

  • Mridul Muralidharan at May 19, 2010 at 11:12 pm
    You can simply skip specifying schema in the load - and access the
    fields either through the udf or through $0, etc positional indexes.


    Like :

    A = load 'myfile' USING PigStorage();
    B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
    C = ...



    Regards,
    Mridul
    On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:
    Hi,

    Is there a way to read a collection (of unknown size) of tab-delimited
    values into a single data type (tuple?) during the LOAD phase?

    Here's specifically what I'm looking to do. I have a given input file format
    of tab-delimited fields like so:

    [timestamp] [count] [field1] [field2] [field2] .. [fieldN]

    I'm writing a pig job to take many small files and roll up the counts for a
    given time increment of a lesser granularity. For example, many files with
    timestamps rounded to 5 minute intervals will be rolled into a single file
    with 1 hour granularity.

    I'm able to do this by grouping on the timestamp (rounded down to the hour)
    and each of the fields shown if I know the number of fields and I list them
    all explicitly. I'd like to write this script though that would work on
    different input formats, some which might have N fields, where others have
    M. For a given job run, the number of fields in the input files passed would
    be fixed.

    So I'd like to be able to do something like this in pseudo code:

    LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
    ...
    GROUP BY round_hour(timestamp), rest_of_line
    [flatten group and sum counts]
    ...
    STORE round_hour(timestamp), totalCount, rest_of_line

    Where I know nothing about how many tokens are in next_of_line. Any ideas
    besides subclassing PigStorage or writing a new FileInputLoadFunc?

    thanks,
    Bill
  • Bill Graham at May 20, 2010 at 12:09 am
    Thanks Mridul, but how would I access the items in the numbered fields 3..N
    where I don't know what N is? Are you suggesting I pass A to a custom UDF to
    convert to a tuple of [time, count, rest_of_line]?


    On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
    wrote:
    You can simply skip specifying schema in the load - and access the fields
    either through the udf or through $0, etc positional indexes.


    Like :

    A = load 'myfile' USING PigStorage();
    B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
    C = ...



    Regards,
    Mridul

    On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:

    Hi,

    Is there a way to read a collection (of unknown size) of tab-delimited
    values into a single data type (tuple?) during the LOAD phase?

    Here's specifically what I'm looking to do. I have a given input file
    format
    of tab-delimited fields like so:

    [timestamp] [count] [field1] [field2] [field2] .. [fieldN]

    I'm writing a pig job to take many small files and roll up the counts for
    a
    given time increment of a lesser granularity. For example, many files with
    timestamps rounded to 5 minute intervals will be rolled into a single file
    with 1 hour granularity.

    I'm able to do this by grouping on the timestamp (rounded down to the
    hour)
    and each of the fields shown if I know the number of fields and I list
    them
    all explicitly. I'd like to write this script though that would work on
    different input formats, some which might have N fields, where others have
    M. For a given job run, the number of fields in the input files passed
    would
    be fixed.

    So I'd like to be able to do something like this in pseudo code:

    LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
    ...
    GROUP BY round_hour(timestamp), rest_of_line
    [flatten group and sum counts]
    ...
    STORE round_hour(timestamp), totalCount, rest_of_line

    Where I know nothing about how many tokens are in next_of_line. Any ideas
    besides subclassing PigStorage or writing a new FileInputLoadFunc?

    thanks,
    Bill
  • Mridul Muralidharan at May 20, 2010 at 7:41 am
    I am not sure what the processing is once the group'ing is done, but
    each tuple has a size() (for arity) method which gives us the number of
    fields in that tuple [if using in udf].
    So that can be used to aid in computation.


    If you are interested in aggregating and simply storing it - you dont
    really need to know the arity of a tuple, right ? (That is, group by
    timestamp, and store - PigStorage should continue to store the variable
    number of fields as was present in input).



    Regards,
    Mridul
    On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:
    Thanks Mridul, but how would I access the items in the numbered fields
    3..N where I don't know what N is? Are you suggesting I pass A to a
    custom UDF to convert to a tuple of [time, count, rest_of_line]?


    On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
    wrote:


    You can simply skip specifying schema in the load - and access the
    fields either through the udf or through $0, etc positional indexes.


    Like :

    A = load 'myfile' USING PigStorage();
    B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
    C = ...



    Regards,
    Mridul


    On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:

    Hi,

    Is there a way to read a collection (of unknown size) of
    tab-delimited
    values into a single data type (tuple?) during the LOAD phase?

    Here's specifically what I'm looking to do. I have a given input
    file format
    of tab-delimited fields like so:

    [timestamp] [count] [field1] [field2] [field2] .. [fieldN]

    I'm writing a pig job to take many small files and roll up the
    counts for a
    given time increment of a lesser granularity. For example, many
    files with
    timestamps rounded to 5 minute intervals will be rolled into a
    single file
    with 1 hour granularity.

    I'm able to do this by grouping on the timestamp (rounded down
    to the hour)
    and each of the fields shown if I know the number of fields and
    I list them
    all explicitly. I'd like to write this script though that would
    work on
    different input formats, some which might have N fields, where
    others have
    M. For a given job run, the number of fields in the input files
    passed would
    be fixed.

    So I'd like to be able to do something like this in pseudo code:

    LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
    ...
    GROUP BY round_hour(timestamp), rest_of_line
    [flatten group and sum counts]
    ...
    STORE round_hour(timestamp), totalCount, rest_of_line

    Where I know nothing about how many tokens are in next_of_line.
    Any ideas
    besides subclassing PigStorage or writing a new FileInputLoadFunc?

    thanks,
    Bill

  • Bill Graham at May 20, 2010 at 5:05 pm
    Correct, I don't need to know the arity of the tuple and if I LOAD without
    specifying the fields like you show I should be able to effectively STORE
    the same data. The problem though is that I need to include both the tuple
    and the timestamp in the grouping (but no the count), then sum the counts.

    As an example, this:

    1271201400000 3 1770 162 5
    1271201400000 4 2000 162 100
    1271201700000 3 1770 162 5
    1271201700000 4 2000 162 100

    Would become this (where 1271199600000 is the hour that the two timestamps
    both roll up to):

    1271199600000 6 1770 162 5
    1271199600000 8 2000 162 100

    So in my case I'd like to be able to load timetamps, count and tuple and
    then group on timestamp and tuple and output in the same format of
    timestamp, count, tuple.

    The easiest hack I've come up with for now is to dynamically insert the
    field definitions in my script before I run it. So in the example above I
    would insert 'f1, f2, f3' everywhere I need to reference the tuple. Another
    run might insert 'f1, f2' for an input that only has 2 extra fields.

    On Thu, May 20, 2010 at 12:39 AM, Mridul Muralidharan wrote:



    I am not sure what the processing is once the group'ing is done, but each
    tuple has a size() (for arity) method which gives us the number of fields in
    that tuple [if using in udf].
    So that can be used to aid in computation.


    If you are interested in aggregating and simply storing it - you dont
    really need to know the arity of a tuple, right ? (That is, group by
    timestamp, and store - PigStorage should continue to store the variable
    number of fields as was present in input).



    Regards,
    Mridul

    On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:

    Thanks Mridul, but how would I access the items in the numbered fields
    3..N where I don't know what N is? Are you suggesting I pass A to a
    custom UDF to convert to a tuple of [time, count, rest_of_line]?


    On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
    wrote:


    You can simply skip specifying schema in the load - and access the
    fields either through the udf or through $0, etc positional indexes.


    Like :

    A = load 'myfile' USING PigStorage();
    B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
    C = ...



    Regards,
    Mridul


    On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:

    Hi,

    Is there a way to read a collection (of unknown size) of
    tab-delimited
    values into a single data type (tuple?) during the LOAD phase?

    Here's specifically what I'm looking to do. I have a given input
    file format
    of tab-delimited fields like so:

    [timestamp] [count] [field1] [field2] [field2] .. [fieldN]

    I'm writing a pig job to take many small files and roll up the
    counts for a
    given time increment of a lesser granularity. For example, many
    files with
    timestamps rounded to 5 minute intervals will be rolled into a
    single file
    with 1 hour granularity.

    I'm able to do this by grouping on the timestamp (rounded down
    to the hour)
    and each of the fields shown if I know the number of fields and
    I list them
    all explicitly. I'd like to write this script though that would
    work on
    different input formats, some which might have N fields, where
    others have
    M. For a given job run, the number of fields in the input files
    passed would
    be fixed.

    So I'd like to be able to do something like this in pseudo code:

    LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
    ...
    GROUP BY round_hour(timestamp), rest_of_line
    [flatten group and sum counts]
    ...
    STORE round_hour(timestamp), totalCount, rest_of_line

    Where I know nothing about how many tokens are in next_of_line.
    Any ideas
    besides subclassing PigStorage or writing a new FileInputLoadFunc?

    thanks,
    Bill


  • Mridul Muralidharan at May 20, 2010 at 9:29 pm
    Hmm, I am not sure if you can do this without a udf - you want to
    replace fields off a tuple while leaving rest of it intact.
    If you were simply adding to it, it would have been possible [
    <new_field, *> style approach.].


    Regards,
    Mridul

    On Thursday 20 May 2010 10:35 PM, Bill Graham wrote:
    Correct, I don't need to know the arity of the tuple and if I LOAD
    without specifying the fields like you show I should be able to
    effectively STORE the same data. The problem though is that I need to
    include both the tuple and the timestamp in the grouping (but no the
    count), then sum the counts.

    As an example, this:

    1271201400000 3 1770 162 5
    1271201400000 4 2000 162 100
    1271201700000 3 1770 162 5
    1271201700000 4 2000 162 100

    Would become this (where 1271199600000 is the hour that the two
    timestamps both roll up to):

    1271199600000 6 1770 162 5
    1271199600000 8 2000 162 100

    So in my case I'd like to be able to load timetamps, count and tuple and
    then group on timestamp and tuple and output in the same format of
    timestamp, count, tuple.

    The easiest hack I've come up with for now is to dynamically insert the
    field definitions in my script before I run it. So in the example above
    I would insert 'f1, f2, f3' everywhere I need to reference the tuple.
    Another run might insert 'f1, f2' for an input that only has 2 extra fields.


    On Thu, May 20, 2010 at 12:39 AM, Mridul Muralidharan
    wrote:



    I am not sure what the processing is once the group'ing is done, but
    each tuple has a size() (for arity) method which gives us the number
    of fields in that tuple [if using in udf].
    So that can be used to aid in computation.


    If you are interested in aggregating and simply storing it - you
    dont really need to know the arity of a tuple, right ? (That is,
    group by timestamp, and store - PigStorage should continue to store
    the variable number of fields as was present in input).



    Regards,
    Mridul


    On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:

    Thanks Mridul, but how would I access the items in the numbered
    fields
    3..N where I don't know what N is? Are you suggesting I pass A to a
    custom UDF to convert to a tuple of [time, count, rest_of_line]?


    On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
    <mridulm@yahoo-inc.com wrote:


    You can simply skip specifying schema in the load - and
    access the
    fields either through the udf or through $0, etc positional
    indexes.


    Like :

    A = load 'myfile' USING PigStorage();
    B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
    C = ...



    Regards,
    Mridul


    On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:

    Hi,

    Is there a way to read a collection (of unknown size) of
    tab-delimited
    values into a single data type (tuple?) during the LOAD
    phase?

    Here's specifically what I'm looking to do. I have a
    given input
    file format
    of tab-delimited fields like so:

    [timestamp] [count] [field1] [field2] [field2] .. [fieldN]

    I'm writing a pig job to take many small files and roll
    up the
    counts for a
    given time increment of a lesser granularity. For
    example, many
    files with
    timestamps rounded to 5 minute intervals will be rolled
    into a
    single file
    with 1 hour granularity.

    I'm able to do this by grouping on the timestamp
    (rounded down
    to the hour)
    and each of the fields shown if I know the number of
    fields and
    I list them
    all explicitly. I'd like to write this script though
    that would
    work on
    different input formats, some which might have N fields,
    where
    others have
    M. For a given job run, the number of fields in the
    input files
    passed would
    be fixed.

    So I'd like to be able to do something like this in
    pseudo code:

    LOAD USING PigStorage('\t') AS (timestamp, count,
    rest_of_line)
    ...
    GROUP BY round_hour(timestamp), rest_of_line
    [flatten group and sum counts]
    ...
    STORE round_hour(timestamp), totalCount, rest_of_line

    Where I know nothing about how many tokens are in
    next_of_line.
    Any ideas
    besides subclassing PigStorage or writing a new
    FileInputLoadFunc?

    thanks,
    Bill



  • Dmitriy Ryaboy at May 20, 2010 at 6:38 am
    At the moment the answer is "Preprocessor" I believe

    On Wed, May 19, 2010 at 3:37 PM, Bill Graham wrote:

    Hi,

    Is there a way to read a collection (of unknown size) of tab-delimited
    values into a single data type (tuple?) during the LOAD phase?

    Here's specifically what I'm looking to do. I have a given input file
    format
    of tab-delimited fields like so:

    [timestamp] [count] [field1] [field2] [field2] .. [fieldN]

    I'm writing a pig job to take many small files and roll up the counts for a
    given time increment of a lesser granularity. For example, many files with
    timestamps rounded to 5 minute intervals will be rolled into a single file
    with 1 hour granularity.

    I'm able to do this by grouping on the timestamp (rounded down to the hour)
    and each of the fields shown if I know the number of fields and I list them
    all explicitly. I'd like to write this script though that would work on
    different input formats, some which might have N fields, where others have
    M. For a given job run, the number of fields in the input files passed
    would
    be fixed.

    So I'd like to be able to do something like this in pseudo code:

    LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
    ...
    GROUP BY round_hour(timestamp), rest_of_line
    [flatten group and sum counts]
    ...
    STORE round_hour(timestamp), totalCount, rest_of_line

    Where I know nothing about how many tokens are in next_of_line. Any ideas
    besides subclassing PigStorage or writing a new FileInputLoadFunc?

    thanks,
    Bill

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 19, '10 at 10:37p
activeMay 20, '10 at 9:29p
posts7
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase