Correct, I don't need to know the arity of the tuple and if I LOAD without
specifying the fields like you show I should be able to effectively STORE
the same data. The problem though is that I need to include both the tuple
and the timestamp in the grouping (but no the count), then sum the counts.
As an example, this:
1271201400000 3 1770 162 5
1271201400000 4 2000 162 100
1271201700000 3 1770 162 5
1271201700000 4 2000 162 100
Would become this (where 1271199600000 is the hour that the two timestamps
both roll up to):
1271199600000 6 1770 162 5
1271199600000 8 2000 162 100
So in my case I'd like to be able to load timetamps, count and tuple and
then group on timestamp and tuple and output in the same format of
timestamp, count, tuple.
The easiest hack I've come up with for now is to dynamically insert the
field definitions in my script before I run it. So in the example above I
would insert 'f1, f2, f3' everywhere I need to reference the tuple. Another
run might insert 'f1, f2' for an input that only has 2 extra fields.
On Thu, May 20, 2010 at 12:39 AM, Mridul Muralidharan wrote:
I am not sure what the processing is once the group'ing is done, but each
tuple has a size() (for arity) method which gives us the number of fields in
that tuple [if using in udf].
So that can be used to aid in computation.
If you are interested in aggregating and simply storing it - you dont
really need to know the arity of a tuple, right ? (That is, group by
timestamp, and store - PigStorage should continue to store the variable
number of fields as was present in input).
On Thursday 20 May 2010 05:39 AM, Bill Graham wrote:
Thanks Mridul, but how would I access the items in the numbered fields
3..N where I don't know what N is? Are you suggesting I pass A to a
custom UDF to convert to a tuple of [time, count, rest_of_line]?
On Wed, May 19, 2010 at 4:11 PM, Mridul Muralidharan
You can simply skip specifying schema in the load - and access the
fields either through the udf or through $0, etc positional indexes.
A = load 'myfile' USING PigStorage();
B = GROUP A by round_hour($0) PARALLEL $PARALLELISM;
C = ...
On Thursday 20 May 2010 04:07 AM, Bill Graham wrote:
Is there a way to read a collection (of unknown size) of
values into a single data type (tuple?) during the LOAD phase?
Here's specifically what I'm looking to do. I have a given input
of tab-delimited fields like so:
[timestamp] [count] [field1] [field2] [field2] .. [fieldN]
I'm writing a pig job to take many small files and roll up the
counts for a
given time increment of a lesser granularity. For example, many
timestamps rounded to 5 minute intervals will be rolled into a
with 1 hour granularity.
I'm able to do this by grouping on the timestamp (rounded down
to the hour)
and each of the fields shown if I know the number of fields and
I list them
all explicitly. I'd like to write this script though that would
different input formats, some which might have N fields, where
M. For a given job run, the number of fields in the input files
So I'd like to be able to do something like this in pseudo code:
LOAD USING PigStorage('\t') AS (timestamp, count, rest_of_line)
GROUP BY round_hour(timestamp), rest_of_line
[flatten group and sum counts]
STORE round_hour(timestamp), totalCount, rest_of_line
Where I know nothing about how many tokens are in next_of_line.
besides subclassing PigStorage or writing a new FileInputLoadFunc?