Grokbase Groups Pig user April 2011
FAQ
Hi all,

I'm trying to do something with Pig and I'm not quite sure whether it's
possible
or not. Hoping somebody could provide with some help on how to proceed
here.

I have a log file with a number of log lines that have relationships with
each other.
The structure of the log line is:

DATE, UUID, CATNAME, DESCRIPTION, ID, PARENTS

Examples of this include:

DATE, UUID, Apple, this is a log line, id:9, parent:8,9
DATE, UUID, Vegetables, this is a log line, id:4
DATE, UUID, Carrots, this is a log line, id:6, parent:4,5
DATE, UUID, Pineapple, this is a log line, id:8, parent:7
DATE, UUID, Potato, id:12, parent:11,12, this is a log line,
DATE, UUID, Parsnip, this is a log line, id:5, parent:4
DATE, UUID, Fruit, this is a log line, id:7
DATE, UUID, Vegetables, id:10, this is a log line,
DATE, UUID, Beetroot, id:11, parent:10, this is a log line,

I'm currently extracting these using Pig into a schema, and I can order them
based on
UUID, CATNAME, ID, PARENTS to give me an ordered list of lines. The above
would be
transformed into the following (including the description).

UUID, Vegetables, id:4, this is a log line,
UUID, Parsnip, id:5, parent:4, this is a log line,
UUID, Carrots, id:6, parent:4,5, this is a log line,
UUID, Fruit, id:7, this is a log line,
UUID, Apple, id:9, parent:8,9, this is a log line,
UUID, Pineapple, id:8, parent:7, this is a log line,
UUID, Vegetables, id:10, this is a log line,
UUID, Beetroot, id:11, parent:10, this is a log line,
UUID, Potato, id:12, parent:11,12, this is a log line,

What I'm then trying to do is generate a report for each CATNAME that
includes the
children for that operation. If I specified 'Vegetables' the resulting
report
should appear like this:

UUID, Vegetables, this is a log line, id:4
- UUID, Parsnip, this is a log line, id:5, parent:4
- UUID, Carrots, this is a log line, id:6, parent:4,5

UUID, Vegetables, this is a log line, id:10,
- UUID, Beetroot, this is a log line, id:11, parent:10
- UUID, Potato, this is a log line, id:12, parent:11,12

I'm not quite sure how to do this in Pig - the difficulty is the
relationship. I was
thinking I could:

A = Group all log lines by CATNAME
B = Filter all log lines that have a non null parent field
C = FOREACH B Extract the parent: field, parse and lookup each log line with
an id
matching in the parent field. Possibly using a custom UDF to do this. I
could have
thousands of CATNAME's though.

Does anybody have an idea on how I could do this in Pig?

Many thanks in advance,
Jon.

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 1, '11 at 3:50p
activeApr 1, '11 at 3:50p
posts1
users1
websitepig.apache.org

1 user in discussion

Jonathan Holloway: 1 post

People

Translate

site design / logo © 2022 Grokbase