FAQ
Hello,

I'll preface this with saying that I know very very little Java and I am
just learning Pig.

My situation is that I am aggregating logs with Flume into a single
logfile. All my logs are in JSON format and then gzip'd before being added
to S3. I have 3 types of log lines in each file (b, i, c). Since I can't
seem to get anything to work, I am pulled a few logfiles down to the local
machine and I am running pig in local mode on decompressed log files.

What I am trying to do is write a Pig script to parse the JSON and then
run queries against. Since there are 3 types of lines in the same file,
when I do an illustrate of a regex (that I know works because I have tested
it against multiple regex matching programs) it only shows me the first
line, not the first matching line. The JSON log line that is of type 'b' is
a nested JSON, so I am staying away from that for now (mostly because I
can't figure out how to get the Java in this Gist to build:
https://gist.github.com/601331). Log lines 'i' and 'c' are single level
JSON (not nested) so a simple regex should work if I understand everything
correctly.

More specifics are in this StackOverflow question I posted as well (
http://stackoverflow.com/questions/5013003/how-do-i-parse-json-in-pig).
Feel free to answer it for the points if we answer the question here.

The version of Hadoop is 0.20 and Pig is 0.6 because that is what is on
the EMR (Elastic Map Reduce) instances.

Here is where I am at:
----
Example log line type 'i':
{"exchange_id":"4cc877b81badf422af000010","exchange_user_id":"MTY4Mjk2NTk2eDAuODA2IDEyOTc4MDI5NTh4MTI2NDc5NjY2MA","bid_id":"00cc4341-facb-4ec1-a403-d5309472d70e","bid_amount":"2.05","win_amount":1.369999968133322,"ad_ids":"4d237a731badf45c8200011a,4d237ac81badf45c85000006,4d4c64c0e32b132113000013,4d23807a1badf45c85000299","wv":"2","logged_at":"2011-02-15T23:36:31.386Z"}

Pig Script Attempt:
REGISTER file:/home/hadoop/lib/pig/piggybank.jar;
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT();
RAW_LOGS = LOAD 'file:/home/hadoop/logs/adserver.log' USING TextLoader AS
(line:chararray);
LOGS_BASE= foreach RAW_LOGS generate
FLATTEN(EXTRACT(line,'{"exchange_id":"(.*[^"])","exchange_user_id":"(.*[^"])","bid_id":"(.*[^"])","bid_amount":"(.*[^"])","win_amount":(.*),"ad_ids":"(.*[^"])","wv":"(.*[^"])","logged_at":"(.*[^"])"}'))
AS
(exchange_id:chararray,exchange_user_id:chararray,bid_id:chararray,bid_amount:float,win_amount:float,ad_ids:chararray,wv:int,logged_at:chararray);
WIDGET_VERSION_ONLY = FOREACH LOGS_BASE GENERATE wv;
WIDGET_VERSION_COUNT = FOREACH (GROUP WIDGET_VERSION_ONLY BY $0) GENERATE
$0, COUNT($1) as num;
WIDGET_VERSION_SORTED_COUNT = LIMIT(ORDER WIDGET_VERSION_COUNT BY num DESC)
5;
----

Any help that would push me in the right direction would be greatly
appreciated.

-e
--
Eric Lubow
e: eric.lubow@gmail.com
w: eric.lubow.org

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 17, '11 at 4:54p
activeFeb 17, '11 at 4:54p
posts1
users1
websitepig.apache.org

1 user in discussion

Eric Lubow: 1 post

People

Translate

site design / logo © 2021 Grokbase