So right off the bat, I fixed the regex patterns in my split, but what I
kept getting an error from the multiquery optimize. Specifically, the

ERROR 2146: Internal Error. Inconsistency in key index found during
optimization. + stacktrace

As a temporary fix, I re-ran without multiquery optimization. Obviously as a
result, the script is running much slower. The question I have then is in
what exactly is causing this issue? How can I fix my script to be able to
run my queries and take advantage of the optimizer?
On Thu, Sep 3, 2009 at 4:03 PM, zaki rahaman wrote:

Hi all,

I'm becoming a bit more comfortable writing scripts, but still not always
sure what the best way to structure/frame my statements in order to optimize
performance. When it comes to Split and Filter, for example, one could
filter multiple times on a raw set of data or condense it into one split
statement, but it's not clear from the docs what the best practice in this
case is. Below is my script as it stands. Your input would be greatly

-- Queries for August by Day/Month/Week

REGISTER mypigudfs.jar;

raw = LOAD 'data' AS (timestamp:chararray, ip:chararray, userid:chararray);

dailyraw = FOREACH raw GENERATE userid, mypigudfs.ExtractDay(timestamp) AS
SPLIT dailyraw INTO broken IF (userid matches '*BROKEN*'), noperm IF
(userid matches '*NOPERM*'), daily IF (NOT ((userid matches '*BROKEN*') OR
(userid matches '*NOPERM*')));

-- Daily Count(s)

daygrp = GROUP daily BY day PARALLEL 36;
daycnts = FOREACH daygrp GENERATE group, COUNT(daily);

-- NoPerm
npgrp = GROUP noperm BY day;
npcnts = FOREACH npgrp GENERATE group, COUNT(noperm);

brkgrp = GROUP broken BY day;
brkcnts = FOREACH brkgrp GENERATE group, COUNT(broken);

-- Weekly Count(s)

weekly = FOREACH daily GENERATE userid, mypigudfs.ExtractWeek(day) AS week;
wkgrp = GROUP weekly By week PARALLEL 36;
wkcnts = FOREACH wkgrp GENERATE group, COUNT(weekly);

broken2 = FOREACH broken GENERATE userid, mypigudfs.ExtractWeek(day) AS
brkgrp2 = GROUP broken2 BY week;
brkcnts2 = FOREACH brkgrp2 GENERATE group, COUNT(broken2);

noperm2 = FOREACH noperm GENERATE userid, mypigudfs.ExtractWeek(day) AS
npgrp2 = GROUP noperm2 BY week;
npcnts2 = FOREACH npgrp2 GENERATE group, COUNT(noperm2);

-- Monthly Count

month = GROUP weekly ALL;
mcnt = FOREACH month GENERATE COUNT(weekly);

npmonth = GROUP noperm2 ALL;
npmcnt = FOREACH npmonth GENERATE COUNT(noperm2);

brkmonth = GROUP broken2 ALL;
brkmcnt = FOREACH brkmonth GENERATE COUNT(broken2);

// Store Output

Zaki Rahaman

Zaki Rahaman

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 4, '09 at 6:35p
activeSep 4, '09 at 6:35p

1 user in discussion

Zaki rahaman: 1 post



site design / logo © 2021 Grokbase