Grokbase Groups Pig user August 2010
FAQ
Update:
After downloading and installing pig-0.6.0, I ran the script again over
the same data set. It produced the desired results. I don't know what I
am doing wrong in 0.7.0, but will be reverting back to 0.6.0 until I can
sort out what went wrong in 0.7.0. Thoughts are still welcome and wanted
:D

Thanks,
Matt

-----Original Message-----
From: Matthew Smith
Sent: Monday, August 23, 2010 11:39 AM
To: Thejas M Nair; pig-user@hadoop.apache.org
Subject: RE: ORDER Issue (repost to avoid spam filters)

Changed the script to:
start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);
target = FILTER start BY sip matches '51.37.8.63';
not_null_bytes = FILTER target BY bytes is not null;
dump not_null_bytes;

and dumped the expected tuples. There were plenty of records that were
valid. I will attempt to revert everything to pig-0.6.0 and re run the
scripts to determine if the issue is in pig-0.7.0.

Matt

-----Original Message-----
From: Thejas M Nair
Sent: Friday, August 20, 2010 5:23 PM
To: pig-user@hadoop.apache.org; Matthew Smith
Subject: Re: ORDER Issue (repost to avoid spam filters)

I was wondering if the bytes column is having all null values (probably
because the input has formatting issues.)

Can check you if the following query gives any output -

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

target = FILTER start BY sip matches '51.37.8.63';

non_null_bytes = FILTER target by bytes is not null;

dump just_bytes;

-Thejas

On 8/20/10 1:56 PM, "Matthew Smith" wrote:

UPDATE: I attempted my code in the amazon cloud (aws.amazon.com) and the
script worked as intended over the data set. This leads me to believe
that the issue is with pig-0.7.0 or my configuration. I would however
like to not pay for something that is free :D. Any other ideas would be
most welcome



@Thejas

I changed the Script to:

start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

target = FILTER start BY sip matches '51.37.8.63';

just_bytes= FOREACH target GENERATE bytes;

fail = ORDER just_bytes BY bytes DESC;

not_reached = LIMIT fail 10;

dump not_reached;



and received the same error as before. I then changed the script to:



start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);

target = FILTER start BY sip matches '51.37.8.63';

stored = STORE target INTO 'myoutput';

second_start = LOAD 'myoutput/part-m-00000' USING PigStorage('\t') AS
(sip:chararray, dip:chararray, sport:int, dport:int, protocol:int,
packets:int, bytes:int, flags:chararray, startTime:long,
endTime:long);
fail = ORDER second_start BY bytes DESC;

not_reached = LIMIT fail 10;

dump not_reached;



and received the same error.



@Mridul

I am using local mode at the moment. I don't understand the second
question.



Thanks,

Matt







From: Thejas M Nair
Sent: Thursday, August 19, 2010 5:34 PM
To: pig-user@hadoop.apache.org; Matthew Smith
Subject: Re: ORDER Issue (repost to avoid spam filters)



I think 0.7 had an issue where order-by used to fail if the input was
empty. But that does not seem to be the case here.
I am wondering if there is a parsing/data-format issue that is causing
bytes column to be empty , though I am not aware of emtpy/null value of
sort column causing issues.
Can you try dumping just the bytes column ?
Another thing you can try is to store the output of filter and load data
again before doing order-by ..

Please let us know what you find.

Thanks,
Thejas




On 8/19/10 11:35 AM, "Matthew Smith" wrote:

All,



I am running pig-0.7.0 and I have been running into an issue running the
ORDER command. I have attempted to run pig out of the box on 2 separate
LINUX OS (Ubuntu 10.4 and OpenSuse 11.2) and the same issue has
occurred. I run these commands in a script file:



start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);



target = FILTER start BY sip matches '51.37.8.63';



fail = ORDER target BY bytes DESC;



not_reached = LIMIT fail 10;



dump not_reached;





The error is listed below. I then run:





start = LOAD 'inputData' USING PigStorage('|') AS (sip:chararray,
dip:chararray, sport:int, dport:int, protocol:int, packets:int,
bytes:int, flags:chararray, startTime:long, endTime:long);



target = FILTER start BY sip matches '51.37.8.63';



dump target;





This script produces a large list of sips matching the filter. What am
I doing wrong that causes pig to not want to ORDER these properly? I
have been wrestling with this issue for a week now. Any help would be
greatly appreciated.







Best,



Matthew



/ERROR



java.lang.RuntimeException:

org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path
does not exist: file:/user/matt/pigsample_24118161_1282155871461



at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:135)



at

org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)


at

org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:
117)



at

org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:
527)



at

org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)



at

org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)



at

org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)


Caused by:
org.apache.hadoop.mapreduce.lib.input.InvalidInputException:
Input path does not exist:

file:/user/matt/pigsample_24118161_1282155871461



at

org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInp
utFormat.java:224)



at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigFileInpu
tFormat.listStatus(PigFileInputFormat.java:37)



at

org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInpu
tFormat.java:241)



at

org.apache.pig.impl.io.ReadToEndLoader.init(ReadToEndLoader.java:153)



at

org.apache.pig.impl.io.ReadToEndLoader.<init>(ReadToEndLoader.java:115)


at

org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioner
s.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:108)



... 6 more










Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 7 of 8 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedAug 19, '10 at 6:36p
activeAug 25, '10 at 4:04p
posts8
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase