Grokbase Groups Pig user April 2010
FAQ

On Tue, Apr 20, 2010 at 2:35 PM, Dmitriy Ryaboy wrote:

Like I said, proof of concept. Patches accepted :-).
It shouldn't be too hard to make it work with globs in 0.7 -- iirc, in 0.7
the functionality that interprets the load string and figures out globs is
factored out so that it can be reused by all loaders without needing to
extend PigStorage.

-D

On Tue, Apr 20, 2010 at 11:14 AM, Edward Capriolo <edlinuxguru@gmail.com
wrote:
On Tue, Apr 20, 2010 at 1:11 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
wrote:
Edward,
The sequence file loader in piggybank is more a proof of concept than a
real
loader. It works, but only if your data happens to match exactly the format
it expects -- namely, key-value pairs where both the key and the value are
one of {Text, IntWritable, LongWritable, FloatWritable, DoubleWritable,
BooleanWritable, ByteWritable}. You seem to be loading 3 columns,
which
doesn't match this format.

I am not sure what you mean by a delimiter for a sequenceFile.
SequenceFiles
are binary, not character-delimited. Are you storing a string as a
value,
and trying to interpret said string? If that's the case, then as you
suggested, the thing to do is to load that string as a value and use
TOKENIZE or some other string parsing function to extract your fields.
Or
write a loader that knows that the user wants to apply further
processing
to
values, and do that in the loader..

As for your other question, Pig works with Hadoop globs, not just
individual
files (though some loaders may not support that -- PigStorage
definitely
does).

-Dmitriy

On Tue, Apr 20, 2010 at 9:36 AM, Edward Capriolo <
edlinuxguru@gmail.com
wrote:
Hello,

We have a file heirarchy we want to be accessable with MR/Hive/Pig.
In
this
way everyone can pick favorites :)

Currently the layout looks like this.

/user/root/data/datepartition1/subpartition2/{sequence file1,
sequence
fileN)

I have just installed pig-0.6.0. I am trying to follow the advice
here
(
http://stackoverflow.com/questions/2423949/storing-data-to-sequencefile-from-apache-pig
)

REGISTER /opt/pig-0.6.0/contrib/piggybank/java/piggybank.jar;
DEFINE SequenceFileLoader
org.apache.pig.piggybank.storage.SequenceFileLoader();
raw = load 'datafile' USING SequenceFileLoader as (version:chararray,
id:int,date:chararray);

2010-04-20 12:10:46,821 [main] ERROR org.apache.pig.tools.grunt.Grunt
-
ERROR 2999: Unexpected internal error.
org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
java.lang.Error

[root@rs01 piggybank]# more /root/pig_1271779744816.log
Pig Stack Trace
---------------
ERROR 2999: Unexpected internal error.
org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
java.lan
g.Error

java.lang.ClassCastException:
org.apache.pig.impl.logicalLayer.FrontendException cannot be cast to
java.lang.Error
at
org.apache.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:1440)
at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:949)
at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:738)
at
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1036)
at
org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:986)
at org.apache.pig.PigServer.registerQuery(PigServer.java:386)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:720)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:324)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:352)

So it seems like I have a bug, or have I done something wrong. looks
like
a
bug because if Pig can't cast the error correctly something is wrong.

Two questions:
1) Can I load all the files in a directory rather then operating on
one
file?

raw = load '/datadir/*' USING SequenceFileLoader as
(version:chararray,
id:int,date:chararray);
Rather then
raw = load '/datafile' USING SequenceFileLoader as
(version:chararray,
id:int,date:chararray);

2) PigStorage seems to let me specify a tab delimeter. How does once
specify
a tab delimeter with SequenceFileLoader? Or does one have to pass the
entire
line to some other Pig Component to be tokenized.

Thank you,
Dimitry,

Thank you for your help. My Columns are Text,Text and I have confirmed the
pig loader is working fine on a single file. I was specifying multiple
columns because it seems that PigStorage can split on tabs with an
argument.
I was hoping the SequenceFileLoader was the same.

In summary, SequenceFileLoader does work on single files not globs, and it
requires a further step to tokenize.


Is a SequenceFileLoader that takes globs on the Roadmap? :)

Regards,
Edward
Unrelated question. Does PIG have an IRC on freenode. #pig seems to be
invite only.

Search Discussions

Discussion Posts

Previous

Follow ups

Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 5 of 6 | next ›
Discussion Overview
groupuser @
categoriespig, hadoop
postedApr 20, '10 at 4:37p
activeApr 20, '10 at 7:25p
posts6
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase