read sequence files whose values are block comressed (gzip'd). I'm using Pig
0.4.99.0+10, and Hadoop hadoop-0.20.1+152, via Cloudera.
Did the following:
* Copied the SequenceFileLoader class into my own project
* Removed
public LoadFunc.RequiredFieldResponse
fieldsToRead(LoadFunc.RequiredFieldList requiredFieldList)
because LoadFunc.RequiredFieldList isn't resolvable, and added
public void fieldsToRead(Schema schema)
* Jarred up the .class file
* Programmatically created a trivial sequence file of a few lines, with
IntWritable keys and Text values, using the basic code in an example in
Hadoop The Definitive Guide
* That file is successfully read and keys/values displayed, with "hadoop fs
-text", as well as with pig, doing the following:
grunt> register sequencefileloader.jar;
grunt> r = load '/path/to/sequence_file' using
com.foobar.SequenceFileLoader();
grunt> dump r;
* The sequence file with the compressed values is successfully read with
hadoop fs -text
* When doing the load step in pig with that file, the following results:
--
2010-02-19 16:59:14,489 [main] WARN org.apache.hadoop.util.NativeCodeLoader
- Unable to load native-hadoop library for your platform..
. using builtin-java classes where applicable
2010-02-19 16:59:14,490 [main] INFO org.apache.hadoop.io.compress.CodecPool
- Got brand-new decompressor
2010-02-19 16:59:14,498 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1018: Problem determining schema during load
Details at logfile: /path/to/pig_1266616744562.log
--
That log file contains the following:
--
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1000: Error during
parsing. Problem determining schema during load
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1037)
at org.apache.pig.PigServer$Graph.registerQuery(PigServer.java:981)
at org.apache.pig.PigServer.registerQuery(PigServer.java:383)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:717)
at
org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:273)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:166)
at
org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:142)
at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75)
at org.apache.pig.Main.main(Main.java:363)
Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException: Problem
determining schema during load
at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:734)
at
org.apache.pig.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:63)
at org.apache.pig.PigServer$Graph.parseQuery(PigServer.java:1031)
... 8 more
Caused by: org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1018:
Problem determining schema during load
at
org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:155)
at
org.apache.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:732)
... 10 more
Caused by: java.io.EOFException
at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:207)
at
java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:197)
at
java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:136)
at java.util.zip.GZIPInputStream.(GZIPInputStream.java:68)
at
org.apache.hadoop.io.compress.GzipCodec$GzipInputStream$ResetableGZIPInputStream.(GzipCodec.java:101)
at
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:169)
at
org.apache.hadoop.io.compress.GzipCodec.createInputStream(GzipCodec.java:179)
at
org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1520)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1417)
at
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFileLoader.java:140)
at
com.media6.SequenceFileLoader.determineSchema(SequenceFileLoader.java:106)
at
org.apache.pig.impl.logicalLayer.LOLoad.getSchema(LOLoad.java:148)
... 11 more
--
Maybe there's something that needs to be added to SequenceFileLoader to
account for the compressed values, which hadoop's "fs -text" accounts for.
Thanks for any ideas/pointers.
Derek