I think the databyte thing you pointed out is likely a real bug, I tested it
on an incomplete subset of primitive types (just longs or something).
The main thing that keeps me from recommending it as is that if you are
reading a SequenceFile whose composition you already know, you should work
with it directly (manually translating into Pig tuples). IIRC, the loader as
written assumes that the key and value are both primitives -- what if your
value is a Pair of primitives? The sequence file loader won't support it.
So, "some assembly required."
As far as making sure it's working correctly -- you are going to want to
test that it handles records that span split edges correctly, not reading
them twice or dropping them. Actually this is probably taken care of by the
new API, so you should be fine if you're on 0.7+. Might be buggy in 0.6.
On Mon, Sep 27, 2010 at 3:59 PM, Zach Bailey wrote:
Oh, gosh, well that makes me uneasy, since I was intending to really use
this, in production.
Is there something in particular about this class that makes it not
intended for real-world use? Performance? The way it's written (i.e. still
depends on old APIs, etc.)?
Is there a loader you suggest I look at using instead that has been more
Dmitriy Ryaboy wrote:
Perhaps I should've documented that better.
That class is *not intended for real use*. As far as I know, it's never
used by anyone for anything in production.
It's a demo of how one would go about writing a real SequenceFileLoader
whatever internal stuff you are using. Feel free to replace anything that
makes sense for you in your implementation.
On Mon, Sep 27, 2010 at 1:23 PM, Zach Baileywrote:
Not sure if this has been discussed already or if this is due to some
limitation in pig, hadoop, or java - but is there a particular reason the
PiggyBank SequenceFileLoader doesn't support the BytesWritable type for
sequence file keys/values?http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html
Looking at the code, it maps the pig-specific DataByteArray class to the
pig type "bytearray" - I don't understand this choice. Why use a
pig-specific class here (which is not very friendly for a mixed
In fact, if you look at the SequenceFileLoader code you will see
that looks very strange:
protected Object translateWritableToPigDataType(*Writable w*, byte
case DataType.CHARARRAY: return ((Text) w).toString();
* case DataType.BYTEARRAY: return((DataByteArray) w).get();*
case DataType.INTEGER: return ((IntWritable) w).get();
case DataType.LONG: return ((LongWritable) w).get();
case DataType.FLOAT: return ((FloatWritable) w).get();
case DataType.DOUBLE: return ((DoubleWritable) w).get();
case DataType.BYTE: return ((ByteWritable) w).get();
This code smells - the method takes a Writeable - which makes sense, but
then for the BYTEARRAY type it's casting it to a DataByteArray, which
doesn't implement Writable! WTF, mate?
I'm going to try my hand at switching this to use BytesWritable instead
see what explodes.