Not sure if this has been discussed already or if this is due to some
limitation in pig, hadoop, or java - but is there a particular reason
the PiggyBank SequenceFileLoader doesn't support the BytesWritable type
for sequence file keys/values?
Looking at the code, it maps the pig-specific DataByteArray class to the
pig type "bytearray" - I don't understand this choice. Why use a
pig-specific class here (which is not very friendly for a mixed
pig/non-pig hadoop ecosystem)?
In fact, if you look at the SequenceFileLoader code you will see
something that looks very strange:
protected Object translateWritableToPigDataType(*Writable w*, byte
case DataType.CHARARRAY: return ((Text) w).toString();
* case DataType.BYTEARRAY: return((DataByteArray) w).get();*
case DataType.INTEGER: return ((IntWritable) w).get();
case DataType.LONG: return ((LongWritable) w).get();
case DataType.FLOAT: return ((FloatWritable) w).get();
case DataType.DOUBLE: return ((DoubleWritable) w).get();
case DataType.BYTE: return ((ByteWritable) w).get();
This code smells - the method takes a Writeable - which makes sense, but
then for the BYTEARRAY type it's casting it to a DataByteArray, which
doesn't implement Writable! WTF, mate?
I'm going to try my hand at switching this to use BytesWritable instead
and see what explodes.