FAQ
Hey folks,

Not sure if this has been discussed already or if this is due to some
limitation in pig, hadoop, or java - but is there a particular reason
the PiggyBank SequenceFileLoader doesn't support the BytesWritable type
for sequence file keys/values?

http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html

Looking at the code, it maps the pig-specific DataByteArray class to the
pig type "bytearray" - I don't understand this choice. Why use a
pig-specific class here (which is not very friendly for a mixed
pig/non-pig hadoop ecosystem)?

In fact, if you look at the SequenceFileLoader code you will see
something that looks very strange:

protected Object translateWritableToPigDataType(*Writable w*, byte
dataType) {
switch(dataType) {
case DataType.CHARARRAY: return ((Text) w).toString();
* case DataType.BYTEARRAY: return((DataByteArray) w).get();*
case DataType.INTEGER: return ((IntWritable) w).get();
case DataType.LONG: return ((LongWritable) w).get();
case DataType.FLOAT: return ((FloatWritable) w).get();
case DataType.DOUBLE: return ((DoubleWritable) w).get();
case DataType.BYTE: return ((ByteWritable) w).get();
}

return null;
}

This code smells - the method takes a Writeable - which makes sense, but
then for the BYTEARRAY type it's casting it to a DataByteArray, which
doesn't implement Writable! WTF, mate?

I'm going to try my hand at switching this to use BytesWritable instead
and see what explodes.

Cheers,
-Zach

Search Discussions

  • Dmitriy Ryaboy at Sep 27, 2010 at 8:49 pm
    Zach,
    Perhaps I should've documented that better.
    That class is *not intended for real use*. As far as I know, it's never been
    used by anyone for anything in production.
    It's a demo of how one would go about writing a real SequenceFileLoader for
    whatever internal stuff you are using. Feel free to replace anything that
    makes sense for you in your implementation.

    -D
    On Mon, Sep 27, 2010 at 1:23 PM, Zach Bailey wrote:

    Hey folks,

    Not sure if this has been discussed already or if this is due to some
    limitation in pig, hadoop, or java - but is there a particular reason the
    PiggyBank SequenceFileLoader doesn't support the BytesWritable type for
    sequence file keys/values?


    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html

    Looking at the code, it maps the pig-specific DataByteArray class to the
    pig type "bytearray" - I don't understand this choice. Why use a
    pig-specific class here (which is not very friendly for a mixed pig/non-pig
    hadoop ecosystem)?

    In fact, if you look at the SequenceFileLoader code you will see something
    that looks very strange:

    protected Object translateWritableToPigDataType(*Writable w*, byte
    dataType) {
    switch(dataType) {
    case DataType.CHARARRAY: return ((Text) w).toString();
    * case DataType.BYTEARRAY: return((DataByteArray) w).get();*
    case DataType.INTEGER: return ((IntWritable) w).get();
    case DataType.LONG: return ((LongWritable) w).get();
    case DataType.FLOAT: return ((FloatWritable) w).get();
    case DataType.DOUBLE: return ((DoubleWritable) w).get();
    case DataType.BYTE: return ((ByteWritable) w).get();
    }

    return null;
    }

    This code smells - the method takes a Writeable - which makes sense, but
    then for the BYTEARRAY type it's casting it to a DataByteArray, which
    doesn't implement Writable! WTF, mate?

    I'm going to try my hand at switching this to use BytesWritable instead and
    see what explodes.

    Cheers,
    -Zach
  • Zach Bailey at Sep 27, 2010 at 11:00 pm
    Oh, gosh, well that makes me uneasy, since I was intending to really use
    this, in production.

    Is there something in particular about this class that makes it not
    intended for real-world use? Performance? The way it's written (i.e.
    still depends on old APIs, etc.)?

    Is there a loader you suggest I look at using instead that has been more
    battle-tested?

    -Zach

    Dmitriy Ryaboy wrote:
    Zach,
    Perhaps I should've documented that better.
    That class is *not intended for real use*. As far as I know, it's never been
    used by anyone for anything in production.
    It's a demo of how one would go about writing a real SequenceFileLoader for
    whatever internal stuff you are using. Feel free to replace anything that
    makes sense for you in your implementation.

    -D

    On Mon, Sep 27, 2010 at 1:23 PM, Zach Baileywrote:
    Hey folks,

    Not sure if this has been discussed already or if this is due to some
    limitation in pig, hadoop, or java - but is there a particular reason the
    PiggyBank SequenceFileLoader doesn't support the BytesWritable type for
    sequence file keys/values?


    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html

    Looking at the code, it maps the pig-specific DataByteArray class to the
    pig type "bytearray" - I don't understand this choice. Why use a
    pig-specific class here (which is not very friendly for a mixed pig/non-pig
    hadoop ecosystem)?

    In fact, if you look at the SequenceFileLoader code you will see something
    that looks very strange:

    protected Object translateWritableToPigDataType(*Writable w*, byte
    dataType) {
    switch(dataType) {
    case DataType.CHARARRAY: return ((Text) w).toString();
    * case DataType.BYTEARRAY: return((DataByteArray) w).get();*
    case DataType.INTEGER: return ((IntWritable) w).get();
    case DataType.LONG: return ((LongWritable) w).get();
    case DataType.FLOAT: return ((FloatWritable) w).get();
    case DataType.DOUBLE: return ((DoubleWritable) w).get();
    case DataType.BYTE: return ((ByteWritable) w).get();
    }

    return null;
    }

    This code smells - the method takes a Writeable - which makes sense, but
    then for the BYTEARRAY type it's casting it to a DataByteArray, which
    doesn't implement Writable! WTF, mate?

    I'm going to try my hand at switching this to use BytesWritable instead and
    see what explodes.

    Cheers,
    -Zach
  • Dmitriy Ryaboy at Sep 28, 2010 at 12:11 am
    I think the databyte thing you pointed out is likely a real bug, I tested it
    on an incomplete subset of primitive types (just longs or something).

    The main thing that keeps me from recommending it as is that if you are
    reading a SequenceFile whose composition you already know, you should work
    with it directly (manually translating into Pig tuples). IIRC, the loader as
    written assumes that the key and value are both primitives -- what if your
    value is a Pair of primitives? The sequence file loader won't support it.
    So, "some assembly required."

    As far as making sure it's working correctly -- you are going to want to
    test that it handles records that span split edges correctly, not reading
    them twice or dropping them. Actually this is probably taken care of by the
    new API, so you should be fine if you're on 0.7+. Might be buggy in 0.6.

    -D
    On Mon, Sep 27, 2010 at 3:59 PM, Zach Bailey wrote:

    Oh, gosh, well that makes me uneasy, since I was intending to really use
    this, in production.

    Is there something in particular about this class that makes it not
    intended for real-world use? Performance? The way it's written (i.e. still
    depends on old APIs, etc.)?

    Is there a loader you suggest I look at using instead that has been more
    battle-tested?

    -Zach


    Dmitriy Ryaboy wrote:
    Zach,
    Perhaps I should've documented that better.
    That class is *not intended for real use*. As far as I know, it's never
    been
    used by anyone for anything in production.
    It's a demo of how one would go about writing a real SequenceFileLoader
    for
    whatever internal stuff you are using. Feel free to replace anything that
    makes sense for you in your implementation.

    -D

    On Mon, Sep 27, 2010 at 1:23 PM, Zach Baileywrote:

    Hey folks,
    Not sure if this has been discussed already or if this is due to some
    limitation in pig, hadoop, or java - but is there a particular reason the
    PiggyBank SequenceFileLoader doesn't support the BytesWritable type for
    sequence file keys/values?



    http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/BytesWritable.html

    Looking at the code, it maps the pig-specific DataByteArray class to the
    pig type "bytearray" - I don't understand this choice. Why use a
    pig-specific class here (which is not very friendly for a mixed
    pig/non-pig
    hadoop ecosystem)?

    In fact, if you look at the SequenceFileLoader code you will see
    something
    that looks very strange:

    protected Object translateWritableToPigDataType(*Writable w*, byte
    dataType) {
    switch(dataType) {
    case DataType.CHARARRAY: return ((Text) w).toString();
    * case DataType.BYTEARRAY: return((DataByteArray) w).get();*
    case DataType.INTEGER: return ((IntWritable) w).get();
    case DataType.LONG: return ((LongWritable) w).get();
    case DataType.FLOAT: return ((FloatWritable) w).get();
    case DataType.DOUBLE: return ((DoubleWritable) w).get();
    case DataType.BYTE: return ((ByteWritable) w).get();
    }

    return null;
    }

    This code smells - the method takes a Writeable - which makes sense, but
    then for the BYTEARRAY type it's casting it to a DataByteArray, which
    doesn't implement Writable! WTF, mate?

    I'm going to try my hand at switching this to use BytesWritable instead
    and
    see what explodes.

    Cheers,
    -Zach

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 27, '10 at 8:30p
activeSep 28, '10 at 12:11a
posts4
users3
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase