Grokbase Groups Pig dev May 2011
FAQ
Disclaimer: I'm still learning my way around the Pig and Hadoop internals,
so this question is aimed at better understanding that and some of the pig
design choices...

Is there a reason why in Pig we are restricted to a set of types (roughly
corresponding to types in java), instead of having an abstract type like in
Hadoop ie Writable or WritableComparable? I guess I got to thinking about
this when thinking about the Algebraic interface... in Hadoop if you want to
have some crazy intermediate objects, you can do that easily as long as they
are serializable (ie Writable, and WritableComparable if they are going to
the reducer in the shuffle). In fact, in Hadoop there is no notion of some
special class of objects which we work with -- everything is simply Writable
or WritableComparable. In Pig we are more limited, and I was just thinking
about why that needs to be the case. Is there any reason why we can't have
abstract types at the same level as String or Integer? My guess would be it
has to do with how these objects are treated internally, but beyond that am
not sure.

Thanks for helping me think about this
Jon

Search Discussions

  • Daniel Dai at Jun 1, 2011 at 10:45 pm
    bytearray can be any data type including WritableComparable. Just
    declare your EvalFunc to return bytearray, and in the runtime, feed
    WritableComparable to Pig. Pig should be able to deal with it.

    Daniel
    On 05/31/2011 10:30 AM, Jonathan Coveney wrote:
    Disclaimer: I'm still learning my way around the Pig and Hadoop internals,
    so this question is aimed at better understanding that and some of the pig
    design choices...

    Is there a reason why in Pig we are restricted to a set of types (roughly
    corresponding to types in java), instead of having an abstract type like in
    Hadoop ie Writable or WritableComparable? I guess I got to thinking about
    this when thinking about the Algebraic interface... in Hadoop if you want to
    have some crazy intermediate objects, you can do that easily as long as they
    are serializable (ie Writable, and WritableComparable if they are going to
    the reducer in the shuffle). In fact, in Hadoop there is no notion of some
    special class of objects which we work with -- everything is simply Writable
    or WritableComparable. In Pig we are more limited, and I was just thinking
    about why that needs to be the case. Is there any reason why we can't have
    abstract types at the same level as String or Integer? My guess would be it
    has to do with how these objects are treated internally, but beyond that am
    not sure.

    Thanks for helping me think about this
    Jon
  • Mridul Muralidharan at Jun 2, 2011 at 8:24 pm
    Not sure behind the rationale behind bytearray in pig (other than the
    need for wanting to decouple from hadoop dependency earlier on : pig was
    expected to run against any backend - hadoop, dryad, local, etc) - but
    the direct impact of that is the need to serialize and deserialize from
    internal objects to byte[] and vice versa ...

    As a trivial example, at times, this has been a major drain on
    performance - a Writable based impl could have done smart things like
    de-serialize once and keep using the internal impl until needing to
    serialize at 'end of pipeline' (either to store or to serialize from map
    to reduce) : while the byte[] based current impl's have to serialize and
    deserialize at each udf input/output.

    The philosophy of pig has definitely changed since the initial versions,
    with the hadoop interfaces leaking out to pig udf's ... but this one is
    probably too late to make these changes now :-)
    (And if it does support Writable, then having byte[] is just silly overhead)


    Regards,
    Mridul


    On Tuesday 31 May 2011 11:00 PM, Jonathan Coveney wrote:
    Disclaimer: I'm still learning my way around the Pig and Hadoop internals,
    so this question is aimed at better understanding that and some of the pig
    design choices...

    Is there a reason why in Pig we are restricted to a set of types (roughly
    corresponding to types in java), instead of having an abstract type like in
    Hadoop ie Writable or WritableComparable? I guess I got to thinking about
    this when thinking about the Algebraic interface... in Hadoop if you want to
    have some crazy intermediate objects, you can do that easily as long as they
    are serializable (ie Writable, and WritableComparable if they are going to
    the reducer in the shuffle). In fact, in Hadoop there is no notion of some
    special class of objects which we work with -- everything is simply Writable
    or WritableComparable. In Pig we are more limited, and I was just thinking
    about why that needs to be the case. Is there any reason why we can't have
    abstract types at the same level as String or Integer? My guess would be it
    has to do with how these objects are treated internally, but beyond that am
    not sure.

    Thanks for helping me think about this
    Jon

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedMay 31, '11 at 5:30p
activeJun 2, '11 at 8:24p
posts3
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase