Grokbase Groups Pig user July 2011
FAQ
expectation from PigStorage.getInputFormat() is that it is a
InputFormat<Writable, Text>, and PigStorage handles converting Text to
Tuple.
This is very useful and easy for users to use some other input format.

But the same is not true for PigStorage().getOutputFormat().. Here it
expects OutputFormat<Writable, Tuple>. So the output format needs to convert
Tuple to Text().

Not sure if this is intentional or not. I can submit a patch to move Tuple
handling into PigStorage. Then PigTextOutputFormat would be as thin as
PigTextInputFormat.

Search Discussions

  • Daniel Dai at Jul 21, 2011 at 8:22 pm
    I agree tuple -> text conversion better be in StoreFunc. User may have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However, existing
    StoreFunc use PigOutputFormat need to change. I don't know how much impact
    that will be, but need to be careful. We need to make clear announcement and
    document it as incompatible change if we do so.

    Daniel
    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi wrote:

    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text to
    Tuple.
    This is very useful and easy for users to use some other input format.

    But the same is not true for PigStorage().getOutputFormat().. Here it
    expects OutputFormat<Writable, Tuple>. So the output format needs to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to move Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin as
    PigTextInputFormat.
  • Raghu Angadi at Jul 22, 2011 at 7:24 pm
    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().


    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai wrote:

    I agree tuple -> text conversion better be in StoreFunc. User may have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However, existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat() and not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they just need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break :)

    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear announcement
    and
    document it as incompatible change if we do so.

    Daniel
    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi wrote:

    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text to
    Tuple.
    This is very useful and easy for users to use some other input format.

    But the same is not true for PigStorage().getOutputFormat().. Here it
    expects OutputFormat<Writable, Tuple>. So the output format needs to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to move Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin as
    PigTextInputFormat.
  • Daniel Dai at Jul 22, 2011 at 8:29 pm
    I mean StoreFunc that delegate outputformat to PigOutputFormat. Though
    PigOutputFormat is not in package org.apache.pig, it is the OutputFormat of
    PigStorage, which many users will use as reference implementation for a
    StoreFunc.

    Daniel
    On Fri, Jul 22, 2011 at 12:24 PM, Raghu Angadi wrote:

    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().


    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai wrote:

    I agree tuple -> text conversion better be in StoreFunc. User may have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However, existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat() and
    not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they just
    need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break :)

    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear announcement
    and
    document it as incompatible change if we do so.

    Daniel
    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi wrote:

    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text to
    Tuple.
    This is very useful and easy for users to use some other input format.

    But the same is not true for PigStorage().getOutputFormat().. Here it
    expects OutputFormat<Writable, Tuple>. So the output format needs to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to move Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin as
    PigTextInputFormat.
  • Raghu Angadi at Jul 22, 2011 at 9:52 pm

    On Fri, Jul 22, 2011 at 1:29 PM, Daniel Dai wrote:

    I mean StoreFunc that delegate outputformat to PigOutputFormat.


    Though
    PigOutputFormat is not in package org.apache.pig, it is the OutputFormat of
    PigStorage,

    There is no reference to PigOutputFormat in PigStorage. Did you mean
    PigTextOutputFormat

    Raghu.

    which many users will use as reference implementation for a
    StoreFunc.

    Daniel
    On Fri, Jul 22, 2011 at 12:24 PM, Raghu Angadi wrote:

    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().


    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai wrote:

    I agree tuple -> text conversion better be in StoreFunc. User may have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However, existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat() and
    not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they just
    need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break :)

    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear
    announcement
    and
    document it as incompatible change if we do so.

    Daniel

    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi <rangadi@apache.org>
    wrote:
    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text
    to
    Tuple.
    This is very useful and easy for users to use some other input
    format.
    But the same is not true for PigStorage().getOutputFormat().. Here it
    expects OutputFormat<Writable, Tuple>. So the output format needs to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to move Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin
    as
    PigTextInputFormat.
  • Daniel Dai at Jul 22, 2011 at 10:44 pm
    Yes, I am talking about PigTextOutputFormat.
    On Fri, Jul 22, 2011 at 2:51 PM, Raghu Angadi wrote:
    On Fri, Jul 22, 2011 at 1:29 PM, Daniel Dai wrote:

    I mean StoreFunc that delegate outputformat to PigOutputFormat.


    Though
    PigOutputFormat is not in package org.apache.pig, it is the OutputFormat of
    PigStorage,

    There is no reference to PigOutputFormat in PigStorage. Did you mean
    PigTextOutputFormat

    Raghu.

    which many users will use as reference implementation for a
    StoreFunc.

    Daniel
    On Fri, Jul 22, 2011 at 12:24 PM, Raghu Angadi wrote:

    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().



    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai <daijy@hortonworks.com>
    wrote:
    I agree tuple -> text conversion better be in StoreFunc. User may
    have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of
    StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However,
    existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat()
    and
    not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they just
    need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break :)

    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear
    announcement
    and
    document it as incompatible change if we do so.

    Daniel

    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi <rangadi@apache.org>
    wrote:
    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text
    to
    Tuple.
    This is very useful and easy for users to use some other input
    format.
    But the same is not true for PigStorage().getOutputFormat().. Here
    it
    expects OutputFormat<Writable, Tuple>. So the output format needs
    to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to
    move
    Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin
    as
    PigTextInputFormat.
  • Raghu Angadi at Jul 22, 2011 at 11:48 pm
    Thanks guys. Updated PIG-2187 with a new patch.
    On Fri, Jul 22, 2011 at 3:44 PM, Daniel Dai wrote:

    Yes, I am talking about PigTextOutputFormat.
    On Fri, Jul 22, 2011 at 2:51 PM, Raghu Angadi wrote:
    On Fri, Jul 22, 2011 at 1:29 PM, Daniel Dai wrote:

    I mean StoreFunc that delegate outputformat to PigOutputFormat.


    Though
    PigOutputFormat is not in package org.apache.pig, it is the
    OutputFormat
    of
    PigStorage,

    There is no reference to PigOutputFormat in PigStorage. Did you mean
    PigTextOutputFormat

    Raghu.

    which many users will use as reference implementation for a
    StoreFunc.

    Daniel

    On Fri, Jul 22, 2011 at 12:24 PM, Raghu Angadi <rangadi@apache.org>
    wrote:
    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().



    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai <daijy@hortonworks.com>
    wrote:
    I agree tuple -> text conversion better be in StoreFunc. User may
    have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of
    StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However,
    existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat()
    and
    not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they
    just
    need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break
    :)
    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear
    announcement
    and
    document it as incompatible change if we do so.

    Daniel

    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi <rangadi@apache.org
    wrote:
    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting
    Text
    to
    Tuple.
    This is very useful and easy for users to use some other input
    format.
    But the same is not true for PigStorage().getOutputFormat()..
    Here
    it
    expects OutputFormat<Writable, Tuple>. So the output format needs
    to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to
    move
    Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as
    thin
    as
    PigTextInputFormat.
  • Alan Gates at Jul 22, 2011 at 8:37 pm
    At this point I'm -1 on this. I don't want to break existing output formats or store functions. And I don't see that much value here. You can accomplish the same thing by putting the logic in a static method of PigTextOutputFormat and letting other users use it. Also, the cost of an extra copy of the output is bad. We don't want to slow down storing data.

    Alan.
    On Jul 22, 2011, at 12:24 PM, Raghu Angadi wrote:

    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().


    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai wrote:

    I agree tuple -> text conversion better be in StoreFunc. User may have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However, existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat() and not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they just need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break :)

    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear announcement
    and
    document it as incompatible change if we do so.

    Daniel
    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi wrote:

    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text to
    Tuple.
    This is very useful and easy for users to use some other input format.

    But the same is not true for PigStorage().getOutputFormat().. Here it
    expects OutputFormat<Writable, Tuple>. So the output format needs to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to move Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin as
    PigTextInputFormat.
  • Raghu Angadi at Jul 22, 2011 at 9:57 pm
    Yes, I don't like the extra copies either.. thats why didn't mark the Jira
    'patch available'. A static helper method would also be useful.

    But I don't see how it breaks how it breaks existing StoreFuncs or output
    formats.. is there an example? There are very few StoreFuncs that extend
    PigStorage.

    Raghu.
    On Fri, Jul 22, 2011 at 1:37 PM, Alan Gates wrote:

    At this point I'm -1 on this. I don't want to break existing output
    formats or store functions. And I don't see that much value here. You can
    accomplish the same thing by putting the logic in a static method of
    PigTextOutputFormat and letting other users use it. Also, the cost of an
    extra copy of the output is bad. We don't want to slow down storing data.

    Alan.
    On Jul 22, 2011, at 12:24 PM, Raghu Angadi wrote:

    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().


    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai wrote:

    I agree tuple -> text conversion better be in StoreFunc. User may have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However, existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat() and not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they just need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break :)

    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear announcement
    and
    document it as incompatible change if we do so.

    Daniel
    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi wrote:

    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text to
    Tuple.
    This is very useful and easy for users to use some other input format.

    But the same is not true for PigStorage().getOutputFormat().. Here it
    expects OutputFormat<Writable, Tuple>. So the output format needs to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to move Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin as
    PigTextInputFormat.
  • Alan Gates at Jul 22, 2011 at 10:11 pm
    "There are very few StoreFuncs that extend PigStorage" that we know of. We don't know how our users are extending it for themselves. And PigStorage is a public interface. Breaking it is a non-starter.

    Alan.
    On Jul 22, 2011, at 2:57 PM, Raghu Angadi wrote:

    Yes, I don't like the extra copies either.. thats why didn't mark the Jira
    'patch available'. A static helper method would also be useful.

    But I don't see how it breaks how it breaks existing StoreFuncs or output
    formats.. is there an example? There are very few StoreFuncs that extend
    PigStorage.

    Raghu.
    On Fri, Jul 22, 2011 at 1:37 PM, Alan Gates wrote:

    At this point I'm -1 on this. I don't want to break existing output
    formats or store functions. And I don't see that much value here. You can
    accomplish the same thing by putting the logic in a static method of
    PigTextOutputFormat and letting other users use it. Also, the cost of an
    extra copy of the output is bad. We don't want to slow down storing data.

    Alan.
    On Jul 22, 2011, at 12:24 PM, Raghu Angadi wrote:

    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().



    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai <daijy@hortonworks.com>
    wrote:
    I agree tuple -> text conversion better be in StoreFunc. User may have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However, existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat() and not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they just need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break :)

    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear announcement
    and
    document it as incompatible change if we do so.

    Daniel

    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi <rangadi@apache.org>
    wrote:
    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text to
    Tuple.
    This is very useful and easy for users to use some other input format.

    But the same is not true for PigStorage().getOutputFormat().. Here it
    expects OutputFormat<Writable, Tuple>. So the output format needs to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to move Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin as
    PigTextInputFormat.
  • Raghu Angadi at Jul 22, 2011 at 10:41 pm
    makes sense. I will attach an updated patch that move Tuple serialization to
    StorageUtil.

    since we expect uses to extend PigStorage, I would like to add
    getFieldDelmiter() method.. otherwise the extender has to parse and
    remember.

    Raghu.
    On Fri, Jul 22, 2011 at 3:10 PM, Alan Gates wrote:

    "There are very few StoreFuncs that extend PigStorage" that we know of. We
    don't know how our users are extending it for themselves. And PigStorage is
    a public interface. Breaking it is a non-starter.

    Alan.
    On Jul 22, 2011, at 2:57 PM, Raghu Angadi wrote:

    Yes, I don't like the extra copies either.. thats why didn't mark the Jira
    'patch available'. A static helper method would also be useful.

    But I don't see how it breaks how it breaks existing StoreFuncs or output
    formats.. is there an example? There are very few StoreFuncs that extend
    PigStorage.

    Raghu.
    On Fri, Jul 22, 2011 at 1:37 PM, Alan Gates wrote:

    At this point I'm -1 on this. I don't want to break existing output
    formats or store functions. And I don't see that much value here. You
    can
    accomplish the same thing by putting the logic in a static method of
    PigTextOutputFormat and letting other users use it. Also, the cost of
    an
    extra copy of the output is bad. We don't want to slow down storing
    data.
    Alan.
    On Jul 22, 2011, at 12:24 PM, Raghu Angadi wrote:

    attached a patch to https://issues.apache.org/jira/browse/PIG-2187

    Only drawback is extra copies required to make a Text().



    On Thu, Jul 21, 2011 at 1:21 PM, Daniel Dai <daijy@hortonworks.com>
    wrote:
    I agree tuple -> text conversion better be in StoreFunc. User may have
    better chance to reuse OutputFormat.

    For backward compatibility, the signature of StoreFunc.getOutputFormat
    returns a generic OutputFormat object, this is fine. However, existing
    StoreFunc use PigOutputFormat need to change.

    you mean existing classes that override PigStorage.getOutputFormat()
    and
    not
    PigStorage.putNext()?
    Yes, they would be affected.. but fixing them is very simple, they just need
    to extend putNext().
    As such there is no contract regd getOutputFormat() for us to break :)

    Raghu.
    I don't know how much impact
    that will be, but need to be careful. We need to make clear
    announcement
    and
    document it as incompatible change if we do so.

    Daniel

    On Thu, Jul 21, 2011 at 11:12 AM, Raghu Angadi <rangadi@apache.org>
    wrote:
    expectation from PigStorage.getInputFormat() is that it is a
    InputFormat<Writable, Text>, and PigStorage handles converting Text
    to
    Tuple.
    This is very useful and easy for users to use some other input
    format.
    But the same is not true for PigStorage().getOutputFormat().. Here it
    expects OutputFormat<Writable, Tuple>. So the output format needs to
    convert
    Tuple to Text().

    Not sure if this is intentional or not. I can submit a patch to move Tuple
    handling into PigStorage. Then PigTextOutputFormat would be as thin
    as
    PigTextInputFormat.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 21, '11 at 6:12p
activeJul 22, '11 at 11:48p
posts11
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase