Grokbase Groups Pig dev August 2008
FAQ
Change to default outputSchema for UDFs
---------------------------------------

Key: PIG-354
URL: https://issues.apache.org/jira/browse/PIG-354
Project: Pig
Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Priority: Critical
Fix For: types_branch


Currently, if UDF writer does not specify outputSchema the default is bytearray which is not what you would want most of the time. Making chararray a default would make things backward compatible.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Search Discussions

  • Olga Natkovich (JIRA) at Aug 3, 2008 at 10:58 pm
    [ https://issues.apache.org/jira/browse/PIG-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619386#action_12619386 ]

    Olga Natkovich commented on PIG-354:
    ------------------------------------

    I think the same should be true for default input as well
    Change to default outputSchema for UDFs
    ---------------------------------------

    Key: PIG-354
    URL: https://issues.apache.org/jira/browse/PIG-354
    Project: Pig
    Issue Type: Bug
    Affects Versions: types_branch
    Reporter: Olga Natkovich
    Priority: Critical
    Fix For: types_branch


    Currently, if UDF writer does not specify outputSchema the default is bytearray which is not what you would want most of the time. Making chararray a default would make things backward compatible.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Aug 4, 2008 at 10:09 pm
    [ https://issues.apache.org/jira/browse/PIG-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619722#action_12619722 ]

    Alan Gates commented on PIG-354:
    --------------------------------

    I don't think we want to be converting data to chararray by default for input to UDFs, for several reasons:

    1 It's expensive
    2 It mangles any data that isn't utf8
    3 It is a fair amount of work for users to provide type specific implementations of their UDFs, and so I suspect most won't.

    By contrast, on the outbound side I agree that chararray is the right default, for two reasons:

    1 It's very easy to determine what type the UDF is returning, either by declaring a schema or by pig reflecting the return type. Only in the case where they do not give a schema and their return type is tuple or bag (thus we have no idea what inside that tuple or bag) will we be forcing data to strings.

    2 In general pig does not assume any particular representation of data in byte arrays. That's why we make the load function provide casts. So if we took this unknown data from UDFs to be byte arrays we'd have no idea how to convert it to anything else. Conversions from strings on the other hand are well understood.
    Change to default outputSchema for UDFs
    ---------------------------------------

    Key: PIG-354
    URL: https://issues.apache.org/jira/browse/PIG-354
    Project: Pig
    Issue Type: Bug
    Affects Versions: types_branch
    Reporter: Olga Natkovich
    Priority: Critical
    Fix For: types_branch


    Currently, if UDF writer does not specify outputSchema the default is bytearray which is not what you would want most of the time. Making chararray a default would make things backward compatible.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Alan Gates (JIRA) at Aug 7, 2008 at 3:40 pm
    [ https://issues.apache.org/jira/browse/PIG-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620654#action_12620654 ]

    Alan Gates commented on PIG-354:
    --------------------------------

    Here's how output type determination works now: the type checker calls outputSchema on the UDF. If it gets a schema it goes with that. If not, it uses reflection to determine the return type of the UDF and then maps that to a data type. The only case where we can't really tell what the UDF is returning is if it doesn't declare a schema and its return type is Tuple or Bag. In that case we have no way to guess what's inside. But all through the code we treat tuples and bags with unknown contents as containing byte arrays. If we try to do otherwise just for UDFs, it will be difficult (we'll end up tracking lineage). so I don't want to change that.

    The one other area we could change is POUserFunc. Currently, when it has a type of bytearray, it checks if the object passed back is really a bytearray or not. If not, it calls toString().toBytes() on it and constructs a DataByteArray from that. We could do the same check when the type is charray. Not sure how useful this would be.
    Change to default outputSchema for UDFs
    ---------------------------------------

    Key: PIG-354
    URL: https://issues.apache.org/jira/browse/PIG-354
    Project: Pig
    Issue Type: Bug
    Affects Versions: types_branch
    Reporter: Olga Natkovich
    Priority: Critical
    Fix For: types_branch


    Currently, if UDF writer does not specify outputSchema the default is bytearray which is not what you would want most of the time. Making chararray a default would make things backward compatible.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Aug 7, 2008 at 4:51 pm
    [ https://issues.apache.org/jira/browse/PIG-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12620669#action_12620669 ]

    Olga Natkovich commented on PIG-354:
    ------------------------------------

    The first part sounds reasonable.

    What would be the use case for the second one?
    Change to default outputSchema for UDFs
    ---------------------------------------

    Key: PIG-354
    URL: https://issues.apache.org/jira/browse/PIG-354
    Project: Pig
    Issue Type: Bug
    Affects Versions: types_branch
    Reporter: Olga Natkovich
    Priority: Critical
    Fix For: types_branch


    Currently, if UDF writer does not specify outputSchema the default is bytearray which is not what you would want most of the time. Making chararray a default would make things backward compatible.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.
  • Olga Natkovich (JIRA) at Aug 7, 2008 at 5:16 pm
    [ https://issues.apache.org/jira/browse/PIG-354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

    Olga Natkovich resolved PIG-354.
    --------------------------------

    Resolution: Fixed

    Looks like the right things is already happening.
    Change to default outputSchema for UDFs
    ---------------------------------------

    Key: PIG-354
    URL: https://issues.apache.org/jira/browse/PIG-354
    Project: Pig
    Issue Type: Bug
    Affects Versions: types_branch
    Reporter: Olga Natkovich
    Priority: Critical
    Fix For: types_branch


    Currently, if UDF writer does not specify outputSchema the default is bytearray which is not what you would want most of the time. Making chararray a default would make things backward compatible.
    --
    This message is automatically generated by JIRA.
    -
    You can reply to this email to add a comment to the issue online.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categoriespig, hadoop
postedAug 1, '08 at 9:52p
activeAug 7, '08 at 5:16p
posts6
users1
websitepig.apache.org

1 user in discussion

Olga Natkovich (JIRA): 6 posts

People

Translate

site design / logo © 2022 Grokbase