|
Alan Gates (JIRA) |
at Aug 4, 2008 at 10:09 pm
|
⇧ |
| |
[
https://issues.apache.org/jira/browse/PIG-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12619722#action_12619722 ]
Alan Gates commented on PIG-354:
--------------------------------
I don't think we want to be converting data to chararray by default for input to UDFs, for several reasons:
1 It's expensive
2 It mangles any data that isn't utf8
3 It is a fair amount of work for users to provide type specific implementations of their UDFs, and so I suspect most won't.
By contrast, on the outbound side I agree that chararray is the right default, for two reasons:
1 It's very easy to determine what type the UDF is returning, either by declaring a schema or by pig reflecting the return type. Only in the case where they do not give a schema and their return type is tuple or bag (thus we have no idea what inside that tuple or bag) will we be forcing data to strings.
2 In general pig does not assume any particular representation of data in byte arrays. That's why we make the load function provide casts. So if we took this unknown data from UDFs to be byte arrays we'd have no idea how to convert it to anything else. Conversions from strings on the other hand are well understood.
Change to default outputSchema for UDFs
---------------------------------------
Key: PIG-354
URL:
https://issues.apache.org/jira/browse/PIG-354Project: Pig
Issue Type: Bug
Affects Versions: types_branch
Reporter: Olga Natkovich
Priority: Critical
Fix For: types_branch
Currently, if UDF writer does not specify outputSchema the default is bytearray which is not what you would want most of the time. Making chararray a default would make things backward compatible.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.