Olga Natkovich wrote:
Have you looked into the updated documentation to see if it answers your
tm. It has section on NULL handling.
From: Mridul Muralidharan
Sent: Monday, February 09, 2009 2:43 AM
Subject: Null's in Pig 2.0
Had a few queries regarding how null's interact with rest
of system which was not very clear to me (based on my reading
of the types func spec).
a) null's are getting treated both as value (a#b) and error
markers (illegal operation/udf error) ? No way to distinguish
between both ?
a.1) Related : project of a field which does not exist in a
tuple - is it an error or just null value ? [As opposed to
referencing a key in map which is not present - this is not an error.]
Strict SQL null clarifies about a.1 - not a though.
Lack of ability to distiguish between error and null seems slightly
disconcerting (weak analogy : returning null vs throwing exception in
java for example - both mean different things).
b) How are null fields stored and loaded ?
b.1) In BinStorage, special marker ?
b.2) In PigStorage ?
b.2.1) Is a subsequent load on the stored result result
through PigStorage going to continue to have null value ?
I think I know what is happening with PigStorage atleast (mentioned at
end of mail), but not about BinStorage yet - and whether there are
issues with that similar to PigStorage or not.
c) Handling of null field vs null record - any difference
between both ?
As in, if we have B = foreach A generate myUDF(*); If myUDF
returns error (or null), will that record be added to B ?
What about in the case of B = foreach A generate $0, myUDF(*); ?
d) Arithmatic operations on null's - func spec says it will
result in null if either operand is null : but aggregates are
treating it differently ?
In general, this would imply we have to introduce "is null"
checks for all arithmetic operations ? In that context, will
the udf here get invoked twice ? Any way to optimize it (if
it is the case above) :
B = foreach A generate myUDF(*) is null ? 0 : myUDF(*) + MY_CONSTANT;
A strict SQL NULL definition makes the table definition for this right
(since it is unknown value - as opposed to a empty value, as in java) : (*)
But rest of pig does not seem to be adhering to it (unless I am missing
For example : JOIN's or GROUP's should have null when an outer join is
done when key is not present in a table - as opposed to having empty
bags - right ? The example was not clear for JOIN's from the doc Olga
mentioned though, but COGROUP example gives empty bag.
Is the doc out of date, or this is a change in pipeline ? Or there is
something I am missing here ? TIA
(*) Would have preferred separation between errors and unknown still ...
e) Projects of fields which dont exist/are null's as input to
udf - is the udf invoked at all (with 'null' for the values,
or it is not ? What if the field is a primitive type ?).
Like : B = FOREACH A generate myUDF($0, $1); where $1 is
absent in some tuples in A - and schema expected to be an int (say).
In particular I am curious about b.2.1 above and how it
interacts with rest of the system - from vague recollection,
null gets stored as empty string : which means, either you
cant have empty strings in input (since they get treated as
null on load -> and null also implies error) or a store
followed by a load through PigStorage converts an error to
This last snippet was added with the assumption that empty strings were
read as empty strings - and not as nulls (in both load and store).
I was wrong with this - it seems that empty string are treated as null
by PigStorage - but this leads to a worse problem (for us atleast) : we
do have quite a bit of fields which are not always present - the value
and so, they are expected to have is empty string.
Other than adding :
A = load 'file' using PigStorage() as (... );
B = foreach A generate $0 is null ? '' : $0, ....
or some such similar dirty hack explicitly, for each load, is there any
other solution ?
Or is there some option to PigStorage to override this empty string ->
NULL behavior (while not causing the problem I mentioned above) ?
Specifically, is it possible to explicitly specify the marker to be used
for NULL in PigStorage ? (We allow this in psox for example - assuming
user knows what 'cant' occur in data and use that for null)
The rest of the queries were not really answered by the doc - though
this and pig14to20 are very helpful document to refer to Olga !