FAQ
Hi,

Had a few queries regarding how null's interact with rest of system
which was not very clear to me (based on my reading of the types func spec).

a) null's are getting treated both as value (a#b) and error markers
(illegal operation/udf error) ? No way to distinguish between both ?
a.1) Related : project of a field which does not exist in a tuple - is
it an error or just null value ? [As opposed to referencing a key in map
which is not present - this is not an error.]

b) How are null fields stored and loaded ?
b.1) In BinStorage, special marker ?
b.2) In PigStorage ?
b.2.1) Is a subsequent load on the stored result result through
PigStorage going to continue to have null value ?

c) Handling of null field vs null record - any difference between both ?
As in, if we have B = foreach A generate myUDF(*);
If myUDF returns error (or null), will that record be added to B ?
What about in the case of B = foreach A generate $0, myUDF(*); ?

d) Arithmatic operations on null's - func spec says it will result in
null if either operand is null : but aggregates are treating it
differently ?
In general, this would imply we have to introduce "is null" checks for
all arithmetic operations ? In that context, will the udf here get
invoked twice ? Any way to optimize it (if it is the case above) :
B = foreach A generate myUDF(*) is null ? 0 : myUDF(*) + MY_CONSTANT;


e) Projects of fields which dont exist/are null's as input to udf - is
the udf invoked at all (with 'null' for the values, or it is not ? What
if the field is a primitive type ?).
Like : B = FOREACH A generate myUDF($0, $1); where $1 is absent in some
tuples in A - and schema expected to be an int (say).



In particular I am curious about b.2.1 above and how it interacts with
rest of the system - from vague recollection, null gets stored as empty
string : which means, either you cant have empty strings in input (since
they get treated as null on load -> and null also implies error) or a
store followed by a load through PigStorage converts an error to valid
value.


Thanks,
Mridul

Search Discussions

  • Olga Natkovich at Feb 9, 2009 at 6:18 pm
    Have you looked into the updated documentation to see if it answers your
    questions:
    http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.h
    tm. It has section on NULL handling.

    Thanks,

    Olga
    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Monday, February 09, 2009 2:43 AM
    To: pig-user@hadoop.apache.org
    Subject: Null's in Pig 2.0

    Hi,

    Had a few queries regarding how null's interact with rest
    of system which was not very clear to me (based on my reading
    of the types func spec).

    a) null's are getting treated both as value (a#b) and error
    markers (illegal operation/udf error) ? No way to distinguish
    between both ?
    a.1) Related : project of a field which does not exist in a
    tuple - is it an error or just null value ? [As opposed to
    referencing a key in map which is not present - this is not an error.]

    b) How are null fields stored and loaded ?
    b.1) In BinStorage, special marker ?
    b.2) In PigStorage ?
    b.2.1) Is a subsequent load on the stored result result
    through PigStorage going to continue to have null value ?

    c) Handling of null field vs null record - any difference
    between both ?
    As in, if we have B = foreach A generate myUDF(*); If myUDF
    returns error (or null), will that record be added to B ?
    What about in the case of B = foreach A generate $0, myUDF(*); ?

    d) Arithmatic operations on null's - func spec says it will
    result in null if either operand is null : but aggregates are
    treating it differently ?
    In general, this would imply we have to introduce "is null"
    checks for all arithmetic operations ? In that context, will
    the udf here get invoked twice ? Any way to optimize it (if
    it is the case above) :
    B = foreach A generate myUDF(*) is null ? 0 : myUDF(*) + MY_CONSTANT;


    e) Projects of fields which dont exist/are null's as input to
    udf - is the udf invoked at all (with 'null' for the values,
    or it is not ? What if the field is a primitive type ?).
    Like : B = FOREACH A generate myUDF($0, $1); where $1 is
    absent in some tuples in A - and schema expected to be an int (say).



    In particular I am curious about b.2.1 above and how it
    interacts with rest of the system - from vague recollection,
    null gets stored as empty string : which means, either you
    cant have empty strings in input (since they get treated as
    null on load -> and null also implies error) or a store
    followed by a load through PigStorage converts an error to
    valid value.


    Thanks,
    Mridul
  • Mridul Muralidharan at Feb 9, 2009 at 7:10 pm
    I was expecting functional specs to be authorative source :-)
    Thanks for the link, will take a look.

    Regards,
    Mridul

    Olga Natkovich wrote:
    Have you looked into the updated documentation to see if it answers your
    questions:
    http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.h
    tm. It has section on NULL handling.

    Thanks,

    Olga
    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Monday, February 09, 2009 2:43 AM
    To: pig-user@hadoop.apache.org
    Subject: Null's in Pig 2.0

    Hi,

    Had a few queries regarding how null's interact with rest
    of system which was not very clear to me (based on my reading
    of the types func spec).

    a) null's are getting treated both as value (a#b) and error
    markers (illegal operation/udf error) ? No way to distinguish
    between both ?
    a.1) Related : project of a field which does not exist in a
    tuple - is it an error or just null value ? [As opposed to
    referencing a key in map which is not present - this is not an error.]

    b) How are null fields stored and loaded ?
    b.1) In BinStorage, special marker ?
    b.2) In PigStorage ?
    b.2.1) Is a subsequent load on the stored result result
    through PigStorage going to continue to have null value ?

    c) Handling of null field vs null record - any difference
    between both ?
    As in, if we have B = foreach A generate myUDF(*); If myUDF
    returns error (or null), will that record be added to B ?
    What about in the case of B = foreach A generate $0, myUDF(*); ?

    d) Arithmatic operations on null's - func spec says it will
    result in null if either operand is null : but aggregates are
    treating it differently ?
    In general, this would imply we have to introduce "is null"
    checks for all arithmetic operations ? In that context, will
    the udf here get invoked twice ? Any way to optimize it (if
    it is the case above) :
    B = foreach A generate myUDF(*) is null ? 0 : myUDF(*) + MY_CONSTANT;


    e) Projects of fields which dont exist/are null's as input to
    udf - is the udf invoked at all (with 'null' for the values,
    or it is not ? What if the field is a primitive type ?).
    Like : B = FOREACH A generate myUDF($0, $1); where $1 is
    absent in some tuples in A - and schema expected to be an int (say).



    In particular I am curious about b.2.1 above and how it
    interacts with rest of the system - from vague recollection,
    null gets stored as empty string : which means, either you
    cant have empty strings in input (since they get treated as
    null on load -> and null also implies error) or a store
    followed by a load through PigStorage converts an error to
    valid value.


    Thanks,
    Mridul
  • Olga Natkovich at Feb 9, 2009 at 7:52 pm
    It might also answer your other questions.

    Olga
    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Monday, February 09, 2009 11:10 AM
    To: pig-user@hadoop.apache.org
    Subject: Re: Null's in Pig 2.0


    I was expecting functional specs to be authorative source :-)
    Thanks for the link, will take a look.

    Regards,
    Mridul

    Olga Natkovich wrote:
    Have you looked into the updated documentation to see if it answers
    your
    questions:
    http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm
    .h
    tm. It has section on NULL handling.

    Thanks,

    Olga
    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Monday, February 09, 2009 2:43 AM
    To: pig-user@hadoop.apache.org
    Subject: Null's in Pig 2.0

    Hi,

    Had a few queries regarding how null's interact with rest of
    system which was not very clear to me (based on my reading of the
    types func spec).

    a) null's are getting treated both as value (a#b) and
    error markers
    (illegal operation/udf error) ? No way to distinguish
    between both ?
    a.1) Related : project of a field which does not exist in
    a tuple -
    is it an error or just null value ? [As opposed to
    referencing a key
    in map which is not present - this is not an error.]

    b) How are null fields stored and loaded ?
    b.1) In BinStorage, special marker ?
    b.2) In PigStorage ?
    b.2.1) Is a subsequent load on the stored result result through
    PigStorage going to continue to have null value ?

    c) Handling of null field vs null record - any difference between
    both ?
    As in, if we have B = foreach A generate myUDF(*); If
    myUDF returns
    error (or null), will that record be added to B ?
    What about in the case of B = foreach A generate $0, myUDF(*); ?

    d) Arithmatic operations on null's - func spec says it
    will result in
    null if either operand is null : but aggregates are treating it
    differently ?
    In general, this would imply we have to introduce "is null"
    checks for all arithmetic operations ? In that context,
    will the udf
    here get invoked twice ? Any way to optimize it (if it is the case
    above) :
    B = foreach A generate myUDF(*) is null ? 0 : myUDF(*) +
    MY_CONSTANT;

    e) Projects of fields which dont exist/are null's as input
    to udf -
    is the udf invoked at all (with 'null' for the values, or
    it is not ?
    What if the field is a primitive type ?).
    Like : B = FOREACH A generate myUDF($0, $1); where $1 is absent in
    some tuples in A - and schema expected to be an int (say).



    In particular I am curious about b.2.1 above and how it interacts
    with rest of the system - from vague recollection, null
    gets stored
    as empty string : which means, either you cant have empty
    strings in
    input (since they get treated as null on load -> and null also
    implies error) or a store followed by a load through PigStorage
    converts an error to valid value.


    Thanks,
    Mridul
  • Mridul Muralidharan at Feb 9, 2009 at 8:09 pm

    Olga Natkovich wrote:
    Have you looked into the updated documentation to see if it answers your
    questions:
    http://wiki.apache.org/pig-data/attachments/FrontPage/attachments/plrm.h
    tm. It has section on NULL handling.

    Thanks,

    Olga
    -----Original Message-----
    From: Mridul Muralidharan
    Sent: Monday, February 09, 2009 2:43 AM
    To: pig-user@hadoop.apache.org
    Subject: Null's in Pig 2.0

    Hi,

    Had a few queries regarding how null's interact with rest
    of system which was not very clear to me (based on my reading
    of the types func spec).

    a) null's are getting treated both as value (a#b) and error
    markers (illegal operation/udf error) ? No way to distinguish
    between both ?
    a.1) Related : project of a field which does not exist in a
    tuple - is it an error or just null value ? [As opposed to
    referencing a key in map which is not present - this is not an error.]
    Strict SQL null clarifies about a.1 - not a though.
    Lack of ability to distiguish between error and null seems slightly
    disconcerting (weak analogy : returning null vs throwing exception in
    java for example - both mean different things).
    b) How are null fields stored and loaded ?
    b.1) In BinStorage, special marker ?
    b.2) In PigStorage ?
    b.2.1) Is a subsequent load on the stored result result
    through PigStorage going to continue to have null value ?

    I think I know what is happening with PigStorage atleast (mentioned at
    end of mail), but not about BinStorage yet - and whether there are
    issues with that similar to PigStorage or not.

    c) Handling of null field vs null record - any difference
    between both ?
    As in, if we have B = foreach A generate myUDF(*); If myUDF
    returns error (or null), will that record be added to B ?
    What about in the case of B = foreach A generate $0, myUDF(*); ?

    d) Arithmatic operations on null's - func spec says it will
    result in null if either operand is null : but aggregates are
    treating it differently ?
    In general, this would imply we have to introduce "is null"
    checks for all arithmetic operations ? In that context, will
    the udf here get invoked twice ? Any way to optimize it (if
    it is the case above) :
    B = foreach A generate myUDF(*) is null ? 0 : myUDF(*) + MY_CONSTANT;

    A strict SQL NULL definition makes the table definition for this right
    (since it is unknown value - as opposed to a empty value, as in java) : (*)
    But rest of pig does not seem to be adhering to it (unless I am missing
    something here).
    For example : JOIN's or GROUP's should have null when an outer join is
    done when key is not present in a table - as opposed to having empty
    bags - right ? The example was not clear for JOIN's from the doc Olga
    mentioned though, but COGROUP example gives empty bag.
    Is the doc out of date, or this is a change in pipeline ? Or there is
    something I am missing here ? TIA


    (*) Would have preferred separation between errors and unknown still ...


    e) Projects of fields which dont exist/are null's as input to
    udf - is the udf invoked at all (with 'null' for the values,
    or it is not ? What if the field is a primitive type ?).
    Like : B = FOREACH A generate myUDF($0, $1); where $1 is
    absent in some tuples in A - and schema expected to be an int (say).



    In particular I am curious about b.2.1 above and how it
    interacts with rest of the system - from vague recollection,
    null gets stored as empty string : which means, either you
    cant have empty strings in input (since they get treated as
    null on load -> and null also implies error) or a store
    followed by a load through PigStorage converts an error to
    valid value.
    This last snippet was added with the assumption that empty strings were
    read as empty strings - and not as nulls (in both load and store).
    I was wrong with this - it seems that empty string are treated as null
    by PigStorage - but this leads to a worse problem (for us atleast) : we
    do have quite a bit of fields which are not always present - the value
    and so, they are expected to have is empty string.

    Other than adding :
    A = load 'file' using PigStorage() as (... );
    B = foreach A generate $0 is null ? '' : $0, ....

    or some such similar dirty hack explicitly, for each load, is there any
    other solution ?
    Or is there some option to PigStorage to override this empty string ->
    NULL behavior (while not causing the problem I mentioned above) ?
    Specifically, is it possible to explicitly specify the marker to be used
    for NULL in PigStorage ? (We allow this in psox for example - assuming
    user knows what 'cant' occur in data and use that for null)



    The rest of the queries were not really answered by the doc - though
    this and pig14to20 are very helpful document to refer to Olga !

    Thanks,
    Mridul

    Thanks,
    Mridul

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedFeb 9, '09 at 10:43a
activeFeb 9, '09 at 8:09p
posts5
users2
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase