Grokbase Groups HBase user April 2013
FAQ
Heya,

Thinking about data types and serialization. I think null support is an
important characteristic for the serialized representations, especially
when considering the compound type. However, doing so in directly
incompatible with fixed-width representations for numerics. For instance,
if we want to have a fixed-width signed long stored on 8-bytes, where do
you put null? float and double types can cheat a little by folding negative
and positive NaN's into a single representation (this isn't strictly
correct!), leaving a place to represent null. In the long example case, the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one. This
will allocate an additional encoding which can be used for null. My
experience working with scientific data, however, makes me wince at the
idea.

The variable-width encodings have it a little easier. There's already
enough going on that it's simpler to make room.

Remember, the final goal is to support order-preserving serialization. This
imposes some limitations on our encoding strategies. For instance, it's not
enough to simply encode null, it really needs to be encoded as 0x00 so as
to sort lexicographically earlier than any other value.

What do you think? Any ideas, experiences, etc?

Thanks,
Nick

Search Discussions

  • Matt Corgan at Apr 2, 2013 at 6:17 am
    Ah, I didn't even realize sql allowed null key parts. Maybe a goal of the
    interfaces should be to provide first-class support for custom user types
    in addition to the standard ones included. Part of the power of hbase's
    plain byte[] keys is that users can concoct the perfect key for their data
    type. For example, I have a lot of geographic data where I interleave
    latitude/longitude bits into a sortable 64 bit value that would probably
    never be included in a standard library.

    On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar wrote:

    I think having Int32, and NullableInt32 would support minimum overhead, as
    well as allowing SQL semantics.

    On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk wrote:

    Furthermore, is is more important to support null values than squeeze all
    representations into minimum size (4-bytes for int32, &c.)?
    On Apr 1, 2013 4:41 PM, "Nick Dimiduk" wrote:

    On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <jtaylor@salesforce.com
    wrote:
    From the SQL perspective, handling null is important.

    From your perspective, it is critical to support NULLs, even at the
    expense of fixed-width encodings at all or supporting representation
    of a
    full range of values. That is, you'd rather be able to represent NULL than
    -2^31?
    On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

    Thanks for the thoughtful response (and code!).

    I'm thinking I will press forward with a base implementation that
    does
    not
    support nulls. The idea is to provide an extensible set of
    interfaces,
    so I
    think this will not box us into a corner later. That is, a mirroring
    package could be implemented that supports null values and accepts
    the relevant trade-offs.

    Thanks,
    Nick

    On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com>
    wrote:

    I spent some time this weekend extracting bits of our serialization
    code to
    a public github repo at http://github.com/hotpads/**data-tools<
    http://github.com/hotpads/data-tools>
    .
    Contributions are welcome - i'm sure we all have this stuff laying
    around.

    You can see I've bumped into the NULL problem in a few places:
    *

    https://github.com/hotpads/**data-tools/blob/master/src/**
    main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
    https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
    *

    https://github.com/hotpads/**data-tools/blob/master/src/**
    main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
    https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
    Looking back, I think my latest opinion on the topic is to reject
    nullability as the rule since it can cause unexpected behavior and
    confusion. It's cleaner to provide a wrapper class (so both
    LongArrayList
    plus NullableLongArrayList) that explicitly defines the behavior,
    and
    costs
    a little more in performance. If the user can't find a pre-made
    wrapper
    class, it's not very difficult for each user to provide their own
    interpretation of null and check for it themselves.

    If you reject nullability, the question becomes what to do in
    situations
    where you're implementing existing interfaces that accept nullable
    params.
    The LongArrayList above implements List<Long> which requires an
    add(Long)
    method. In the above implementation I chose to swap nulls with
    Long.MIN_VALUE, however I'm now thinking it best to force the user
    to
    make
    that swap and then throw IllegalArgumentException if they pass null.


    On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
    doug.meil@explorysmedical.com
    wrote:
    HmmmŠ good question.

    I think that fixed width support is important for a great many
    rowkey
    constructs cases, so I'd rather see something like losing MIN_VALUE
    and
    keeping fixed width.




    On 4/1/13 2:00 PM, "Nick Dimiduk" wrote:

    Heya,
    Thinking about data types and serialization. I think null support
    is
    an
    important characteristic for the serialized representations,
    especially
    when considering the compound type. However, doing so in directly
    incompatible with fixed-width representations for numerics. For
    instance,
    if we want to have a fixed-width signed long stored on 8-bytes,
    where
    do
    you put null? float and double types can cheat a little by folding
    negative
    and positive NaN's into a single representation (this isn't
    strictly
    correct!), leaving a place to represent null. In the long example
    case,
    the
    obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
    one.
    This
    will allocate an additional encoding which can be used for null.
    My
    experience working with scientific data, however, makes me wince
    at
    the
    idea.

    The variable-width encodings have it a little easier. There's
    already
    enough going on that it's simpler to make room.

    Remember, the final goal is to support order-preserving
    serialization.
    This
    imposes some limitations on our encoding strategies. For instance,
    it's
    not
    enough to simply encode null, it really needs to be encoded as
    0x00
    so
    as
    to sort lexicographically earlier than any other value.
    What do you think? Any ideas, experiences, etc?

    Thanks,
    Nick

  • Dmitriy Ryaboy at Apr 3, 2013 at 6:29 pm
    Hiya Nick,
    Pig converts data for HBase storage using this class:
    https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
    is mostly just calling into HBase's Bytes class). As long as Bytes
    handles the null stuff, we'll just inherit the behavior.

    On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk wrote:

    I agree that a user-extensible interface is a required feature here.
    Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
    keep in mind, though, that SQL and user applications are not the only
    consumers of this interface. A big motivation is allowing interop with the
    other higher MR languages. *cough* Where are my Pig and Hive peeps in this
    thread?

    On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
    wrote:
    Maybe if we can keep nullability separate from the
    serialization/deserialization, we can come up with a solution that works?
    We're able to essentially infer that a column is null based on its value
    being missing or empty. So if an iterator through the row key bytes could
    detect/indicate that, then an application could "infer" the value is null.
    We're definitely planning on keeping byte[] accessors for use cases that
    need it. I'm curious on the geographic data case, though, could you use a
    fixed length long with a couple of new SQL built-ins to encode/decode the
    latitude/longitude?

    On 04/01/2013 11:29 PM, Jesse Yates wrote:

    Actually, that isn't all that far-fetched of a format Matt - pretty
    common
    anytime anyone wants to do sortable lat/long (*cough* three letter
    agencies
    cough*).

    Wouldn't we get the same by providing a simple set of libraries (ala
    orderly + other HBase useful things) and then still give access to the
    underlying byte array? Perhaps a nullable key type in that lib makes
    sense
    if lots of people need it and it would be nice to have standard
    libraries
    so tools could interop much more easily.
    -------------------
    Jesse Yates
    @jesse_yates
    jyates.github.com


    On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan wrote:

    Ah, I didn't even realize sql allowed null key parts. Maybe a goal of
    the
    interfaces should be to provide first-class support for custom user
    types
    in addition to the standard ones included. Part of the power of
    hbase's
    plain byte[] keys is that users can concoct the perfect key for their
    data
    type. For example, I have a lot of geographic data where I interleave
    latitude/longitude bits into a sortable 64 bit value that would
    probably
    never be included in a standard library.


    On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis.soz@gmail.com>
    wrote:

    I think having Int32, and NullableInt32 would support minimum
    overhead,
    as
    well as allowing SQL semantics.


    On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimiduk@gmail.com>
    wrote:

    Furthermore, is is more important to support null values than squeeze
    all
    representations into minimum size (4-bytes for int32, &c.)?
    On Apr 1, 2013 4:41 PM, "Nick Dimiduk" wrote:

    On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
    jtaylor@salesforce.com
    wrote:

    From the SQL perspective, handling null is important.
    From your perspective, it is critical to support NULLs, even at the
    expense of fixed-width encodings at all or supporting representation
    of a
    full range of values. That is, you'd rather be able to represent NULL
    than
    -2^31?
    On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

    Thanks for the thoughtful response (and code!).
    I'm thinking I will press forward with a base implementation that
    does
    not
    support nulls. The idea is to provide an extensible set of
    interfaces,
    so I
    think this will not box us into a corner later. That is, a
    mirroring
    package could be implemented that supports null values and accepts
    the relevant trade-offs.

    Thanks,
    Nick

    On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com
    wrote:

    I spent some time this weekend extracting bits of our
    serialization
    code to
    <http://github.com/hotpads/**data-tools>
    http://github.com/hotpads/data-tools>

    .
    Contributions are welcome - i'm sure we all have this stuff
    laying
    around.
    You can see I've bumped into the NULL problem in a few places:
    *

    https://github.com/hotpads/****data-tools/blob/master/src/**<
    https://github.com/hotpads/**data-tools/blob/master/src/**>
    main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
    **java<
    https://github.com/hotpads/**data-tools/blob/master/src/**
    main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<
    https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
    https://github.com/hotpads/**data-tools/blob/master/src/**>
    main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
    java<
    https://github.com/hotpads/**data-tools/blob/master/src/**
    main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<
    https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
    Looking back, I think my latest opinion on the topic is to reject
    nullability as the rule since it can cause unexpected behavior
    and
    confusion. It's cleaner to provide a wrapper class (so both
    LongArrayList
    plus NullableLongArrayList) that explicitly defines the behavior,
    and
    costs
    a little more in performance. If the user can't find a pre-made
    wrapper
    class, it's not very difficult for each user to provide their own
    interpretation of null and check for it themselves.

    If you reject nullability, the question becomes what to do in
    situations
    where you're implementing existing interfaces that accept nullable
    params.
    The LongArrayList above implements List<Long> which requires
    an
    add(Long)
    method. In the above implementation I chose to swap nulls with
    Long.MIN_VALUE, however I'm now thinking it best to force the
    user
    to
    make
    that swap and then throw IllegalArgumentException if they pass null.
    On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
    doug.meil@explorysmedical.com

    wrote:
    HmmmŠ good question.

    I think that fixed width support is important for a great many
    rowkey
    constructs cases, so I'd rather see something like losing
    MIN_VALUE
    and
    keeping fixed width.



    On 4/1/13 2:00 PM, "Nick Dimiduk" wrote:

    Heya,
    Thinking about data types and serialization. I think null
    support
    is
    an
    important characteristic for the serialized representations,
    especially
    when considering the compound type. However, doing so in
    directly
    incompatible with fixed-width representations for numerics. For
    instance,
    if we want to have a fixed-width signed long stored on 8-bytes,
    where
    do
    you put null? float and double types can cheat a little by
    folding
    negative
    and positive NaN's into a single representation (this isn't
    strictly
    correct!), leaving a place to represent null. In the long
    example
    case,
    the
    obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
    one.
    This
    will allocate an additional encoding which can be used for
    null.
    My
    experience working with scientific data, however, makes me wince
    at
    the
    idea.

    The variable-width encodings have it a little easier. There's
    already
    enough going on that it's simpler to make room.
    Remember, the final goal is to support order-preserving
    serialization.
    This
    imposes some limitations on our encoding strategies. For
    instance,
    it's
    not
    enough to simply encode null, it really needs to be encoded as
    0x00
    so
    as
    to sort lexicographically earlier than any other value.
    What do you think? Any ideas, experiences, etc?

    Thanks,
    Nick
  • Michel Segel at Apr 2, 2013 at 9:34 am
    Silly question...
    Null support. In a system where a column may or may not exist, how do you support null?

    ;-)

    In terms of a key, it's a primary key and can't be null.


    So what am I missing?


    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On Apr 1, 2013, at 10:26 PM, Nick Dimiduk wrote:

    Furthermore, is is more important to support null values than squeeze all
    representations into minimum size (4-bytes for int32, &c.)?
    On Apr 1, 2013 4:41 PM, "Nick Dimiduk" wrote:
    On Mon, Apr 1, 2013 at 4:31 PM, James Taylor wrote:

    From the SQL perspective, handling null is important.

    From your perspective, it is critical to support NULLs, even at the
    expense of fixed-width encodings at all or supporting representation of a
    full range of values. That is, you'd rather be able to represent NULL than
    -2^31?
    On 04/01/2013 01:32 PM, Nick Dimiduk wrote:

    Thanks for the thoughtful response (and code!).

    I'm thinking I will press forward with a base implementation that does
    not
    support nulls. The idea is to provide an extensible set of interfaces,
    so I
    think this will not box us into a corner later. That is, a mirroring
    package could be implemented that supports null values and accepts
    the relevant trade-offs.

    Thanks,
    Nick

    On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com>
    wrote:

    I spent some time this weekend extracting bits of our serialization
    code to
    a public github repo at http://github.com/hotpads/**data-tools<http://github.com/hotpads/data-tools>
    .
    Contributions are welcome - i'm sure we all have this stuff laying
    around.

    You can see I've bumped into the NULL problem in a few places:
    *

    https://github.com/hotpads/**data-tools/blob/master/src/**
    main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java>
    *

    https://github.com/hotpads/**data-tools/blob/master/src/**
    main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java>

    Looking back, I think my latest opinion on the topic is to reject
    nullability as the rule since it can cause unexpected behavior and
    confusion. It's cleaner to provide a wrapper class (so both
    LongArrayList
    plus NullableLongArrayList) that explicitly defines the behavior, and
    costs
    a little more in performance. If the user can't find a pre-made wrapper
    class, it's not very difficult for each user to provide their own
    interpretation of null and check for it themselves.

    If you reject nullability, the question becomes what to do in situations
    where you're implementing existing interfaces that accept nullable
    params.
    The LongArrayList above implements List<Long> which requires an
    add(Long)
    method. In the above implementation I chose to swap nulls with
    Long.MIN_VALUE, however I'm now thinking it best to force the user to
    make
    that swap and then throw IllegalArgumentException if they pass null.


    On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
    doug.meil@explorysmedical.com
    wrote:
    HmmmŠ good question.

    I think that fixed width support is important for a great many rowkey
    constructs cases, so I'd rather see something like losing MIN_VALUE and
    keeping fixed width.




    On 4/1/13 2:00 PM, "Nick Dimiduk" wrote:

    Heya,
    Thinking about data types and serialization. I think null support is
    an
    important characteristic for the serialized representations,
    especially
    when considering the compound type. However, doing so in directly
    incompatible with fixed-width representations for numerics. For
    instance,
    if we want to have a fixed-width signed long stored on 8-bytes, where
    do
    you put null? float and double types can cheat a little by folding
    negative
    and positive NaN's into a single representation (this isn't strictly
    correct!), leaving a place to represent null. In the long example
    case,
    the
    obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by one.
    This
    will allocate an additional encoding which can be used for null. My
    experience working with scientific data, however, makes me wince at
    the
    idea.

    The variable-width encodings have it a little easier. There's already
    enough going on that it's simpler to make room.

    Remember, the final goal is to support order-preserving serialization.
    This
    imposes some limitations on our encoding strategies. For instance,
    it's
    not
    enough to simply encode null, it really needs to be encoded as 0x00 so
    as
    to sort lexicographically earlier than any other value.
    What do you think? Any ideas, experiences, etc?

    Thanks,
    Nick

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedApr 1, '13 at 6:01p
activeApr 3, '13 at 6:29p
posts4
users4
websitehbase.apache.org

People

Translate

site design / logo © 2021 Grokbase