Pig converts data for HBase storage using this class:
https://svn.apache.org/repos/asf/pig/trunk/src/org/apache/pig/backend/hadoop/hbase/HBaseBinaryConverter.java(which
is mostly just calling into HBase's Bytes class). As long as Bytes
handles the null stuff, we'll just inherit the behavior.
On Tue, Apr 2, 2013 at 9:40 AM, Nick Dimiduk wrote:
I agree that a user-extensible interface is a required feature here.
Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
keep in mind, though, that SQL and user applications are not the only
consumers of this interface. A big motivation is allowing interop with the
other higher MR languages. *cough* Where are my Pig and Hive peeps in this
thread?
On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
sense
libraries
types
hbase's
probably
overhead,
jtaylor@salesforce.com
full range of values. That is, you'd rather be able to represent NULL
so I
package could be implemented that supports null values and accepts
code to
<http://github.com/hotpads/**data-tools>
http://github.com/hotpads/data-tools>
around.
https://github.com/hotpads/**data-tools/blob/master/src/**>
main/java/com/hotpads/data/**primitive/lists/LongArrayList.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/primitive/lists/LongArrayList.java
https://github.com/hotpads/**data-tools/blob/master/src/**>
main/java/com/hotpads/data/**types/floats/DoubleByteTool.**java<https://github.com/hotpads/data-tools/blob/master/src/main/java/com/hotpads/data/types/floats/DoubleByteTool.java
and
costs
class, it's not very difficult for each user to provide their own
where you're implementing existing interfaces that accept nullable
an
user
make
constructs cases, so I'd rather see something like losing
and
is
wheredo
negative
correct!), leaving a place to represent null. In the long
case,
This
null.
experience working with scientific data, however, makes me wince
the
enough going on that it's simpler to make room.
This
it's
so
I agree that a user-extensible interface is a required feature here.
Personally, I'd love to ship a set of standard GIS tools on HBase. Let's
keep in mind, though, that SQL and user applications are not the only
consumers of this interface. A big motivation is allowing interop with the
other higher MR languages. *cough* Where are my Pig and Hive peeps in this
thread?
On Mon, Apr 1, 2013 at 11:33 PM, James Taylor <jtaylor@salesforce.com
wrote:
Maybe if we can keep nullability separate from the
serialization/deserialization, we can come up with a solution that works?
We're able to essentially infer that a column is null based on its value
being missing or empty. So if an iterator through the row key bytes could
detect/indicate that, then an application could "infer" the value is null.
We're definitely planning on keeping byte[] accessors for use cases that
need it. I'm curious on the geographic data case, though, could you use a
fixed length long with a couple of new SQL built-ins to encode/decode the
latitude/longitude?
commonMaybe if we can keep nullability separate from the
serialization/deserialization, we can come up with a solution that works?
We're able to essentially infer that a column is null based on its value
being missing or empty. So if an iterator through the row key bytes could
detect/indicate that, then an application could "infer" the value is null.
We're definitely planning on keeping byte[] accessors for use cases that
need it. I'm curious on the geographic data case, though, could you use a
fixed length long with a couple of new SQL built-ins to encode/decode the
latitude/longitude?
On 04/01/2013 11:29 PM, Jesse Yates wrote:
Actually, that isn't all that far-fetched of a format Matt - pretty
Actually, that isn't all that far-fetched of a format Matt - pretty
anytime anyone wants to do sortable lat/long (*cough* three letter
agencies
cough*).
Wouldn't we get the same by providing a simple set of libraries (ala
orderly + other HBase useful things) and then still give access to the
underlying byte array? Perhaps a nullable key type in that lib makes
agencies
cough*).
Wouldn't we get the same by providing a simple set of libraries (ala
orderly + other HBase useful things) and then still give access to the
underlying byte array? Perhaps a nullable key type in that lib makes
if lots of people need it and it would be nice to have standard
so tools could interop much more easily.
-------------------
Jesse Yates
@jesse_yates
jyates.github.com
On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan wrote:
Ah, I didn't even realize sql allowed null key parts. Maybe a goal of
-------------------
Jesse Yates
@jesse_yates
jyates.github.com
On Mon, Apr 1, 2013 at 11:17 PM, Matt Corgan wrote:
Ah, I didn't even realize sql allowed null key parts. Maybe a goal of
the
interfaces should be to provide first-class support for custom user
interfaces should be to provide first-class support for custom user
in addition to the standard ones included. Part of the power of
plain byte[] keys is that users can concoct the perfect key for their
data
type. For example, I have a lot of geographic data where I interleave
latitude/longitude bits into a sortable 64 bit value that would
data
type. For example, I have a lot of geographic data where I interleave
latitude/longitude bits into a sortable 64 bit value that would
never be included in a standard library.
On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis.soz@gmail.com>
wrote:
I think having Int32, and NullableInt32 would support minimum
On Mon, Apr 1, 2013 at 8:38 PM, Enis Söztutar <enis.soz@gmail.com>
wrote:
I think having Int32, and NullableInt32 would support minimum
as
well as allowing SQL semantics.
On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimiduk@gmail.com>
wrote:
Furthermore, is is more important to support null values than squeeze
all
representations into minimum size (4-bytes for int32, &c.)?
On Mon, Apr 1, 2013 at 7:26 PM, Nick Dimiduk <ndimiduk@gmail.com>
wrote:
Furthermore, is is more important to support null values than squeeze
all
representations into minimum size (4-bytes for int32, &c.)?
On Apr 1, 2013 4:41 PM, "Nick Dimiduk" wrote:
On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
On Mon, Apr 1, 2013 at 4:31 PM, James Taylor <
wrote:
From the SQL perspective, handling null is important.
From your perspective, it is critical to support NULLs, even at the
expense of fixed-width encodings at all or supporting representation
of a From the SQL perspective, handling null is important.
From your perspective, it is critical to support NULLs, even at the
expense of fixed-width encodings at all or supporting representation
full range of values. That is, you'd rather be able to represent NULL
than
-2^31?
not-2^31?
On 04/01/2013 01:32 PM, Nick Dimiduk wrote:
Thanks for the thoughtful response (and code!).
Thanks for the thoughtful response (and code!).
I'm thinking I will press forward with a base implementation that
doessupport nulls. The idea is to provide an extensible set of
interfaces,think this will not box us into a corner later. That is, a
mirroringthe relevant trade-offs.
Thanks,
Nick
On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com
Thanks,
Nick
On Mon, Apr 1, 2013 at 12:26 PM, Matt Corgan <mcorgan@hotpads.com
wrote:
I spent some time this weekend extracting bits of our
serializationI spent some time this weekend extracting bits of our
a public github repo at http://github.com/hotpads/****data-tools
http://github.com/hotpads/data-tools>
.
Contributions are welcome - i'm sure we all have this stuff
layingYou can see I've bumped into the NULL problem in a few places:
*
https://github.com/hotpads/****data-tools/blob/master/src/**<
*
https://github.com/hotpads/****data-tools/blob/master/src/**<
main/java/com/hotpads/data/****primitive/lists/LongArrayList.**
**java<
https://github.com/hotpads/**data-tools/blob/master/src/****java<
https://github.com/hotpads/**data-tools/blob/master/src/**>
main/java/com/hotpads/data/****types/floats/DoubleByteTool.****
java<
https://github.com/hotpads/**data-tools/blob/master/src/**java<
Looking back, I think my latest opinion on the topic is to reject
nullability as the rule since it can cause unexpected behavior
confusion. It's cleaner to provide a wrapper class (so both
LongArrayList
plus NullableLongArrayList) that explicitly defines the behavior,
andLongArrayList
plus NullableLongArrayList) that explicitly defines the behavior,
a little more in performance. If the user can't find a pre-made
wrapperinterpretation of null and check for it themselves.
If you reject nullability, the question becomes what to do in
situationsIf you reject nullability, the question becomes what to do in
params.
The LongArrayList above implements List<Long> which requires
The LongArrayList above implements List<Long> which requires
add(Long)
method. In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the
method. In the above implementation I chose to swap nulls with
Long.MIN_VALUE, however I'm now thinking it best to force the
to
that swap and then throw IllegalArgumentException if they pass null.
On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
doug.meil@explorysmedical.com
wrote:
On Mon, Apr 1, 2013 at 11:41 AM, Doug Meil <
doug.meil@explorysmedical.com
wrote:
HmmmŠ good question.
I think that fixed width support is important for a great many
rowkeyI think that fixed width support is important for a great many
MIN_VALUE
keeping fixed width.
On 4/1/13 2:00 PM, "Nick Dimiduk" wrote:
Heya,
On 4/1/13 2:00 PM, "Nick Dimiduk" wrote:
Heya,
Thinking about data types and serialization. I think null
supportan
incompatible with fixed-width representations for numerics. Forimportant characteristic for the serialized representations,
especially
when considering the compound type. However, doing so in
directlyespecially
when considering the compound type. However, doing so in
instance,
if we want to have a fixed-width signed long stored on 8-bytes,you put null? float and double types can cheat a little by
foldingand positive NaN's into a single representation (this isn't
strictlyexample
the
obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
one.obvious choice is to reduce MAX_VALUE or increase MIN_VALUE by
will allocate an additional encoding which can be used for
My
at
idea.
The variable-width encodings have it a little easier. There's
alreadyThe variable-width encodings have it a little easier. There's
Remember, the final goal is to support order-preserving
serialization.imposes some limitations on our encoding strategies. For
instance,not
enough to simply encode null, it really needs to be encoded as
0x00enough to simply encode null, it really needs to be encoded as
as
to sort lexicographically earlier than any other value.
What do you think? Any ideas, experiences, etc?
Thanks,
Nick
Thanks,
Nick