FAQ
Hello,

First of all, I'm new at Pig and NoSQL so I hope you'll forgive stupid
questions ;-)

So, I'm playing with OpenTSDB (software layer on top of HBase to handle
timeseries data) and now I'd like to run some data mining queries on top of
my timestamped data. I found that Pig could be a solution so I tried to make
it working on top of the openTSDB data in hbase, it neraly works but I'm
still confused.

OpenTSDB schema :
hbase(main):011:0> describe 'tsdb-uid'
DESCRIPTION
ENABLED
{NAME => 'tsdb-uid', FAMILIES => [{NAME => 'id', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => true
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'name', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL =>
'2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BL
OCKCACHE => 'true'}]}

hbase(main):012:0> describe 'tsdb'
DESCRIPTION
ENABLED
{NAME => 'tsdb', FAMILIES => [{NAME => 't', BLOOMFILTER => 'NONE',
REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', true
TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE
=> 'true'}]}

So sample uid data are :
hbase(main):014:0> scan 'tsdb-uid'
ROW COLUMN+CELL
\x00\x00\x01 column=name:metrics,
timestamp=1314801674803, value=proc.loadavg.1m
\x00\x00\x01 column=name:tagk,
timestamp=1314801684953, value=validity
\x00\x00\x01 column=name:tagv,
timestamp=1314801685000, value=true
\x00\x00\x02 column=name:metrics,
timestamp=1314801674849, value=proc.loadavg.5m
\x00\x00\x02 column=name:tagk,
timestamp=1314801685049, value=device
\x00\x00\x02 column=name:tagv,
timestamp=1314801685096, value=Device1
\x00\x00\x03 column=name:metrics,
timestamp=1314801674898, value=Measurement_1
\x00\x00\x03 column=name:tagk,
timestamp=1314801685144, value=accuracy
\x00\x00\x03 column=name:tagv,
timestamp=1314801693030, value=Device2
\x00\x00\x04 column=name:metrics,
timestamp=1314801674947, value=Measurement_2
\x00\x00\x05 column=name:metrics,
timestamp=1314801674994, value=Measurement_3
Device1 column=id:tagv,
timestamp=1314801685097, value=\x00\x00\x02
Device2 column=id:tagv,
timestamp=1314801693031, value=\x00\x00\x03
Measurement_1 column=id:metrics,
timestamp=1314801674899, value=\x00\x00\x03
Measurement_2 column=id:metrics,
timestamp=1314801674948, value=\x00\x00\x04
Measurement_3 column=id:metrics,
timestamp=1314801674995, value=\x00\x00\x05
accuracy column=id:tagk,
timestamp=1314801685145, value=\x00\x00\x03
device column=id:tagk,
timestamp=1314801685050, value=\x00\x00\x02
proc.loadavg.1m column=id:metrics,
timestamp=1314801674804, value=\x00\x00\x01
proc.loadavg.5m column=id:metrics,
timestamp=1314801674850, value=\x00\x00\x02
true column=id:tagv,
timestamp=1314801685002, value=\x00\x00\x01
validity column=id:tagk,
timestamp=1314801684955, value=\x00\x00\x01

Here are the metrics (timestamp data type id:metrics) and the tag defining
the data (tagk and tagv for value, ex: validity = true)

So from Pig when I want to retrieve only the metrics and their value (= id
for the data table) I do :
tsd_metrics = LOAD 'hbase://tsdb-uid' using
org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics', '-loadKey
true') AS (metrics:bytearray);
dump tsd_metrics;

HadoopVersion PigVersion UserId StartedAt FinishedAt
Features
0.20.2 0.8.1-SNAPSHOT opentsdb 2011-09-06 13:39:27 2011-09-06
13:39:34 UNKNOWN
Success!
Job Stats (time in seconds):
JobId Alias Feature Outputs
job_local_0004 tsd_metrics MAP_ONLY
file:/tmp/temp-1850282462/tmp1589556736,
Input(s):
Successfully read records from: "hbase://tsdb-uid"
Output(s):
Successfully stored records in: "file:/tmp/temp-1850282462/tmp1589556736"
Job DAG:
job_local_0004

(Measurement_1,)
(Measurement_2,)
(Measurement_3,)
(proc.loadavg.1m,)
(proc.loadavg.5m,)

so that's nealy ok except that the value (= id) displayed is null instead
of \x00\x00\x03 for example in the case of Measurement_1

Any idea ?

thx !

shazz

Search Discussions

  • Norbert Burger at Sep 6, 2011 at 1:38 pm

    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng wrote:
    So from Pig when I want to retrieve only the metrics and their value (= id
    for the data table) I do :
    tsd_metrics     = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics', '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then your
    LOAD schema includes an extra column containing the row key, and you
    should add equivalent to your schema column mapping (the AS clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert
  • shazz Ng at Sep 6, 2011 at 2:00 pm
    Hello Norbert,

    Unfortunately, same result :
    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    the row key is well extracted (Measurement_1 for example) but the value, the
    id I need for timestamp data querying, the bytearray, is not :(

    shazz
    On Tue, Sep 6, 2011 at 3:37 PM, Norbert Burger wrote:
    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng wrote:
    So from Pig when I want to retrieve only the metrics and their value (= id
    for the data table) I do :
    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics', '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then your
    LOAD schema includes an extra column containing the row key, and you
    should add equivalent to your schema column mapping (the AS clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert
  • Bryce Poole at Sep 6, 2011 at 2:19 pm
    Try adding -caster=HBaseBinaryConverter along with loadKey

    '-caster=HBaseBinaryConverter -loadKey=true'

    -bp
    On Tue, Sep 6, 2011 at 7:59 AM, shazz Ng wrote:

    Hello Norbert,

    Unfortunately, same result :
    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    the row key is well extracted (Measurement_1 for example) but the value,
    the
    id I need for timestamp data querying, the bytearray, is not :(

    shazz

    On Tue, Sep 6, 2011 at 3:37 PM, Norbert Burger <norbert.burger@gmail.com
    wrote:
    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng wrote:
    So from Pig when I want to retrieve only the metrics and their value (= id
    for the data table) I do :
    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then your
    LOAD schema includes an extra column containing the row key, and you
    should add equivalent to your schema column mapping (the AS clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert
  • shazz Ng at Sep 6, 2011 at 2:30 pm
    Hello Bryce,

    not better... :-(

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);
    grunt> dump tsd_metrics2;

    [...]

    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    On Tue, Sep 6, 2011 at 4:18 PM, Bryce Poole wrote:

    Try adding -caster=HBaseBinaryConverter along with loadKey

    '-caster=HBaseBinaryConverter -loadKey=true'

    -bp
    On Tue, Sep 6, 2011 at 7:59 AM, shazz Ng wrote:

    Hello Norbert,

    Unfortunately, same result :
    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    the row key is well extracted (Measurement_1 for example) but the value,
    the
    id I need for timestamp data querying, the bytearray, is not :(

    shazz

    On Tue, Sep 6, 2011 at 3:37 PM, Norbert Burger <norbert.burger@gmail.com
    wrote:
    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng wrote:
    So from Pig when I want to retrieve only the metrics and their value
    (=
    id
    for the data table) I do :
    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then your
    LOAD schema includes an extra column containing the row key, and you
    should add equivalent to your schema column mapping (the AS clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert
  • shazz Ng at Sep 6, 2011 at 3:01 pm
    the 'funny' thing is that if I look at the other CF name (from an byte id
    gives the name, reverse way) :

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('name:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);

    I've got the same issue:
    (,proc.loadavg.1m)
    (,proc.loadavg.5m)
    (,Measurement_1)
    (,Measurement_2)
    (,Measurement_3)

    So there is a real issue with byte array....
    On Tue, Sep 6, 2011 at 4:30 PM, shazz Ng wrote:

    Hello Bryce,

    not better... :-(

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);
    grunt> dump tsd_metrics2;

    [...]

    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    On Tue, Sep 6, 2011 at 4:18 PM, Bryce Poole wrote:

    Try adding -caster=HBaseBinaryConverter along with loadKey

    '-caster=HBaseBinaryConverter -loadKey=true'

    -bp
    On Tue, Sep 6, 2011 at 7:59 AM, shazz Ng wrote:

    Hello Norbert,

    Unfortunately, same result :
    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    the row key is well extracted (Measurement_1 for example) but the value,
    the
    id I need for timestamp data querying, the bytearray, is not :(

    shazz

    On Tue, Sep 6, 2011 at 3:37 PM, Norbert Burger <
    norbert.burger@gmail.com
    wrote:
    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng wrote:
    So from Pig when I want to retrieve only the metrics and their value
    (=
    id
    for the data table) I do :
    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then your
    LOAD schema includes an extra column containing the row key, and you
    should add equivalent to your schema column mapping (the AS clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert
  • Bryce Poole at Sep 6, 2011 at 5:03 pm
    My load looks like this

    .... AS (key:chararray, value:long);

    and I'm able to return data.

    I changed the load to

    .... AS (key:chararray, value:bytearray);

    and had results that match yours.

    Try changing the value to long or int type and see if that helps.

    -bp

    On Tue, Sep 6, 2011 at 9:00 AM, shazz Ng wrote:

    the 'funny' thing is that if I look at the other CF name (from an byte id
    gives the name, reverse way) :

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('name:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);

    I've got the same issue:
    (,proc.loadavg.1m)
    (,proc.loadavg.5m)
    (,Measurement_1)
    (,Measurement_2)
    (,Measurement_3)

    So there is a real issue with byte array....
    On Tue, Sep 6, 2011 at 4:30 PM, shazz Ng wrote:

    Hello Bryce,

    not better... :-(

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);
    grunt> dump tsd_metrics2;

    [...]

    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    On Tue, Sep 6, 2011 at 4:18 PM, Bryce Poole wrote:

    Try adding -caster=HBaseBinaryConverter along with loadKey

    '-caster=HBaseBinaryConverter -loadKey=true'

    -bp
    On Tue, Sep 6, 2011 at 7:59 AM, shazz Ng wrote:

    Hello Norbert,

    Unfortunately, same result :
    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    the row key is well extracted (Measurement_1 for example) but the
    value,
    the
    id I need for timestamp data querying, the bytearray, is not :(

    shazz

    On Tue, Sep 6, 2011 at 3:37 PM, Norbert Burger <
    norbert.burger@gmail.com
    wrote:
    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng wrote:
    So from Pig when I want to retrieve only the metrics and their
    value
    (=
    id
    for the data table) I do :
    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then your
    LOAD schema includes an extra column containing the row key, and you
    should add equivalent to your schema column mapping (the AS clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert
  • Dmitriy Ryaboy at Sep 6, 2011 at 5:10 pm
    That's interesting... we should be able to return a byte array properly
    (though this is a bit risky for people who try to later turn this bytearray
    into a long using Pig, since the conversion from bytes to longs in Pig is
    different than in HBase).

    Could you guys open a jira, preferably with an easy way to reproduce the
    error?

    D
    On Tue, Sep 6, 2011 at 10:03 AM, Bryce Poole wrote:

    My load looks like this

    .... AS (key:chararray, value:long);

    and I'm able to return data.

    I changed the load to

    .... AS (key:chararray, value:bytearray);

    and had results that match yours.

    Try changing the value to long or int type and see if that helps.

    -bp

    On Tue, Sep 6, 2011 at 9:00 AM, shazz Ng wrote:

    the 'funny' thing is that if I look at the other CF name (from an byte id
    gives the name, reverse way) :

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('name:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);

    I've got the same issue:
    (,proc.loadavg.1m)
    (,proc.loadavg.5m)
    (,Measurement_1)
    (,Measurement_2)
    (,Measurement_3)

    So there is a real issue with byte array....
    On Tue, Sep 6, 2011 at 4:30 PM, shazz Ng wrote:

    Hello Bryce,

    not better... :-(

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);
    grunt> dump tsd_metrics2;

    [...]

    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    On Tue, Sep 6, 2011 at 4:18 PM, Bryce Poole wrote:

    Try adding -caster=HBaseBinaryConverter along with loadKey

    '-caster=HBaseBinaryConverter -loadKey=true'

    -bp
    On Tue, Sep 6, 2011 at 7:59 AM, shazz Ng wrote:

    Hello Norbert,

    Unfortunately, same result :
    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    the row key is well extracted (Measurement_1 for example) but the
    value,
    the
    id I need for timestamp data querying, the bytearray, is not :(

    shazz

    On Tue, Sep 6, 2011 at 3:37 PM, Norbert Burger <
    norbert.burger@gmail.com
    wrote:
    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng <shazz.ng@gmail.com>
    wrote:
    So from Pig when I want to retrieve only the metrics and their
    value
    (=
    id
    for the data table) I do :
    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then
    your
    LOAD schema includes an extra column containing the row key, and
    you
    should add equivalent to your schema column mapping (the AS
    clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert
  • Dmitriy Ryaboy at Sep 6, 2011 at 5:11 pm
    (fwiw, HBaseStorage works fine for me when I use it to pull whole protocol
    buffer messages down as byte arrays)
    On Tue, Sep 6, 2011 at 10:10 AM, Dmitriy Ryaboy wrote:

    That's interesting... we should be able to return a byte array properly
    (though this is a bit risky for people who try to later turn this bytearray
    into a long using Pig, since the conversion from bytes to longs in Pig is
    different than in HBase).

    Could you guys open a jira, preferably with an easy way to reproduce the
    error?

    D

    On Tue, Sep 6, 2011 at 10:03 AM, Bryce Poole wrote:

    My load looks like this

    .... AS (key:chararray, value:long);

    and I'm able to return data.

    I changed the load to

    .... AS (key:chararray, value:bytearray);

    and had results that match yours.

    Try changing the value to long or int type and see if that helps.

    -bp

    On Tue, Sep 6, 2011 at 9:00 AM, shazz Ng wrote:

    the 'funny' thing is that if I look at the other CF name (from an byte id
    gives the name, reverse way) :

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('name:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);

    I've got the same issue:
    (,proc.loadavg.1m)
    (,proc.loadavg.5m)
    (,Measurement_1)
    (,Measurement_2)
    (,Measurement_3)

    So there is a real issue with byte array....
    On Tue, Sep 6, 2011 at 4:30 PM, shazz Ng wrote:

    Hello Bryce,

    not better... :-(

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);
    grunt> dump tsd_metrics2;

    [...]

    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    On Tue, Sep 6, 2011 at 4:18 PM, Bryce Poole wrote:

    Try adding -caster=HBaseBinaryConverter along with loadKey

    '-caster=HBaseBinaryConverter -loadKey=true'

    -bp
    On Tue, Sep 6, 2011 at 7:59 AM, shazz Ng wrote:

    Hello Norbert,

    Unfortunately, same result :
    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    the row key is well extracted (Measurement_1 for example) but the
    value,
    the
    id I need for timestamp data querying, the bytearray, is not :(

    shazz

    On Tue, Sep 6, 2011 at 3:37 PM, Norbert Burger <
    norbert.burger@gmail.com
    wrote:
    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng <shazz.ng@gmail.com>
    wrote:
    So from Pig when I want to retrieve only the metrics and their
    value
    (=
    id
    for the data table) I do :
    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then
    your
    LOAD schema includes an extra column containing the row key, and
    you
    should add equivalent to your schema column mapping (the AS
    clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert
  • shazz Ng at Sep 7, 2011 at 6:41 am
    Thanks Dmitriy !

    Effectively it works using the caster AND (defining value OR metrics as
    long)
    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:long);

    I don't really understand why the HBaseStorage LoadFunc considers that
    cf:qualifier == value but why not....I'll look in the code :)
    I'll try to setup an easy way t oreproduce it and I'll jira it.

    btw, I'm not sure I understood your last comment, how did you do to pull
    bytearrays so ?

    shazz

    On Tue, Sep 6, 2011 at 7:10 PM, Dmitriy Ryaboy wrote:

    (fwiw, HBaseStorage works fine for me when I use it to pull whole protocol
    buffer messages down as byte arrays)
    On Tue, Sep 6, 2011 at 10:10 AM, Dmitriy Ryaboy wrote:

    That's interesting... we should be able to return a byte array properly
    (though this is a bit risky for people who try to later turn this bytearray
    into a long using Pig, since the conversion from bytes to longs in Pig is
    different than in HBase).

    Could you guys open a jira, preferably with an easy way to reproduce the
    error?

    D

    On Tue, Sep 6, 2011 at 10:03 AM, Bryce Poole wrote:

    My load looks like this

    .... AS (key:chararray, value:long);

    and I'm able to return data.

    I changed the load to

    .... AS (key:chararray, value:bytearray);

    and had results that match yours.

    Try changing the value to long or int type and see if that helps.

    -bp

    On Tue, Sep 6, 2011 at 9:00 AM, shazz Ng wrote:

    the 'funny' thing is that if I look at the other CF name (from an byte id
    gives the name, reverse way) :

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('name:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);

    I've got the same issue:
    (,proc.loadavg.1m)
    (,proc.loadavg.5m)
    (,Measurement_1)
    (,Measurement_2)
    (,Measurement_3)

    So there is a real issue with byte array....
    On Tue, Sep 6, 2011 at 4:30 PM, shazz Ng wrote:

    Hello Bryce,

    not better... :-(

    grunt> tsd_metrics2 = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-caster=HBaseBinaryConverter -loadKey=true') AS (key:bytearray,
    metrics:bytearray);
    grunt> dump tsd_metrics2;

    [...]

    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    On Tue, Sep 6, 2011 at 4:18 PM, Bryce Poole wrote:

    Try adding -caster=HBaseBinaryConverter along with loadKey

    '-caster=HBaseBinaryConverter -loadKey=true'

    -bp
    On Tue, Sep 6, 2011 at 7:59 AM, shazz Ng wrote:

    Hello Norbert,

    Unfortunately, same result :
    (Measurement_1,)
    (Measurement_2,)
    (Measurement_3,)
    (proc.loadavg.1m,)
    (proc.loadavg.5m,)

    the row key is well extracted (Measurement_1 for example) but the
    value,
    the
    id I need for timestamp data querying, the bytearray, is not :(

    shazz

    On Tue, Sep 6, 2011 at 3:37 PM, Norbert Burger <
    norbert.burger@gmail.com
    wrote:
    On Tue, Sep 6, 2011 at 7:58 AM, shazz Ng <shazz.ng@gmail.com>
    wrote:
    So from Pig when I want to retrieve only the metrics and
    their
    value
    (=
    id
    for the data table) I do :
    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey
    true') AS (metrics:bytearray);
    dump tsd_metrics;
    Shazz -- if you use the "-loadKey" option to HbaseStorage, then
    your
    LOAD schema includes an extra column containing the row key,
    and
    you
    should add equivalent to your schema column mapping (the AS
    clause).
    Try the following:

    tsd_metrics = LOAD 'hbase://tsdb-uid' using
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('id:metrics',
    '-loadKey true') AS (key:bytearray, metrics:bytearray);

    Norbert

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedSep 6, '11 at 11:59a
activeSep 7, '11 at 6:41a
posts10
users4
websitepig.apache.org

People

Translate

site design / logo © 2021 Grokbase