Grokbase Groups Pig user July 2011
FAQ
Hi,

I'd like to make PIG load only a subset of an HBase table, based on
the timestamp of the records, or on the key of the rows.

As an example, I'd like to load only records that have a timestamp >
N, or a key > "something".

I know that HBase can handle scanners that are highly optimized to
perform this kind of things, and it would greatly improve the time
needed to load my data.

Is there any way to do this ?
If not, it is planned to be added in the HBase loader ?
If not, is it technically possible to do it ?
If yes, can I contribute and propose a patch on that ?

Thank a lot !

Search Discussions

  • Norbert Burger at Jul 28, 2011 at 11:19 am
    You can instruct HBaseStorage to load a subset of the rows using the "-gt"
    and "-lt" options to HBaseStorage, documented here [1].

    I don't believe querying by timestamp is currently supported in Pig, based
    on the comments to [2]. There is a standalone JIRA that's been created [3].

    Norbert

    [1]
    http://ofps.oreilly.com/titles/9781449302641/community.html#hbase_options_table
    [2] https://issues.apache.org/jira/browse/PIG-1782
    [3] https://issues.apache.org/jira/browse/PIG-1832
    On Thu, Jul 28, 2011 at 6:18 AM, Vincent Barat wrote:

    Hi,

    I'd like to make PIG load only a subset of an HBase table, based on the
    timestamp of the records, or on the key of the rows.

    As an example, I'd like to load only records that have a timestamp > N, or
    a key > "something".

    I know that HBase can handle scanners that are highly optimized to perform
    this kind of things, and it would greatly improve the time needed to load my
    data.

    Is there any way to do this ?
    If not, it is planned to be added in the HBase loader ?
    If not, is it technically possible to do it ?
    If yes, can I contribute and propose a patch on that ?

    Thank a lot !
  • Vincent Barat at Jul 28, 2011 at 12:54 pm
    Thanks for the input, [3] is more related to timestamp storage,
    anyway I added my 2 cents to the issue concerning loading by timestamp.

    Le 28/07/11 13:19, Norbert Burger a écrit :
    You can instruct HBaseStorage to load a subset of the rows using the "-gt"
    and "-lt" options to HBaseStorage, documented here [1].

    I don't believe querying by timestamp is currently supported in Pig, based
    on the comments to [2]. There is a standalone JIRA that's been created [3].

    Norbert

    [1]
    http://ofps.oreilly.com/titles/9781449302641/community.html#hbase_options_table
    [2] https://issues.apache.org/jira/browse/PIG-1782
    [3] https://issues.apache.org/jira/browse/PIG-1832

    On Thu, Jul 28, 2011 at 6:18 AM, Vincent Baratwrote:
    Hi,

    I'd like to make PIG load only a subset of an HBase table, based on the
    timestamp of the records, or on the key of the rows.

    As an example, I'd like to load only records that have a timestamp> N, or
    a key> "something".

    I know that HBase can handle scanners that are highly optimized to perform
    this kind of things, and it would greatly improve the time needed to load my
    data.

    Is there any way to do this ?
    If not, it is planned to be added in the HBase loader ?
    If not, is it technically possible to do it ?
    If yes, can I contribute and propose a patch on that ?

    Thank a lot !
  • Norbert Burger at Jul 28, 2011 at 1:00 pm
    [3] is titled with respect to storage, but if you read through the comments
    of [2], Dmitriy mentions that it'll also include querying.

    Norbert
    On Thu, Jul 28, 2011 at 8:53 AM, Vincent Barat wrote:

    Thanks for the input, [3] is more related to timestamp storage, anyway I
    added my 2 cents to the issue concerning loading by timestamp.

    Le 28/07/11 13:19, Norbert Burger a écrit :

    You can instruct HBaseStorage to load a subset of the rows using the "-gt"
    and "-lt" options to HBaseStorage, documented here [1].

    I don't believe querying by timestamp is currently supported in Pig, based
    on the comments to [2]. There is a standalone JIRA that's been created
    [3].

    Norbert

    [1]
    http://ofps.oreilly.com/**titles/9781449302641/**
    community.html#hbase_options_**table<http://ofps.oreilly.com/titles/9781449302641/community.html#hbase_options_table>
    [2] https://issues.apache.org/**jira/browse/PIG-1782<https://issues.apache.org/jira/browse/PIG-1782>
    [3] https://issues.apache.org/**jira/browse/PIG-1832<https://issues.apache.org/jira/browse/PIG-1832>

    On Thu, Jul 28, 2011 at 6:18 AM, Vincent Barat<vincent.barat@gmail.com>**
    wrote:

    Hi,
    I'd like to make PIG load only a subset of an HBase table, based on the
    timestamp of the records, or on the key of the rows.

    As an example, I'd like to load only records that have a timestamp> N,
    or
    a key> "something".

    I know that HBase can handle scanners that are highly optimized to
    perform
    this kind of things, and it would greatly improve the time needed to load
    my
    data.

    Is there any way to do this ?
    If not, it is planned to be added in the HBase loader ?
    If not, is it technically possible to do it ?
    If yes, can I contribute and propose a patch on that ?

    Thank a lot !
  • Bill Graham at Jul 28, 2011 at 5:27 pm
    Timestamp based querying is being handled in
    https://issues.apache.org/jira/browse/PIG-2114 FYI.

    On Thu, Jul 28, 2011 at 6:00 AM, Norbert Burger [3] is titled with respect to storage, but if you read through the comments
    of [2], Dmitriy mentions that it'll also include querying.

    Norbert

    On Thu, Jul 28, 2011 at 8:53 AM, Vincent Barat <vincent.barat@gmail.com
    wrote:
    Thanks for the input, [3] is more related to timestamp storage, anyway I
    added my 2 cents to the issue concerning loading by timestamp.

    Le 28/07/11 13:19, Norbert Burger a écrit :

    You can instruct HBaseStorage to load a subset of the rows using the
    "-gt"
    and "-lt" options to HBaseStorage, documented here [1].

    I don't believe querying by timestamp is currently supported in Pig,
    based
    on the comments to [2]. There is a standalone JIRA that's been created
    [3].

    Norbert

    [1]
    http://ofps.oreilly.com/**titles/9781449302641/**
    community.html#hbase_options_**table<
    http://ofps.oreilly.com/titles/9781449302641/community.html#hbase_options_table
    https://issues.apache.org/jira/browse/PIG-1782>
    https://issues.apache.org/jira/browse/PIG-1832>
    On Thu, Jul 28, 2011 at 6:18 AM, Vincent Barat<vincent.barat@gmail.com **
    wrote:

    Hi,
    I'd like to make PIG load only a subset of an HBase table, based on the
    timestamp of the records, or on the key of the rows.

    As an example, I'd like to load only records that have a timestamp> N,
    or
    a key> "something".

    I know that HBase can handle scanners that are highly optimized to
    perform
    this kind of things, and it would greatly improve the time needed to
    load
    my
    data.

    Is there any way to do this ?
    If not, it is planned to be added in the HBase loader ?
    If not, is it technically possible to do it ?
    If yes, can I contribute and propose a patch on that ?

    Thank a lot !

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJul 28, '11 at 10:19a
activeJul 28, '11 at 5:27p
posts5
users3
websitepig.apache.org

People

Translate

site design / logo © 2022 Grokbase