FAQ
Hi,

Consider this use case:

There is a program store cpu usage metrics to a HBase table. This
HBase table has a column family called cpu, and individual cpu core
usage is stored in columns like, cpu:user.0, cpu:user.1 etc. The
suffix number represent unique cpu core id in the system.

While it is possible to write query like:

SystemMetrics = load 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster
cpu:combined.0 cpu:combined.1 ... system:LoadAverage.1','-loadKey') AS
(rowKey: chararray, cluster: chararray, cpuCombined0:float,
cpuCombined1:float ... LoadAverage:float);

To get a long list of columns to load and specify the same list in
group by command like:

CleanseBuffer = foreach SystemMetrics generate
REGEX_EXTRACT($0,'^\\d+',0) as time, cluster, cpuCombined0,
cpuCombined1, ..., LoadAverage;

The syntax works fine, but it would be nice to load all columns of a
given column family without specifying individual columns.

i.e. SystemMetrics = load 'hbase://SystemMetrics' USING
org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster cpu
system');

Is this syntax possible to implement in pig?

Second question, is it possible to make alteration of a tuple in a
bag, but not specifying other tuples in the same bag?

For large column tables, it would be nice if there is short hand
syntax to make pig syntax shorter to write.
Any tip on making foreach and group by shorter? Thanks

regards,
Eric

Search Discussions

  • Dmitriy Ryaboy at Dec 30, 2010 at 2:10 am
    Hi Eric,
    Yes, we can certainly add the convention that a string without a ":" refers
    to a complete column family.
    It should be fairly straightforward.. step 1 is to open a ticket on the
    Jira, step to is to do it :).

    I am not sure what you mean by "make alteration of a tuple in a bag, but not
    specifying other tuples in the same bag" -- can you provide an example that
    illustrates what you want to do?

    Thanks,
    -Dmitriy
    On Tue, Dec 28, 2010 at 11:10 PM, Eric Yang wrote:

    Hi,

    Consider this use case:

    There is a program store cpu usage metrics to a HBase table. This
    HBase table has a column family called cpu, and individual cpu core
    usage is stored in columns like, cpu:user.0, cpu:user.1 etc. The
    suffix number represent unique cpu core id in the system.

    While it is possible to write query like:

    SystemMetrics = load 'hbase://SystemMetrics' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster
    cpu:combined.0 cpu:combined.1 ... system:LoadAverage.1','-loadKey') AS
    (rowKey: chararray, cluster: chararray, cpuCombined0:float,
    cpuCombined1:float ... LoadAverage:float);

    To get a long list of columns to load and specify the same list in
    group by command like:

    CleanseBuffer = foreach SystemMetrics generate
    REGEX_EXTRACT($0,'^\\d+',0) as time, cluster, cpuCombined0,
    cpuCombined1, ..., LoadAverage;

    The syntax works fine, but it would be nice to load all columns of a
    given column family without specifying individual columns.

    i.e. SystemMetrics = load 'hbase://SystemMetrics' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster cpu
    system');

    Is this syntax possible to implement in pig?

    Second question, is it possible to make alteration of a tuple in a
    bag, but not specifying other tuples in the same bag?

    For large column tables, it would be nice if there is short hand
    syntax to make pig syntax shorter to write.
    Any tip on making foreach and group by shorter? Thanks

    regards,
    Eric
  • Eric Yang at Dec 30, 2010 at 5:12 am
    Hi Dmitriy,

    Issue filed: https://issues.apache.org/jira/browse/PIG-1782

    I meant to say columns in my previous message. It should read as
    "Make alteration of a column in a bug, but not specifying other
    columns in the same bag".

    Let's assume PIG-1782 is address and CpuMetrics from PIG-1782 example
    should contains 250 columns.
    The next line that I write, would look like this:

    ConcatBuffer = foreach CpuMentrics generate CONCAT(CONCAT($0, '-'),
    $1) as rowId, $2, $3, $4, $5, $6, $7, $8, $9, $10, ... $250;

    It would be nice if the statement can be written like this:

    ConcatBuffer = foreach CpuMentrics generate CONCAT(CONCAT($0, '-'),
    $1) as rowID, MIRROR($2..$250);

    Is there something like this in pig built-in functions?

    regards,
    Eric
    On Wed, Dec 29, 2010 at 6:09 PM, Dmitriy Ryaboy wrote:
    Hi Eric,
    Yes, we can certainly add the convention that a string without a ":" refers
    to a complete column family.
    It should be fairly straightforward.. step 1 is to open a ticket on the
    Jira, step to is to do it :).

    I am not sure what you mean by "make alteration of a tuple in a bag, but not
    specifying other tuples in the same bag" -- can you provide an example that
    illustrates what you want to do?

    Thanks,
    -Dmitriy
    On Tue, Dec 28, 2010 at 11:10 PM, Eric Yang wrote:

    Hi,

    Consider this use case:

    There is a program store cpu usage metrics to a HBase table.  This
    HBase table has a column family called cpu, and individual cpu core
    usage is stored in columns like, cpu:user.0, cpu:user.1 etc.  The
    suffix number represent unique cpu core id in the system.

    While it is possible to write query like:

    SystemMetrics = load 'hbase://SystemMetrics' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster
    cpu:combined.0 cpu:combined.1 ... system:LoadAverage.1','-loadKey') AS
    (rowKey: chararray, cluster: chararray, cpuCombined0:float,
    cpuCombined1:float ... LoadAverage:float);

    To get a long list of columns to load and specify the same list in
    group by command like:

    CleanseBuffer = foreach SystemMetrics generate
    REGEX_EXTRACT($0,'^\\d+',0) as time, cluster, cpuCombined0,
    cpuCombined1, ..., LoadAverage;

    The syntax works fine, but it would be nice to load all columns of a
    given column family without specifying individual columns.

    i.e. SystemMetrics = load 'hbase://SystemMetrics' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster cpu
    system');

    Is this syntax possible to implement in pig?

    Second question, is it possible to make alteration of a tuple in a
    bag, but not specifying other tuples in the same bag?

    For large column tables, it would be nice if there is short hand
    syntax to make pig syntax shorter to write.
    Any tip on making foreach and group by shorter?  Thanks

    regards,
    Eric
  • Dmitriy Ryaboy at Dec 30, 2010 at 10:16 am
    Ah, I see. There is no such function available right now.
    There is some discussion of such a feature here:
    https://issues.apache.org/jira/browse/PIG-1693
    As you can see, there isn't yet a consensus on how such syntax would work.
    Feel free to weigh in.

    -Dmitriy
    On Wed, Dec 29, 2010 at 9:12 PM, Eric Yang wrote:

    Hi Dmitriy,

    Issue filed: https://issues.apache.org/jira/browse/PIG-1782

    I meant to say columns in my previous message. It should read as
    "Make alteration of a column in a bug, but not specifying other
    columns in the same bag".

    Let's assume PIG-1782 is address and CpuMetrics from PIG-1782 example
    should contains 250 columns.
    The next line that I write, would look like this:

    ConcatBuffer = foreach CpuMentrics generate CONCAT(CONCAT($0, '-'),
    $1) as rowId, $2, $3, $4, $5, $6, $7, $8, $9, $10, ... $250;

    It would be nice if the statement can be written like this:

    ConcatBuffer = foreach CpuMentrics generate CONCAT(CONCAT($0, '-'),
    $1) as rowID, MIRROR($2..$250);

    Is there something like this in pig built-in functions?

    regards,
    Eric
    On Wed, Dec 29, 2010 at 6:09 PM, Dmitriy Ryaboy wrote:
    Hi Eric,
    Yes, we can certainly add the convention that a string without a ":" refers
    to a complete column family.
    It should be fairly straightforward.. step 1 is to open a ticket on the
    Jira, step to is to do it :).

    I am not sure what you mean by "make alteration of a tuple in a bag, but not
    specifying other tuples in the same bag" -- can you provide an example that
    illustrates what you want to do?

    Thanks,
    -Dmitriy
    On Tue, Dec 28, 2010 at 11:10 PM, Eric Yang wrote:

    Hi,

    Consider this use case:

    There is a program store cpu usage metrics to a HBase table. This
    HBase table has a column family called cpu, and individual cpu core
    usage is stored in columns like, cpu:user.0, cpu:user.1 etc. The
    suffix number represent unique cpu core id in the system.

    While it is possible to write query like:

    SystemMetrics = load 'hbase://SystemMetrics' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster
    cpu:combined.0 cpu:combined.1 ... system:LoadAverage.1','-loadKey') AS
    (rowKey: chararray, cluster: chararray, cpuCombined0:float,
    cpuCombined1:float ... LoadAverage:float);

    To get a long list of columns to load and specify the same list in
    group by command like:

    CleanseBuffer = foreach SystemMetrics generate
    REGEX_EXTRACT($0,'^\\d+',0) as time, cluster, cpuCombined0,
    cpuCombined1, ..., LoadAverage;

    The syntax works fine, but it would be nice to load all columns of a
    given column family without specifying individual columns.

    i.e. SystemMetrics = load 'hbase://SystemMetrics' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster cpu
    system');

    Is this syntax possible to implement in pig?

    Second question, is it possible to make alteration of a tuple in a
    bag, but not specifying other tuples in the same bag?

    For large column tables, it would be nice if there is short hand
    syntax to make pig syntax shorter to write.
    Any tip on making foreach and group by shorter? Thanks

    regards,
    Eric
  • Eric Yang at Dec 30, 2010 at 7:33 pm
    Thanks for the pointer. :)

    regards,
    Eric
    On Thu, Dec 30, 2010 at 2:15 AM, Dmitriy Ryaboy wrote:
    Ah, I see. There is no such function available right now.
    There is some discussion of such a feature here:
    https://issues.apache.org/jira/browse/PIG-1693
    As you can see, there isn't yet a consensus on how such syntax would work.
    Feel free to weigh in.

    -Dmitriy
    On Wed, Dec 29, 2010 at 9:12 PM, Eric Yang wrote:

    Hi Dmitriy,

    Issue filed: https://issues.apache.org/jira/browse/PIG-1782

    I meant to say columns in my previous message.  It should read as
    "Make alteration of a column in a bug, but not specifying other
    columns in the same bag".

    Let's assume PIG-1782 is address and CpuMetrics from PIG-1782 example
    should contains 250 columns.
    The next line that I write, would look like this:

    ConcatBuffer = foreach CpuMentrics generate CONCAT(CONCAT($0, '-'),
    $1) as rowId, $2, $3, $4, $5, $6, $7, $8, $9, $10, ... $250;

    It would be nice if the statement can be written like this:

    ConcatBuffer = foreach CpuMentrics generate CONCAT(CONCAT($0, '-'),
    $1) as rowID, MIRROR($2..$250);

    Is there something like this in pig built-in functions?

    regards,
    Eric

    On Wed, Dec 29, 2010 at 6:09 PM, Dmitriy Ryaboy <dvryaboy@gmail.com>
    wrote:
    Hi Eric,
    Yes, we can certainly add the convention that a string without a ":" refers
    to a complete column family.
    It should be fairly straightforward.. step 1 is to open a ticket on the
    Jira, step to is to do it :).

    I am not sure what you mean by "make alteration of a tuple in a bag, but not
    specifying other tuples in the same bag" -- can you provide an example that
    illustrates what you want to do?

    Thanks,
    -Dmitriy
    On Tue, Dec 28, 2010 at 11:10 PM, Eric Yang wrote:

    Hi,

    Consider this use case:

    There is a program store cpu usage metrics to a HBase table.  This
    HBase table has a column family called cpu, and individual cpu core
    usage is stored in columns like, cpu:user.0, cpu:user.1 etc.  The
    suffix number represent unique cpu core id in the system.

    While it is possible to write query like:

    SystemMetrics = load 'hbase://SystemMetrics' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster
    cpu:combined.0 cpu:combined.1 ... system:LoadAverage.1','-loadKey') AS
    (rowKey: chararray, cluster: chararray, cpuCombined0:float,
    cpuCombined1:float ... LoadAverage:float);

    To get a long list of columns to load and specify the same list in
    group by command like:

    CleanseBuffer = foreach SystemMetrics generate
    REGEX_EXTRACT($0,'^\\d+',0) as time, cluster, cpuCombined0,
    cpuCombined1, ..., LoadAverage;

    The syntax works fine, but it would be nice to load all columns of a
    given column family without specifying individual columns.

    i.e. SystemMetrics = load 'hbase://SystemMetrics' USING
    org.apache.pig.backend.hadoop.hbase.HBaseStorage('tags:cluster cpu
    system');

    Is this syntax possible to implement in pig?

    Second question, is it possible to make alteration of a tuple in a
    bag, but not specifying other tuples in the same bag?

    For large column tables, it would be nice if there is short hand
    syntax to make pig syntax shorter to write.
    Any tip on making foreach and group by shorter?  Thanks

    regards,
    Eric

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedDec 29, '10 at 7:10a
activeDec 30, '10 at 7:33p
posts5
users2
websitepig.apache.org

2 users in discussion

Eric Yang: 3 posts Dmitriy Ryaboy: 2 posts

People

Translate

site design / logo © 2021 Grokbase