Grokbase Groups Pig user June 2011
FAQ
Howdy,

I'm coming from cassandra, and I'm actually trying to count all columns in a
column family. I believe that is similar to counting the number tuples in a
bag in the lingo in the pig manual. It was harder than I expected, but I
think this works:
rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage()
AS (key, columns: bag {T: tuple(name, value)});
counts = FOREACH rows GENERATE COUNT(columns);
counts_in_bag = GROUP counts ALL;
sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
dump sum_of_bag;

My question is: am I right that it works? I started with 3 keys having a
total of 5 columns and got (5). Then I added a new key/column, and another
column on an existing key and got (7). So, it seems like it's working.
But, was there a better way to write it?

Thanks!

will

Search Discussions

  • Dmitriy Ryaboy at Jun 3, 2011 at 8:07 pm
    I am not sure what you mean by "count all columns". The code you have
    counts all *cells*.
    So:
    id1: col1, col2
    id2: col1, col2, col3

    has 3 columns in a conventional sense, but your code will return 5. Is
    that what you want? If so, your code seems correct.

    D

    On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
    wrote:
    Howdy,

    I'm coming from cassandra, and I'm actually trying to count all columns in a
    column family.  I believe that is similar to counting the number tuples in a
    bag in the lingo in the pig manual.  It was harder than I expected, but I
    think this works:
    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage()
    AS (key, columns: bag {T: tuple(name, value)});
    counts = FOREACH rows GENERATE COUNT(columns);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
    dump sum_of_bag;

    My question is: am I right that it works?  I started with 3 keys having a
    total of 5 columns and got (5).  Then I added a new key/column, and another
    column on an existing key and got (7).  So, it seems like it's working.
    But, was there a better way to write it?

    Thanks!

    will
  • William Oberman at Jun 3, 2011 at 8:10 pm
    That is exactly what I wanted, thanks for the confirm!
    On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy wrote:

    I am not sure what you mean by "count all columns". The code you have
    counts all *cells*.
    So:
    id1: col1, col2
    id2: col1, col2, col3

    has 3 columns in a conventional sense, but your code will return 5. Is
    that what you want? If so, your code seems correct.

    D

    On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
    wrote:
    Howdy,

    I'm coming from cassandra, and I'm actually trying to count all columns in a
    column family. I believe that is similar to counting the number tuples in a
    bag in the lingo in the pig manual. It was harder than I expected, but I
    think this works:
    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
    CassandraStorage()
    AS (key, columns: bag {T: tuple(name, value)});
    counts = FOREACH rows GENERATE COUNT(columns);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
    dump sum_of_bag;

    My question is: am I right that it works? I started with 3 keys having a
    total of 5 columns and got (5). Then I added a new key/column, and another
    column on an existing key and got (7). So, it seems like it's working.
    But, was there a better way to write it?

    Thanks!

    will


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com
  • William Oberman at Jun 7, 2011 at 8:34 pm
    I tried this same script on closer to production data, and I'm getting
    errors. I'm 50% sure it's this:
    https://issues.apache.org/jira/browse/PIG-1283

    One of my rows in cassandra has no columns (maybe?), which maybe causes a
    null bag, which causes COUNT to blow up (at least, that's my theory). As a
    workaround, can I have COUNT ignore/skip rows with null columns? I'll start
    digging through the docs as well.

    will
    On Fri, Jun 3, 2011 at 4:09 PM, William Oberman wrote:

    That is exactly what I wanted, thanks for the confirm!

    On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy wrote:

    I am not sure what you mean by "count all columns". The code you have
    counts all *cells*.
    So:
    id1: col1, col2
    id2: col1, col2, col3

    has 3 columns in a conventional sense, but your code will return 5. Is
    that what you want? If so, your code seems correct.

    D

    On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
    wrote:
    Howdy,

    I'm coming from cassandra, and I'm actually trying to count all columns in a
    column family. I believe that is similar to counting the number tuples in a
    bag in the lingo in the pig manual. It was harder than I expected, but I
    think this works:
    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
    CassandraStorage()
    AS (key, columns: bag {T: tuple(name, value)});
    counts = FOREACH rows GENERATE COUNT(columns);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
    dump sum_of_bag;

    My question is: am I right that it works? I started with 3 keys having a
    total of 5 columns and got (5). Then I added a new key/column, and another
    column on an existing key and got (7). So, it seems like it's working.
    But, was there a better way to write it?

    Thanks!

    will


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com
  • William Oberman at Jun 7, 2011 at 8:59 pm
    I think FILTER will do the trick? E.g.

    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING CassandraStorage()
    AS (key, columns: bag {T: tuple(name, value)});
    filter_rows = FILTER rows BY columns is not null;
    counts = FOREACH filter_rows GENERATE COUNT(columns);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
    dump sum_of_bag;

    On Tue, Jun 7, 2011 at 4:33 PM, William Oberman wrote:

    I tried this same script on closer to production data, and I'm getting
    errors. I'm 50% sure it's this:
    https://issues.apache.org/jira/browse/PIG-1283

    One of my rows in cassandra has no columns (maybe?), which maybe causes a
    null bag, which causes COUNT to blow up (at least, that's my theory). As a
    workaround, can I have COUNT ignore/skip rows with null columns? I'll start
    digging through the docs as well.

    will

    On Fri, Jun 3, 2011 at 4:09 PM, William Oberman wrote:

    That is exactly what I wanted, thanks for the confirm!

    On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy wrote:

    I am not sure what you mean by "count all columns". The code you have
    counts all *cells*.
    So:
    id1: col1, col2
    id2: col1, col2, col3

    has 3 columns in a conventional sense, but your code will return 5. Is
    that what you want? If so, your code seems correct.

    D

    On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
    wrote:
    Howdy,

    I'm coming from cassandra, and I'm actually trying to count all columns in a
    column family. I believe that is similar to counting the number tuples in a
    bag in the lingo in the pig manual. It was harder than I expected, but I
    think this works:
    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
    CassandraStorage()
    AS (key, columns: bag {T: tuple(name, value)});
    counts = FOREACH rows GENERATE COUNT(columns);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
    dump sum_of_bag;

    My question is: am I right that it works? I started with 3 keys having a
    total of 5 columns and got (5). Then I added a new key/column, and another
    column on an existing key and got (7). So, it seems like it's working.
    But, was there a better way to write it?

    Thanks!

    will


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com
  • William Oberman at Jun 8, 2011 at 8:57 pm
    Just in case this ends up as someone else's answer someday, here is the
    working query on real data:
    rows = LOAD 'cassandra://civicscience/observations' USING
    CassandraStorage();
    filter_rows = FILTER rows BY $1 is not null;
    counts = FOREACH filter_rows GENERATE COUNT($1);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
    dump sum_of_bag;

    For some reason typing the bag was causing me problems.
    On Tue, Jun 7, 2011 at 4:58 PM, William Oberman wrote:

    I think FILTER will do the trick? E.g.


    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
    CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
    filter_rows = FILTER rows BY columns is not null;
    counts = FOREACH filter_rows GENERATE COUNT(columns);

    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
    dump sum_of_bag;

    On Tue, Jun 7, 2011 at 4:33 PM, William Oberman wrote:

    I tried this same script on closer to production data, and I'm getting
    errors. I'm 50% sure it's this:
    https://issues.apache.org/jira/browse/PIG-1283

    One of my rows in cassandra has no columns (maybe?), which maybe causes a
    null bag, which causes COUNT to blow up (at least, that's my theory). As a
    workaround, can I have COUNT ignore/skip rows with null columns? I'll start
    digging through the docs as well.

    will


    On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <oberman@civicscience.com
    wrote:
    That is exactly what I wanted, thanks for the confirm!

    On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy wrote:

    I am not sure what you mean by "count all columns". The code you have
    counts all *cells*.
    So:
    id1: col1, col2
    id2: col1, col2, col3

    has 3 columns in a conventional sense, but your code will return 5. Is
    that what you want? If so, your code seems correct.

    D

    On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
    wrote:
    Howdy,

    I'm coming from cassandra, and I'm actually trying to count all
    columns in a
    column family. I believe that is similar to counting the number
    tuples in a
    bag in the lingo in the pig manual. It was harder than I expected, but I
    think this works:
    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
    CassandraStorage()
    AS (key, columns: bag {T: tuple(name, value)});
    counts = FOREACH rows GENERATE COUNT(columns);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1);
    dump sum_of_bag;

    My question is: am I right that it works? I started with 3 keys having a
    total of 5 columns and got (5). Then I added a new key/column, and another
    column on an existing key and got (7). So, it seems like it's working.
    But, was there a better way to write it?

    Thanks!

    will


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com
  • Dmitriy Ryaboy at Jun 8, 2011 at 9:32 pm
    Thanks for following through William!
    D

    On Wed, Jun 8, 2011 at 1:56 PM, William Oberman
    wrote:
    Just in case this ends up as someone else's answer someday, here is the
    working query on real data:
    rows = LOAD 'cassandra://civicscience/observations' USING
    CassandraStorage();
    filter_rows = FILTER rows BY $1 is not null;
    counts = FOREACH filter_rows GENERATE COUNT($1);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
    dump sum_of_bag;

    For some reason typing the bag was causing me problems.
    On Tue, Jun 7, 2011 at 4:58 PM, William Oberman wrote:

    I think FILTER will do the trick?  E.g.


    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
    CassandraStorage() AS (key, columns: bag {T: tuple(name, value)});
    filter_rows = FILTER rows BY columns is not null;
    counts = FOREACH filter_rows GENERATE COUNT(columns);

    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
    dump sum_of_bag;

    On Tue, Jun 7, 2011 at 4:33 PM, William Oberman wrote:

    I tried this same script on closer to production data, and I'm getting
    errors.  I'm 50% sure it's this:
    https://issues.apache.org/jira/browse/PIG-1283

    One of my rows in cassandra has no columns (maybe?), which maybe causes a
    null bag, which causes COUNT to blow up (at least, that's my theory).  As a
    workaround, can I have COUNT ignore/skip rows with null columns?  I'll start
    digging through the docs as well.

    will


    On Fri, Jun 3, 2011 at 4:09 PM, William Oberman <oberman@civicscience.com
    wrote:
    That is exactly what I wanted, thanks for the confirm!

    On Fri, Jun 3, 2011 at 4:06 PM, Dmitriy Ryaboy wrote:

    I am not sure what you mean by "count all columns". The code you have
    counts all *cells*.
    So:
    id1: col1, col2
    id2: col1, col2, col3

    has 3 columns in a conventional sense, but your code will return 5. Is
    that what you want? If so, your code seems correct.

    D

    On Fri, Jun 3, 2011 at 12:53 PM, William Oberman
    wrote:
    Howdy,

    I'm coming from cassandra, and I'm actually trying to count all
    columns in a
    column family.  I believe that is similar to counting the number
    tuples in a
    bag in the lingo in the pig manual.  It was harder than I expected, but I
    think this works:
    rows = LOAD 'cassandra://MyKeyspace/MyColumnFamily' USING
    CassandraStorage()
    AS (key, columns: bag {T: tuple(name, value)});
    counts = FOREACH rows GENERATE COUNT(columns);
    counts_in_bag = GROUP counts ALL;
    sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
    dump sum_of_bag;

    My question is: am I right that it works?  I started with 3 keys having a
    total of 5 columns and got (5).  Then I added a new key/column, and another
    column on an existing key and got (7).  So, it seems like it's working.
    But, was there a better way to write it?

    Thanks!

    will


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com


    --
    Will Oberman
    Civic Science, Inc.
    3030 Penn Avenue., First Floor
    Pittsburgh, PA 15201
    (M) 412-480-7835
    (E) oberman@civicscience.com

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedJun 3, '11 at 7:54p
activeJun 8, '11 at 9:32p
posts7
users2
websitepig.apache.org

2 users in discussion

William Oberman: 5 posts Dmitriy Ryaboy: 2 posts

People

Translate

site design / logo © 2021 Grokbase