Grokbase Groups Pig user May 2011
FAQ
It is possible to access the columns values (stored in cassandra) from pig,
using the column names defined in the Cassandra Schema, using the UDF from
pygmalion.


So imagine a schema being :


create column family Users
with column_type = Standard
and comparator = UTF8Type
and default_validation_class = UTF8Type;


and the data being :

RowKey: 1
=> (column=firstname, value=albert, timestamp=1304694447722746)
=> (column=city, value=london, timestamp=1304694447722746)
-------------------
RowKey: 2
=> (column=firstname, value=antonio, timestamp=1304694447140376)
=> (column=city, value=roma, timestamp=1304694447140376)


Note that this is returned by CassandraStorage as a bag {T: tuple(name,
value)}

So in pig, your load statement will be something like :

rows = LOAD 'cassandra://Keyspace/Users' USING
org.apache.cassandra.hadoop.pig.CassandraStorage() as (key:chararray,
columns: bag{T:(columnname, columnvalue)});

if you illustrate this, you get :

---------------------------------------------------------------------------------------------------------------------------------------
rows | key: bytearray | columns:
bytearray({T: (columnname: bytearray,columnvalue: bytearray)}) |
---------------------------------------------------------------------------------------------------------------------------------------
1 | {(firstname, albert), (city, london)}
---------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------
rows | key: chararray | columns:
bag({T: (columnname: bytearray,columnvalue: bytearray)}) |
---------------------------------------------------------------------------------------------------------------------------------
1 | {(firstname, antonio), (city, roma)} |
---------------------------------------------------------------------------------------------------------------------------------

now, if you want to access those column values by names, here is the trick.
Register the pygmalion jar first (you need to build it, of course).

register 'pygmalion.jar';

and then, here is the magic part ...

rows_namedcols = foreach rows generate key,
flatten(org.pygmalion.udf.FromCassandraBag('firstname, city', columns))
as (firstname: chararray, city: chararray);

Now you can query your columns directly from pig. Isn't that awesome ?

rows_london = filter rows_namedcols by city == 'london';
names_london = foreach rows_london generate firstname;
dump names_london;


You can download the UDF from here:
https://github.com/jeromatron/pygmalion

Thanks to jeromatron for this !

Search Discussions

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriespig, hadoop
postedMay 6, '11 at 6:21p
activeMay 6, '11 at 6:21p
posts1
users1
websitepig.apache.org

1 user in discussion

Gianni Moschini: 1 post

People

Translate

site design / logo © 2021 Grokbase