using the column names defined in the Cassandra Schema, using the UDF from
pygmalion.
So imagine a schema being :
create column family Users
with column_type = Standard
and comparator = UTF8Type
and default_validation_class = UTF8Type;
and the data being :
RowKey: 1
=> (column=firstname, value=albert, timestamp=1304694447722746)
=> (column=city, value=london, timestamp=1304694447722746)
-------------------
RowKey: 2
=> (column=firstname, value=antonio, timestamp=1304694447140376)
=> (column=city, value=roma, timestamp=1304694447140376)
Note that this is returned by CassandraStorage as a bag {T: tuple(name,
value)}
So in pig, your load statement will be something like :
rows = LOAD 'cassandra://Keyspace/Users' USING
org.apache.cassandra.hadoop.pig.CassandraStorage() as (key:chararray,
columns: bag{T:(columnname, columnvalue)});
if you illustrate this, you get :
---------------------------------------------------------------------------------------------------------------------------------------
rows | key: bytearray | columns:
bytearray({T: (columnname: bytearray,columnvalue: bytearray)}) |---------------------------------------------------------------------------------------------------------------------------------------
1 | {(firstname, albert), (city, london)}
---------------------------------------------------------------------------------------------------------------------------------
rows | key: chararray | columns:
bag({T: (columnname: bytearray,columnvalue: bytearray)}) |---------------------------------------------------------------------------------------------------------------------------------
1 | {(firstname, antonio), (city, roma)} |
now, if you want to access those column values by names, here is the trick.
Register the pygmalion jar first (you need to build it, of course).
register 'pygmalion.jar';
and then, here is the magic part ...
rows_namedcols = foreach rows generate key,
flatten(org.pygmalion.udf.FromCassandraBag('firstname, city', columns))
as (firstname: chararray, city: chararray);
Now you can query your columns directly from pig. Isn't that awesome ?
rows_london = filter rows_namedcols by city == 'london';
names_london = foreach rows_london generate firstname;
dump names_london;
You can download the UDF from here:
https://github.com/jeromatron/pygmalion
Thanks to jeromatron for this !