The link below contains a Jython script which populates an HBase table with data in two column familes. A corresponding Pig query retrieves data for one column and saves it to a CSV:
https://gist.github.com/766929
The Jython script has the following usage:
jython hbase_test.py [table] [column count] [row count] [batch count]
This will populate a table named [table] with two column families. The first contains static data. The second contains the given number of columns, populated with data.The Pig query will return an inaccurate number of results for certain table sizes and configurations, most notably with tables exceeding 1.8 million rows in length and with more than 2 columns in the queried column family, eg.
jython hbase_test.py test 3 1800000 100000
For instance, if I execute the above command and the corresponding Pig query, the results number 905914. Note that if the table is re-populated and queried a second time, a different number results. If I run the query again without re-populating the table, I get the same number of results. The HBase shell returns an accurate row count.Some notes on reproducing this issue (or not):
* If the Jython script doesn't populate the meta column family, the issue goes away with the same query.
* If the Jython script populates 2 columns instead of 3, the issue goes away with the same query.
* The size of the column key or its value may influence whether the issue occurs.
For instance, if I change the script to store 'value_%d' instead of 'value_%d_%d', retaining the random int, the issue goes away with the same query.
I am using Pig 0.8.0 and HBase 0.20.6 on a MacBook running Snow Leopard using the stock Java that came with the OS. Attached is a log of the Pig console output. The error logs contain nothing of import.
Am I doing anything incorrectly? Is there a way I can work around this issue without compromising the column family being queried?
This appears to be a fairly simple case of Pig/HBase usage. Can anyone else reproduce the issue?
thanks,
Ian.