not map in a general way to the
number of rows, or the number of rows with a specific column."
It would be nice to have an index like that; Would solve a lot of
issues for people migrating from mysql. I assume that without the
'count' feature, people are resorting to storing dataset elements in
other engines, which is not great, since you then end up to require a
non-hbase index to be consistent and authoritative for all of your
datasets that require counts.
-Jack
On Fri, Jun 3, 2011 at 3:24 PM, Ryan Rawson wrote:
This is a commonly requested feature, and it remains unimplemented
because it is actually quite hard. Each HFile knows how many KV
entries there are in it, but this does not map in a general way to the
number of rows, or the number of rows with a specific column. Keeping
track of the row count as new rows are created is also not as easy as
it seems - this is because a Put does not know if a row already exists
or not. Making it aware of that fact would require doing a get before
a put - not cheap.
-ryan
This is a commonly requested feature, and it remains unimplemented
because it is actually quite hard. Each HFile knows how many KV
entries there are in it, but this does not map in a general way to the
number of rows, or the number of rows with a specific column. Keeping
track of the row count as new rows are created is also not as easy as
it seems - this is because a Put does not know if a row already exists
or not. Making it aware of that fact would require doing a get before
a put - not cheap.
-ryan
On Fri, Jun 3, 2011 at 3:20 PM, Jack Levin wrote:
I have a feature request: There should be a native function called
'count', that produces count of rows based on specific family filter,
that is internal to HBASE and won't be required to read CELLs off the
disk/cache. Just count up the rows in the most efficient way
possible. I realize that family definitions are part of the cells, so
it would be nice to have an index that somehow can produce low IO/CPU
hit to hbase when doing a count (for example enabling an index like
that in table schema would be how you turn it on for a specific
family).
Best,
-Jack
I have a feature request: There should be a native function called
'count', that produces count of rows based on specific family filter,
that is internal to HBASE and won't be required to read CELLs off the
disk/cache. Just count up the rows in the most efficient way
possible. I realize that family definitions are part of the cells, so
it would be nice to have an index that somehow can produce low IO/CPU
hit to hbase when doing a count (for example enabling an index like
that in table schema would be how you turn it on for a specific
family).
Best,
-Jack
