Grokbase Groups HBase user May 2011
FAQ
Hi,
I would like to know if there's a way to quickly count number of rows
from scan result?
Right now I'm iterating over ResultScanner like this:
int count = 0;
for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
++count;
}
But with number of rows reaching millions this takes a while.
I tried to find something in documentation, but I didn't found anything.
I would like to use HBase API, not MR job (because this cluster only has
HDFS and HBase installed).

Thanks for all help.

--
Wojciech Langiewicz

Search Discussions

  • Doug Meil at May 1, 2011 at 5:54 pm
    What caching value are you using on the scan? If you aren't setting this, it's probably using the default - which is 1. Which is slow. http://hbase.apache.org/book.html#d379e3504

    Re: "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)."

    For Very Large tables you want to start using an MR job for this.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 9:44 AM
    To: user@hbase.apache.org
    Subject: Row count without iterating over ResultScanner?

    Hi,
    I would like to know if there's a way to quickly count number of rows from scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz
  • Himanshu Vashishtha at May 1, 2011 at 6:03 pm
    If you are interested row count only (and not want to fetch the table rows
    to your client side), you can also try out
    https://issues.apache.org/jira/browse/HBASE-1512.

    PS: Which version you are on? The above patch is in main trunk as of now, so
    to use it you would have to checkout the code and build it.

    Thanks,
    Himanshu

    On Sun, May 1, 2011 at 11:55 AM, Doug Meil wrote:

    What caching value are you using on the scan? If you aren't setting this,
    it's probably using the default - which is 1. Which is slow.
    http://hbase.apache.org/book.html#d379e3504

    Re: "I would like to use HBase API, not MR job (because this cluster only
    has HDFS and HBase installed)."

    For Very Large tables you want to start using an MR job for this.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 9:44 AM
    To: user@hbase.apache.org
    Subject: Row count without iterating over ResultScanner?

    Hi,
    I would like to know if there's a way to quickly count number of rows from
    scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has
    HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz
  • Wojciech Langiewicz at May 1, 2011 at 6:29 pm
    Hi,
    On 01.05.2011 20:03, Himanshu Vashishtha wrote:
    If you are interested row count only (and not want to fetch the table rows
    to your client side), you can also try out
    https://issues.apache.org/jira/browse/HBASE-1512.
    Yes, I only want to count rows and apply filters or select columns.
    Are filters also supported to work with those aggregate functions?
    PS: Which version you are on? The above patch is in main trunk as of now, so
    to use it you would have to checkout the code and build it.
    I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound
    to this version.

    Coprocessors with aggregate functions seem to be the thing I need. Thanks!
    --
    Wojciech Langiewicz
    Thanks,
    Himanshu


    On Sun, May 1, 2011 at 11:55 AM, Doug Meilwrote:
    What caching value are you using on the scan? If you aren't setting this,
    it's probably using the default - which is 1. Which is slow.
    http://hbase.apache.org/book.html#d379e3504

    Re: "I would like to use HBase API, not MR job (because this cluster only
    has HDFS and HBase installed)."

    For Very Large tables you want to start using an MR job for this.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 9:44 AM
    To: user@hbase.apache.org
    Subject: Row count without iterating over ResultScanner?

    Hi,
    I would like to know if there's a way to quickly count number of rows from
    scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has
    HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz
  • Himanshu Vashishtha at May 1, 2011 at 6:43 pm
    Yes, you can define your scan object at the client side and pass to the
    AggregateClient.rowCount. You can refer to AggregateClient javadoc and
    associated TestAggregateProtocol test methods to get an idea.

    Thanks,
    Himanshu

    On Sun, May 1, 2011 at 12:29 PM, Wojciech Langiewicz
    wrote:
    Hi,
    On 01.05.2011 20:03, Himanshu Vashishtha wrote:

    If you are interested row count only (and not want to fetch the table rows
    to your client side), you can also try out
    https://issues.apache.org/jira/browse/HBASE-1512.
    Yes, I only want to count rows and apply filters or select columns.
    Are filters also supported to work with those aggregate functions?


    PS: Which version you are on? The above patch is in main trunk as of now,
    so
    to use it you would have to checkout the code and build it.
    I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound to
    this version.

    Coprocessors with aggregate functions seem to be the thing I need. Thanks!
    --
    Wojciech Langiewicz


    Thanks,
    Himanshu


    On Sun, May 1, 2011 at 11:55 AM, Doug Meil<doug.meil@explorysmedical.com
    wrote:
    What caching value are you using on the scan? If you aren't setting
    this,
    it's probably using the default - which is 1. Which is slow.
    http://hbase.apache.org/book.html#d379e3504

    Re: "I would like to use HBase API, not MR job (because this cluster
    only
    has HDFS and HBase installed)."

    For Very Large tables you want to start using an MR job for this.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 9:44 AM
    To: user@hbase.apache.org
    Subject: Row count without iterating over ResultScanner?

    Hi,
    I would like to know if there's a way to quickly count number of rows
    from
    scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has
    HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz
  • Wojciech Langiewicz at May 1, 2011 at 6:51 pm
    Thanks, that's great. But I firstly I have to update HBase and read some
    documentation, so I'll let you know in a while how that works for me.
    On 01.05.2011 20:42, Himanshu Vashishtha wrote:
    Yes, you can define your scan object at the client side and pass to the
    AggregateClient.rowCount. You can refer to AggregateClient javadoc and
    associated TestAggregateProtocol test methods to get an idea.

    Thanks,
    Himanshu

    On Sun, May 1, 2011 at 12:29 PM, Wojciech Langiewicz
    wrote:
    Hi,
    On 01.05.2011 20:03, Himanshu Vashishtha wrote:

    If you are interested row count only (and not want to fetch the table rows
    to your client side), you can also try out
    https://issues.apache.org/jira/browse/HBASE-1512.
    Yes, I only want to count rows and apply filters or select columns.
    Are filters also supported to work with those aggregate functions?


    PS: Which version you are on? The above patch is in main trunk as of now,
    so
    to use it you would have to checkout the code and build it.
    I'm using version from CDH3, so it is: 0.90.1-cdh3u0, but I'm not bound to
    this version.

    Coprocessors with aggregate functions seem to be the thing I need. Thanks!
    --
    Wojciech Langiewicz


    Thanks,
    Himanshu


    On Sun, May 1, 2011 at 11:55 AM, Doug Meil<doug.meil@explorysmedical.com
    wrote:
    What caching value are you using on the scan? If you aren't setting
    this,
    it's probably using the default - which is 1. Which is slow.
    http://hbase.apache.org/book.html#d379e3504

    Re: "I would like to use HBase API, not MR job (because this cluster
    only
    has HDFS and HBase installed)."

    For Very Large tables you want to start using an MR job for this.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 9:44 AM
    To: user@hbase.apache.org
    Subject: Row count without iterating over ResultScanner?

    Hi,
    I would like to know if there's a way to quickly count number of rows
    from
    scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has
    HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz
  • Wojciech Langiewicz at May 1, 2011 at 6:12 pm
    Yes, I was using default caching, setting this value to few thousands
    made significant difference in performance, I'll experiment more with
    this option.

    Right now I want to stay away from MR, mainly because of cluster warm-up
    time, and I want to get results almost real-time (few seconds max).

    Thanks for the tip on caching!
    On 01.05.2011 19:55, Doug Meil wrote:
    What caching value are you using on the scan? If you aren't setting this, it's probably using the default - which is 1. Which is slow. http://hbase.apache.org/book.html#d379e3504

    Re: "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)."

    For Very Large tables you want to start using an MR job for this.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 9:44 AM
    To: user@hbase.apache.org
    Subject: Row count without iterating over ResultScanner?

    Hi,
    I would like to know if there's a way to quickly count number of rows from scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz
  • Doug Meil at May 1, 2011 at 6:44 pm
    Another thing is be careful about CF/attributes you have in the Scan. If you add a column family (scan.addFamily) , it will pull *all* the attributes of that column family. If you only care about a row-count, pick only one very small attribute from the row.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 2:12 PM
    To: user@hbase.apache.org
    Subject: Re: Row count without iterating over ResultScanner?

    Yes, I was using default caching, setting this value to few thousands made significant difference in performance, I'll experiment more with this option.

    Right now I want to stay away from MR, mainly because of cluster warm-up time, and I want to get results almost real-time (few seconds max).

    Thanks for the tip on caching!
    On 01.05.2011 19:55, Doug Meil wrote:
    What caching value are you using on the scan? If you aren't setting this, it's probably using the default - which is 1. Which is slow. http://hbase.apache.org/book.html#d379e3504

    Re: "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)."

    For Very Large tables you want to start using an MR job for this.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 9:44 AM
    To: user@hbase.apache.org
    Subject: Row count without iterating over ResultScanner?

    Hi,
    I would like to know if there's a way to quickly count number of rows from scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz
  • Wojciech Langiewicz at May 1, 2011 at 6:50 pm
    Thanks, also referring documentation from link you posted (13.6.5.) I
    have applied those filters.
    On 01.05.2011 20:44, Doug Meil wrote:
    Another thing is be careful about CF/attributes you have in the Scan. If you add a column family (scan.addFamily) , it will pull *all* the attributes of that column family. If you only care about a row-count, pick only one very small attribute from the row.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 2:12 PM
    To: user@hbase.apache.org
    Subject: Re: Row count without iterating over ResultScanner?

    Yes, I was using default caching, setting this value to few thousands made significant difference in performance, I'll experiment more with this option.

    Right now I want to stay away from MR, mainly because of cluster warm-up time, and I want to get results almost real-time (few seconds max).

    Thanks for the tip on caching!
    On 01.05.2011 19:55, Doug Meil wrote:
    What caching value are you using on the scan? If you aren't setting this, it's probably using the default - which is 1. Which is slow. http://hbase.apache.org/book.html#d379e3504

    Re: "I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed)."

    For Very Large tables you want to start using an MR job for this.


    -----Original Message-----
    From: Wojciech Langiewicz
    Sent: Sunday, May 01, 2011 9:44 AM
    To: user@hbase.apache.org
    Subject: Row count without iterating over ResultScanner?

    Hi,
    I would like to know if there's a way to quickly count number of rows from scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz
  • Michel Segel at May 2, 2011 at 2:44 am
    Hi,
    There's a row counter app in the hbase release that's a m/r job.

    You could also do a dynamic counter too.


    Sent from a remote device. Please excuse any typos...

    Mike Segel
    On May 1, 2011, at 8:44 AM, Wojciech Langiewicz wrote:

    Hi,
    I would like to know if there's a way to quickly count number of rows from scan result?
    Right now I'm iterating over ResultScanner like this:
    int count = 0;
    for (Result rr = scanner.next(); rr != null; rr = scanner.next()) {
    ++count;
    }
    But with number of rows reaching millions this takes a while.
    I tried to find something in documentation, but I didn't found anything.
    I would like to use HBase API, not MR job (because this cluster only has HDFS and HBase installed).

    Thanks for all help.

    --
    Wojciech Langiewicz

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMay 1, '11 at 1:44p
activeMay 2, '11 at 2:44a
posts10
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase