Grokbase Groups HBase dev July 2012
FAQ
Hi,



What is the best way to get the total row count?



I tried following things,

a> Count 'tablename' in shell prompt: Helpful, only with very less
number of records.

b> Runing RowCounter Job: It took almost 8hr to get row count of 2TB
data in 3node cluster (16 core system, 48GB RAM)

c> Using AggregationClient: Disk IO is very high (System wait is 65-70%,
Load factor is almost 110), this makes server to non responsive and makes
the clients to go down (Due to RPCTimeOut Exceptions).

Thanks & Regards,

Gopinathan A



****************************************************************************
***********
This e-mail and attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed
above. Any use of the information contained herein in any way (including,
but not limited to, total or partial disclosure, reproduction, or
dissemination) by persons other than the intended recipient's) is
prohibited. If you receive this e-mail in error, please notify the sender by
phone or email immediately and delete it!

Search Discussions

  • Jean-Daniel Cryans at Jul 9, 2012 at 5:31 pm
    If you need the exact count the best way is still RowCounter, maybe
    set a bigger scanner caching?

    Another option that works if you only need an estimate is using the
    reported number of KVs per region and then summing them up. Look at
    any of your region servers' web ui and on the right you'll see the
    count per region.

    J-D
    On Mon, Jul 9, 2012 at 3:41 AM, Gopinathan A wrote:
    Hi,



    What is the best way to get the total row count?



    I tried following things,

    a> Count 'tablename' in shell prompt: Helpful, only with very less
    number of records.

    b> Runing RowCounter Job: It took almost 8hr to get row count of 2TB
    data in 3node cluster (16 core system, 48GB RAM)

    c> Using AggregationClient: Disk IO is very high (System wait is 65-70%,
    Load factor is almost 110), this makes server to non responsive and makes
    the clients to go down (Due to RPCTimeOut Exceptions).

    Thanks & Regards,

    Gopinathan A



    ****************************************************************************
    ***********
    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!

  • Shashwat shriparv at Jul 9, 2012 at 5:56 pm
    Count the number of rows in a table. This operation may take a LONG
    time (Run '$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount' to run a
    counting mapreduce job). Current count is shown every 1000 rows by
    default. Count interval may be optionally specified. Examples:

    hbase> count 't1'
    hbase> count 't1', 100000

    On Mon, Jul 9, 2012 at 11:01 PM, Jean-Daniel Cryans wrote:

    If you need the exact count the best way is still RowCounter, maybe
    set a bigger scanner caching?

    Another option that works if you only need an estimate is using the
    reported number of KVs per region and then summing them up. Look at
    any of your region servers' web ui and on the right you'll see the
    count per region.

    J-D
    On Mon, Jul 9, 2012 at 3:41 AM, Gopinathan A wrote:
    Hi,



    What is the best way to get the total row count?



    I tried following things,

    a> Count 'tablename' in shell prompt: Helpful, only with very less
    number of records.

    b> Runing RowCounter Job: It took almost 8hr to get row count of 2TB
    data in 3node cluster (16 core system, 48GB RAM)

    c> Using AggregationClient: Disk IO is very high (System wait is 65-70%,
    Load factor is almost 110), this makes server to non responsive and makes
    the clients to go down (Due to RPCTimeOut Exceptions).

    Thanks & Regards,

    Gopinathan A



    ****************************************************************************
    ***********
    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!



    --



    Shashwat Shriparv
  • Shashwat shriparv at Jul 9, 2012 at 5:56 pm
    On more option is will suggest... while puting the data in hadoop just
    maintain a count somewhere..
    On Mon, Jul 9, 2012 at 11:25 PM, shashwat shriparv wrote:

    Count the number of rows in a table. This operation may take a LONG
    time (Run '$HADOOP_HOME/bin/hadoop jar hbase.jar rowcount' to run a
    counting mapreduce job). Current count is shown every 1000 rows by
    default. Count interval may be optionally specified. Examples:

    hbase> count 't1'
    hbase> count 't1', 100000

    On Mon, Jul 9, 2012 at 11:01 PM, Jean-Daniel Cryans wrote:

    If you need the exact count the best way is still RowCounter, maybe
    set a bigger scanner caching?

    Another option that works if you only need an estimate is using the
    reported number of KVs per region and then summing them up. Look at
    any of your region servers' web ui and on the right you'll see the
    count per region.

    J-D

    On Mon, Jul 9, 2012 at 3:41 AM, Gopinathan A <gopinathan.a@huawei.com>
    wrote:
    Hi,



    What is the best way to get the total row count?



    I tried following things,

    a> Count 'tablename' in shell prompt: Helpful, only with very less
    number of records.

    b> Runing RowCounter Job: It took almost 8hr to get row count of 2TB
    data in 3node cluster (16 core system, 48GB RAM)

    c> Using AggregationClient: Disk IO is very high (System wait is 65-70%,
    Load factor is almost 110), this makes server to non responsive and makes
    the clients to go down (Due to RPCTimeOut Exceptions).

    Thanks & Regards,

    Gopinathan A



    ****************************************************************************
    ***********
    This e-mail and attachments contain confidential information from HUAWEI,
    which is intended only for the person or entity whose address is listed
    above. Any use of the information contained herein in any way
    (including,
    but not limited to, total or partial disclosure, reproduction, or
    dissemination) by persons other than the intended recipient's) is
    prohibited. If you receive this e-mail in error, please notify the sender by
    phone or email immediately and delete it!



    --



    Shashwat Shriparv


    --



    Shashwat Shriparv

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedJul 9, '12 at 10:42a
activeJul 9, '12 at 5:56p
posts4
users3
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase