Grokbase Groups HBase user June 2011
FAQ
hello everybody

i'm trying to scan my hbase table for reporting purposes
the cluster has 4 servers:
- server1: namenode, secondary namenode, jobtracker, hbase master, zookeeper1
- server2: datanode, tasktracker, hbase regionserver, zookeeper2
- server3: datanode, tasktracker, hbase regionserver, zookeeper3
- server4: datanode, tasktracker, hbase regionserver
everything seems to work properly
versions:
- hadoop-0.20.2-CDH3B4
- hbase-0.90.1-CDH3B4
- zookeeper-3.3.2-CDH3B4


at the moment our hbase table has 300000 entries

if i do a table scan over the hbase api (at the moment without a filter)
ResultScanner scanner = table.getScanner(...);

it takes about 60 seconds to process, which is actually okey, because all records are processed be only one thread sequentially
BUT it takes approximately the same time, if i do a scan over Map&Reduce job using TableInputFormat

i'm definitely doing something wrong, because the processing time is going up directly proportional to the number of rows.
in my understanding, the big advantage of hadoop/hbase is, that huge numbers of entries can be processed in parallel and very fast

300k entries are not much, we expecting this number to be added hourly to our cluster, but the processing time is increasing, which is actually not acceptable

any one an idea, what i'm doing wrong?

best regards
andre

Search Discussions

  • Joey Echeverria at Jun 6, 2011 at 1:10 pm
    How many regions does your table have?
    On Mon, Jun 6, 2011 at 4:48 AM, Andreas Reiter wrote:
    hello everybody

    i'm trying to scan my hbase table for reporting purposes
    the cluster has 4 servers:
    - server1: namenode, secondary namenode, jobtracker, hbase master,
    zookeeper1
    - server2: datanode, tasktracker, hbase regionserver, zookeeper2
    - server3: datanode, tasktracker, hbase regionserver, zookeeper3
    - server4: datanode, tasktracker, hbase regionserver
    everything seems to work properly
    versions:
    - hadoop-0.20.2-CDH3B4
    - hbase-0.90.1-CDH3B4
    - zookeeper-3.3.2-CDH3B4


    at the moment our hbase table has 300000 entries

    if i do a table scan over the hbase api  (at the moment without a filter)
    ResultScanner scanner = table.getScanner(...);

    it takes about 60 seconds to process, which is actually okey, because all
    records are processed be only one thread sequentially
    BUT it takes approximately the same time, if i do a scan over Map&Reduce job
    using TableInputFormat

    i'm definitely doing something wrong, because the processing time is going
    up directly proportional to the number of rows.
    in my understanding, the big advantage of hadoop/hbase is, that huge numbers
    of entries can be processed in parallel and very fast

    300k entries are not much, we expecting this number to be added hourly to
    our cluster, but the processing time is increasing, which is actually not
    acceptable

    any one an idea, what i'm doing wrong?

    best regards
    andre


    --
    Joseph Echeverria
    Cloudera, Inc.
    443.305.9434
  • Andre Reiter at Jun 6, 2011 at 9:28 pm
    good question... i have no idea...

    i did not define explicitly the number of regions for the table, how can i find out how many regions does my table have?
    how many ragions should the table have? how to change the number of the regions?

    best regards
    andre


    ----- Original Message -----
    From: Joey Echeverria
    Sent: Mon Jun 06 2011 15:10:29 GMT+0200 (CET)
    To:
    CC:
    Subject: Re: full table scan
    How many regions does your table have?
  • Doug Meil at Jun 6, 2011 at 9:30 pm
    Check the web console.

    -----Original Message-----
    From: Andre Reiter
    Sent: Monday, June 06, 2011 5:27 PM
    To: user@hbase.apache.org
    Subject: Re: full table scan


    good question... i have no idea...

    i did not define explicitly the number of regions for the table, how can i find out how many regions does my table have?
    how many ragions should the table have? how to change the number of the regions?

    best regards
    andre


    ----- Original Message -----
    From: Joey Echeverria
    Sent: Mon Jun 06 2011 15:10:29 GMT+0200 (CET)
    To:
    CC:
    Subject: Re: full table scan
    How many regions does your table have?
  • Andre Reiter at Jun 6, 2011 at 10:08 pm
    Check the web console.
    ah, ok thanks!
    at the port 60010 on the hbase master i actually found a web interface
    there was only one region, i played i bit with it, and executed the "Split" function twice. Now i have three regions, one on each hbase region server
    but still, the processing time did not change... i measured the same times as with only one region...

    best regards
    andre
  • Ted Yu at Jun 6, 2011 at 10:20 pm
    I think row counter would help you figure out the number of rows in each
    region.
    Refer to the following email thread, especially Stack's answer on Apr 1:
    row_counter map reduce job & 0.90.1
    On Mon, Jun 6, 2011 at 3:07 PM, Andre Reiter wrote:


    Check the web console.
    ah, ok thanks!
    at the port 60010 on the hbase master i actually found a web interface
    there was only one region, i played i bit with it, and executed the "Split"
    function twice. Now i have three regions, one on each hbase region server
    but still, the processing time did not change... i measured the same times
    as with only one region...

    best regards
    andre
  • Christopher Tarnas at Jun 6, 2011 at 3:00 pm
    How many regions does your table have? If all of the data is still in one
    region then you will be rate limited by how fast that single region can be
    read. 3 nodes is also pretty small, the more nodes you have the better (at
    least 5 for dev and test and 10+ for production has been my experience).

    Also, with only 4 servers you probably only need one zookeeper node; you
    will not be putting it under any serious load and you already have a SPOF on
    server1 (namenode, hbase master, etc).

    -chris

    On Mon, Jun 6, 2011 at 3:48 AM, Andreas Reiter wrote:

    hello everybody

    i'm trying to scan my hbase table for reporting purposes
    the cluster has 4 servers:
    - server1: namenode, secondary namenode, jobtracker, hbase master,
    zookeeper1
    - server2: datanode, tasktracker, hbase regionserver, zookeeper2
    - server3: datanode, tasktracker, hbase regionserver, zookeeper3
    - server4: datanode, tasktracker, hbase regionserver
    everything seems to work properly
    versions:
    - hadoop-0.20.2-CDH3B4
    - hbase-0.90.1-CDH3B4
    - zookeeper-3.3.2-CDH3B4


    at the moment our hbase table has 300000 entries

    if i do a table scan over the hbase api (at the moment without a filter)
    ResultScanner scanner = table.getScanner(...);

    it takes about 60 seconds to process, which is actually okey, because all
    records are processed be only one thread sequentially
    BUT it takes approximately the same time, if i do a scan over Map&Reduce
    job using TableInputFormat

    i'm definitely doing something wrong, because the processing time is going
    up directly proportional to the number of rows.
    in my understanding, the big advantage of hadoop/hbase is, that huge
    numbers of entries can be processed in parallel and very fast

    300k entries are not much, we expecting this number to be added hourly to
    our cluster, but the processing time is increasing, which is actually not
    acceptable

    any one an idea, what i'm doing wrong?

    best regards
    andre
  • Himanshu Vashishtha at Jun 6, 2011 at 7:42 pm
    Also,
    How big is each row? Are you using scanner cache? You just fetching all the
    rows to the client, and?.

    300k is not big (It seems you have 1'ish region, that could explain similar
    timing). Add more data and mapreduce will pick up!

    Thanks,
    Himanshu
    On Mon, Jun 6, 2011 at 8:59 AM, Christopher Tarnas wrote:

    How many regions does your table have? If all of the data is still in one
    region then you will be rate limited by how fast that single region can be
    read. 3 nodes is also pretty small, the more nodes you have the better (at
    least 5 for dev and test and 10+ for production has been my experience).

    Also, with only 4 servers you probably only need one zookeeper node; you
    will not be putting it under any serious load and you already have a SPOF
    on
    server1 (namenode, hbase master, etc).

    -chris

    On Mon, Jun 6, 2011 at 3:48 AM, Andreas Reiter wrote:

    hello everybody

    i'm trying to scan my hbase table for reporting purposes
    the cluster has 4 servers:
    - server1: namenode, secondary namenode, jobtracker, hbase master,
    zookeeper1
    - server2: datanode, tasktracker, hbase regionserver, zookeeper2
    - server3: datanode, tasktracker, hbase regionserver, zookeeper3
    - server4: datanode, tasktracker, hbase regionserver
    everything seems to work properly
    versions:
    - hadoop-0.20.2-CDH3B4
    - hbase-0.90.1-CDH3B4
    - zookeeper-3.3.2-CDH3B4


    at the moment our hbase table has 300000 entries

    if i do a table scan over the hbase api (at the moment without a filter)
    ResultScanner scanner = table.getScanner(...);

    it takes about 60 seconds to process, which is actually okey, because all
    records are processed be only one thread sequentially
    BUT it takes approximately the same time, if i do a scan over Map&Reduce
    job using TableInputFormat

    i'm definitely doing something wrong, because the processing time is going
    up directly proportional to the number of rows.
    in my understanding, the big advantage of hadoop/hbase is, that huge
    numbers of entries can be processed in parallel and very fast

    300k entries are not much, we expecting this number to be added hourly to
    our cluster, but the processing time is increasing, which is actually not
    acceptable

    any one an idea, what i'm doing wrong?

    best regards
    andre
  • Andre Reiter at Jun 7, 2011 at 8:08 am
    now i found out, that there are three regions, each on a particular region server (server2, server3, server4)
    the processing time is still >=60sec, which is not very impressive...

    what can i do, to speed up the table scan

    best regards
    andre


    Andreas Reiter wrote:
    hello everybody

    i'm trying to scan my hbase table for reporting purposes
    the cluster has 4 servers:
    - server1: namenode, secondary namenode, jobtracker, hbase master, zookeeper1
    - server2: datanode, tasktracker, hbase regionserver, zookeeper2
    - server3: datanode, tasktracker, hbase regionserver, zookeeper3
    - server4: datanode, tasktracker, hbase regionserver
    everything seems to work properly
    versions:
    - hadoop-0.20.2-CDH3B4
    - hbase-0.90.1-CDH3B4
    - zookeeper-3.3.2-CDH3B4


    at the moment our hbase table has 300000 entries

    if i do a table scan over the hbase api (at the moment without a filter)
    ResultScanner scanner = table.getScanner(...);

    it takes about 60 seconds to process, which is actually okey, because all records are processed be only one thread sequentially
    BUT it takes approximately the same time, if i do a scan over Map&Reduce job using TableInputFormat

    i'm definitely doing something wrong, because the processing time is going up directly proportional to the number of rows.
    in my understanding, the big advantage of hadoop/hbase is, that huge numbers of entries can be processed in parallel and very fast

    300k entries are not much, we expecting this number to be added hourly to our cluster, but the processing time is increasing, which is actually not acceptable

    any one an idea, what i'm doing wrong?

    best regards
    andre
  • Stack at Jun 7, 2011 at 5:29 pm
    See http://hbase.apache.org/book/performance.html
    St.Ack
    On Tue, Jun 7, 2011 at 1:08 AM, Andre Reiter wrote:
    now i found out, that there are three regions, each on a particular region
    server (server2, server3, server4)
    the processing time is still >=60sec, which is not very impressive...

    what can i do, to speed up the table scan

    best regards
    andre


    Andreas Reiter wrote:
    hello everybody

    i'm trying to scan my hbase table for reporting purposes
    the cluster has 4 servers:
    - server1: namenode, secondary namenode, jobtracker, hbase master,
    zookeeper1
    - server2: datanode, tasktracker, hbase regionserver, zookeeper2
    - server3: datanode, tasktracker, hbase regionserver, zookeeper3
    - server4: datanode, tasktracker, hbase regionserver
    everything seems to work properly
    versions:
    - hadoop-0.20.2-CDH3B4
    - hbase-0.90.1-CDH3B4
    - zookeeper-3.3.2-CDH3B4


    at the moment our hbase table has 300000 entries

    if i do a table scan over the hbase api (at the moment without a filter)
    ResultScanner scanner = table.getScanner(...);

    it takes about 60 seconds to process, which is actually okey, because all
    records are processed be only one thread sequentially
    BUT it takes approximately the same time, if i do a scan over Map&Reduce
    job using TableInputFormat

    i'm definitely doing something wrong, because the processing time is going
    up directly proportional to the number of rows.
    in my understanding, the big advantage of hadoop/hbase is, that huge
    numbers of entries can be processed in parallel and very fast

    300k entries are not much, we expecting this number to be added hourly to
    our cluster, but the processing time is increasing, which is actually not
    acceptable

    any one an idea, what i'm doing wrong?

    best regards
    andre
  • Andre Reiter at Jun 8, 2011 at 4:44 am
    cool, just one change

    scan.setCaching(1000);

    reduced the processing time of my MR job from 60sec to 10sec !
    nice :-)

    PS: now looking for other optimizations...



    Stack wrote:
  • Jean-Daniel Cryans at Jun 10, 2011 at 6:47 pm
    You expect a MapReduce job to be faster than a Scan on small data,
    your expectation is wrong.

    There's a minimal cost to every MR job, which is of a few seconds, and
    you can't go around it.

    What other people have been trying to tell you is that you don't have
    enough data to benefit from the parallel execution advantages of
    Hadoop and HBase.

    J-D
    On Wed, Jun 8, 2011 at 4:43 AM, Andre Reiter wrote:
    cool, just one change

    scan.setCaching(1000);

    reduced the processing time of my MR job from 60sec to 10sec !
    nice :-)

    PS: now looking for other optimizations...



    Stack wrote:
  • Andre Reiter at Jun 11, 2011 at 8:36 am

    Jean-Daniel Cryans wrote:
    You expect a MapReduce job to be faster than a Scan on small data,
    your expectation is wrong.
    never expected a MR job to be faster for every context
    There's a minimal cost to every MR job, which is of a few seconds, and
    you can't go around it.
    for sure there is an overhead for MR job, and a few seconds are OK, but not a whole minute...

    so what time can be expected for processing a full scan of i.e. 1.000.000.000 rows in an hbase cluster with i.e. 3 region servers?

    i'm just wondering, if its worth to run the full scan only once a day, and to persist the results
    i hoped to be able to process it on demand, but if it takes too much time, its not acceptable

    andre
  • Stack at Jun 11, 2011 at 4:42 pm

    On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote:
    so what time can be expected for processing a full scan of i.e.
    1.000.000.000 rows in an hbase cluster with i.e. 3 region servers?
    I don't think three servers and 1M rows (only) enough data and
    resources for contrast and compare. Multiply data by 100. Servers by
    three or four (IMO).

    St.Ack
  • Ted Dunning at Jun 12, 2011 at 9:32 am
    He said 10^9. Easy to misread.
    On Sat, Jun 11, 2011 at 6:41 PM, Stack wrote:
    On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote:
    so what time can be expected for processing a full scan of i.e.
    1.000.000.000 rows in an hbase cluster with i.e. 3 region servers?
    I don't think three servers and 1M rows (only) enough data and
    resources for contrast and compare. Multiply data by 100. Servers by
    three or four (IMO).

    St.Ack
  • Stack at Jun 12, 2011 at 7:08 pm
    Thanks Ted. I misread


    On Jun 12, 2011, at 2:31, Ted Dunning wrote:

    He said 10^9. Easy to misread.
    On Sat, Jun 11, 2011 at 6:41 PM, Stack wrote:
    On Sat, Jun 11, 2011 at 1:36 AM, Andre Reiter wrote:
    so what time can be expected for processing a full scan of i.e.
    1.000.000.000 rows in an hbase cluster with i.e. 3 region servers?
    I don't think three servers and 1M rows (only) enough data and
    resources for contrast and compare. Multiply data by 100. Servers by
    three or four (IMO).

    St.Ack
  • Andre Reiter at Jun 21, 2011 at 5:13 am
    sorry guys,
    still the same problem... my MR jobs are running not very fast...

    the job org.apache.hadoop.hbase.mapreduce.RowCounter took 13 minutes to complete while we do not have much rows, just 3223543
    at the moment we have 3 region servers, while the table is split over 13 regions on that 3 servers

    i just can not believe, its that slow...

    what is going wrong?
  • Stack at Jun 21, 2011 at 5:28 am
    Sounds like you are doing about 5k rows/second per server.

    What size rows? How many column families? What kinda of hardware?

    St.Ack
    On Mon, Jun 20, 2011 at 10:13 PM, Andre Reiter wrote:
    sorry guys,
    still  the same problem... my MR jobs are running not very fast...

    the job org.apache.hadoop.hbase.mapreduce.RowCounter took 13 minutes to
    complete while we do not have much rows, just 3223543
    at the moment we have 3 region servers, while the table is split over 13
    regions on that 3 servers

    i just can not believe, its that slow...

    what is going wrong?
  • Andre Reiter at Jun 21, 2011 at 7:03 am
    Hi Stack,

    thanks a lot for the reply
    each row is about 2k in average, there are only 2 families

    hardware:

    CPU: 2x AMD Opteron(tm) Processor 250 (2.4GHz)
    disk: 500 GB, software raid raid1 (2x WDC WD5000AAKB-00H8A0, ATA DISK drive)
    memory: 2 GB
    network: 1 Gbps Ethernet


    schrieb Stack:
    Sounds like you are doing about 5k rows/second per server.

    What size rows? How many column families? What kinda of hardware?

    St.Ack
  • Stack at Jun 21, 2011 at 3:02 pm
    Andre:

    As per Ted in the other thread, because you have 2GB only, are you
    sure that you are not swapping? Swapping will cause all to slow down.

    St.Ack
    On Tue, Jun 21, 2011 at 12:02 AM, Andre Reiter wrote:
    Hi Stack,

    thanks a lot for the reply
    each row is about 2k in average, there are only 2 families

    hardware:

    CPU: 2x AMD Opteron(tm) Processor 250 (2.4GHz)
    disk: 500 GB, software raid raid1 (2x WDC WD5000AAKB-00H8A0, ATA DISK drive)
    memory: 2 GB
    network: 1 Gbps Ethernet


    schrieb Stack:
    Sounds like you are doing about 5k rows/second per server.

    What size rows?  How many column families?  What kinda of hardware?

    St.Ack

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedJun 6, '11 at 8:50a
activeJun 21, '11 at 3:02p
posts20
users10
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase