Grokbase Groups HBase user March 2011
FAQ
Data Size - 20 GB. It took about an hour with default hbase setting and
after varying several parameters, we were able to get this done in ~20
minutes. This is slow and we are trying to improve.

We wrote a java client which would essentially `put` to hbase tables in
batches. Our fine-tuning parameters include,
1. Disabling compaction
2. Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000, 40000
)
3. Setting AutoFlush to on/off.
4. Varying write buffer(in client) with 2mb, 128mb,256mb
5. Changing regionserver.handler.count to 100
6. Varying regionserver size from 128 to 256/512/1024.
7. Increasing number of regions.
8. Creating regions with keys pre-specified (so that clients hit the
regions directly)
9. Varying number of clients (from 30 clients to 100 clients)

The above was tested on a 38 node cluster with 2 regions each.

We did not try disabling WAL fearing loss of data.

Are there any other parameters that we missed during the process?


Viv

Search Discussions

  • Ted Dunning at Mar 25, 2011 at 12:28 am
    Are you putting this data from a single host? Is your sender
    multi-threaded?

    I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
    stressing the network. You would likely be stressing a single threaded
    client pretty severely.

    What is your record size? It may be that you are bound up by the number of
    records being inserted rather than the total data size.
    On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna wrote:

    Data Size - 20 GB. It took about an hour with default hbase setting and
    after varying several parameters, we were able to get this done in ~20
    minutes. This is slow and we are trying to improve.

    We wrote a java client which would essentially `put` to hbase tables in
    batches. Our fine-tuning parameters include,
    1. Disabling compaction
    2. Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000, 40000
    )
    3. Setting AutoFlush to on/off.
    4. Varying write buffer(in client) with 2mb, 128mb,256mb
    5. Changing regionserver.handler.count to 100
    6. Varying regionserver size from 128 to 256/512/1024.
    7. Increasing number of regions.
    8. Creating regions with keys pre-specified (so that clients hit the
    regions directly)
    9. Varying number of clients (from 30 clients to 100 clients)

    The above was tested on a 38 node cluster with 2 regions each.

    We did not try disabling WAL fearing loss of data.

    Are there any other parameters that we missed during the process?


    Viv
  • Vivek Krishna at Mar 25, 2011 at 12:33 am
    I have a total of 10 clients-nodes with 3-10 threads running on each node.
    Record size ~1K

    Viv


    On Thu, Mar 24, 2011 at 8:28 PM, Ted Dunning wrote:

    Are you putting this data from a single host? Is your sender
    multi-threaded?

    I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
    stressing the network. You would likely be stressing a single threaded
    client pretty severely.

    What is your record size? It may be that you are bound up by the number of
    records being inserted rather than the total data size.
    On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna wrote:

    Data Size - 20 GB. It took about an hour with default hbase setting and
    after varying several parameters, we were able to get this done in ~20
    minutes. This is slow and we are trying to improve.

    We wrote a java client which would essentially `put` to hbase tables in
    batches. Our fine-tuning parameters include,
    1. Disabling compaction
    2. Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000,
    40000
    )
    3. Setting AutoFlush to on/off.
    4. Varying write buffer(in client) with 2mb, 128mb,256mb
    5. Changing regionserver.handler.count to 100
    6. Varying regionserver size from 128 to 256/512/1024.
    7. Increasing number of regions.
    8. Creating regions with keys pre-specified (so that clients hit the
    regions directly)
    9. Varying number of clients (from 30 clients to 100 clients)

    The above was tested on a 38 node cluster with 2 regions each.

    We did not try disabling WAL fearing loss of data.

    Are there any other parameters that we missed during the process?


    Viv
  • Ted Dunning at Mar 25, 2011 at 12:43 am
    Something is just wrong. You should be able to do 17,000 records from a few
    nodes with multiple threads against a fairly small cluster. You should be
    able to come close to that from a single node into a dozen region servers.
    On Thu, Mar 24, 2011 at 5:32 PM, Vivek Krishna wrote:

    I have a total of 10 clients-nodes with 3-10 threads running on each node.
    Record size ~1K

    Viv



    On Thu, Mar 24, 2011 at 8:28 PM, Ted Dunning wrote:

    Are you putting this data from a single host? Is your sender
    multi-threaded?

    I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
    stressing the network. You would likely be stressing a single threaded
    client pretty severely.

    What is your record size? It may be that you are bound up by the number
    of records being inserted rather than the total data size.
    On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna wrote:

    Data Size - 20 GB. It took about an hour with default hbase setting and
    after varying several parameters, we were able to get this done in ~20
    minutes. This is slow and we are trying to improve.

    We wrote a java client which would essentially `put` to hbase tables in
    batches. Our fine-tuning parameters include,
    1. Disabling compaction
    2. Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000,
    40000
    )
    3. Setting AutoFlush to on/off.
    4. Varying write buffer(in client) with 2mb, 128mb,256mb
    5. Changing regionserver.handler.count to 100
    6. Varying regionserver size from 128 to 256/512/1024.
    7. Increasing number of regions.
    8. Creating regions with keys pre-specified (so that clients hit the
    regions directly)
    9. Varying number of clients (from 30 clients to 100 clients)

    The above was tested on a 38 node cluster with 2 regions each.

    We did not try disabling WAL fearing loss of data.

    Are there any other parameters that we missed during the process?


    Viv
  • Vivek Krishna at Apr 11, 2011 at 6:21 pm
    Is there a limiting factor/setting that limits/controls the bandwidth on
    HBase nodes? I know there is a number to be set on zoo.cfg to increase the
    number of incoming connections.

    Though I am using a 15 Gigabit ethernet card, I can see only 50-100MB/s of
    transfer per node (from clients) via ganglia.
    Viv


    On Thu, Mar 24, 2011 at 8:42 PM, Ted Dunning wrote:


    Something is just wrong. You should be able to do 17,000 records from a
    few nodes with multiple threads against a fairly small cluster. You should
    be able to come close to that from a single node into a dozen region
    servers.

    On Thu, Mar 24, 2011 at 5:32 PM, Vivek Krishna wrote:

    I have a total of 10 clients-nodes with 3-10 threads running on each node.
    Record size ~1K

    Viv



    On Thu, Mar 24, 2011 at 8:28 PM, Ted Dunning wrote:

    Are you putting this data from a single host? Is your sender
    multi-threaded?

    I note that (20 GB / 20 minutes < 20 MB / s) so you aren't particularly
    stressing the network. You would likely be stressing a single threaded
    client pretty severely.

    What is your record size? It may be that you are bound up by the number
    of records being inserted rather than the total data size.
    On Thu, Mar 24, 2011 at 5:22 PM, Vivek Krishna wrote:

    Data Size - 20 GB. It took about an hour with default hbase setting and
    after varying several parameters, we were able to get this done in ~20
    minutes. This is slow and we are trying to improve.

    We wrote a java client which would essentially `put` to hbase tables in
    batches. Our fine-tuning parameters include,
    1. Disabling compaction
    2. Varying batch sizes of put ( tried with 1000, 5000, 10000, 20000,
    40000
    )
    3. Setting AutoFlush to on/off.
    4. Varying write buffer(in client) with 2mb, 128mb,256mb
    5. Changing regionserver.handler.count to 100
    6. Varying regionserver size from 128 to 256/512/1024.
    7. Increasing number of regions.
    8. Creating regions with keys pre-specified (so that clients hit the
    regions directly)
    9. Varying number of clients (from 30 clients to 100 clients)

    The above was tested on a 38 node cluster with 2 regions each.

    We did not try disabling WAL fearing loss of data.

    Are there any other parameters that we missed during the process?


    Viv

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedMar 25, '11 at 12:23a
activeApr 11, '11 at 6:21p
posts5
users2
websitehbase.apache.org

2 users in discussion

Vivek Krishna: 3 posts Ted Dunning: 2 posts

People

Translate

site design / logo © 2022 Grokbase