FAQ
Hi,
I have an indexedtable with index on around 20 columns. The write
performance on the original table is around 60 per second. This is just a one
node setup. Even with mutiple parallel clients, I am getting just 60
writes/second. That means a total write of 60 * 20 = 1200 writes/second due to
20 indextables? This is not good enough for our application. Is this number 1200
look right ? I was expecting around 15k.
I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core,
7.2k rpm disk). Will adding nodes increase the writes linearly?

Thanks,
Murali Krishna

Search Discussions

  • Andrey Stepachev at Sep 2, 2010 at 6:45 pm
    First, check that you connection not in autoflash mode.
    Second, you can think about custom indexing instead
    of using indexedtable. In my experience custom idexing
    (especially if data doesn't modified), is much more performant.
    Problem with indexedtable is in fact, that on every insert
    hbase performs one (random) get operation (to check, that we doesn't
    have previous indexed data, and delete if it exists). Random gets are
    lays around 100-400 request per node, so you get 60 looks good
    (for indexedtable).

    How to build custom indexes you can read
    http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/

    2010/9/2 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    I have an indexedtable with index on around 20 columns. The write
    performance on the original table is around 60 per second. This is just a one
    node setup. Even with mutiple parallel clients, I am getting just 60
    writes/second. That means a total write of 60 * 20 = 1200 writes/second due to
    20 indextables? This is not good enough for our application. Is this number 1200
    look right ? I was expecting around 15k.
    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core,
    7.2k rpm disk). Will adding nodes increase the writes linearly?

    Thanks,
    Murali Krishna
  • Murali Krishna. P at Sep 3, 2010 at 12:31 pm
    Thanks Andrey,

    * Setting the autoflush to false and increasing the writeBuffer size to 12MB
    improved the writes to 100/s
    * custom indexing is good, but our data keeps changing every day. So, probably
    indextable is the best option for us
    * Just added one more regionserver and it did not help. Actually it went back
    to 60/s for some strange reason(with one client). The requests in the hbase ui
    is not uniform across 2 region servers. One server is doing around 2000 and the
    other 500. Probably once the region gets split and when we have lots of data,
    writes will improve ? (Now it is just writing to one region for the main table)
    * Is there some way to do bulk load the indexedtable? Earlier I have used the
    bulk loader tool (mapreduce job which creates the regions offline) but not sure
    whether it works with indexed table.


    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Fri, 3 September, 2010 12:14:29 AM
    Subject: Re: HBase secondary index performance

    First, check that you connection not in autoflash mode.
    Second, you can think about custom indexing instead
    of using indexedtable. In my experience custom idexing
    (especially if data doesn't modified), is much more performant.
    Problem with indexedtable is in fact, that on every insert
    hbase performs one (random) get operation (to check, that we doesn't
    have previous indexed data, and delete if it exists). Random gets are
    lays around 100-400 request per node, so you get 60 looks good
    (for indexedtable).

    How to build custom indexes you can read
    http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/


    2010/9/2 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    I have an indexedtable with index on around 20 columns. The write
    performance on the original table is around 60 per second. This is just a one
    node setup. Even with mutiple parallel clients, I am getting just 60
    writes/second. That means a total write of 60 * 20 = 1200 writes/second due to
    20 indextables? This is not good enough for our application. Is this number
    1200
    look right ? I was expecting around 15k.
    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core,
    7.2k rpm disk). Will adding nodes increase the writes linearly?

    Thanks,
    Murali Krishna
  • Samuru Jackson at Sep 3, 2010 at 12:54 pm
    Hi,

    I wrote my own Indexer and actually I have a pretty good performance.
    However, there are still known places where I could gain even more
    performance (just not having the time right now).

    What is important is to create bulk loads when you are indexing something. I
    posted this one before, but maybe you have missed it:

    I create a Put List out of those records:

    List<Put> pList = new ArrayList<Put>();

    where each Put has WriteToWAL set to false;

    put.setWriteToWAL(false);
    pList.add(p);

    Then I set autoflush to false and create a larger writebuffer:

    hTable.setAutoFlush(false);
    hTable.setWriteBufferSize(
    1024*1024*12);
    hTable.put(pList);
    hTable.setAutoFlush(true);

    The following settings have boosted my load performance 5times -
    without any bigger performance tunings, no special HW configuration I
    achieve 8000-9000 records per second:
    p.setWriteToWAL(false);
    hTable.setAutoFlush(false);
    hTable.setWriteBufferSize(1024*1024*12);


    /SJ
    http://uncinuscloud.blogspot.com/






    On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P wrote:

    Thanks Andrey,

    * Setting the autoflush to false and increasing the writeBuffer size
    to 12MB
    improved the writes to 100/s
    * custom indexing is good, but our data keeps changing every day.
    So, probably
    indextable is the best option for us
    * Just added one more regionserver and it did not help. Actually it
    went back
    to 60/s for some strange reason(with one client). The requests in the hbase
    ui
    is not uniform across 2 region servers. One server is doing around 2000 and
    the
    other 500. Probably once the region gets split and when we have lots of
    data,
    writes will improve ? (Now it is just writing to one region for the main
    table)
    * Is there some way to do bulk load the indexedtable? Earlier I have
    used the
    bulk loader tool (mapreduce job which creates the regions offline) but not
    sure
    whether it works with indexed table.


    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Fri, 3 September, 2010 12:14:29 AM
    Subject: Re: HBase secondary index performance

    First, check that you connection not in autoflash mode.
    Second, you can think about custom indexing instead
    of using indexedtable. In my experience custom idexing
    (especially if data doesn't modified), is much more performant.
    Problem with indexedtable is in fact, that on every insert
    hbase performs one (random) get operation (to check, that we doesn't
    have previous indexed data, and delete if it exists). Random gets are
    lays around 100-400 request per node, so you get 60 looks good
    (for indexedtable).

    How to build custom indexes you can read

    http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/


    2010/9/2 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    I have an indexedtable with index on around 20 columns. The write
    performance on the original table is around 60 per second. This is just a one
    node setup. Even with mutiple parallel clients, I am getting just 60
    writes/second. That means a total write of 60 * 20 = 1200 writes/second due to
    20 indextables? This is not good enough for our application. Is this number
    1200
    look right ? I was expecting around 15k.
    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core,
    7.2k rpm disk). Will adding nodes increase the writes linearly?

    Thanks,
    Murali Krishna
  • Murali Krishna. P at Sep 4, 2010 at 1:56 pm
    Thanks Samuru,
    I was reading about custom indexing in habse, just wanted to know how are we
    handling the updates incase of custom indexing. Probably if the original data
    doesn't change, it might be a good solution. Say, if one of the column value
    gets changed in the original table, we need to query the index table for the
    orignal column value, delete it and then add an entry for the new value. I think
    this will run into consistency issues since we are doing it in a
    non-transactional manner.

    Are we always doing full indexing and not worry about increments ? May be I
    am missing something here since I am new to this.

    My requirements are such that daily updates are around 10 million records where
    most of it are just updates and we want it to be real time (or NRT). Any
    suggestions are appreciated.

    Thanks,
    Murali Krishna




    ________________________________
    From: Samuru Jackson <samurujackson@googlemail.com>
    To: user@hbase.apache.org
    Sent: Fri, 3 September, 2010 6:24:16 PM
    Subject: Re: HBase secondary index performance

    Hi,

    I wrote my own Indexer and actually I have a pretty good performance.
    However, there are still known places where I could gain even more
    performance (just not having the time right now).

    What is important is to create bulk loads when you are indexing something. I
    posted this one before, but maybe you have missed it:

    I create a Put List out of those records:

    List<Put> pList = new ArrayList<Put>();

    where each Put has WriteToWAL set to false;

    put.setWriteToWAL(false);
    pList.add(p);

    Then I set autoflush to false and create a larger writebuffer:

    hTable.setAutoFlush(false);
    hTable.setWriteBufferSize(
    1024*1024*12);
    hTable.put(pList);
    hTable.setAutoFlush(true);

    The following settings have boosted my load performance 5times -
    without any bigger performance tunings, no special HW configuration I
    achieve 8000-9000 records per second:
    p.setWriteToWAL(false);
    hTable.setAutoFlush(false);
    hTable.setWriteBufferSize(1024*1024*12);


    /SJ
    http://uncinuscloud.blogspot.com/






    On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P wrote:

    Thanks Andrey,

    * Setting the autoflush to false and increasing the writeBuffer size
    to 12MB
    improved the writes to 100/s
    * custom indexing is good, but our data keeps changing every day.
    So, probably
    indextable is the best option for us
    * Just added one more regionserver and it did not help. Actually it
    went back
    to 60/s for some strange reason(with one client). The requests in the hbase
    ui
    is not uniform across 2 region servers. One server is doing around 2000 and
    the
    other 500. Probably once the region gets split and when we have lots of
    data,
    writes will improve ? (Now it is just writing to one region for the main
    table)
    * Is there some way to do bulk load the indexedtable? Earlier I have
    used the
    bulk loader tool (mapreduce job which creates the regions offline) but not
    sure
    whether it works with indexed table.


    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Fri, 3 September, 2010 12:14:29 AM
    Subject: Re: HBase secondary index performance

    First, check that you connection not in autoflash mode.
    Second, you can think about custom indexing instead
    of using indexedtable. In my experience custom idexing
    (especially if data doesn't modified), is much more performant.
    Problem with indexedtable is in fact, that on every insert
    hbase performs one (random) get operation (to check, that we doesn't
    have previous indexed data, and delete if it exists). Random gets are
    lays around 100-400 request per node, so you get 60 looks good
    (for indexedtable).

    How to build custom indexes you can read

    http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
    /


    2010/9/2 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    I have an indexedtable with index on around 20 columns. The write
    performance on the original table is around 60 per second. This is just a one
    node setup. Even with mutiple parallel clients, I am getting just 60
    writes/second. That means a total write of 60 * 20 = 1200 writes/second due to
    20 indextables? This is not good enough for our application. Is this number
    1200
    look right ? I was expecting around 15k.
    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core,
    7.2k rpm disk). Will adding nodes increase the writes linearly?

    Thanks,
    Murali Krishna
  • Samuru Jackson at Sep 4, 2010 at 7:04 pm
    Hi,

    I'm not sure if I understand your problems completely, but relating to your
    update issue:

    HBase keeps versions of your columns. If you have an index on something that
    needs to be updated you just overwrite the value in the index. There is no
    need to remove things.

    I also organize my indexes in separate tables. There is one table for each
    indexed column of a table and I also keep separate tables for composite
    indexes.

    For a fast retrieval I created an indexmanager table which I can use to
    retrieve the corrsponding indexes for attributes and also keep statistics
    about them for query planning for instance.


    Cheers!

    /SJ
    -----------
    http://uncinuscloud.blogspot.com/

    On Sat, Sep 4, 2010 at 9:55 AM, Murali Krishna. P wrote:

    Thanks Samuru,
    I was reading about custom indexing in habse, just wanted to know how
    are we
    handling the updates incase of custom indexing. Probably if the original
    data
    doesn't change, it might be a good solution. Say, if one of the column
    value
    gets changed in the original table, we need to query the index table for
    the
    orignal column value, delete it and then add an entry for the new value. I
    think
    this will run into consistency issues since we are doing it in a
    non-transactional manner.

    Are we always doing full indexing and not worry about increments ? May
    be I
    am missing something here since I am new to this.

    My requirements are such that daily updates are around 10 million records
    where
    most of it are just updates and we want it to be real time (or NRT). Any
    suggestions are appreciated.

    Thanks,
    Murali Krishna




    ________________________________
    From: Samuru Jackson <samurujackson@googlemail.com>
    To: user@hbase.apache.org
    Sent: Fri, 3 September, 2010 6:24:16 PM
    Subject: Re: HBase secondary index performance

    Hi,

    I wrote my own Indexer and actually I have a pretty good performance.
    However, there are still known places where I could gain even more
    performance (just not having the time right now).

    What is important is to create bulk loads when you are indexing something.
    I
    posted this one before, but maybe you have missed it:

    I create a Put List out of those records:

    List<Put> pList = new ArrayList<Put>();

    where each Put has WriteToWAL set to false;

    put.setWriteToWAL(false);
    pList.add(p);

    Then I set autoflush to false and create a larger writebuffer:

    hTable.setAutoFlush(false);
    hTable.setWriteBufferSize(
    1024*1024*12);
    hTable.put(pList);
    hTable.setAutoFlush(true);

    The following settings have boosted my load performance 5times -
    without any bigger performance tunings, no special HW configuration I
    achieve 8000-9000 records per second:
    p.setWriteToWAL(false);
    hTable.setAutoFlush(false);
    hTable.setWriteBufferSize(1024*1024*12);


    /SJ
    http://uncinuscloud.blogspot.com/







    On Fri, Sep 3, 2010 at 8:30 AM, Murali Krishna. P <muralikpbhat@yahoo.com
    wrote:
    Thanks Andrey,

    * Setting the autoflush to false and increasing the writeBuffer size
    to 12MB
    improved the writes to 100/s
    * custom indexing is good, but our data keeps changing every day.
    So, probably
    indextable is the best option for us
    * Just added one more regionserver and it did not help. Actually it
    went back
    to 60/s for some strange reason(with one client). The requests in the hbase
    ui
    is not uniform across 2 region servers. One server is doing around 2000 and
    the
    other 500. Probably once the region gets split and when we have lots of
    data,
    writes will improve ? (Now it is just writing to one region for the main
    table)
    * Is there some way to do bulk load the indexedtable? Earlier I have
    used the
    bulk loader tool (mapreduce job which creates the regions offline) but not
    sure
    whether it works with indexed table.


    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Fri, 3 September, 2010 12:14:29 AM
    Subject: Re: HBase secondary index performance

    First, check that you connection not in autoflash mode.
    Second, you can think about custom indexing instead
    of using indexedtable. In my experience custom idexing
    (especially if data doesn't modified), is much more performant.
    Problem with indexedtable is in fact, that on every insert
    hbase performs one (random) get operation (to check, that we doesn't
    have previous indexed data, and delete if it exists). Random gets are
    lays around 100-400 request per node, so you get 60 looks good
    (for indexedtable).

    How to build custom indexes you can read

    http://brunodumon.wordpress.com/2010/02/17/building-indexes-using-hbase-mapping-strings-numbers-and-dates-onto-bytes/
    /


    2010/9/2 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    I have an indexedtable with index on around 20 columns. The write
    performance on the original table is around 60 per second. This is just
    a
    one
    node setup. Even with mutiple parallel clients, I am getting just 60
    writes/second. That means a total write of 60 * 20 = 1200 writes/second due to
    20 indextables? This is not good enough for our application. Is this number
    1200
    look right ? I was expecting around 15k.
    I am using 0.20.6 HBase on 0.20.2 Hadoop. hardware config (8g ram, 2core,
    7.2k rpm disk). Will adding nodes increase the writes linearly?

    Thanks,
    Murali Krishna


    --
    /SJ
    -----------
    http://uncinuscloud.blogspot.com/
  • Michael Segel at Sep 3, 2010 at 2:58 pm

    Date: Fri, 3 Sep 2010 18:00:42 +0530
    From: muralikpbhat@yahoo.com
    Subject: Re: HBase secondary index performance
    To: user@hbase.apache.org

    Thanks Andrey,

    * Setting the autoflush to false and increasing the writeBuffer size to 12MB
    improved the writes to 100/s
    * custom indexing is good, but our data keeps changing every day. So, probably
    indextable is the best option for us
    * Just added one more regionserver and it did not help. Actually it went back
    to 60/s for some strange reason(with one client). The requests in the hbase ui
    is not uniform across 2 region servers. One server is doing around 2000 and the
    other 500. Probably once the region gets split and when we have lots of data,
    writes will improve ? (Now it is just writing to one region for the main table)
    * Is there some way to do bulk load the indexedtable? Earlier I have used the
    bulk loader tool (mapreduce job which creates the regions offline) but not sure
    whether it works with indexed table.
    Just a small suggestion...

    If you have a table that is populated and you add a new region server, your data isn't going to balance itself out.
    If you want to balance your existing data, you'll need to bring down hbase, then run hadoop's balancer app. When its completed, you'll see that your data is now spread more evenly across the cloud. Please remember that you need to have HBase down when you run the balancer app.
  • Todd Lipcon at Sep 4, 2010 at 7:17 pm

    On Fri, Sep 3, 2010 at 7:57 AM, Michael Segel wrote:

    Date: Fri, 3 Sep 2010 18:00:42 +0530
    From: muralikpbhat@yahoo.com
    Subject: Re: HBase secondary index performance
    To: user@hbase.apache.org

    Thanks Andrey,

    * Setting the autoflush to false and increasing the writeBuffer
    size to 12MB
    improved the writes to 100/s
    * custom indexing is good, but our data keeps changing every day.
    So, probably
    indextable is the best option for us
    * Just added one more regionserver and it did not help. Actually it went back
    to 60/s for some strange reason(with one client). The requests in the hbase ui
    is not uniform across 2 region servers. One server is doing around 2000 and the
    other 500. Probably once the region gets split and when we have lots of data,
    writes will improve ? (Now it is just writing to one region for the main table)
    * Is there some way to do bulk load the indexedtable? Earlier I
    have used the
    bulk loader tool (mapreduce job which creates the regions offline) but not sure
    whether it works with indexed table.
    Just a small suggestion...

    If you have a table that is populated and you add a new region server, your
    data isn't going to balance itself out.
    If you want to balance your existing data, you'll need to bring down hbase,
    then run hadoop's balancer app. When its completed, you'll see that your
    data is now spread more evenly across the cloud. Please remember that you
    need to have HBase down when you run the balancer app.

    The above is all incorrect.

    The data *will* balance itself out on HDFS after major compactions have
    taken place, and even before that, the regions *will* balance themselves
    across region servers.

    Running the balancer while HBase is running is also perfectly safe, though
    it is not necessary for performance reasons.

    -Todd


    >




    --
    Todd Lipcon
    Software Engineer, Cloudera
  • Andrey Stepachev at Sep 4, 2010 at 10:24 pm

    2010/9/3 Murali Krishna. P <muralikpbhat@yahoo.com>:

    * custom indexing is good, but our data keeps changing every day. So, probably
    indextable is the best option for us
    In case of custom indexing you can use timestamps to check, that index
    record still valid.
    (or ever simply recheck existance of the value)
    Also you need regular index cleanup (mr job or some custom application).

    To index some row identified by 'key' having 'value', we can create
    index table,
    where key will be [value:key] and insert rows every time, when we insert
    our values. We will got 30k rows/s/node.
    When we want to find all 'value', we scan [value:0000, value:9999] and
    find all keys,
    which point to rows, containing values.
    We scan index, random get rows, recheck, that index is still valid
    (check value or timestamp, index timestamp should be >= value timestamp) and
    return only valid values (may be we can even delete on the fly when we
    got negative
    result to automatically clenup stale data).

    * Just added one more regionserver and it did not help. Actually it went back
    to 60/s for some strange reason(with one client). The requests in the hbase ui
    is not uniform across 2 region servers. One server is doing around 2000 and the
    other 500. Probably once the region gets split and when we have lots of data,
    writes will improve ? (Now it is just writing to one region for the main table)
    Looks like all data goes to one region server. Try to make more random writes
    (may be you should make key as random uuid or other key randomization technique)
    * Is there some way to do bulk load the indexedtable? Earlier I have used the
    bulk loader tool (mapreduce job which creates the regions offline) but not sure
    whether it works with indexed table.
    No sure, but you can look at source code, and try to emulate indexing
    operations in
    your code after regular bulk loading.

    Thanks,
    Murali Krishna
    Andrey.
  • Samuru Jackson at Sep 5, 2010 at 12:57 am
    Hi,
    where key will be [value:key] and insert rows every time, when we insert
    our values. We will got 30k rows/s/node.
    Could you specify on what kind of hardware you did this? How did you
    design your indexer? Is it multithreaded?

    /SJ
    -----------
    http://uncinuscloud.blogspot.com/
  • Andrey Stepachev at Sep 5, 2010 at 6:13 pm

    2010/9/5 Samuru Jackson <samurujackson@googlemail.com>:
    Hi,
    where key will be [value:key] and insert rows every time, when we insert
    our values. We will got 30k rows/s/node.
    Could you specify on what kind of hardware you did this?
    3 node "cluster", 16Gb core2duo. sas raid10.
    How did you design your indexer? Is it multithreaded?
    It is not and indexer, It is abstraction around HTable, which
    does put plus additional puts (as described before) into index
    tables. Later (i don't have actual date now), i release this
    code, but it is not a rocket science.

    30k - it is peak requests/ps not a constant rate. Effective rows
    (json objects with 1-2 indexes on them and 100-500bytes) i got
    1-3k objects per node.
  • Murali Krishna. P at Sep 5, 2010 at 5:18 am
    Hi,
    Thanks for the detailed explanation, I liked the idea of timestamp
    check, this will be good enough for us and I can put a periodic MR cleaner.
    However I need some help in understanding the 30K number that was claimed. With
    the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns).
    I understood that there arean additional reads that indextable does but 25X
    improvement that you got is very impressive. Can you please help me to
    understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)

    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Sun, 5 September, 2010 3:53:26 AM
    Subject: Re: HBase secondary index performance

    2010/9/3 Murali Krishna. P <muralikpbhat@yahoo.com>:
    * custom indexing is good, but our data keeps changing every day. So,
    probably
    indextable is the best option for us
    In case of custom indexing you can use timestamps to check, that index
    record still valid.
    (or ever simply recheck existance of the value)
    Also you need regular index cleanup (mr job or some custom application).

    To index some row identified by 'key' having 'value', we can create
    index table,
    where key will be [value:key] and insert rows every time, when we insert
    our values. We will got 30k rows/s/node.
    When we want to find all 'value', we scan [value:0000, value:9999] and
    find all keys,
    which point to rows, containing values.
    We scan index, random get rows, recheck, that index is still valid
    (check value or timestamp, index timestamp should be >= value timestamp) and
    return only valid values (may be we can even delete on the fly when we
    got negative
    result to automatically clenup stale data).

    * Just added one more regionserver and it did not help. Actually it went
    back
    to 60/s for some strange reason(with one client). The requests in the hbase ui
    is not uniform across 2 region servers. One server is doing around 2000 and the
    other 500. Probably once the region gets split and when we have lots of data,
    writes will improve ? (Now it is just writing to one region for the main
    table)

    Looks like all data goes to one region server. Try to make more random writes
    (may be you should make key as random uuid or other key randomization technique)
    * Is there some way to do bulk load the indexedtable? Earlier I have
    used the
    bulk loader tool (mapreduce job which creates the regions offline) but not sure
    whether it works with indexed table.
    No sure, but you can look at source code, and try to emulate indexing
    operations in
    your code after regular bulk loading.

    Thanks,
    Murali Krishna
    Andrey.
  • Andrey Stepachev at Sep 5, 2010 at 6:25 pm

    2010/9/5 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    Thanks for the detailed explanation, I liked the idea of timestamp
    check, this will be good enough for us and I can put a periodic MR cleaner.
    However I need some help in understanding the 30K number that was claimed.
    Real insert rate will depend on size of row, size of write buffer etc.
    In case of simple row with one long per row i got 30k requests/second
    (shown in hbase).
    Json serialised objects 100-700bytes each, with validation I can insert 2-6k
    objects (json) per second.

    With
    the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns).
    I understood that there arean additional reads that indextable does but  25X
    improvement that you got is very impressive. Can you please help me to
    understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)
    Did you try to insert data into non indexed region (disable
    indexedtables extension)?
    What numbers did you got?
    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Sun, 5 September, 2010 3:53:26 AM
    Subject: Re: HBase secondary index performance

    2010/9/3 Murali Krishna. P <muralikpbhat@yahoo.com>:
    * custom indexing is good, but our data keeps changing every day. So,
    probably
    indextable is the best option for us
    In case of custom indexing you can use timestamps to check, that index
    record still valid.
    (or ever simply recheck existance of the value)
    Also you need regular index cleanup (mr job or some custom application).

    To index some row identified by 'key' having 'value', we can create
    index table,
    where key will be [value:key] and insert rows every time, when we insert
    our values. We will got 30k rows/s/node.
    When we want to find all 'value', we scan [value:0000, value:9999] and
    find all keys,
    which point to rows, containing values.
    We scan index, random get rows, recheck, that index is still valid
    (check value or timestamp, index timestamp should be >= value timestamp) and
    return only valid values (may be we can even delete on the fly when we
    got negative
    result to automatically clenup stale data).

    * Just added one more regionserver and it did not help. Actually it went
    back
    to 60/s for some strange reason(with one client). The requests in the hbase ui
    is not uniform across 2 region servers. One server is doing around 2000 and the
    other 500. Probably once the region gets split and when we have lots of data,
    writes will improve ? (Now it is just writing to one region for the main
    table)

    Looks like all data goes to one region server. Try to make more random writes
    (may be you should make key as random uuid or other key randomization technique)
    * Is there some way to do bulk load the indexedtable? Earlier I have
    used the
    bulk loader tool (mapreduce job which creates the regions offline) but not sure
    whether it works with indexed table.
    No sure, but you can look at source code, and try to emulate indexing
    operations in
    your code after regular bulk loading.

    Thanks,
    Murali Krishna
    Andrey.
  • Murali Krishna. P at Sep 6, 2010 at 5:02 am
    Hi,
    My row size is around 300 bytes with total 20 columns. I tried the custom
    indexing without the write to WAL. Currently having only 2 tables, one for the
    main table and another for all 20 indexes. My key to the index table is
    columnValue+columnName+rowKey.
    I am getting around 500 inserts/second now. (ie, total of ~10K puts). This is
    probably comparable with your numbers based on the data size.
    I have some doubts with the hbase write implementation.
    * Is this the max that we can achieve with any number of region servers? Why
    adding region servers not improving the write performance? Is it because when
    the data doesn't exist in the table, it always writes to one region ?

    * Probably writing to an existing, well distributed table might give better
    performance since the writes will be across machines ? In that case, if we have
    multiple tables (one per index), will it be better during the initial write
    itself (since each table has different region) ??

    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Sun, 5 September, 2010 11:54:45 PM
    Subject: Re: HBase secondary index performance

    2010/9/5 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    Thanks for the detailed explanation, I liked the idea of timestamp
    check, this will be good enough for us and I can put a periodic MR cleaner.
    However I need some help in understanding the 30K number that was claimed.
    Real insert rate will depend on size of row, size of write buffer etc.
    In case of simple row with one long per row i got 30k requests/second
    (shown in hbase).
    Json serialised objects 100-700bytes each, with validation I can insert 2-6k
    objects (json) per second.

    With
    the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns).
    I understood that there arean additional reads that indextable does but 25X
    improvement that you got is very impressive. Can you please help me to
    understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)
    Did you try to insert data into non indexed region (disable
    indexedtables extension)?
    What numbers did you got?
    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Sun, 5 September, 2010 3:53:26 AM
    Subject: Re: HBase secondary index performance

    2010/9/3 Murali Krishna. P <muralikpbhat@yahoo.com>:
    * custom indexing is good, but our data keeps changing every day. So,
    probably
    indextable is the best option for us
    In case of custom indexing you can use timestamps to check, that index
    record still valid.
    (or ever simply recheck existance of the value)
    Also you need regular index cleanup (mr job or some custom application).

    To index some row identified by 'key' having 'value', we can create
    index table,
    where key will be [value:key] and insert rows every time, when we insert
    our values. We will got 30k rows/s/node.
    When we want to find all 'value', we scan [value:0000, value:9999] and
    find all keys,
    which point to rows, containing values.
    We scan index, random get rows, recheck, that index is still valid
    (check value or timestamp, index timestamp should be >= value timestamp) and
    return only valid values (may be we can even delete on the fly when we
    got negative
    result to automatically clenup stale data).

    * Just added one more regionserver and it did not help. Actually it went
    back
    to 60/s for some strange reason(with one client). The requests in the hbase
    ui
    is not uniform across 2 region servers. One server is doing around 2000 and the
    other 500. Probably once the region gets split and when we have lots of data,
    writes will improve ? (Now it is just writing to one region for the main
    table)

    Looks like all data goes to one region server. Try to make more random writes
    (may be you should make key as random uuid or other key randomization
    technique)
    * Is there some way to do bulk load the indexedtable? Earlier I have
    used the
    bulk loader tool (mapreduce job which creates the regions offline) but not sure
    whether it works with indexed table.
    No sure, but you can look at source code, and try to emulate indexing
    operations in
    your code after regular bulk loading.

    Thanks,
    Murali Krishna
    Andrey.
  • Ted Yu at Sep 6, 2010 at 1:53 pm
    My key to the index table is columnValue+columnName+rowKey.
    You need to consider the distribution of the above key so that write to
    index table doesn't become bottleneck in the write path.

    Please clarify how this index table serves 20 columns - in the above schema,
    columnValue would be different for the 20 columns indexed, I assume.

    On Sun, Sep 5, 2010 at 10:02 PM, Murali Krishna. P
    wrote:
    Hi,
    My row size is around 300 bytes with total 20 columns. I tried the custom
    indexing without the write to WAL. Currently having only 2 tables, one for
    the
    main table and another for all 20 indexes. My key to the index table is
    columnValue+columnName+rowKey.
    I am getting around 500 inserts/second now. (ie, total of ~10K puts). This
    is
    probably comparable with your numbers based on the data size.
    I have some doubts with the hbase write implementation.
    * Is this the max that we can achieve with any number of region servers?
    Why
    adding region servers not improving the write performance? Is it because
    when
    the data doesn't exist in the table, it always writes to one region ?

    * Probably writing to an existing, well distributed table might give better
    performance since the writes will be across machines ? In that case, if we
    have
    multiple tables (one per index), will it be better during the initial write
    itself (since each table has different region) ??

    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Sun, 5 September, 2010 11:54:45 PM
    Subject: Re: HBase secondary index performance

    2010/9/5 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    Thanks for the detailed explanation, I liked the idea of timestamp
    check, this will be good enough for us and I can put a periodic MR cleaner.
    However I need some help in understanding the 30K number that was
    claimed.

    Real insert rate will depend on size of row, size of write buffer etc.
    In case of simple row with one long per row i got 30k requests/second
    (shown in hbase).
    Json serialised objects 100-700bytes each, with validation I can insert
    2-6k
    objects (json) per second.

    With
    the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns).
    I understood that there arean additional reads that indextable does but 25X
    improvement that you got is very impressive. Can you please help me to
    understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)
    Did you try to insert data into non indexed region (disable
    indexedtables extension)?
    What numbers did you got?
    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Sun, 5 September, 2010 3:53:26 AM
    Subject: Re: HBase secondary index performance

    2010/9/3 Murali Krishna. P <muralikpbhat@yahoo.com>:
    * custom indexing is good, but our data keeps changing every day.
    So,
    probably
    indextable is the best option for us
    In case of custom indexing you can use timestamps to check, that index
    record still valid.
    (or ever simply recheck existance of the value)
    Also you need regular index cleanup (mr job or some custom application).

    To index some row identified by 'key' having 'value', we can create
    index table,
    where key will be [value:key] and insert rows every time, when we insert
    our values. We will got 30k rows/s/node.
    When we want to find all 'value', we scan [value:0000, value:9999] and
    find all keys,
    which point to rows, containing values.
    We scan index, random get rows, recheck, that index is still valid
    (check value or timestamp, index timestamp should be >= value timestamp) and
    return only valid values (may be we can even delete on the fly when we
    got negative
    result to automatically clenup stale data).

    * Just added one more regionserver and it did not help. Actually
    it
    went
    back
    to 60/s for some strange reason(with one client). The requests in the
    hbase
    ui
    is not uniform across 2 region servers. One server is doing around 2000
    and
    the
    other 500. Probably once the region gets split and when we have lots of
    data,
    writes will improve ? (Now it is just writing to one region for the main
    table)

    Looks like all data goes to one region server. Try to make more random writes
    (may be you should make key as random uuid or other key randomization
    technique)
    * Is there some way to do bulk load the indexedtable? Earlier I
    have
    used the
    bulk loader tool (mapreduce job which creates the regions offline) but
    not
    sure
    whether it works with indexed table.
    No sure, but you can look at source code, and try to emulate indexing
    operations in
    your code after regular bulk loading.

    Thanks,
    Murali Krishna
    Andrey.
  • Murali Krishna. P at Sep 6, 2010 at 5:13 pm

    Please clarify how this index table serves 20 columns - in the above schema,
    columnValue would be different for the 20 columns indexed, I assume.
    My query to the index table will be columnValue + columnName. This is for exact
    match, if you need scan on partial value, we have to reverse the key
    generation-> cName+ cValue + rowKey. I went for this schema to reduce the
    number of tables involved.

    Thanks,
    Murali Krishna




    ________________________________
    From: Ted Yu <yuzhihong@gmail.com>
    To: user@hbase.apache.org
    Sent: Mon, 6 September, 2010 7:23:22 PM
    Subject: Re: HBase secondary index performance
    My key to the index table is columnValue+columnName+rowKey.
    You need to consider the distribution of the above key so that write to
    index table doesn't become bottleneck in the write path.

    Please clarify how this index table serves 20 columns - in the above schema,
    columnValue would be different for the 20 columns indexed, I assume.

    On Sun, Sep 5, 2010 at 10:02 PM, Murali Krishna. P
    wrote:
    Hi,
    My row size is around 300 bytes with total 20 columns. I tried the custom
    indexing without the write to WAL. Currently having only 2 tables, one for
    the
    main table and another for all 20 indexes. My key to the index table is
    columnValue+columnName+rowKey.
    I am getting around 500 inserts/second now. (ie, total of ~10K puts). This
    is
    probably comparable with your numbers based on the data size.
    I have some doubts with the hbase write implementation.
    * Is this the max that we can achieve with any number of region servers?
    Why
    adding region servers not improving the write performance? Is it because
    when
    the data doesn't exist in the table, it always writes to one region ?

    * Probably writing to an existing, well distributed table might give better
    performance since the writes will be across machines ? In that case, if we
    have
    multiple tables (one per index), will it be better during the initial write
    itself (since each table has different region) ??

    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Sun, 5 September, 2010 11:54:45 PM
    Subject: Re: HBase secondary index performance

    2010/9/5 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    Thanks for the detailed explanation, I liked the idea of timestamp
    check, this will be good enough for us and I can put a periodic MR cleaner.
    However I need some help in understanding the 30K number that was
    claimed.

    Real insert rate will depend on size of row, size of write buffer etc.
    In case of simple row with one long per row i got 30k requests/second
    (shown in hbase).
    Json serialised objects 100-700bytes each, with validation I can insert
    2-6k
    objects (json) per second.

    With
    the IndexedTable approach, I got only 1200rows/s (60rows/s X 20 index columns).
    I understood that there arean additional reads that indextable does but 25X
    improvement that you got is very impressive. Can you please help me to
    understand this gain ? (My hardware is 8GB/7.2rpm/2core-2GHz)
    Did you try to insert data into non indexed region (disable
    indexedtables extension)?
    What numbers did you got?
    Thanks,
    Murali Krishna




    ________________________________
    From: Andrey Stepachev <octo47@gmail.com>
    To: user@hbase.apache.org
    Sent: Sun, 5 September, 2010 3:53:26 AM
    Subject: Re: HBase secondary index performance

    2010/9/3 Murali Krishna. P <muralikpbhat@yahoo.com>:
    * custom indexing is good, but our data keeps changing every day.
    So,
    probably
    indextable is the best option for us
    In case of custom indexing you can use timestamps to check, that index
    record still valid.
    (or ever simply recheck existance of the value)
    Also you need regular index cleanup (mr job or some custom application).

    To index some row identified by 'key' having 'value', we can create
    index table,
    where key will be [value:key] and insert rows every time, when we insert
    our values. We will got 30k rows/s/node.
    When we want to find all 'value', we scan [value:0000, value:9999] and
    find all keys,
    which point to rows, containing values.
    We scan index, random get rows, recheck, that index is still valid
    (check value or timestamp, index timestamp should be >= value timestamp) and
    return only valid values (may be we can even delete on the fly when we
    got negative
    result to automatically clenup stale data).

    * Just added one more regionserver and it did not help. Actually
    it
    went
    back
    to 60/s for some strange reason(with one client). The requests in the
    hbase
    ui
    is not uniform across 2 region servers. One server is doing around 2000
    and
    the
    other 500. Probably once the region gets split and when we have lots of
    data,
    writes will improve ? (Now it is just writing to one region for the main
    table)

    Looks like all data goes to one region server. Try to make more random writes
    (may be you should make key as random uuid or other key randomization
    technique)
    * Is there some way to do bulk load the indexedtable? Earlier I
    have
    used the
    bulk loader tool (mapreduce job which creates the regions offline) but
    not
    sure
    whether it works with indexed table.
    No sure, but you can look at source code, and try to emulate indexing
    operations in
    your code after regular bulk loading.

    Thanks,
    Murali Krishna
    Andrey.
  • Andrey Stepachev at Sep 6, 2010 at 6:47 pm

    2010/9/6 Murali Krishna. P <muralikpbhat@yahoo.com>:
    Hi,
    My row size is around 300 bytes with total 20 columns. I tried the custom
    indexing without the write to WAL. Currently having only 2 tables, one for the
    main table and another for all 20 indexes. My key to the index table is
    columnValue+columnName+rowKey.
    As mentioned before, you can randomize you index insertions.
    If you don't order scan or range scan on columnValue, you can
    prefix it with some hash, f.e. sha(columnValue) + columnValue +
    columnName + rowKey.
    This remove hotspot in one of your region servers.
    I am getting around 500 inserts/second now. (ie, total of ~10K puts). This is
    probably comparable with your numbers based on the data size.
    Are all region servers get equal load, or some servers are more busy,
    then others?
    I have some doubts with the hbase write implementation.
    * Is this the max that we can achieve with any number of region servers? Why
    adding region servers not improving the write performance? Is it because when
    the data doesn't exist in the table, it always writes to one region ?
    In general - yes. Before tables splits, you will get all writes into
    one region server.
    * Probably writing to an existing, well distributed table might give better
    performance since the writes will be across machines ? In that case, if we have
    multiple tables (one per index), will it be better during the initial write
    itself (since each table has different region) ??
    More servers affect the recording, the better.

    Andrey.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedSep 2, '10 at 5:44p
activeSep 6, '10 at 6:47p
posts17
users6
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase