FAQ
Hi ,
To improve client performance I changed
hbase.client.scanner.caching from 1 to 50.
After running client with new value( hbase.client.scanner.caching from = 50
) it didn't improve execution time at all.

I have ~ 9 million small records.
I have to do full scan , so it brings all 9 million records to client .
My assumption -- this change have to bring significant improvement , but it
is not.

Additional Information.
I scan table which has 100 regions
5 server
20 map
4 concurrent map
Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
and how can I improve it


I changed the value in all hbase-site.xml files and restart hbase.

Any suggestions.

Search Discussions

  • Friso van Vollenhoven at Nov 11, 2010 at 11:30 am
    How small is small? If it is bytes, then setting the value to 50 is not so much different from 1, I suppose. If 50 rows fit in one block, it will just fetch one block whether the setting is 1 or 50. You might want to try a larger value. It should be fine if the records are small and you need them all on the client side anyway.

    It also depends on the block size, of course. When you only ever do full scans on a table and little random access, you might want to increase that.

    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching from = 50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan , so it brings all 9 million records to client .
    My assumption -- this change have to bring significant improvement , but it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4 concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.
  • Oleg Ruchovets at Nov 11, 2010 at 11:55 am
    Yes , I thought about large number , so you said it depends on block size.
    Good point.

    I have one recored ~ 4k ,
    block size is:

    <property>
    <name>dfs.block.size</name>
    <value>268435456</value>
    <description>HDFS blocksize of 256MB for large file-systems.
    </description>
    </property>

    what is the number that I have choose? Assuming
    I am afraid that using number which is equal one block brings to
    socketTimeOutException? Am I write?

    Thanks Oleg.



    On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven wrote:

    How small is small? If it is bytes, then setting the value to 50 is not so
    much different from 1, I suppose. If 50 rows fit in one block, it will just
    fetch one block whether the setting is 1 or 50. You might want to try a
    larger value. It should be fine if the records are small and you need them
    all on the client side anyway.

    It also depends on the block size, of course. When you only ever do full
    scans on a table and little random access, you might want to increase that.

    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching from = 50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan , so it brings all 9 million records to client .
    My assumption -- this change have to bring significant improvement , but it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4 concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.
  • Friso van Vollenhoven at Nov 11, 2010 at 1:08 pm
    Not that block size (that's the HDFS one), but the HBase block size. You set it at table creation or it uses the default of 64K.

    The description of hbase.client.scanner.caching says:
    Number of rows that will be fetched when calling next
    on a scanner if it is not served from memory. Higher caching values
    will enable faster scanners but will eat up more memory and some
    calls of next may take longer and longer times when the cache is empty.

    That means that it will pre-fetch that number of rows, if the next row does not come from memory. So if your rows are small enough to fit 100 of them in one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it will only go to disk when it exhausts the whole block, which sticks in block cache. So, it will still fetch the same amount of data from disk every time. If you increase the number to a value that is certain to load multiple blocks at a time from disk, it will increase performance.


    On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

    Yes , I thought about large number , so you said it depends on block size.
    Good point.

    I have one recored ~ 4k ,
    block size is:

    <property>
    <name>dfs.block.size</name>
    <value>268435456</value>
    <description>HDFS blocksize of 256MB for large file-systems.
    </description>
    </property>

    what is the number that I have choose? Assuming
    I am afraid that using number which is equal one block brings to
    socketTimeOutException? Am I write?

    Thanks Oleg.




    On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    How small is small? If it is bytes, then setting the value to 50 is not so
    much different from 1, I suppose. If 50 rows fit in one block, it will just
    fetch one block whether the setting is 1 or 50. You might want to try a
    larger value. It should be fine if the records are small and you need them
    all on the client side anyway.

    It also depends on the block size, of course. When you only ever do full
    scans on a table and little random access, you might want to increase that.

    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching from = 50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan , so it brings all 9 million records to client .
    My assumption -- this change have to bring significant improvement , but it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4 concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.
  • Michael Segel at Nov 11, 2010 at 1:11 pm
    Correct me if I'm wrong, but isn't hbase's default block size 256MB while hadoop's default blocksize is 64MB?

    From: fvanvollenhoven@xebia.com
    To: user@hbase.apache.org
    Subject: Re: scan performance improvement
    Date: Thu, 11 Nov 2010 13:08:56 +0000

    Not that block size (that's the HDFS one), but the HBase block size. You set it at table creation or it uses the default of 64K.

    The description of hbase.client.scanner.caching says:
    Number of rows that will be fetched when calling next
    on a scanner if it is not served from memory. Higher caching values
    will enable faster scanners but will eat up more memory and some
    calls of next may take longer and longer times when the cache is empty.

    That means that it will pre-fetch that number of rows, if the next row does not come from memory. So if your rows are small enough to fit 100 of them in one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it will only go to disk when it exhausts the whole block, which sticks in block cache. So, it will still fetch the same amount of data from disk every time. If you increase the number to a value that is certain to load multiple blocks at a time from disk, it will increase performance.


    On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

    Yes , I thought about large number , so you said it depends on block size.
    Good point.

    I have one recored ~ 4k ,
    block size is:

    <property>
    <name>dfs.block.size</name>
    <value>268435456</value>
    <description>HDFS blocksize of 256MB for large file-systems.
    </description>
    </property>

    what is the number that I have choose? Assuming
    I am afraid that using number which is equal one block brings to
    socketTimeOutException? Am I write?

    Thanks Oleg.




    On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    How small is small? If it is bytes, then setting the value to 50 is not so
    much different from 1, I suppose. If 50 rows fit in one block, it will just
    fetch one block whether the setting is 1 or 50. You might want to try a
    larger value. It should be fine if the records are small and you need them
    all on the client side anyway.

    It also depends on the block size, of course. When you only ever do full
    scans on a table and little random access, you might want to increase that.

    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching from = 50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan , so it brings all 9 million records to client .
    My assumption -- this change have to bring significant improvement , but it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4 concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.
  • Friso van Vollenhoven at Nov 11, 2010 at 1:28 pm
    The 256M = default MAX_FILE_SIZE
    64K = default HBase block size
    64M = HDFS default block size

    If you look at a table definition in the HBase master UI you can see settings for your table. Like this:
    {NAME => 'inrdb_rir_stats', MAX_FILESIZE => '268435456', FAMILIES => [{NAME => 'data', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'LZO', VERSIONS => '1', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

    Also, have a look here to see how HBase stores data: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html



    On 11 nov 2010, at 14:11, Michael Segel wrote:


    Correct me if I'm wrong, but isn't hbase's default block size 256MB while hadoop's default blocksize is 64MB?

    From: fvanvollenhoven@xebia.com
    To: user@hbase.apache.org
    Subject: Re: scan performance improvement
    Date: Thu, 11 Nov 2010 13:08:56 +0000

    Not that block size (that's the HDFS one), but the HBase block size. You set it at table creation or it uses the default of 64K.

    The description of hbase.client.scanner.caching says:
    Number of rows that will be fetched when calling next
    on a scanner if it is not served from memory. Higher caching values
    will enable faster scanners but will eat up more memory and some
    calls of next may take longer and longer times when the cache is empty.

    That means that it will pre-fetch that number of rows, if the next row does not come from memory. So if your rows are small enough to fit 100 of them in one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it will only go to disk when it exhausts the whole block, which sticks in block cache. So, it will still fetch the same amount of data from disk every time. If you increase the number to a value that is certain to load multiple blocks at a time from disk, it will increase performance.


    On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

    Yes , I thought about large number , so you said it depends on block size.
    Good point.

    I have one recored ~ 4k ,
    block size is:

    <property>
    <name>dfs.block.size</name>
    <value>268435456</value>
    <description>HDFS blocksize of 256MB for large file-systems.
    </description>
    </property>

    what is the number that I have choose? Assuming
    I am afraid that using number which is equal one block brings to
    socketTimeOutException? Am I write?

    Thanks Oleg.




    On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    How small is small? If it is bytes, then setting the value to 50 is not so
    much different from 1, I suppose. If 50 rows fit in one block, it will just
    fetch one block whether the setting is 1 or 50. You might want to try a
    larger value. It should be fine if the records are small and you need them
    all on the client side anyway.

    It also depends on the block size, of course. When you only ever do full
    scans on a table and little random access, you might want to increase that.

    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching from = 50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan , so it brings all 9 million records to client .
    My assumption -- this change have to bring significant improvement , but it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4 concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.
  • Oleg Ruchovets at Nov 11, 2010 at 1:34 pm
    Great , thank you for the explanation.

    my table schema is:

    {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
    '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
    => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
    '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}]

    couple of questions:
    1) How can I know what is the optimal size of BlockSize? What is the
    best practice regarding this issue
    2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
    and it is ~ 3 blocks , so performance had to be improved , but execution
    time was the same.

    Oleg.

    On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven wrote:

    Not that block size (that's the HDFS one), but the HBase block size. You
    set it at table creation or it uses the default of 64K.

    The description of hbase.client.scanner.caching says:
    Number of rows that will be fetched when calling next
    on a scanner if it is not served from memory. Higher caching values
    will enable faster scanners but will eat up more memory and some
    calls of next may take longer and longer times when the cache is empty.

    That means that it will pre-fetch that number of rows, if the next row does
    not come from memory. So if your rows are small enough to fit 100 of them in
    one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
    will only go to disk when it exhausts the whole block, which sticks in block
    cache. So, it will still fetch the same amount of data from disk every time.
    If you increase the number to a value that is certain to load multiple
    blocks at a time from disk, it will increase performance.


    On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

    Yes , I thought about large number , so you said it depends on block size.
    Good point.

    I have one recored ~ 4k ,
    block size is:

    <property>
    <name>dfs.block.size</name>
    <value>268435456</value>
    <description>HDFS blocksize of 256MB for large file-systems.
    </description>
    </property>

    what is the number that I have choose? Assuming
    I am afraid that using number which is equal one block brings to
    socketTimeOutException? Am I write?

    Thanks Oleg.




    On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    How small is small? If it is bytes, then setting the value to 50 is not
    so
    much different from 1, I suppose. If 50 rows fit in one block, it will
    just
    fetch one block whether the setting is 1 or 50. You might want to try a
    larger value. It should be fine if the records are small and you need
    them
    all on the client side anyway.

    It also depends on the block size, of course. When you only ever do full
    scans on a table and little random access, you might want to increase
    that.
    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching from
    =
    50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan , so it brings all 9 million records to client
    .
    My assumption -- this change have to bring significant improvement ,
    but
    it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4 concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.
  • Friso van Vollenhoven at Nov 11, 2010 at 3:36 pm
    Great , thank you for the explanation.

    my table schema is:

    {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
    '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
    => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
    '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}]

    couple of questions:
    1) How can I know what is the optimal size of BlockSize? What is the
    best practice regarding this issue
    Check the link I sent. There is an explanation on this setting in there.
    2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
    and it is ~ 3 blocks , so performance had to be improved , but execution
    time was the same.
    There is of course more involved than just this. And also, you may be already getting the most of what your hardware can give you. You should also try to find out what bottleneck you have (IO or CPU or network). Hadoop and HBase have many settings. There is no magic single knob that makes things fast or slow.
    Oleg.


    On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    Not that block size (that's the HDFS one), but the HBase block size. You
    set it at table creation or it uses the default of 64K.

    The description of hbase.client.scanner.caching says:
    Number of rows that will be fetched when calling next
    on a scanner if it is not served from memory. Higher caching values
    will enable faster scanners but will eat up more memory and some
    calls of next may take longer and longer times when the cache is empty.

    That means that it will pre-fetch that number of rows, if the next row does
    not come from memory. So if your rows are small enough to fit 100 of them in
    one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
    will only go to disk when it exhausts the whole block, which sticks in block
    cache. So, it will still fetch the same amount of data from disk every time.
    If you increase the number to a value that is certain to load multiple
    blocks at a time from disk, it will increase performance.


    On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

    Yes , I thought about large number , so you said it depends on block size.
    Good point.

    I have one recored ~ 4k ,
    block size is:

    <property>
    <name>dfs.block.size</name>
    <value>268435456</value>
    <description>HDFS blocksize of 256MB for large file-systems.
    </description>
    </property>

    what is the number that I have choose? Assuming
    I am afraid that using number which is equal one block brings to
    socketTimeOutException? Am I write?

    Thanks Oleg.




    On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    How small is small? If it is bytes, then setting the value to 50 is not
    so
    much different from 1, I suppose. If 50 rows fit in one block, it will
    just
    fetch one block whether the setting is 1 or 50. You might want to try a
    larger value. It should be fine if the records are small and you need
    them
    all on the client side anyway.

    It also depends on the block size, of course. When you only ever do full
    scans on a table and little random access, you might want to increase
    that.
    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching from
    =
    50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan , so it brings all 9 million records to client
    .
    My assumption -- this change have to bring significant improvement ,
    but
    it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4 concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.
  • Ryan Rawson at Nov 11, 2010 at 6:03 pm
    I'd be careful about adjusting HFile block size, we took 64k after
    benchmarking a bunch of things, and it seemed to e a good performance
    point.

    As for scanning small rows, I'd go with a caching size of 1000-3000.
    When I set my scanners to that, I can pull 50k+ rows/sec from 1
    client.

    On Thu, Nov 11, 2010 at 7:36 AM, Friso van Vollenhoven
    wrote:
    Great , thank you for the explanation.

    my table schema is:

    {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
    '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data', VERSIONS
    => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
    '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}]

    couple of questions:
    1) How can I know what is the optimal size of BlockSize? What is the
    best practice regarding this issue
    Check the link I sent. There is an explanation on this setting in there.
    2) Assuming that I have a record 4 k and changed to 50 --> 4*50 = 200
    and it is ~ 3 blocks , so performance had to be improved , but execution
    time was the same.
    There is of course more involved than just this. And also, you may be already getting the most of what your hardware can give you. You should also try to find out what bottleneck you have (IO or CPU or network). Hadoop and HBase have many settings. There is no magic single knob that makes things fast or slow.
    Oleg.


    On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    Not that block size (that's the HDFS one), but the HBase block size. You
    set it at table creation or it uses the default of 64K.

    The description of hbase.client.scanner.caching says:
    Number of rows that will be fetched when calling next
    on a scanner if it is not served from memory. Higher caching values
    will enable faster scanners but will eat up more memory and some
    calls of next may take longer and longer times when the cache is empty.

    That means that it will pre-fetch that number of rows, if the next row does
    not come from memory. So if your rows are small enough to fit 100 of them in
    one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because it
    will only go to disk when it exhausts the whole block, which sticks in block
    cache. So, it will still fetch the same amount of data from disk every time.
    If you increase the number to a value that is certain to load multiple
    blocks at a time from disk, it will increase performance.


    On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

    Yes , I thought about large number , so you said it depends on block size.
    Good point.

    I have one recored ~ 4k ,
    block size is:

    <property>
    <name>dfs.block.size</name>
    <value>268435456</value>
    <description>HDFS blocksize of 256MB for large file-systems.
    </description>
    </property>

    what is the number that I have choose? Assuming
    I am afraid that using number which is equal one block brings to
    socketTimeOutException? Am I write?

    Thanks Oleg.




    On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    How small is small? If it is bytes, then setting the value to 50 is not
    so
    much different from 1, I suppose. If 50 rows fit in one block, it will
    just
    fetch one block whether the setting is 1 or 50. You might want to try a
    larger value. It should be fine if the records are small and you need
    them
    all on the client side anyway.

    It also depends on the block size, of course. When you only ever do full
    scans on a table and little random access, you might want to increase
    that.
    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I  changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching from
    =
    50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan  , so it brings all 9 million records to client
    .
    My assumption -- this change have to bring significant improvement ,
    but
    it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4  concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am I write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.
  • Oleg Ruchovets at Nov 11, 2010 at 7:03 pm
    Hi

    I didn't change a block size ( it is still 64k).
    Running test configured with caching size of 3600.
    The test is still running , but I already see that there is NO performance
    improvement.
    How can I check that hbase works with changed caching size.
    Can I see it from logs or some debugging?

    Thanks
    Oleg.
    On Thu, Nov 11, 2010 at 8:03 PM, Ryan Rawson wrote:

    I'd be careful about adjusting HFile block size, we took 64k after
    benchmarking a bunch of things, and it seemed to e a good performance
    point.

    As for scanning small rows, I'd go with a caching size of 1000-3000.
    When I set my scanners to that, I can pull 50k+ rows/sec from 1
    client.

    On Thu, Nov 11, 2010 at 7:36 AM, Friso van Vollenhoven
    wrote:
    Great , thank you for the explanation.

    my table schema is:

    {NAME => 'URLs_sanity', FAMILIES => [{NAME => 'gs', VERSIONS =>
    '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'meta-data',
    VERSIONS
    => '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE =>
    '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'snt', VERSIONS =>
    '1', COMPRESSION => 'NONE', TTL => '2147483647', BLOCKSIZE => '65536',
    IN_MEMORY => 'false', BLOCKCACHE => 'true'}]

    couple of questions:
    1) How can I know what is the optimal size of BlockSize? What is the
    best practice regarding this issue
    Check the link I sent. There is an explanation on this setting in there.
    2) Assuming that I have a record 4 k and changed to 50 --> 4*50 =
    200
    and it is ~ 3 blocks , so performance had to be improved , but execution
    time was the same.
    There is of course more involved than just this. And also, you may be
    already getting the most of what your hardware can give you. You should also
    try to find out what bottleneck you have (IO or CPU or network). Hadoop and
    HBase have many settings. There is no magic single knob that makes things
    fast or slow.
    Oleg.


    On Thu, Nov 11, 2010 at 3:08 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    Not that block size (that's the HDFS one), but the HBase block size.
    You
    set it at table creation or it uses the default of 64K.

    The description of hbase.client.scanner.caching says:
    Number of rows that will be fetched when calling next
    on a scanner if it is not served from memory. Higher caching values
    will enable faster scanners but will eat up more memory and some
    calls of next may take longer and longer times when the cache is empty.

    That means that it will pre-fetch that number of rows, if the next row
    does
    not come from memory. So if your rows are small enough to fit 100 of
    them in
    one block, it doesn't matter whether you pre-fetch 1, 50 or 99, because
    it
    will only go to disk when it exhausts the whole block, which sticks in
    block
    cache. So, it will still fetch the same amount of data from disk every
    time.
    If you increase the number to a value that is certain to load multiple
    blocks at a time from disk, it will increase performance.


    On 11 nov 2010, at 12:55, Oleg Ruchovets wrote:

    Yes , I thought about large number , so you said it depends on block size.
    Good point.

    I have one recored ~ 4k ,
    block size is:

    <property>
    <name>dfs.block.size</name>
    <value>268435456</value>
    <description>HDFS blocksize of 256MB for large file-systems.
    </description>
    </property>

    what is the number that I have choose? Assuming
    I am afraid that using number which is equal one block brings to
    socketTimeOutException? Am I write?

    Thanks Oleg.




    On Thu, Nov 11, 2010 at 1:30 PM, Friso van Vollenhoven <
    fvanvollenhoven@xebia.com> wrote:
    How small is small? If it is bytes, then setting the value to 50 is
    not
    so
    much different from 1, I suppose. If 50 rows fit in one block, it
    will
    just
    fetch one block whether the setting is 1 or 50. You might want to try
    a
    larger value. It should be fine if the records are small and you need
    them
    all on the client side anyway.

    It also depends on the block size, of course. When you only ever do
    full
    scans on a table and little random access, you might want to increase
    that.
    Friso



    On 11 nov 2010, at 12:15, Oleg Ruchovets wrote:

    Hi ,
    To improve client performance I changed
    hbase.client.scanner.caching from 1 to 50.
    After running client with new value( hbase.client.scanner.caching
    from
    =
    50
    ) it didn't improve execution time at all.

    I have ~ 9 million small records.
    I have to do full scan , so it brings all 9 million records to
    client
    .
    My assumption -- this change have to bring significant improvement ,
    but
    it
    is not.

    Additional Information.
    I scan table which has 100 regions
    5 server
    20 map
    4 concurrent map
    Scan process takes 5.5 - 6 hours. As for me it is too much time? Am
    I
    write?
    and how can I improve it


    I changed the value in all hbase-site.xml files and restart hbase.

    Any suggestions.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categorieshbase, hadoop
postedNov 11, '10 at 11:15a
activeNov 11, '10 at 7:03p
posts10
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase