FAQ
Hey guys,

So in HBASE-4365 I ran multiple uploads and the latest one I reported
was a 5TB import on 14 RS and it took 18h with Stack's patch. Now one
thing we can see is that apart from some splitting, there's a lot of
compacting going on. Stack was wondering exactly how much that IO
costs us, so we devised a test where we could upload 5TB with 0
compactions. Here are the results:

The table was pre-split with 14 regions, 1 per region server.
hbase.hstore.compactionThreshold=100
hbase.hstore.blockingStoreFiles=110
hbase.regionserver.maxlogs=64 (the block size is 128MB)
hfile.block.cache.size=0.05
hbase.regionserver.global.memstore.lowerLimit=0.40
hbase.regionserver.global.memstore.upperLimit=0.74
export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Xmx14G
-XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=256m
-XX:MaxNewSize=256m"

The table had:
MAX_FILESIZE => '549755813888', MEMSTORE_FLUSHSIZE => '549755813888'

Basically what I'm trying to do is to never block and almost always be
flushing. You'll probably notice the big difference between the lower
and upper barriers and think "le hell?", it's because it takes so long
to flush that you have to have enough room to take on more data while
this is happening (and we are able to flush faster than we take on
write).

The test reports the following:
Wall time: 34984.083 s
Aggregate Throughput: 156893.07 queries/s
Aggregate Throughput: 160030935.29 bytes/s

That's 2x faster than when we wait for compactions and splits, not too
bad but I'm pretty sure we can do better:

- The QPS was very uneven, it seems that when it's flushing it takes
a big toll and queries drop to ~100k/s while the rest of the time it's
more like 200k/s. Need to figure out what's going there and if it's
really just caused by flush-related IO.
- The logs were rolling every 6 seconds and since this takes a global
write lock, I can see how we could be slowing down a lot across 14
machines.
- The load was a bit uneven, I miscalculated my split points and the
last region always had 2-3k more queries per second.

Stay tuned for more.

J-D

Search Discussions

  • Matt Corgan at Feb 25, 2012 at 11:19 pm
    I've been meaning to look into something regarding compactions for a while
    now that may be relevant here. It could be that this is already how it
    works, but just to be sure I'll spell out my suspicions...

    I did a lot of large uploads when we moved to .92. Our biggest dataset is
    time series data (partitioned 16 ways with a row prefix). The actual
    inserting and flushing went extremely quickly, and the parallel compactions
    were churning away. However, when the compactions inevitably started
    falling behind I noticed a potential problem. The compaction queue would
    get up to, say, 40, which represented, say, an hour's worth of requests.
    The problem was that by the time a compaction request started executing,
    the CompactionSelection that it held was terribly out of date. It was
    compacting a small selection (3-5) of the 50 files that were now there.
    Then the next request would compact another (3-5), etc, etc, until the
    queue was empty. It would have been much better if a CompactionRequest
    decided what files to compact when it got to the head of the queue. Then
    it could see that there are now 50 files needing compacting and to possibly
    compact the 30 smallest ones, not just 5. When the insertions were done
    after many hours, I would have preferred it to do one giant major
    compaction, but it sat there and worked through it's compaction queue
    compacting all sorts of different combinations of files.

    Said differently, it looks like .92 picks the files to compact at
    compaction request time rather than compaction execution time which is
    problematic when these times grow far apart. Is that the case? Maybe
    there are some other effects that are mitigating it...

    Matt
    On Sat, Feb 25, 2012 at 10:05 AM, Jean-Daniel Cryans wrote:

    Hey guys,

    So in HBASE-4365 I ran multiple uploads and the latest one I reported
    was a 5TB import on 14 RS and it took 18h with Stack's patch. Now one
    thing we can see is that apart from some splitting, there's a lot of
    compacting going on. Stack was wondering exactly how much that IO
    costs us, so we devised a test where we could upload 5TB with 0
    compactions. Here are the results:

    The table was pre-split with 14 regions, 1 per region server.
    hbase.hstore.compactionThreshold=100
    hbase.hstore.blockingStoreFiles=110
    hbase.regionserver.maxlogs=64 (the block size is 128MB)
    hfile.block.cache.size=0.05
    hbase.regionserver.global.memstore.lowerLimit=0.40
    hbase.regionserver.global.memstore.upperLimit=0.74
    export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Xmx14G
    -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=256m
    -XX:MaxNewSize=256m"

    The table had:
    MAX_FILESIZE => '549755813888', MEMSTORE_FLUSHSIZE => '549755813888'

    Basically what I'm trying to do is to never block and almost always be
    flushing. You'll probably notice the big difference between the lower
    and upper barriers and think "le hell?", it's because it takes so long
    to flush that you have to have enough room to take on more data while
    this is happening (and we are able to flush faster than we take on
    write).

    The test reports the following:
    Wall time: 34984.083 s
    Aggregate Throughput: 156893.07 queries/s
    Aggregate Throughput: 160030935.29 bytes/s

    That's 2x faster than when we wait for compactions and splits, not too
    bad but I'm pretty sure we can do better:

    - The QPS was very uneven, it seems that when it's flushing it takes
    a big toll and queries drop to ~100k/s while the rest of the time it's
    more like 200k/s. Need to figure out what's going there and if it's
    really just caused by flush-related IO.
    - The logs were rolling every 6 seconds and since this takes a global
    write lock, I can see how we could be slowing down a lot across 14
    machines.
    - The load was a bit uneven, I miscalculated my split points and the
    last region always had 2-3k more queries per second.

    Stay tuned for more.

    J-D
  • Yuzhihong at Feb 25, 2012 at 11:38 pm
    Thanks for sharing this, Matt.

    Do you mind opening a Jira for your suggestion ?


    On Feb 25, 2012, at 3:18 PM, Matt Corgan wrote:

    I've been meaning to look into something regarding compactions for a while
    now that may be relevant here. It could be that this is already how it
    works, but just to be sure I'll spell out my suspicions...

    I did a lot of large uploads when we moved to .92. Our biggest dataset is
    time series data (partitioned 16 ways with a row prefix). The actual
    inserting and flushing went extremely quickly, and the parallel compactions
    were churning away. However, when the compactions inevitably started
    falling behind I noticed a potential problem. The compaction queue would
    get up to, say, 40, which represented, say, an hour's worth of requests.
    The problem was that by the time a compaction request started executing,
    the CompactionSelection that it held was terribly out of date. It was
    compacting a small selection (3-5) of the 50 files that were now there.
    Then the next request would compact another (3-5), etc, etc, until the
    queue was empty. It would have been much better if a CompactionRequest
    decided what files to compact when it got to the head of the queue. Then
    it could see that there are now 50 files needing compacting and to possibly
    compact the 30 smallest ones, not just 5. When the insertions were done
    after many hours, I would have preferred it to do one giant major
    compaction, but it sat there and worked through it's compaction queue
    compacting all sorts of different combinations of files.

    Said differently, it looks like .92 picks the files to compact at
    compaction request time rather than compaction execution time which is
    problematic when these times grow far apart. Is that the case? Maybe
    there are some other effects that are mitigating it...

    Matt
    On Sat, Feb 25, 2012 at 10:05 AM, Jean-Daniel Cryans wrote:

    Hey guys,

    So in HBASE-4365 I ran multiple uploads and the latest one I reported
    was a 5TB import on 14 RS and it took 18h with Stack's patch. Now one
    thing we can see is that apart from some splitting, there's a lot of
    compacting going on. Stack was wondering exactly how much that IO
    costs us, so we devised a test where we could upload 5TB with 0
    compactions. Here are the results:

    The table was pre-split with 14 regions, 1 per region server.
    hbase.hstore.compactionThreshold=100
    hbase.hstore.blockingStoreFiles=110
    hbase.regionserver.maxlogs=64 (the block size is 128MB)
    hfile.block.cache.size=0.05
    hbase.regionserver.global.memstore.lowerLimit=0.40
    hbase.regionserver.global.memstore.upperLimit=0.74
    export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Xmx14G
    -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=256m
    -XX:MaxNewSize=256m"

    The table had:
    MAX_FILESIZE => '549755813888', MEMSTORE_FLUSHSIZE => '549755813888'

    Basically what I'm trying to do is to never block and almost always be
    flushing. You'll probably notice the big difference between the lower
    and upper barriers and think "le hell?", it's because it takes so long
    to flush that you have to have enough room to take on more data while
    this is happening (and we are able to flush faster than we take on
    write).

    The test reports the following:
    Wall time: 34984.083 s
    Aggregate Throughput: 156893.07 queries/s
    Aggregate Throughput: 160030935.29 bytes/s

    That's 2x faster than when we wait for compactions and splits, not too
    bad but I'm pretty sure we can do better:

    - The QPS was very uneven, it seems that when it's flushing it takes
    a big toll and queries drop to ~100k/s while the rest of the time it's
    more like 200k/s. Need to figure out what's going there and if it's
    really just caused by flush-related IO.
    - The logs were rolling every 6 seconds and since this takes a global
    write lock, I can see how we could be slowing down a lot across 14
    machines.
    - The load was a bit uneven, I miscalculated my split points and the
    last region always had 2-3k more queries per second.

    Stay tuned for more.

    J-D
  • Lars hofhansl at Feb 25, 2012 at 11:45 pm
    Interesting. So a compaction request would hold no information beyond the CF, really,
    but is just a promise to do a compaction as soon as possible.
    I agree with Ted, we should explore this in a jira.

    -- Lars


    ----- Original Message -----
    From: Matt Corgan <mcorgan@hotpads.com>
    To: dev@hbase.apache.org
    Cc:
    Sent: Saturday, February 25, 2012 3:18 PM
    Subject: Re: Follow-up to my HBASE-4365 testing

    I've been meaning to look into something regarding compactions for a while
    now that may be relevant here.  It could be that this is already how it
    works, but just to be sure I'll spell out my suspicions...

    I did a lot of large uploads when we moved to .92.  Our biggest dataset is
    time series data (partitioned 16 ways with a row prefix).  The actual
    inserting and flushing went extremely quickly, and the parallel compactions
    were churning away.  However, when the compactions inevitably started
    falling behind I noticed a potential problem.  The compaction queue would
    get up to, say, 40, which represented, say, an hour's worth of requests.
    The problem was that by the time a compaction request started executing,
    the CompactionSelection that it held was terribly out of date.  It was
    compacting a small selection (3-5) of the 50 files that were now there.
    Then the next request would compact another (3-5), etc, etc, until the
    queue was empty.  It would have been much better if a CompactionRequest
    decided what files to compact when it got to the head of the queue.  Then
    it could see that there are now 50 files needing compacting and to possibly
    compact the 30 smallest ones, not just 5.  When the insertions were done
    after many hours, I would have preferred it to do one giant major
    compaction, but it sat there and worked through it's compaction queue
    compacting all sorts of different combinations of files.

    Said differently, it looks like .92 picks the files to compact at
    compaction request time rather than compaction execution time which is
    problematic when these times grow far apart.  Is that the case?  Maybe
    there are some other effects that are mitigating it...

    Matt
    On Sat, Feb 25, 2012 at 10:05 AM, Jean-Daniel Cryans wrote:

    Hey guys,

    So in HBASE-4365 I ran multiple uploads and the latest one I reported
    was a 5TB import on 14 RS and it took 18h with Stack's patch. Now one
    thing we can see is that apart from some splitting, there's a lot of
    compacting going on. Stack was wondering exactly how much that IO
    costs us, so we devised a test where we could upload 5TB with 0
    compactions. Here are the results:

    The table was pre-split with 14 regions, 1 per region server.
    hbase.hstore.compactionThreshold=100
    hbase.hstore.blockingStoreFiles=110
    hbase.regionserver.maxlogs=64  (the block size is 128MB)
    hfile.block.cache.size=0.05
    hbase.regionserver.global.memstore.lowerLimit=0.40
    hbase.regionserver.global.memstore.upperLimit=0.74
    export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Xmx14G
    -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=256m
    -XX:MaxNewSize=256m"

    The table had:
    MAX_FILESIZE => '549755813888', MEMSTORE_FLUSHSIZE => '549755813888'

    Basically what I'm trying to do is to never block and almost always be
    flushing. You'll probably notice the big difference between the lower
    and upper barriers and think "le hell?", it's because it takes so long
    to flush that you have to have enough room to take on more data while
    this is happening (and we are able to flush faster than we take on
    write).

    The test reports the following:
    Wall time: 34984.083 s
    Aggregate Throughput: 156893.07 queries/s
    Aggregate Throughput: 160030935.29 bytes/s

    That's 2x faster than when we wait for compactions and splits, not too
    bad but I'm pretty sure we can do better:

    - The QPS was very uneven, it seems that when it's flushing it takes
    a big toll and queries drop to ~100k/s while the rest of the time it's
    more like 200k/s. Need to figure out what's going there and if it's
    really just caused by flush-related IO.
    - The logs were rolling every 6 seconds and since this takes a global
    write lock, I can see how we could be slowing down a lot across 14
    machines.
    - The load was a bit uneven, I miscalculated my split points and the
    last region always had 2-3k more queries per second.

    Stay tuned for more.

    J-D
  • Matt Corgan at Feb 26, 2012 at 12:03 am
    Yeah. You would also want a mechanism to prevent queuing the same CF
    multiple times, and probably want the completion of one compaction to
    trigger a check to see if it should queue another.

    A possibly different architecture than the current style of queues would be
    to have each Store (all open in memory) keep a compactionPriority score up
    to date after events like flushes, compactions, schema changes, etc. Then
    you create a "CompactionPriorityComparator implements Comparator<Store>"
    and stick all the Stores into a PriorityQueue. The async compaction
    threads would keep pulling off the head of that queue as long as the head
    has compactionPriority > X.

    On Sat, Feb 25, 2012 at 3:44 PM, lars hofhansl wrote:

    Interesting. So a compaction request would hold no information beyond the
    CF, really,
    but is just a promise to do a compaction as soon as possible.
    I agree with Ted, we should explore this in a jira.

    -- Lars


    ----- Original Message -----
    From: Matt Corgan <mcorgan@hotpads.com>
    To: dev@hbase.apache.org
    Cc:
    Sent: Saturday, February 25, 2012 3:18 PM
    Subject: Re: Follow-up to my HBASE-4365 testing

    I've been meaning to look into something regarding compactions for a while
    now that may be relevant here. It could be that this is already how it
    works, but just to be sure I'll spell out my suspicions...

    I did a lot of large uploads when we moved to .92. Our biggest dataset is
    time series data (partitioned 16 ways with a row prefix). The actual
    inserting and flushing went extremely quickly, and the parallel compactions
    were churning away. However, when the compactions inevitably started
    falling behind I noticed a potential problem. The compaction queue would
    get up to, say, 40, which represented, say, an hour's worth of requests.
    The problem was that by the time a compaction request started executing,
    the CompactionSelection that it held was terribly out of date. It was
    compacting a small selection (3-5) of the 50 files that were now there.
    Then the next request would compact another (3-5), etc, etc, until the
    queue was empty. It would have been much better if a CompactionRequest
    decided what files to compact when it got to the head of the queue. Then
    it could see that there are now 50 files needing compacting and to possibly
    compact the 30 smallest ones, not just 5. When the insertions were done
    after many hours, I would have preferred it to do one giant major
    compaction, but it sat there and worked through it's compaction queue
    compacting all sorts of different combinations of files.

    Said differently, it looks like .92 picks the files to compact at
    compaction request time rather than compaction execution time which is
    problematic when these times grow far apart. Is that the case? Maybe
    there are some other effects that are mitigating it...

    Matt

    On Sat, Feb 25, 2012 at 10:05 AM, Jean-Daniel Cryans <jdcryans@apache.org
    wrote:
    Hey guys,

    So in HBASE-4365 I ran multiple uploads and the latest one I reported
    was a 5TB import on 14 RS and it took 18h with Stack's patch. Now one
    thing we can see is that apart from some splitting, there's a lot of
    compacting going on. Stack was wondering exactly how much that IO
    costs us, so we devised a test where we could upload 5TB with 0
    compactions. Here are the results:

    The table was pre-split with 14 regions, 1 per region server.
    hbase.hstore.compactionThreshold=100
    hbase.hstore.blockingStoreFiles=110
    hbase.regionserver.maxlogs=64 (the block size is 128MB)
    hfile.block.cache.size=0.05
    hbase.regionserver.global.memstore.lowerLimit=0.40
    hbase.regionserver.global.memstore.upperLimit=0.74
    export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Xmx14G
    -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=256m
    -XX:MaxNewSize=256m"

    The table had:
    MAX_FILESIZE => '549755813888', MEMSTORE_FLUSHSIZE => '549755813888'

    Basically what I'm trying to do is to never block and almost always be
    flushing. You'll probably notice the big difference between the lower
    and upper barriers and think "le hell?", it's because it takes so long
    to flush that you have to have enough room to take on more data while
    this is happening (and we are able to flush faster than we take on
    write).

    The test reports the following:
    Wall time: 34984.083 s
    Aggregate Throughput: 156893.07 queries/s
    Aggregate Throughput: 160030935.29 bytes/s

    That's 2x faster than when we wait for compactions and splits, not too
    bad but I'm pretty sure we can do better:

    - The QPS was very uneven, it seems that when it's flushing it takes
    a big toll and queries drop to ~100k/s while the rest of the time it's
    more like 200k/s. Need to figure out what's going there and if it's
    really just caused by flush-related IO.
    - The logs were rolling every 6 seconds and since this takes a global
    write lock, I can see how we could be slowing down a lot across 14
    machines.
    - The load was a bit uneven, I miscalculated my split points and the
    last region always had 2-3k more queries per second.

    Stay tuned for more.

    J-D
  • Matt Corgan at Feb 26, 2012 at 12:52 am
    https://issues.apache.org/jira/browse/HBASE-5479

    JD - apologies if that was unrelated to your email

    On Sat, Feb 25, 2012 at 4:03 PM, Matt Corgan wrote:

    Yeah. You would also want a mechanism to prevent queuing the same CF
    multiple times, and probably want the completion of one compaction to
    trigger a check to see if it should queue another.

    A possibly different architecture than the current style of queues would
    be to have each Store (all open in memory) keep a compactionPriority score
    up to date after events like flushes, compactions, schema changes, etc.
    Then you create a "CompactionPriorityComparator implements
    Comparator<Store>" and stick all the Stores into a PriorityQueue. The
    async compaction threads would keep pulling off the head of that queue as
    long as the head has compactionPriority > X.

    On Sat, Feb 25, 2012 at 3:44 PM, lars hofhansl wrote:

    Interesting. So a compaction request would hold no information beyond the
    CF, really,
    but is just a promise to do a compaction as soon as possible.
    I agree with Ted, we should explore this in a jira.

    -- Lars


    ----- Original Message -----
    From: Matt Corgan <mcorgan@hotpads.com>
    To: dev@hbase.apache.org
    Cc:
    Sent: Saturday, February 25, 2012 3:18 PM
    Subject: Re: Follow-up to my HBASE-4365 testing

    I've been meaning to look into something regarding compactions for a while
    now that may be relevant here. It could be that this is already how it
    works, but just to be sure I'll spell out my suspicions...

    I did a lot of large uploads when we moved to .92. Our biggest dataset is
    time series data (partitioned 16 ways with a row prefix). The actual
    inserting and flushing went extremely quickly, and the parallel
    compactions
    were churning away. However, when the compactions inevitably started
    falling behind I noticed a potential problem. The compaction queue would
    get up to, say, 40, which represented, say, an hour's worth of requests.
    The problem was that by the time a compaction request started executing,
    the CompactionSelection that it held was terribly out of date. It was
    compacting a small selection (3-5) of the 50 files that were now there.
    Then the next request would compact another (3-5), etc, etc, until the
    queue was empty. It would have been much better if a CompactionRequest
    decided what files to compact when it got to the head of the queue. Then
    it could see that there are now 50 files needing compacting and to
    possibly
    compact the 30 smallest ones, not just 5. When the insertions were done
    after many hours, I would have preferred it to do one giant major
    compaction, but it sat there and worked through it's compaction queue
    compacting all sorts of different combinations of files.

    Said differently, it looks like .92 picks the files to compact at
    compaction request time rather than compaction execution time which is
    problematic when these times grow far apart. Is that the case? Maybe
    there are some other effects that are mitigating it...

    Matt

    On Sat, Feb 25, 2012 at 10:05 AM, Jean-Daniel Cryans <jdcryans@apache.org
    wrote:
    Hey guys,

    So in HBASE-4365 I ran multiple uploads and the latest one I reported
    was a 5TB import on 14 RS and it took 18h with Stack's patch. Now one
    thing we can see is that apart from some splitting, there's a lot of
    compacting going on. Stack was wondering exactly how much that IO
    costs us, so we devised a test where we could upload 5TB with 0
    compactions. Here are the results:

    The table was pre-split with 14 regions, 1 per region server.
    hbase.hstore.compactionThreshold=100
    hbase.hstore.blockingStoreFiles=110
    hbase.regionserver.maxlogs=64 (the block size is 128MB)
    hfile.block.cache.size=0.05
    hbase.regionserver.global.memstore.lowerLimit=0.40
    hbase.regionserver.global.memstore.upperLimit=0.74
    export HBASE_REGIONSERVER_OPTS="$HBASE_JMX_BASE -Xmx14G
    -XX:CMSInitiatingOccupancyFraction=75 -XX:NewSize=256m
    -XX:MaxNewSize=256m"

    The table had:
    MAX_FILESIZE => '549755813888', MEMSTORE_FLUSHSIZE => '549755813888'

    Basically what I'm trying to do is to never block and almost always be
    flushing. You'll probably notice the big difference between the lower
    and upper barriers and think "le hell?", it's because it takes so long
    to flush that you have to have enough room to take on more data while
    this is happening (and we are able to flush faster than we take on
    write).

    The test reports the following:
    Wall time: 34984.083 s
    Aggregate Throughput: 156893.07 queries/s
    Aggregate Throughput: 160030935.29 bytes/s

    That's 2x faster than when we wait for compactions and splits, not too
    bad but I'm pretty sure we can do better:

    - The QPS was very uneven, it seems that when it's flushing it takes
    a big toll and queries drop to ~100k/s while the rest of the time it's
    more like 200k/s. Need to figure out what's going there and if it's
    really just caused by flush-related IO.
    - The logs were rolling every 6 seconds and since this takes a global
    write lock, I can see how we could be slowing down a lot across 14
    machines.
    - The load was a bit uneven, I miscalculated my split points and the
    last region always had 2-3k more queries per second.

    Stay tuned for more.

    J-D

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupdev @
categorieshbase, hadoop
postedFeb 25, '12 at 6:06p
activeFeb 26, '12 at 12:52a
posts6
users4
websitehbase.apache.org

People

Translate

site design / logo © 2022 Grokbase