Grokbase Groups HBase user May 2010
FAQ

[HBase-user] Additional disk space required for Hbase compactions..

Jonathan Gray
May 17, 2010 at 8:06 pm
I'm not sure I understand why you distinguish small HFiles and a single behemoth HFile? Are you trying to understand more about disk space or I/O patterns?

It looks like your understanding is correct. At the worst point, a given Region will use twice it's disk space during a major compaction. Once the compaction is complete, the original files are deleted from HDFS. So it is not the case that your entire dataset will require double the space for compactions as they are not all run concurrently.

JG
-----Original Message-----
From: Vidhyashankar Venkataraman
Sent: Monday, May 17, 2010 11:56 AM
To: hbase-user@hadoop.apache.org
Cc: Joel Koshy
Subject: Additional disk space required for Hbase compactions..

Hi guys,
I am quite new to Hbase.. I am trying to figure out the max
additional disk space required for compactions..

If the set of small Hfiles amount to a size of U in total, before a
major compaction happens and the 'behemoth' HFile has size M, assuming
the resultant size of the Hfile after compaction is U+M (worst case has
only insertions) and a replication factor of r, then disk space taken
by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is
based on my understanding that compactions happen on HDFS and not on
the local file system: am I correct?)...

Thank you
Vidhya
reply

Search Discussions

15 responses

  • Ryan Rawson at May 17, 2010 at 8:02 pm
    Yes compactions happen on hdfs. Hbase will only compact one region at a time
    per regionservers so you in theory will need k×max(all region sizes).

    But hdfs does a delayed delete, so deleted files are not instantly freed up.
    You could end up requiring much more disk space.

    Considering hdfs disk should be the cheapest (data drives in a low density
    configuration) disks you own hopefully it wont be hard to over provision.

    On May 17, 2010 11:57 AM, "Vidhyashankar Venkataraman" wrote:

    Hi guys,
    I am quite new to Hbase.. I am trying to figure out the max additional disk
    space required for compactions..

    If the set of small Hfiles amount to a size of U in total, before a major
    compaction happens and the 'behemoth' HFile has size M, assuming the
    resultant size of the Hfile after compaction is U+M (worst case has only
    insertions) and a replication factor of r, then disk space taken by the
    Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is based on my
    understanding that compactions happen on HDFS and not on the local file
    system: am I correct?)...

    Thank you
    Vidhya
  • Jonathan Gray at May 17, 2010 at 8:08 pm
    We should do better at scheduling major compactions over a longer period of time if we keep it as a background process.

    Also, there's been some discussion about adding some heuristics about never major compacting very old and/or very large HFiles to prevent old, rarely read data from being rewritten constantly.
    -----Original Message-----
    From: Ryan Rawson
    Sent: Monday, May 17, 2010 12:02 PM
    To: hbase-user@hadoop.apache.org
    Cc: Joel Koshy
    Subject: Re: Additional disk space required for Hbase compactions..

    Yes compactions happen on hdfs. Hbase will only compact one region at a
    time
    per regionservers so you in theory will need k×max(all region sizes).

    But hdfs does a delayed delete, so deleted files are not instantly
    freed up.
    You could end up requiring much more disk space.

    Considering hdfs disk should be the cheapest (data drives in a low
    density
    configuration) disks you own hopefully it wont be hard to over
    provision.

    On May 17, 2010 11:57 AM, "Vidhyashankar Venkataraman" <
    vidhyash@yahoo-inc.com> wrote:

    Hi guys,
    I am quite new to Hbase.. I am trying to figure out the max additional
    disk
    space required for compactions..

    If the set of small Hfiles amount to a size of U in total, before a
    major
    compaction happens and the 'behemoth' HFile has size M, assuming the
    resultant size of the Hfile after compaction is U+M (worst case has
    only
    insertions) and a replication factor of r, then disk space taken by the
    Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is based on
    my
    understanding that compactions happen on HDFS and not on the local file
    system: am I correct?)...

    Thank you
    Vidhya
  • Jonathan Gray at May 17, 2010 at 8:06 pm
    I'm not sure I understand why you distinguish small HFiles and a single behemoth HFile? Are you trying to understand more about disk space or I/O patterns?

    It looks like your understanding is correct. At the worst point, a given Region will use twice it's disk space during a major compaction. Once the compaction is complete, the original files are deleted from HDFS. So it is not the case that your entire dataset will require double the space for compactions as they are not all run concurrently.

    JG
    -----Original Message-----
    From: Vidhyashankar Venkataraman
    Sent: Monday, May 17, 2010 11:56 AM
    To: hbase-user@hadoop.apache.org
    Cc: Joel Koshy
    Subject: Additional disk space required for Hbase compactions..

    Hi guys,
    I am quite new to Hbase.. I am trying to figure out the max
    additional disk space required for compactions..

    If the set of small Hfiles amount to a size of U in total, before a
    major compaction happens and the 'behemoth' HFile has size M, assuming
    the resultant size of the Hfile after compaction is U+M (worst case has
    only insertions) and a replication factor of r, then disk space taken
    by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is
    based on my understanding that compactions happen on HDFS and not on
    the local file system: am I correct?)...

    Thank you
    Vidhya
  • Vidhyashankar Venkataraman at May 17, 2010 at 7:48 pm

    I'm not sure I understand why you distinguish small HFiles and a single behemoth HFile? Are you trying to understand more
    about disk space or I/O patterns?
    Was talking wrt an application I had in mind.. Right now, I am considering just disk space..

    Ryan's comment:
    Yes compactions happen on hdfs. Hbase will only compact one region at a time
    per regionservers so you in theory will need k?max(all region sizes).
    So the U and M from my mail are sizes per region. Am I right? So what is a good cutoff region size for hundreds of TB of data to be stored in hbase? I am wondering if this has ever been attempted..

    Vidhya

    -----Original Message-----
    From: Vidhyashankar Venkataraman
    Sent: Monday, May 17, 2010 11:56 AM
    To: hbase-user@hadoop.apache.org
    Cc: Joel Koshy
    Subject: Additional disk space required for Hbase compactions..

    Hi guys,
    I am quite new to Hbase.. I am trying to figure out the max
    additional disk space required for compactions..

    If the set of small Hfiles amount to a size of U in total, before a
    major compaction happens and the 'behemoth' HFile has size M, assuming
    the resultant size of the Hfile after compaction is U+M (worst case has
    only insertions) and a replication factor of r, then disk space taken
    by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is
    based on my understanding that compactions happen on HDFS and not on
    the local file system: am I correct?)...

    Thank you
    Vidhya
  • Jonathan Gray at May 17, 2010 at 7:54 pm
    So the question is how large to make your regions if you have 100s of TBs?

    How many nodes will this be on and what are the specs of each node?

    Many people run with 1-2GB regions or higher.

    Primarily the issue will be memory usage and also the propensity for splitting. With that dataset size, you'll need to be careful about splitting too much because rewrites of data are expensive. Same with major compactions (you would definitely need to turn them off and control them manually if you need them at all).
    -----Original Message-----
    From: Vidhyashankar Venkataraman
    Sent: Monday, May 17, 2010 12:19 PM
    To: hbase-user@hadoop.apache.org
    Subject: Re: Additional disk space required for Hbase compactions..
    I'm not sure I understand why you distinguish small HFiles and a
    single behemoth HFile? Are you trying to understand more
    about disk space or I/O patterns?
    Was talking wrt an application I had in mind.. Right now, I am
    considering just disk space..

    Ryan's comment:
    Yes compactions happen on hdfs. Hbase will only compact one region
    at a time
    per regionservers so you in theory will need k?max(all region
    sizes).
    So the U and M from my mail are sizes per region. Am I right? So what
    is a good cutoff region size for hundreds of TB of data to be stored in
    hbase? I am wondering if this has ever been attempted..

    Vidhya

    -----Original Message-----
    From: Vidhyashankar Venkataraman
    Sent: Monday, May 17, 2010 11:56 AM
    To: hbase-user@hadoop.apache.org
    Cc: Joel Koshy
    Subject: Additional disk space required for Hbase compactions..

    Hi guys,
    I am quite new to Hbase.. I am trying to figure out the max
    additional disk space required for compactions..

    If the set of small Hfiles amount to a size of U in total, before a
    major compaction happens and the 'behemoth' HFile has size M, assuming
    the resultant size of the Hfile after compaction is U+M (worst case has
    only insertions) and a replication factor of r, then disk space taken
    by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is
    based on my understanding that compactions happen on HDFS and not on
    the local file system: am I correct?)...

    Thank you
    Vidhya
  • Vidhyashankar Venkataraman at May 17, 2010 at 8:39 pm
    So the question is how large to make your regions if you have 100s of TBs?
    Yeah.. I realize it would depend on the number of nodes and their specs.. My question tilted more towards asking how many regions per node Hbase would be comfortable with (at least as of now) with the default config.. Assume an off-the-shelf machine for simplicity..
    Same with major compactions (you would definitely need to turn them off and control them manually if you need them at all).
    Huh.. But, that can affect reads and scans.. Further the problem is aggravated if we want to do online writes: A steady stream (whose volume lets say, is roughly an order of magnitude lower than the size of the db) of updates to the database may require compaction be done regularly: Isnt that right?

    V

    On 5/17/10 12:26 PM, "Jonathan Gray" wrote:

    So the question is how large to make your regions if you have 100s of TBs?

    How many nodes will this be on and what are the specs of each node?

    Many people run with 1-2GB regions or higher.

    Primarily the issue will be memory usage and also the propensity for splitting. With that dataset size, you'll need to be careful about splitting too much because rewrites of data are expensive. Same with major compactions (you would definitely need to turn them off and control them manually if you need them at all).
    -----Original Message-----
    From: Vidhyashankar Venkataraman
    Sent: Monday, May 17, 2010 12:19 PM
    To: hbase-user@hadoop.apache.org
    Subject: Re: Additional disk space required for Hbase compactions..
    I'm not sure I understand why you distinguish small HFiles and a
    single behemoth HFile? Are you trying to understand more
    about disk space or I/O patterns?
    Was talking wrt an application I had in mind.. Right now, I am
    considering just disk space..

    Ryan's comment:
    Yes compactions happen on hdfs. Hbase will only compact one region
    at a time
    per regionservers so you in theory will need k?max(all region
    sizes).
    So the U and M from my mail are sizes per region. Am I right? So what
    is a good cutoff region size for hundreds of TB of data to be stored in
    hbase? I am wondering if this has ever been attempted..

    Vidhya

    -----Original Message-----
    From: Vidhyashankar Venkataraman
    Sent: Monday, May 17, 2010 11:56 AM
    To: hbase-user@hadoop.apache.org
    Cc: Joel Koshy
    Subject: Additional disk space required for Hbase compactions..

    Hi guys,
    I am quite new to Hbase.. I am trying to figure out the max
    additional disk space required for compactions..

    If the set of small Hfiles amount to a size of U in total, before a
    major compaction happens and the 'behemoth' HFile has size M, assuming
    the resultant size of the Hfile after compaction is U+M (worst case has
    only insertions) and a replication factor of r, then disk space taken
    by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is
    based on my understanding that compactions happen on HDFS and not on
    the local file system: am I correct?)...

    Thank you
    Vidhya
  • TuX RaceR at May 17, 2010 at 8:47 pm
    Hello List,

    On 17/05/10 20:26, Jonathan Gray wrote:
    Same with major compactions (you would definitely need to turn them off and control them manually if you need them at all).
    How would you turn the major compaction off?
    The only major compaction related parameter is this one:

    <property>
    <name>hbase.hregion.majorcompaction</name>
    <value>86400000</value>
    <description>The time (in miliseconds) between 'major' compactions of all
    HStoreFiles in a region. Default: 1 day.
    </description>
    </property>

    Is there a cleaner way to turn it off than putting a ridiculously large
    value?

    Thanks
    TuX
  • Jonathan Gray at May 17, 2010 at 10:13 pm
    No there isn't.

    I just opened a JIRA to make it so it can be set to 0 to disable.

    https://issues.apache.org/jira/browse/HBASE-2559

    Will put up a patch for trunk/0.21.

    JG
    -----Original Message-----
    From: TuX RaceR
    Sent: Monday, May 17, 2010 1:47 PM
    To: hbase-user@hadoop.apache.org
    Subject: Re: Additional disk space required for Hbase compactions..

    Hello List,

    On 17/05/10 20:26, Jonathan Gray wrote:
    Same with major compactions (you would definitely need to turn them
    off and control them manually if you need them at all).
    How would you turn the major compaction off?
    The only major compaction related parameter is this one:

    <property>
    <name>hbase.hregion.majorcompaction</name>
    <value>86400000</value>
    <description>The time (in miliseconds) between 'major' compactions of
    all
    HStoreFiles in a region. Default: 1 day.
    </description>
    </property>

    Is there a cleaner way to turn it off than putting a ridiculously large
    value?

    Thanks
    TuX
  • TuX RaceR at May 18, 2010 at 8:07 am
    Thank you Jonathan for raising the Jira and attaching a patch

    I was looking for more info on how major compactions and minor
    compactions work and google found me this page:

    http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

    After reading the wiki page and Google Bigtable paper, it seems to me
    that there is a difference between Google 'minor compactions' andHbase
    'minor compactions'.

    In google, a minor compaction is (from the paper):
    "5.4 Compactions
    As write operations execute, the size of the memtable increases. When
    the memtable size reaches a threshold, the memtable is frozen, a new
    memtable is created, and the frozen memtable is converted to an SSTable
    and written to GFS. This minor compaction process has two goals:
    it shrinks the memory usage of the tablet server, and it reduces the
    amount of data that has to be read from the commit log during recovery
    if this server dies. Incoming read and write operations can continue
    while compactions occur.
    Every minor compaction creates a new SSTable. If this behavior continued
    unchecked, read operations might need to merge updates from an arbitrary
    number of SSTables."

    On the other hand the Hbase wiki:
    "Compactions: When the number of MapFiles exceeds a configurable
    threshold, a minor compaction is performed which consolidates the most
    recently written MapFiles."

    So it seems that:
    1) google minor compactions are equivalent to Hbase cache flushes
    2) google major compactions are equivalent to Hbase major compactions
    3) there is no equivalent of Hbase minor compactions in the google design.

    can somebody confirm this?
    As in my case my data is almost immutable (i.e I do not have a lot of
    space to claim for deleted rows as there are few of them) , I am
    wondering if the compactions do not more harm than good.

    Thanks
    TuX


    On 17/05/10 23:12, Jonathan Gray wrote:
    No there isn't.

    I just opened a JIRA to make it so it can be set to 0 to disable.

    https://issues.apache.org/jira/browse/HBASE-2559

    Will put up a patch for trunk/0.21.

    JG

    -----Original Message-----
    From: TuX RaceR
    Sent: Monday, May 17, 2010 1:47 PM
    To: hbase-user@hadoop.apache.org
    Subject: Re: Additional disk space required for Hbase compactions..

    Hello List,

    On 17/05/10 20:26, Jonathan Gray wrote:

    Same with major compactions (you would definitely need to turn them
    off and control them manually if you need them at all).
    How would you turn the major compaction off?
    The only major compaction related parameter is this one:

    <property>
    <name>hbase.hregion.majorcompaction</name>
    <value>86400000</value>
    <description>The time (in miliseconds) between 'major' compactions of
    all
    HStoreFiles in a region. Default: 1 day.
    </description>
    </property>

    Is there a cleaner way to turn it off than putting a ridiculously large
    value?

    Thanks
    TuX
  • Jean-Daniel Cryans at May 18, 2010 at 5:28 pm
    The equivalent of HBase minor compactions would be Bigtable's merging
    compaction (minus the part where it also reads from memtable).

    About your space problem, the recommended practice is to keep your
    system with at least 20% free disk space else you can run into all
    sorts of problems.

    J-D
    On Tue, May 18, 2010 at 4:06 AM, TuX RaceR wrote:
    Thank you Jonathan for raising the Jira and attaching a patch

    I was looking for more info on how major compactions and minor compactions
    work and google found me this page:

    http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

    After reading the wiki page and Google Bigtable paper, it seems to me that
    there is a difference between Google 'minor compactions' andHbase 'minor
    compactions'.

    In google, a minor compaction is (from the paper):
    "5.4 Compactions
    As write operations execute, the size of the memtable increases. When the
    memtable size reaches a threshold, the memtable is frozen, a new memtable is
    created, and the frozen memtable is converted to an SSTable and written to
    GFS. This minor compaction process has two goals:
    it shrinks the memory usage of the tablet server, and it reduces the amount
    of data that has to be read from the commit log during recovery if this
    server dies. Incoming read and write operations can continue while
    compactions occur.
    Every minor compaction creates a new SSTable. If this behavior continued
    unchecked, read operations might need to merge updates from an arbitrary
    number of SSTables."

    On the other hand the Hbase wiki:
    "Compactions: When the number of MapFiles exceeds a configurable threshold,
    a minor compaction is performed which consolidates the most recently written
    MapFiles."

    So it seems that:
    1) google minor compactions are equivalent to Hbase cache flushes
    2) google major compactions are equivalent to Hbase major compactions
    3) there is no equivalent of Hbase minor compactions in the google design.

    can somebody confirm this?
    As in my case my data is almost immutable (i.e I do not have a lot of space
    to claim for deleted rows as there are few of them) , I am wondering if the
    compactions do not more harm than good.

    Thanks
    TuX


    On 17/05/10 23:12, Jonathan Gray wrote:

    No there isn't.

    I just opened a JIRA to make it so it can be set to 0 to disable.

    https://issues.apache.org/jira/browse/HBASE-2559

    Will put up a patch for trunk/0.21.

    JG

    -----Original Message-----
    From: TuX RaceR
    Sent: Monday, May 17, 2010 1:47 PM
    To: hbase-user@hadoop.apache.org
    Subject: Re: Additional disk space required for Hbase compactions..

    Hello List,

    On 17/05/10 20:26, Jonathan Gray wrote:


    Same with major compactions (you would definitely need to turn them
    off and control them manually if you need them at all).
    How would you turn the major compaction off?
    The only major compaction related parameter is this one:

    <property>
    <name>hbase.hregion.majorcompaction</name>
    <value>86400000</value>
    <description>The time (in miliseconds) between 'major' compactions of
    all
    HStoreFiles in a region.  Default: 1 day.
    </description>
    </property>

    Is there a cleaner way to turn it off than putting a ridiculously large
    value?

    Thanks
    TuX
  • TuX RaceR at May 18, 2010 at 7:12 pm
    Thanks J-D,

    sorry for my bad English, what I meant about the space is because my
    data is almost immutable (i.e. almost no update and not delete), if I
    compact two tables of size S1 and S2, the size of the merged table will
    be almost S1+S2, whereas if (if I understand well how it works) if I
    have made a lot of deletes on the two original tables, the size of the
    merged table could be much less than S1+S2.

    I do not have right now problem with disk space, but the 20% thumb rule
    is good to know (we all end up filling our large disks ;) )

    Thanks
    TuX
    On 18/05/10 18:27, Jean-Daniel Cryans wrote:
    The equivalent of HBase minor compactions would be Bigtable's merging
    compaction (minus the part where it also reads from memtable).

    About your space problem, the recommended practice is to keep your
    system with at least 20% free disk space else you can run into all
    sorts of problems.

    J-D

    On Tue, May 18, 2010 at 4:06 AM, TuX RaceRwrote:
    Thank you Jonathan for raising the Jira and attaching a patch

    I was looking for more info on how major compactions and minor compactions
    work and google found me this page:

    http://wiki.apache.org/hadoop/Hbase/HbaseArchitecture

    After reading the wiki page and Google Bigtable paper, it seems to me that
    there is a difference between Google 'minor compactions' andHbase 'minor
    compactions'.

    In google, a minor compaction is (from the paper):
    "5.4 Compactions
    As write operations execute, the size of the memtable increases. When the
    memtable size reaches a threshold, the memtable is frozen, a new memtable is
    created, and the frozen memtable is converted to an SSTable and written to
    GFS. This minor compaction process has two goals:
    it shrinks the memory usage of the tablet server, and it reduces the amount
    of data that has to be read from the commit log during recovery if this
    server dies. Incoming read and write operations can continue while
    compactions occur.
    Every minor compaction creates a new SSTable. If this behavior continued
    unchecked, read operations might need to merge updates from an arbitrary
    number of SSTables."

    On the other hand the Hbase wiki:
    "Compactions: When the number of MapFiles exceeds a configurable threshold,
    a minor compaction is performed which consolidates the most recently written
    MapFiles."

    So it seems that:
    1) google minor compactions are equivalent to Hbase cache flushes
    2) google major compactions are equivalent to Hbase major compactions
    3) there is no equivalent of Hbase minor compactions in the google design.

    can somebody confirm this?
    As in my case my data is almost immutable (i.e I do not have a lot of space
    to claim for deleted rows as there are few of them) , I am wondering if the
    compactions do not more harm than good.

    Thanks
    TuX


    On 17/05/10 23:12, Jonathan Gray wrote:

    No there isn't.

    I just opened a JIRA to make it so it can be set to 0 to disable.

    https://issues.apache.org/jira/browse/HBASE-2559

    Will put up a patch for trunk/0.21.

    JG


    -----Original Message-----
    From: TuX RaceR
    Sent: Monday, May 17, 2010 1:47 PM
    To: hbase-user@hadoop.apache.org
    Subject: Re: Additional disk space required for Hbase compactions..

    Hello List,


    On 17/05/10 20:26, Jonathan Gray wrote:

    Same with major compactions (you would definitely need to turn them
    off and control them manually if you need them at all).

    How would you turn the major compaction off?
    The only major compaction related parameter is this one:

    <property>
    <name>hbase.hregion.majorcompaction</name>
    <value>86400000</value>
    <description>The time (in miliseconds) between 'major' compactions of
    all
    HStoreFiles in a region. Default: 1 day.
    </description>
    </property>

    Is there a cleaner way to turn it off than putting a ridiculously large
    value?

    Thanks
    TuX
  • Jean-Daniel Cryans at May 18, 2010 at 8:41 pm
    Oh I see now, well HBase doesn't compact tables, only HFiles. So in
    your case the only time you want compactions is for totally new data,
    to regroup all the flushed HFiles together so that when reading you
    don't open a bunch of files.

    J-D
    On Tue, May 18, 2010 at 3:12 PM, TuX RaceR wrote:
    Thanks J-D,

    sorry for my bad English, what I meant about the space is because my data is
    almost immutable (i.e. almost no update and not delete), if I compact two
    tables of size S1 and S2, the size of the merged table will be almost S1+S2,
    whereas if (if I understand well how it works) if I have made a lot of
    deletes on the two original tables, the size of the merged table could be
    much less than S1+S2.

    I do not have right now problem with disk space, but the 20% thumb rule is
    good to know (we all end up filling our large disks ;) )

    Thanks
    TuX

    On 18/05/10 18:27, Jean-Daniel Cryans wrote:
  • Vidhyashankar Venkataraman at May 25, 2010 at 2:08 pm
    I want to run some experiments with the major compactions turned off..

    There are two parameters that I can see both of which have to be modified to turn major compactions off (am I right here?).
    hbase.hstore.compactionThreshold and hbase.hregion.majorcompaction...

    But the config file's comments say that setting compactionThreshold to a high value may lead to memory issues while compaction (because an entire Hfile will have to be logged)..


    Another related parameter that I found was:
    hbase.regionserver.thread.splitcompactcheckfrequency

    The default value is 20000... Does this mean the check happens once every 20000 seconds?

    Thank you
    Vidhya
  • Jean-Daniel Cryans at May 25, 2010 at 5:43 pm

    I want to run some experiments with the major compactions turned off..

    There are two parameters that I can see both of which have to be modified to turn major compactions off (am I right here?).
    hbase.hstore.compactionThreshold and hbase.hregion.majorcompaction...

    But the config file's comments say that setting compactionThreshold to a high value may lead to memory issues while compaction (because an entire Hfile will have to be logged)..
    compactionThreshold is only used for minor compactions, since major
    compactions always compact every files together. So setting the other
    conf to a very high number will basically disable major compactions. I
    trunk, setting it to 0 disables them. See
    https://issues.apache.org/jira/browse/HBASE-2559

    Another related parameter that I found was:
    hbase.regionserver.thread.splitcompactcheckfrequency

    The default value is 20000... Does this mean the check happens once every 20000 seconds?
    All the time-related configurations are in ms, so that's 20 secs.

    J-D
  • Edward Capriolo at May 17, 2010 at 8:08 pm

    On Mon, May 17, 2010 at 3:03 PM, Jonathan Gray wrote:

    I'm not sure I understand why you distinguish small HFiles and a single
    behemoth HFile? Are you trying to understand more about disk space or I/O
    patterns?

    It looks like your understanding is correct. At the worst point, a given
    Region will use twice it's disk space during a major compaction. Once the
    compaction is complete, the original files are deleted from HDFS. So it is
    not the case that your entire dataset will require double the space for
    compactions as they are not all run concurrently.

    JG
    -----Original Message-----
    From: Vidhyashankar Venkataraman
    Sent: Monday, May 17, 2010 11:56 AM
    To: hbase-user@hadoop.apache.org
    Cc: Joel Koshy
    Subject: Additional disk space required for Hbase compactions..

    Hi guys,
    I am quite new to Hbase.. I am trying to figure out the max
    additional disk space required for compactions..

    If the set of small Hfiles amount to a size of U in total, before a
    major compaction happens and the 'behemoth' HFile has size M, assuming
    the resultant size of the Hfile after compaction is U+M (worst case has
    only insertions) and a replication factor of r, then disk space taken
    by the Hfiles is 2r(U+M).. Is this estimate reasonable? (This also is
    based on my understanding that compactions happen on HDFS and not on
    the local file system: am I correct?)...

    Thank you
    Vidhya
    I do not have the answer, but let me warn the major compaction is VERY VERY
    IO intensive. In our case, it put a system that would typically respond to
    get request in 1-2ms up to 30 or 2000ms(at times). That was not acceptable
    for us. I think in most normal cases it is not good to force the issue with
    the major compaction.

Related Discussions