FAQ
We are using Cassandra 0.8.0 with 8 node ring and only one CF.
Every column has TTL of 86400 (24 hours). we also set 'GC grace second' to
43200
(12 hours). We have to store massive amount of data for one day now and
eventually for five days if we get more disk space.
Even for one day, we do run out disk space in a busy day.

We run nodetool compact command at night or as necessary then we run GC from
jconsole. We observed that GC did remove files but not necessarily oldest
ones.
Data files from more than 36 hours ago and quite often three days ago are
still there.

Does this behavior expected or we need adjust some other parameters?


Yuki Watanabe

Search Discussions

  • Aaron morton at Aug 25, 2011 at 10:13 pm
    If cassandra does not have enough disk space to create a new file it will provoke a JVM GC which should result in compacted SStables that are no longer needed been deleted. Otherwise they are deleted at some time in the future.

    Compacted SSTables have a file written out with a "compacted" extension.

    Do you see compacted sstables in the data directory?

    Cheers.

    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 26/08/2011, at 2:29 AM, yuki watanabe wrote:


    We are using Cassandra 0.8.0 with 8 node ring and only one CF.
    Every column has TTL of 86400 (24 hours). we also set 'GC grace second' to 43200
    (12 hours). We have to store massive amount of data for one day now and eventually for five days if we get more disk space.
    Even for one day, we do run out disk space in a busy day.

    We run nodetool compact command at night or as necessary then we run GC from jconsole. We observed that GC did remove files but not necessarily oldest ones.
    Data files from more than 36 hours ago and quite often three days ago are still there.

    Does this behavior expected or we need adjust some other parameters?


    Yuki Watanabe
  • Hiroyuki Watanabe at Sep 1, 2011 at 10:11 pm
    Yes, I see files with name like
    Orders-g-6517-Compacted

    However, all of those file have a size of 0.

    Starting from Monday to Thurseday we have 5642 files for -Data.db, -Filter.db and Statistics.db and only 128 -Compacted files.
    and all of -Compacted file has size of 0.

    Is this normal, or we are doing something wrong?


    yuki


    ________________________________
    From: aaron morton
    Sent: Thursday, August 25, 2011 6:13 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    If cassandra does not have enough disk space to create a new file it will provoke a JVM GC which should result in compacted SStables that are no longer needed been deleted. Otherwise they are deleted at some time in the future.

    Compacted SSTables have a file written out with a "compacted" extension.

    Do you see compacted sstables in the data directory?

    Cheers.

    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com

    On 26/08/2011, at 2:29 AM, yuki watanabe wrote:


    We are using Cassandra 0.8.0 with 8 node ring and only one CF.
    Every column has TTL of 86400 (24 hours). we also set 'GC grace second' to 43200
    (12 hours). We have to store massive amount of data for one day now and eventually for five days if we get more disk space.
    Even for one day, we do run out disk space in a busy day.

    We run nodetool compact command at night or as necessary then we run GC from jconsole. We observed that GC did remove files but not necessarily oldest ones.
    Data files from more than 36 hours ago and quite often three days ago are still there.

    Does this behavior expected or we need adjust some other parameters?


    Yuki Watanabe


    _______________________________________________

    This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group.
    _______________________________________________
  • Sylvain Lebresne at Sep 2, 2011 at 7:41 am

    On Fri, Sep 2, 2011 at 12:11 AM, wrote:
    Yes, I see files with name like
    Orders-g-6517-Compacted

    However, all of those file have a size of 0.

    Starting from Monday to Thurseday we have 5642 files for -Data.db,
    -Filter.db and Statistics.db and only 128 -Compacted files.
    and all of -Compacted file has size of 0.

    Is this normal, or we are doing something wrong?
    You are not doing something wrong. The -Compacted files are just
    marker, to indicate
    that the -Data file corresponding (with the same number) are, in fact,
    compacted and
    will eventually be removed. So those files will always have a size of 0.

    --
    Sylvain

    yuki

    ________________________________
    From: aaron morton
    Sent: Thursday, August 25, 2011 6:13 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    If cassandra does not have enough disk space to create a new file it will
    provoke a JVM GC which should result in compacted SStables that are no
    longer needed been deleted. Otherwise they are deleted at some time in the
    future.
    Compacted SSTables have a file written out with a "compacted" extension.
    Do you see compacted sstables in the data directory?
    Cheers.
    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 26/08/2011, at 2:29 AM, yuki watanabe wrote:

    We are using Cassandra 0.8.0 with 8 node ring and only one CF.
    Every column has TTL of 86400 (24 hours). we also set 'GC grace second' to
    43200
    (12 hours).  We have to store massive amount of data for one day now and
    eventually for five days if we get more disk space.
    Even for one day, we do run out disk space in a busy day.

    We run nodetool compact command at night or as necessary then we run GC from
    jconsole. We observed that  GC did remove files but not necessarily oldest
    ones.
    Data files from more than 36 hours ago and quite often three days ago are
    still there.

    Does this behavior expected or we need adjust some other parameters?


    Yuki Watanabe

    _______________________________________________



    This e-mail may contain information that is confidential, privileged or
    otherwise protected from disclosure. If you are not an intended recipient of
    this e-mail, do not duplicate or redistribute it by any means. Please delete
    it and any attachments and notify the sender that you have received it in
    error. Unless specifically indicated, this e-mail is not an offer to buy or
    sell or a solicitation to buy or sell any securities, investment products or
    other financial product or service, an official confirmation of any
    transaction, or an official statement of Barclays. Any views or opinions
    presented are solely those of the author and do not necessarily represent
    those of Barclays. This e-mail is subject to terms available at the
    following link: www.barcap.com/emaildisclaimer. By messaging with Barclays
    you consent to the foregoing.  Barclays Capital is the investment banking
    division of Barclays Bank PLC, a company registered in England (number
    1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
    This email may relate to or be sent from other members of the Barclays
    Group.

    _______________________________________________
  • Hiroyuki Watanabe at Sep 2, 2011 at 1:02 pm
    I see. Thank you for helpful information

    Yuki



    -----Original Message-----
    From: Sylvain Lebresne
    Sent: Friday, September 02, 2011 3:40 AM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files
    On Fri, Sep 2, 2011 at 12:11 AM, wrote:
    Yes, I see files with name like
    Orders-g-6517-Compacted

    However, all of those file have a size of 0.

    Starting from Monday to Thurseday we have 5642 files for -Data.db,
    -Filter.db and Statistics.db and only 128 -Compacted files.
    and all of -Compacted file has size of 0.

    Is this normal, or we are doing something wrong?
    You are not doing something wrong. The -Compacted files are just marker, to indicate that the -Data file corresponding (with the same number) are, in fact, compacted and will eventually be removed. So those files will always have a size of 0.

    --
    Sylvain

    yuki

    ________________________________
    From: aaron morton
    Sent: Thursday, August 25, 2011 6:13 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    If cassandra does not have enough disk space to create a new file it
    will provoke a JVM GC which should result in compacted SStables that
    are no longer needed been deleted. Otherwise they are deleted at some
    time in the future.
    Compacted SSTables have a file written out with a "compacted" extension.
    Do you see compacted sstables in the data directory?
    Cheers.
    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 26/08/2011, at 2:29 AM, yuki watanabe wrote:

    We are using Cassandra 0.8.0 with 8 node ring and only one CF.
    Every column has TTL of 86400 (24 hours). we also set 'GC grace
    second' to 43200
    (12 hours).  We have to store massive amount of data for one day now
    and eventually for five days if we get more disk space.
    Even for one day, we do run out disk space in a busy day.

    We run nodetool compact command at night or as necessary then we run
    GC from jconsole. We observed that  GC did remove files but not
    necessarily oldest ones.
    Data files from more than 36 hours ago and quite often three days ago
    are still there.

    Does this behavior expected or we need adjust some other parameters?


    Yuki Watanabe

    _______________________________________________



    This e-mail may contain information that is confidential, privileged
    or otherwise protected from disclosure. If you are not an intended
    recipient of this e-mail, do not duplicate or redistribute it by any
    means. Please delete it and any attachments and notify the sender that
    you have received it in error. Unless specifically indicated, this
    e-mail is not an offer to buy or sell or a solicitation to buy or sell
    any securities, investment products or other financial product or
    service, an official confirmation of any transaction, or an official
    statement of Barclays. Any views or opinions presented are solely
    those of the author and do not necessarily represent those of
    Barclays. This e-mail is subject to terms available at the following
    link: www.barcap.com/emaildisclaimer. By messaging with Barclays you
    consent to the foregoing.  Barclays Capital is the investment banking
    division of Barclays Bank PLC, a company registered in England (number
    1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
    This email may relate to or be sent from other members of the Barclays
    Group.

    _______________________________________________
  • Hiroyuki Watanabe at Sep 27, 2011 at 9:17 pm
    Now, we use TTL of 12 hours and GC grace period of 8 hours for encouraging Cassandra to remove old data/files more aggressively.

    Cassandra do remove fair amount of old data files.
    Cassandra tends to removed 4 out of every 5 files.
    I notice it because data file has a sequence number as a part of name.

    I also noticed when Cassandra generated *-Compacted file it generated 4 file at a time.
    They have consecutive numbers as file name, but skip one number from the previous group of 4.
    The one missing is the file that is failed to be removed in the end and stays forever.

    I looked at the Keys in an index file that failed to be removed. If I make query of any of keys, Cassandra indicates that there is not data, which is correct because these files are older than 24 hours. All the data must be obsolete due to TTL.

    I am wondering why Cassandra does not remove all data file whose time stamp is much older than TTL + grace period.

    Does anybody have similar experience ?


    Yuki Watanabe



    -----Original Message-----
    From: Watanabe, Hiroyuki: IT (NYK)
    Sent: Friday, September 02, 2011 9:01 AM
    To: user@cassandra.apache.org
    Subject: RE: Removal of old data files


    I see. Thank you for helpful information

    Yuki



    -----Original Message-----
    From: Sylvain Lebresne
    Sent: Friday, September 02, 2011 3:40 AM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files
    On Fri, Sep 2, 2011 at 12:11 AM, wrote:
    Yes, I see files with name like
    Orders-g-6517-Compacted

    However, all of those file have a size of 0.

    Starting from Monday to Thurseday we have 5642 files for -Data.db,
    -Filter.db and Statistics.db and only 128 -Compacted files.
    and all of -Compacted file has size of 0.

    Is this normal, or we are doing something wrong?
    You are not doing something wrong. The -Compacted files are just marker, to indicate that the -Data file corresponding (with the same number) are, in fact, compacted and will eventually be removed. So those files will always have a size of 0.

    --
    Sylvain

    yuki

    ________________________________
    From: aaron morton
    Sent: Thursday, August 25, 2011 6:13 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    If cassandra does not have enough disk space to create a new file it
    will provoke a JVM GC which should result in compacted SStables that
    are no longer needed been deleted. Otherwise they are deleted at some
    time in the future.
    Compacted SSTables have a file written out with a "compacted" extension.
    Do you see compacted sstables in the data directory?
    Cheers.
    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 26/08/2011, at 2:29 AM, yuki watanabe wrote:

    We are using Cassandra 0.8.0 with 8 node ring and only one CF.
    Every column has TTL of 86400 (24 hours). we also set 'GC grace
    second' to 43200
    (12 hours).  We have to store massive amount of data for one day now
    and eventually for five days if we get more disk space.
    Even for one day, we do run out disk space in a busy day.

    We run nodetool compact command at night or as necessary then we run
    GC from jconsole. We observed that  GC did remove files but not
    necessarily oldest ones.
    Data files from more than 36 hours ago and quite often three days ago
    are still there.

    Does this behavior expected or we need adjust some other parameters?


    Yuki Watanabe

    _______________________________________________



    This e-mail may contain information that is confidential, privileged
    or otherwise protected from disclosure. If you are not an intended
    recipient of this e-mail, do not duplicate or redistribute it by any
    means. Please delete it and any attachments and notify the sender that
    you have received it in error. Unless specifically indicated, this
    e-mail is not an offer to buy or sell or a solicitation to buy or sell
    any securities, investment products or other financial product or
    service, an official confirmation of any transaction, or an official
    statement of Barclays. Any views or opinions presented are solely
    those of the author and do not necessarily represent those of
    Barclays. This e-mail is subject to terms available at the following
    link: www.barcap.com/emaildisclaimer. By messaging with Barclays you
    consent to the foregoing.  Barclays Capital is the investment banking
    division of Barclays Bank PLC, a company registered in England (number
    1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
    This email may relate to or be sent from other members of the Barclays
    Group.

    _______________________________________________
  • Aaron morton at Sep 28, 2011 at 12:22 am
    Short Answer: Cassandra will actively delete files when it needs to make space. Otherwise they will be deleted "some time later". Unless you are getting out of disk space errors it's not normally something to worry about.

    Longer:
    The TTL guarantee is "do not return this data to get requests after this many seconds".

    Data is "purged" from an SSTable when we run compactions (either minor/auto or major/manual). Purging means it will not be written in the new SSTable created by the compaction process. The main criteria for purging is that either gc_grace_seconds OR ttl have expired on the column.

    After compaction completes it's writes the -Compacted for the SSTables that were compacted. But there is a bunch of logic associated with which files are compacted, it's not "compact the oldest 3 files". Basically it tries to compact files which are about the same size.

    Remember we *never* modify data on disk. If we want to remove data from an SSTable we have to write a new SSTable. It's one of the reasons things writes are fast http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

    Hope that helps.

    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 28/09/2011, at 10:16 AM, hiroyuki.watanabe@barclayscapital.com wrote:


    Now, we use TTL of 12 hours and GC grace period of 8 hours for encouraging Cassandra to remove old data/files more aggressively.

    Cassandra do remove fair amount of old data files.
    Cassandra tends to removed 4 out of every 5 files.
    I notice it because data file has a sequence number as a part of name.

    I also noticed when Cassandra generated *-Compacted file it generated 4 file at a time.
    They have consecutive numbers as file name, but skip one number from the previous group of 4.
    The one missing is the file that is failed to be removed in the end and stays forever.

    I looked at the Keys in an index file that failed to be removed. If I make query of any of keys, Cassandra indicates that there is not data, which is correct because these files are older than 24 hours. All the data must be obsolete due to TTL.

    I am wondering why Cassandra does not remove all data file whose time stamp is much older than TTL + grace period.

    Does anybody have similar experience ?


    Yuki Watanabe



    -----Original Message-----
    From: Watanabe, Hiroyuki: IT (NYK)
    Sent: Friday, September 02, 2011 9:01 AM
    To: user@cassandra.apache.org
    Subject: RE: Removal of old data files


    I see. Thank you for helpful information

    Yuki



    -----Original Message-----
    From: Sylvain Lebresne
    Sent: Friday, September 02, 2011 3:40 AM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files
    On Fri, Sep 2, 2011 at 12:11 AM, wrote:
    Yes, I see files with name like
    Orders-g-6517-Compacted

    However, all of those file have a size of 0.

    Starting from Monday to Thurseday we have 5642 files for -Data.db,
    -Filter.db and Statistics.db and only 128 -Compacted files.
    and all of -Compacted file has size of 0.

    Is this normal, or we are doing something wrong?
    You are not doing something wrong. The -Compacted files are just marker, to indicate that the -Data file corresponding (with the same number) are, in fact, compacted and will eventually be removed. So those files will always have a size of 0.

    --
    Sylvain

    yuki

    ________________________________
    From: aaron morton
    Sent: Thursday, August 25, 2011 6:13 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    If cassandra does not have enough disk space to create a new file it
    will provoke a JVM GC which should result in compacted SStables that
    are no longer needed been deleted. Otherwise they are deleted at some
    time in the future.
    Compacted SSTables have a file written out with a "compacted" extension.
    Do you see compacted sstables in the data directory?
    Cheers.
    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 26/08/2011, at 2:29 AM, yuki watanabe wrote:

    We are using Cassandra 0.8.0 with 8 node ring and only one CF.
    Every column has TTL of 86400 (24 hours). we also set 'GC grace
    second' to 43200
    (12 hours). We have to store massive amount of data for one day now
    and eventually for five days if we get more disk space.
    Even for one day, we do run out disk space in a busy day.

    We run nodetool compact command at night or as necessary then we run
    GC from jconsole. We observed that GC did remove files but not
    necessarily oldest ones.
    Data files from more than 36 hours ago and quite often three days ago
    are still there.

    Does this behavior expected or we need adjust some other parameters?


    Yuki Watanabe

    _______________________________________________



    This e-mail may contain information that is confidential, privileged
    or otherwise protected from disclosure. If you are not an intended
    recipient of this e-mail, do not duplicate or redistribute it by any
    means. Please delete it and any attachments and notify the sender that
    you have received it in error. Unless specifically indicated, this
    e-mail is not an offer to buy or sell or a solicitation to buy or sell
    any securities, investment products or other financial product or
    service, an official confirmation of any transaction, or an official
    statement of Barclays. Any views or opinions presented are solely
    those of the author and do not necessarily represent those of
    Barclays. This e-mail is subject to terms available at the following
    link: www.barcap.com/emaildisclaimer. By messaging with Barclays you
    consent to the foregoing. Barclays Capital is the investment banking
    division of Barclays Bank PLC, a company registered in England (number
    1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
    This email may relate to or be sent from other members of the Barclays
    Group.

    _______________________________________________
  • Hiroyuki Watanabe at Sep 28, 2011 at 3:40 pm
    Thank you. You have been very helpful.

    We stored only one day worth of data for now. However, we want to store 5 days worth of data eventually.
    That is 5 times more disk space.
    That is our main reason for us to look at older SSTables that appears to be holding only tombstones.

    - yuki


    ________________________________
    From: aaron morton
    Sent: Tuesday, September 27, 2011 8:22 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    Short Answer: Cassandra will actively delete files when it needs to make space. Otherwise they will be deleted "some time later". Unless you are getting out of disk space errors it's not normally something to worry about.

    Longer:
    The TTL guarantee is "do not return this data to get requests after this many seconds".

    Data is "purged" from an SSTable when we run compactions (either minor/auto or major/manual). Purging means it will not be written in the new SSTable created by the compaction process. The main criteria for purging is that either gc_grace_seconds OR ttl have expired on the column.

    After compaction completes it's writes the -Compacted for the SSTables that were compacted. But there is a bunch of logic associated with which files are compacted, it's not "compact the oldest 3 files". Basically it tries to compact files which are about the same size.

    Remember we *never* modify data on disk. If we want to remove data from an SSTable we have to write a new SSTable. It's one of the reasons things writes are fast http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

    Hope that helps.

    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com

    On 28/09/2011, at 10:16 AM, hiroyuki.watanabe@barclayscapital.com wrote:


    Now, we use TTL of 12 hours and GC grace period of 8 hours for encouraging Cassandra to remove old data/files more aggressively.

    Cassandra do remove fair amount of old data files.
    Cassandra tends to removed 4 out of every 5 files.
    I notice it because data file has a sequence number as a part of name.

    I also noticed when Cassandra generated *-Compacted file it generated 4 file at a time.
    They have consecutive numbers as file name, but skip one number from the previous group of 4.
    The one missing is the file that is failed to be removed in the end and stays forever.

    I looked at the Keys in an index file that failed to be removed. If I make query of any of keys, Cassandra indicates that there is not data, which is correct because these files are older than 24 hours. All the data must be obsolete due to TTL.

    I am wondering why Cassandra does not remove all data file whose time stamp is much older than TTL + grace period.

    Does anybody have similar experience ?


    Yuki Watanabe



    -----Original Message-----
    From: Watanabe, Hiroyuki: IT (NYK)
    Sent: Friday, September 02, 2011 9:01 AM
    To: user@cassandra.apache.org
    Subject: RE: Removal of old data files


    I see. Thank you for helpful information

    Yuki



    -----Original Message-----
    From: Sylvain Lebresne
    Sent: Friday, September 02, 2011 3:40 AM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    On Fri, Sep 2, 2011 at 12:11 AM, wrote:
    Yes, I see files with name like
    Orders-g-6517-Compacted

    However, all of those file have a size of 0.

    Starting from Monday to Thurseday we have 5642 files for -Data.db,
    -Filter.db and Statistics.db and only 128 -Compacted files.
    and all of -Compacted file has size of 0.

    Is this normal, or we are doing something wrong?

    You are not doing something wrong. The -Compacted files are just marker, to indicate that the -Data file corresponding (with the same number) are, in fact, compacted and will eventually be removed. So those files will always have a size of 0.

    --
    Sylvain



    yuki

    ________________________________
    From: aaron morton
    Sent: Thursday, August 25, 2011 6:13 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    If cassandra does not have enough disk space to create a new file it
    will provoke a JVM GC which should result in compacted SStables that
    are no longer needed been deleted. Otherwise they are deleted at some
    time in the future.
    Compacted SSTables have a file written out with a "compacted" extension.
    Do you see compacted sstables in the data directory?
    Cheers.
    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 26/08/2011, at 2:29 AM, yuki watanabe wrote:

    We are using Cassandra 0.8.0 with 8 node ring and only one CF.
    Every column has TTL of 86400 (24 hours). we also set 'GC grace
    second' to 43200
    (12 hours). We have to store massive amount of data for one day now
    and eventually for five days if we get more disk space.
    Even for one day, we do run out disk space in a busy day.

    We run nodetool compact command at night or as necessary then we run
    GC from jconsole. We observed that GC did remove files but not
    necessarily oldest ones.
    Data files from more than 36 hours ago and quite often three days ago
    are still there.

    Does this behavior expected or we need adjust some other parameters?


    Yuki Watanabe

    _______________________________________________



    This e-mail may contain information that is confidential, privileged
    or otherwise protected from disclosure. If you are not an intended
    recipient of this e-mail, do not duplicate or redistribute it by any
    means. Please delete it and any attachments and notify the sender that
    you have received it in error. Unless specifically indicated, this
    e-mail is not an offer to buy or sell or a solicitation to buy or sell
    any securities, investment products or other financial product or
    service, an official confirmation of any transaction, or an official
    statement of Barclays. Any views or opinions presented are solely
    those of the author and do not necessarily represent those of
    Barclays. This e-mail is subject to terms available at the following
    link: www.barcap.com/emaildisclaimer<http://www.barcap.com/emaildisclaimer>. By messaging with Barclays you
    consent to the foregoing. Barclays Capital is the investment banking
    division of Barclays Bank PLC, a company registered in England (number
    1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
    This email may relate to or be sent from other members of the Barclays
    Group.

    _______________________________________________


    _______________________________________________

    This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group.
    _______________________________________________
  • Aaron morton at Sep 28, 2011 at 9:21 pm
    For background:

    Minor compaction will bucket files (see https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L989) and then compact them if they have more than min_compaction_threshold set per CF.

    It will then purge tombstones if a row is only contained in the SSTables involved in the compaction (see https://github.com/apache/cassandra/blob/cassandra-0.8.6/src/java/org/apache/cassandra/db/compaction/CompactionController.java#L84)

    So there are a couple of approaches you can take if you want to ensure all TTL'd data is purged ASAP. Note that Tombstones and TTL'd data will be automatically purged at some point, but if more precise control you may need to take a few steps.

    First if all the data you are storing has a 24 hour TTL you can include a manualy major compaction via node tool in your maintenance routine. We normally advise against it because it tries to create one large file, but if all your data is going to be removed it's prob ok.

    Second, play around with minor compaction to increase the chances that data is purged soon after the

    Third, monkey up a process to kick of user defined compaction runs for SSTables that are over 24 hours old.

    I know disk space can be an issue, but if you have the spare capacity you can just let cassandra manage things. Also 1.0 has some major changes in this area.

    Cheers

    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 29/09/2011, at 4:39 AM, hiroyuki.watanabe@barclayscapital.com wrote:

    Thank you. You have been very helpful.

    We stored only one day worth of data for now. However, we want to store 5 days worth of data eventually.
    That is 5 times more disk space.
    That is our main reason for us to look at older SSTables that appears to be holding only tombstones.

    - yuki


    From: aaron morton
    Sent: Tuesday, September 27, 2011 8:22 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    Short Answer: Cassandra will actively delete files when it needs to make space. Otherwise they will be deleted "some time later". Unless you are getting out of disk space errors it's not normally something to worry about.

    Longer:
    The TTL guarantee is "do not return this data to get requests after this many seconds".

    Data is "purged" from an SSTable when we run compactions (either minor/auto or major/manual). Purging means it will not be written in the new SSTable created by the compaction process. The main criteria for purging is that either gc_grace_seconds OR ttl have expired on the column.

    After compaction completes it's writes the -Compacted for the SSTables that were compacted. But there is a bunch of logic associated with which files are compacted, it's not "compact the oldest 3 files". Basically it tries to compact files which are about the same size.

    Remember we *never* modify data on disk. If we want to remove data from an SSTable we have to write a new SSTable. It's one of the reasons things writes are fast http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

    Hope that helps.

    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 28/09/2011, at 10:16 AM, hiroyuki.watanabe@barclayscapital.com wrote:


    Now, we use TTL of 12 hours and GC grace period of 8 hours for encouraging Cassandra to remove old data/files more aggressively.

    Cassandra do remove fair amount of old data files.
    Cassandra tends to removed 4 out of every 5 files.
    I notice it because data file has a sequence number as a part of name.

    I also noticed when Cassandra generated *-Compacted file it generated 4 file at a time.
    They have consecutive numbers as file name, but skip one number from the previous group of 4.
    The one missing is the file that is failed to be removed in the end and stays forever.

    I looked at the Keys in an index file that failed to be removed. If I make query of any of keys, Cassandra indicates that there is not data, which is correct because these files are older than 24 hours. All the data must be obsolete due to TTL.

    I am wondering why Cassandra does not remove all data file whose time stamp is much older than TTL + grace period.

    Does anybody have similar experience ?


    Yuki Watanabe



    -----Original Message-----
    From: Watanabe, Hiroyuki: IT (NYK)
    Sent: Friday, September 02, 2011 9:01 AM
    To: user@cassandra.apache.org
    Subject: RE: Removal of old data files


    I see. Thank you for helpful information

    Yuki



    -----Original Message-----
    From: Sylvain Lebresne
    Sent: Friday, September 02, 2011 3:40 AM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files
    On Fri, Sep 2, 2011 at 12:11 AM, wrote:
    Yes, I see files with name like
    Orders-g-6517-Compacted

    However, all of those file have a size of 0.

    Starting from Monday to Thurseday we have 5642 files for -Data.db,
    -Filter.db and Statistics.db and only 128 -Compacted files.
    and all of -Compacted file has size of 0.

    Is this normal, or we are doing something wrong?
    You are not doing something wrong. The -Compacted files are just marker, to indicate that the -Data file corresponding (with the same number) are, in fact, compacted and will eventually be removed. So those files will always have a size of 0.

    --
    Sylvain

    yuki

    ________________________________
    From: aaron morton
    Sent: Thursday, August 25, 2011 6:13 PM
    To: user@cassandra.apache.org
    Subject: Re: Removal of old data files

    If cassandra does not have enough disk space to create a new file it
    will provoke a JVM GC which should result in compacted SStables that
    are no longer needed been deleted. Otherwise they are deleted at some
    time in the future.
    Compacted SSTables have a file written out with a "compacted" extension.
    Do you see compacted sstables in the data directory?
    Cheers.
    -----------------
    Aaron Morton
    Freelance Cassandra Developer
    @aaronmorton
    http://www.thelastpickle.com
    On 26/08/2011, at 2:29 AM, yuki watanabe wrote:

    We are using Cassandra 0.8.0 with 8 node ring and only one CF.
    Every column has TTL of 86400 (24 hours). we also set 'GC grace
    second' to 43200
    (12 hours). We have to store massive amount of data for one day now
    and eventually for five days if we get more disk space.
    Even for one day, we do run out disk space in a busy day.

    We run nodetool compact command at night or as necessary then we run
    GC from jconsole. We observed that GC did remove files but not
    necessarily oldest ones.
    Data files from more than 36 hours ago and quite often three days ago
    are still there.

    Does this behavior expected or we need adjust some other parameters?


    Yuki Watanabe

    _______________________________________________



    This e-mail may contain information that is confidential, privileged
    or otherwise protected from disclosure. If you are not an intended
    recipient of this e-mail, do not duplicate or redistribute it by any
    means. Please delete it and any attachments and notify the sender that
    you have received it in error. Unless specifically indicated, this
    e-mail is not an offer to buy or sell or a solicitation to buy or sell
    any securities, investment products or other financial product or
    service, an official confirmation of any transaction, or an official
    statement of Barclays. Any views or opinions presented are solely
    those of the author and do not necessarily represent those of
    Barclays. This e-mail is subject to terms available at the following
    link: www.barcap.com/emaildisclaimer. By messaging with Barclays you
    consent to the foregoing. Barclays Capital is the investment banking
    division of Barclays Bank PLC, a company registered in England (number
    1026167) with its registered office at 1 Churchill Place, London, E14 5HP.
    This email may relate to or be sent from other members of the Barclays
    Group.

    _______________________________________________
    _______________________________________________

    This e-mail may contain information that is confidential, privileged or otherwise protected from disclosure. If you are not an intended recipient of this e-mail, do not duplicate or redistribute it by any means. Please delete it and any attachments and notify the sender that you have received it in error. Unless specifically indicated, this e-mail is not an offer to buy or sell or a solicitation to buy or sell any securities, investment products or other financial product or service, an official confirmation of any transaction, or an official statement of Barclays. Any views or opinions presented are solely those of the author and do not necessarily represent those of Barclays. This e-mail is subject to terms available at the following link: www.barcap.com/emaildisclaimer. By messaging with Barclays you consent to the foregoing. Barclays Capital is the investment banking division of Barclays Bank PLC, a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays Group.
    _______________________________________________

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupuser @
categoriescassandra
postedAug 25, '11 at 2:30p
activeSep 28, '11 at 9:21p
posts9
users4
websitecassandra.apache.org
irc#cassandra

People

Translate

site design / logo © 2022 Grokbase