FAQ
Hello all,

I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house
backups. The backups are run via rsync/rsnapshot and are large in terms of
the number of files: over 10 million each.

Now the machine is not particularly powerful: it is 64-bit machine, dual
core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
following problem: once in awhile that XFS partition starts generating
multiple I/O errors, files that had content become 0 byte, directories
disappear, etc. Every time a reboot fixes that, however. So far I've looked
at logs but could not find a cause of precipitating event.

Hence the question: has anyone experienced anything along those lines? What
could be the cause of this?

Thanks.

Boris.

Search Discussions

  • Boris Epstein at Jan 22, 2012 at 10:00 am

    On Sun, Jan 22, 2012 at 9:06 AM, Boris Epstein wrote:

    Hello all,

    I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house
    backups. The backups are run via rsync/rsnapshot and are large in terms of
    the number of files: over 10 million each.

    Now the machine is not particularly powerful: it is 64-bit machine, dual
    core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
    following problem: once in awhile that XFS partition starts generating
    multiple I/O errors, files that had content become 0 byte, directories
    disappear, etc. Every time a reboot fixes that, however. So far I've looked
    at logs but could not find a cause of precipitating event.

    Hence the question: has anyone experienced anything along those lines?
    What could be the cause of this?

    Thanks.

    Boris.
    Correction to the above: the XFS partition is 26TB, not 16 TB (not that it
    should matter in the context of this particular situation).

    Also, here's somethine else I have discovered. Apparently there is an
    potential intermittent RAID disk trouble. At least I found the following in
    the system log:

    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
    Drive ECC error reported:port=4, unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
    Source drive error occurred:port=4, unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004):
    Rebuild failed:unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B):
    Rebuild paused:unit=0.

    ...

    Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
    (0x04:0x000F): SMART threshold exceeded:port=9.
    Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
    (0x04:0x000F): SMART threshold exceeded:port=9.
    Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B):
    Rebuild started:unit=0.

    Even if a disk is misbehaving in a RAID6 that should not be causing I/O
    errors. Plus, why is it never straight after a rebbot and is always fixed
    by a reboot?

    Be that as it may, I am still puzzled.

    Boris.
  • Keith Keller at Jan 22, 2012 at 1:34 pm

    On 2012-01-22, Boris Epstein wrote:
    Also, here's somethine else I have discovered. Apparently there is an
    potential intermittent RAID disk trouble. At least I found the following in
    the system log:

    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
    Drive ECC error reported:port=4, unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
    Source drive error occurred:port=4, unit=0.
    Which 3ware controller is this? I have had lots of problems with the
    3ware 9550SX controller and WD-EA[RD]S drives in a similar
    configuration. (Yes, I know all about the EARS drives, but they work
    mostly fine with the 3ware 9650 controller, so I suspect some weird
    interaction between the cheap drives and the old not-so-great
    controller. I also suspect an intermittently failing port, which I'll
    be testing more later this week.)
    Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
    (0x04:0x000F): SMART threshold exceeded:port=9.
    Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
    (0x04:0x000F): SMART threshold exceeded:port=9.
    Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B):
    Rebuild started:unit=0.
    What does your RAID look like? Are you using the 3ware's RAID6 (in
    which case it's not a 9550) or mdraid? Are the 3ware errors in the logs
    across a large number of ports or just a few? Have you used the drive
    tester for your drives to verify that they're still good? On all my
    other systems, when the controller has reported a failure, and I've run
    it through the tester, it's reported a failure. (Often when my 9550
    reports a failure the drive passes all tests.)

    If you happen to have real RAID drive models, you may also try
    contacting LSI support. They will steadfastly refuse to help if you
    have desktop-edition drives, but can be at least somewhat helpful if you
    have enterprise drives.

    --keith


    --
    kkeller at wombat.san-francisco.ca.us
  • Boris Epstein at Jan 22, 2012 at 5:36 pm

    On Sun, Jan 22, 2012 at 1:34 PM, Keith Keller wrote:
    On 2012-01-22, Boris Epstein wrote:

    Also, here's somethine else I have discovered. Apparently there is an
    potential intermittent RAID disk trouble. At least I found the following in
    the system log:

    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR
    (0x04:0x0026):
    Drive ECC error reported:port=4, unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR
    (0x04:0x002D):
    Source drive error occurred:port=4, unit=0.
    Which 3ware controller is this? I have had lots of problems with the
    3ware 9550SX controller and WD-EA[RD]S drives in a similar
    configuration. (Yes, I know all about the EARS drives, but they work
    mostly fine with the 3ware 9650 controller, so I suspect some weird
    interaction between the cheap drives and the old not-so-great
    controller. I also suspect an intermittently failing port, which I'll
    be testing more later this week.)
    Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
    (0x04:0x000F): SMART threshold exceeded:port=9.
    Jan 22 09:55:23 nrims-bs kernel: 3w-9xxx: scsi6: AEN: WARNING
    (0x04:0x000F): SMART threshold exceeded:port=9.
    Jan 22 09:56:17 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x000B):
    Rebuild started:unit=0.
    What does your RAID look like? Are you using the 3ware's RAID6 (in
    which case it's not a 9550) or mdraid? Are the 3ware errors in the logs
    across a large number of ports or just a few? Have you used the drive
    tester for your drives to verify that they're still good? On all my
    other systems, when the controller has reported a failure, and I've run
    it through the tester, it's reported a failure. (Often when my 9550
    reports a failure the drive passes all tests.)

    If you happen to have real RAID drive models, you may also try
    contacting LSI support. They will steadfastly refuse to help if you
    have desktop-edition drives, but can be at least somewhat helpful if you
    have enterprise drives.

    --keith


    --
    kkeller at wombat.san-francisco.ca.us


    Keith, thanks!

    The RAID is on the controller level. Yes, I believe the controller is a
    3Ware 9xxx series - I don't recall the details right now.

    What are you referring to as "drive tester"?

    Boris.
  • Keith Keller at Jan 22, 2012 at 8:45 pm

    On 2012-01-22, Boris Epstein wrote:
    The RAID is on the controller level. Yes, I believe the controller is a
    3Ware 9xxx series - I don't recall the details right now.
    The details are important in this context--the 9550 is the problematic
    one (at least for me, though I've seen others with similar issues). But
    if it's a hardware RAID6, it's a later controller, as the 9550 doesn't
    support RAID6. I have had some issues with the WD-EARS drives with 96xx
    controllers, but much less frequently.
    What are you referring to as "drive tester"?
    Some drive vendors distribute their own bootable CD image, with which
    you can run tests specific to their drives, which can return proper
    error codes to help determine whether there is actually a problem on the
    drive. Seagate used to require you give them the diagnostic code their
    tester returned in order for them to accept a drive for an RMA; I don't
    think they do that any more, but they still distribute their tester.
    But it's a good way to get another indicator of a problem; if both the
    controller and the drive tester report an error, it's very likely that
    you have a bad drive; if the tester says the drive is fine, and does
    this for a few drives the controller reports as failed, you can suspect
    something behind the drives as a problem. (This is how I came to
    suspect the 9550: it would say my drives had failed, but the WD tester
    repeatedly said they were fine.)

    The latest version of UBCD has the latest versions of these various
    testers; I recall WD, Seagate, and Hitachi testers, and I'm pretty sure
    there are others.

    --keith

    --
    kkeller at wombat.san-francisco.ca.us
  • Miguel Medalha at Jan 22, 2012 at 2:27 pm

    Correction to the above: the XFS partition is 26TB, not 16 TB (not that it
    should matter in the context of this particular situation).
    Yes, it does matter:

    Read this:

    *[CentOS] 32-bit kernel+XFS+16.xTB filesystem = potential disaster*
    http://lists.centos.org/pipermail/centos/2011-April/109142.html
  • Boris Epstein at Jan 22, 2012 at 2:29 pm

    On Sun, Jan 22, 2012 at 2:27 PM, Miguel Medalha wrote:

    Correction to the above: the XFS partition is 26TB, not 16 TB (not that it
    should matter in the context of this particular situation).
    Yes, it does matter:

    Read this:

    *[CentOS] 32-bit kernel+XFS+16.xTB filesystem = potential disaster*
    http://lists.centos.org/**pipermail/centos/2011-April/**109142.html<http://lists.centos.org/pipermail/centos/2011-April/109142.html>
    Miguel,

    Thanks, but based on the uname output:

    uname -a
    Linux nrims-bs 2.6.18-274.12.1.el5xen #1 SMP Tue Nov 29 14:18:21 EST 2011
    x86_64 x86_64 x86_64 GNU/Linux

    this is clearly a 64-bit OS so the 32-bit limitations ought not to apply.

    Boris.
  • Miguel Medalha at Jan 22, 2012 at 2:32 pm

    uname -a
    Linux nrims-bs 2.6.18-274.12.1.el5xen #1 SMP Tue Nov 29 14:18:21 EST
    2011 x86_64 x86_64 x86_64 GNU/Linux

    this is clearly a 64-bit OS so the 32-bit limitations ought not to apply.
    Ok! Since you didn't inform us in your initial post, I thought I should
    ask you in order to eliminate that possible cause.
  • Miguel Medalha at Jan 22, 2012 at 2:35 pm
    Nevertheless, it seems to me that you should have more than 3GB of RAM
    on a 64 bit system...
    Since the width of the binary word is 64 bit in this case, 3GB
    correspond to 1.5GB on a 32 bit system...
    If you have a 64 bit system you should give it space to work properly.
  • Miguel Medalha at Jan 22, 2012 at 2:37 pm

    Nevertheless, it seems to me that you should have more than 3GB of RAM
    on a 64 bit system...
    Since the width of the binary word is 64 bit in this case, 3GB
    correspond to 1.5GB on a 32 bit system...
    If you have a 64 bit system you should give it space to work properly.
    ... and the fact that a reboot seems to fix the problem could also point
    in that direction.
  • Boris Epstein at Jan 22, 2012 at 2:42 pm

    On Sun, Jan 22, 2012 at 2:37 PM, Miguel Medalha wrote:
    Nevertheless, it seems to me that you should have more than 3GB of RAM
    on a 64 bit system...
    Since the width of the binary word is 64 bit in this case, 3GB
    correspond to 1.5GB on a 32 bit system...
    If you have a 64 bit system you should give it space to work properly.
    ... and the fact that a reboot seems to fix the problem could also point
    in that direction.
    That is entirely possible. It does seem to me that some sort of a resourse
    accumulation is indeed occurring on the system - and I hope there is a way
    to stop that because filesystem I/O should be a self-balancing process.

    Boris.
  • Boris Epstein at Jan 22, 2012 at 2:40 pm

    On Sun, Jan 22, 2012 at 2:35 PM, Miguel Medalha wrote:
    Nevertheless, it seems to me that you should have more than 3GB of RAM on
    a 64 bit system...
    Since the width of the binary word is 64 bit in this case, 3GB correspond
    to 1.5GB on a 32 bit system...
    If you have a 64 bit system you should give it space to work properly.
    Don't worry, you asked exactly the right question - but, unfortunately, it
    is not a 32-bit OS here that's the culprit so the situation is more
    involved than that.

    You are right - it would indeed be desirable to have more than 3 GB of RAM
    on that system. However it is not obvious to me that having that little RAM
    should cause I/O failure? Why? That it would make the machine slow is to be
    expected - and especially so given that I had to jack the swap up to some
    40 GB. But I do not necessarily see why I should have outright failures due
    solely to not having more RAM.

    Boris.
  • Miguel Medalha at Jan 22, 2012 at 2:43 pm

    You are right - it would indeed be desirable to have more than 3 GB of
    RAM on that system. However it is not obvious to me that having that
    little RAM should cause I/O failure? Why? That it would make the
    machine slow is to be expected - and especially so given that I had to
    jack the swap up to some 40 GB. But I do not necessarily see why I
    should have outright failures due solely to not having more RAM.
    If I were you, I would be monitoring the system's memory usage. Maybe
    some software component has a memory leak which keeps worsening until a
    reboot cleans it.
    Also, I wouldn't discard the possibility of a physical memory problem.
    Can you test it?
  • Boris Epstein at Jan 22, 2012 at 2:53 pm

    On Sun, Jan 22, 2012 at 2:43 PM, Miguel Medalha wrote:

    You are right - it would indeed be desirable to have more than 3 GB of
    RAM on that system. However it is not obvious to me that having that little
    RAM should cause I/O failure? Why? That it would make the machine slow is
    to be expected - and especially so given that I had to jack the swap up to
    some 40 GB. But I do not necessarily see why I should have outright
    failures due solely to not having more RAM.
    If I were you, I would be monitoring the system's memory usage. Maybe some
    software component has a memory leak which keeps worsening until a reboot
    cleans it.
    Also, I wouldn't discard the possibility of a physical memory problem. Can
    you test it?
    Miguel, thanks!

    All that you are saying makes perfect sense. I have tried monitoring the
    system to see if any memory hogs emerge and found no obvious culprits thus
    far. I.e., there are processes running that consume large volumes or RAM
    but none that seem to keep growing overtime. Or at least I failed to locate
    such processes thus far.

    As for testing the RAM - it is always a good test when in doubt. Too bad
    you have to stop your machine in order to do it and for that reason I
    haven't done it yet. Though this is on the short list of things to try.

    Boris.
  • Ross Walker at Jan 22, 2012 at 4:41 pm

    On Jan 22, 2012, at 10:00 AM, Boris Epstein wrote:

    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
    Drive ECC error reported:port=4, unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
    Source drive error occurred:port=4, unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004):
    Rebuild failed:unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B):
    Rebuild paused:unit=0.
    From 3ware's site:
    004h Rebuild failed

    The 3ware RAID controller was unable to complete a rebuild operation. This error can be caused by drive errors on either the source or the destination of the rebuild. However, due to ATA drives' ability to reallocate sectors on write errors, the rebuild failure is most likely caused by the source drive of the rebuild detecting some sort of read error. The default operation of the 3ware RAID controller is to abort a rebuild if an error is encountered. If it is desired to continue on error, you can set the Continue on Source Error During Rebuild policy for the unit on the Controller Settings page in 3DM.

    026h Drive ECC error reported

    This AEN may be sent when a drive returns the ECC error response to an 3ware RAID controller command. The AEN may or may not be associated with a host command. Internal operations such as Background Media Scan post this AEN whenever drive ECC errors are detected.

    Drive ECC errors are an indication of a problem with grown defects on a particular drive. For redundant arrays, this typically means that dynamic sector repair would be invoked (see AEN 023h). For non-redundant arrays (JBOD, RAID 0 and degraded arrays), drive ECC errors result in the 3ware RAID controller returning failed status to the associated host command.


    Sounds awfully like a hardware error on one of the drives. Replace the failed drive and try rebuilding.

    -Ross
  • Ross Walker at Jan 22, 2012 at 4:44 pm

    On Jan 22, 2012, at 4:41 PM, Ross Walker wrote:
    On Jan 22, 2012, at 10:00 AM, Boris Epstein wrote:

    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0026):
    Drive ECC error reported:port=4, unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x002D):
    Source drive error occurred:port=4, unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: ERROR (0x04:0x0004):
    Rebuild failed:unit=0.
    Jan 22 09:17:53 nrims-bs kernel: 3w-9xxx: scsi6: AEN: INFO (0x04:0x003B):
    Rebuild paused:unit=0.
    From 3ware's site:
    004h Rebuild failed

    The 3ware RAID controller was unable to complete a rebuild operation. This error can be caused by drive errors on either the source or the destination of the rebuild. However, due to ATA drives' ability to reallocate sectors on write errors, the rebuild failure is most likely caused by the source drive of the rebuild detecting some sort of read error. The default operation of the 3ware RAID controller is to abort a rebuild if an error is encountered. If it is desired to continue on error, you can set the Continue on Source Error During Rebuild policy for the unit on the Controller Settings page in 3DM.

    026h Drive ECC error reported

    This AEN may be sent when a drive returns the ECC error response to an 3ware RAID controller command. The AEN may or may not be associated with a host command. Internal operations such as Background Media Scan post this AEN whenever drive ECC errors are detected.

    Drive ECC errors are an indication of a problem with grown defects on a particular drive. For redundant arrays, this typically means that dynamic sector repair would be invoked (see AEN 023h). For non-redundant arrays (JBOD, RAID 0 and degraded arrays), drive ECC errors result in the 3ware RAID controller returning failed status to the associated host command.

    Sounds awfully like a hardware error on one of the drives. Replace the failed drive and try rebuilding.
    This error code does not bode well.

    02Dh Source drive error occurred

    If an error is encountered during a rebuild operation, this AEN is generated if the error was on a source drive of the rebuild. Knowing if the error occurred on the source or the destination of the rebuild is useful for troubleshooting.



    It's possible the whole RAID6 is corrupt.

    -Ross
  • Miguel Medalha at Jan 22, 2012 at 2:23 pm

    Now the machine is not particularly powerful: it is 64-bit machine, dual
    core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
    following problem: once in awhile that XFS partition starts generating
    multiple I/O errors, files that had content become 0 byte, directories
    disappear, etc. Every time a reboot fixes that, however. So far I've looked
    at logs but could not find a cause of precipitating event.
    Is the CentOS you are running a 64 bit one?

    The reason I am asking this is because the use of XFS under a 32 bit OS
    is NOT recommended.
    If you search this list's archives you will find some discussion about
    this subject.
  • Joseph L. Casale at Jan 22, 2012 at 2:56 pm

    I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house
    backups. The backups are run via rsync/rsnapshot and are large in terms of
    the number of files: over 10 million each.

    Now the machine is not particularly powerful: it is 64-bit machine, dual
    core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
    following problem: once in awhile that XFS partition starts generating
    multiple I/O errors, files that had content become 0 byte, directories
    disappear, etc. Every time a reboot fixes that, however. So far I've looked
    at logs but could not find a cause of precipitating event.

    Hence the question: has anyone experienced anything along those lines? What
    could be the cause of this?
    In every situation like this that I have seen, it was hardware that never had
    adequate memory provisioned.

    Another consideration is you almost certainly wont be able to run a repair on that
    fs with so little ram.

    Finally, it would be interesting to know how you architected the storage hardware.
    Hardware raid, BBC, drive cache status, barrier status etc...
  • Boris Epstein at Jan 22, 2012 at 3:05 pm

    On Sun, Jan 22, 2012 at 2:56 PM, Joseph L. Casale wrote:

    I have a CentOS 5.7 machine hosting a 16 TB XFS partition used to house
    backups. The backups are run via rsync/rsnapshot and are large in terms of
    the number of files: over 10 million each.

    Now the machine is not particularly powerful: it is 64-bit machine, dual
    core CPU, 3 GB RAM. So perhaps this is a factor in why I am having the
    following problem: once in awhile that XFS partition starts generating
    multiple I/O errors, files that had content become 0 byte, directories
    disappear, etc. Every time a reboot fixes that, however. So far I've looked
    at logs but could not find a cause of precipitating event.

    Hence the question: has anyone experienced anything along those lines? What
    could be the cause of this?
    In every situation like this that I have seen, it was hardware that never
    had
    adequate memory provisioned.

    Another consideration is you almost certainly wont be able to run a repair
    on that
    fs with so little ram.

    Finally, it would be interesting to know how you architected the storage
    hardware.
    Hardware raid, BBC, drive cache status, barrier status etc...
    Joseph,

    If I remember correctly I pretty much went with the defaults when I created
    this XFS on top of a 16-drive RAID6 configuration.

    Now as far as memory - I think for the purpose of XFS repair RAM and swap
    ought to be the same. And I've got plenty of swap on this system. I also
    host an 5 TB XFS in a file there and I ran XFS repair on it and it ran
    within no more than 5 minutes. Now this is 20% of the larger XFS, roughly
    speaking.

    I should try to collect the info you mentioned, though - that was a good
    thought, some clue might be contained in there for sure.

    Thanks for your input.

    Boris.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcentos @
categoriescentos
postedJan 22, '12 at 9:06a
activeJan 22, '12 at 8:45p
posts19
users5
websitecentos.org
irc#centos

People

Translate

site design / logo © 2022 Grokbase