FAQ
Out of the blue, dmesg on my HP Proliant w/ a SCSI disk gives loads of
messages like this one:

EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

Then the root fs goes read-only, so little else can be done on the machine.
LVM locks up. At restart, fs needs a reboot to recover after fsck. The host
starts up ok, then I am given some more minutes before the problem
reappears. This is stock CentOS 4.4, never have gotten to update it because
of this very same problem.

System logs say SCSI I/O error, but SMART says no problem has been found,
neither does badblocks (run from a rescue CD bootup). SCSI cabling,
terminator, etc has been checked.

What should I investigate next? Is the disk condemned?
TIA

--
Eduardo Grosclaude
Universidad Nacional del Comahue
Neuquen, Argentina
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.centos.org/pipermail/centos/attachments/20070711/00504832/attachment.htm

Search Discussions

  • Garrick Staples at Jul 11, 2007 at 11:13 pm

    On Wed, Jul 11, 2007 at 06:20:50PM -0300, Eduardo Grosclaude alleged:
    Out of the blue, dmesg on my HP Proliant w/ a SCSI disk gives loads of
    messages like this one:

    EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

    Then the root fs goes read-only, so little else can be done on the machine.
    LVM locks up. At restart, fs needs a reboot to recover after fsck. The host
    starts up ok, then I am given some more minutes before the problem
    reappears. This is stock CentOS 4.4, never have gotten to update it because
    of this very same problem.

    System logs say SCSI I/O error, but SMART says no problem has been found,
    neither does badblocks (run from a rescue CD bootup). SCSI cabling,
    terminator, etc has been checked.

    What should I investigate next? Is the disk condemned?
    Quite likely the drive is dieing. If you want proof from SMART,
    something like 'smartctl -t long /dev/sda' will likely fail.

    --
    Garrick Staples, GNU/Linux HPCC SysAdmin
    University of Southern California

    Please avoid sending me Word or PowerPoint attachments.
    See http://www.gnu.org/philosophy/no-word-attachments.html
    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: not available
    Type: application/pgp-signature
    Size: 189 bytes
    Desc: not available
    Url : http://lists.centos.org/pipermail/centos/attachments/20070711/bdbfe0f2/attachment.bin
  • Stephen John Smoogen at Jul 11, 2007 at 11:18 pm

    On 7/11/07, Eduardo Grosclaude wrote:
    Out of the blue, dmesg on my HP Proliant w/ a SCSI disk gives loads of
    messages like this one:

    EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

    Then the root fs goes read-only, so little else can be done on the machine.
    LVM locks up. At restart, fs needs a reboot to recover after fsck. The host
    starts up ok, then I am given some more minutes before the problem
    reappears. This is stock CentOS 4.4, never have gotten to update it because
    of this very same problem.

    System logs say SCSI I/O error, but SMART says no problem has been found,
    neither does badblocks (run from a rescue CD bootup). SCSI cabling,
    terminator, etc has been checked.

    What should I investigate next? Is the disk condemned?
    SMART isnt fool-proof. I have had disks that go 'clunk/scraping
    sounds/spin up' that have gotten SMART seal of approval. My normal
    checklist is the above with replacing the items (in case that isnt
    what you meant by check).

    Replace
    terminator
    scsi cable
    controller
    diskdrive

    though I usually do disk drive then controller.

    --
    Stephen J Smoogen. -- CSIRT/Linux System Administrator
    How far that little candle throws his beams! So shines a good deed
    in a naughty world. = Shakespeare. "The Merchant of Venice"
  • Garrick Staples at Jul 11, 2007 at 11:36 pm

    On Wed, Jul 11, 2007 at 05:18:35PM -0600, Stephen John Smoogen alleged:
    On 7/11/07, Eduardo Grosclaude wrote:
    Out of the blue, dmesg on my HP Proliant w/ a SCSI disk gives loads of
    messages like this one:

    EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

    Then the root fs goes read-only, so little else can be done on the
    machine.
    LVM locks up. At restart, fs needs a reboot to recover after fsck. The host
    starts up ok, then I am given some more minutes before the problem
    reappears. This is stock CentOS 4.4, never have gotten to update it because
    of this very same problem.

    System logs say SCSI I/O error, but SMART says no problem has been found,
    neither does badblocks (run from a rescue CD bootup). SCSI cabling,
    terminator, etc has been checked.

    What should I investigate next? Is the disk condemned?
    SMART isnt fool-proof. I have had disks that go 'clunk/scraping
    sounds/spin up' that have gotten SMART seal of approval. My normal
    checklist is the above with replacing the items (in case that isnt
    what you meant by check).

    Replace
    terminator
    scsi cable
    controller
    diskdrive

    though I usually do disk drive then controller.
    In my experience the drive is, by far, the most likely to have problems.
    Personally I never suspect anything else until I've fully tested the
    drive.

    --
    Garrick Staples, GNU/Linux HPCC SysAdmin
    University of Southern California

    Please avoid sending me Word or PowerPoint attachments.
    See http://www.gnu.org/philosophy/no-word-attachments.html
    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: not available
    Type: application/pgp-signature
    Size: 189 bytes
    Desc: not available
    Url : http://lists.centos.org/pipermail/centos/attachments/20070711/2db925de/attachment.bin
  • Redhat at Jul 12, 2007 at 2:10 am
    We had exactly this problem on a bunch of 4.4 VMs under ESX 3.0.1. It could happen after 1 day or 6 days, not real pattern except that it was related to the mpt scsi driver timing out afte 5 attempts to write to the target. I found on the VMware technology network forums that it was quite common, and it turns out there was an unofficial kernel patch from redhat to fix the problem. I applied it to my vms and they never suffered again.

    I have since upgraded some of those boxes to 4.5 and the related *stock* kernel and the problem appears to have gone away, so it looks like this has been fixed in 4.5 and you dont need to patch.

    If you can't upgrade to 4.5 then you can look at the following article for options, NB - when I went look for this stuff there was only a kernel patch, however according to the following article it looks like they have streamlined that to just be a kernel module build;



    http://kb.vmware.com/selfservice/microsites/search.do?cmd=displayKC&externalIdQ306

    Hope this helps.

    Cheers,


    Brian.


    ----- Original Message -----
    From: "Garrick Staples" <[email protected]>
    To: "CentOS mailing list" <[email protected]>
    Sent: Thursday, July 12, 2007 9:36:12 AM (GMT+1000) Australia/Brisbane
    Subject: Re: [CentOS] Root fs suddenly goes r/o

    On Wed, Jul 11, 2007 at 05:18:35PM -0600, Stephen John Smoogen alleged:
    On 7/11/07, Eduardo Grosclaude wrote:
    Out of the blue, dmesg on my HP Proliant w/ a SCSI disk gives loads of
    messages like this one:

    EXT3-fs error (device dm-0) in start_transaction: Journal has aborted

    Then the root fs goes read-only, so little else can be done on the
    machine.
    LVM locks up. At restart, fs needs a reboot to recover after fsck. The host
    starts up ok, then I am given some more minutes before the problem
    reappears. This is stock CentOS 4.4, never have gotten to update it because
    of this very same problem.

    System logs say SCSI I/O error, but SMART says no problem has been found,
    neither does badblocks (run from a rescue CD bootup). SCSI cabling,
    terminator, etc has been checked.

    What should I investigate next? Is the disk condemned?
    SMART isnt fool-proof. I have had disks that go 'clunk/scraping
    sounds/spin up' that have gotten SMART seal of approval. My normal
    checklist is the above with replacing the items (in case that isnt
    what you meant by check).

    Replace
    terminator
    scsi cable
    controller
    diskdrive

    though I usually do disk drive then controller.
    In my experience the drive is, by far, the most likely to have problems.
    Personally I never suspect anything else until I've fully tested the
    drive.

    --
    Garrick Staples, GNU/Linux HPCC SysAdmin
    University of Southern California

    Please avoid sending me Word or PowerPoint attachments.
    See http://www.gnu.org/philosophy/no-word-attachments.html


    --
    This message has been scanned for viruses and
    dangerous content by MailScanner, and is
    believed to be clean.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcentos @
categoriescentos
postedJul 11, '07 at 9:20p
activeJul 12, '07 at 2:10a
posts5
users4
websitecentos.org
irc#centos

People

Translate

site design / logo © 2023 Grokbase