FAQ
Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a 3tb
drive in for additional workspace for the users, and some of them won't
read, others will go for weeks, then spit out DRDY errors. lshw shows the
controller as an ATI SB7x0/SB8x0/SB9x0 SATA.

I did notice that it shows
*-storage
description: SATA controller
product: SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]
vendor: ATI Technologies Inc
<snip>
width: 32 bits
^^^^^^^^^^^^^^
clock: 66MHz
^^^^^^^^^^^^
capabilities: storage pm ahci_1.0 bus_master cap_list
From /var/log/dmesg:
pci 0000:00:0d.0: PME# supported from D0 D3hot D3cold
pci 0000:00:0d.0: PME# disabled
pci 0000:00:11.0: reg 10 io port: [0xd000-0xd007]
pci 0000:00:11.0: reg 14 io port: [0xc000-0xc003]
pci 0000:00:11.0: reg 18 io port: [0xb000-0xb007]
pci 0000:00:11.0: reg 1c io port: [0xa000-0xa003]
pci 0000:00:11.0: reg 20 io port: [0x9000-0x900f]
pci 0000:00:11.0: reg 24 32bit mmio: [0xdfefa400-0xdfefa7ff]
<...>
ahci 0000:00:11.0: version 3.0
alloc irq_desc for 22 on node 0
alloc kstat_irqs on node 0
ahci 0000:00:11.0: PCI INT A -> GSI 22 (level, low) -> IRQ 22
ahci 0000:00:11.0: AHCI 0001.0100 32 slots 4 ports 3 Gbps 0xf impl SATA mode
ahci 0000:00:11.0: flags: 64bit ncq sntf ilck pm led clo pmp pio slum part
ccc
<...>
ata1: SATA max UDMA/133 abar m1024 at 0xdfefa400 port 0xdfefa500 irq 22
ata2: SATA max UDMA/133 abar m1024 at 0xdfefa400 port 0xdfefa580 irq 22

I've included the above, because I note the 32bit mmio, but the 64bit
flag; also the clock speed for the controller.

Now, I've been working on one with Penguin. I noticed one thing, that it
was set to native IDE. After googling, I saw that the most recent spec,
which included EIDE, should be good to petabytes... but I tried resetting
it to AHCI anyway.

The user ran one job, ok... then another last night, and it's spitting the
same errors.

In /var/log/messages, I see JBD: detected IO errors while flushing file data:
Mar 7 00:53:28 <server> kernel: ata2.00: exception Emask 0x0 SAct 0x3
SErr 0x0 action 0x6 frozen
Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Mar 7 00:53:28 <server> kernel: ata2.00: cmd
61/08:00:72:4a:a4/00:00:ae:00:00/40 tag 0 ncq 4096 out
Mar 7 00:53:28 <server> kernel: res
40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout)
Mar 7 00:53:28 <server> kernel: ata2.00: status: { DRDY }
<...>
Mar 7 00:53:28 <server> kernel: ata2: hard resetting link
Mar 7 00:53:33 <server> kernel: ata2: SATA link up 3.0 Gbps (SStatus 123
SControl 300)
Mar 7 00:53:33 <server> kernel: ata2.00: configured for UDMA/133
Mar 7 00:53:33 <server> kernel: ata2.00: device reported invalid CHS
sector 0
Mar 7 00:53:33 <server> kernel: ata2: EH complete

Notice the "device reported invalid CHS sector 0". The drive does have a
GPT rather than an MBR.

So, has anyone else seen similar problems, or have some suggestions as to
something I can try? Penguin's still waiting for a response from
Supermicro, and has escalated....

mark

Search Discussions

  • Peter Kjellström at Mar 7, 2012 at 11:43 am

    On Wednesday 07 March 2012 11.17.15 m.roth at 5-cent.us wrote:
    Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a 3tb
    drive in for additional workspace for the users, and some of them won't
    read, others will go for weeks, then spit out DRDY errors. lshw shows the
    controller as an ATI SB7x0/SB8x0/SB9x0 SATA. ...
    Now, I've been working on one with Penguin. I noticed one thing, that it
    was set to native IDE. After googling, I saw that the most recent spec,
    which included EIDE, should be good to petabytes... but I tried resetting
    it to AHCI anyway.

    The user ran one job, ok... then another last night, and it's spitting the
    same errors. ...
    Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA QUEUED ...
    40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) ...
    Mar 7 00:53:28 <server> kernel: ata2: hard resetting link
    While writing the drive timed out and the link to it was then subjected to a
    hard reset. This is not normal and usually points to bad drive or buggy
    firmware.

    Have you had a look at smartdata for the drive(s)? (you may want to run the
    smart selftests)

    Also, I'd suggest you test it in a controlled environment. For example, can
    any of your drives survive a full surface write? (dd if=/dev/zero bs=1M of=..)
    Full surface read? Do the tests against /dev/sdX to be sure (excludes
    partitioning, filesystems, volume management, etc.)

    Do note that writing your drive full of zeros _will_ destroy your data (I
    really hope that's stating the obvious...).

    /Peter
    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: not available
    Type: application/pgp-signature
    Size: 198 bytes
    Desc: This is a digitally signed message part.
    Url : http://lists.centos.org/pipermail/centos/attachments/20120307/b1f27fe1/attachment.bin
  • Mark Roth at Mar 7, 2012 at 1:16 pm

    Peter Kjellstr?m wrote:
    On Wednesday 07 March 2012 11.17.15 m.roth at 5-cent.us wrote:
    Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a
    3tb drive in for additional workspace for the users, and some of them
    won't read, others will go for weeks, then spit out DRDY errors. lshw
    shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA. ...
    Now, I've been working on one with Penguin. I noticed one thing, that it
    was set to native IDE. After googling, I saw that the most recent spec,
    which included EIDE, should be good to petabytes... but I tried
    resetting it to AHCI anyway.

    The user ran one job, ok... then another last night, and it's spitting
    the same errors. ...
    Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA
    QUEUED ...
    40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) ...
    Mar 7 00:53:28 <server> kernel: ata2: hard resetting link
    While writing the drive timed out and the link to it was then subjected to
    a hard reset. This is not normal and usually points to bad drive or buggy
    firmware.

    Have you had a look at smartdata for the drive(s)? (you may want to run
    the smart selftests)

    Also, I'd suggest you test it in a controlled environment. For example,
    can any of your drives survive a full surface write? (dd if=/dev/zero
    bs=1M of=..)
    Full surface read? Do the tests against /dev/sdX to be sure (excludes
    partitioning, filesystems, volume management, etc.)

    Do note that writing your drive full of zeros _will_ destroy your data (I
    really hope that's stating the obvious...).
    <g>
    Of course. Nahhh... I've run bonnie++ against it, but couldn't provoke it.
    It's this one user, who runs *large* jobs, with big o/p, when it hits.

    smartctl - I ran the short test just before lunch, and smartctl -H reports
    it passed, completed without errors.

    I saw that it timed out. One of the reasons for some of the stuff I
    included, above, was that
    kernel: ata2.00: device reported invalid CHS sector 0

    Also, I noticed that lshw showed the ATI controller having a width of 32
    bits, and a clock of 66MHz, and wondered if there could be some sort of
    slip-through-the-cracks where the driver didn't handle this correctly.

    mark

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcentos @
categoriescentos
postedMar 7, '12 at 11:17a
activeMar 7, '12 at 1:16p
posts3
users2
websitecentos.org
irc#centos

2 users in discussion

Mark Roth: 2 posts Peter Kjellström: 1 post

People

Translate

site design / logo © 2022 Grokbase