Peter Kjellstr?m wrote:
On Wednesday 07 March 2012 11.17.15 m.roth at 5-cent.us wrote:
Got a bunch of servers from Penguin. Supermicro m/b's H8QG6. We put a
3tb drive in for additional workspace for the users, and some of them
won't read, others will go for weeks, then spit out DRDY errors. lshw
shows the controller as an ATI SB7x0/SB8x0/SB9x0 SATA. ...
Now, I've been working on one with Penguin. I noticed one thing, that it
was set to native IDE. After googling, I saw that the most recent spec,
which included EIDE, should be good to petabytes... but I tried
resetting it to AHCI anyway.

The user ran one job, ok... then another last night, and it's spitting
the same errors. ...
Mar 7 00:53:28 <server> kernel: ata2.00: failed command: WRITE FPDMA
40/00:04:20:00:00/00:00:00:00:00/40 Emask 0x4 (timeout) ...
Mar 7 00:53:28 <server> kernel: ata2: hard resetting link
While writing the drive timed out and the link to it was then subjected to
a hard reset. This is not normal and usually points to bad drive or buggy

Have you had a look at smartdata for the drive(s)? (you may want to run
the smart selftests)

Also, I'd suggest you test it in a controlled environment. For example,
can any of your drives survive a full surface write? (dd if=/dev/zero
bs=1M of=..)
Full surface read? Do the tests against /dev/sdX to be sure (excludes
partitioning, filesystems, volume management, etc.)

Do note that writing your drive full of zeros _will_ destroy your data (I
really hope that's stating the obvious...).
Of course. Nahhh... I've run bonnie++ against it, but couldn't provoke it.
It's this one user, who runs *large* jobs, with big o/p, when it hits.

smartctl - I ran the short test just before lunch, and smartctl -H reports
it passed, completed without errors.

I saw that it timed out. One of the reasons for some of the stuff I
included, above, was that
kernel: ata2.00: device reported invalid CHS sector 0

Also, I noticed that lshw showed the ATI controller having a width of 32
bits, and a clock of 66MHz, and wondered if there could be some sort of
slip-through-the-cracks where the driver didn't handle this correctly.


Search Discussions

Discussion Posts


Related Discussions

Discussion Navigation
viewthread | post
posts ‹ prev | 3 of 3 | next ›
Discussion Overview
groupcentos @
postedMar 7, '12 at 11:17a
activeMar 7, '12 at 1:16p

2 users in discussion

Mark Roth: 2 posts Peter Kjellström: 1 post



site design / logo © 2022 Grokbase