FAQ
I have few disk that have offline uncorrectables sectors;

I found on this page how to identify the sectors and force a write on them to
trigger the relocation of bad sectors on the disk:

http://smartmontools.sourceforge.net/BadBlockHowTo.txt

My question is:

since I'm too lazy to follow all the procedure, do you think that a force
rewrite of the full disk would work?

Eg. "dd if=/dev/sda pf=/dev/sda bsQ2"

Shoudl this be done at runlevel 1 or offline or I can do it without too many
worries, since I'm reading and rewriting the same data on the disk?

TIA and sorry for the OT

Lorenzo Quatrini

Search Discussions

  • Nate Amsden at Aug 22, 2008 at 3:59 pm

    Lorenzo Quatrini wrote:
    I have few disk that have offline uncorrectables sectors;
    Ideally it should be done using the manufacturer's tools,
    and really any disk that has even one bad sector that the OS
    can see should not be relied upon, it should be considered a
    failed disk. Disks automatically keep spare sectors that the
    operating system cannot see and re-maps bad sectors to them,
    if your seeing bad sectors that means that collection of
    spares has been exhausted. I've never seen a disk manufacturer
    not accept a disk that had bad sectors on it (that was still
    under warranty) in as long as I can remember..

    nate
  • Lorenzo Quatrini at Aug 22, 2008 at 4:07 pm

    nate ha scritto:
    Lorenzo Quatrini wrote:
    I have few disk that have offline uncorrectables sectors;
    Ideally it should be done using the manufacturer's tools,
    and really any disk that has even one bad sector that the OS
    can see should not be relied upon, it should be considered a
    failed disk. Disks automatically keep spare sectors that the
    operating system cannot see and re-maps bad sectors to them,
    if your seeing bad sectors that means that collection of
    spares has been exhausted. I've never seen a disk manufacturer
    not accept a disk that had bad sectors on it (that was still
    under warranty) in as long as I can remember..

    nate
    For what I understand Offline uncorrectable means that the sector would be
    relocated the next time it is accessed for writing... so it is on a "wait for
    relocation" status.
    I don't know of any other way to force this relocation other tha actually
    writing over the sector (a simple read doesn't trigger the relocation)...

    And yes, I know that a disk with bad blocks isn't reliable, but you remember?
    I'm too lazy to send my home disks back to the manufacturer ;)

    Lorenzo
  • Nate Amsden at Aug 22, 2008 at 4:26 pm

    Lorenzo Quatrini wrote:

    For what I understand Offline uncorrectable means that the sector would be
    relocated the next time it is accessed for writing... so it is on a "wait
    for
    relocation" status.
    I don't know of any other way to force this relocation other tha actually
    writing over the sector (a simple read doesn't trigger the relocation)...
    Not sure myself but the manufacturer's testing tools have
    non destructive ways of detecting and re-mapping bad sectors.
    Of course a downside to the manufacturer's tools is they often
    only support a limited number of disk controllers.

    It's probably been since the IBM Deathstar 75GXP that I last recall
    having drives with bad sectors on them but typically at least at that
    time, when the OS encountered a bad sector it didn't handle it too
    gracefully, often times had to reboot the system. Perhaps the linux
    kernel is more robust for those things these days (I had roughly 75%
    of my 75GXP drives fail - more than 30).

    Interesting that the man page for e2fsck in RHEL 4 doesn't describe
    the -c option, but the man page for it in RHEL 3 does, not sure if
    that is significant(RHEL4 man page mentions the option, but no
    clear description of what it does). Haven't checked RHEL/CentOS 5.

    from RHEL 3 manpage:
    -c This option causes e2fsck to run the badblocks(8)
    program to find any blocks which are bad on the
    filesystem, and then marks them as bad by adding
    them to the bad block inode. If this option
    is specified twice, then the bad block scan will
    be done using a non-destructive read-write test.

    So if you haven't heard of it already, try e2fsck -c <device> ?
    I recall using this off and on about 10 years ago but found the
    manufacturer's tools to be more accurate.
    And yes, I know that a disk with bad blocks isn't reliable, but you
    remember?
    I'm too lazy to send my home disks back to the manufacturer ;)
    Ahh ok, I see...just keep in mind that it's quite possible the
    bad sector count will continue to mount as time goes on..

    good luck ..


    nate
  • Akemi Yagi at Aug 22, 2008 at 4:29 pm

    On Fri, Aug 22, 2008 at 9:26 AM, nate wrote:
    Lorenzo Quatrini wrote:
    Not sure myself but the manufacturer's testing tools have
    non destructive ways of detecting and re-mapping bad sectors.
    Of course a downside to the manufacturer's tools is they often
    only support a limited number of disk controllers.

    It's probably been since the IBM Deathstar 75GXP that I last recall
    having drives with bad sectors on them but typically at least at that
    time, when the OS encountered a bad sector it didn't handle it too
    gracefully, often times had to reboot the system. Perhaps the linux
    kernel is more robust for those things these days (I had roughly 75%
    of my 75GXP drives fail - more than 30).

    Interesting that the man page for e2fsck in RHEL 4 doesn't describe
    the -c option, but the man page for it in RHEL 3 does, not sure if
    that is significant(RHEL4 man page mentions the option, but no
    clear description of what it does). Haven't checked RHEL/CentOS 5.

    from RHEL 3 manpage:
    -c This option causes e2fsck to run the badblocks(8)
    program to find any blocks which are bad on the
    filesystem, and then marks them as bad by adding
    them to the bad block inode. If this option
    is specified twice, then the bad block scan will
    be done using a non-destructive read-write test.

    So if you haven't heard of it already, try e2fsck -c <device> ?
    I recall using this off and on about 10 years ago but found the
    manufacturer's tools to be more accurate.
    And yes, I know that a disk with bad blocks isn't reliable, but you
    remember?
    I'm too lazy to send my home disks back to the manufacturer ;)
    Ahh ok, I see...just keep in mind that it's quite possible the
    bad sector count will continue to mount as time goes on..
    There is a thread on this topic in the CentOS forum:

    http://www.centos.org/modules/newbb/viewtopic.php?topic_id880&forum9

    Akemi
  • William L. Maltby at Aug 22, 2008 at 4:30 pm

    On Fri, 2008-08-22 at 18:07 +0200, Lorenzo Quatrini wrote:
    nate ha scritto:
    <snip>
    For what I understand Offline uncorrectable means that the sector would be
    relocated the next time it is accessed for writing... so it is on a "wait for
    relocation" status.
    If my memory is still good (I don't recall if it is or not! :-) you are
    correct.
    I don't know of any other way to force this relocation other tha actually
    writing over the sector (a simple read doesn't trigger the relocation)...
    You can force this with dd using various seek, skip and blksize
    parameters to (re)write only the desired sectors. The "if=" parameter
    would reference the physical device or partition and the "skip=" would
    be the offset to the sector. Be very careful and have good backups. In
    fact, you could test by making an image of the partition and doing a
    test run on that.

    It might be a lot easier to reference the sector from start of disk, as
    the reported sectors will be in reference to that.

    And yes, I know that a disk with bad blocks isn't reliable, but you remember?
    I'm too lazy to send my home disks back to the manufacturer ;)
    If my other post is correct, it may still be reliable (or be getting old
    enough to become so).
    Lorenzo
    <snip sig stuff>
    HTH
    --
    Bill
  • Scott Silva at Aug 22, 2008 at 4:59 pm

    on 8-22-2008 9:07 AM Lorenzo Quatrini spake the following:
    nate ha scritto:
    Lorenzo Quatrini wrote:
    I have few disk that have offline uncorrectables sectors;
    Ideally it should be done using the manufacturer's tools,
    and really any disk that has even one bad sector that the OS
    can see should not be relied upon, it should be considered a
    failed disk. Disks automatically keep spare sectors that the
    operating system cannot see and re-maps bad sectors to them,
    if your seeing bad sectors that means that collection of
    spares has been exhausted. I've never seen a disk manufacturer
    not accept a disk that had bad sectors on it (that was still
    under warranty) in as long as I can remember..

    nate
    For what I understand Offline uncorrectable means that the sector would be
    relocated the next time it is accessed for writing... so it is on a "wait for
    relocation" status.
    I don't know of any other way to force this relocation other tha actually
    writing over the sector (a simple read doesn't trigger the relocation)...

    And yes, I know that a disk with bad blocks isn't reliable, but you remember?
    I'm too lazy to send my home disks back to the manufacturer ;)
    Then I hope you are not too lazy to do some proper backups!
    Sending a disk back to be replaced is a lot less work then recovering a failed
    array when the disk tanks. How much is your data worth?
    I know by experience that a 6 drive raid 5 array can run near $10,000 US to
    recover.

    --
    MailScanner is like deodorant...
    You hope everybody uses it, and
    you notice quickly if they don't!!!!

    -------------- next part --------------
    A non-text attachment was scrubbed...
    Name: signature.asc
    Type: application/pgp-signature
    Size: 250 bytes
    Desc: OpenPGP digital signature
    Url : http://lists.centos.org/pipermail/centos/attachments/20080822/4e3d507c/signature.bin
  • William L. Maltby at Aug 22, 2008 at 4:22 pm

    On Fri, 2008-08-22 at 08:59 -0700, nate wrote:
    Lorenzo Quatrini wrote:
    I have few disk that have offline uncorrectables sectors;
    Ideally it should be done using the manufacturer's tools,
    Second that!
    and really any disk that has even one bad sector that the OS
    can see should not be relied upon, it should be considered a
    failed disk. Disks automatically keep spare sectors that the
    operating system cannot see and re-maps bad sectors to them,
    if your seeing bad sectors that means that collection of
    spares has been exhausted. I've never seen a disk manufacturer
    ?? Uncertain about "spares has been exhausted". I recently had one SATA
    drive that kept reporting a bad sector (actually grew to three). Being
    inured against panic attacks by long exposure to panic-inducing
    situations, I decided to let it ride a bit (it was an empty re-used
    partition upon which I would mke2fs and temporarily mount and use) and
    see if the number continued to grow. To this end, I ran the smart tools
    extended tests, several times over a period of a week, and saw no new
    ones. This was reassuring as traditionally if failure is imminent the
    number tends to grow quickly. A few appearances of bad sectors early in
    the drive lifetime is not an unusual occurrence and is not reason for
    trade in of the drive (after all, in this case the manufacturer just
    runs the repair software on it and re-sells it). It *is* a reason for
    heightened caution and alertness, depending on your situation.

    After deciding the drive was not in its death-throes, I downloaded the
    DOS utilities from the manufacturer web site and ran the repair
    utilities. No smart tools reports of bad sectors since then (about 2
    months so far).

    Now, I don't know (or care) if an alternate sector was assigned, just
    that the sector was flagged unusable. For my use (temporary use - no
    permanent or critical data) this is fine. Last several mke2fs runs have
    produced the same amount of usable blocks and i-nodes, so I don't see
    evidence that no spare was available.

    I do expect that a few more sectors will be found as the drive ages
    until the manufacturing weak areas have all aged sufficiently to cause
    failures.
    not accept a disk that had bad sectors on it (that was still
    under warranty) in as long as I can remember..
    If your application is critical and you still have warranty, the only
    cost is inconvenience, delay and more work to get it exchanged. You will
    likely receive a "reconditioned" drive though. So for me, in my
    situation, the download and use of the manufacturers repair software is
    better. Only bad part is instead of using floppies now, they seem to
    want a CD/DVD to boot from. A minor inconvenience considering the
    alternatives.
    nate
    <snip sig stuff>
    HTH
    --
    Bill
  • Nate Amsden at Aug 22, 2008 at 4:33 pm

    William L. Maltby wrote:

    ?? Uncertain about "spares has been exhausted".
    I don't recall where I read it, and I suppose it may be
    misinformation, but it made sense at the time. The idea is
    the disks are not made to hold EXACTLY the amount of blocks
    that the specs are for. There are some extra blocks, that
    the disk "hides" from the disk controller. The disk automatically
    re-maps these hidden blocks(making them visible again). By
    the time bad blocks start showing up on the OS level these
    extra blocks are already full, an indication that there is
    far more bad blocks on the disk than just the ones that you
    can see at the OS level.
    Now, I don't know (or care) if an alternate sector was assigned, just
    that the sector was flagged unusable. For my use (temporary use - no
    permanent or critical data) this is fine. Last several mke2fs runs have
    produced the same amount of usable blocks and i-nodes, so I don't see
    evidence that no spare was available.
    Note that mke2fs doesn't write over the entire disk, I doubt it
    even scans the entire disk. I've used a technology called thin
    provisioning where only data that is written to disk is actually
    allocated on disk(e.g. you can create a 1TB volume, if you only
    write 1GB to it, it only uses 1GB, allowing you to oversubscribe
    the system, and dynamically grow physical storage as needed). When
    allocating thinly provisioned volumes and formatting them with
    mke2fs, even on multi hundred gig systems only a few megs are written
    to disk(perhaps a hundred megs).

    nate
  • William L. Maltby at Aug 22, 2008 at 4:57 pm

    On Fri, 2008-08-22 at 09:33 -0700, nate wrote:
    William L. Maltby wrote:
    ?? Uncertain about "spares has been exhausted".
    I don't recall where I read it, and I suppose it may be
    misinformation, but it made sense at the time. The idea is
    the disks are not made to hold EXACTLY the amount of blocks
    that the specs are for. There are some extra blocks, that
    the disk "hides" from the disk controller. The disk automatically
    re-maps these hidden blocks(making them visible again). By
    That is correct. Back in the old days, we had access to a "spares"
    cylinder and could manually maintain the alternate sectors table. We
    could wipe it, add sectors etc.

    As technology progressed, this capability disappeared and the drive
    electronics and proms began taking care of it.

    What I don't know (extreme lack of sufficient interest to find out so
    far) is if the self-monitoring tools report a sector when a *read*
    results in either a hard or soft failure and if it tries to reassign at
    that time. My local evidence seems to indicate that the report is made
    at read time but assignment of a spare is not made then. This because
    the same three sectors kept reporting over and over.

    After running the repair software, messages stopped, indicating that the
    bad sector was then marked unusable and alternate sectors had been
    assigned.
    the time bad blocks start showing up on the OS level these
    extra blocks are already full, an indication that there is
    far more bad blocks on the disk than just the ones that you
    can see at the OS level.
    Correct.
    Now, I don't know (or care) if an alternate sector was assigned, just
    that the sector was flagged unusable. For my use (temporary use - no
    permanent or critical data) this is fine. Last several mke2fs runs have
    produced the same amount of usable blocks and i-nodes, so I don't see
    evidence that no spare was available.
    Note that mke2fs doesn't write over the entire disk, I doubt it
    even scans the entire disk.
    Correct, unless the check is forced. I failed to note in my previous
    post that a *substantial* portion of the partition was written (which I
    knew included the questionable sectors through manual math and the
    nature of file system usage).
    I've used a technology called thin
    provisioning where only data that is written to disk is actually
    allocated on disk(e.g. you can create a 1TB volume, if you only
    write 1GB to it, it only uses 1GB, allowing you to oversubscribe
    the system, and dynamically grow physical storage as needed). When
    allocating thinly provisioned volumes and formatting them with
    mke2fs, even on multi hundred gig systems only a few megs are written
    to disk(perhaps a hundred megs).
    Yep. Only a few copies of the superblock and the i-node tables are
    written by the file system make process. That's why it's important for
    files systems in critical applications to be created with the check
    forced. Folks should also keep in mind that the default check, read
    only, is really not sufficient for critical situations. The full
    write/read check should be forced on *new* partitions/disks.
    nate
    <snip sig stuff>
    --
    Bill
  • Lorenzo Quatrini at Aug 25, 2008 at 8:43 am
    William L. Maltby ha scritto:
    Yep. Only a few copies of the superblock and the i-node tables are
    written by the file system make process. That's why it's important for
    files systems in critical applications to be created with the check
    forced. Folks should also keep in mind that the default check, read
    only, is really not sufficient for critical situations. The full
    write/read check should be forced on *new* partitions/disks.
    So again my question is:
    can I use dd to "test" the disk? what about

    dd if=/dev/sda of=/dev/sda bsQ2

    Is this safe on a full running system? Has to be done at runlevel 1 or with a
    live cd?
    I think this is "better" than the manufactureur way, as dd is always present
    and works with any brand.

    Lorenzo
  • Stephen Harris at Aug 25, 2008 at 10:36 am

    On Mon, Aug 25, 2008 at 10:43:01AM +0200, Lorenzo Quatrini wrote:
    So again my question is:
    can I use dd to "test" the disk? what about

    dd if=/dev/sda of=/dev/sda bsQ2

    Is this safe on a full running system? Has to be done at runlevel 1 or with a
    live cd?
    Do not do this on a mounted filesystem; you risk corruption. I'd be leary
    of this command, though.

    A better way is use the "badblocks" command; if you want to keep data
    then "badblocks -n"; if you don't care about data then "badblocks -w".
    Again, you can't do this on a mounted filesystem.

    --

    rgds
    Stephen
  • William L. Maltby at Aug 25, 2008 at 11:24 am

    On Mon, 2008-08-25 at 06:36 -0400, Stephen Harris wrote:
    On Mon, Aug 25, 2008 at 10:43:01AM +0200, Lorenzo Quatrini wrote:
    So again my question is:
    can I use dd to "test" the disk? what about

    dd if=/dev/sda of=/dev/sda bsQ2

    Is this safe on a full running system? Has to be done at runlevel 1 or with a
    live cd?
    Do not do this on a mounted filesystem; you risk corruption. I'd be leary
    of this command, though.
    Whoo-hoo! The question un-asked ...

    I didn't even think of mentioning this to him in my other reply. I'm
    glad you jumped on that.
    A better way is use the "badblocks" command; if you want to keep data
    then "badblocks -n"; if you don't care about data then "badblocks -w".
    Again, you can't do this on a mounted filesystem.
    This is *far* superior to the OP's thoughts of dd.

    And I'll remind here, mentioned in my other post, about "hard" and
    "soft" errors. "Soft" errors are not seen by the OS.

    "Badblocks" (which really should be invoked via mke2fs or e2fsck rather
    than manually) has useful, but limited, utility in ensuring reliability.
    And it does require some small storage space in the file system. And it
    does *not* assign alternate blocks (that is, it does not take advantage
    of the hardware alternate block capability). And it is not "predictive",
    thereby being useful only for keeping an FS usable *after* data has been
    (potentially) lost on an existing file system. It's best utility is at
    FS creation and check time. It also has use if you can un-mount the FS
    (ignoring the "force" capability provided) but cannot take the system
    down to run manufacturer-specific diagnostic and repair software.

    >

    --
    Bill
  • Nifty Cluster Mitch at Aug 25, 2008 at 7:03 pm

    On Mon, Aug 25, 2008 at 07:24:24AM -0400, William L. Maltby wrote:
    "Badblocks" (which really should be invoked via mke2fs or e2fsck rather
    than manually) has useful, but limited, utility in ensuring reliability.
    And it does require some small storage space in the file system. And it
    does *not* assign alternate blocks (that is, it does not take advantage
    of the hardware alternate block capability). And it is not "predictive",
    thereby being useful only for keeping an FS usable *after* data has been
    (potentially) lost on an existing file system. It's best utility is at
    FS creation and check time. It also has use if you can un-mount the FS
    (ignoring the "force" capability provided) but cannot take the system
    down to run manufacturer-specific diagnostic and repair software.
    It might be interesting to add a "catch 22" story.

    I once added -c flags to /fsckoptions and "touch"ed /forcefsck.
    I had to take the disk to the lab and fix it on a bench system.



    --
    T o m M i t c h e l l
    Got a great hat... now what.
  • William L. Maltby at Aug 25, 2008 at 7:43 pm

    On Mon, 2008-08-25 at 12:03 -0700, Nifty Cluster Mitch wrote:
    On Mon, Aug 25, 2008 at 07:24:24AM -0400, William L. Maltby wrote:

    <snip>
    (potentially) lost on an existing file system. It's best utility is at
    FS creation and check time. It also has use if you can un-mount the FS
    (ignoring the "force" capability provided) but cannot take the system
    down to run manufacturer-specific diagnostic and repair software.
    It might be interesting to add a "catch 22" story.

    I once added -c flags to /fsckoptions and "touch"ed /forcefsck.
    I had to take the disk to the lab and fix it on a bench system.
    YOIKS! Any explanation why such a reliable process would cause such a
    result? Was it a long time ago with a buggy e2fsck maybe? Did you mean
    to say you added the "-f" flag and the FS was mounted and active at the
    time? Is it just one of those "Mysteries of the Universe"? I hate those!
    <snip>
    --
    Bill
  • Nifty Cluster Mitch at Aug 25, 2008 at 10:36 pm

    On Mon, Aug 25, 2008 at 03:43:18PM -0400, William L. Maltby wrote:
    On Mon, 2008-08-25 at 12:03 -0700, Nifty Cluster Mitch wrote:
    On Mon, Aug 25, 2008 at 07:24:24AM -0400, William L. Maltby wrote:

    <snip>
    (potentially) lost on an existing file system. It's best utility is at
    FS creation and check time. It also has use if you can un-mount the FS
    (ignoring the "force" capability provided) but cannot take the system
    down to run manufacturer-specific diagnostic and repair software.
    It might be interesting to add a "catch 22" story.

    I once added -c flags to /fsckoptions and "touch"ed /forcefsck.
    I had to take the disk to the lab and fix it on a bench system.
    YOIKS! Any explanation why such a reliable process would cause such a
    result? Was it a long time ago with a buggy e2fsck maybe? Did you mean
    to say you added the "-f" flag and the FS was mounted and active at the
    time? Is it just one of those "Mysteries of the Universe"? I hate those!
    The removal of /forcefsck would never happen when badblocks was run.
    Something wonkey perhaps because I did have a disk with defects..

    Might be worth a retry next time I need to clean and reload a machine
    but I do not know how to reproduct the disk hardware issue.

    Gone are the days where disk controllers gave you the ability
    to 'expose' defects.






    --
    T o m M i t c h e l l
    Got a great hat... now what.
  • William L. Maltby at Aug 25, 2008 at 11:13 pm

    On Mon, 2008-08-25 at 15:36 -0700, Nifty Cluster Mitch wrote:
    On Mon, Aug 25, 2008 at 03:43:18PM -0400, William L. Maltby wrote:
    On Mon, 2008-08-25 at 12:03 -0700, Nifty Cluster Mitch wrote:
    On Mon, Aug 25, 2008 at 07:24:24AM -0400, William L. Maltby wrote:

    <snip>
    (potentially) lost on an existing file system. It's best utility is at
    FS creation and check time. It also has use if you can un-mount the FS
    (ignoring the "force" capability provided) but cannot take the system
    down to run manufacturer-specific diagnostic and repair software.
    It might be interesting to add a "catch 22" story.

    I once added -c flags to /fsckoptions and "touch"ed /forcefsck.
    I had to take the disk to the lab and fix it on a bench system.
    YOIKS! Any explanation why such a reliable process would cause such a
    result? Was it a long time ago with a buggy e2fsck maybe? Did you mean
    to say you added the "-f" flag and the FS was mounted and active at the
    time? Is it just one of those "Mysteries of the Universe"? I hate those!
    The removal of /forcefsck would never happen when badblocks was run.
    Something wonkey perhaps because I did have a disk with defects..

    Might be worth a retry next time I need to clean and reload a machine
    but I do not know how to reproduct the disk hardware issue.

    Gone are the days where disk controllers gave you the ability
    to 'expose' defects.
    I don't have an available "smart" drive here at home, but I do have some
    older stuff. I think we can "emulate" defects by defining a partition
    that runs a few sectors beyond the end of the HD. Then mke2fs giving the
    -c -c and a manually specified size that includes the phantom sectors.

    When I get time (won't be RSN) I'll do both a mke2fs test and then an
    e2fsck test. What I don't know is if notification of "beyond media end"
    is sent by hardware and caught by drivers or if drivers just catch an
    error and a bad block (sector) is presumed, to be logged and avoided.
    ISTR (on SCSI anyway) that read past media end was handled. But, this
    ain't SCSI! 8-)

    If someone has a setup that makes this a quick and easy test to run
    sooner than I'll be able to, that would be "peachy".
    <snip>
    --
    Bill
  • William L. Maltby at Aug 25, 2008 at 10:53 am

    On Mon, 2008-08-25 at 10:43 +0200, Lorenzo Quatrini wrote:
    William L. Maltby ha scritto:
    Yep. Only a few copies of the superblock and the i-node tables are
    written by the file system make process. That's why it's important for
    files systems in critical applications to be created with the check
    forced. Folks should also keep in mind that the default check, read
    only, is really not sufficient for critical situations. The full
    write/read check should be forced on *new* partitions/disks.
    First, a correction. I earlier mentioned "-C" as causing the read/write
    check for mke2fs. It is "-c -c". I must've been thinking of some other
    FS software.
    So again my question is:
    can I use dd to "test" the disk? what about

    dd if=/dev/sda of=/dev/sda bsQ2
    It ought to do what you think it would. But ...
    Is this safe on a full running system? Has to be done at runlevel 1 or with a
    live cd?
    Safe on a full running system? Probably. I suggest a test before you do
    it on an important system. I've never had the urge to do it the way you
    suggest. It can be done at run level 1 or from a live CD too. But ..
    I think this is "better" than the manufacturer way, as dd is always present
    and works with any brand.
    s/better/convenient/ # IMO

    Now for the "buts". I presume that there are still two basic types of
    media errors on HDs, "hard" and "soft". Hard errors are those that are
    not recoverable through the normal hardware crc check process (or
    whatever they use these days). Soft errors are errors that are
    recoverable via the normal hardware crc check process.

    Hard errors are always reported to the OS, soft errors are not, IIRC. So
    you could have recovered media failures that do not get reported to the
    OS. IF these failures are early indicators of deteriorating media you
    will not be notified of them.

    For this reason, hardware-specific diagnostic software is "better".
    Further, the "smart" capabilities are *really* hardware specific and
    will detect and report things that normal read/write activities, like
    dd, cannot.

    As to running on a live system, you might not want to for several
    reasons. If you are using the system to do anything useful at the time,
    there will be a big hit on responsiveness. Unlike the real original
    UNIX, Linux still does not have preemptive scheduling (somebody please
    correct me if I missed this potentially earth-shattering advancement -
    last I heard, earliest was to be the 2.7 kernel, presuming no slippage).

    Because dd is fast, it will consume all I/O capability, especially the
    way you propose running it. Further, you will be causing a *LARGE*
    number of system calls, further degrading system responsiveness. It
    could be so slow to respond that one might think the system is "frozen".

    If you insist on doing this, I would suggest something like

    nice <:your priority here:> dd if=/dev/xxxx of=/dev/xxxx bs384&

    "Man nice" for details. This helps a little bit. I've not tried to see
    how much responsiveness can be "recovered". A larger "bs=" will reduce
    system calls, but will increase buffer sizes and usage and increase I/O
    load. Even if you omit the trailing "&" to run in foreground, the
    responsiveness may be so slow that a <CTL>-<C> may appear to fail and
    make you think the system is "frozen"... for a little while.

    The larger "bs=" would seem to negate what you want with the "bsQ2".
    Not so. Since the detection of failures happens on the hardware, it will
    still detect failures and handle them as it normally would. The "bs=" is
    only a blocking factor. Your "512" only saves doing math to figure out
    what the "sector" really is. But it has a large cost. BTW, you don't
    really know what the sector size is these days. It may not be 512. Back
    in the old days, sector size was selectable via jumpers. Today I suspect
    the drives don't have sectors in the same way/size as they used to.

    Closing (really, they are!) arguments:
    1. Any OS, rather than hardware specific, test will be less rigorous.
    This is "optimal" only if other factors trump reliability. Usually
    "convenience" and "portability" will not trump reliability for server or
    critical platforms.

    2. The "smart" feature has capabilities of which you may not be aware.
    One of these is to run in such a way as to minimize performance impact
    on a live system. If you've run "makewhatis", then "man -k smart" or
    "apropos smart" will get you started on the reading you may want to do.

    3. Hardware-specific diagnostics and repair utilities from the
    manufacturer (this includes the "smart" capability of the drives) will
    be more rigorous and reliable than general-purpose utilities.

    4. The manufacturer utilities can "repair" media failures as they are
    detected. If you are taking the time to run diagnostics, why not fix
    failures at the same time? If you believe that the "dd" way can
    accomplish the same thing (through the alternate block assignment
    process), why not grab a drive with known bad sectors and run a test to
    see if it will be satisfactory to you?
    Lorenzo
    <snip sig stuff>
    --
    Bill
  • Nifty Cluster Mitch at Aug 25, 2008 at 6:53 pm

    On Mon, Aug 25, 2008 at 10:43:01AM +0200, Lorenzo Quatrini wrote:
    William L. Maltby ha scritto:
    Yep. Only a few copies of the superblock and the i-node tables are
    written by the file system make process. That's why it's important for
    files systems in critical applications to be created with the check
    forced. Folks should also keep in mind that the default check, read
    only, is really not sufficient for critical situations. The full
    write/read check should be forced on *new* partitions/disks.
    So again my question is:
    can I use dd to "test" the disk? what about

    dd if=/dev/sda of=/dev/sda bsQ2

    Is this safe on a full running system? Has to be done at runlevel 1 or with a
    live cd?
    I think this is "better" than the manufactureur way, as dd is always present
    and works with any brand.
    It is not safe on a mounted filesystem or devices with mounted filesystems.

    File system code on a partition will have no coherency interaction
    with the entire raw device.

    See the -f flag in the "badblocks" man page:
    "-f Normally, badblocks will refuse to do a read/write or a non-
    destructive test on a device which is mounted, since either can
    cause the system to potentially crash and/or damage the filesys-
    tem even if ....."

    It is also not 100% clear to me that the kernel buffer code will not
    see a paired set of "dd" commands as a no op and skip the write.

    Vendor tools on an unmounted disk operate at a raw level and also have
    access to the vendor specific embedded controller commands bypassing
    buffering and directly interacting with error codes and retry counts and more.

    In normal operation the best opportunity to spare a sector or track is
    on a write..... At that time the OS, and disk both have known good data
    so a read after write can detect the defect/ error and take the necessary
    action without loss of data. Some disks have read heads that follow the
    write heads to this end. Other disks require an additional revolution....

    When "mke2fs -c -c " is invoked the second -c flag is important because the
    paired read/write can let the firmware on the disk map detected defects
    to spares. With a single "-c" flag the Linux filesystem code can
    assign the error blocks to non files . A system admin that does a dd read
    of a problem disk may find that the OS hurls on the errors and takes the device off line.
    i.e. this command:
    dd if=/dev/sda of=/dev/sda bsQ2
    might not do the expected because the first read can take the device
    off line negating the follow up write intended to fix things.

    The tool "hdparm: is rich in info -- some flags are dangerous.

    Bottom line... use vendor tools....
    Vendors like error reports from their tools for RMA processing and warranty...

    BTW: smartd is a good thing. For me any disk that smartd had made noise
    about has failed... often with weeks or months of warning...


    --
    T o m M i t c h e l l
    Got a great hat... now what.
  • Lorenzo Quatrini at Aug 26, 2008 at 8:38 am
    Nifty Cluster Mitch ha scritto:
    Bottom line... use vendor tools....
    Vendors like error reports from their tools for RMA processing and warranty...

    BTW: smartd is a good thing. For me any disk that smartd had made noise
    about has failed... often with weeks or months of warning...
    So... ok, I see the point: I should monitor for SMART errors and then use
    vendor tools to fix things...

    (BTW, the pc which triggered the tread reallocated the sector by himself: I
    guess that finally the OS tried to write to the bad sector and the disk did all
    the magic relocation thing)

    Also I finally noticed that badblocs has a non-distructive read-write mode (the
    man page is outdated and doesn't mention that) which can be used routinely (say
    once at month) to force a check of the whole disk.

    Thanks to all for the explanation

    Regards

    Lorenzo Quatrini
  • William L. Maltby at Aug 26, 2008 at 12:05 pm

    On Tue, 2008-08-26 at 10:38 +0200, Lorenzo Quatrini wrote:
    <snip>
    Also I finally noticed that badblocs has a non-distructive read-write mode (the
    man page is outdated and doesn't mention that) which can be used routinely (say
    once at month) to force a check of the whole disk.
    From "man badblocks":
    -n Use non-destructive read-write mode. By default only a non-
    destructive read-only test is done. This option must not be
    combined with the -w option, as they are mutually exclusive.

    Note the phrase beginning with "By default only...". I'll admit it could
    be more clearly stated.
    Thanks to all for the explanation

    Regards

    Lorenzo Quatrini
    <snip>
    --
    Bill
  • Lorenzo Quatrini at Aug 26, 2008 at 2:02 pm
    William L. Maltby ha scritto:
    From "man badblocks":

    -n Use non-destructive read-write mode. By default only a non-
    destructive read-only test is done. This option must not be
    combined with the -w option, as they are mutually exclusive.

    Note the phrase beginning with "By default only...". I'll admit it could
    be more clearly stated.
    The Italian translation of the man page is outdated... I guess I sould stick
    with the original version of man pages, or at least remember to check them.

    Lorenzo
  • Nifty Cluster Mitch at Aug 27, 2008 at 4:02 pm

    On Tue, Aug 26, 2008 at 04:02:22PM +0200, Lorenzo Quatrini wrote:
    William L. Maltby ha scritto:
    From "man badblocks":

    -n Use non-destructive read-write mode. By default only a non-
    destructive read-only test is done. This option must not be
    combined with the -w option, as they are mutually exclusive.

    Note the phrase beginning with "By default only...". I'll admit it could
    be more clearly stated.
    The Italian translation of the man page is outdated... I guess I sould stick
    with the original version of man pages, or at least remember to check them.
    Consider filing a bug --
    One goal for the user community is to turn the old phrase RTFM
    to be "Read The Fine Manual".... in contrast to the historic profanity.

    You can file it against either the English, the Italian translation
    or both.

    As an alternative you can post a difference file to a list like
    this for discussion and ask ONE person to help you file the bug.

    Translations are commonly not done by the maintainer so a bug can be
    the best path. If you need help with the mechanics of filing a bug
    ask...




    --
    T o m M i t c h e l l
    Got a great hat... now what.
  • Related Discussions

    Discussion Navigation
    viewthread | post
    Discussion Overview
    groupcentos @
    categoriescentos
    postedAug 22, '08 at 3:55p
    activeAug 27, '08 at 4:02p
    posts23
    users7
    websitecentos.org
    irc#centos

    People

    Translate

    site design / logo © 2022 Grokbase