FAQ
Hi List,
I've been getting the following EDAC memory errors
EDAC MC0: CE page 0xeb0dd, offset 0x0, grain 4096, syndrome 0x45, row 3,
channel 0, label "": i82875p CE
and from this seeing that these errors have been corrected.
Checking cat /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count gives
me a count of 4
thus I now know that csrow3 - ch0 is the problem

My question is, how does this map to the on board labels
DIMM 1A
DIMM 1B
DIMM 2A
DIMM 2B

Am I correct in assuming csrow 3 is DIMM 2B?

Also I have just discovered that both the OS drives sda and sdb have
huge number of errors shown on the SMART records
- can this relate to the memory errors??
- I am just really surprised to have two drives show almost identical
number of errors at the same time, yet no apparent data errors - Drives
are ATA ST380013AS 74.53 GB
TIA for your insightful comments

Search Discussions

  • Rob Kampen at Dec 5, 2011 at 3:17 am

    Rob Kampen wrote:
    Hi List,
    I've been getting the following EDAC memory errors
    EDAC MC0: CE page 0xeb0dd, offset 0x0, grain 4096, syndrome 0x45, row
    3, channel 0, label "": i82875p CE
    and from this seeing that these errors have been corrected.
    Checking cat /sys/devices/system/edac/mc/mc0/csrow3/ch0_ce_count gives
    me a count of 4
    thus I now know that csrow3 - ch0 is the problem

    My question is, how does this map to the on board labels
    DIMM 1A
    DIMM 1B
    DIMM 2A
    DIMM 2B

    Am I correct in assuming csrow 3 is DIMM 2B?
    Swapped the memory between DIMM 2A and DIMM 2B - still get fault in row
    3, channel 0 - thus did not move with the RAM??
    Next reboot I'll try swapping 1A and 1B
    Also I have just discovered that both the OS drives sda and sdb have
    huge number of errors shown on the SMART records
    - can this relate to the memory errors??
    - I am just really surprised to have two drives show almost identical
    number of errors at the same time, yet no apparent data errors -
    Drives are ATA ST380013AS 74.53 GB
    Just for safety I swapped /dev/sda with a new slightly larger drive did
    the sfdisk foo and added it to the md raid drives.
    This brand new drive immediately shows high raw read error rate and
    hardware ECC recovered in the tens of millions - I think this is not a
    drive issue but related to the ECC mem errors??
    Anyone with experience?
    TIA for your insightful comments
    ------------------------------------------------------------------------

    _______________________________________________
    CentOS mailing list
    CentOS at centos.org
    http://lists.centos.org/mailman/listinfo/centos
  • John R Pierce at Dec 5, 2011 at 3:34 am

    On 12/05/11 12:17 AM, Rob Kampen wrote:
    Swapped the memory between DIMM 2A and DIMM 2B - still get fault in
    row 3, channel 0 - thus did not move with the RAM??
    Next reboot I'll try swapping 1A and 1B
    often an indication the problem is board/socket related rather than
    memory DIMM. unless its the other pair you didn't swap.
    Also I have just discovered that both the OS drives sda and sdb have
    huge number of errors shown on the SMART records
    - can this relate to the memory errors??
    - I am just really surprised to have two drives show almost identical
    number of errors at the same time, yet no apparent data errors -
    Drives are ATA ST380013AS 74.53 GB

    Just for safety I swapped /dev/sda with a new slightly larger drive
    did the sfdisk foo and added it to the md raid drives.
    This brand new drive immediately shows high raw read er
    hmmm. no, they shouldn't be remotely related. unless its something
    else, like a power supply with noisy or out of spec voltage(s).

    80GB 3.5" SATA drives? aren't those kind of old? like, ancient ?
    looked up that PN, thats a Baracuda 7200.7 from circa 2003-2005.
    http://www.seagate.com/support/disc/manuals/sata/cuda7200_sata_pm.pdf

    those are past their shelf date.


    --
    john r pierce N 37, W 122
    santa cruz ca mid-left coast
  • Rob Kampen at Dec 5, 2011 at 4:24 am

    John R Pierce wrote:
    On 12/05/11 12:17 AM, Rob Kampen wrote:

    Swapped the memory between DIMM 2A and DIMM 2B - still get fault in
    row 3, channel 0 - thus did not move with the RAM??
    Next reboot I'll try swapping 1A and 1B
    often an indication the problem is board/socket related rather than
    memory DIMM. unless its the other pair you didn't swap.

    Also I have just discovered that both the OS drives sda and sdb have
    huge number of errors shown on the SMART records
    - can this relate to the memory errors??
    - I am just really surprised to have two drives show almost identical
    number of errors at the same time, yet no apparent data errors -
    Drives are ATA ST380013AS 74.53 GB

    Just for safety I swapped /dev/sda with a new slightly larger drive
    did the sfdisk foo and added it to the md raid drives.
    This brand new drive immediately shows high raw read er
    hmmm. no, they shouldn't be remotely related. unless its something
    else, like a power supply with noisy or out of spec voltage(s).

    80GB 3.5" SATA drives? aren't those kind of old? like, ancient ?
    looked up that PN, thats a Baracuda 7200.7 from circa 2003-2005.
    http://www.seagate.com/support/disc/manuals/sata/cuda7200_sata_pm.pdf

    those are past their shelf date.
    Yes Christmas 2004 - never a problem until one of the md raid sets
    dropped out today.
    However I put a brand new - never used 120G drive in and it too shows
    these errors - something doesn't seem right
    Getting too tired to think straight so I'll leave it limping along until
    tomorrow
    Thanks for thoughts
  • Nicolas Thierry-Mieg at Dec 5, 2011 at 5:51 am

    Rob Kampen wrote:
    John R Pierce wrote:
    On 12/05/11 12:17 AM, Rob Kampen wrote:
    Swapped the memory between DIMM 2A and DIMM 2B - still get fault in
    row 3, channel 0 - thus did not move with the RAM??
    Next reboot I'll try swapping 1A and 1B
    often an indication the problem is board/socket related rather than
    memory DIMM. unless its the other pair you didn't swap.
    Also I have just discovered that both the OS drives sda and sdb have
    huge number of errors shown on the SMART records
    - can this relate to the memory errors??
    - I am just really surprised to have two drives show almost identical
    number of errors at the same time, yet no apparent data errors -
    Drives are ATA ST380013AS 74.53 GB

    Just for safety I swapped /dev/sda with a new slightly larger drive
    did the sfdisk foo and added it to the md raid drives.
    This brand new drive immediately shows high raw read er
    hmmm. no, they shouldn't be remotely related. unless its something
    else, like a power supply with noisy or out of spec voltage(s).
    as John suggests, I think it sounds like a tired PSU
    try swapping it if you have a spare
  • Ljubomir Ljubojevic at Dec 5, 2011 at 5:57 am

    Vreme: 12/05/2011 10:24 AM, Rob Kampen pi?e:
    However I put a brand new - never used 120G drive in and it too shows
    these errors - something doesn't seem right
    Getting too tired to think straight so I'll leave it limping along until
    tomorrow
    Thanks for thoughts
    Download Hiren's BootCD and use bundled Memory Test if your way is
    complicated.

    --

    Ljubomir Ljubojevic
    (Love is in the Air)
    PL Computers
    Serbia, Europe

    Google is the Mother, Google is the Father, and traceroute is your
    trusty Spiderman...
    StarOS, Mikrotik and CentOS/RHEL/Linux consultant
  • John R Pierce at Dec 5, 2011 at 6:00 am

    On 12/05/11 2:57 AM, Ljubomir Ljubojevic wrote:
    Download Hiren's BootCD and use bundled Memory Test if your way is
    complicated.
    that won't do much to detect soft ECC errors, will it?



    --
    john r pierce N 37, W 122
    santa cruz ca mid-left coast
  • Ljubomir Ljubojevic at Dec 5, 2011 at 7:28 am

    Vreme: 12/05/2011 12:00 PM, John R Pierce pi?e:
    On 12/05/11 2:57 AM, Ljubomir Ljubojevic wrote:
    Download Hiren's BootCD and use bundled Memory Test if your way is
    complicated.
    that won't do much to detect soft ECC errors, will it?
    They are (there are 4-5 apps) checking various patterns in memory (write
    then read), and you can run it for a longer period of time.

    As for ECC errors I can not say, I never ever used ECC memory or got
    familiar with it.


    --

    Ljubomir Ljubojevic
    (Love is in the Air)
    PL Computers
    Serbia, Europe

    Google is the Mother, Google is the Father, and traceroute is your
    trusty Spiderman...
    StarOS, Mikrotik and CentOS/RHEL/Linux consultant
  • John Hodrien at Dec 7, 2011 at 3:55 am

    On Mon, 5 Dec 2011, Ljubomir Ljubojevic wrote:

    Vreme: 12/05/2011 12:00 PM, John R Pierce pi?e:
    On 12/05/11 2:57 AM, Ljubomir Ljubojevic wrote:
    Download Hiren's BootCD and use bundled Memory Test if your way is
    complicated.
    that won't do much to detect soft ECC errors, will it?
    They are (there are 4-5 apps) checking various patterns in memory (write
    then read), and you can run it for a longer period of time.

    As for ECC errors I can not say, I never ever used ECC memory or got
    familiar with it.
    In my limited experience, if you can disable ECC in your BIOS, memtest is just
    as good at spotting errors on ECC as non-ECC. With ECC enabled, you'll need
    seriously messed up ECC before it'll be detected.

    jh
  • John R Pierce at Dec 7, 2011 at 4:07 am

    On 12/07/11 12:55 AM, John Hodrien wrote:
    In my limited experience, if you can disable ECC in your BIOS, memtest
    is just
    as good at spotting errors on ECC as non-ECC. With ECC enabled,
    you'll need
    seriously messed up ECC before it'll be detected.
    except with ECC disabled, the extra 8 ECC bits per 64bit memory word
    aren't touched at all.

    I'd leave ECC on, and skip running memtest entirely, just run real OS
    workloads and let the ECC do the memory test on the fly, as its meant to.

    does linux have an ECC scrubber process? 'real' Unix servers (Solaris,
    AIX, etc) generally have a background process, sometimes its part of the
    Idle process, that does a read/write of every memory location when the
    machine is otherwise idle, this catches and fixes soft ECC errors in
    otherwise idle memory, which in turn gets logged. Solaris (on Sun Sparc
    hardware at least) keeps track of what locations have had bad memory,
    and will stop using a memory page entirely (with a logged alert) if
    there are too many soft ECC errors in the same area.

    --
    john r pierce N 37, W 122
    santa cruz ca mid-left coast

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcentos @
categoriescentos
postedDec 5, '11 at 2:13a
activeDec 7, '11 at 4:07a
posts10
users5
websitecentos.org
irc#centos

People

Translate

site design / logo © 2022 Grokbase