Grokbase
Topics Posts Groups | in
x
[ help ]

Re: how to debug hardware lockups?

View PostFlat  Thread  Threaded | < Prev - Next >
Les Mikesell Re: [CentOS] how to debug hardware lockups?
| +1 vote
[ Profile | Reply to group ] [ Flat  Thread  Threaded ]
Rudi Ahlers wrote:
>> I had machine that would crash about once every week or two in normal
>> operation. Memtest86+ found an error in the 2nd day of running. The worst
>> part was that it left the raid mirrors in a strange state that caused
>> occasional problems for months even after replacing the RAM.
>>
>> --
>
> Did you leave memtest86+ running for 2 days? I thought 1 or 2 cycles
> would be good enough?
>
> I'm hoping to pick-up the server in the next 2 hours then I can see
> what happens when I run memtest86+ or other tests

Yes, apparently RAM errors can be subtle and only appear when certain
adjacent bit patterns are stored - or when the moon is in a certain
phase or something.

--
   Les Mikesell
    [email protected: lesmik...@gmail.com]
_______________________________________________
CentOS mailing list
[email protected: C...@centos.org]
http://lists.centos.org/mailman/listinfo/centos

Thread : how to debug hardware lockups?
1)
Rudi Ahlers Hi, We have a server which locks up about once a week (for the past 3 weeks now), without any...
2)
Richard Karhuse Attach a local console to the video port and let us know what it says --> that will (probably) be...
3)
Rudi Ahlers Unfortunately, I can't leave a monitor attached to the server all the time. The server is in a...
4)
nate Configure a serial console, connect the console to another system and use something like minicom to...
5)
Rudi Ahlers That machine doesn't have a serial port (why do vendors think serial ports are obsolete????), so is...
6)
Scott Silva This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --==============78156311=Content-Type:...
7)
John R Pierce in the original post he said... and upon further questioning... which is purely a desktop board...
8)
Scott Silva This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --==============78037462=Content-Type:...
9)
Matthew Kent You can send it to another machines syslogd with netconsole. Checkout initscripts...
10)
Vandaman Are those the only logs you've got. Normally linux is very chatty, and you get WARNING, PANIC etc...
11)
Rudi Ahlers Well, on a standard CentOS 5.2, /var/log/messages will be the the place to log problems like this,...
12)
John R Pierce tough to write to the disk when the kernel is crashing. ditto the network. that leaves VGAs and...
13)
Rudi Ahlers No, the motherboard doesn't support ECC RAM. The motherboard is a Intel DG35EC -...
14)
John R Pierce midrange business desktop board. I use a DG33TL as my desktop, same thing.
15)
Scott Silva This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --==============25180043=Content-Type:...
16)
Rudi Ahlers Sure, but it also doesn't pay to purchase a 10Ton truck to move a 1Ton load :) Bottom line is, you...
17)
Rudi Ahlers This comes down't to the old question of "what is a server"? Is a server, a) the most powerful,...
18)
Tru Huynh <rant deleted, mail trimmed> a server just "works", and provide a usable way to debug the OS...
19)
Rudi Ahlers Sure, I understand that. But then again, on my Dell servers, when I have problems, I sit with the...
20)
Scott Silva This is an OpenPGP/MIME signed message (RFC 2440 and 3156) --==============00139170=Content-Type:...
21)
nate I suppose it depends on what dells you have. On the latest 1950 III systems we have they have...
22)
Les Mikesell I had machine that would crash about once every week or two in normal operation. Memtest86+ found...
23)
Rudi Ahlers Did you leave memtest86+ running for 2 days? I thought 1 or 2 cycles would be good enough? I'm...
24)
Les Mikesell Yes, apparently RAM errors can be subtle and only appear when certain adjacent bit patterns are...
25)
Rob Lines When we burn in machines to try to find errors we go with the day or two run also. The one fun...
26)
nate Don't forget cosmic rays http://adsabs.harvard.edu/abs/1978ITNS...25.1166P nate
27)
Les Mikesell Yeah, but those don't stop when you replace the faulty RAM... Mine did, but the errors committed to...
28)
Ross Walker Ah, memory mapped files, another very good reason to use ECC with large memory machines. Also if...
29)
nate Normal ECC doesn't seem to be all that great IMO, though I have been very impressed with HP's...
30)
Nifty Cluster Mitch Jumping in the middle of a long list of good ideas. Other things to try -- change the run level if...
31)
Rudi Ahlers Thanx Tom, You gave some good ideas, and I've been through all of them. As a general rule of thumb,...
32)
Rudi Ahlers As an interesting side note, with all the other servers & cabinets in the datacentre, the DB level...
33)
Kai Schaetzl Rudi Ahlers wrote on Thu, 20 Nov 2008 10:30:53 +0200: If it's overheating there should be two...
34)
Rudi Ahlers Hi Kai, How do I check the sensors & throttling?
35)
John R Pierce that fan says its for up to 135 watt CPUs, I don't think a q9300 is anywhere near that, so unless...
36)
Rudi Ahlers Well, as you can imagine, I didn't think the CPU would / could be a problem when I installed it,...
spacer
View PostFlat  Thread  Threaded | < Prev - Next >
Home > Groups > CentOS > how to debug hardware lockups? (36 posts) > View Post