FAQ
This may or may not be CentOS related, but am out of ideas at this
point and wanted to bounce this off the list.

I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5.

Almost everyday around 3:30 AM the server completely locks up and has
to be power cycled before it will come back online.
(this means someone hat to wake up and reboot the server, oh how I
love being an internet janitor! :)

Smells like a hardware issue to me too, but I went through all of the
dell diagnostics, updated the firmware, everything checks out as being
okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
tech. No hardware issues, at least that we were able to find.

There are no cron jobs that run at 3:30, no backups, the server has a
load of 0, nothing is scheduled around that time...

The only crontab entry at all is "*/5 * * * * wget -q
www.websitedomain.com/cron.php >/dev/null 2>&1"
They are running Magento for commerce purposes and this runs every 5 minutes.

Why does the server only lockup around 3:30 AM? Because it's knows I
am fast asleep?

I was able to pull this from /var/log/messages, this happens just
seconds before locking up completely...

Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
than 120 seconds.
Mar 8 03:33:19 web1 kernel: "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Mar 8 03:33:19 web1 kernel: wget D ffff810001004420 0
13608 13607 (NOTLB)
Mar 8 03:33:19 web1 kernel: ffff81007bc7bc78 0000000000000086
ffff81007bc7bd88 ffff81000100d3f8
Mar 8 03:33:19 web1 kernel: ffff81007bc7bbf0 0000000000000007
ffff8100849db0c0 ffffffff80308b60
Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 0000000000003237
ffff8100849db2a8 0000000064c82eae
Mar 8 03:33:19 web1 kernel: Call Trace:
Mar 8 03:33:20 web1 kernel: [<ffffffff80063c6f>]
__mutex_lock_slowpath+0x60/0x9b
Mar 8 03:33:20 web1 kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
Mar 8 03:33:20 web1 kernel: [<ffffffff8000cf82>] do_lookup+0x90/0x1e6
Mar 8 03:33:20 web1 kernel: [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b
Mar 8 03:33:20 web1 kernel: [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2
Mar 8 03:33:20 web1 kernel: [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
Mar 8 03:33:23 web1 kernel: [<ffffffff80012851>] getname+0x15b/0x1c2
Mar 8 03:33:23 web1 kernel: [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c
Mar 8 03:33:23 web1 kernel: [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a
Mar 8 03:33:23 web1 kernel: [<ffffffff80023703>] sys_newstat+0x19/0x31
Mar 8 03:33:23 web1 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83

If anyone has some advice on where to go from here it would be greatly
appreciated.

Thanks in advance.

--
PJF

Search Discussions

  • Boris Epstein at Mar 11, 2011 at 12:42 pm

    On Fri, Mar 11, 2011 at 12:33 PM, PJ wrote:
    This may or may not be CentOS related, but am out of ideas at this
    point and wanted to bounce this off the list.

    I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5.

    Almost everyday around 3:30 AM the server completely locks up and has
    to be power cycled before it will come back online.
    (this means someone hat to wake up and reboot the server, oh how I
    love being an internet janitor! :)

    Smells like a hardware issue to me too, but I went through all of the
    dell diagnostics, updated the firmware, everything checks out as being
    okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
    tech. No hardware issues, at least that we were able to find.

    There are no cron jobs that run at 3:30, no backups, the server has a
    load of 0, nothing is scheduled around that time...

    The only crontab entry at all is "*/5 * * * * wget -q
    www.websitedomain.com/cron.php >/dev/null 2>&1"
    They are running Magento for commerce purposes and this runs every 5 minutes.

    Why does the server only lockup around 3:30 AM? Because it's knows I
    am fast asleep?

    I was able to pull this from /var/log/messages, this happens just
    seconds before locking up completely...

    Mar ?8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
    than 120 seconds.
    Mar ?8 03:33:19 web1 kernel: "echo 0 >
    /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Mar ?8 03:33:19 web1 kernel: wget ? ? ? ? ?D ffff810001004420 ? ? 0
    13608 ?13607 ? ? ? ? ? ? ? ? ? ? (NOTLB)
    Mar ?8 03:33:19 web1 kernel: ?ffff81007bc7bc78 0000000000000086
    ffff81007bc7bd88 ffff81000100d3f8
    Mar ?8 03:33:19 web1 kernel: ?ffff81007bc7bbf0 0000000000000007
    ffff8100849db0c0 ffffffff80308b60
    Mar ?8 03:33:19 web1 kernel: ?00013a2964cdf439 0000000000003237
    ffff8100849db2a8 0000000064c82eae
    Mar ?8 03:33:19 web1 kernel: Call Trace:
    Mar ?8 03:33:20 web1 kernel: ?[<ffffffff80063c6f>]
    __mutex_lock_slowpath+0x60/0x9b
    Mar ?8 03:33:20 web1 kernel: ?[<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
    Mar ?8 03:33:20 web1 kernel: ?[<ffffffff8000cf82>] do_lookup+0x90/0x1e6
    Mar ?8 03:33:20 web1 kernel: ?[<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b
    Mar ?8 03:33:20 web1 kernel: ?[<ffffffff8000ea4b>] link_path_walk+0x42/0xb2
    Mar ?8 03:33:20 web1 kernel: ?[<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
    Mar ?8 03:33:23 web1 kernel: ?[<ffffffff80012851>] getname+0x15b/0x1c2
    Mar ?8 03:33:23 web1 kernel: ?[<ffffffff800239d1>] __user_walk_fd+0x37/0x4c
    Mar ?8 03:33:23 web1 kernel: ?[<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a
    Mar ?8 03:33:23 web1 kernel: ?[<ffffffff80023703>] sys_newstat+0x19/0x31
    Mar ?8 03:33:23 web1 kernel: ?[<ffffffff8005d116>] system_call+0x7e/0x83

    If anyone has some advice on where to go from here it would be greatly
    appreciated.

    Thanks in advance.

    --
    PJF
    _______________________________________________
    CentOS mailing list
    CentOS at centos.org
    http://lists.centos.org/mailman/listinfo/centos
    Have you tried disabling the cron job you think is at fault to see if
    the lock up goes away? Also, have you checked all the users' crontabs?

    Boris.
  • Denniston, Todd A CIV NAVSURFWARCENDIV Crane at Mar 11, 2011 at 4:31 pm

    -----Original Message-----
    From: centos-bounces at centos.org [mailto:centos-bounces at centos.org] On
    Behalf Of PJ
    Sent: Friday, March 11, 2011 12:34
    To: centos at centos.org
    Subject: [CentOS] Server locking up everyday around 3:30 AM - (INFO: task
    wget:13608 blocked for more than 120 seconds) need sleep, help. <SNIP>
    There are no cron jobs that run at 3:30, no backups, the server has a
    load of 0, nothing is scheduled around that time...
    <SNIP>

    Are you sure the stuff in /etc/cron.daily/ is done by then or not
    started yet?
    Could be something like the mlocate or makewhatis chewing up CPU/Mem.
    IIRC the stuff in /etc/cron.daily/ runs in alphabetic order so, are you
    (root) getting the logwatch messages, and at what time?
  • Ross Walker at Mar 11, 2011 at 9:07 pm

    On Mar 11, 2011, at 12:33 PM, PJ wrote:

    This may or may not be CentOS related, but am out of ideas at this
    point and wanted to bounce this off the list.

    I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5.

    Almost everyday around 3:30 AM the server completely locks up and has
    to be power cycled before it will come back online.
    (this means someone hat to wake up and reboot the server, oh how I
    love being an internet janitor! :)

    Smells like a hardware issue to me too, but I went through all of the
    dell diagnostics, updated the firmware, everything checks out as being
    okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
    tech. No hardware issues, at least that we were able to find.

    There are no cron jobs that run at 3:30, no backups, the server has a
    load of 0, nothing is scheduled around that time...

    The only crontab entry at all is "*/5 * * * * wget -q
    www.websitedomain.com/cron.php >/dev/null 2>&1"
    They are running Magento for commerce purposes and this runs every 5 minutes.

    Why does the server only lockup around 3:30 AM? Because it's knows I
    am fast asleep?

    I was able to pull this from /var/log/messages, this happens just
    seconds before locking up completely...

    Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
    than 120 seconds.
    Mar 8 03:33:19 web1 kernel: "echo 0 >
    /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Mar 8 03:33:19 web1 kernel: wget D ffff810001004420 0
    13608 13607 (NOTLB)
    Mar 8 03:33:19 web1 kernel: ffff81007bc7bc78 0000000000000086
    ffff81007bc7bd88 ffff81000100d3f8
    Mar 8 03:33:19 web1 kernel: ffff81007bc7bbf0 0000000000000007
    ffff8100849db0c0 ffffffff80308b60
    Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 0000000000003237
    ffff8100849db2a8 0000000064c82eae
    Mar 8 03:33:19 web1 kernel: Call Trace:
    Mar 8 03:33:20 web1 kernel: [<ffffffff80063c6f>]
    __mutex_lock_slowpath+0x60/0x9b
    Mar 8 03:33:20 web1 kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
    Mar 8 03:33:20 web1 kernel: [<ffffffff8000cf82>] do_lookup+0x90/0x1e6
    Mar 8 03:33:20 web1 kernel: [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b
    Mar 8 03:33:20 web1 kernel: [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2
    Mar 8 03:33:20 web1 kernel: [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
    Mar 8 03:33:23 web1 kernel: [<ffffffff80012851>] getname+0x15b/0x1c2
    Mar 8 03:33:23 web1 kernel: [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c
    Mar 8 03:33:23 web1 kernel: [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a
    Mar 8 03:33:23 web1 kernel: [<ffffffff80023703>] sys_newstat+0x19/0x31
    Mar 8 03:33:23 web1 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83

    If anyone has some advice on where to go from here it would be greatly
    appreciated.
    Do a fsck of the file system wget is writing to as there might be a corruption it hits only on the 3:30am run as that's when the other vendor dumps data to be downloaded.

    You could also check to see if a RAID patrol read (scrub/predictive failure detection) is happening around this time as well and disable/reschedule it.

    -Ross
  • Shen, Xin (Sinux) at Mar 14, 2011 at 10:47 am
    Update the kernel will probably be the way to fix your problem.

    Best Regards
    Sinux

    ? 2011-3-12?10:08?"Ross Walker" <rswwalker at gmail.com> ???
    On Mar 11, 2011, at 12:33 PM, PJ wrote:

    This may or may not be CentOS related, but am out of ideas at this
    point and wanted to bounce this off the list.

    I'm running a CentOS 5.5 server, running the latest kernel 2.6.18-194.32.1.el5.

    Almost everyday around 3:30 AM the server completely locks up and has
    to be power cycled before it will come back online.
    (this means someone hat to wake up and reboot the server, oh how I
    love being an internet janitor! :)

    Smells like a hardware issue to me too, but I went through all of the
    dell diagnostics, updated the firmware, everything checks out as being
    okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
    tech. No hardware issues, at least that we were able to find.

    There are no cron jobs that run at 3:30, no backups, the server has a
    load of 0, nothing is scheduled around that time...

    The only crontab entry at all is "*/5 * * * * wget -q
    www.websitedomain.com/cron.php >/dev/null 2>&1"
    They are running Magento for commerce purposes and this runs every 5 minutes.

    Why does the server only lockup around 3:30 AM? Because it's knows I
    am fast asleep?

    I was able to pull this from /var/log/messages, this happens just
    seconds before locking up completely...

    Mar 8 03:33:18 web1 kernel: INFO: task wget:13608 blocked for more
    than 120 seconds.
    Mar 8 03:33:19 web1 kernel: "echo 0 >
    /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Mar 8 03:33:19 web1 kernel: wget D ffff810001004420 0
    13608 13607 (NOTLB)
    Mar 8 03:33:19 web1 kernel: ffff81007bc7bc78 0000000000000086
    ffff81007bc7bd88 ffff81000100d3f8
    Mar 8 03:33:19 web1 kernel: ffff81007bc7bbf0 0000000000000007
    ffff8100849db0c0 ffffffff80308b60
    Mar 8 03:33:19 web1 kernel: 00013a2964cdf439 0000000000003237
    ffff8100849db2a8 0000000064c82eae
    Mar 8 03:33:19 web1 kernel: Call Trace:
    Mar 8 03:33:20 web1 kernel: [<ffffffff80063c6f>]
    __mutex_lock_slowpath+0x60/0x9b
    Mar 8 03:33:20 web1 kernel: [<ffffffff80063cb9>] .text.lock.mutex+0xf/0x14
    Mar 8 03:33:20 web1 kernel: [<ffffffff8000cf82>] do_lookup+0x90/0x1e6
    Mar 8 03:33:20 web1 kernel: [<ffffffff8000a29c>] __link_path_walk+0xa01/0xf5b
    Mar 8 03:33:20 web1 kernel: [<ffffffff8000ea4b>] link_path_walk+0x42/0xb2
    Mar 8 03:33:20 web1 kernel: [<ffffffff8000cd72>] do_path_lookup+0x275/0x2f1
    Mar 8 03:33:23 web1 kernel: [<ffffffff80012851>] getname+0x15b/0x1c2
    Mar 8 03:33:23 web1 kernel: [<ffffffff800239d1>] __user_walk_fd+0x37/0x4c
    Mar 8 03:33:23 web1 kernel: [<ffffffff80028905>] vfs_stat_fd+0x1b/0x4a
    Mar 8 03:33:23 web1 kernel: [<ffffffff80023703>] sys_newstat+0x19/0x31
    Mar 8 03:33:23 web1 kernel: [<ffffffff8005d116>] system_call+0x7e/0x83

    If anyone has some advice on where to go from here it would be greatly
    appreciated.
    Do a fsck of the file system wget is writing to as there might be a corruption it hits only on the 3:30am run as that's when the other vendor dumps data to be downloaded.

    You could also check to see if a RAID patrol read (scrub/predictive failure detection) is happening around this time as well and disable/reschedule it.

    -Ross

    _______________________________________________
    CentOS mailing list
    CentOS at centos.org
    http://lists.centos.org/mailman/listinfo/centos
  • Alexander Georgiev at Mar 12, 2011 at 3:07 am

    Almost everyday around 3:30 AM the server completely locks up and has
    to be power cycled before it will come back online.
    (this means someone hat to wake up and reboot the server, oh how I
    love being an internet janitor! :)

    Smells like a hardware issue to me too, but I went through all of the
    dell diagnostics, updated the firmware, everything checks out as being
    okay, RAID, disks, RAM, etc... Spent an hour on the phone with a Dell
    tech. No hardware issues, at least that we were able to find.

    There are no cron jobs that run at 3:30, no backups, the server has a
    load of 0, nothing is scheduled around that time...
    do you have smartd set to run short/long hard disk checks during the
    night? it is done via /etc/smartd.conf, not via cron.

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcentos @
categoriescentos
postedMar 11, '11 at 12:33p
activeMar 14, '11 at 10:47a
posts6
users6
websitecentos.org
irc#centos

People

Translate

site design / logo © 2022 Grokbase