FAQ
So I'm running nfs to get content to my web servers. Now I've had this
problem 2 times (about 2 weeks since the last occurrence).
I use drbd on the nfs server for redundancy. Now to my problem:

All my web sites stopped responding so I started by checking dmesg and
there I found a bunch of this errors
Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out


But when checking the nfs server lockd was running and I could access
all the files from the webserver with ls, cd etc.

The logs on the nfs server doesn't say anything of interest and checking
apaches error_log just says "not found or unable to stat".

Now I mentioned this have happened 2 times and both these times I've
"solved" it by rebooting the nfs server and web servers. This isn't a
good solution to have to reboot my servers every couple of weeks so I
really could use some help. :)

Also I get this from time to time on the web servers, dunno if it's related.
/do_vfs_lock: VFS is out of sync with lock manager! /
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.centos.org/pipermail/centos/attachments/20080812/77f24867/attachment.htm

Search Discussions

  • Johan Swensson at Aug 13, 2008 at 2:38 am
    It happend again this night but now I temporarily(?) fixed it with
    mounting -o nolock on the web servers.
    It works but dmesg is still spamming "lockd: server 192.168.20.22 not
    responding, timed out". Atleast my sites are up, and the message isn't
    critical anymore.
    But how can I get rid of it?

    Johan Swensson wrote:
    So I'm running nfs to get content to my web servers. Now I've had this
    problem 2 times (about 2 weeks since the last occurrence).
    I use drbd on the nfs server for redundancy. Now to my problem:

    All my web sites stopped responding so I started by checking dmesg and
    there I found a bunch of this errors
    Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
    Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out

    But when checking the nfs server lockd was running and I could access
    all the files from the webserver with ls, cd etc.

    The logs on the nfs server doesn't say anything of interest and
    checking apaches error_log just says "not found or unable to stat".

    Now I mentioned this have happened 2 times and both these times I've
    "solved" it by rebooting the nfs server and web servers. This isn't a
    good solution to have to reboot my servers every couple of weeks so I
    really could use some help. :)

    Also I get this from time to time on the web servers, dunno if it's
    related.
    /do_vfs_lock: VFS is out of sync with lock manager! /
    ------------------------------------------------------------------------

    _______________________________________________
    CentOS mailing list
    CentOS@centos.org
    http://lists.centos.org/mailman/listinfo/centos
    -------------- next part --------------
    An HTML attachment was scrubbed...
    URL: http://lists.centos.org/pipermail/centos/attachments/20080813/84bd081a/attachment.htm
  • Nate Amsden at Aug 13, 2008 at 3:16 am

    Johan Swensson wrote:
    It happend again this night but now I temporarily(?) fixed it with
    mounting -o nolock on the web servers.
    It works but dmesg is still spamming "lockd: server 192.168.20.22 not
    responding, timed out". Atleast my sites are up, and the message isn't
    critical anymore.
    But how can I get rid of it?
    What does 'rpcinfo -p' read on both the servers and the clients?

    Also how about /etc/init.d/nfs status (both client and server)
    and /etc/init.d/nfslock status (both client and server)

    Any firewalls in between client and server?
    Run: iptables -L -n (on both client and server)

    nate
  • Craig White at Aug 13, 2008 at 3:25 am

    On Tue, 2008-08-12 at 20:16 -0700, nate wrote:
    Johan Swensson wrote:
    It happend again this night but now I temporarily(?) fixed it with
    mounting -o nolock on the web servers.
    It works but dmesg is still spamming "lockd: server 192.168.20.22 not
    responding, timed out". Atleast my sites are up, and the message isn't
    critical anymore.
    But how can I get rid of it?
    What does 'rpcinfo -p' read on both the servers and the clients?

    Also how about /etc/init.d/nfs status (both client and server)
    and /etc/init.d/nfslock status (both client and server)

    Any firewalls in between client and server?
    Run: iptables -L -n (on both client and server)
    ----
    I don't want to step on Johan's thread but wanted to commiserate with
    him.

    No firewall's at present...
    nfs and nfslock on both client and server are running and show pid's

    client
    [root@cube ~]# rpcinfo -p
    program vers proto port service
    100000 4 tcp 111 portmapper
    100000 3 tcp 111 portmapper
    100000 2 tcp 111 portmapper
    100000 4 udp 111 portmapper
    100000 3 udp 111 portmapper
    100000 2 udp 111 portmapper
    100000 4 0 111 portmapper
    100000 3 0 111 portmapper
    100000 2 0 111 portmapper
    100024 1 udp 50259 status
    100024 1 tcp 53710 status
    100021 1 tcp 53045 nlockmgr
    100021 3 tcp 53045 nlockmgr
    100021 4 tcp 53045 nlockmgr

    server
    [root@srv1 log]# rpcinfo -p
    program vers proto port
    100000 2 tcp 111 portmapper
    100000 2 udp 111 portmapper
    100024 1 udp 4003 status
    100024 1 tcp 4003 status
    100011 1 udp 4000 rquotad
    100011 2 udp 4000 rquotad
    100011 1 tcp 4000 rquotad
    100011 2 tcp 4000 rquotad
    100003 2 udp 2049 nfs
    100003 3 udp 2049 nfs
    100003 4 udp 2049 nfs
    100021 1 udp 4001 nlockmgr
    100021 3 udp 4001 nlockmgr
    100021 4 udp 4001 nlockmgr
    100021 1 tcp 4001 nlockmgr
    100021 3 tcp 4001 nlockmgr
    100021 4 tcp 4001 nlockmgr
    100003 2 tcp 2049 nfs
    100003 3 tcp 2049 nfs
    100003 4 tcp 2049 nfs
    100005 1 udp 4002 mountd
    100005 1 tcp 4002 mountd
    100005 2 udp 4002 mountd
    100005 2 tcp 4002 mountd
    100005 3 udp 4002 mountd
    100005 3 tcp 4002 mountd

    Server has ports fixed in place with settings in /etc/sysconfig/nfs

    Craig
  • Johan Swensson at Aug 13, 2008 at 4:35 am
    No firewall on either end and server responds to ping.

    client:
    program vers proto port
    100000 2 tcp 111 portmapper
    100000 2 udp 111 portmapper
    100024 1 udp 889 status
    100024 1 tcp 892 status
    server:

    program vers proto port
    100000 2 tcp 111 portmapper
    100000 2 udp 111 portmapper
    100024 1 udp 964 status
    100024 1 tcp 967 status
    100011 1 udp 718 rquotad
    100011 2 udp 718 rquotad
    100011 1 tcp 721 rquotad
    100011 2 tcp 721 rquotad
    100003 2 udp 2049 nfs
    100003 3 udp 2049 nfs
    100003 4 udp 2049 nfs
    100021 1 udp 32768 nlockmgr
    100021 3 udp 32768 nlockmgr
    100021 4 udp 32768 nlockmgr
    100003 2 tcp 2049 nfs
    100003 3 tcp 2049 nfs
    100003 4 tcp 2049 nfs
    100021 1 tcp 58027 nlockmgr
    100021 3 tcp 58027 nlockmgr
    100021 4 tcp 58027 nlockmgr
    100005 1 udp 778 mountd
    100005 1 tcp 781 mountd
    100005 2 udp 778 mountd
    100005 2 tcp 781 mountd
    100005 3 udp 778 mountd
    100005 3 tcp 781 mountd

    However I just rebooted the nfs server. But when I checked before lockd
    was running with a ps -A
    As Craig said he started notice this about the the time he upgraded to
    5.2, the same goes for me, started getting this problem about the time
    I've upgraded the clients and server.
    nate wrote:
    Johan Swensson wrote:
    It happend again this night but now I temporarily(?) fixed it with
    mounting -o nolock on the web servers.
    It works but dmesg is still spamming "lockd: server 192.168.20.22 not
    responding, timed out". Atleast my sites are up, and the message isn't
    critical anymore.
    But how can I get rid of it?
    What does 'rpcinfo -p' read on both the servers and the clients?

    Also how about /etc/init.d/nfs status (both client and server)
    and /etc/init.d/nfslock status (both client and server)

    Any firewalls in between client and server?
    Run: iptables -L -n (on both client and server)

    nate

    _______________________________________________
    CentOS mailing list
    CentOS@centos.org
    http://lists.centos.org/mailman/listinfo/centos

    --

    *Johan Swensson | apegroup*
    System Administrator
    johan@apegroup.com
    Mobile: +46 (0) 735 21 98 58
    www.apegroup.com
    Fiskartorpsv?gen 52, 115 42 Stockholm
  • Andylockran at Aug 13, 2008 at 11:05 am
    Not wanting to hijack the thread, but since a similar date I've had
    issues with NFS updates being 'delayed' for anything between two seconds
    to six hours.

    Weird one.
  • Nate Amsden at Aug 13, 2008 at 1:16 pm

    Johan Swensson wrote:
    No firewall on either end and server responds to ping.

    client:
    program vers proto port
    100000 2 tcp 111 portmapper
    100000 2 udp 111 portmapper
    100024 1 udp 889 status
    100024 1 tcp 892 status
    Doesn't look like nfslock is running on the client?

    What does /etc/init.d/nfslock status say?
    As Craig said he started notice this about the the time he upgraded to
    5.2, the same goes for me, started getting this problem about the time
    I've upgraded the clients and server.
    Maybe related to this bug:

    https://bugzilla.redhat.com/show_bug.cgi?idE3094

    Try restarting nfslock on both client and server when it occurs?
    Or try setting up a cron to restart nfslock hourly on all systems
    to see if that can workaround the issue until a fix comes out?

    nate
  • Johan Swensson at Aug 13, 2008 at 1:48 pm

    nate wrote:
    Johan Swensson wrote:
    No firewall on either end and server responds to ping.

    client:
    program vers proto port
    100000 2 tcp 111 portmapper
    100000 2 udp 111 portmapper
    100024 1 udp 889 status
    100024 1 tcp 892 status
    Doesn't look like nfslock is running on the client?

    What does /etc/init.d/nfslock status say?
    [root@web03 ~]# service nfslock status
    rpc.statd (pid 2737) is running...
    As Craig said he started notice this about the the time he upgraded to
    5.2, the same goes for me, started getting this problem about the time
    I've upgraded the clients and server.
    Maybe related to this bug:

    https://bugzilla.redhat.com/show_bug.cgi?idE3094

    Try restarting nfslock on both client and server when it occurs?
    Or try setting up a cron to restart nfslock hourly on all systems
    to see if that can workaround the issue until a fix comes out?

    nate

    Actually I tried restarting both nfslock(on clients and server) and
    nfs(on server) but it didn't help.
    Is my solution with mounting it nolock bad?

    I was also thinking about mounting the nfs shares as soft, is this a
    good idea? Could it help me? And also, what's the difference between
    soft and intr?
    Read the manual and I thought they were pretty similiar.
    _______________________________________________
    CentOS mailing list
    CentOS@centos.org
    http://lists.centos.org/mailman/listinfo/centos
  • Filipe Brandenburger at Aug 15, 2008 at 12:49 am

    On Wed, Aug 13, 2008 at 09:48, Johan Swensson wrote:
    I was also thinking about mounting the nfs shares as soft, is this a good
    idea?
    No, this is a bad idea. Mounting as soft means that if there is any
    errors or timeouts, your writes will fail, and most programs don't
    check for the status of those, so you will have undetectable data
    loss.
    And also, what's the difference between soft and intr?
    Intr (which is a good idea) means that you can use "kill" to stop
    processes that are hung waiting for the NFS server. The problem with
    "intr" is that I never saw it working. When my NFS server goes down,
    the processes that are waiting for it will stay in "D" state, no
    matter if I try to "kill" or even "kill -9" them... So, although
    "intr" seems like a good idea, in practice it does not make much of a
    difference.

    HTH,
    Filipe
  • Craig White at Aug 13, 2008 at 2:52 am

    On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:
    So I'm running nfs to get content to my web servers. Now I've had this
    problem 2 times (about 2 weeks since the last occurrence).
    I use drbd on the nfs server for redundancy. Now to my problem:

    All my web sites stopped responding so I started by checking dmesg and
    there I found a bunch of this errors
    Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
    Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out

    But when checking the nfs server lockd was running and I could access
    all the files from the webserver with ls, cd etc.

    The logs on the nfs server doesn't say anything of interest and
    checking apaches error_log just says "not found or unable to stat".

    Now I mentioned this have happened 2 times and both these times I've
    "solved" it by rebooting the nfs server and web servers. This isn't a
    good solution to have to reboot my servers every couple of weeks so I
    really could use some help. :)

    Also I get this from time to time on the web servers, dunno if it's
    related.
    do_vfs_lock: VFS is out of sync with lock manager!
    ----
    I too have been having the same issues with my nfs server - which seems
    to have started when I updated on July 27th (5.2)

    It seems to happen after logrotate on Sunday morning but I didn't know
    about it until users show up on Monday mornings.

    /var/log/messages has...

    Aug 4 09:32:59 cube kernel: lockd: server HOSTNAME not responding,
    still trying

    and like you, I've rebooted the main server each time (Monday
    mornings)...there's something wrong that I can't figure out

    Craig
  • Matthew Kent at Aug 13, 2008 at 4:27 pm

    On Tue, 2008-08-12 at 14:27 +0200, Johan Swensson wrote:
    So I'm running nfs to get content to my web servers. Now I've had this
    problem 2 times (about 2 weeks since the last occurrence).
    I use drbd on the nfs server for redundancy. Now to my problem:

    All my web sites stopped responding so I started by checking dmesg and
    there I found a bunch of this errors
    Aug 11 16:00:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out
    Aug 11 16:02:39 web03 kernel: lockd: server 192.168.20.22 not responding, timed out

    But when checking the nfs server lockd was running and I could access
    all the files from the webserver with ls, cd etc.
    This is the exact problem we were having here. Rebooting is the only
    solution.

    And as already mentioned further down the thread it was attributed to
    this https://bugzilla.redhat.com/show_bug.cgi?idE3094

    My solution was to extract the patch from the upstream kernel in
    http://people.redhat.com/dzickus/el5/103.el5/src/
    called
    linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch

    and reroll the latest centosplus kernel srpm with it. Servers have been
    fine for 6 days running this kernel.

    As much as I hate carrying custom kernel rpms this is a showstopper for
    us, and it looks like it won't make in until 5.3.

    Personally given the limited scope of the patch and apparent
    unwillingness of redhat to include it in an update I'd advocate CentOS
    carrying it as a custom patch.

    Here's my srpm if anyone wants it,
    http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm
    the only change is the patch for this issue. Everything builds cleanly
    via mock.
    --
    Matthew Kent \ SA \ bravenet.com
  • Akemi Yagi at Sep 4, 2008 at 2:35 pm

    On Wed, Aug 13, 2008 at 9:27 AM, Matthew Kent wrote:

    This is the exact problem we were having here. Rebooting is the only
    solution.

    And as already mentioned further down the thread it was attributed to
    this https://bugzilla.redhat.com/show_bug.cgi?idE3094

    My solution was to extract the patch from the upstream kernel in
    http://people.redhat.com/dzickus/el5/103.el5/src/
    called
    linux-2.6-fs-lockd-nlmsvc_lookup_host-called-with-f_sema-held.patch

    and reroll the latest centosplus kernel srpm with it. Servers have been
    fine for 6 days running this kernel.

    As much as I hate carrying custom kernel rpms this is a showstopper for
    us, and it looks like it won't make in until 5.3.

    Personally given the limited scope of the patch and apparent
    unwillingness of redhat to include it in an update I'd advocate CentOS
    carrying it as a custom patch.

    Here's my srpm if anyone wants it,
    http://magoazul.com/tmp/kernel-2.6.18-92.1.10.1.el5.centos.plus.src.rpm
    the only change is the patch for this issue. Everything builds cleanly
    via mock.
    --
    Matthew Kent \ SA \ bravenet.com
    CentOS developer, Tru, compiled a patched version of regular kernel
    and is offering it at:

    http://people.centos.org/tru/kernel+bz453094/

    Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5
    according to the bugzilla referred to above.

    Akemi
  • Akemi Yagi at Sep 4, 2008 at 3:09 pm

    On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi wrote:

    CentOS developer, Tru, compiled a patched version of regular kernel
    and is offering it at:

    http://people.centos.org/tru/kernel+bz453094/

    Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5
    according to the bugzilla referred to above.
    The bugzilla link is actually this one:

    https://bugzilla.redhat.com/show_bug.cgi?idE9083

    Akemi
  • Akemi Yagi at Sep 24, 2008 at 8:38 pm

    On Thu, Sep 4, 2008 at 8:09 AM, Akemi Yagi wrote:
    On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi wrote:

    CentOS developer, Tru, compiled a patched version of regular kernel
    and is offering it at:

    http://people.centos.org/tru/kernel+bz453094/

    Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5
    according to the bugzilla referred to above.
    The bugzilla link is actually this one:

    https://bugzilla.redhat.com/show_bug.cgi?idE9083

    Akemi
    kernel-2.6.18-92.1.13.el5 is out (upstream):

    http://rhn.redhat.com/errata/RHSA-2008-0885.html

    Akemi
  • Craig White at Sep 24, 2008 at 9:23 pm

    On Wed, 2008-09-24 at 13:38 -0700, Akemi Yagi wrote:
    On Thu, Sep 4, 2008 at 8:09 AM, Akemi Yagi wrote:
    On Thu, Sep 4, 2008 at 7:35 AM, Akemi Yagi wrote:

    CentOS developer, Tru, compiled a patched version of regular kernel
    and is offering it at:

    http://people.centos.org/tru/kernel+bz453094/

    Also, the fix will be in the upcoming kernel-2.6.18-92.1.13.el5
    according to the bugzilla referred to above.
    The bugzilla link is actually this one:

    https://bugzilla.redhat.com/show_bug.cgi?idE9083

    Akemi
    kernel-2.6.18-92.1.13.el5 is out (upstream):

    http://rhn.redhat.com/errata/RHSA-2008-0885.html
    ----
    yep and I'm still running an old kernel to get around this - got the
    notification from bugzilla today myself - hooray

    Craig

Related Discussions

Discussion Navigation
viewthread | post
Discussion Overview
groupcentos @
categoriescentos
postedAug 12, '08 at 12:27p
activeSep 24, '08 at 9:23p
posts15
users7
websitecentos.org
irc#centos

People

Translate

site design / logo © 2022 Grokbase