kern/160198: amd + NFS reconnect = ICMP storm + unkillable process
+ hung amd mount.
Artem Belevich
art at FreeBSD.org
Fri Aug 26 06:10:09 UTC 2011
>Number: 160198
>Category: kern
>Synopsis: amd + NFS reconnect = ICMP storm + unkillable process + hung amd mount.
>Confidential: no
>Severity: serious
>Priority: low
>Responsible: freebsd-bugs
>State: open
>Quarter:
>Keywords:
>Date-Required:
>Class: sw-bug
>Submitter-Id: current-users
>Arrival-Date: Fri Aug 26 06:10:08 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator: Artem Belevich
>Release: FreeBSD 8.2-STABLE i386
>Organization:
FreeBSD
>Environment:
FreeBSD stable/8, head
>Description:
When a process is interrupted during NFS reconnect which uses
UDP, the process gets stuck in an unkillable state.
In my particular case NFS connection is to the amd process on
the localhost. Continuous reconnects result in a
self-inflicted DoS attack on the amd which renders it
unresponsive which hangs all other processes that access
amd-mounted filesystems. As a side effect we also generate
rather high rate of ICMP port unreachable replies. All in all
the system ends up being virtually unavailable and in many
cases it requires reboot to get it out of this state.
The stuck process always has clnt_reconnect_call() in its backtrace:
18779 100511 collect2 -
mi_switch+0x176
turnstile_wait+0x1cb
_mtx_lock_sleep+0xe1
sleepq_catch_signals+0x386
sleepq_timedwait_sig+0x19
_sleep+0x1b1
clnt_dg_call+0x7e6
clnt_reconnect_call+0x12e
nfs_request+0x212
nfs_getattr+0x2e4
VOP_GETATTR_APV+0x44
nfs_bioread+0x42a
VOP_READLINK_APV+0x4a
namei+0x4f9
kern_statat_vnhook+0x92
kern_statat+0x15
freebsd32_stat+0x2e
syscallenter+0x23d
>How-To-Repeat:
In my case the problem most frequently occurs when a parallel
build that touches amd-mounted filesystem is interrupted.
>Fix:
clnt_dg_call() uses msleep() which may return ERESTART when
current process is interrupted. In that happens we return to
clnt_reconnect_call with RPC_CANTRECV. clnt_reconnect_call()
handles RPC_CANTRECV by trying to reconnect again and the
story repeats. Because current code never returns to the
userland, it never quits and gets stuck, in most cases,
forever.
The fix is to convert ERESTART to RPC_INTR which is what's
done in other places where it's handled in RPC code.
>Release-Note:
>Audit-Trail:
>Unformatted:
More information about the freebsd-bugs
mailing list