kern/160198: amd + NFS reconnect = ICMP storm + unkillable process + hung amd mount.

Fri Aug 26 06:10:09 UTC 2011

>Number:         160198
>Category:       kern
>Synopsis:       amd + NFS reconnect = ICMP storm + unkillable process + hung amd mount.
>Confidential:   no
>Severity:       serious
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Fri Aug 26 06:10:08 UTC 2011
>Closed-Date:
>Last-Modified:
>Originator:     Artem Belevich
>Release:        FreeBSD 8.2-STABLE i386
>Organization:
FreeBSD
>Environment:
FreeBSD stable/8, head

>Description:

	When a process is interrupted during NFS reconnect which uses
	UDP, the process gets stuck in an unkillable state.

	In my particular case NFS connection is to the amd process on
	the localhost. Continuous reconnects result in a
	self-inflicted DoS attack on the amd which renders it
	unresponsive which hangs all other processes that access
	amd-mounted filesystems. As a side effect we also generate
	rather high rate of ICMP port unreachable replies. All in all
	the system ends up being virtually unavailable and in many
	cases it requires reboot to get it out of this state.

        The stuck process always has clnt_reconnect_call() in its backtrace:

	18779 100511 collect2         -                
	mi_switch+0x176
	turnstile_wait+0x1cb 
	_mtx_lock_sleep+0xe1 
	sleepq_catch_signals+0x386
	sleepq_timedwait_sig+0x19 
	_sleep+0x1b1 
	clnt_dg_call+0x7e6
	clnt_reconnect_call+0x12e 
	nfs_request+0x212 
	nfs_getattr+0x2e4
	VOP_GETATTR_APV+0x44 
	nfs_bioread+0x42a 
	VOP_READLINK_APV+0x4a
	namei+0x4f9 
	kern_statat_vnhook+0x92 
	kern_statat+0x15
	freebsd32_stat+0x2e 
	syscallenter+0x23d
	

>How-To-Repeat:
	In my case the problem most frequently occurs when a parallel
	build that touches amd-mounted filesystem is interrupted.

>Fix:
	
	clnt_dg_call() uses msleep() which may return ERESTART when
	current process is interrupted. In that happens we return to
	clnt_reconnect_call with RPC_CANTRECV. clnt_reconnect_call()
	handles RPC_CANTRECV by trying to reconnect again and the
	story repeats. Because current code never returns to the
	userland, it never quits and gets stuck, in most cases,
	forever.

	The fix is to convert ERESTART to RPC_INTR which is what's
	done in other places where it's handled in RPC code.

>Release-Note:
>Audit-Trail:
>Unformatted: