nfsd hang in sosend_generic

Wed Nov 21 15:27:37 UTC 2012

On Nov 21, 2012, at 4:01 PM, Rick Macklem <rmacklem at uoguelph.ca> wrote:

> Nikolay Denev wrote:
>> Hello,
>> 
>> First of all, I'm not sure if this is actually nfsd issue and not
>> network stack issue.
>> 
>> I've just had nfsd hang in unkillable state while doing some IO from
>> Linux host running Oracle DB using Oracle's Direct NFS.
>> 
>> I was watching from some time how the Direct NFS client loads the NFS
>> server differently, i.e.:
>> with the linux kernel NFS client I see single TCP session to port 2049
>> and all traffic goes there, while the Direct NFS client
>> is much more aggressive and creates multiple TCP sessions, and often
>> was able to generate pretty big Send/Recv-Q's on FreeBSD's side.
>> I'm mentioning this as probably is related.
>> 
> I don't know anything about the Oracle client, but it might be creating
> new TCP connections to try and recover from a "hung" state. Your netstat
> for the client below shows that there are several ESTABLISHED TCP connections
> with large receive queues. I wouldn't expect to see this and it suggests
> that the Oracle client isn't receiving/reading data off the TCP socket for
> some reason. Once it isn't receiving/reading an RPC reply off the TCP socket,
> it might create a new one to attempt a retry of the RPC. (NFSv4 requires that
> any retry of an RPC be done on a new TCP connection. Although that requirement
> doesn't exist for NFSv3, it would probably be considered "good practice" and
> will happen if NFSv3 and NFSv4 share the same RPC socket handling code.)
> 
>> Here's the procstat -kk of the hanged nfsd process :
>> 
>> [... snipped huge procstat output …]
>> 
> It appears that all the nfsd threads are trying to send RPC replies
> back to the client and are stuck there. As you can see below, the
> send queues for the TCP sockets are big, so the data isn't getting
> through to the client. The large receive queue in the ESTABLISHED
> connections on the Linux client suggests that Oracle isn't taking
> data off the TCP socket for some reason, which would result in this,
> once the send window is filled. At least that's my rusty old
> understanding of TCP. (That would hint at an Oracle client bug,
> but I don't know anything about the Oracle client.)
> 
> Why? Well, I can't even guess, but a few things you might try are:
> - disabling TSO and rx/tx checksum offload on the FreeBSD server's
>  network interface(s).
> - try a different type of network card, if you have one handy.
> I doubt these will make a difference, since the large receive queues
> for the ESTABLISHED TCP connections in the Linux client suggests that
> the data is getting through. Still might be worth a try, since there
> might be one packet that isn't getting through and that is causing
> issues for the Oracle client.
> 
> - if you can do it, try switching the Oracle client mounts to UDP.
>  (For UDP, you want to start with a rsize, wsize no bigger than
>   16384 and then be prepared to make it smaller if the
>   "fragments dropped due to timeout" becomes non-zero for UDP when
>   you do a "netstat -s".)
>   - There might be a NFS over TCP bug in the Oracle client.
> - when it is stuck again, do a "vmstat -z" and "vmstat -m" to
>  see if there is a large "InUse" for anything.
>  - in particular, check mbuf clusters
> 
> Also, you could try capturing packets when it
> happens and look at then in wireshark to see if/what
> related traffic is going on the wire. Focus on the TCP layer
> as well as NFS.
> 

Looking at it again, It really looks like a bug in the Oracle client, so
for now we've decided to disable the Direct NFS client and switch back to the
standard linux kernel NFS client.

Unfortunately testing with UDP won't be possible as I think oracle's NFS client only support TCP.

What is curious is why the kernel NFS mount from the Linux host was also stuck because of the misbehaving userspace client.
I should have tested mounting from another host to see if the NFS server would respond, as this seems like a DoS attack to the NFS server :)

Anyways, I've started collecting and graphing the output of netstat -m and vmstat -z in case
something like this happens again.

>> 
>> Here is a netstat output for the nfs sessions from FreeBSD server
>> side:
>> 
>> Proto Recv-Q Send-Q Local Address Foreign Address (state)
>> tcp4 0 37215456 10.101.0.1.2049 10.101.0.2.42856 ESTABLISHED
>> tcp4 0 14561020 10.101.0.1.2049 10.101.0.2.62854 FIN_WAIT_1
>> tcp4 0 3068132 10.100.0.1.2049 10.100.0.2.9712 FIN_WAIT_1
>> 
>> Linux host sees this :
>> 
>> tcp 1 0 10.101.0.2:9270 10.101.0.1:2049 CLOSE_WAIT
>> tcp 477940 0 10.100.0.2:9712 10.100.0.1:2049 ESTABLISHED
> ** These hint that the Oracle client isn't reading the socket
>   for some reason. I'd guess that the send window is now full,
>   so the data is backing up in the send queue in the server.
>> tcp 1 0 10.101.0.2:10588 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:12254 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:12438 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:17583 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:20285 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:20678 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:22892 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:28850 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:33851 10.100.0.1:2049 CLOSE_WAIT
>> tcp 165 0 10.100.0.2:34190 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:35643 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:39498 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:39724 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:40742 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:41674 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:42942 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:42956 10.100.0.1:2049 CLOSE_WAIT
>> tcp 477976 0 10.101.0.2:42856 10.101.0.1:2049 ESTABLISHED
>> tcp 1 0 10.100.0.2:42045 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:42048 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:43063 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:44771 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:49568 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:50813 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:51418 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:54507 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:57201 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:58553 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:59638 10.101.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.100.0.2:62289 10.100.0.1:2049 CLOSE_WAIT
>> tcp 1 0 10.101.0.2:61848 10.101.0.1:2049 CLOSE_WAIT
>> tcp 476952 0 10.101.0.2:62854 10.101.0.1:2049 ESTABLISHED
>> 
>> Then I used "tcpdrop" on FreeBSD's side to drop the sessions, the nfsd
>> was able to die and be restarted.
>> During the "hanged" period, all NFS mounts from the Linux host were
>> inaccessible, and IO hanged.
>> 
>> The nfsd is running with drc2/drc3 and lkshared patches from Rick
>> Macklem.
>> 
> These shouldn't have any effect on the above, unless you've exhausted
> your mbuf clusters. Once you are out of mbuf clusters, I'm not sure
> what might happen within the lower layers TCP->network interface.
> 
> Good luck with it, rick
> 

Thank you for the response!

Cheers,
Nikolay

>> _______________________________________________
>> freebsd-fs at freebsd.org mailing list
>> http://lists.freebsd.org/mailman/listinfo/freebsd-fs
>> To unsubscribe, send any mail to "freebsd-fs-unsubscribe at freebsd.org"