NFS 75 second stall

Wed Sep 1 15:56:50 UTC 2010

  On 07/01/10 15:23, Garrett Cooper wrote:
> On Thu, Jul 1, 2010 at 11:51 AM, alan bryan<alan.bryan at yahoo.com>  wrote:
>>
>> --- On Thu, 7/1/10, Garrett Cooper<yanefbsd at gmail.com>  wrote:
>>
>>> From: Garrett Cooper<yanefbsd at gmail.com>
>>> Subject: Re: NFS 75 second stall
>>> To: "alan bryan"<alan.bryan at yahoo.com>
>>> Cc: freebsd-stable at freebsd.org
>>> Date: Thursday, July 1, 2010, 11:13 AM
>>> On Thu, Jul 1, 2010 at 11:01 AM, alan
>>> bryan<alan.bryan at yahoo.com>
>>> wrote:
>>>> Setup:
>>>>
>>>> server - FreeBSD 8-stable from today.  2 UFS dirs
>>> exported via NFS.
>>>> client - FreeBSD 8.0-Release.  Running a test php
>>> script that copies around various files to/from 2 separate
>>> NFS mounts.
>>>> Situation:
>>>>
>>>> script is started (forked to do 20 simultaneous runs)
>>> and 20 1GB files are copied to the NFS dir which works
>>> fine.  When it then switches to reading those files back
>>> and simultaneously writing to the other NFS mount I see a
>>> hang of 75 seconds.  If I do an "ls -l" on the NFS mount it
>>> hangs too.  After 75 seconds the client has reported:
>>>> nfs server 192.168.10.133:/usr/local/export1: not
>>> responding
>>>> nfs server 192.168.10.133:/usr/local/export1: is alive
>>> again
>>>> nfs server 192.168.10.133:/usr/local/export1: not
>>> responding
>>>> nfs server 192.168.10.133:/usr/local/export1: is alive
>>> again
>>>> and then things start working again.  The server was
>>> originally FreeBSD 8.0-Release also but was upgraded to the
>>> latest stable to see if this issue could be avoided.
>>>> # nfsstat -s -W -w 1
>>>>   GtAttr Lookup Rdlink   Read  Write Rename
>>> Access  Rddir
>>>>        0      0      0    222    257
>>>    0      0      0
>>>>        0      0      0    178    135
>>>    0      0      0
>>>>        0      0      0     85    127
>>>      0      0      0
>>>>        0      0      0      0      0
>>>      0      0      0
>>>>        0      0      0      0      0
>>>      0      0      0
>>>>        0      0      0      0      0
>>>      0      0      0
>>>>        0      0      0      0      0
>>>      0      0      0
>>>>        0      0      0      0      0
>>>      0      0      0
>>>> ... for 75 rows of all zeros
>>>>
>>>>        0      0      0    272    266
>>>    0      0      0
>>>>        0      0      0    167    165
>>>    0      0      0
>>>> I also tried runs with 15 simultaneous processes and
>>> 25.  15 processes gave only about a 5 second stall but 25
>>> gave again the same 75 second stall.
>>>> Further, I tested with 2 mounts to the same server but
>>> from ZFS filesytems with the exact same stall/timeout
>>> periods.  So, it doesn't appear to matter what the
>>> underlying filesystem is - it's something in NFS or
>>> networking code.
>>>> Any ideas on what's going on here?  What's causing
>>> the complete stall period of zero NFS activity?   Any flaws
>>> with my testing methods?
>>>> Thanks for any and all help/ideas.
>>> What network driver are you using? Have you tried
>>> tcpdumping the packets?
>>> -Garrett
>>>
>> I'm using igb currently but have also used em.  I have not tried tcpdumping the packets yet on this test.  Any suggestions on things to look out for (I'm not that familiar with that whole process).
>>
>> Which brings up another point - I'm using TCP connections for NFS, not UDP.
>      Is the net.inet.tcp.tso sysctl enabled or not? What about rxcsum and txcsum?
> Thanks,
> -Garrett

We're occaisionally seeing these same types of stalls (+ repeated "is 
not responding" "is alive again" messages in quick succession).  We're 
seeing it only on our 8.1-RELEASE systems against a variety of NFS 
servers (6.3-RELEASE, 7.2-RELEASE, and 8-STABLE from before the release 
of 8.1).  We also see it happen with a variety of client hardware and 
network adapters (em, bce, bge); the only common denominator is 
8.1-RELEASE on the clients.