NFS FHA issue and possible change to the algorithm

Fri Oct 23 12:04:40 UTC 2015

Hi,

An off list discussion occurred where a site running an NFS server found that
they needed to disable File Handle Affinity (FHA) to get good performance.
Here is a re-post of some of that (with Josh's permission):
First what was observed w.r.t. the machine.
Josh Paetzel wrote:
>>>> It's all good.
>>>>
>>>> It's a 96GB RAM machine and I have 2 million nmbclusters, so 8GB RAM,
>>>> and we've tried 1024 NFS threads.
>>>>
>>>> It might be running out of network memory but we can't really afford to
>>>> give it any more, for this use case disabling FHA might end up being the
>>>> way to go.
>>>>
I wrote:
>>> Just to fill mav@ in, the person that reported a serious performance
>>> problem
>>> to Josh was able to fix it by disabling FHA.
Josh Paetzel wrote:
>>
>> There's about 300 virtual machines that mount root from a read only NFS
>> share.
>>
>> There's also another few hundred users that mount their home directories
>> over NFS.  When things went sideways it is always the virtual machines
>> that get unusable.  45 seconds to log in via ssh, 15 minutes to boot,
>> stuff like that.
>>
>> root at head2] ~# nfsstat -s 1
>>  GtAttr Lookup Rdlink   Read  Write Rename Access  Rddir
>>    4117     17      0    124    689      4    680      0
>>    4750     31      5    121    815      3    950      1
>>    4168     16      0    109    659      9    672      0
>>    4416     24      0    112    771      3    748      0
>>    5038     86      0     76    728      4    825      0
>>    5602     21      0     76    740      3    702      6
>>
>> [root at head2] ~# arcstat.py 1
>>     time  read  miss  miss%  dmis  dm%  pmis  pm%  mmis  mm%  arcsz     c
>> 18:25:36    21     0      0     0    0     0    0     0    0    65G   65G
>> 18:25:37  1.8K    23      1    23    1     0    0     7    0    65G   65G
>> 18:25:38  1.9K    88      4    32    1    56   32     3    0    65G   65G
>> 18:25:39  2.2K    67      3    62    2     5    5     2    0    65G   65G
>> 18:25:40  2.7K   132      4    39    1    93   17     8    0    65G   65G
>>
>> last pid:  7800;  load averages:  1.44,  1.65,  1.68
>>                                                                  up
>> 0+19:22:29  18:26:16
>> 69 processes:  1 running, 68 sleeping
>> CPU:  0.1% user,  0.0% nice,  1.8% system,  0.9% interrupt, 97.3% idle
>> Mem: 297M Active, 180M Inact, 74G Wired, 140K Cache, 565M Buf, 19G Free
>> ARC: 66G Total, 39G MFU, 24G MRU, 53M Anon, 448M Header, 1951M Other
>> Swap: 28G Total, 28G Free
>>
>>   PID USERNAME       THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU
>> COMMAND
>>  9915 root            37  52    0  9900K  2060K rpcsvc 16  16.7H 24.02% nfsd
>>  6402 root             1  52    0 85352K 20696K select  8  47:17  3.08%
>> python2.7
>> 43178 root             1  20    0 70524K 30752K select  7  31:04  0.59%
>> rsync
>>  7363 root             1  20    0 49512K  6456K CPU16  16   0:00  0.59% top
>> 37968 root             1  20    0 70524K 31432K select  7  16:53  0.00%
>> rsync
>> 37969 root             1  20    0 55752K 11052K select  1   9:11  0.00% ssh
>> 13516 root            12  20    0   176M 41152K uwait  23   4:14  0.00%
>> collectd
>> 31375 root            12  20    0   176M 42432K uwa
>>
>> This is a quick peek at the system at the end of the day, so load has
>> dropped off considerably, however the main takeaway is it has plenty of
>> free RAM, and ZFS ARC hit percentage is > 99%.
>>
I wrote:
>>> I took a look at it and I wonder if it is time to consider changing the
>>> algorithm
>>> somewhat?
>>>
>>> The main thing that I wonder about is doing FHA for all the RPCs other than
>>> Read and Write.
>>>
>>> In particular, Getattr is often the most frequent RPC and doing FHA for it
>>> seems
>>> like wasted overhead to me? Normally separate Getattr RPCs wouldn't be done
>>> for
>>> FHs are being Read/Written, since the Read/Write reply has updated
>>> attributes in it.
>>>
Although the load is mostly Getattr RPCs and I think the above statement is correct,
I don't know if the overhead of doing FHA for all the Getattr RPCs explains the observed
performance problem?

I don't see how doing FHA for RPCs like Getattr will improve their performance.
Note that when the FHA algorithm was originally done, there wasn't a shared vnode
lock and, as such, all RPCs on a given FH/vnode would have been serialized by the vnode
lock anyhow. Now, with shared vnode locks, this isn't the case for frequently performed
RPCs like Getattr, Read (Write for ZFS), Lookup and Access. I have always felt that
doing FHA for RPCs other than Read and Write didn't make much sense to me, but I don't
have any evidence that it causes a significant performance penalty.

Anyhow, the attached simple patch limits FHA to Read and Write RPCs.
The simple testing I've done shows it to be about performance neutral (0-1% improvement),
but I have only small hardware and no ZFS or any easy way to emulate a load of mostly
Getattr RPCs. As such, unless others can determine if this patch (or some other one)
helps w.r.t. this, I don't think committing it makes much sense?

If anyone can test this or have comments w.r.t. this or suggestions for other possible
changes to the FHA algorithm, please do so.

Thanks, rick

-------------- next part --------------
A non-text attachment was scrubbed...
Name: nfsfha.patch
Type: text/x-patch
Size: 1882 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-fs/attachments/20151023/2f816d84/attachment.bin>