Kernel modules

Thu Apr 18 12:52:53 UTC 2019

My NFS over IB has been generally working well, but it's going down 
under very heavy load.  It's getting consistently triggered by 256 
I/O-intensive processes across about a dozen compute nodes.

The server remains up and responsive, but ib0 is running out of buffer 
space and going down.

It seems the only way to get the interface back is by rebooting.

I cranked the buffer limits way up based on some search results, but the 
interface went down again at the same point even with the higher values.

net.inet.tcp.sendbuf_max: 67108864
net.inet.tcp.recvbuf_max: 67108864

Anyone have a suggestion for dealing with this?

root at zfs-01:/home/bacon # ping compute-001-hpc
PING compute-001-hpc.mortimer (192.168.129.18): 56 data bytes
ping: sendto: No buffer space available
ping: sendto: No buffer space available
^C
--- compute-001-hpc.mortimer ping statistics ---
2 packets transmitted, 0 packets received, 100.0% packet loss
root at zfs-01:/home/bacon # ping compute-001
PING compute-001.mortimer (192.168.1.18): 56 data bytes
64 bytes from 192.168.1.18: icmp_seq=0 ttl=64 time=0.492 ms
64 bytes from 192.168.1.18: icmp_seq=1 ttl=64 time=0.294 ms
^C
--- compute-001.mortimer ping statistics ---
2 packets transmitted, 2 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.294/0.393/0.492/0.099 ms

dmesg is showing a lot of these:

nfsrv_cache_session: no session

Thanks,

     JB

-- 
Earth is a beta site.