bacon4000 at gmail.com
Sat Apr 13 18:41:17 UTC 2019
On 2019-04-13 13:29, Justin Clift wrote:
> On 2019-04-13 23:52, Jason Bacon wrote:
>> Stability will take a long time to test properly. I'm going to start
>> by rerunning some of our most I/O-intensive jobs on it - jobs that
>> actually broke our CentOS RAID servers until I switched them to NFS
>> over RDMA.
> That's got to be the first time anyone's ever mentioned "NFS over
> RDMA" as
> increasing a systems' stability. :)
> + Justin
Believe it or not... ;-)
After my upgrade from CentOS 6 to CentOS 7, NFS over TCP started falling
apart under heavy load; servers and compute nodes becoming unresponsive
and requiring a reboot to restore stability.
If it's due to problems in the CentOS TCP stack, NFS over RDMA would
help by eliminating the TCP stack from the pathway.
One one cluster (old qlogic HCAs), setting net.core.netdev_budget=2000
seems to have solved the issue. On the other (newer Mellanox FDR HCAs),
it did not seem to help, so I tried RDMA and it's been stable ever
since. Down side is we can no longer monitor traffic with iftop...
Earth is a beta site.
More information about the freebsd-infiniband