when has a pNFS data server failed?

Wed Aug 23 12:36:08 UTC 2017

Karli Sjöberg wrote:
[stuff snipped for brevity]
>>Rick Macklem wrote:
>>To be honest, I think the answer for version 1 will come down to...
>>
>>How long should the MDS try to communicate with the DS before it gives up and
>>considers it failed?
>>
>>It will probably be setable via a sysctl, but does need a reasonable default value.
>>(A "very large" value would indicate "leave it for the sysadmin to decide and do
>>manually.)
[more stuff snipped]
>This is what one prominent "customer" says about timeout:
>https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009465
>"These issues occur when the guest operating system timeout values are exceeded for >attached storage disks. This may be caused by an underlying storage problem or due to >brief transient pauses during normal operations (such as path failover). To accommodate >transient events, the VMware Tools increases the SCSI disk timeout to 60 seconds for >Virtual Infrastructure 3 and 180 seconds for vSphere 4 and higher."
>
>Which means that you have a minute before the "customers" start complaining:)
Thanks. I was thinking that a minute or two is about what the default might want
to be. It may need to be longer than that, since a DS needs to be able to reboot
and start servicing RPCs before this timeout happens as one example.
(Fortunately a DS does not need to wait for the "grace period that an NFSv4/MDS
 server does after boot, since that time is for clients to recover locks and the locks
 are handled by the MDS and not the DSs.)

Thanks for the comment, rick