when has a pNFS data server failed?

Tue Aug 22 19:51:15 UTC 2017

Ronald Klop wrote:
>On Fri, 18 Aug 2017 23:52:12 +0200, Rick Macklem <rmacklem at uoguelph.ca>
>wrote:
>> This is kind of a "big picture" question that I thought I 'd throw out.
>>
>> As a brief background, I now have the code for running mirrored pNFS
>> Data Servers
>> working for normal operation. You can look at:
>> http://people.freebsd.org/~rmacklem/pnfs-planb-setup.txt
>> if you are interested in details related to the pNFS server code/testing.
>>
>> So, now I am facing the interesting part:
>> 1 - The Metadata Server (MDS) needs to decide that a mirrored DS has
>> failed at some
>>       point. Once that happens, it stops using the DS, etc.
>> --> This brings me to the question of "when should the MDS decide that
>> the DS has
>>       failed and should be taken offline?".
>>       - I'm not up to date w.r.t. the TCP stack, so I'm not sure how
>> long it will take for the
>>         TCP connection to decide that a DS server is no longer working
>> and fail the TCP
>>         connection. I think it takes a fair amount of time, so I'm not
>> sure if TCP connection
>>         loss is a good indicator of DS server failure or not?
>>     - It seems to me that the MDS should wait a fairly long time before
>> failing the DS,
>>       since this will have a major impact on the pNFS server, requiring
>> repair/resilvering
>>       by a sysadmin once it happens.
>> So, any comments or thoughts on this? rick
>
>This is a quite common problem for all clustered/connected systems. I
>think there is no general answer. And there are a lot of papers written
>about it.
If you have a suggestion for one good paper, I might be willing to read it.
Short answer is I'm retired after 30years of working for a University and I have
roughly a 0 interest in reading academic papers.

>For example: in NFS you have the 'soft' option. It is recommended not to
>use it. I can imagine that if your home-dir or /usr is mounted over NFS,
>but at work I want my http-servers to not hang and just give an IO-error
>when the backend fileserver with data is gone.
>Something similar happens here.
Yes. However, the analogy only works so far, in that a failure of a "soft" mount
affects integrity of the file, if it is a write that fails.
In this case, there shouldn't be data corruption/loss, however there may be
degraded performance during the mirror failure and subsequent resilvering.
(A closer analogy might be a drive failure when in a mirrored configuration
 with another drive. These days drive hardware does try to indicate "hardware health",
 which the mirrored server may not provide, at least in the early version.)

> Doesn't the protocol definition say something about this?
Nope, except for some "on the wire" information that the pNFS client can provide
to indicate to the MDS that it is having problems with a DS.
(The RFCs deal with what goes on the wire and not how servers get implemented.)

> Or what do other implementations do?
I have no idea. At this point, all extant pNFS server implementations are proprietary
blobs, such as a Netapp clustered configuration. I've only seen "high level" white
papers (one notch away from marketing).

To be honest, I think the answer for version 1 will come down to...

How long should the MDS try to communicate with the DS before it gives up and
considers it failed?

It will probably be setable via a sysctl, but does need a reasonable default value.
(A "very large" value would indicate "leave it for the sysadmin to decide and do
 manually.)

I also think there might be certain error returns from sosend()/sorecieve() that
may want special handling.
A simple example I experienced in recent testing was...
- One system was misconfigured with the same IP# as one of the DS systems.
   After fixing the misconfiguration, the pNFS server was wedged because it had
   a bogus arp entry so it couldn't talk to the one mirror.
--> This was easily handled by a "arp -d" done by me on the MDS, but if the MDS
      had given up on the DS before I did that, it would have been a lot more work
      to fix. (The bogus arp entry had a very long timeout on it.)

Anyhow, thanks for the comments and we'll see if others have comments, rick