Re: RFC: mount_nfs failure due to dns not running yet

Reply: Rick Macklem : "Re: RFC: mount_nfs failure due to dns not running yet"
In reply to: Rick Macklem : "Re: RFC: mount_nfs failure due to dns not running yet"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Toomas Soome <tsoome_at_me.com>
Date: Fri, 21 Feb 2025 08:12:54 UTC

> On 21. Feb 2025, at 04:39, Rick Macklem <rick.macklem@gmail.com> wrote:
> 
> On Thu, Feb 20, 2025 at 4:28 PM Steve Rikli <sr@genyosha.net> wrote:
>> 
>> On Wed, Feb 19, 2025 at 02:40:15PM -0800, Rick Macklem wrote:
>>> 
>>> The subject line basically describes the problem glebius@
>>> ran into.  When doing an NFS mount in /etc/fstab, it failed
>>> since the DNS service was not yet working and, as such,
>>> the DNS lookup of the server fqdn failed, causing the mount
>>> to fail. Note that this behaviour has existed for decades.
>>> 
>>> He feels this is a bug and that mount_nfs(8) should retry
>>> getaddrinfo(3) calls until success, instead of failing the
>>> mount when the first attempt fails.
>>> The problem with just retrying getaddrinfo(3) is that it
>>> could retry forever for simple failures like a typo in the
>>> server fqdn.
>>> I can see several ways this can be handled and would
>>> like feedback from others w.r.t. these alternatives.
>>> 
>>> 1) Simply document this case and encourage use of
>>>    host names in /etc/hosts for NFS servers along with
>>>    specifying use of file before dns in nsswitch.conf.
>>>     Doing this results in the mounts working whether or
>>>      not DNS is working.
>>> 
>>> 2) Call it a bug and patch mount_nfs(8) to retry getaddrinfo(3)
>>>     until it succeeds. (I feel this would be a POLA violation,
>>>     given that the current behaviour has existed for decades
>>>     and for simple cases where the fqdn will never resolve
>>>     the behaviour would be to hang at the mount attempt
>>>     during boot unless "bg" is specified for the /etc/fstab entry.)
>>> 
>>> 3) Add a new NFS mount option "retrydns=<N>", which would enable
>>>    retries of getaddrinfo(3). This would avoid any POLA violation and
>>>    would allow for a convenient way to document the behaviour in
>>>    "man mount_nfs".
>>> 
>>> 4) ???
>>> 
>>> So, what do you think is the preferred change?
>> 
>> I don't think I would change mount_nfs code behavior for this.
>> 
>> That is, requiring services and daemons etc. to workaround missing,
>> misconfigured, slow, or misbehaving nameservice (whether it's DNS,
>> /etc/hosts, NIS, whatever) seems like more complexity, possibly not
>> effective, and maybe not focused on the right thing.
>> 
>> Now, without meaning to be presumptuous, it may be worth re-examining
>> the startup sequence, e.g. to make sure NFS mounts are tried after the
>> known dependencies can reasonably be expected to have started, including
>> the network, plus local_unbound or bind (if used), possibly others.
>> 
>> After a quick look, I don't see an obvious problem with the sequence,
>> but more knowledgeable eyes than mine are welcome.  I don't quite follow
>> some of the output from rcorder and service -r.
>> 
>>> ps: I looked and the return value from getaddrinfo(3) does not
>>>      appear to be useful to discern the case of "DNS service not
>>>      running yet". (I think it replies EAI_FAIL for this case.)
>> 
>> In that area, I'll note FreeBSD rc.d has a "NETWORKING" dependency for
>> PROVIDE and REQUIRE, and it's included in scripts like nfsclient,
>> mountcritremote et al. However there seems to be no similar dependency
>> for something like "NAMESERVICE" (generic, as opposed to "named"
>> specifically), and I'm not sure how that might be implemented, even
>> assuming it could be useful in a situation like this.
>> 
>> I.e. there are many things to potentially check for "can the system
>> resolve hostnames yet", and not all of them involve running a local
>> instance of named, unbound, etc.
>> 
>> In general, if I were running into problems with nameservice not being
>> available by the time NFS mounts happen, I think I'd start by looking
>> into possible nameservice issues, then check out some mechanisms other
>> folks have mentioned (fstab IP addresses or late option, rc.conf
>> netwait_enable, etc.) rather than coding workarounds into NFS itself.
> Well, the patch I have created (it took about 15min) only changes behaviour
> if a new "retrydns" option i used. As such, I think it might be useful for some,
> but doesn't change things unless someone uses it.
> 
> I agree with you that I don't think the rc scripts have a way to check REQUIRE
> dns working. (I, personally, always put the fqdn for NFS servers in /etc/hosts
> and make sure "files" is first in nsswitch.conf, but others argue that is not
> feasible for some deployments. (Using IP numbers works for AUTH_SYS,
> but not Kerberized mounts.)
> 
> Note that there is already "retrycnt", which specifies retry the mount,
> but that retry loop doesn't include getaddrinfo(3) calls.
> --> Personally, I do not like always doing retries since I often
>     type mount commands manually and I'm a terrible typist, so I
>     often mistype the server's name.
> 
> This reply was mostly a followup on all the good comments and
> not just yours.
> 
> Thanks everyone, for your comments, rick
> 

my 2cents:

there is a difference of name service not responding and name not resolving. In first case, it will go to:

             bg      If an initial attempt to contact the server fails, fork
                     off a child to keep trying the mount in the background.
                     Useful for fstab(5), where the file system mount is not
                     critical to multiuser operation.

             bgnow   Like bg, fork off a child to keep trying the mount in the
                     background, but do not attempt to mount in the foreground
                     first.  This eliminates a 60+ second timeout when the
                     server is not responding.  Useful for speeding up the
                     boot process of a client when the server is likely to be
                     unavailable.  This is often the case for interdependent
                     servers such as cross-mounted servers (each of two
                     servers is an NFS client of the other) and for cluster
                     nodes that must boot before the file servers.

in second case, its a failure you can not recover from.

rgds,
toomas