Re: resolv.conf question

From: Bob Proulx <bob_at_proulx.com>
Date: Thu, 13 Oct 2022 01:56:29 UTC
Doug Denault wrote:
> > Doug Denault wrote:
> >       So I tried to RTFM, /usr/src/contrib/ldns/resolver.c in this case. It is
> >       almost certain that the system was up but bind did not respond. The source
> >       is a bit above my pay grade but it did seem possible that if that was the
> >       case, the second server was never tried. This is what actually happened.
> >
> >       There were no other issues as each of the jails started fine with a manual
> >       boot. Does anyone know if the timeout and/or retry setting offer a way
> >       around this.
>
> For performance reasons, especially if the first listed server is always
> used, I want that in our data center. Aside from speed, no hacking is
> possible. My purpose here is to figure how resolv.conf works. If more than
> one entry is effectively useless, I would be tempted to use 8.8.8.8. Also
> the jail mother had not been booted in several months and only now because I
> f-ed up changing the root password.

I still have a physical copy of DNS and BIND by Paul Albitz & Cricket
Liu published by O'Reilly 1992.  I have no idea if the way this was
described there still matches the way it is resolved now.  But I think
it likely it is still at least similar.

It is described that the timeouts will depend upon the number of
nameserver directives in the resolv.conf file.  Here is a table that I
reproduce here.

      | Name Servers Configured
------+----------------------------
Retry |    1    |    2    |    3
------+---------+---------+--------
   0  |    5s   |  (2x)5s | (3x)5s
   1  |   10s   |  (2x)5s | (3x)3s
   2  |   20s   | (2x)10s | (3x)6s
   3  |   40s   | (2x)20s | (3x)13s
------+---------+---------+--------
Total |   75s   |     80s |     81s

If there is no nameserver configured then the default is to query the
nameserver on the local system.  None is the same as one configured
local host nameserver.

If there is one nameserver configured then it will query that
nameserver with a timeout of 5 seconds.  This is the timeout before
sending another query.  A retry.  If the resolver encounters and error
that indicates the nameserver is really down or unreachable or times
out it will double the timeout and query the nameserver again.

If there is more than one nameserver configured then the libc resolver
queries the first one in the list with a timeout of 5 seconds.  If
that query times out or recieves an error then it falls back to the
next nameserver in the list with the same 5 second timeout.  If the
resolver reaches the end of the list and all of them (up to three)
timed out or received an error then it will update the timeouts and
cycle through the list again.

The next retry through the list will have timeouts set according to a
calculation of 10 seconds divided by the number of nameservers
configured rounded down.  One nameserver is 10 seconds.  Two
nameservers is 5 seconds.  Three nameservers is 3 seconds.

If that round of queries through each of the nameservers again
receives errors or timeouts then the timeout values are doubled and
the queries retry again.

There are four possible rounds of queries.  The first initial round
with the 5s timeouts.  The second round with the calculated timeouts.
The 3rd and 4th rounds with the calculated timeouts doubled each
round.

That accounts for why the total time it takes a DNS lookup using the
libc resolver will vary among 75s, 80s, 81s depending upon the number
of nameserver directives configured in the case that all of them
return either errors or are unreachable.

Again let me repeat that this was as descibed in 1992 and I have no
idea if the current implementation is still the same.  But at least it
lays the foundation for the way things used to work.

To get come recent data I tried it on my NetBSD 9.0 system here.  (I
know I am behind and need to upgrade it to the current 9.3.)  I tried
the four combinations with unreachable (non-existent) nameservers.

No nameservers configured.  No local host nameserver running.

    netbsd# time host example.com
    ;; connection timed out; no servers could be reached
       12.17s real     0.02s user     0.02s system

One unreachable nameserver configured.

    netbsd# time host example.com
    ;; connection timed out; no servers could be reached
       10.05s real     0.02s user     0.00s system

Two unreachable nameservers configured.

    netbsd# time host example.com
    ;; connection timed out; no servers could be reached
       12.07s real     0.01s user     0.02s system

Three unreachable nameservers configured.

    netbsd# time host example.com
    ;; connection timed out; no servers could be reached
       14.10s real     0.03s user     0.01s system

Then I configured two nameserver where the first one was unreachable
but the second one was local, available, and online.

    netbsd# time host example.com
    example.com has address 93.184.216.34
    example.com has IPv6 address 2606:2800:220:1:248:1893:25c8:1946
    example.com mail is handled by 0 .
        3.41s real     0.02s user     0.01s system

Then again with three nameservers but with the first two being
unreachable and again the third one, the last one, being available.

    netbsd# time host example.com
    example.com has address 93.184.216.34
    example.com has IPv6 address 2606:2800:220:1:248:1893:25c8:1946
    example.com mail is handled by 0 .
        6.09s real     0.01s user     0.02s system

Therefore it looks like the algorithm implemented now is similar but
somewhat different than that as historically described.

================================================================

Let's see the same experiment again with FreeBSD 12.3.

No nameservers configured.  No local host nameserver running.

    [root@freebsd ~]# time host example.com
    ;; connection timed out; no servers could be reached

    real    0m20.219s
    user    0m0.002s
    sys     0m0.003s

One unreachable nameserver configured.

    [root@freebsd ~]# time host example.com
    ;; connection timed out; no servers could be reached

    real    0m10.111s
    user    0m0.000s
    sys     0m0.006s

Two unreachable nameservers configured.

    [root@freebsd ~]# time host example.com
    ;; connection timed out; no servers could be reached

    real    0m20.226s
    user    0m0.005s
    sys     0m0.000s

Three unreachable nameservers configured.

    [root@freebsd ~]# time host example.com
    ;; connection timed out; no servers could be reached

    real    0m30.409s
    user    0m0.000s
    sys     0m0.007s

Then I configured two nameserver where the first one was unreachable
but the second one was local, available, and online.

    [root@freebsd ~]# time host example.com
    example.com has address 93.184.216.34
    example.com has IPv6 address 2606:2800:220:1:248:1893:25c8:1946
    example.com mail is handled by 0 .

    real    0m10.091s
    user    0m0.000s
    sys     0m0.007s

Then again with three nameservers but with the first two being
unreachable and again the third one, the last one, being available.

    [root@freebsd ~]# time host example.com
    example.com has address 93.184.216.34
    example.com has IPv6 address 2606:2800:220:1:248:1893:25c8:1946
    example.com mail is handled by 0 .

    real    0m20.309s
    user    0m0.002s
    sys     0m0.004s

================================================================

Let's see the same experiment again with Debian Unstable with glibc
version 2.35.

No nameservers configured.  No local host nameserver running.

    root@glibc:~# time host example.com
    ;; communications error to ::1#53: connection refused
    ;; communications error to ::1#53: connection refused
    ;; communications error to 127.0.0.1#53: connection refused
    ;; no servers could be reached
    real    0m0.031s
    user    0m0.015s
    sys     0m0.005s

Interesting that it complains about both IPv6 failure and IPv4 failure
whereas traditionally it is silent.  ("::1" being IPv6 localhost, and
127.0.0.1 being IPv4 localhost.)

One unreachable IPv4 local host nameserver configured.

    root@glibc:~# time host example.com
    ;; communications error to 127.0.0.1#53: connection refused
    ;; communications error to 127.0.0.1#53: connection refused
    ;; no servers could be reached
    real    0m0.034s
    user    0m0.019s
    sys     0m0.000s

One unreachable IPv4 nameserver configured.  This doesn't show timestamps
but each line was output at 5s intervals.

    root@glibc:~# time host example.com
    ;; communications error to 192.168.1.151#53: timed out
    ;; communications error to 192.168.1.151#53: timed out
    ;; no servers could be reached
    real    0m10.045s
    user    0m0.016s
    sys     0m0.008s

Two unreachable nameservers configured.  This doesn't show timestamps
but each line was output at 5s intervals.

    root@glibc:~# time host example.com
    ;; communications error to 192.168.1.151#53: timed out
    ;; communications error to 192.168.1.151#53: timed out
    ;; communications error to 192.168.1.152#53: timed out
    ;; no servers could be reached
    real    0m15.049s
    user    0m0.014s
    sys     0m0.009s

Three unreachable nameservers configured.

    root@glibc:~# time host example.com
    ;; communications error to 192.168.1.151#53: timed out
    ;; communications error to 192.168.1.151#53: timed out
    ;; communications error to 192.168.1.152#53: timed out
    ;; communications error to 192.168.1.153#53: timed out
    ;; no servers could be reached
    real    0m20.052s
    user    0m0.012s
    sys     0m0.008s

================================================================

I am not sure if this in any way answers your questions.  But
hopefully it provides some interesting information about the behavior
of the resolver in these various different systems.

Personally I almost always configure a local caching nameserver on the
local host for my server systems.  For me that is almost always the
right answer for Internet connected servers.

However for DHCP mobile clients I mostly don't and use the DHCP
provided nameservers.  That's the best answer to allow spoofing for
captive portal open WiFi Access Points such as at namebrand coffee
shops and airports.

One more "however" here as not validating DNSSEC also allows spoofing.
Therefore I turn my mobile laptop's local DNSSEC validating nameserver
on and off manually.  I need it on for security.  I need it off for
clicking through the EULA on a captive portal.  Captive portals are
rather a mess.

    https://en.wikipedia.org/wiki/Captive_portal

Bob