[Bug 204048] stable/9: r289998: ntpd 4.2.8p4 DNS resolution misbehaves (occasional segfault)

Mon Oct 26 23:06:44 UTC 2015

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=204048

            Bug ID: 204048
           Summary: stable/9: r289998: ntpd 4.2.8p4 DNS resolution
                    misbehaves (occasional segfault)
           Product: Base System
           Version: 9.3-RELEASE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: bin
          Assignee: freebsd-bugs at FreeBSD.org
          Reporter: jdc at koitsu.org

Recent upgrade of ntpd 4.2.8p4 to stable/9 results in a daemon which behaves
very very oddly.  Said upgrade: http://www.freshbsd.org/commit/freebsd/r289998

My log after several manual troubleshooting attempts -- note the intermixed
segfaults:

Oct 26 15:38:05 icarus ntpd[1092]: giving up resolving host clock.isc.org:
servname not supported for ai_socktype (9)
Oct 26 15:38:23 icarus ntpd[1116]: giving up resolving host clock.isc.org:
servname not supported for ai_socktype (9)
pid 1139 (ntpd), uid 0: exited on signal 11 (core dumped)
Oct 26 15:39:07 icarus ntpd[1176]: giving up resolving host clock.isc.org:
servname not supported for ai_socktype (9)
Oct 26 15:39:59 icarus ntpd[1209]: giving up resolving host ntp-1.cso.uiuc.edu:
servname not supported for ai_socktype (9)
Oct 26 15:40:24 icarus ntpd[1268]: giving up resolving host clock.isc.org:
servname not supported for ai_socktype (9)
pid 1294 (ntpd), uid 0: exited on signal 11 (core dumped)
pid 1312 (ntpd), uid 0: exited on signal 11 (core dumped)
Oct 26 15:44:09 icarus ntpd[1409]: giving up resolving host clock.isc.org:
servname not supported for ai_socktype (9)
Oct 26 15:45:26 icarus ntpd[1490]: giving up resolving host
0.freebsd.pool.ntp.org: servname not supported for ai_socktype (9)
Oct 26 15:50:18 icarus ntpd[1656]: giving up resolving host tick.jrc.us:
servname not supported for ai_socktype (9)

Segfaults are always here:

root at icarus:~ # gdb /usr/sbin/ntpd /ntpd.core
...
#0  0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7
[New Thread 801807c00 (LWP 100797/ntpd)]
[New Thread 801807400 (LWP 100791/ntpd)]
(gdb) bt
#0  0x000000080114d79d in _malloc_postfork () from /lib/libc.so.7
#1  0x000000080114fb3e in _malloc_postfork () from /lib/libc.so.7
#2  0x00000008011523fe in _malloc_prefork () from /lib/libc.so.7
#3  0x0000000801154482 in calloc () from /lib/libc.so.7
#4  0x000000080117aba6 in __res_state () from /lib/libc.so.7
#5  0x000000080118698c in freeaddrinfo () from /lib/libc.so.7
#6  0x00000008011ab61a in nsdispatch () from /lib/libc.so.7
#7  0x0000000801187ffb in getaddrinfo () from /lib/libc.so.7
#8  0x0000000000474f04 in blocking_getaddrinfo ()
#9  0x0000000000473a43 in blocking_child_common ()
#10 0x00000000004737e9 in blocking_thread ()
#11 0x0000000800afee70 in pthread_getprio () from /lib/libthr.so.3
#12 0x0000000000000000 in ?? ()

Important:

The behaviour seen is very strange.  Basically, the daemon starts, emits one of
the aforementioned DNS errors, then proceeds to either a) exit, b) crash, or c)
continue running.

Sometimes when the daemon exits (possibly when crashing too), it restarts
itself.  There have been a couple times where ps -auxwww | grep ntp returns
nothing, yet a few seconds later the daemon is found running.

Things I've tried which made no difference:

1. Removing -4 from $ntpd_flags (I set this because while my system has IPv6, I
prefer using IPv4 everywhere)
2. Using /etc/ntp.conf (r289998) instead of my own ntp.conf

There is no workaround for this other than to roll back to something prior to
r289998.

Googling turns up several reports of this problem, but all relate to people
trying to use chroot'ing with ntpd (I DO NOT use this feature).

https://mail-index.netbsd.org/current-users/2014/01/26/msg024169.html
https://mail-index.netbsd.org/current-users/2014/06/01/msg024998.html

One report says that use of -O1 (on ARM) relieves the problem, but crashing is
seen on VAX and other platforms.  (My system uses gcc, not clang, just for the
record)

Footnote: upgrading to stable/10 is not an option until the load average bug
there is rectified (I am not the only one to report this problem).  I can try
to test out this ntpd on a VM running stable/10 to see if the problem there is
reproducible or not.

My ntp.conf (w/ comments removed):

server clock.isc.org          iburst
server ntp-1.cso.uiuc.edu     iburst
server clock.psu.edu          iburst
server tick.jrc.us            iburst
server 0.us.pool.ntp.org      iburst

restrict default limited kod nomodify notrap nopeer noquery
restrict 127.0.0.1
restrict 192.168.1.0 mask 255.255.255.0

My rc.conf ntp-related flags:

# ntpd_flags: temporary workaround for
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199127
ntpd_enable="yes"
ntpd_config="/conf/ME/ntp.conf"
ntpd_sync_on_start="yes"
ntpd_flags="-4 ${ntpd_flags}"

-- 
You are receiving this mail because:
You are the assignee for the bug.