RELENG_7: something is very wrong with UDP?
Oleg V. Nauman
oleg at opentransfer.com
Sat Sep 20 09:58:05 UTC 2008
Quoting Robert Watson <rwatson at FreeBSD.org>:
>
> On Fri, 19 Sep 2008, Oleg V. Nauman wrote:
>
>>> (1) Start by deleting all but one nameserver entry in /etc/resolv.conf.
>>> Confirm that you can still reproduce the problem.
>>
>> Due to various reasons my laptop running local caching DNS server (
>> named ) without any forwarders assigned. My /etc/resolv.conf
>> contains nameserver 127.0.0.1
>
> This is simplifying in some senses, but complicating in others. In
> particular, the question it raises is whether the problem is in the DNS
> resolver or the nameserver. Seeing a tcpdump of lo0 for DNS traffic
> would be quite interesting, since we could look at timestamps and try
> to place the blame a bit more precisely.
>
>>> Could you
>>> also use procstat -k on the dig process to generate a kernel stack trace
>>> for it?
>
> Let's add to this list: when the problem happens, could you also
> procstat -k the name server process(es)?
>
>> And procstat -kk output for logger process waiting:
>>
>> PID TID COMM TDNAME KSTACK
>> 1421 100095 logger - mi_switch+0x2c8
>> sleepq_switch+0xd9 sleepq_catch_signals+0x239 sleepq_wait_sig+0x14
>> _sleep+0x35f pipe_read+0x389 dofileread+0x96 kern_readv+0x58
>> read+0x4f syscall+0x2b3 Xint0x80_syscall+0x20
>
> Interesting -- logger is blocked on reading from a pipe, likely
> standard input. So it sounds like something else is failing to
> complete in a timely manner -- perhaps due to DNS.
Nothing strange with this because it was kernel stack for logger
waiting on background fsck output ( bgfsck was never starting though )
>
>>> This is approximately the date of my last UDP MFC. Could you try
>>> backing out just src/sys/netinet6/udp6_usrreq.c revision 1.81.2.7
>>> and see if that helps? (specifically, restore the use of
>>> sosend_generic instead of sosend_dgram)
>
> If you can show that it's definitely a problem with the change to
> sosend_dgram for UDPv6 socket send, then it might suggest it's the same
> problem that it is related to the UDPv46 code there. In which case I
> will propose we back out that portion of the change in the 7-stable
> branch until it's known to be resolved -- I don't want other people
> tripping over this.
Sorry for false alarm regarding UDP issues.. Have noticed that my
clock is stop incrementing ( it explaining the zeroes in traceroute
output also ). It gave me idea what is related to this issue so
performed backout revision 1.243.2.4 of src/sys/dev/acpica/acpi.c and
it fixes my issues.. Looks like it stops incrementing the timecounters
on my laptop..
Ironically speaking I was this ACPI behavior change initiator ( I was
reporting "ACPI HPET stops working on my RELENG_7" at July 19 to
stable at freebsd.org) so jhb@ implemented a patch and it was working for
me those days. Something was changed during the next 2 months so this
patch causing issues instead the success on my hardware. I will play a
bit with kern.timecounter.choice at Monday and report it back to jhb@
then.
>
>>> Could you try compiling your kernel with WITNESS to see if we get
>>> any extended debugging information?
>>
>> Have added WITNESS ( and STACK required by procstat ) options but
>> it is not producing any output ( so no LORs or something like this )
>
> OK. Could you try adding INVARIANT_SUPPORT and INVARIANTS if they
> aren't there? Be aware: this may convert the wedging you are
> experiencing into a kernel panic.
No output produced with INVARIANT_SUPPORT and INVARIANTS support
included in the kernel. And no kernel panic produced :) Thank you for
excellent work.
>
>>>> Is anybody experiencing the same issues with fresh RELENG_7?
>>>> Unsure it is my local issues though
>>>
>>> I'm not experiencing them, but these sorts of things can be quite
>>> subtle and workload-dependent.
>>
>> Well experiencing this issue during the system boot even..
>
> OK. So there must be something a bit different about your setup --
> perhaps there's something specific about the way things are interacting
> over the loopback address for the name server. Is this the stock
> system BIND9 or something else? Are you able to temporarily switch to
I have stock system BIND running
> an external name server and see if that changes things?
>
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
More information about the freebsd-stable
mailing list