Re: thread on sleepqueue does not wake up after timeout

Reply: Andriy Gapon : "Re: thread on sleepqueue does not wake up after timeout"
In reply to: Andriy Gapon : "Re: thread on sleepqueue does not wake up after timeout"
Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Tue, 09 Nov 2021 11:52:18 UTC
On Tue, Nov 09, 2021 at 11:58:30AM +0200, Andriy Gapon wrote:
> 
> Two years have passed since my original report, some things have changed,
> but we are still seeing the problem.
> 
> To recap, the problem is that sometimes a callout for a sleepq timeout would
> fire to early.  The code in sleepq_timeout would ignore such a wake-up.
> And, as there would not be another callout, there will never be a wake-up.
> 
> This has been observed on VMWare only.  We have seen it on different
> versions of ESXi (different environments, upgrades, etc) starting with 5.0,
> through 5.5 and 6.0, and now on 6.5.
> 
> We use 12.2 now and are still seeing the problem occasionally.
> 
> On advice from Kostik I added some diagnostic code.
> The code checks for the too early condition and panics when it happens.
> It also provides some additional information.
> 
> Here is an example of such a panic.
> Unread portion of the kernel message buffer:
> panic: too early sleepq timeout, 190543869738008 > 190543869603412 (190543869798505)
> cpuid = 1
> time = 1636195325
> 
> (kgdb) bt
> #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
> #1  doadump (textdump=153080096) at /usr/src/sys/kern/kern_shutdown.c:447
> #2  0xffffffff8038f1cc in db_fncall_generic (addr=<optimized out>, nargs=0,
> args=<optimized out>, rv=<optimized out>) at
> /usr/src/sys/ddb/db_command.c:622
> #3  db_fncall (dummy1=<optimized out>, dummy2=<optimized out>,
> dummy3=<optimized out>, dummy4=<optimized out>) at
> /usr/src/sys/ddb/db_command.c:670
> #4  0xffffffff8038eaf6 in db_command (last_cmdp=<optimized out>,
> cmd_table=<optimized out>, dopager=0) at /usr/src/sys/ddb/db_command.c:494
> #5  0xffffffff80393a68 in db_script_exec (scriptname=<optimized out>,
> warnifnotfound=<optimized out>) at /usr/src/sys/ddb/db_script.c:304
> #6  0xffffffff80393892 in db_script_kdbenter (eventname=<optimized out>) at
> /usr/src/sys/ddb/db_script.c:326
> #7  0xffffffff80391ac3 in db_trap (type=<optimized out>, code=<optimized
> out>) at /usr/src/sys/ddb/db_main.c:251
> #8  0xffffffff807821e2 in kdb_trap (type=3, code=0, tf=0xfffffe01091fd410)
> at /usr/src/sys/kern/subr_kdb.c:700
> #9  0xffffffff809d870e in trap (frame=0xfffffe01091fd410) at
> /usr/src/sys/amd64/amd64/trap.c:583
> #10 <signal handler called>
> #11 kdb_enter (why=0xffffffff80b17af9 "panic", msg=0xffffffff80b17af9
> "panic") at /usr/src/sys/kern/subr_kdb.c:486
> #12 0xffffffff8073ddce in vpanic (fmt=<optimized out>, ap=<optimized out>)
> at /usr/src/sys/kern/kern_shutdown.c:975
> #13 0xffffffff8073dc23 in panic (fmt=0xffffffff81178120 <cnputs_mtx+24> "")
> at /usr/src/sys/kern/kern_shutdown.c:909
> #14 0xffffffff8078f09c in sleepq_timeout (arg=0xfffff8005b6a1000) at
> /usr/src/sys/kern/subr_sleepqueue.c:1042
> #15 0xffffffff8075594a in softclock_call_cc (c=0xfffff8005b6a1470,
> cc=0xffffffff811a0580 <cc_cpu+320>, now=190543869798505, direct=1) at
> /usr/src/sys/kern/kern_timeout.c:768
> #16 0xffffffff8075567c in callout_process (now=190543869798505) at
> /usr/src/sys/kern/kern_timeout.c:511
> #17 0xffffffff80a3d455 in handleevents (now=190543869798505, fake=0) at
> /usr/src/sys/kern/kern_clocksource.c:213
> #18 0xffffffff80a3da46 in timercb (et=0xffffffff81251238 <lapic_et>,
> arg=<optimized out>) at /usr/src/sys/kern/kern_clocksource.c:357
> #19 0xffffffff80a680cb in lapic_handle_timer (frame=0xfffffe01091fd7d0) at
> /usr/src/sys/x86/x86/local_apic.c:1339
> #20 <signal handler called>
> #21 0xffffffff81d90654 in buf_hash_find
> ...
> 
> (kgdb) fr 18
> #18 0xffffffff80a3da46 in timercb (et=0xffffffff81251238 <lapic_et>,
> arg=<optimized out>) at /usr/src/sys/kern/kern_clocksource.c:357
> 357     /usr/src/sys/kern/kern_clocksource.c: No such file or directory.
> (kgdb) i loc
> state = <optimized out>
> next = 0xfffffe00015b7b80
> now = 190543869798505
> bcast = <error reading variable bcast (Cannot access memory at address 0x0)>
> cpu = <optimized out>
> 
> (kgdb) fr 15
> #15 0xffffffff8075594a in softclock_call_cc (c=0xfffff8005b6a1470,
> cc=0xffffffff811a0580 <cc_cpu+320>, now=190543869798505, direct=1) at
> /usr/src/sys/kern/kern_timeout.c:768
> 768     /usr/src/sys/kern/kern_timeout.c: No such file or directory.
> (kgdb) p *c
> $1 = {c_links = {le = {le_next = 0x0, le_prev = 0xfffffe000029f398}, sle =
> {sle_next = 0x0}, tqe = {tqe_next = 0x0, tqe_prev = 0xfffffe000029f398}},
> c_time = 190543869738008, c_precision = 2684354, c_arg = 0xfffff8005b6a1000,
>   c_func = 0xffffffff8078f000 <sleepq_timeout>, c_lock = 0x0, c_flags = 2,
> c_iflags = 272, c_cpu = 1, c_exec_time = 190543869798505, c_lines = {u128 =
> 5514281696343066435261632100963648521, u16 = {1033, 755, 1225, 628, 409,
>       651, 788, 1062}}}
> 
> Here is an explanation for the numbers reproted in the panic message (sorted
> from earliest to latest):
> 190543869603412 - 'now' as seen in sleepq_timeout(), returned by sbinuptime();
> 190543869738008 - td_sleeptimo, also c_time in the callout;
> 190543869798505 - 'now' as captured when the LAPIC timer fired, seen in the
> stack trace and also recorded as c_exec_time in the callout.
> 
> Just in case, here is the code that printed the panic message:
>                         panic("too early sleepq timeout, %jd > %jd (%jd)",
> 
> 
>                             (intmax_t)td->td_sleeptimo,
> 
> 
>                             (intmax_t)sbinuptime(),
> 
> 
>                             (intmax_t)td->td_slpcallout.c_exec_time);
> 
> As can be seen, sbinuptime() in sleepq_timeout() returned a value smaller
> than what sbinuptime() returned in timercb().
> 
> It seems that the code in callout_process() was right to fire the callout
> because now > td_sleeptimo there.
> But sleepq_timeout() thought that it was premature as sbinuptime() <
> td_sleeptimo there.
> 
> In the above case the eventtimer is obviously LAPIC, the timecounter is HPET.
> But we also saw the same issue when we changed the timecounter to ACPI-fast.
> I think that we haven't tried any other choices because:
> kern.timecounter.tc.ACPI-fast.quality: 900
> kern.timecounter.tc.i8254.quality: 0
> kern.timecounter.tc.HPET.quality: 950
> kern.timecounter.tc.TSC-low.quality: -100
> 
> I see three possibilities of why sbinuptime() could go backwards:
> - a bug in FreeBSD time keeping code that gets triggered only under very
> specific conditions that don't normally happen
> - a bug in VMWare, e.g. in HPET and ACPI timer emulation
> - something exotic, like correct C code miscompiled to incorrect machine
> code or weird memory model violation or etc...
> 
> I'd appreciate any suggestions on additional diagnostics to narrow down /
> rule out the possibilities.
I do not have any other idea than to try and see if sbinuptime() might go
backward, by running a parallel test of it on several CPUs.  Might be,
also record the raw timecounter value from the hw register.

> 
> P.S.
> As a workaround I could modify sleepq_timeout() to get "current time" from
> c_exec_time (added by us) instead of sbinuptime().  c_exec_time is the value
> of 'now' in callout_process() when it decides that the callout should fire.
> But I'd like to get to the bottom of the issue.
This was done quite recently in 6df1359e5542f69179c142 by mav.