Re: git: 3d9d64aa1846 - main - kern_tc: unify timecounter to bintime delta conversion

From: Kyle Evans <kevans_at_freebsd.org>
Date: Tue, 30 Nov 2021 16:14:22 UTC
On Tue, Nov 30, 2021 at 7:34 AM Andriy Gapon <avg@freebsd.org> wrote:
>
> The branch main has been updated by avg:
>
> URL: https://cgit.FreeBSD.org/src/commit/?id=3d9d64aa1846217eac9229f8cba5cb6646a688b7
>
> commit 3d9d64aa1846217eac9229f8cba5cb6646a688b7
> Author:     Andriy Gapon <avg@FreeBSD.org>
> AuthorDate: 2021-11-30 13:23:23 +0000
> Commit:     Andriy Gapon <avg@FreeBSD.org>
> CommitDate: 2021-11-30 13:23:23 +0000
>
>     kern_tc: unify timecounter to bintime delta conversion
>
>     There are two places where we convert from a timecounter delta to
>     a bintime delta: tc_windup and bintime_off.
>     Both functions use the same calculations when the timecounter delta is
>     small.  But for a large delta (greater than approximately an equivalent
>     of 1 second) the calculations were different.  Both functions use
>     approximate calculations based on th_scale that avoid division.  Both
>     produce values slightly greater than a true value, calculated with
>     division by tc_frequency, would be.  tc_windup is slightly more
>     accurate, so its result is closer to the true value and, thus, smaller
>     than bintime_off result.
>
>     As a consequence there can be a jump back in time when time hands are
>     switched after a long period of time (a large delta).  Just before the
>     switch the time would be calculated with a large delta from
>     th_offset_count in bintime_off.  tc_windup does the switch using its own
>     calculations of a new th_offset using the large delta.  As explained
>     earlier, the new th_offset may end up being less than the previously
>     produced binuptime.  So, for a period of time new binuptime values may
>     be "back in time" comparing to values just before the switch.
>
>     Such a jump must never happen.  All the code assumes that the uptime is
>     monotonically nondecreasing and some code works incorrectly when that
>     assumption is broken.  For example, we have observed sleepq_timeout()
>     ignoring a timeout when the sbinuptime value obtained by the callout
>     code was greater than the expiration value, but the sbinuptime obtained
>     in sleepq_timeout() was less than it.  In that case the target thread
>     would never get woken up.
>
>     The unified calculations should ensure the monotonic property of the
>     uptime.
>
>     The problem is quite rare as normally tc_windup should be called HZ
>     times per second (typically 1000 or 100).  But it may happen in VMs on
>     very busy hypervisors where a VM's virtual CPU may not get an execution
>     time slot for a second or more.
>

I wonder if this helps explain the behavior we saw when enabling TSC
on VirtualBox guests. Threads doing small ~1 second or less sleeps
would start to miss their wakeups, so we'd consistently see, e.g.,
shutdown issues after applying a high loading while we're waiting for
bufdaemon threads.