calcru: negative time ... followed by freeze
Don Lewis
truckman at FreeBSD.org
Mon Jun 21 08:01:49 GMT 2004
On 21 Jun, Bruce Evans wrote:
> On Sun, 20 Jun 2004, Don Lewis wrote:
>
>> On 20 Jun, Robert Watson wrote:
>> > FYI, this is a Xeon box with two physical processors, each with two
>> > logical processors, and the problem could well be SMP-related.
>>
>> The hang might be SMP-related, but I just started getting the calcru
>> messages on my UP Athlon XP box. I just upgraded to today's
>> -CURRENT from a week-old version, and my console got spammed by a bunch
>> of these messages while I was running portupgrade.
>>
>> calcru: negative time of 0 usec for pid 21261 (sh)
>> calcru: negative time of 0 usec for pid 21261 (sh)
>> ...
>> calcru: negative time of 3917 usec for pid 49504 (sh)
>> calcru: negative time of 3917 usec for pid 49504 (sh)
>
> Hmm, these are nonnegative times.
>
> The extra messages in -current might be caused by fill_kinfo() now calling
> calcru().
>
> Misreporting of negative times is by the following too-simple dianostic:
>
> % tu = (u_int64_t)tv.tv_sec * 1000000 + tv.tv_usec;
> % ptu = p->p_uu + p->p_su + p->p_iu;
> % if (tu < ptu || (int64_t)tu < 0) {
> % printf("calcru: negative time of %jd usec for pid %d (%s)\n",
> % (intmax_t)tu, p->p_pid, p->p_comm);
> % tu = ptu;
> % }
>
> The message is also printed for the tu < ptu case, which is what you are
> getting.
>
> I fixed the messages when I got a lot of them due to a local bug. The local
> bug was from double rounding for calcru() on child times (which -current
> doesn't do). I can't see how this could be the problem in -current, since
> the components of ptu are rounded down and there is a KASSERT that they
> added up to less than tu in the previous call.
>
> Ah, here is a likely cause of the bug in -current:
>
> % if (p == curthread->td_proc) {
> % /*
> % * Adjust for the current time slice. This is actually fairly
> % * important since the error here is on the order of a time
> % * quantum, which is much greater than the sampling error.
> % * XXXKSE use a different test due to threads on other
> % * processors also being 'current'.
> % */
> % binuptime(&bt);
> % bintime_sub(&bt, PCPU_PTR(switchtime));
> % bintime_add(&bt, &p->p_runtime);
> % } else
> % bt = p->p_runtime;
>
> The XXXKSE comment is correct that this might be broken. If the (p
> != curthread->td_proc) case happens at all for a running process, then
> it gives a wrong (out of date) timestamp in bt. This wrongness will
> be detected if calcru() is was called called earlier in the current
> timeslice and took the other path here.
>
> The recent change to fill_kinfo() is quite likely to trigger detection
> of this bug. fill_kinfo() is often used to iterate over all processes
> for ps, so it will call calcru() with (p != curthread->td_proc) for
> all processes other than the one running it, and give a bt that is out
> of date for all such processes that are actually running. Since there
> can be at most one running process per CPU, this bug only affects SMP.
>
> The call to calcru() from ttyinfo() may be the only other trigger.
> ttyinfo() picks a process and should rarely or never pick the ithread
> running it, so it will almost always take the (p != curthread->td_proc)
> path. Again, this is only a problem for the SMP case since in the !SMP
> case the picked process must have been switched away from to run the
> ithread, so it cannot be running.
There must be some !SMP trigger for this as well. I just checked and I
was able to trigger this on my Pentium-M laptop as well by leaning on
the ^T key while I was logged on via ssh and running 'portupgrade -aP'.
Jun 21 00:41:31 hairball kernel: calcru: negative time of 23169 usec for pid 44653 (sh)
Jun 21 00:41:32 hairball kernel: calcru: negative time of 21990 usec for pid 44665 (sh)
I didn't use ^T on my Athlon box. I might have had top running, though.
It's interesting that this bug only seems to get triggered on /bin/sh.
Maybe it is fork()/exit()/wait() related?
More information about the freebsd-threads
mailing list