huge nanosleep variance on 11-stable

Wed Nov 2 07:55:20 UTC 2016

On Tue, Nov 01, 2016 at 02:29:13PM -0700, Jason Harmening wrote:
> repro code is at http://pastebin.com/B68N4AFY if anyone's interested.
> 
> On 11/01/16 13:58, Jason Harmening wrote:
> > Hi everyone,
> > 
> > I recently upgraded my main amd64 server from 10.3-stable (r302011) to
> > 11.0-stable (r308099).  It went smoothly except for one big issue:
> > certain applications (but not the system as a whole) respond very
> > sluggishly, and video playback of any kind is extremely choppy.
> > 
> > The system is under very light load, and I see no evidence of abnormal
> > interrupt latency or interrupt load.  More interestingly, if I place the
> > system under full load (~0.0% idle) the problem *disappears* and
> > playback/responsiveness are smooth and quick.
> > 
> > Running ktrace on some of the affected apps points me at the problem:
> > huge variance in the amount of time spent in the nanosleep system call.
> > A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry
> > to return of the syscall.  OTOH, anything CPU-bound or that waits on
> > condvars or I/O interrupts seems to work fine, so this doesn't seem to
> > be an issue with overall system latency.
> > 
> > I can repro this with a simple program that just does a 3ms usleep in a
> > tight loop (i.e. roughly the amount of time a video player would sleep
> > between frames @ 30fps).  At light load ktrace will show the huge
> > nanosleep variance; under heavy load every nanosleep will complete in
> > almost exactly 3ms.
> > 
> > FWIW, I don't see this on -current, although right now all my -current
> > images are VMs on different HW so that might not mean anything.  I'm not
> > aware of any recent timer- or scheduler- specific changes, so I'm
> > wondering if perhaps the recent IPI or taskqueue changes might be
> > somehow to blame.
> > 
> > I'm not especially familiar w/ the relevant parts of the kernel, so any
> > guidance on where I should focus my debugging efforts would be much
> > appreciated.
> > 

I am confident, with very high degree of certainity, that the issue is a
CPU bug in interaction between deep sleep states (C6) and LAPIC timer.
Check what hardware is used for the eventtimers,
	sysctl kern.eventtimer.timer
It should report LAPIC, and you should get rid of jitter with setting
the sysctl to HPET.  Also please show the first 50 lines of the verbose
boot dmesg.

I know that the Nehalem cores are affected, I do not know was the bug
fixed for Westmere or not.  I asked Intel contact about the problem,
but got no response.  It is not unreasonable, given that the CPUs are
beyond their support time.  I intended to automatically bump HPET quality
on Nehalem and might be Westmere, but I was not able to check Westmere,
and waited for more information, so this was forgotten.
BTW, using the latest CPU microcode did not helped.

After I discovered this, I specifically looked at my Sandy and Haswell
test systems, but they do not exhibit such problem.

In the Intel document 320836-036US 'Intel(R) CoreTM i7-900 Desktop
Processor Extreme Edition Series and Intel(R) CoreTM i7-900 Desktop
Processor Series Specification Update', there are two erratas which
might be relevant and show the LAPIC bugs: AAJ47 (but default is to
not use periodic mode), and AAJ121.  The 121 might be the real cause,
but Intel does not provide enough details to understand.  And of
course, the suggested workaround is not feasible.

Googling for 'Windows LAPIC Nehalem' shows very interesting results,
in particular,
https://support.microsoft.com/en-us/kb/2000977 (which I think is the bug
you see) and
https://hardware.slashdot.org/story/09/11/28/1723257/microsoft-advice-against-nehalem-xeons-snuffed-out
for amusement.