huge nanosleep variance on 11-stable

Wed Nov 2 16:11:21 UTC 2016

On 11/02/16 00:55, Konstantin Belousov wrote:
> On Tue, Nov 01, 2016 at 02:29:13PM -0700, Jason Harmening wrote:
>> repro code is at http://pastebin.com/B68N4AFY if anyone's interested.
>>
>> On 11/01/16 13:58, Jason Harmening wrote:
>>> Hi everyone,
>>>
>>> I recently upgraded my main amd64 server from 10.3-stable (r302011) to
>>> 11.0-stable (r308099).  It went smoothly except for one big issue:
>>> certain applications (but not the system as a whole) respond very
>>> sluggishly, and video playback of any kind is extremely choppy.
>>>
>>> The system is under very light load, and I see no evidence of abnormal
>>> interrupt latency or interrupt load.  More interestingly, if I place the
>>> system under full load (~0.0% idle) the problem *disappears* and
>>> playback/responsiveness are smooth and quick.
>>>
>>> Running ktrace on some of the affected apps points me at the problem:
>>> huge variance in the amount of time spent in the nanosleep system call.
>>> A sleep of, say, 5ms might take anywhere from 5ms to ~500ms from entry
>>> to return of the syscall.  OTOH, anything CPU-bound or that waits on
>>> condvars or I/O interrupts seems to work fine, so this doesn't seem to
>>> be an issue with overall system latency.
>>>
>>> I can repro this with a simple program that just does a 3ms usleep in a
>>> tight loop (i.e. roughly the amount of time a video player would sleep
>>> between frames @ 30fps).  At light load ktrace will show the huge
>>> nanosleep variance; under heavy load every nanosleep will complete in
>>> almost exactly 3ms.
>>>
>>> FWIW, I don't see this on -current, although right now all my -current
>>> images are VMs on different HW so that might not mean anything.  I'm not
>>> aware of any recent timer- or scheduler- specific changes, so I'm
>>> wondering if perhaps the recent IPI or taskqueue changes might be
>>> somehow to blame.
>>>
>>> I'm not especially familiar w/ the relevant parts of the kernel, so any
>>> guidance on where I should focus my debugging efforts would be much
>>> appreciated.
>>>
> 
> I am confident, with very high degree of certainity, that the issue is a
> CPU bug in interaction between deep sleep states (C6) and LAPIC timer.
> Check what hardware is used for the eventtimers,
> 	sysctl kern.eventtimer.timer
> It should report LAPIC, and you should get rid of jitter with setting
> the sysctl to HPET.  Also please show the first 50 lines of the verbose
> boot dmesg.
> 
> I know that the Nehalem cores are affected, I do not know was the bug
> fixed for Westmere or not.  I asked Intel contact about the problem,
> but got no response.  It is not unreasonable, given that the CPUs are
> beyond their support time.  I intended to automatically bump HPET quality
> on Nehalem and might be Westmere, but I was not able to check Westmere,
> and waited for more information, so this was forgotten.
> BTW, using the latest CPU microcode did not helped.
> 
> After I discovered this, I specifically looked at my Sandy and Haswell
> test systems, but they do not exhibit such problem.
> 
> In the Intel document 320836-036US 'Intel(R) CoreTM i7-900 Desktop
> Processor Extreme Edition Series and Intel(R) CoreTM i7-900 Desktop
> Processor Series Specification Update', there are two erratas which
> might be relevant and show the LAPIC bugs: AAJ47 (but default is to
> not use periodic mode), and AAJ121.  The 121 might be the real cause,
> but Intel does not provide enough details to understand.  And of
> course, the suggested workaround is not feasible.
> 
> Googling for 'Windows LAPIC Nehalem' shows very interesting results,
> in particular,
> https://support.microsoft.com/en-us/kb/2000977 (which I think is the bug
> you see) and
> https://hardware.slashdot.org/story/09/11/28/1723257/microsoft-advice-against-nehalem-xeons-snuffed-out
> for amusement.
> 

I think you are probably right.  Hacking out the Intel-specific
additions to C-state parsing in acpi_cpu_cx_cst() from r282678 (thus
going back to sti;hlt instead of monitor+mwait at C1) fixed the problem
for me.  But r282678 also had the effect of enabling C2 and C3 on my
system, because ACPI only presents MWAIT entries for those states and
not p_lvlx.

I will try switching to HPET when I have more time to test; may be a few
days.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 585 bytes
Desc: OpenPGP digital signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20161102/a78a788d/attachment.sig>