[Bug 198149] [hwpmc] pmcstat -P -t (top mode, process sampling) stops after a while

Fri May 8 19:34:29 UTC 2015

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=198149

--- Comment #15 from John Baldwin <jhb at FreeBSD.org> ---
So I don't have a good explanation of why the counters overflow without
generating an interrupt, however, it does seem like the interrupt is lost, or
perhaps there is a race where the counter can be disabled right at about the
time that it should fire, but the counter might still count a few events after
it has been disabled.  In particular, I captured this sequence of events
(filtered to only show CPU 1) where I got an overflow with a value of 2:

 72890   1  563885152232313 MDP:REA:2: iaf-read counter overflowed: 2
  2264   1  563885043095469 MDP:STO:1: iafctrl=0(0) globalctrl=0(0)
  2263   1  563885043094005 MDP:STO:1: iaf-stop iafctrl=0
  2262   1  563885043093777 MDP:STO:1: iaf-stop cpu=1 ri=2
  2256   1  563885043091809 CSW:SWO:1: cpu=1 proc=0xfffff8005698c538 (5697,
threadspin) pp=0xfffff80052fea600
  2188   1  563885043029837 MDP:INT:1: cpu=1 fixedctrl=b00 globalctrl=400000000
status=0 ovf=0
  2187   1  563885043028445 MDP:INT:1: cpu=1 intrenable=400000000
  2186   1  563885043028165 MDP:INT:1: iaf-intr cpu=1 error=0
v=ffffffff0000(ffffffff0000)
  2185   1  563885043026045 MDP:INT:1: cpu=1 intrstatus=400000000
  2184   1  563885043025621 MDP:INT:1: cpu=1 tf=0x0xfffffe083a551f30 um=1
  2113   1  563885042960229 MDP:INT:1: cpu=1 fixedctrl=b00 globalctrl=400000000
status=0 ovf=0
  2112   1  563885042958865 MDP:INT:1: cpu=1 intrenable=400000000
  2111   1  563885042958665 MDP:INT:1: iaf-intr cpu=1 error=0
v=ffffffff0000(ffffffff0000)
  2110   1  563885042957205 MDP:INT:1: cpu=1 intrstatus=400000000
  2109   1  563885042956765 MDP:INT:1: cpu=1 tf=0x0xfffffe083a551f30 um=0
  2108   1  563885042955049 MDP:SWI:1: cr4=0x406e0
  2107   1  563885042954861 MDP:SWI:1: pc=0xfffff8014d0e5300
pp=0xfffff80052fea600 enable-msr=0
  2106   1  563885042954569 MDP:STA:1: iafctrl=b00(b00)
globalctrl=400000000(400000000)
  2103   1  563885042952493 MDP:STA:1: iaf-start cpu=1 ri=2
  2102   1  563885042952109 MDP:WRI:1: iaf-write cpu=1 ri=2 msr=0x40000002
v=fffffffff7d2 iafctrl=0 pmc=fffffffff7d2
  2094   1  563885042949821 CSW:SWI:1: cpu=1 ri=23 new=2094
  2087   1  563885042947433 CSW:SWI:1: cpu=1 proc=0xfffff8005698c538 (5697,
threadspin) pp=0xfffff80052fea600

Here you can see that the interrupts for are occurring "about" every 75 KTR
events or so, so the next one would have fired at around 2250.  That is when
the switchout begins.  Note that there is a long gap between the call to
iaf_stop() from pmc_process_csw_out() and the later call to iaf_read(), not
just in KTR events, but also in terms of clock cycles (109136844 cycles on a
3.3 GHz CPU, so ~33 milliseconds assuming my math is correct).  All of the
intervening PMC events are PMC interrupts on other CPUs during that time.  I'm
actually not sure what CPU 1 could have been spinning on for that duration
aside from the KTR index itself.

I don't see anything that should be blocking the PMC interrupt (an NMI) either,
so I wonder if it's possible for the PMC to increment a few times after it has
been disabled?  I do find that if I change my command to only monitor user
events, then this goes away, so presumably the counter was still "running" and
counting clock cycles in the PMC code itself when it overflowed.  Given that,
I'm just inclined to apply the first hack patch as-is (though also fix the
similar function for IAP).

-- 
You are receiving this mail because:
You are the assignee for the bug.