cpu timer issues

Tue Sep 28 11:02:21 UTC 2010

On 28.09.2010, at 10:54, Jurgen Weber <jurgen at ish.com.au> wrote:

> Hello List
> 
> We have been having issues with some firewall machines of ours using pfSense.
> 
> FreeBSD smash01.ish.com.au 7.2-RELEASE-p5 FreeBSD 7.2-RELEASE-p5 #0: Sun Dec  6 23:20:31 EST 2009 sullrich at FreeBSD_7.2_pfSense_1.2.3_snaps.pfsense.org:/usr/obj.pfSense/usr/pfSensesrc/src/sys/pfSense_SMP.7  i386
> 
> MotherBoard: http://www.supermicro.com/products/motherboard/Xeon3000/3200/X7SBi-LN4.cfm
> 
> Originally the systems started out by showing a lot of packet loss, the system time would fall behind, and the value of "#vmstat -i | grep timer" was dropping below 2000. I was lead to believe by the guys at pfSense that this is where the value should sit. I would also receive errors in messages that looked like " kernel: calcru: runtime went backwards from 244314 usec to 236341".
> 
> We tried a variety of things, disabling USB, turning off the Intel Speed Step in the BIOS, disabling ACPI, etc, etc. All having little to no effect. The only thing that would right it is restarting the box but over time it would degrade again. I talked to the SuperMicro and they said that this is a FreeBSD issue and pretty much washed their hands of it.
> 
> After a couple of months of dealing with this and just rebooting the systems reguarly, the symptoms slowly but surely disappeared. eg. The kernel messages went away, the system time was not falling behind and I was experiencing no packet loss but the "#vmstat -i | grep timer" value would continue to decrease over time. Eventually I think, when it finally got the 0 the machine restarted (I am only guessing here).
> 
> After this restart it worked again for a couple of hours and then it restarted again.
> 
> After the second time the system has not missed a beat, it has been fine and the "#vmstat -i | grep timer" value remained near the 2000 mark... We setup some zabbix monitoring to watch it. As mentioned it was fine for about a month. Until today. Today the value has dropped to 0, but the system has not restarted and over the last couple of hours the value has increased to 47.
> 
> This machine is mission critical, we have two in a fail over scenario (using pfSense's CARP features) and it seems unfortunate that we have an issue with two brand new SuperMicro boxes that affect both machines. While at the moment everything seems fine I want to ensure that I have no further issues. Does anyone have any suggestions?
> 
> Lastly I have double check both of the below:
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#CALCRU-NEGATIVE-RUNTIME
> We disabled EIST.
> 
> http://www.freebsd.org/doc/en_US.ISO8859-1/books/faq/troubleshoot.html#COMPUTER-CLOCK-SKEW
> 
> # dmesg | grep Timecounter
> Timecounter "i8254" frequency 1193182 Hz quality 0
> Timecounters tick every 1.000 msec
> # sysctl kern.timecounter.hardware
> kern.timecounter.hardware: i8254
> 
> Only have one timer to choose from.
> 
> Thanks
> 
> Jurgen
> 
> _______________________________________________
> freebsd-stable at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-stable
> To unsubscribe, send any mail to "freebsd-stable-unsubscribe at freebsd.org"

Hello,
vmsat -i calculates interrupt rate based on interrupt count/uptime, and the interrupt count is 32 bit integer. 
With high values of kern.hz it will overflow in few days (with kern.hz=4000 it will happen every 12 days or so).
If that is the case, use systat -vmstat 1 to get accurate interrupt rate.
That is just fyi, because i was confused once and it scared me abit, and i started changing counters untill i noticed this.

p.s. please forgive my poor english