RELENG_8 pf stack issue (state count spiraling out of control)

Tue May 3 09:22:59 UTC 2011

On Mon, May 02, 2011 at 06:58:54PM -0700, Jeremy Chadwick wrote:

> Here's one piece of core.0.txt which makes no sense to me -- the "rate"
> column.  I have a very hard time believing that was the interrupt rate
> of all the relevant devices at the time (way too high).  Maybe this data
> becomes wrong only during a coredump?  The total column I could believe.
> 
> ------------------------------------------------------------------------
> vmstat -i
> 
> interrupt                          total       rate
> irq4: uart0                        54768        912
> irq6: fdc0                             1          0
> irq17: uhci1+                        172          2
> irq23: uhci3 ehci1+                 2367         39
> cpu0: timer                  13183882632  219731377
> irq256: em0                    260491055    4341517
> irq257: em1                    127555036    2125917
> irq258: ahci0                  225923164    3765386
> cpu2: timer                  13183881837  219731363
> cpu1: timer                  13002196469  216703274
> cpu3: timer                  13183881783  219731363
> Total                        53167869284  886131154
> ------------------------------------------------------------------------
> 
> Here's what a normal "vmstat -i" shows from the command-line:
> 
> # vmstat -i
> interrupt                          total       rate
> irq4: uart0                          518          0
> irq6: fdc0                             1          0
> irq23: uhci3 ehci1+                  145          0
> cpu0: timer                     19041199       1999
> irq256: em0                       614280         64
> irq257: em1                       168529         17
> irq258: ahci0                     355536         37
> cpu2: timer                     19040462       1999
> cpu1: timer                     19040458       1999
> cpu3: timer                     19040454       1999
> Total                           77301582       8119

The cpu0-3 timer totals seem consistent in the first output: 
13183881783/1999/60/60/24 matches 76 days of uptime.

The high rate in the first output comes from vmstat.c dointr()'s
division of the total by the uptime:

        struct timespec sp;
        clock_gettime(CLOCK_MONOTONIC, &sp);
        uptime = sp.tv_sec;
	for (i = 0; i < nintr; i++) {
		printf("%-*s %20lu %10lu\n", istrnamlen, intrname,
		    *intrcnt, *intrcnt / uptime);
	}

>From this we can deduce that the value of uptime must have been
13183881783/219731363 = 60 (seconds).

Since the uptime was 76 days (and not just 60 seconds), the
CLOCK_MONOTONIC clock must have reset, wrapped, or been overwritten.

I don't know how that's possible, but if this means that the kernel
variable time_second was possibly going back, that could very well
have messed up pf's state purging.

Daniel