Polling tuning and performance

Alan Amesbury amesbury at umn.edu
Thu Dec 14 14:40:10 PST 2006

This is a long one, but mainly because I've tried to include notes about
what I've already looked at.  Thanks in advance for taking the time to
read this.

I have a FreeBSD 6.1-RELEASE/amd64 system which routinely needs to
accept traffic at fairly high speeds.  The system is accepting traffic
at fairly high rates; 'systat -if' suggests 428551GB (not a typo, but
possibly a display bug in 'systat') over the past 63 days, or an average
rate of a bit over 600Mb/sec.  However, 'time tcpdump ...' tends to back
up this assertion:

amesbury at host % sudo time tcpdump -i bge1 -n -w /dev/null -c 1000000
tcpdump: WARNING: bge1: no IPv4 address assigned
tcpdump: listening on bge1, link-type EN10MB (Ethernet), capture size 96
1000000 packets captured
1000395 packets received by filter
167 packets dropped by kernel
0.268u 0.153s 0:06.84 5.9%      901+3236k 0+0io 0pf+0w

What I'm aiming for, of course, is zero packet loss.  Realizing that's
probably impossible for this system given its load, I'm trying to do
what I can to minimize loss.

The system is running a somewhat leaner kernel than GENERIC.  Notable
changes include:

	* PREEMPTION disabled - /sys/conf/NOTES says this helps with
	  interactivity.  I don't care about interactive performance
	  on this host.

	* COMPAT_FREEBSD4, COMPAT_LINUX32, and COMPAT_43 are removed.
	  They appear to be unneeded.

	* SMP is enabled, as this is a dual-core box (not HTT!).

	* Many devices are removed, e.g., ncr(4), sym(4), adv(4), and
	  other unnecessary block devices; anything relating to cardbus;
	  de(4), bce(4), ti(4), wb(4), ed(4), ex(4), lnc(4), and a
	  number of other network devices that aren't going to ever be
	  used; etc.

	* All wlan(4) and related drivers are gone.

	* pf(4), pflog(4), and some of the ALTQ stuff has been added in,
	  but is not actively used on this host (at the moment).

	  are enabled.

	* Most importantly, HZ=1000, and DEVICE_POLLING and
	  AUTO_EOI_1 are included.  (AUTO_EOI_1 was added because
	  /sys/amd64/conf/NOTES says this can save a few microseconds
	  on some interrupts.  I'm not worried about suspend/resume, but
	  definitely want speed, so it got added.

As mentioned above, this host is running FreeBSD/amd64, so there's no
need to remove support for I586_CPU, et al; that stuff was never there
in the first place.

Since kern.polling.enable is marked as deprecated in
/sys/kern/kern_poll.c, I'm enabling polling specifically for the
interface receiving the high-volume traffic.  (It is NOT enabled for the
other interface on this system, but traffic loads there are orders of
magnitude lower, so I didn't think it was necessary.)

As mentioned above, I've got HZ set to 1000.  Per /sys/amd64/conf/NOTES,
I'd considered setting it to 2000, but have discovered previously that
FreeBSD's RFC1323 support breaks.  I documented this on -hackers last year:


Since I've not seen word on a correction for this being added to
FreeBSD, I've limited HZ to 1000.

After reading polling(4) a couple times, I set kern.polling.burst_max to
1000.  The manpage says that "each interface can receive at most (HZ *
burst_max) packets per second", and the default setting is 150, which is
described as "adequate for 100Mbit network and HZ=1000."  I figured,
"Hey, gigabit, how about ten times the default?" but that's prevented by
"#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c.

In theory that might've been good enough, but polling(4) says that
kern.polling.burst is "[the] [m]aximum number of packets grabbed from
each network interface in each timer tick.  This number is dynamically
adjusted by the kernel, according to the programmed user_frac,
burst_max, CPU speed, and system load."  I keep seeing
kern.polling.burst hit a thousand, which leads me to believe that
kern.polling.burst_max needs to be higher.

For example:

	secs since
	  epoch	      kern.polling.burst
	----------    ------------------
	1166133997       1000
	1166134006        550
	1166134015        877
	1166134024       1000
	1166134033       1000
	1166134042       1000
	1166134051       1000
	1166134060       1000
	1166134069       1000
	1166134078       1000

Unfortunately, that appears to be only possible through a) patching
/sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000,
as indicated in one of the NOTES, which will effectively hose certain
TCP connectivity because of the RFC1323 breakage.  Looked at another
way, both essentially require changes to source code, the former being
fairly obvious, and the latter requiring fixes to the RFC1323 support.
Either way, I think that's a bit beyond my abilities; I have NO
illusions about my kernel h4cking sk1llz.

Other possibly relevant data points:

	* System load hovers right around 1.

	* The system has almost zero disk activity.

	* With polling off:

	  - 'vmstat 5' consistently shows about 13K context switches
	    and ~6800 interrupts
	  - 'vmstat -i' shows 2K interrupts per CPU, consistently 6286
	    for bge1, and near zero for everything else
	  - CPU load drops to 0.4-0.8, but CPU idle time sits around 80%

	* With polling on, kern.polling.burst_max=150:

	  - kern.polling.burst holds at 150
	  - 'vmstat 5' shows context switches hold around 2600, with
	    interrupts holding around 30K
	  - 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
	    doesn't increase!), other rates stay the same (looks like
	    possible display bugs in 'vmstat -i' here!)
	  - CPU load holds at 1, but CPU idle time usually stays >95%

	* With polling on, kern.polling.burst_max=1000:

	  - kern.polling.burst is frequently 1000 and almost always >850
	  - 'vmstat 5' shows context switches unchanged, but interrupts
	    are 150K-190K
	  - 'vmstat -i' unchanged from burst_max=150
	  - CPU load and CPU idle time very similar to burst_max=150

So, with all that in mind.....  Any ideas for improvement?  Apologies in
advance for missing the obvious.  'dmesg' and kernel config are attached.

Alan Amesbury
OIT Security and Assurance
University of Minnesota
-------------- next part --------------

machine		amd64

# To statically compile in device wiring instead of /boot/device.hints
#hints		"GENERIC.hints"		# Default places to look for devices.

makeoptions	DEBUG=-g		# Build kernel with gdb(1) debug symbols

#options 	SCHED_ULE		# ULE scheduler
options 	SCHED_4BSD		# 4BSD scheduler
#options 	PREEMPTION		# Enable kernel thread preemption
options 	INET			# InterNETworking
options 	INET6			# IPv6 communications protocols
options 	FFS			# Berkeley Fast Filesystem
options 	SOFTUPDATES		# Enable FFS soft updates support
options 	UFS_ACL			# Support for access control lists
options 	UFS_DIRHASH		# Improve performance on big directories
options 	MD_ROOT			# MD is a potential root device
options 	NFSCLIENT		# Network Filesystem Client
options 	NFSSERVER		# Network Filesystem Server
options 	NFS_ROOT		# NFS usable as /, requires NFSCLIENT
options 	MSDOSFS			# MSDOS Filesystem
options 	CD9660			# ISO 9660 Filesystem
options 	PROCFS			# Process filesystem (requires PSEUDOFS)
options 	PSEUDOFS		# Pseudo-filesystem framework
options 	GEOM_GPT		# GUID Partition Tables.
options 	COMPAT_IA32		# Compatible with i386 binaries
options 	COMPAT_FREEBSD5		# Compatible with FreeBSD5
options 	SCSI_DELAY=5000		# Delay (in ms) before probing SCSI
options 	KTRACE			# ktrace(1) support
options 	SYSVSHM			# SYSV-style shared memory
options 	SYSVMSG			# SYSV-style message queues
options 	SYSVSEM			# SYSV-style semaphores
options 	_KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions
options 	KBD_INSTALL_CDEV	# install a CDEV entry in /dev
options 	AHC_REG_PRETTY_PRINT	# Print register bitfields in debug
					# output.  Adds ~128k to driver.
options 	AHD_REG_PRETTY_PRINT	# Print register bitfields in debug
					# output.  Adds ~215k to driver.
options 	ADAPTIVE_GIANT		# Giant mutex is adaptive.

options 	SMP			# Symmetric MultiProcessor Kernel

# Workarounds for some known-to-be-broken chipsets (nVidia nForce3-Pro150)
device		atpic			# 8259A compatability

# Bus support.
device		acpi
device		isa
device		pci
device		mem
device		io

# Floppy drives
device		fdc

# ATA and ATAPI devices
device		ata
device		atadisk		# ATA disk drives
device		ataraid		# ATA RAID drives
device		atapicd		# ATAPI CDROM drives
device		atapifd		# ATAPI floppy drives
device		atapist		# ATAPI tape drives
options 	ATA_STATIC_ID	# Static device numbering

# SCSI Controllers
device		ahc		# AHA2940 and onboard AIC7xxx devices
device		ahd		# AHA39320/29320 and onboard AIC79xx devices
device		amd		# AMD 53C974 (Tekram DC-390(T))
device		isp		# Qlogic family
device		mpt		# LSI-Logic MPT-Fusion

# SCSI peripherals
device		scbus		# SCSI bus (required for SCSI)
device		ch		# SCSI media changers
device		da		# Direct Access (disks)
device		sa		# Sequential Access (tape etc)
device		cd		# CD
device		pass		# Passthrough device (direct SCSI access)
device		ses		# SCSI Environmental Services (and SAF-TE)

# RAID controllers interfaced to the SCSI subsystem
device		amr		# AMI MegaRAID
device		ciss		# Compaq Smart RAID 5*
device		dpt		# DPT Smartcache III, IV - See NOTES for options
device		hptmv		# Highpoint RocketRAID 182x
device		iir		# Intel Integrated RAID
device		ips		# IBM (Adaptec) ServeRAID
device		mly		# Mylex AcceleRAID/eXtremeRAID
device		twa		# 3ware 9000 series PATA/SATA RAID

# RAID controllers
device		aac		# Adaptec FSA RAID
device		aacp		# SCSI passthrough for aac (requires CAM)
device		ida		# Compaq Smart RAID
device		twe		# 3ware ATA RAID

# atkbdc0 controls both the keyboard and the PS/2 mouse
device		atkbdc		# AT keyboard controller
device		atkbd		# AT keyboard
device		psm		# PS/2 mouse

device		vga		# VGA video card driver

device		splash		# Splash screen and screen saver support

# syscons is the default console driver, resembling an SCO console
device		sc

device		agp		# support several AGP chipsets

# Serial (COM) ports
device		sio		# 8250, 16[45]50 based serial ports

# If you've got a "dumb" serial or parallel PCI card that is
# supported by the puc(4) glue driver, uncomment the following
# line to enable it (connects to the sio and/or ppc drivers):
#device		puc

# PCI Ethernet NICs.
device		em		# Intel PRO/1000 adapter Gigabit Ethernet Card
device		ixgb		# Intel PRO/10GbE Ethernet Card
device		txp		# 3Com 3cR990 (``Typhoon'')
device		vx		# 3Com 3c590, 3c595 (``Vortex'')

# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device		miibus		# MII bus support
device		bfe		# Broadcom BCM440x 10/100 Ethernet
device		bge		# Broadcom BCM570xx Gigabit Ethernet
device		dc		# DEC/Intel 21143 and various workalikes
device		fxp		# Intel EtherExpress PRO/100B (82557, 82558)
device		lge		# Level 1 LXT1001 gigabit Ethernet
device		nge		# NatSemi DP83820 gigabit Ethernet
device		re		# RealTek 8139C+/8169/8169S/8110S
device		rl		# RealTek 8129/8139
device		sis		# Silicon Integrated Systems SiS 900/SiS 7016
device		sk		# SysKonnect SK-984x & SK-982x gigabit Ethernet
device		tx		# SMC EtherPower II (83c170 ``EPIC'')
device		xl		# 3Com 3c90x (``Boomerang'', ``Cyclone'')

# Pseudo devices.
device		loop		# Network loopback
device		random		# Entropy device
device		ether		# Ethernet support
device		tun		# Packet tunnel.
device		pty		# Pseudo-ttys (telnet etc)
device		md		# Memory "disks"
device		gif		# IPv6 and IPv4 tunneling
device		faith		# IPv6-to-IPv4 relaying (translation)

# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device		bpf		# Berkeley packet filter

# USB support
device		uhci		# UHCI PCI->USB interface
device		ohci		# OHCI PCI->USB interface
device		ehci		# EHCI PCI->USB interface (USB 2.0)
device		usb		# USB Bus (required)
#device		udbp		# USB Double Bulk Pipe devices
device		ugen		# Generic
device		uhid		# "Human Interface Devices"
device		ukbd		# Keyboard
device		ulpt		# Printer
device		umass		# Disks/Mass storage - Requires scbus and da
device		ums		# Mouse

# FireWire support
device		firewire	# FireWire bus code
device		sbp		# SCSI over FireWire (Requires scbus and da)
device		fwe		# Ethernet over FireWire (non-standard!)

options 	ALTQ
options 	ALTQ_CBQ
options 	ALTQ_HFSC
options 	ALTQ_PRIQ
options		ALTQ_NOPCC
device		pf
device		pflog
options 	BRIDGE
options 	MAC
options 	HZ=1000
options 	SC_HISTORY_SIZE=1000
options 	AUTO_EOI_1
-------------- next part --------------
Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD 6.1-RELEASE-p10 #1: Thu Oct 12 14:14:54 CDT 2006
    root at specialized:/usr/obj/usr/src/sys/SPECIALIZED
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) D CPU 2.80GHz (2800.11-MHz K8-class CPU)
  Origin = "GenuineIntel"  Id = 0xf44  Stepping = 4
  AMD Features=0x20100800<SYSCALL,NX,LM>
  Cores per package: 2
real memory  = 4563402752 (4352 MB)
avail memory = 4140404736 (3948 MB)
ACPI APIC Table: <DELL   PE850   >
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
 cpu0 (BSP): APIC ID:  0
 cpu1 (AP): APIC ID:  1
Security policy loaded: TrustedBSD MAC/BSD Extended (mac_bsdextended)
Security policy loaded: TrustedBSD MAC/Partition (mac_partition)
ioapic0: Changing APIC ID to 2
ioapic1: Changing APIC ID to 3
ioapic1: WARNING: intbase 32 != expected base 24
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 32-55 on motherboard
acpi0: <DELL PE850> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> at device 0.0 on pci2
pci3: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> at device 28.4 on pci0
pci4: <ACPI PCI bus> on pcib4
bge0: <Broadcom BCM5721 Gigabit Ethernet, ASIC rev. 0x4101> mem 0xfe8f0000-0xfe8fffff irq 16 at device 0.0 on pci4
miibus0: <MII bus> on bge0
brgphy0: <BCM5750 10/100/1000baseTX PHY> on miibus0
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
bge0: Ethernet address: 00:15:c5:60:1b:dc
pcib5: <ACPI PCI-PCI bridge> at device 28.5 on pci0
pci5: <ACPI PCI bus> on pcib5
bge1: <Broadcom BCM5721 Gigabit Ethernet, ASIC rev. 0x4101> mem 0xfe6f0000-0xfe6fffff irq 17 at device 0.0 on pci5
miibus1: <MII bus> on bge1
brgphy1: <BCM5750 10/100/1000baseTX PHY> on miibus1
brgphy1:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
bge1: Ethernet address: 00:15:c5:60:1b:dd
uhci0: <UHCI (generic) USB controller> port 0xbce0-0xbcff irq 20 at device 29.0 on pci0
usb0: <UHCI (generic) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <UHCI (generic) USB controller> port 0xbcc0-0xbcdf irq 21 at device 29.1 on pci0
usb1: <UHCI (generic) USB controller> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: <UHCI (generic) USB controller> port 0xbca0-0xbcbf irq 22 at device 29.2 on pci0
usb2: <UHCI (generic) USB controller> on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
ehci0: <Intel 82801GB/R (ICH7) USB 2.0 controller> mem 0xfeb00400-0xfeb007ff irq 20 at device 29.7 on pci0
usb3: EHCI version 1.0
usb3: wrong number of companions (7 != 3)
usb3: companion controllers, 2 ports each: usb0 usb1 usb2
usb3: <Intel 82801GB/R (ICH7) USB 2.0 controller> on ehci0
usb3: USB revision 2.0
uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub3: 6 ports with 6 removable, self powered
pcib6: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci6: <ACPI PCI bus> on pcib6
pci6: <display, VGA> at device 5.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH7 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata1: <ATA channel 1> on atapci0
atapci1: <Intel ICH7 SATA300 controller> port 0xbc98-0xbc9f,0xbc90-0xbc93,0xbc80-0xbc87,0xbc78-0xbc7b,0xbc60-0xbc6f mem 0xfeb00000-0xfeb003ff irq 20 at device 31.2 on pci0
ata2: <ATA channel 0> on atapci1
ata3: <ATA channel 1> on atapci1
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A, console
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xec000-0xeffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 1.000 msec
acd0: CDRW <TSSTcorpCD-RW/DVD-ROM TSL462C/DE05> at ata0-master UDMA33
ad4: 152587MB <WDC WD1600JS-75NCB2 10.02E03> at ata2-master SATA150
SMP: AP CPU #1 Launched!
Trying to mount root from ufs:/dev/ad4s1a
bge0: link state changed to UP
bge1: link state changed to UP

More information about the freebsd-performance mailing list