Polling tuning and performance
Alan Amesbury
amesbury at umn.edu
Thu Dec 14 14:40:10 PST 2006
This is a long one, but mainly because I've tried to include notes about
what I've already looked at. Thanks in advance for taking the time to
read this.
I have a FreeBSD 6.1-RELEASE/amd64 system which routinely needs to
accept traffic at fairly high speeds. The system is accepting traffic
at fairly high rates; 'systat -if' suggests 428551GB (not a typo, but
possibly a display bug in 'systat') over the past 63 days, or an average
rate of a bit over 600Mb/sec. However, 'time tcpdump ...' tends to back
up this assertion:
amesbury at host % sudo time tcpdump -i bge1 -n -w /dev/null -c 1000000
tcpdump: WARNING: bge1: no IPv4 address assigned
tcpdump: listening on bge1, link-type EN10MB (Ethernet), capture size 96
bytes
1000000 packets captured
1000395 packets received by filter
167 packets dropped by kernel
0.268u 0.153s 0:06.84 5.9% 901+3236k 0+0io 0pf+0w
What I'm aiming for, of course, is zero packet loss. Realizing that's
probably impossible for this system given its load, I'm trying to do
what I can to minimize loss.
The system is running a somewhat leaner kernel than GENERIC. Notable
changes include:
* PREEMPTION disabled - /sys/conf/NOTES says this helps with
interactivity. I don't care about interactive performance
on this host.
* COMPAT_FREEBSD4, COMPAT_LINUX32, and COMPAT_43 are removed.
They appear to be unneeded.
* SMP is enabled, as this is a dual-core box (not HTT!).
* Many devices are removed, e.g., ncr(4), sym(4), adv(4), and
other unnecessary block devices; anything relating to cardbus;
de(4), bce(4), ti(4), wb(4), ed(4), ex(4), lnc(4), and a
number of other network devices that aren't going to ever be
used; etc.
* All wlan(4) and related drivers are gone.
* pf(4), pflog(4), and some of the ALTQ stuff has been added in,
but is not actively used on this host (at the moment).
* ZERO_COPY_SOCKETS, MAC_BSDEXTENDED, MAC_PARTITION, and MAC
are enabled.
* Most importantly, HZ=1000, and DEVICE_POLLING and
AUTO_EOI_1 are included. (AUTO_EOI_1 was added because
/sys/amd64/conf/NOTES says this can save a few microseconds
on some interrupts. I'm not worried about suspend/resume, but
definitely want speed, so it got added.
As mentioned above, this host is running FreeBSD/amd64, so there's no
need to remove support for I586_CPU, et al; that stuff was never there
in the first place.
Since kern.polling.enable is marked as deprecated in
/sys/kern/kern_poll.c, I'm enabling polling specifically for the
interface receiving the high-volume traffic. (It is NOT enabled for the
other interface on this system, but traffic loads there are orders of
magnitude lower, so I didn't think it was necessary.)
As mentioned above, I've got HZ set to 1000. Per /sys/amd64/conf/NOTES,
I'd considered setting it to 2000, but have discovered previously that
FreeBSD's RFC1323 support breaks. I documented this on -hackers last year:
http://lists.freebsd.org/pipermail/freebsd-hackers/2005-December/014829.html
Since I've not seen word on a correction for this being added to
FreeBSD, I've limited HZ to 1000.
After reading polling(4) a couple times, I set kern.polling.burst_max to
1000. The manpage says that "each interface can receive at most (HZ *
burst_max) packets per second", and the default setting is 150, which is
described as "adequate for 100Mbit network and HZ=1000." I figured,
"Hey, gigabit, how about ten times the default?" but that's prevented by
"#define MAX_POLL_BURST_MAX 1000" in /sys/kern/kern_poll.c.
In theory that might've been good enough, but polling(4) says that
kern.polling.burst is "[the] [m]aximum number of packets grabbed from
each network interface in each timer tick. This number is dynamically
adjusted by the kernel, according to the programmed user_frac,
burst_max, CPU speed, and system load." I keep seeing
kern.polling.burst hit a thousand, which leads me to believe that
kern.polling.burst_max needs to be higher.
For example:
secs since
epoch kern.polling.burst
---------- ------------------
1166133997 1000
1166134006 550
1166134015 877
1166134024 1000
1166134033 1000
1166134042 1000
1166134051 1000
1166134060 1000
1166134069 1000
1166134078 1000
Unfortunately, that appears to be only possible through a) patching
/sys/kern/kern_poll.c to allow larger values; or b) setting HZ to 2000,
as indicated in one of the NOTES, which will effectively hose certain
TCP connectivity because of the RFC1323 breakage. Looked at another
way, both essentially require changes to source code, the former being
fairly obvious, and the latter requiring fixes to the RFC1323 support.
Either way, I think that's a bit beyond my abilities; I have NO
illusions about my kernel h4cking sk1llz.
Other possibly relevant data points:
* System load hovers right around 1.
* The system has almost zero disk activity.
* With polling off:
- 'vmstat 5' consistently shows about 13K context switches
and ~6800 interrupts
- 'vmstat -i' shows 2K interrupts per CPU, consistently 6286
for bge1, and near zero for everything else
- CPU load drops to 0.4-0.8, but CPU idle time sits around 80%
* With polling on, kern.polling.burst_max=150:
- kern.polling.burst holds at 150
- 'vmstat 5' shows context switches hold around 2600, with
interrupts holding around 30K
- 'vmstat -i' shows bge1 interrupt rate of 6286 (but total
doesn't increase!), other rates stay the same (looks like
possible display bugs in 'vmstat -i' here!)
- CPU load holds at 1, but CPU idle time usually stays >95%
* With polling on, kern.polling.burst_max=1000:
- kern.polling.burst is frequently 1000 and almost always >850
- 'vmstat 5' shows context switches unchanged, but interrupts
are 150K-190K
- 'vmstat -i' unchanged from burst_max=150
- CPU load and CPU idle time very similar to burst_max=150
So, with all that in mind..... Any ideas for improvement? Apologies in
advance for missing the obvious. 'dmesg' and kernel config are attached.
--
Alan Amesbury
OIT Security and Assurance
University of Minnesota
-------------- next part --------------
machine amd64
cpu HAMMER
ident SPECIALIZED
# To statically compile in device wiring instead of /boot/device.hints
#hints "GENERIC.hints" # Default places to look for devices.
makeoptions DEBUG=-g # Build kernel with gdb(1) debug symbols
#options SCHED_ULE # ULE scheduler
options SCHED_4BSD # 4BSD scheduler
#options PREEMPTION # Enable kernel thread preemption
options INET # InterNETworking
options INET6 # IPv6 communications protocols
options FFS # Berkeley Fast Filesystem
options SOFTUPDATES # Enable FFS soft updates support
options UFS_ACL # Support for access control lists
options UFS_DIRHASH # Improve performance on big directories
options MD_ROOT # MD is a potential root device
options NFSCLIENT # Network Filesystem Client
options NFSSERVER # Network Filesystem Server
options NFS_ROOT # NFS usable as /, requires NFSCLIENT
options MSDOSFS # MSDOS Filesystem
options CD9660 # ISO 9660 Filesystem
options PROCFS # Process filesystem (requires PSEUDOFS)
options PSEUDOFS # Pseudo-filesystem framework
options GEOM_GPT # GUID Partition Tables.
options COMPAT_IA32 # Compatible with i386 binaries
options COMPAT_FREEBSD5 # Compatible with FreeBSD5
options SCSI_DELAY=5000 # Delay (in ms) before probing SCSI
options KTRACE # ktrace(1) support
options SYSVSHM # SYSV-style shared memory
options SYSVMSG # SYSV-style message queues
options SYSVSEM # SYSV-style semaphores
options _KPOSIX_PRIORITY_SCHEDULING # POSIX P1003_1B real-time extensions
options KBD_INSTALL_CDEV # install a CDEV entry in /dev
options AHC_REG_PRETTY_PRINT # Print register bitfields in debug
# output. Adds ~128k to driver.
options AHD_REG_PRETTY_PRINT # Print register bitfields in debug
# output. Adds ~215k to driver.
options ADAPTIVE_GIANT # Giant mutex is adaptive.
options SMP # Symmetric MultiProcessor Kernel
# Workarounds for some known-to-be-broken chipsets (nVidia nForce3-Pro150)
device atpic # 8259A compatability
# Bus support.
device acpi
device isa
device pci
device mem
device io
# Floppy drives
device fdc
# ATA and ATAPI devices
device ata
device atadisk # ATA disk drives
device ataraid # ATA RAID drives
device atapicd # ATAPI CDROM drives
device atapifd # ATAPI floppy drives
device atapist # ATAPI tape drives
options ATA_STATIC_ID # Static device numbering
# SCSI Controllers
device ahc # AHA2940 and onboard AIC7xxx devices
device ahd # AHA39320/29320 and onboard AIC79xx devices
device amd # AMD 53C974 (Tekram DC-390(T))
device isp # Qlogic family
device mpt # LSI-Logic MPT-Fusion
# SCSI peripherals
device scbus # SCSI bus (required for SCSI)
device ch # SCSI media changers
device da # Direct Access (disks)
device sa # Sequential Access (tape etc)
device cd # CD
device pass # Passthrough device (direct SCSI access)
device ses # SCSI Environmental Services (and SAF-TE)
# RAID controllers interfaced to the SCSI subsystem
device amr # AMI MegaRAID
device ciss # Compaq Smart RAID 5*
device dpt # DPT Smartcache III, IV - See NOTES for options
device hptmv # Highpoint RocketRAID 182x
device iir # Intel Integrated RAID
device ips # IBM (Adaptec) ServeRAID
device mly # Mylex AcceleRAID/eXtremeRAID
device twa # 3ware 9000 series PATA/SATA RAID
# RAID controllers
device aac # Adaptec FSA RAID
device aacp # SCSI passthrough for aac (requires CAM)
device ida # Compaq Smart RAID
device twe # 3ware ATA RAID
# atkbdc0 controls both the keyboard and the PS/2 mouse
device atkbdc # AT keyboard controller
device atkbd # AT keyboard
device psm # PS/2 mouse
device vga # VGA video card driver
device splash # Splash screen and screen saver support
# syscons is the default console driver, resembling an SCO console
device sc
device agp # support several AGP chipsets
# Serial (COM) ports
device sio # 8250, 16[45]50 based serial ports
# If you've got a "dumb" serial or parallel PCI card that is
# supported by the puc(4) glue driver, uncomment the following
# line to enable it (connects to the sio and/or ppc drivers):
#device puc
# PCI Ethernet NICs.
device em # Intel PRO/1000 adapter Gigabit Ethernet Card
device ixgb # Intel PRO/10GbE Ethernet Card
device txp # 3Com 3cR990 (``Typhoon'')
device vx # 3Com 3c590, 3c595 (``Vortex'')
# PCI Ethernet NICs that use the common MII bus controller code.
# NOTE: Be sure to keep the 'device miibus' line in order to use these NICs!
device miibus # MII bus support
device bfe # Broadcom BCM440x 10/100 Ethernet
device bge # Broadcom BCM570xx Gigabit Ethernet
device dc # DEC/Intel 21143 and various workalikes
device fxp # Intel EtherExpress PRO/100B (82557, 82558)
device lge # Level 1 LXT1001 gigabit Ethernet
device nge # NatSemi DP83820 gigabit Ethernet
device re # RealTek 8139C+/8169/8169S/8110S
device rl # RealTek 8129/8139
device sis # Silicon Integrated Systems SiS 900/SiS 7016
device sk # SysKonnect SK-984x & SK-982x gigabit Ethernet
device tx # SMC EtherPower II (83c170 ``EPIC'')
device xl # 3Com 3c90x (``Boomerang'', ``Cyclone'')
# Pseudo devices.
device loop # Network loopback
device random # Entropy device
device ether # Ethernet support
device tun # Packet tunnel.
device pty # Pseudo-ttys (telnet etc)
device md # Memory "disks"
device gif # IPv6 and IPv4 tunneling
device faith # IPv6-to-IPv4 relaying (translation)
# The `bpf' device enables the Berkeley Packet Filter.
# Be aware of the administrative consequences of enabling this!
# Note that 'bpf' is required for DHCP.
device bpf # Berkeley packet filter
# USB support
device uhci # UHCI PCI->USB interface
device ohci # OHCI PCI->USB interface
device ehci # EHCI PCI->USB interface (USB 2.0)
device usb # USB Bus (required)
#device udbp # USB Double Bulk Pipe devices
device ugen # Generic
device uhid # "Human Interface Devices"
device ukbd # Keyboard
device ulpt # Printer
device umass # Disks/Mass storage - Requires scbus and da
device ums # Mouse
# FireWire support
device firewire # FireWire bus code
device sbp # SCSI over FireWire (Requires scbus and da)
device fwe # Ethernet over FireWire (non-standard!)
options ALTQ
options ALTQ_CBQ
options ALTQ_HFSC
options ALTQ_PRIQ
options ALTQ_NOPCC
device pf
device pflog
options BRIDGE
options ZERO_COPY_SOCKETS
options MAC
options MAC_BSDEXTENDED
options MAC_PARTITION
options HZ=1000
options SC_HISTORY_SIZE=1000
options SC_KERNEL_CONS_ATTR=(FG_YELLOW|BG_BLACK)
options SC_KERNEL_CONS_REV_ATTR=(FG_BLACK|BG_RED)
options DEVICE_POLLING
options AUTO_EOI_1
options INCLUDE_CONFIG_FILE
-------------- next part --------------
Copyright (c) 1992-2006 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
The Regents of the University of California. All rights reserved.
FreeBSD 6.1-RELEASE-p10 #1: Thu Oct 12 14:14:54 CDT 2006
root at specialized:/usr/obj/usr/src/sys/SPECIALIZED
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Pentium(R) D CPU 2.80GHz (2800.11-MHz K8-class CPU)
Origin = "GenuineIntel" Id = 0xf44 Stepping = 4
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0x641d<SSE3,RSVD2,MON,DS_CPL,CNTX-ID,CX16,<b14>>
AMD Features=0x20100800<SYSCALL,NX,LM>
Cores per package: 2
real memory = 4563402752 (4352 MB)
avail memory = 4140404736 (3948 MB)
ACPI APIC Table: <DELL PE850 >
FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs
cpu0 (BSP): APIC ID: 0
cpu1 (AP): APIC ID: 1
Security policy loaded: TrustedBSD MAC/BSD Extended (mac_bsdextended)
Security policy loaded: TrustedBSD MAC/Partition (mac_partition)
ioapic0: Changing APIC ID to 2
ioapic1: Changing APIC ID to 3
ioapic1: WARNING: intbase 32 != expected base 24
ioapic0 <Version 2.0> irqs 0-23 on motherboard
ioapic1 <Version 2.0> irqs 32-55 on motherboard
acpi0: <DELL PE850> on motherboard
acpi0: Power Button (fixed)
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0
cpu0: <ACPI CPU> on acpi0
cpu1: <ACPI CPU> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> at device 1.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> at device 28.0 on pci0
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> at device 0.0 on pci2
pci3: <ACPI PCI bus> on pcib3
pcib4: <ACPI PCI-PCI bridge> at device 28.4 on pci0
pci4: <ACPI PCI bus> on pcib4
bge0: <Broadcom BCM5721 Gigabit Ethernet, ASIC rev. 0x4101> mem 0xfe8f0000-0xfe8fffff irq 16 at device 0.0 on pci4
miibus0: <MII bus> on bge0
brgphy0: <BCM5750 10/100/1000baseTX PHY> on miibus0
brgphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
bge0: Ethernet address: 00:15:c5:60:1b:dc
pcib5: <ACPI PCI-PCI bridge> at device 28.5 on pci0
pci5: <ACPI PCI bus> on pcib5
bge1: <Broadcom BCM5721 Gigabit Ethernet, ASIC rev. 0x4101> mem 0xfe6f0000-0xfe6fffff irq 17 at device 0.0 on pci5
miibus1: <MII bus> on bge1
brgphy1: <BCM5750 10/100/1000baseTX PHY> on miibus1
brgphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto
bge1: Ethernet address: 00:15:c5:60:1b:dd
uhci0: <UHCI (generic) USB controller> port 0xbce0-0xbcff irq 20 at device 29.0 on pci0
uhci0: [GIANT-LOCKED]
usb0: <UHCI (generic) USB controller> on uhci0
usb0: USB revision 1.0
uhub0: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub0: 2 ports with 2 removable, self powered
uhci1: <UHCI (generic) USB controller> port 0xbcc0-0xbcdf irq 21 at device 29.1 on pci0
uhci1: [GIANT-LOCKED]
usb1: <UHCI (generic) USB controller> on uhci1
usb1: USB revision 1.0
uhub1: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub1: 2 ports with 2 removable, self powered
uhci2: <UHCI (generic) USB controller> port 0xbca0-0xbcbf irq 22 at device 29.2 on pci0
uhci2: [GIANT-LOCKED]
usb2: <UHCI (generic) USB controller> on uhci2
usb2: USB revision 1.0
uhub2: Intel UHCI root hub, class 9/0, rev 1.00/1.00, addr 1
uhub2: 2 ports with 2 removable, self powered
ehci0: <Intel 82801GB/R (ICH7) USB 2.0 controller> mem 0xfeb00400-0xfeb007ff irq 20 at device 29.7 on pci0
ehci0: [GIANT-LOCKED]
usb3: EHCI version 1.0
usb3: wrong number of companions (7 != 3)
usb3: companion controllers, 2 ports each: usb0 usb1 usb2
usb3: <Intel 82801GB/R (ICH7) USB 2.0 controller> on ehci0
usb3: USB revision 2.0
uhub3: Intel EHCI root hub, class 9/0, rev 2.00/1.00, addr 1
uhub3: 6 ports with 6 removable, self powered
pcib6: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci6: <ACPI PCI bus> on pcib6
pci6: <display, VGA> at device 5.0 (no driver attached)
isab0: <PCI-ISA bridge> at device 31.0 on pci0
isa0: <ISA bus> on isab0
atapci0: <Intel ICH7 UDMA100 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xfc00-0xfc0f at device 31.1 on pci0
ata0: <ATA channel 0> on atapci0
ata1: <ATA channel 1> on atapci0
atapci1: <Intel ICH7 SATA300 controller> port 0xbc98-0xbc9f,0xbc90-0xbc93,0xbc80-0xbc87,0xbc78-0xbc7b,0xbc60-0xbc6f mem 0xfeb00000-0xfeb003ff irq 20 at device 31.2 on pci0
ata2: <ATA channel 0> on atapci1
ata3: <ATA channel 1> on atapci1
pci0: <serial bus, SMBus> at device 31.3 (no driver attached)
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
sio0: <16550A-compatible COM port> port 0x3f8-0x3ff irq 4 flags 0x10 on acpi0
sio0: type 16550A, console
fdc0: <floppy drive controller> port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0
fdc0: does not respond
device_attach: fdc0 attach returned 6
orm0: <ISA Option ROMs> at iomem 0xc0000-0xc7fff,0xec000-0xeffff on isa0
atkbdc0: <Keyboard controller (i8042)> at port 0x60,0x64 on isa0
sc0: <System console> at flags 0x100 on isa0
sc0: VGA <16 virtual consoles, flags=0x100>
sio1: configured irq 3 not in bitmap of probed irqs 0
sio1: port may not be enabled
vga0: <Generic ISA VGA> at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0
Timecounters tick every 1.000 msec
acd0: CDRW <TSSTcorpCD-RW/DVD-ROM TSL462C/DE05> at ata0-master UDMA33
ad4: 152587MB <WDC WD1600JS-75NCB2 10.02E03> at ata2-master SATA150
SMP: AP CPU #1 Launched!
Trying to mount root from ufs:/dev/ad4s1a
bge0: link state changed to UP
bge1: link state changed to UP
More information about the freebsd-performance
mailing list