RELENG_4 -> 5 -> 6: significant performance regression

Dmitry Pryanishnikov dmitry at atlantis.dp.ua
Thu Apr 27 14:08:23 UTC 2006


Hello!

  I've done simple (yet, I hope, reality-reflecting) performance benchmarking
different STABLE branches (4 vs 5 vs 6) using the following hardware:

CPU: Pentium II/Pentium II Xeon/Celeron (334.09-MHz 686-class CPU)
   Origin = "GenuineIntel"  Id = 0x665  Stepping = 5
   Features=0x183f9ff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,SEP,MTRR,PGE,MCA,CMOV,PA
T,PSE36,MMX,FXSR>
real memory  = 134152192 (127 MB)
...
rl0: <RealTek 8139 10/100BaseTX> port 0xe800-0xe8ff mem 0xdc101000-0xdc1010ff
  irq 5 at device 20.0 on pci0
...
fxp0: <Intel 82559 Pro/100 Ethernet> port 0xe400-0xe43f mem
  0xdc100000-0xdc100fff,0xdc000000-0xdc0fffff irq 7 at device 19.0 on pci0
...
ad0: 76351MB <SAMSUNG SP0802N TK100-24> at ata0-master UDMA33

and just restoring precompiled 4/5/6-STABLE to the same HDD partition. I've 
used the following kernel config for 4-STABLE:

ident           TEST
machine		i386
maxusers	32
makeoptions	CONF_CFLAGS=-fno-builtin
makeoptions	DEBUG=-g
options 	INCLUDE_CONFIG_FILE
cpu		I686_CPU
options 	COMPAT_43
options 	USER_LDT
options 	SYSVSHM
options 	SYSVSEM
options 	SYSVMSG
options 	INVARIANTS
options 	INVARIANT_SUPPORT
options 	USERCONFIG
options 	INET
options 	FAST_IPSEC
options 	IPSEC_FILTERGIF
pseudo-device	ether
pseudo-device	vlan	1
pseudo-device	loop
pseudo-device	bpf
pseudo-device	ppp	8
options 	PPP_BSDCOMP
options 	PPP_DEFLATE
options 	PPP_FILTER
options 	IPFIREWALL
options 	IPFW2
options 	IPFIREWALL_VERBOSE
options 	IPFIREWALL_VERBOSE_LIMIT=100
options 	IPFIREWALL_FORWARD
options 	IPDIVERT
options 	IPSTEALTH
options 	ICMP_BANDLIM
options 	DUMMYNET
options 	FFS
options 	FFS_ROOT
options 	SOFTUPDATES
options 	QUOTA
options 	P1003_1B
options 	_KPOSIX_PRIORITY_SCHEDULING
options 	_KPOSIX_VERSION=199309L
pseudo-device	pty
pseudo-device	crypto
device		isa
device		atkbdc0	at isa? port IO_KBD
device		atkbd0	at atkbdc? irq 1
device		psm0	at atkbdc? irq 12
device		vga0	at isa?
pseudo-device	splash
device		sc0	at isa?
options 	SC_HISTORY_SIZE=1000
options 	SC_TWOBUTTON_MOUSE
device		npx0	at nexus? port IO_NPX flags 0x0 irq 13
device		ata
device		atadisk
options 	ATA_STATIC_ID
device		fdc0	at isa? port IO_FD1 irq 6 drq 2
device		fd0	at fdc0 drive 0
device		fd1	at fdc0 drive 1
device          sio0    at isa? port IO_COM1 irq 4
device          sio1    at isa? port IO_COM2 irq 3
device		pci

and slightly modified it for 5/6-STABLE, here is the diff ("<" = 4-only
option, ">" - 5/6-only):

> options 	SCHED_4BSD

< options 	USER_LDT
< options 	USERCONFIG

< pseudo-device	ether
< pseudo-device	vlan	1
< pseudo-device	loop
< pseudo-device	bpf
< pseudo-device	ppp	8
> device	ether
> device	loop
> device	bpf

< options 	IPFW2
> options 	IPFIREWALL_FORWARD_EXTENDED

< options 	ICMP_BANDLIM
< options 	FFS_ROOT
< options 	P1003_1B
< options 	_KPOSIX_VERSION=199309L

< pseudo-device	pty
< pseudo-device	crypto
> device	pty
> device		crypto

< device		atkbdc0	at isa? port IO_KBD
< device		atkbd0	at atkbdc? irq 1
< device		psm0	at atkbdc? irq 12
< device		vga0	at isa?
< pseudo-device	splash
< device		sc0	at isa?
---
> device		atkbdc
> device		atkbd
> device		psm
> options 	KBD_INSTALL_CDEV
> device		vga
> device		splash
> device		sc

< device		npx0	at nexus? port IO_NPX flags 0x0 irq 13
> device		npx

< device		fdc0	at isa? port IO_FD1 irq 6 drq 2
< device		fd0	at fdc0 drive 0
< device		fd1	at fdc0 drive 1
< device          sio0    at isa? port IO_COM1 irq 4
< device          sio1    at isa? port IO_COM2 irq 3

Also I've set kern.hz="100" in /boot/loader.conf for every system.
I've effectively excluded ipfw from the game by using
'add 1 pass all from any to any' rule. I hope, I've compared apples with 
apples this way.

   For every x-STABLE, I've received large ISO image via FTP in binary mode 
twice: using rl NIC and using fxp one, both in 10baseT mode (got approx. 1 
Mbyte/s transfer rate). I've noted CPU utilization which gave "systat -vm 1" 
once numbers have stabilized. Here are the results (average numbers, %User 
and %Nice are close to zero):

                   %Sys   %Intr   %Idl

RELENG_4 + rl0      14      14     72
RELENG_4 + fxp0     14      10     76

RELENG_5 + rl0      40      30     30
RELENG_5 + fxp0     35      25     40

RELENG_6 + rl0      45      40     15
RELENG_6 + fxp0     45      35     20

I've tried to verify these numbers by running 'md5 -t' in parallel with
download and measuring wall time: "time md5 -t". Indeed, under RELENG_4
I've got 43 sec on wall clock time for this benchmark vs 2:01 for
RELENG_5 and 2:05 under RELENG_6 (I don't understand why difference is so low 
between 5 and 6 here).

  I would call these numbers discouraging. Actually such high CPU usage
during the relatively simple processing to HDD of _only_ 10 Mbit/s traffic 
will surely prevent deployment of 6-STABLE on many not-very-powerful
production servers. Am I missing something simple regarding compile-time
or runtime optimization?

Sincerely, Dmitry
-- 
Atlantis ISP, System Administrator
e-mail:  dmitry at atlantis.dp.ua
nic-hdl: LYNX-RIPE


More information about the freebsd-stable mailing list