processes not getting fair share of available disk I/O (was: Re: TCP parameters and interpreting tcpdump output )

Mon Dec 11 10:57:09 PST 2006

> Did this problem start before you made port2file run with rtprio?

Yes.  I only added rtprio because it wasn't working.

> Can you please include a copy of your kernel configuration file and dmesg?

I think you asked that before:     :-)

>   > OK, that's correct.  Can you also provide details of your disk
>   > hardware (e.g. dmesg) and kernel configuration?
>   
>   FreeBSD 6.0
>   
>   Kernel is stock except for addition of:
>   
>           device                atapicam        # needed to burn dvd
>   
>   /boot/loader.conf:
>   
>           console="comconsole"
>           hw.ata.wc=0
>           hw.ata.atapi_dma="1"
>           kern.ipc.nmbclusters="256000"
>   
>   Mainboard: Tyan Tomcat k8e 2865
>   
>   CPU: AMD64 3000+
>   
>   Chipset: Nvidia nforce4 ultra
>   
>   Memory: 2 GB DDR400 ECC
>   
>   Disks:  4x Seagate 7200 rpm SATA
>           1x Seagate 7200 rpm PATA
>           1x LG CD/DVD
>   
>   atapci0: <nVidia nForce4 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xe000-0xe00f at device 6.0 on pci0
>   ata0: <ATA channel 0> on atapci0
>   ata1: <ATA channel 1> on atapci0
>   atapci1: <nVidia nForce4 SATA150 controller> port 0x9f0-0x9f7,0xbf0-0xbf3,0x970-0x977,0xb70-0xb73,0xcc00-0xcc0f mem 0xfebfb00
>   0-0xfebfbfff irq 10 at device 7.0 on pci0
>   ata2: <ATA channel 0> on atapci1
>   ata3: <ATA channel 1> on atapci1
>   atapci2: <nVidia nForce4 SATA150 controller> port 0x9e0-0x9e7,0xbe0-0xbe3,0x960-0x967,0xb60-0xb63,0xb800-0xb80f mem 0xfebfa00
>   0-0xfebfafff irq 11 at device 8.0 on pci0
>   ata4: <ATA channel 0> on atapci2
>   ata5: <ATA channel 1> on atapci2
>   acd0: DVDR <HL-DT-ST DVDRAM GSA-4160B/A301> at ata0-master UDMA66
>   ad2: 305245MB <Seagate ST3320620A 3.AAC> at ata1-master UDMA100
>   ad4: 238475MB <Seagate ST3250823AS 3.03> at ata2-master SATA150
>   ad6: 238475MB <Seagate ST3250823AS 3.03> at ata3-master SATA150
>   ad8: 238475MB <Seagate ST3250823AS 3.03> at ata4-master SATA150
>   ad10: 305245MB <Seagate ST3320620AS 3.AAC> at ata5-master SATA150
>   cd0 at ata0 bus 0 target 0 lun 0

Since then I added another Seagate 7200 rpm PATA, connected via a PATA-to-USB.
The idea being to get a different controller path to a disk.  Although I think
all I/O has to go through the nforce one way or another.

This USB disk writes at about 15 MB/s instead of the 6-7 MB/s, but otherwise they
interfere with each other same as two disks connected directly to the nforce.
Perhaps a clue in there somewhere?

umass0: Prolific Technology Inc. ATAPI-6 Bridge Controller, rev 2.00/0.01, addr 2
da0 at umass-sim0 bus 0 target 0 lun 0
da0: <ST332062 0A 3.AA> Fixed Direct Access SCSI-0 device
da0: 40.000MB/s transfers
da0: 305245MB (625142449 512 byte sectors: 255H 63S/T 38913C)

The Ethernet is on the mainboard:

pcib5: <ACPI PCI-PCI bridge> at device 13.0 on pci0
pci5: <ACPI PCI bus> on pcib5
bge0: <Broadcom BCM5721 Gigabit Ethernet, ASIC rev. 0x4101> mem 0xfe4f0000-0xfe4fffff irq 11 at device 0.0 on pci5
miibus1: <MII bus> on bge0
brgphy0: <BCM5750 10/100/1000baseTX PHY> on miibus1
brgphy0:  10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, 1000baseTX, 1000baseTX-FDX, auto

The only stuff that says Giant or GIANT-LOCKED is

	atkbd0  only used with firmware
	usb     the new disk, otherwise not used
	nve     not in use
	fwe	not in use

Is Giant the only mutex/lock that could be a bottleneck across disks?
I can't figure out anything else that would create a common bottleneck
across drives.

The nforce can read from all four SATA drives at once as fast as the
disks can go, 65-70 MB/s per drive at the fast end of the platter.  I
assume that the nforce doesn't care about read vs write, and is not
the bottleneck.

The filesystem has to allocate blocks and such, but that shouldn't be
common across drives.

It does this without the CPU being maxed out, assuming you believe
the numbers from systat -vmstat or top.

Memory buffer cache?  However they do that these days...

I was thinking maybe part of port2file's circular buffer was getting paged
out, so I added mlock(2) of the buffer.  Still fails.  :-(

Writing to disk doesn't seem to hurt the Ethernet.  If I direct the
output of port2file to /dev/null it works fine.

I don't suppose you happen to know how to enable SATA's NCQ queuing?

I did some experiments with rtprio and dd.  rtprio reduces
the effect of other disk activity somewhat, but not enough.
I noticed that the transfer rates as reported by systat -vmstat varied
more than I would expect.  First one disk would be faster for a few seconds,
then the other.  Sometimes they would be about equal.  The sum of the
two drives looked to be approx constant.  The sum was only slightly higher
than a single drive by itself.

It certainly smells like there is *some* single resource for writing
that all the disks have to share.