processes not getting fair share of available disk I/O (was: Re: TCP parameters and interpreting tcpdump output )

Thu Dec 7 17:46:10 PST 2006

> > > > > > > > 	hw.ata.wc=3D3D3D3D0
> > > > > > >         ^^^^^^^^^^^
> > > > > > > "Make my hard drive go reeeeally slow please (just in case I cr=
> ash)=3D
> > > " :)
> > > > > >=3D3D20
> > > > > > Slower, yes, but not *that* slow.
> > > > > >=3D3D20
> > > > > > Normal ls : 0.032 second.  Two processes using same disk, multipl=
> y by=3D
> > >  two,
> > > > > > so 0.064 second.  Maybe the multiplier is more than 2, call it 10=
> x, so
> > > > > > 0.32 second.  But I'm seeing a factor of over 9100x.
> > > > >=3D20
> > > > > Humour me and turn it back on, then see what happens.
> > > >=3D20
> > > > Where is the knob to turn the write cache on/off on a per-drive basis
> > > > in FreeBSD?  I can do this in NetBSD, but the only knob I can find in
> > > > FreeBSD affects all drives, and requires a reboot.
> > >=20
> > > Yes, I think you need to do it globally at boot time.
> > >=20
> > > > Humour me and read the Subject line.  The ls does not get its fair sh=
> are
> > > > of disk I/O.
> > > >=3D20
> > > > Both times are with the disk's write cache in write-through mode.
> > > > I'm not comparing times with the write cache in different modes.
> > > > I'm comparing ls by itself against ls competing with cp.
> > >=20
> > > Your cp is going to be running synchronously, i.e. spend a lot of time
> > > waiting on the disk to perform the writes.  This may well be the cause
> > > of your problem.  Once we have established whether or not it is the
> > > cause, we can proceed to whether this behaviour can be improved.
> >=20
> > I submitted PR 106340 asking for a way to control the disk write cache on
> > a per disk basis like NetBSD can.  Meanwhile, I added a PATA via USB disk,
> > which judging from the write speed, appears to be immune from hw.ata.wc=
> =3D0.
> >=20
> > So I now have a disk which has the write cache on, is connected via a dif=
> ferent
> > controller, and thus uses a different device driver.
> >=20
> > I still see the same problems.  Writing to one disk *significantly* slows=
>  down
> > writing to another disk.  Even if one process is at normal default priori=
> ty
> > and the other is running at rtprio 5.  Regardless of which process uses t=
> he
> > USB disk and which uses the direct-to-chipset disk.  Even if the rtprio 5
> > process only needs a very small fraction of the disk bandwidth, it still =
> gets
> > slowed down to the point that data is lost.
> >=20
> > My current SWAG is that writing to a disk requires some spl/mutex/lock th=
> at
> > is global across all disks on the system.  And this spl/mutex/lock is a
> > bottleneck.
> 
> In the case of USB devices, yes - all USB accesses require Giant so
> all USB I/O is serialized.  This isn't true in general though, unless
> you have debug.mpsafevfs=3D0 set (or forced because of something else,
> e.g. quotas).  If this is set then all filesystem I/O is serialized
> (and maybe it's even worse, if there are also device drivers in the
> I/O path that also require Giant, like USB).

debug.mpsafevfs: 1
machine is single CPU
I'm not using quotas.

> However, I don't know what you mean by "data is lost".  Data should
> never be lost from the filesystem regardless of how slow the I/O is
> happening, unless there's something else going wrong (e.g. driver
> bug).
> 
> Also, rtprio should not be used in general - see the manpage.  Were
> you using rtprio in your original scenario?  It can easily cause
> resource starvation.

I have data arriving on Ethernet.  The data rate is 2.5 MB/s max,
but the other end only has a small buffer.  If the BSD box doesn't read
the port fast enough, the data is lost.  I have a C program (port2file)
reading from the port into a *large* circular buffer, currently 431,226,880
bytes.  This should be enough to buffer over 2 minutes of data.  It does
non-blocking 64KB writes to stdout.  Shell script calls this program and
redirects stdout to a disk file.  Very little if any other i/o to this
disk.  Even with disk cache in write-through mode, I can write at about
6-7 MB/s.  The process needs very little CPU.  Sounds like this should
be no problem.

And it seems to work okay if the system is otherwise idle.

The problem is that if some other process is writing to some other disk,
it somehow slows down writes to ALL disks.  Enough that, dispite the non-blocking
writes (?), the TCP receive window shrinks and shrinks and finally is smaller
than a packet.  The src machine obediantly stops sending packets, its small
buffer fills up, and data is lost.

Things I have done so far:

   BIG buffer (over 2 minutes worth).

   The port2file process cranks up the TCP receive window from 65700 to 197100.

   It also cranks up rtprio from 20 to 5.

   sysctl net.inet.tcp.delayed_ack=0

The only process running rtprio is port2file.  All other processes are
either default priority or niced down with the classic nice(1).