kern/178997: Heavy disk I/O may hang system

Mon May 27 05:58:12 UTC 2013

On Sun, 26 May 2013, Klaus Weber wrote:

>> Environment:
> FreeBSD filepile 9.1-STABLE FreeBSD 9.1-STABLE #5 r250475: Sun May 12 19:14:21 CEST 2013     root at filepile:/usr/obj/usr/src/sys/FILEPILE  amd64
> (Kernel config has some drivers removed and debugging options added. Tested with GENERIC as well)
>> Description:
> During heavy disk I/O (two bonnie++ processes working on the same disk simultaneously) causes an extreme degradation in disk throughput (combined throughput as observed in iostat drops to ~1-3 MB/sec). The problem shows when both bonnie++ processes are in the "Rewriting..." phase.

Please use the unix newline character in mail.

A huge drop in disk throughput is normal with multiple processes,
unfortunately.  Perhaps not quite this much.

> Sometimes, the systems "hangs" (throughput reaches 0 MB/sec); in that case, the system does not recover from this state. "Hang" means that the system responds to ping etc., processes that are already running will continue, but every command that requires disk accesses (even on disks other than the disk being tested) just hangs. If shutdown can be started without disk access (i.e. still in cache), the system will begin the shutdown, but fails to sync all buffers.

Hangs are bugs though.

I have been arguing with kib@ about some methods of handling heavy disk
i/o being nonsense since they either make the problem worse (by switching
to direct unclustered writes, so that slow i/o goes even slower) or have
no effect except to complicate the analysis of the problem (because they
are old hackish methods, and newer better methods make the cases handled
by the old methods unreachable).  But the analysis is complicated, and
we couldn't agree on anything.

One thing I noticed more recently than my last mail to kib about this
is that even a single heavy disk write (say dd bs=256m count=2
</dev/zero >foo) can block another non-write-bound process (say iostat
1 or systat -v 1 to watch the progress) from starting up (sometimes
even when it has just been run to load it into cache(s)) until the dd
completes 10 seconds later.  This was on a 2-core system running a
fairly old version of FreeBSD (8.0-CURRENT from ~2008).  I found that
the problem could be fixed by killing cluster_write() by turning it into
bdwrite() (by editing the running kernel using ddb, since this is easier
than rebuilding the kernel).  I was trying many similar things since I
had a theory that cluster_write() is useless.  Actually, cluster_write()
gives some minor optimizations, but this misbehaviour indicates that it
has a resource starvation bug.  My theory for what the bug is is that
cluster_write() and cluster_read() share the limit resource of pbufs.
pbufs are not managed as carefully as normal buffers.  In particular,
there is nothing to limit write pressure from pbufs like there is for
normal buffers.  However, this behaviour doesn't happen on the much older
version of FreeBSD that I normally use (5.2-CURRENT from 2004 with many
local modifications).  It starts iostat to watch dd instantly, even running
on a single core system.

> gpart create -s gpt da0
> gpart add -s 500G -t freebsd-ufs -l test0 -a 1M da0
> newfs -b 64k -f 8k /dev/da0p1
> mount /dev/da0p1 /mnt && mkdir /mnt/t{1,2} && chmod g+w /mnt/t{1,2}

The default for newfs is -b 32k.  This asks for buffer cache fragmentation.
Someone increased the default from 16k to 32k without changing the buffer
cache's preferred size (BKVASIZE = 16K).  BKVASIZE has always been too
small, but on 32-bit arches kernel virtual memory is too limited to have
a larger BKVASIZE by default.  BKVASIZE is still 16K on all arches
although this problem doesn't affetc 64-bit arches.

-b 64k is worse.

> When both bonnie++ processes are in their "Rewriting" phase, the system hangs within a few seconds. Both bonnie++ processes are in state "nbufkv". bufdaemon takes about 40% CPU time and is in state "qsleep" when not active.

You got the buffer cache fragmentation that you asked for.

Buffer cache fragmentation caused serious problems 10-20 years ago, but the
ones causing hangs were supposed to be fixed about 12 years ago and the ones
causing extreme slowness for just the software parts of handling
fragmentation were fixed a little later.  But there apparently still some.

However, I have never been able to reproduce serious fragmentation problems
from using too-large-block sizes, or demonstrate significant improvements
from avoiding the known fragmentation problem by increasing BKVASIZE.
Perhaps my systems are too small, or have tuning or local changes that
accidentally avoid the problem.

Apparently you found a way to reproduce the serious fragmentaion
problems.  Try using a block size that doesn't ask for the problem.
Increasing BKVASIZE would take more work than this, since although it
was intended to be a system parameter which could be changed to reduce
the fragmentation problem, one of the many bugs in it is that it was
never converted into a "new" kernel option.  Another of the bugs in
it is that doubling it halves the number of buffers, so doubling it
does more than use twice as much kva.  This severely limited the number
of buffers back when memory sizes were 64MB.  It is not a very
significant limitation if the memory size is 1GB or larger.

> I have also briefly tested a single disk instead of a RAID volume. In that case, I could not get the system to hang, put performance drops from ~150 MB/sec to ~1-3 MB/sec. I do not know if that indicates a problem with the arcmsr driver, or whether I cannot create "enough" I/O load with a single disk:

I get ~39MB/sec with 1 "dd bs=128n count=2 /dev/zero >foo" writing to
a nearly full old ffs file system on an old PATA disk, and 2 times
20MB/sec with 2 dd's.  This is is almost as good as possible.  The
configuration was somewhat flawed:
- the writes all fitted in memory because my disks are too full to hold
   more.  This shouldn't be a problem for write benchmarks, because write
   pressure is (too) limited by writing out dirty buffers (too) early to
   limit write pressure, so writes are mostly physical.
- the fs block size was 16K.  This not a flaw, just the right size, but
   other sizes should be tested.
- the file system was mounted async.  This should have little affect,
   since for large writes all data is written async anyway (the write
   pressure handling (causes async writes instead of delayed writes even
   for not so large writes).
- ffs was modified to remove its bogus write pressure handling.  This
   should have no affect, but I didn't verify it.

> I have, however, logged the sysctl vas subtree every second while the problem occurs:
> As long as both bonnie++ are in "Writing intelligently..." phase, vfs.numdirtybuffers stays mostly between 80-100. When the first bonnie++ goes to "Rewriting..." (at 3:40 below) starts to increase (about 150-350 new dirty buffers per second). When the second bonnie++ goes to "Rewriting..." as well (at 4:24), this speeds up a lot (about 3000-5000 new dirty buffers/sec). A few seconds later, vfs.numdirtybuffers reaches its limit, and remains there until the machine is rebooted:
>
> 00-03-28.log:vfs.numdirtybuffers: 107
> 00-03-29.log:vfs.numdirtybuffers: 106
> 00-03-30.log:vfs.numdirtybuffers: 105
> ...
> [first bonnie++ starts Rewriting]
> 00-03-41.log:vfs.numdirtybuffers: 294
> 00-03-42.log:vfs.numdirtybuffers: 404
> 00-03-43.log:vfs.numdirtybuffers: 959
> 00-03-44.log:vfs.numdirtybuffers: 1071

There are various write pressure handling mechanisms that have different
thresholds and get in each way a bit.

The dirty buffer watermarks are presumably much higher than 105 or even
1071.  Other mechanisms apparently work well and limit the number of
dirty buffers to only 107 with 1 writer.  Writes to memory can go much
faster than writes to the disk even when the latter go at full speed, but
some mechanism blocks writers (too) early so the 1 writer soon reaches a
steady state with not many dirty buffers.

> 00-03-45.log:vfs.numdirtybuffers: 1155
> 00-03-46.log:vfs.numdirtybuffers: 1253
> 00-03-47.log:vfs.numdirtybuffers: 1366
> 00-03-48.log:vfs.numdirtybuffers: 1457
> ...
> 00-04-21.log:vfs.numdirtybuffers: 8156
> 00-04-22.log:vfs.numdirtybuffers: 8362
> 00-04-23.log:vfs.numdirtybuffers: 8558
> [second bonnie goes Rewriting as well]
> 00-04-24.log:vfs.numdirtybuffers: 11586
> 00-04-25.log:vfs.numdirtybuffers: 16325
> 00-04-26.log:vfs.numdirtybuffers: 24333
> ...
> 00-04-54.log:vfs.numdirtybuffers: 52096
> 00-04-57.log:vfs.numdirtybuffers: 52098
> 00-05-00.log:vfs.numdirtybuffers: 52096
> [ etc. ]

This is a rather large buildup and may indicate a problem.  Try reducing
the dirty buffer watermarks.  Their default values are mostly historical
nonsense.  They are just some fixed fraction (about 1/4) of the number
of buffers.  With multiple GB of memory, the number of buffers may be
huge.  On a system with 24GB, vfs.bbuf is 157800, and vfs.dirtybufthresh
is 35523.  But the size of main memory has nothing to do with the
correct limit on the number of dirty buffers (unless the latter is not
being used to limit write pressure, but only to limit starvation of
buffers for readers.  This is perhaps all it is supposed to do, but see
below).  We want buildups of dirty buffers to be limited so that the
system never gets too far behind on physical writes.  The correct limit
depends on the physical write capabilities of the machine and on which
write channels have pressure on them.  Default global limits don't work
at all for this.  You can fix the defaults but you can't distribute the
limits properly accross channels, or easily keep the limits up to date
as channels are added.  It is easy to add slow USB channels.  These
won't increase the write capability much, but it is very easy for a
slow disk or channel to it to grab many dirty buffers and keep them for
a long time so that faster disks are starved.

Other mechanisms handle write pressure better than the simple dirty
buffer count one can (others count the amount of i/o, which is more
relevant to how long it will take).  So assume that any affect of
the dirty buffer count mechanism on write pressure is unintentional.
There are still the following problems with it:
- as above, slow channels can starve fast channels, thus giving a
   severe side effect on the write pressure (at best the users of the
   fast channels block so that the are slowed down to the speed of
   the slow channels)
- buffer starvation for readers can happen anyway.  vfs.nbuf is the
   number of buffers that are allocated, but not all of these can be
   used when some file systems use buffer sizes larger than about
   BKVASIZE.  The worst case is when all file systems a block size of
   64KB.  Then only about 1/4 of the allocated buffers can be used.
   The number that can be used is limited by vfs.maxbufspace and
   related variables.  The default for vfs.maxbufspace is controlled
   indirectly by BKVASIZE (e.g., maxbufspace = 2GB goes with nbuf =
   128K; the ratio is 16K = BKVASIZE).

So when you ask for buffer cache fragmentation with -b 64k, you get
even more.  The effective nbuf is reduced by a factor of 4.  This
means that all the dirty buffer count watermarks except one are
slightly above the effective nbuf, so if the higher watermarks
cannot quite be reached and when they are nearly reached readers
are starved of buffers.

> vfs.hidirtybuffers: 52139
> vfs.lodirtybuffers: 26069
> (the machine has 32 GB RAM)

Fairly consistent with my examples of a machine with 24GB RAM.
The upper watermark is 33% higher, etc.  Your nbuf will be more like
200K than 150K.  You reached numdirtybuffers = 52096.  At 64K each,
that's about 3GB.  maxbufspace will be about the same, and bufspace
(the amount used) will be not much smaller.  There will be little
space for buffers for readers, and the 2 writers apparently manage
to starve even each other, perhaps for similar reasons.  The backlog
of 3GB would take 20 seconds to clear even at 150MB/sec, and at 1MB/sec
it would take almost an hour, so it is not much better than a hang.

> I have found http://freebsd.1045724.n5.nabble.com/7-1-RELEASE-I-O-hang-td3966945.html , which seems to describe the same issue (also in combination with a RAID controller, albeit a different one).

I think the hardware is usually not the problem (at least after kib's fix
for the above), except when it is large it can stress the software more.

Bruce