swap_pager complaints but not using swap

Sun Jan 25 21:03:09 PST 2009

>>>> AMD64  FreeBSD 7.0  2 GiB main memory
>>>>
>>>> My console says:
>>>>
>>>> login: swap_pager: indefinite wait buffer: bufobj: 0, blkno: 22, size: 4096
>>>> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 22, size: 4096
>>>> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 22, size: 4096
>>>> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 22, size: 4096
>>>>
>>>> pstat -sk
>>>> Device          1K-blocks     Used    Avail Capacity
>>>> /dev/ad6s10       4590208       96  4590112     0%
>>>>
>>>> Wow, using a whole 96K of swap.  I don't see any disk related
>>>> complaints in dmesg.
>>>>
>>>> Is this something to worry about?
>>> Yes, the system was *trying* to do swap I/O and timing out while doing so.
>>>
>>> Kris
>> 
>> Whoops, I forgot to change the subject line after adding the k option
>> to pstat.  Without the k it said 0 used.  And this morning it occurs to
>> me that even if swap used was zero, it could have been trying to *start*
>> using swap.
>> 
>> Anyway... given this timeout explaination, I'm guessing that page/swap
>> has to compete with user processes for disk i/o, and thus probably
>> suffers from the same lack of fair i/o scheduling that user processes
>> suffer from.  E.g. one process doing disk i/o can lock out another
>> process for at least several minutes, probably indefinitely.  :-(
> 
> There is a timeout of (from memory) 60 seconds.  I've not seen this 
> timeout exceeded on properly functioning disk hardware (even heavily 
> loaded), only on broken hardware/controllers, or on I/O devices that are 
> intrinsically slow for some reason (USB stick, or swapping to a file).
> 
> Unless you're doing something truly unspeakable to that disk's load, I'd 
> look at the hardware.

zcat /ad8/7.1-RELEASE-amd64-dvd1.iso.gz > /ad6/7.1-RELEASE-amd64-dvd1.iso

I'll spare you the real paths. :-)  The target was to slice 2, which is
very near the beginning of the disk, while swap is in slice 10 at the
very end of the same disk.  These disks are both Seagate 7200 SATA
connected to nforce4-ultra.

I just ran the same command again, and the CPU is 89-96% idle.  So it is
I/O bound writing to the disk, as expected.

The machine was rebooted Tuesday afternoon (I had been testing a firewire
patch for Sean).  Friday morning I copied the 7.1 ISO to the machine
and was verifying checksums.  After I noticed the swap_pager complaints
on the console I checked and it was only using 96 KiB of swap.

Two days later (Sunday morning) swap usage has grown to 500 KiB:

pstat -sk
Device          1K-blocks     Used    Avail Capacity
/dev/ad6s10       4590208      500  4589708     0%

So the machine doesn't normally use swap much at all, but messing with
the large ISO apparently kicked something out of memory, and the disk
with the swap partition was already busy writing at the other end of
the disk.

Do you consider writing a large file to disk a "truly unspeakable" load?

Scott Bennett writes:

>> This machine has 2 GiB of main memory and almost never uses the swap
>> partition, so I put swap at the slow end of the drive.  Yes I have
>> swap on slice 10.  I use NetBSD's fdisk, as it handles more than
>> 4 slices nicely, unlike FreeBSD's fdisk.  As far as I know, the BIOS
>
>     So NetBSD's fdisk understands logical partitions in an extended
> partition?  Cool.  I wish we had it in FreeBSD.  It's really a pain to
> have to shut FreeBSD down and boot a standalone program to change the layout
> of a disk that has an EP. :-(  At least the FreeBSD kernel has no problem
> understanding a disk like that.

I haven't ported NetBSD's fdisk to FreeBSD, I just boot NetBSD, fdisk
the new disk, and boot back to FreeBSD.  I also use NetBSD's MBR,
which has a nice boot menu.  (well, as nice as it can be with only
512 bytes to work with)  It would be nice to have NetBSD's fdisk
ported, as FreeBSD's fdisk can't even read the logical/extended
partitions.

>> message I suspect that the pager/swaper is competing for disk i/o.
>> I forgot to ask if there is some sysctl or other knob to lengthen
>> the timeout.  The real fix is to improve the i/o fairness, but I've
>> been asking about this for 2-3 years and not getting anywhere.
>>
>     BSD UNIX introduced the disksort() routine into its kernel ages ago.
> I know it was in 4.2BSD, but it may well have been there long before then.
> disksort() was added to satisfy a maximum number of disk I/O requests with
> a minimum of head movement and delay.  Basically, it sorts new requests
> into queues for each drive such that the arm moves from request to request
> in one direction through the disk, and then the next queue started is sorted
> into the opposite sequence for the arm to move in the opposite direction.
> The result is that the arm moves back and forth from the start to the end
> of the disk and then back again, reading and writing as it goes, thus
> minimizing the distance traveled for each request handled.  In FreeBSD,
> I think there is also some sort of change to the algorithm that tends to
> subprioritize or subdivide requests according to the amount of data to be
> read/written in each request, but I don't know any of its details.  In
> general, disksort() gives pretty good performance.
>     I doubt that the current algorithm is the source of your problems, but
> if it is, then perhaps moving swap to sit between the two most active file
> systems on that drive could help.  You may wish to look carefully at the
> disk I/O system in FreeBSD to see whether your idea of "fairness" could be
> implemented without running afoul of the existing code structure and also
> to get an idea as to whether what you want done would really be likely to
> yield any performance improvement.

I don't think the elevator algorithm is the problem.  I think it has something
to do with the disk buffer cache.  Which has changed a lot since I took that
internals class back in the dark ages.  Something about buffer cache and vm
being unified now.  I haven't had a chance to study the new way.  :-(  Anyway,
the problem seems to be related to reading or writing large files and being
i/o bound.  My theory is that the i/o bound process keeps the buffer cache
full, and then other processes don't get their i/o queued, so they block.
Often I have the case where I have one disk i/o bound and a different
process blocks waiting for i/o on an idle disk.  Occasionally I try to come up
with a demo for this two disk scenario, but so far no luck, so there may be
something else going on.  I did find a good demo using one disk:

http://lists.freebsd.org/pipermail/freebsd-performance/2008-July/003533.html

I haven't received any replies to this.  It should be trivially easy for
someone to try it and see if they get similar results or not.  Just find
or create a file larger than main memory and run the test.