dangerous situation with shutdown process

Wed Jul 20 02:42:04 GMT 2005

On 18 Jul, Matthias Buelow wrote:
> Paul Mather <paul at gromit.dlib.vt.edu> writes:
> 
>>Why would that necessarily be more successful?  If the outstanding
>>buffers count is not reducing between time intervals, it is most likely
>>because there is some underlying hardware problem (e.g., a bad block).
>>If the count still persists in staying put, it likely means whatever the
>>hardware is doing to try and fix things (e.g., write reallocation) isn't
>>working, and so the kernel may as well give up.
> 
> So the kernel is relying on guesswork whether the buffers are flushed
> or not...
> 
>>You can enumerate the buffers and *try* to write them, but that doesn't
>>guarantee they will be written successfully any more than observing the
>>relative number left outstanding.
> 
> That's rather nonsensical. If I write each buffer synchronously (and
> wait for the disk's response) this is for sure a lot more reliable than
> observing changes in the number of remaining buffers. I mean, where's
> the sense in the latter? It would be analogous to, in userspace, having
> to monitor write(2) continuously over a given time interval and check
> whether the number it returns eventually reaches zero. That's complete
> madness, imho.

During syncer shutdown, the numbers being printed are actually the
number of vnodes that have dirty buffers.  The syncer walks the list of
vnodes with dirty buffers any synchronously flushes each one to disk
(modulo whatever write-caching is done by the drive).

The reason that the monitors the number of dirty vnodes instead of just
interating once over the list is that with softupdates, flushing one
vnode to disk can cause another vnode to be dirtied and put on the list,
so it can take multiple passes to flush all the dirty vnodes.  Its
normal to see this if the machine was at least moderately busy before
being shut down.  The number of dirty vnodes will start off at a high
number, decrease rapidly at first, and then decrease to zero.  It is not
unusual to see the number bounce from zero back into the low single
digits a few times before stabilizing at zero and triggering the syncer
termination code.

The syncer shutdown algorithm could definitely be improved to speed it
up.  I didn't want it to push out too many vnodes at the start of the
shutdown sequence, but later in the sequence the delay intervals could
be shortened and more worklist buckets could be visited per interval to
speed the shutdown.  One possible complication that I worry about is
that the new vnodes being added to the list might not be added
synchronously, so if the syncer processes the worklist and shuts down
too quickly it might miss vnodes that got added too late.

I've never seen a syncer shutdown timeout, though it could happen if
either the underlying media became unwriteable or if a process got
wedged while holding a vnode lock.  In either case, it might never be
possible to flush the dirty vnodes in question.

The final sync code in boot() just iterated over the dirty buffers, but
it was not unusual for it to get stuck on mutually dependent buffers. I
would see this quite frequently if I did a shutdown immediately after
running mergemaster.  The final sync code would flush all but the last
few buffers and finally time out.  This problem was my motivation for
adding the shutdown code to the syncer so that the final sync code would
hopefully not have anything to do.

The final sync code also gets confused if you have any ext2 file systems
mounted (even read-only) and times out while waiting for the ext2 file
system to release its private buffers (which only happens when the file
system is unmounted).