vgone() calling VOP_CLOSE() -> blocked threads?

Bruce Evans brde at optusnet.com.au
Sat Mar 15 16:55:38 UTC 2008


On Sat, 15 Mar 2008, Ed Schouten wrote:

> The last couple of days I'm seeing some strange things in my mpsafetty
> branch related to terminal revocation.
>
> In my current TTY design, I hold a count (t_ldisccnt) of the amount of
> threads that are sleeping in the line discipline. I need to store such a
> count, because it's not possible to change line disciplines while some
> threads are still blocked inside the discipline. This means that when
> d_close() is called on a TTY, t_ldisccnt should always be 0. There
> cannot be any threads stuck inside the line discipline when there aren't
> any descriptors referencing it.
>
> Unfortunately, this isn't entirely true with the current VFS/devfs
> design. When vgone() is called, a VOP_CLOSE() is performed , which means
> there could be a dozen threads still stuck inside a device driver, but
> the close routine is already called to clean up stuff. There are a
> *real* lot of drivers that blindly clean up their stuff in the d_close()
> routine, expecting that the device is completely unused. This can
> easily be demonstrated by revoking a bpf device, while running tcpdump.

Yes, most drivers are broken here, but the problem is rarely noticed
because revoke() isn't normally applied to any devices except ttys.
Even ordinary close() can cause problems when a thread is sleeping
in device open, but this too is only common for ttys (for callin and
callout devices).

The tty driver is about the only driver that handles this problem
almost correctly.  It uses a generation count.  All tty drivers are
supposed to sleep using only ttysleep().  ttysleep() checks the
generation count and returns ERESTART if the generation is new.  All
tty drivers should consider this error to be fatal and propagate it
up to the syscall level where the syscall is restarted.  This tends
to happen naturally, but some places (in device close IIRC), the driver
ignores the error and does more i/o (to finish cleaning up in close
-- close and open can easily pass each other and clobber each others
state when this happens).  More I/O also tends to occur if a revoke()
happens when a thread is blocked but not sleeping.  Then ttysleep()
isn't in sight, so the thread has no idea that the generation count
changed.  Giant locking limits this problem.

> To be honest, I'm not completely sure how to solve this issue, though I
> know it should at least do something similar to this:
>
> - The device driver should have a seperate routine (d_revoke) to wake
>  up any blocked threads, to make sure they leave the device driver
>  properly.

Something is needed to signal blocked but non-sleeping threads.  I
think the wakeup for ttys now normally occurs as a side effect of
flushing i/o.  revoke() normally calls ttyclose() which calls ttyldclose()
which normally calls ttylclose() which flushes i/o which wakes up
threads waiting on the i/o.  I don't see how ttylclose() can work right
in the usual !FNONBLOCK case.  Maybe revoke() sets FNONBLOCK.  The
generation count stuff doesn't help here because the flush is done
before incrementing the generation count.

There is an obvious race here for threads doing i/o instead of waiting
for it.  These muse be blocked (by Giant now for ttys, or by your
MPSAFE locking).  They will run when revoke() releases the lock and
find the i/o flushed and maybe the generation count incrememented, but
they normally won't check these states and will just blunder on doing
more i/o.  It would be painful to check these states every time the
lock is aquired, but this seems to be necessary.  Magic Giant locking
makes the places where the lock is acquired hard see.

> - Maybe vgonel() shouldn't call VOP_CLOSE(). It should probably move the
>  vnode into deadfs, with the exception of the close() routine. Maybe
>  it's better to add a new function to do this, vrevoke().
>
> This means that when a revoke() call is performed, all blocked threads
> are woken up, will leave the driver, to find out their terminal has been
> revoked. Further system calls will fail, because the vnode is in deadfs,
> but when the processes close the descriptor, the device driver can still
> clean up everything.

I think vfs already moves the vnode to deadfs.  It doesn't do anything
to synchronize with threads running in device drivers.  The forced
last-close() should complete synchronously as part of revoke().  Then
other threads leave the device driver asynchronously, hopefully not
much later.  Then if the generation count stuff is working right, the
syscall is restarted, but now file descriptors point to deadfs so the
syscall normally fails.  I think the async completion is OK provided
it is done right (don't delay it indefinitely, and don't do more
i/o on completion).  It doesn't seem to be useful to make revoke()
wait for the completions.

I don't think it would work well to move everything except d_close to
deadfs.

Other problems near here:
- neither vfs nor drivers currently know how many threads are in a
   driver.  vfs uses vp->v_rdev->si_usecount, but this doesn't quite work
   since it doesn't count threads sleeping in open.  Maybe ones excuting
   last close too -- this would be more of a problem.  revoke() just
   uses vcount(), which just acquires the device locks and returns
   si_usecount after releasing the device lock.  (I don't understand
   this locking -- what stops the count changing after the lock is
   released, or if it cannot changed then why acquire the lock?)  This
   can result in revoke() not calling device close when it should.
   Drivers can obviously keep count of their activities using large
   code.  I can't see any way for vfs to keep count short of asking
   drivers for their counts.
- there can be any number of threads in device open and close concurrently,
   even without the complications for revoke().  The most problematic
   cases happen when last-close blocks, as is common for ttys waiting
   for output to drain (since no one cares about their output actually
   working and ensures draining it using tcdrain() -- normal losing
   programs finish up with something like printf(); exit(); and depend
   on the close() in exit(); blocking to drain the output).  Then new
   opens are allowed, and this is useful for doing ioctls() to unblocked
   blocked closes.  If the new open or fcntl sets non-blocking mode, then
   the last-close for the new open may pass the blocked last close.  If
   the new mode is blocking, then the last-close for the new open may block
   too.  The number of threads in last-close is thus unlimited.  A thundering
   herd of them tends to stomp on each other when they are all unblocked at
   the same time.

   The connections of this with revoke() are:
   - it takes vfs's not counting of all threads in the device driver to
     allow the useful behaviour of opens while a close is blocked and
     the necessary behaviour of last-close while another last-close is
     executing (drivers should be aware of this possibility and merge
     the closes, but don't).
   - I think revoke() sets FNONBLOCK somewhere.  Thus it tends to unblock
     any thread waiting in last-close for output to drain.

   Less problematic cases occur when opens block.  ttyopen() understands
   this possibility and handles it almost right using its t_wopeners
   count.  ttyopen() uses various sleeps where it should use ttysleep()
   or check the generation count itself; this results in it looping
   internally instead of restarting the syscall, which is only a small
   error since for open() alone, restarting the syscall would call back
   to the same non-dead device open except in unusual cases where there
   was a signal and syscalls are not restarted, or the device name went
   away.  There is still a problem with the vfs usage counting -- in
   one case involving callin and callout devices whose details I forget,
   last-close is not called when it needs to be called to wake up all
   the threads sleeping in open so that they can enter a new state.

Bruce


More information about the freebsd-arch mailing list