cvs commit: src/sys/dev/io iodev.c

Tue Aug 12 14:07:48 UTC 2008

On Tue, 12 Aug 2008, Bruce Evans wrote:

> On Sat, 9 Aug 2008, John Baldwin wrote:

>> You failed to note that not using D_TRACKCLOSE doesn't actually reliable 
>> close
>> devices.
>
> No, I noted that vfs doesn't count last closes right.  Unreliability of
> last-close is a special case of that.  But using D_TRACKCLOSE mostly makes
> things worse.  To avoid the vfs bugs in tracked closes, it is necessary
> for individual drivers to understand vfs's bugs better than vfs does.
>
>> I set D_TRACKCLOSE on bpf because under load at work (lots of
>> concurrent nmap's) bpf devices would sometimes never get fully closed, so
>> you'd have an unopened bpf device that was no longer available.
>
> It is unclear how that helps.  vfs's bugs must be really large for the
> last-close to go completely missing.  kib fixed some of them a few months
> ago (before you changed bpf I think).  bpf is especially simple since it
> is exclusive-accesses, so the vfs bugs should be smaller for it.
>
>> D_TRACKCLOSE
>> gives much saner semantics for devices, each open() of an fd calls
>> the 'd_open' method, and the last close of that fd (which could be in 
>> another
>> process) calls 'd_close'.
>
> This is insane semantics for most devices, and D_TRACKCLOSE doesn't give it.
> Most devices don't care about the difference between file descriptors and
> files -- they want d_close to be called only when the last file descriptor
> on the device is closed.  !D_TRACKCLOSE gives this, modulo the vfs bugs.
>
> D_TRACKCLOSE doesn't give these semantics: revoke() closes everything
> in 1 step and then calls d_close(), giving an N-1 correspondence between
> d_open()s and d_close()s (modulo the vfs bugs giving a fuzzier
> correspondence).  (IIRC, at least before kib's fixes, it was possible
> for revoke() to cause any of 0, 1 or 2 calls to d_close().  I think 1
> call always gets as far as devfs_close() but this is only guaranteed
> to reaach d_close() with D_TRACKCLOSE.)  This is not a bug, but all
> drivers that depend on the correspondence being 1-1 are broken.  Drivers
> that enforce "exclusive" access like bpf have N=1, so they probably
> work modulo the vfs bugs, and they have a good chance of not being
> affected by the vfs bug that gives an extra devfs_close() and thus an
> extra d_close().

I checked that bpf panics (even under UP) due to the obvious bugs in
its d_close():

     # Generate lots of network activity using something like:
     sysctl net.inet.icmp.icmplim=0; ping -fq localhost &

     # Race to panic eventually:
     while :; do tcpdump -i lo0 & sleep 0.001; revoke /dev/bpf0

Most or all device drivers have obvious bugs in their d_close(); bpf
is just a bit easier to understand and more likely to cause a panic
than most device drivers, since it is simple and frees resources.  A
panic is very likely when si_drv1 is freed, and si_drv1 is only locked
accidentally.

% static	int
% bpfclose(struct cdev *dev, int flags, int fmt, struct thread *td)
% {
% 	struct bpf_d *d = dev->si_drv1;
% 
% 	BPFD_LOCK(d);
% 	if (d->bd_state == BPF_WAITING)
% 		callout_stop(&d->bd_callout);
% 	d->bd_state = BPF_IDLE;
% 	BPFD_UNLOCK(d);
% 	funsetown(&d->bd_sigio);
% 	mtx_lock(&bpf_mtx);
% 	if (d->bd_bif)
% 		bpf_detachd(d);
% 	mtx_unlock(&bpf_mtx);
% 	selwakeuppri(&d->bd_sel, PRINET);
% #ifdef MAC
% 	mac_bpfdesc_destroy(d);
% #endif /* MAC */
% 	knlist_destroy(&d->bd_sel.si_note);
% 	bpf_freed(d);
% 	dev->si_drv1 = NULL;
% 	free(d, M_BPF);

Lots of possibly-in-used data structures are destroyed here.

% 
% 	return (0);
% }
% 
% static	int
% bpfread(struct cdev *dev, struct uio *uio, int ioflag)
% {
% 	struct bpf_d *d = dev->si_drv1;

There is no locking for si_drv1 here and elsewhere.  There used to be
locking -- long ago, there was implicit locking for non-SMP non-preemptive
kernels.  Then there was Giant locking.

d can be either NULL or garbage here.  (It is null if revoke() just
cleared si_drv1 and there is an implicit or accidental memory barrier
that makes the change visible here.  It is garbage if revoke() has
started destroying it but hasn't yet set si_drv1 to NULL when we read
si_drv1, or if revoke() starts destroying it after we read si_drv1.)

% 	int timed_out;
% 	int error;
% 
% 	/*
% 	 * Restrict application to use a buffer the same size as
% 	 * as kernel buffers.
% 	 */
% 	if (uio->uio_resid != d->bd_bufsize)
% 		return (EINVAL);

If we're lucky after losing the race, then d is null and we get a null
pointer panic here.  I didn't observe this.

% 
% 	BPFD_LOCK(d);

If we're unlucky after losing the race, then we start corrupting *d here.

% 		error = msleep(d, &d->bd_mtx, PRINET|PCATCH,
% 		     "bpf", d->bd_rtout);

Since we block, we implemented panic(9) even with implicit or explicit
Giant locking.  The sleep gives a very large race window...

% 		if (error == EINTR || error == ERESTART) {
% 			BPFD_UNLOCK(d);
% 			return (error);
% 		}

...however, we now have locking for d, and bpfclose() blocks on this,
so I think we don't implement panic(9) here.  We have a broken revoke()
instead -- it should do a forced close without waiting, but instead it
waits for the unlock in the above or later.

% 		if (error == EWOULDBLOCK) {
% 			/*
% 			 * On a timeout, return what's in the buffer,
% 			 * which may be nothing.  If there is something
% 			 * in the store buffer, we can rotate the buffers.
% 			 */
% 			if (d->bd_hbuf)
% 				/*
% 				 * We filled up the buffer in between
% 				 * getting the timeout and arriving
% 				 * here, so we don't need to rotate.
% 				 */
% 				break;
% 
% 			if (d->bd_slen == 0) {
% 				BPFD_UNLOCK(d);
% 				return (0);
% 			}
% 			ROTATE_BUFFERS(d);
% 			break;
% 		}
% 	}
% 	/*
% 	 * At this point, we know we have something in the hold slot.
% 	 */
% 	BPFD_UNLOCK(d);

If bpfclose() was blocked on d, then our d is especially likely to become
corrupt just after here.

% 
% 	/*
% 	 * Move data from hold buffer into user space.
% 	 * We know the entire buffer is transferred since
% 	 * we checked above that the read buffer is bpf_bufsize bytes.
% 	 *
% 	 * XXXRW: More synchronization needed here: what if a second thread
% 	 * issues a read on the same fd at the same time?  Don't want this
% 	 * getting invalidated.
% 	 */
% 	error = bpf_uiomove(d, d->bd_hbuf, d->bd_hlen, uio);
% 
% 	BPFD_LOCK(d);
% 	d->bd_fbuf = d->bd_hbuf;
% 	d->bd_hbuf = NULL;
% 	d->bd_hlen = 0;
% 	bpf_buf_reclaimed(d);
% 	BPFD_UNLOCK(d);
% 
% 	return (error);
% }

Bruce