protection against module unloading ?

Mon Jul 13 15:29:22 UTC 2015

On Mon, Jul 13, 2015 at 05:00:30PM +0200, Luigi Rizzo wrote:
> On Mon, Jul 13, 2015 at 03:46:03PM +0300, Konstantin Belousov wrote:
> > On Mon, Jul 13, 2015 at 02:28:40PM +0200, Luigi Rizzo wrote:
> > > Hi,
> > > I am trying to understand how to protect efficiently against
> > > module removals when a device driver is in use.
> > > This issue came up some time ago when trying netmap
> > > loaded as a module.
> > > 
> > > --- Summary of the current mode of operation (as i understand it): ---
> > > 
> > > When a device is compiled as a module, all devfs calls in
> > > sys/fs/devfs/devfs_vnops.c are protected by setting
> > > dev->si_threadcount (within dev*_refthread()) for the duration of the call.
> > > 
> > > This is implemented using devfs_fp_check()/dev_relthread() which
> > > in turn, for the module case, updates the dev->si_threadcount under
> > > dev_lock() which is a single shared mutex, devmtx .
> > > 
> > > The above is problematic for driver modules that do a lot of I/O
> > > using poll, ioctl, read or write, because of the system-wide contention
> > > on devmtx
> > > 
> > > To mitigate the contention, statically compiled modules have the
> > > SI_ETERNAL attribute that prevents the lock (major contention point)
> > > and si_threadcount update.
> > > 
> > > --- Alternative ?
> > > 
> > > I was hoping to make the protection mechanism cheaper by only
> > > increasing  dev->si_threadcount on devfs_open() without calling
> > > dev_relthread() ), and then decreasing si_threadcount on defvs_close() .
> > > This way the reference is active for the lifetime of the file descriptor
> > > without needing to track individual high-frequency calls.
> > > 
> > > This is probably not enough because there could be mmap handlers,
> > > which remain active even after the device is closed.
> > This is of course wrong because destroy_dev(9) would be blocked until
> > all file descriptors referencing the device are closed.  The blocked
> > thread which called destroy_dev() is blocked hard, i.e. you cannot
> > interrupt it with a signal.  The situation is incomprehensible for
> > the user issued the kldunload command.
> > 
> > The current si_threadcount mechanism is designed and intended to work on
> > the code call granulariry, to allow the device destruction and driver
> > unload when there is no code call into the driver.
> 
> thanks a lot for the clarification on the intent.
> I clearly need to understand more on the architecture of the module unload.
> 
> In any case: the global contention on devmtx for I/O syscalls is
> really a showstopper for making effective use of modular drivers
> so we should really find a way to remove it.
What contention do you see ?  Is it on the device node for a single
device ?  If yes, then any modification of the below proposal would
not help.  I explained this below.

Also note that modules have more issues than lack of the ETERNAL cdevs.

E.g., on everything non-amd64 modules use PIC code. This means that any
access to the global variables is indirected through GOT, and that on
i386 one register is not available for computations from the already
scarce set of seven.

Modules do not use inlined fast-path for the locking primitives, as
well as they do not use inlined atomics.

Each loadable syscall requires an atomic on entry and atomic on exit.

There are probably more issues, this list is from the top of my
memory. The common wisdom is that you should not use modules if you do
benchmarking.

> Is there any other way to protect access to dev->si_threadcount ?
> 
> Eg how about the following:
> - use a (leaf) lock into struct cdev to protect dev->si_threadcount, so that
>   we could replace dev_lock() with mtx_lock(&dev->foo) in dev_refthread(dev)
>   dev_relthread(dev) and other places that access si_threadcount
This would not work, you cannot protect a lifetime of the object by a lock
contained in the object.

A modification that might work is to have the substitution lock for dev_mtx
embedded into cdevsw for the given device.  This is not completely trivial,
since there are some data in devfs and special node management that is
global.  Another complication is that drivers would need to call some
destructor for cdevsw, which is not required for current arrangement.

Still, even per-cdevsw dev_mtx would not help for the non-ETERNAL
contention on single devfs node.

> 
> - devvn_refthread() further uses vp->v_rdev, which elsewhere is protected by
>   VI_LOCK(vp), so devvn_refthread() could use the sequence
> 	VI_LOCK(vp);
> 	dev = vp->v_rdev;
> 	if (dev != NULL) {
> 		mtx_lock(&dev->foo)
> 		if ( ... ) {
> 			atomic_add_long(&dev->si_threadcount, 1);
> 		}
> 		mtx_unlock(&dev->foo);
> 	}
> 	VI_UNLOCK(vp)
> 
> 
>   (this is almos the same as what we have in devfs_allocv() and devfs_reclaim(),
>   except that they use dev_lock() instead of the device-specific lock.
> 
> cheers
> luigi