svn commit: r277199 - in head/sys: fs/devfs kern

Sun Jan 18 11:32:01 UTC 2015

On Sun, Jan 18, 2015 at 12:08:26PM +0100, Hans Petter Selasky wrote:
> Hi Konstantin,
> 
> On 01/18/15 11:54, Konstantin Belousov wrote:
> > On Fri, Jan 16, 2015 at 10:00:11AM +0100, Hans Petter Selasky wrote:
> >> Hi Konstantin,
> >>
> >> On 01/16/15 09:03, Konstantin Belousov wrote:
> >>> My opinion is that you should have tried to handle the issue at the driver
> >>> level, instead of making this devfs issue.  I.e., if you already have
> >>> cdev node with the correct name, driver should have replaced the private
> >>> data to point to new device.
> >>
> >> I think this way cannot be implemented in a clean way, because of
> >> locking order reversal. And if you try to avoid the LOR you end up with
> >> the race. Chess mate sort of ;-)  Let me explain:
> >>
> >> Thread 1:
> >> usb_sx_lock();
> >>     cdev = Look in freelist for existing device();
> >> else
> >>     cdev = make_dev();
> >> usb_sx_unlock();
> >>
> >> Thread 2:
> >> usb_sx_lock();
> >> put cdev on freelist
> >> usb_sx_unlock();
> >>
> >> Thread 3:
> >> usb_sx_lock();
> >> cdev = remove first entry in freelist
> >> usb_sx_unlock();
> >>
> >> /*
> >>    * XXX because USB needs to call destroy_dev() unlocked we
> >>    * are racing with Thread 1 again
> >>    */
> >> destroy_dev(cdev);
> > I do not understand the 'freelist' part.
> >
> > You have some usb (or whatever other kind) device, which is represented by
> > some data structure.  This data structure references cdev.  In reverse,
> > cdev points to this data structure by si_drv1 or similar driver-owned
> > member of struct cdev.
> >
> > The structures naming usb devices have some mapping to/from usb bus addresses.
> > When device is destroyed or created due to external plug event, there is
> > some mechanism that ensures mapping consistence.  It could be lock, it
> > could be single-threaded process which discover the bus, or something
> > else, I do not know.
> >
> > While the structure notes that device with some address goes out and comes
> > in, it can look up the corresponding cdev and just change the si_drv1 pointer
> > to point to the new device.
> 
> What about events for "devd" that a new character devices is present in 
> the system or being destroyed for user-space applications?
It is up to the driver to decide.  If wanted, it could post the pair of
destroy/create itself (e.g., if there is some state which require
the userspace to reset), or it could decide not to.  The later is
reasonable if the destroy/create is believed to be caused by a glitch
rather than the actual replug.

> 
> >
> > I find it very strange that you need sleepless operation which can
> > _both_ destroy and create cdev. At least one operation of creation or
> > destruction must sleep, at least in the current devfs design. It could
> > be made sleepless, by making VFS visible part even less connected to the
> > cdev part, but this is not how it (can) currently work.
> 
> I think it would be good practice that all resources needed for creating 
> a character device can be allocated prior to registration. That means we 
> first should allocate all resources needed, but not register as a single 
> function.
> 
> For example:
> 
> Once make_dev() has returned, we can get an "d_open" callback. But 
> "si_drv1/2" is always set by drivers _after_ that "make_dev()" has 
> returned! This causes a race we could simply avoid by splitting the 
> make_dev() like I suggest. Instead of putting more logic and code inside 
> the drivers to handle the race!
I wanted to tunnel the si_drv values through the make_dev().
I.e. there could appear yet another variant of make_dev() which takes
the initialization parameters for the si_* driver vars.

I just did not have time/motivation to do the pass.
> 
> >>
> >>>
> >>> This would also close a window where /dev node is non-existent or operate
> >>> erronously.
> >>
> >> I'm not saying I plan so, but I think "cdevs" at some point need to
> >> understand mutexes and locks. That means like with other API's in the
> >> kernel we can associate a lock with the "cdev", and this lock is then
> >> used to ensure an atomic shutdown of the system in a non-blocking
> >> fashion. In my past experience multithreaded APIs should be high level
> >> implemented like this:
> >>
> >> NON-BLOCKING methods:
> >> lock(); **
> >> xxx_start();
> >> xxx_stop();
> >> unlock();
> >>
> >> BLOCKING methods:
> >> setup(); // init
> >> unsetup(); // drain
> >>
> >> Any callbacks should always be called locked **
> >>
> >> In devfs there was no non-blocking stop before I added the delist_dev()
> >> function.
> >>
> >>>
> >>> You do not understand my point.
> >>>
> >>> I object against imposing one additional global reference on all cdevs
> >>> just to cope with the delist hack.  See the patch at the end of the message.
> >>
> >> It's fine by me.
> >>>
> >>> WRT destroy_dev_sched_cb() calling delist_dev(), even after calling
> >>> delist_dev(), the node still exists in the /dev. It is only removed
> >>> after populate loop is run sometime later. dev_sched() KPI is inheritly
> >>> racy, drivers must handle the races for other reasons.
> >>
> >> The populate loop is all running under the dev_lock() from what I can
> >> see and make_dev() is also keeping the same lock when inserting and
> >> removing new cdevs. The populate loop should always give a consistent
> > No, populate loop is not run under dev_mtx.
> 
> Are you sure:
Yes, I am sure.  I explained it in the next sentences which you break from
the first one in the citation.

> 
> > static int
> > devfs_populate_loop(struct devfs_mount *dm, int cleanup)
> > {
> >         struct cdev_priv *cdp;
> >         struct devfs_dirent *de;
> >         struct devfs_dirent *dd, *dt;
> >         struct cdev *pdev;
> >         int de_flags, depth, j;
> >         char *q, *s;
> >
> >         sx_assert(&dm->dm_lock, SX_XLOCKED);
> >         dev_lock();
>            ^^^^ what is this ?
> >         TAILQ_FOREACH(cdp, &cdevp_list, cdp_list) {
> 
> > It would not be able to
> > allocate memory, even in M_NOWAIT fashion.  The dev_mtx is after the
> > vm and vnode locks.  Only iteration over the cdevp_list is protected by
> > dev_mtx, which is dropped right after something which require an
> > action, is found on the list. Populate() needs some way to avoid
> > reading inconsistent data from the cdev layer, but there is not
> > guarantee that we are always up to date.
^^^^^^^^^

> >
> >> view of the character devices available, and I don't see how "cdev"
> >> structures without the CDP_ACTIVE flag can appear with recently created
> >> ones, even if the name is the same.
> > It cannot simply because it is not synchronized with the device
> > creation/destruction.
> 
> 
> 
> > Lookup (or any other VFS-level code) is not protected by dev_mtx, it
> > is only dm lock which ensures that populate loop is not run in parallel,
> > modifying direntries.
> 
> See comment above.
> 
> >
> >>
> >> That system functions can still call into the dangling read/write/ioctl
> >> functions is another story, and that is why I tell, that in order to
> >> simplify this teardown, we possibly should associate a client selectable
> >> lock with each CDEV, for teardown purposes. Like done for several years
> >> in the callout and USB APIs and possibly many other places.
> > There is d_purge method which provides the tear-down semantic for
> > drivers which needs it.  Only tty uses the method AFAIK.
> >
> > Devfs ensures that no client code is active when cdev is removed for real.
> 
> --HPS