How to do proper locking

Fri Aug 5 19:26:07 GMT 2005

On Thursday 04 August 2005 04:53 pm, Hans Petter Selasky wrote:
> On Thursday 04 August 2005 20:15, John Baldwin wrote:
> > On Thursday 04 August 2005 12:50 pm, Hans Petter Selasky wrote:
> > > On Thursday 04 August 2005 15:53, John Baldwin wrote:
> > > > On Thursday 04 August 2005 07:40 am, Hans Petter Selasky wrote:
> > > > > On Wednesday 03 August 2005 19:21, John Baldwin wrote:
> > > > > > On Tuesday 02 August 2005 06:23 pm, Hans Petter Selasky wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I am looking for a safe way to access structures that can be
> > > > > > > freed. The solution I am looking for must not:
> > > > > > >
> > > > > > > - hinder scaleability
> > > > > > > - lead to use of a single lock
> > > > > > > - lead to lock order reversal
> > > > >
> > > > > There is one more thing. When the structure is freed it should not
> > > > > block waiting for any reference counts.
> > > >
> > > > crhold() is always called when the caller knows that it has a
> > > > reference to cr that can't go away.  Thus, in the case of p_ucred,
> > > > one holds the associated PROC_LOCK() while doing crhold().  After the
> > > > crhold() you can drop the proc lock, so that lock is only held for a
> > > > very short period of time.  You then have your own reference to the
> > > > cred structure that is valid until you call crfree().  In the case of
> > > > curthread->td_ucred, the reference is valid until curthread releases
> > > > it on a cred update at the start of a syscall, so your reference is
> > > > valid without needing locks for the life of a syscall.  The key here
> > > > is that you solve this problem at a higher level, not within the
> > > > structure itself.
> > >
> > > Yes, this works with mbufs, crhold() and alike, where one doesn't
> > > actually free the object until the refcount reaches zero. But I am
> > > speaking about destroying an object while the refcount is non-zero. Do
> > > you see the difference?
> > >
> > > There are two ways to handle this:
> > >
> > > 1) blocking: wait until the refcount reaches zero.
> > >
> > > 2) nonblocking: increment some other refcount that the
> > >    callback checks before accessing any data.
> > >
> > >
> > > Do you agree that method 2 is going to:
> > >
> > > - avoid deadlock
> > > - avoid use of sx-locks, locks that allow calls to msleep(),
> > >   to keep synchronization while destroying objects
> > >   with refcounts.
> >
> > Not always. :)
> >
> > > When calling functions like "callout_stop()", it is likely that one is
> > > holding a lock to prevent other code from calling "callout_start()" on
> > > the same callback before "callout_stop()" has been called. Now, if
> > > "callout_stop()" sleeps, the lock that is held while calling this
> > > function, must allow calls msleep(). So this must be an sx-lock (see
> > > "man sx"). Now if one routine sleeps, then any callers of this routine
> > > must sleep too. So this affects the synchronization of the whole
> > > system.
> > >
> > > Isn't nonblocking operation preferred over blocking operation?
> >
> > You can call callout_stop() while holding the lock, then do a
> > callout_drain() later when you are prepared to block.  This is what I'm
> > doing with ethernet drivers right now that typically call callout_stop in
> > their foo_stop() routine so detach looks like:
> >
> >  FOO_LOCK(sc);
> >  foo_stop(sc); // calls callout_stop()
> >  FOO_UNLOCK(sc);
> >  callout_drain(&sc->foo_stat_ch);
>
> This is only going to work if you are detaching. But consider the
> following:
>
>   FOO_LOCK(sc);
>   callout_stop();
>   callout_start();
>   FOO_UNLOCK(sc);
>
> Your solution is only spinning halfway around. What I say is that you need
> some more parameters passed to callbacks, so that one can check if it has
> been cancelled before proceeding.

Err, there is no callout_start(), but callout_reset() will just reschedule a 
callout if it's already pending.  The consumer of the callout can maintain 
his own side state that he can check in his callout to handle a state change 
in between scheduling the callout and the callout running.

> > > No, the callback functions read/write/ioctl/poll can be called after
> > > that "destroy_dev()" has been called.
> >
> > Nope.  Every call to read/write/ioctl bumps a reference count on the
> > underlying dev_t and destroy_dev() blocks until the reference count drops
> > to zero.
>
> Yes, I see that "dev_refthread()" and "dev_relthread()" are called from
> "devfs_xxx()", but "destroy_dev()" will not wait for these refcounts to
> reach zero unless the "d_purge" field is initialized in "struct cdevsw".
> And I cannot see that "prep_cdevs()" is setting any default value for this
> field. So by default "destroy_dev()" is non-blocking which is right.
>
> I am a bit lazy today, so I did:
>
> cat `find /usr/src/sys/dev/*` | grep d_purge
>
> And got two hits. This feature is not used at all.

Yes, I think that's a bug.  It should sleep as long as threadcount is > 0 and 
dev_relthread() should do a wakeup when the refcount hits zero.  I've talked 
with phk@ about this and we might try that out in HEAD.

> > > What I want is that the kernel provides some routine that can do
> > > locking based on a structure like below. Also the kernel must provide
> > > some routines to allocate, free and read a refcount.
> > >
> > > struct callback_args {
> > >   void *sc;
> > >   struct mtx *p_mtx;
> > >   u_int32_t refcount_copy;
> > >   u_int32_t *p_refcount;
> > > };
> > >
> > > Then I want routines like "callout_reset()", "make_dev()",
> > > "bus_setup_intr()" and so on, to pass these four parameters to the
> > > callback function, either directly, or like a pointer to this
> > > structure.
> >
> > You are just going to move the blocking behavior into contention on your
> > locks, you aren't going to actually remove it anywhere
>
> Yes, that is correct, but from what I understand it is inherently more
> efficient to wait for a mutex of type "struct mtx", than it is to use
> msleep/wakeup, which is the current way of doing it, according to "man sx".

Well, it depends on what you are doing.  Waiting for a device to detach is not 
a critical path.

> > , and since all users
> > of the structures have to do more checking at runtime
>
> I don't think so.
>
> > , I think you will end up increasing overhead.
>
> Let's consider a real example, for example "devfs_read_f()":
>
> devfs_read_f(...)
> {
>
>         dev_refthread(dev);
>
>         error = dsw->d_read(dev, uio, ioflag);
>
>         dev_relthread(dev);
>
>         return (error);
> }
>
> Per "devfs_read_f()" one has got to lock/unlock two times. It involves two
> refcount reads and two refcount writes, which means that the CPU cache will
> be invalidated.

Err, refcount operations don't invalidate the CPU cache all by themselves.  If 
there is contention then individual lines might be invalidated, but if your 
CPUs are contending you're going to lose anyway.

> Now consider my proposal. There is no more need for
> dev_refthread/dev_relthread. Instead the callback, "d_read()" will do the
> checking. This consist of:
>
> d_read()
> {
>     mtx_lock(args->mtx);
>
>     if(args->refcount_copy == atomic_read(args->refcount_p))
>     {
>       /* valid */
>     }
>
>     mtx_unlock(args->mtx);
>
>     return;
> }
>
> Per "devfs_read_f()" one has got to lock/unlock one time. It involves two
> refcount reads and no refcount writes.
>
> Well, isn't this 200 percent faster?

Maybe, but another thing you need to consider is "maintainenance" overhead.  
Device drivers, especially, should be the simplest parts of the kernel to 
implement because we want to minimize the potential for screw up in that 
area.  Having to trust that the same code is going to be duplicated 40 or 50 
times without any errors versus having it done once in a centralized place 
where it breaks everyone if it is broken is just insane.  Otherwise, we might 
as well go write the whole kernel in assembly so we can tweak every last bit 
out of it. :)

-- 
John Baldwin <jhb at FreeBSD.org>  <><  http://www.FreeBSD.org/~jhb/
"Power Users Use the Power to Serve"  =  http://www.FreeBSD.org