HAVE TRACE & DDB Re: FreeBSD 5.2-RC1 released

Sun Dec 14 10:41:18 PST 2003

On Sun, 14 Dec 2003, Jeff Roberson wrote:

>
> On Sun, 14 Dec 2003, Jeff Roberson wrote:
>
> > On Sat, 13 Dec 2003, Don Lewis wrote:
> >
> > > On 13 Dec, Don Lewis wrote:
> > > > On 12 Dec, Jeff Roberson wrote:
> > > >
> > > >
> > > >> fsync: giving up on dirty: 0xc4e18000: tag devfs, type VCHR, usecount 44,
> > > >> writecount 0, refcount 14, flags (VI_XLOCK|VV_OBJBUF), lock type devfs: EXCL
> > > >> (count 1) by thread 0xc20ff500
> > > >
> > > > Why are we trying to reuse a vnode with a usecount of 44 and a refcount
> > > > of 14?  What is thread 0xc20ff500 doing?
> > >
> > > Following up to myself ...
> > >
> > > It looks like we're trying to recycle this vnode because of the
> > > following sysinstall code, in distExtractTarball():
> > >
> > >     if (is_base && RunningAsInit && !Fake) {
> > >         unmounted_dev = 1;
> > >         unmount("/dev", MNT_FORCE);
> > >     } else
> > >         unmounted_dev = 0;
> > >
> > > What happens if we forceably umount /dev while /dev/whatever holds a
> > > mounted file system?  It looks like this is handled by vgonechrl().  It
> > > looks to me like vclean() is going to do some scary stuff to this vnode.
> > >
> >
> > Excellent work!  I think I may know what's wrong.  If you look at rev
> > 1.461 of vfs_subr.c I changed the semantics of cleaning a VCHR that was
> > being unmounted.  I now acquire the xlock around the operation.  This may
> > be the culprit.  I'm too tired to debug this right now, but I can look at
> > it in the am.
> >
>
> Ok, I think I understand what happens..  The syncer runs, and at the same
> time, we're doing the forced unmount.  This causes the sync of the device
> vnode to fail.  This isn't really a problem.  After this, while syncing
> a ffs volume that is mounted on a VCHR from /dev, we bread() and get a
> buffer for this device and then immediately block.  The forced unmount
> then proceeds, calling vclean() on the device, which goes into the VM via
> DESTROYVOBJECT.  The VM frees all of the pages associated with the object
> etc.  Then, the ffs_update() is allowed to run again with a pointer to a
> buffer that has pointers to pages that have been freed.  This is where
> vfs_setdirty() comes in and finds a NULL object.
>
> The wired counts on the pages are 1, which is consistent with a page in
> the bufcache.  Also the object is NULL which is the only indication we
> have that this is a free page.
>
> I think that if we want to allow unmounting of the underlying device for
> VCHR, we need to not call vclean() from vgonechr().  We need to just lock,
> VOP_RECLAIM, cache_purge(), and insmntque to NULL.
>
> I've looked through my changes here, and I don't see how I could have
> introduced this bug.  Were we vclean()ing before, and that seems to be the
> main problem.  There have been some changes to device aliasing that could
> have impacted this.  I'm trying to get the scoop from phk now.
>
> I'm going to change the way vgonechrl() works, but I'd really like to know
> what changed that broke this..
>

Please test the patch at:
http://www.chesapeake.net/~jroberson/forcevchr.diff

If this works I'll come up with a more compact arrangement for the code so
that we can avoid all of this duplication.

Cheers,
Jeff

> Cheers,
> Jeff
>
> _______________________________________________
> freebsd-current at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-unsubscribe at freebsd.org"
>