Filesystem wedges caused by r251446
Konstantin Belousov
kostikbel at gmail.com
Fri Jul 12 20:11:01 UTC 2013
On Fri, Jul 12, 2013 at 08:11:36PM +0200, Ian FREISLICH wrote:
> John Baldwin wrote:
> > On Thursday, July 11, 2013 6:54:35 am Ian FREISLICH wrote:
> > > John Baldwin wrote:
> > > > On Thursday, July 04, 2013 5:03:29 am Ian FREISLICH wrote:
> > > > > Konstantin Belousov wrote:
> > > > > >
> > > > > > Care to provide any useful information ?
> > > > > >
> > > > > > http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-
> > > > handbook/kerneldebug-deadlocks.html
> > > > >
> > > > > Well, the system doesn't deadlock it's perfectly useable so long
> > > > > as you don't touch the file that's wedged. A lot of the time the
> > > > > userland process is unkillable, but often it is killable. How do
> > > > > I get from from the PID to where the FS is stuck in the kernel?
> > > >
> > > > Use kgdb. 'proc <pid>', then 'bt'.
> > >
> > > So, I setup a remote kbgd session, but I still can't figure out how
> > > to get at the information we need.
> > >
> > > (kgdb) proc 5176
> > > only supported for core file target
> > >
> > > In the mean time, I'll just force it to make a core dump from ddb.
> > > However, I can't reacreate the issue while the mirror (gmirror) is
> > > rebuilding, so we'll have to wait for that to finish.
> >
> > Sorrry, just run 'sudo kgdb' on the box itself. You can inspect the running
> > kernel without having to stop it.
>
> So, this machine's installworld *always* stalls installing clang.
> The install can be stopped (ctrl-c) leaving behind this process:
>
> root 23147 0.0 0.0 9268 1512 1 D 7:51PM 0:00.01 install -s -o root -g wheel -m 555 clang /usr/bin/clang
>
> This is the backtrace from gdb. I suspect frame 4.
>
> (kgdb) proc 23147
> [Switching to thread 117 (Thread 100059)]#0 sched_switch (
> td=0xfffffe000c012920, newtd=0x0, flags=<value optimized out>)
> at /usr/src/sys/kern/sched_ule.c:1954
> 1954 cpuid = PCPU_GET(cpuid);
> Current language: auto; currently minimal
> (kgdb) bt
> #0 sched_switch (td=0xfffffe000c012920, newtd=0x0,
> flags=<value optimized out>) at /usr/src/sys/kern/sched_ule.c:1954
> #1 0xffffffff8047539e in mi_switch (flags=260, newtd=0x0)
> at /usr/src/sys/kern/kern_synch.c:487
> #2 0xffffffff804acbea in sleepq_wait (wchan=0x0, pri=0)
> at /usr/src/sys/kern/subr_sleepqueue.c:620
> #3 0xffffffff80474ee9 in _sleep (ident=<value optimized out>,
> lock=0xffffffff80a20300, priority=84, wmesg=0xffffffff8071129a "wdrain",
> sbt=<value optimized out>, pr=0, flags=<value optimized out>)
> at /usr/src/sys/kern/kern_synch.c:249
> #4 0xffffffff804e6523 in waitrunningbufspace ()
> at /usr/src/sys/kern/vfs_bio.c:564
> #5 0xffffffff804e6073 in bufwrite (bp=<value optimized out>)
> at /usr/src/sys/kern/vfs_bio.c:1226
> #6 0xffffffff804f05ed in cluster_wbuild (vp=0xfffffe008fec4000, size=32768,
> start_lbn=136, len=<value optimized out>, gbflags=<value optimized out>)
> at /usr/src/sys/kern/vfs_cluster.c:1002
> #7 0xffffffff804efbc3 in cluster_write (vp=0xfffffe008fec4000,
> bp=0xffffff80f83da6f0, filesize=4456448, seqcount=127,
> gbflags=<value optimized out>) at /usr/src/sys/kern/vfs_cluster.c:592
> #8 0xffffffff805c1032 in ffs_write (ap=0xffffff8121c81850)
> at /usr/src/sys/ufs/ffs/ffs_vnops.c:801
> #9 0xffffffff8067fe21 in VOP_WRITE_APV (vop=<value optimized out>,
> ---Type <return> to continue, or q <return> to quit---
> a=<value optimized out>) at vnode_if.c:999
> #10 0xffffffff80511eca in vn_write (fp=0xfffffe006a5f7410,
> uio=0xffffff8121c81a90, active_cred=0x0, flags=<value optimized out>,
> td=0x0) at vnode_if.h:413
> #11 0xffffffff8050eb3a in vn_io_fault (fp=0xfffffe006a5f7410,
> uio=0xffffff8121c81a90, active_cred=0xfffffe006a6ca000, flags=0,
> td=0xfffffe000c012920) at /usr/src/sys/kern/vfs_vnops.c:983
> #12 0xffffffff804b506a in dofilewrite (td=0xfffffe000c012920, fd=5,
> fp=0xfffffe006a5f7410, auio=0xffffff8121c81a90,
> offset=<value optimized out>, flags=0) at file.h:290
> #13 0xffffffff804b4cde in sys_write (td=0xfffffe000c012920,
> uap=<value optimized out>) at /usr/src/sys/kern/sys_generic.c:460
> #14 0xffffffff8061807a in amd64_syscall (td=0xfffffe000c012920, traced=0)
> at subr_syscall.c:134
> #15 0xffffffff806017ab in Xfast_syscall ()
> at /usr/src/sys/amd64/amd64/exception.S:387
> #16 0x000000000044e75a in ?? ()
> Previous frame inner to this frame (corrupt stack?)
> (kgdb)
Please apply (mostly debugging) patch below, then reproduce the issue.
I need the backtrace of the 'main' hung process, assuming it is stuck
in the waitrunningbufspace(). Also, from the same kgdb session, print
runningbufreq, runningbufspace and lorunningspace.
diff --git a/sys/kern/vfs_bio.c b/sys/kern/vfs_bio.c
index 68021e0..205e9b3 100644
--- a/sys/kern/vfs_bio.c
+++ b/sys/kern/vfs_bio.c
@@ -474,10 +474,12 @@ runningbufwakeup(struct buf *bp)
{
long space, bspace;
- if (bp->b_runningbufspace == 0)
- return;
- space = atomic_fetchadd_long(&runningbufspace, -bp->b_runningbufspace);
bspace = bp->b_runningbufspace;
+ if (bspace == 0)
+ return;
+ space = atomic_fetchadd_long(&runningbufspace, -bspace);
+ KASSERT(space >= bspace, ("runningbufspace underflow %ld %ld",
+ space, bspace));
bp->b_runningbufspace = 0;
/*
* Only acquire the lock and wakeup on the transition from exceeding
@@ -561,7 +563,7 @@ waitrunningbufspace(void)
mtx_lock(&rbreqlock);
while (runningbufspace > hirunningspace) {
- ++runningbufreq;
+ runningbufreq = 1;
msleep(&runningbufreq, &rbreqlock, PVM, "wdrain", 0);
}
mtx_unlock(&rbreqlock);
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 834 bytes
Desc: not available
URL: <http://lists.freebsd.org/pipermail/freebsd-current/attachments/20130712/4a148748/attachment.sig>
More information about the freebsd-current
mailing list