panic: detach with active requests on 10.1-RC3

Sat Oct 25 15:02:53 UTC 2014

On 10/24/14 15:26, Guido Falsi wrote:
> Hi,
> 
> I'm making some experiments with 10.1-RC3 on alix boards as hardware
> using NanoBSD.
> 
> By mounting and umounting UFS filesystems I have seen umount constantly
> hanging hard in a deadlock. I have tested on two boards with two
> distinct compactflash disks with same results. This was not happening
> with 10.0-RELEASE.
> 
> I have build a 10.1-RC3 kernel with full debugging and caused the
> problem to happen, I got this:
> 
> root at qtest:~ [0]# umount /cfg
> panic: detach with active requests
> KDB: stack backtrace:
> db_trace_self_wrapper(c0968053,c08ea7f0,c2d48800,c23d6bc8,c0536a16,...)
> at db_trace_self_wrapper+0x2d/frame 0xc23d6b98
> kdb_backtrace(c09639e1,c09fa7e8,c095761d,c23d6c54,c095761d,...) at
> kdb_backtrace+0x30/frame 0xc23d6c00
> vpanic(c09fa682,100,c095761d,c23d6c54,c23d6c54,...) at vpanic+0x80/frame
> 0xc23d6c24
> kassert_panic(c095761d,c09575b3,c2d7acc0,4c7,c2d7acc0,...) at
> kassert_panic+0xe9/frame 0xc23d6c48
> g_detach(c2d7acc0,4,c095725c,1c2,c09c8d5c,...) at g_detach+0x1d3/frame
> 0xc23d6c64
> g_wither_washer(c09f7df4,0,c0956544,124,0,...) at
> g_wither_washer+0x109/frame 0xc23d6c90
> g_run_events(0,c23d6d08,c095d42a,3dc,0,...) at g_run_events+0x40/frame
> 0xc23d6ccc
> fork_exit(c05c4e60,0,c23d6d08) at fork_exit+0x7f/frame 0xc23d6cf4
> fork_trampoline() at fork_trampoline+0x8/frame 0xc23d6cf4
> --- trap 0, eip = 0, esp = 0xc23d6d40, ebp = 0 ---
> KDB: enter: panic
> [ thread pid 12 tid 100006 ]
> Stopped at      kdb_enter+0x3d: movl    $0,kdb_why
> db>
> 

I tried to investigate some more by myself. Maybe what I found is
obvious to anyone with decent VFS knowledge, anyway:

After some fumbling around I did:

db> show geom 0xc2e98b40
consumer: 0xc2e98b40
  class:    VFS (0xc09c8d5c)
  geom:     ffs.ada0s3 (0xc3293600)
  provider: ada0s3 (0xc2e7e200)
  access:   r0w0e0
  flags:    0x0030
  nstart:   19
  nend:     18

Which shows nstart != nend, while g_detach asserts them to be the same.

Going up the chain of providers I find also it's providers have nstart -
nend == 1:

db> show geom 0xc2e9b7c0
consumer: 0xc2e9b7c0
  class:    PART (0xc09c96b0)
  geom:     ada0 (0xc2e7e780)
  provider: ada0 (0xc2e7e500)
  access:   r2w0e0
  flags:    0x0030
  nstart:   1430
  nend:     1429
db> show geom 0xc2e7e500
provider: ada0 (0xc2e7e500)
  class:        DISK (0xc09c8890)
  geom:         ada0 (0xc2e7e580)
  mediasize:    4017807360
  sectorsize:   512
  stripesize:   0
  stripeoffset: 0
  access:       r2w0e0
  flags:         (0x0030)
  error:        0
  nstart:       2085
  nend:         2084
  consumer: 0xc2e9a700 (ada0), access=r0w0e0, flags=0x0030
  consumer: 0xc2e9b480 (ada0), access=r0w0e0, flags=0x0030
  consumer: 0xc2e9b7c0 (ada0), access=r2w0e0, flags=0x0030

Looking at the code these values are touched only in g_io_request() and
g_io_deliver() respectively.

So this one now looks like a geom problem.

In fact the only commit which touched those functions between 10.0 and
10.1 branches is r260385, which merged quite a few things.

I've tried reverting it to test without that, but "svn merge -c -260385
." generated a few conflicts I'm unable to resolve. So I need some
guidance even to perform this simple test.

> 
> The machine is sitting there, I am connected with serial console, anyone
> willing to help me debug this further? I really know very little about
> kernel debugging. If necessary I can also make myself available via IRC
> or Jabber.
> 
> It looks like this has some similarities with what was reported here:
> 
> https://lists.freebsd.org/pipermail/freebsd-fs/2014-September/020035.html
> 
> I also tested with head (including r272130) and it does deadlock the same.
> 

After the analysis above I think that there really is no similitude with
the probllem reported by bdrewery.

-- 
Guido Falsi <mad at madpilot.net>