RELENG_6 kernel panic + savecore(8) problem

Sun Nov 25 18:21:36 PST 2007

On Wed, Nov 07, 2007 at 03:23:28PM -0800, Jeremy Chadwick wrote:
> On Wed, Nov 07, 2007 at 11:16:11AM -0800, Jeremy Chadwick wrote:
> > Tracing pid 3 tid 100001 td 0xc7c6ad80
> > kdb_enter(3228441820,3228796672,3228487817,3867634632,256,...) at kdb_enter+48
> > panic(3228487817,3426817152,256,3228643296,0,...) at panic+206
> > handle_written_inodeblock(3459887104,3688934424,3226775710,3228787204,3228175693,...) at handle_written_inodeblock+1503
> > softdep_disk_write_complete(3688934424,3227842097,3356275348,3867634836,3226342800,...) at softdep_disk_write_complete+241
> > bufdone(3688934424,0,3867634856,3226352850,3356275348,...) at bufdone+126
> > g_vfs_done(3356275348,0,0,3352445440,3355957180) at g_vfs_done+198
> > biodone(3356275348,3228786984,588,3228423470,100,...) at biodone+178
> > g_io_schedule_up(3351686528,76,3351679512,3226344072,3867634980,...) at g_io_schedule_up+137
> > g_up_procbody(0,3867635000,0,0,0,...) at g_up_procbody+122
> > fork_exit(3226344072,0,3867635000) at fork_exit+122
> > fork_trampoline() at fork_trampoline+8
> 
> A follow-up to this:
> 
> It appears that somehow a few of the filesystems on the disk (it's a
> single-disk system) were suffering from some bizarre form of soft update
> corruption.

And tonight we had the same problem happen again.  The backtrace when
the machine panic'd is identical:

Tracing pid 3 tid 100001 td 0xc7c6ed80
kdb_enter(c06e475e,c073ade0,c06efb55,e6876bc8,100,...) at kdb_enter+0x30
panic(c06efb55,ce04b280,100,c07156c0,0,...) at panic+0xce
handle_written_inodeblock(c858d200,dbda0a70,c07388e4,c06a3e4a,e6876c30,...) at handle_written_inodeblock+0x5df
softdep_disk_write_complete(dbda0a70,c0652591,c80e65ac,e6876c94,c04e16c4,...) at softdep_disk_write_complete+0xf1
bufdone(dbda0a70,0,e6876ca8,c04e3e06,c80e65ac,...) at bufdone+0x7e
g_vfs_done(c80e65ac,0,0,c7d28200,c80a418c) at g_vfs_done+0xc6
biodone(c80e65ac,c0738808,24c,c06dff1c,64,...) at biodone+0xb2
g_io_schedule_up(c7c6ed80,4c,c7c6d218,c04e1bbc,e6876d24,...) at g_io_schedule_up+0x89
g_up_procbody(0,e6876d38,0,0,0,...) at g_up_procbody+0x7a
fork_exit(c04e1bbc,0,e6876d38) at fork_exit+0x7a
fork_trampoline() at fork_trampoline+0x8

Background fsck reported similar weirdness until I brought down the
system into single-user, fsck'd things, and brought the system back up
again.

Also like last time, savecore found no cores on /dev/ad0s1b, despite
one being done after doing "panic" in DDB.  I believe I may have
figured out why that's occurring, though:

Using `rcorder /etc/rc.d/*` I'm able to see that the following rc
scripts are run in this respective order:

/etc/rc.d/dumpon    <-- Runs `dumpon -v /dev/ad0s1b`
/etc/rc.d/swap1     <-- Runs `swapon -a`
/etc/rc.d/addswap   <-- Adds additional swap spaces
/etc/rc.d/savecore  <-- Runs `savecore /var/crash /dev/ad0s1b`

I believe the problem is that /etc/rc.d/swap1 is being run before
savecore.  I'm guessing that swapon(8) actually destroys/clobbers the
existing saved kernel panic/core data, thus one will never get a
coredump in /var/crash.  I believe some re-organisation of rcorder(8)
arguments in the files is in order, but I don't know what should
be changed to what.

I'll submit a PR for the above, because IMHO that's a serious one.

-- 
| Jeremy Chadwick                                    jdc at parodius.com |
| Parodius Networking                           http://www.parodius.com/ |
| UNIX Systems Administrator                      Mountain View, CA, USA |
| Making life hard for others since 1977.                  PGP: 4BD6C0CB |