Kernel Panic help.

Fri Aug 22 16:21:16 UTC 2008

Eric Crist wrote:
> Hey folks,
> 
> First, please 'reply-all' as I'm not on the list.
> 
> I've got a backup server that, every night, offloads things to a 
> secondary, USB attached hard disk.  We've got two of these disks, which 
> we rotate so as to have a fairly recent off-site version, in the event 
> of a disaster.  One of the two drives has start to cause the backup 
> server to core dump and reboot.  The other works fine.  I tried taking 
> the problematic drive and repartitioning and reformatting it, but the 
> problems persist.
> 
> Here is what I get from a kgdb:
> 
> ecrist at leopard:/usr/obj/usr/src/sys/GENERIC-> sudo kgdb kernel.debug 
> /var/crash/vmcore.17
> [GDB will not be able to debug user-mode threads: 
> /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you 
> are
> welcome to change it and/or distribute copies of it under certain 
> conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-marcel-freebsd".
> 
> Unread portion of the kernel message buffer:
> panic: softdep_deallocate_dependencies: dangling deps
> cpuid = 0
> Uptime: 11d20h37m38s
> Physical memory: 1011 MB
> Dumping 201 MB: 186 170 154 138 122 106 90 74 58 42 26 10
> 
> #0  doadump () at pcpu.h:195
> 195        __asm __volatile("movl %%fs:0,%0" : "=r" (td));
> 
> 
> Any insight is appreciated.  uname -a is:
> 
> FreeBSD hostname 7.0-RELEASE-p3 FreeBSD 7.0-RELEASE-p3 #1: Tue Jul 15 
> 13:53:28 CDT 2008     root at hostname:/usr/obj/usr/src/sys/GENERIC  i386

See the developers handbook for more details on how to report panics 
(you also need the backtrace, and it may help to catch the problem 
earlier if you turn on debugging).

However, this kind of panic can happen if the drive is marginal.  e.g. 
if it loses or corrupts I/O in transit.  Try compiling e.g. the 
/usr/src/tools/regression/fsx tool and running that against the problem 
disk for a few days, or even multiple instances on different files at 
once to really stress it.  It will do lots of I/O to a file and verify 
that the file remains consistent throughout.  It won't touch the whole 
drive though, so if only parts of the disk are bad it won't catch it.

For that you could try generating a large random file on another disk, 
keeping the md5 checksum, then writing lots of copies of it to the bad 
disk to fill or almost fill it, then read back the md5 checksums of each 
to compare.  A small script could run this in a loop.

Yet another option would be to configure the disk as a geli or zfs 
volume, since that will validate checksums with each read and will catch 
data corruption anywhere on the disk.

I'd validate those things before proceeding with the existing panic.

Kris