FreeBSD-9.1: machine reboots during snapshot creation, LORs found

Jeremy Chadwick jdc at koitsu.org
Sun Jun 16 10:30:24 UTC 2013


On Sun, Jun 16, 2013 at 11:55:38AM +0200, Andre Albsmeier wrote:
> On Sun, 16-Jun-2013 at 10:49:37 +0200, Jeremy Chadwick wrote:
> > On Sun, Jun 16, 2013 at 10:02:39AM +0200, Andre Albsmeier wrote:
> > > On Sun, 16-Jun-2013 at 08:54:41 +0200, Jeremy Chadwick wrote:
> > > > On Fri, May 31, 2013 at 07:25:23PM +0200, Andre Albsmeier wrote:
> > > > > On Fri, 31-May-2013 at 16:51:03 +0200, John Baldwin wrote:
> > > > > > On Friday, May 31, 2013 8:26:11 am Andre Albsmeier wrote:
> > > > > > > Each day at 5:15 we are generating snapshots on various machines.
> > > > > > > This used to work perfectly under 7-STABLE for years but since
> > > > > > > we started to use 9.1-STABLE the machine reboots in about 10%
> > > > > > > of all cases.
> > > > > > > 
> > > > > > > After rebooting we find a new snapshot file which is a bit
> > > > > > > smaller than the good ones and with different permissions
> > > > > > > It does not succeed a fsck. In this example it is the one
> > > > > > > whose name is beginning with s3:
> > > > > > > 
> > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 29 May 05:15 s2-2013.05.28-03.15.04
> > > > > > > -r--------   1 root  operator  snapshot 72802893824 29 May 05:15 s3-2013.05.29-03.15.03
> > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s4-2013.05.23-06.38.44
> > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s5-2013.05.24-03.15.03
> > > > > > > -r--r-----   1 root  operator  snapshot 72802894528 28 May 14:22 s6-2013.05.25-03.15.03
> > > > > > > 
> > > > > > > After enabling DIAGNOSTIC, WITNESS and INVARIANTS in the kernel
> > > > > > > I see the following LORs (mksnap_ffs starts exactly at 5:15):
> > > > > > > 
> > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: lock order reversal:
> > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 1st 0xc2371da8 ufs (ufs) @ /src/src-9/sys/kern/vfs_mount.c:1240
> > > > > > > May 29 05:15:00 <kern.crit> palveli kernel: 2nd 0xc2371ec4 devfs (devfs) @ /src/src-9/sys/ufs/ffs/ffs_vfsops.c:1414
> > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: lock order reversal:
> > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 1st 0xc228471c snaplk (snaplk) @ /src/src-9/sys/ufs/ufs/ufs_vnops.c:976
> > > > > > > May 29 05:15:04 <kern.crit> palveli kernel: 2nd 0xc22f25e4 ufs (ufs) @ /src/src-9/sys/ufs/ffs/ffs_snapshot.c:1626
> > > > > > > 
> > > > > > > Unfortunatley no corefiles are being generated ;-(.
> > > > > > > 
> > > > > > > I have checked and even rebuilt the (UFS1) fs in question
> > > > > > > from scratch. I have also seen this happen on an UFS2 on
> > > > > > > another machine and on a third one when running "dump -L"
> > > > > > > on a root fs.
> > > > > > > 
> > > > > > > Any hints of how to proceed?
> > > > > > 
> > > > > > Would it be possible to setup a serial console that is logged on this machine
> > > > > > to see if it is panic'ing but failing to write out a crashdump?
> > > > > 
> > > > > I'll try to arrange that. It'll take a bit since this
> > > > > box is 200 km away... 
> > > > > 
> > > > > Maybe I'll find another one nearby to reproduce it...
> > > > 
> > > > SPECIFICALLY regarding "lack of crash dumps": I need to see the
> > > > following:
> > > > 
> > > > * cat /etc/rc.conf
> > > > * cat /etc/fstab
> > > > 
> > > > I may need output from other commands, but shall deal with that when I
> > > > see output from the above.  Thanks.
> > > 
> > > No problem, see below...
> > > 
> > > To make a long story short, the machine dumps core perfectly
> > > (tested that a while ago), but not when dealing with _this_
> > > issue...
> > > 
> > > I dump on da1s1b and savecore fetches it from there and puts
> > > it on /var (sitting on da0), that's faster.
> > > 
> > > rc.conf (beware, rc.conf.local exists):
> > > ---------------------------------------
> > > rcshutdown_timeout=180
> > > tmpmfs=YES
> > > tmpsize="$(( `/sbin/sysctl -n hw.usermem` / 3000000 ))m"
> > > tmpmfs_flags="$tmpmfs_flags -v 1 -n"
> > > 
> > > background_fsck=NO
> > > 
> > > nisdomainname=ofw.tld
> > > pflog_flags=-S
> > > 
> > > syslogd_flags=-svv
> > > inetd_enable=YES
> > > inetd_flags=-l
> > > named_flags="-S 1000"
> > > named_chrootdir=""
> > > rwhod_enable=YES
> > > sshd_enable=YES
> > > amd_enable=YES
> > > amd_flags="-F /etc/amd.conf"
> > > nfs_client_enable=YES
> > > nfs_access_cache=2
> > > mountd_flags=-n
> > > rpcbind_enable=YES
> > > 
> > > ntpdate_enable=YES
> > > ntpdate_hosts=ntp
> > > ntpd_enable=YES
> > > ntpd_flags="-p /var/run/ntpd.pid"
> > > 
> > > nis_client_enable=YES
> > > nis_client_flags="-s -S ofw.tld,nis-16-1,nis-16-2"
> > > nis_server_flags=-n
> > > nis_yppasswdd_flags="-t /var/yp/src/master.passwd -f -v"
> > > 
> > > defaultrouter=192.168.16.2
> > > 
> > > keyrate=fast
> > > 
> > > sendmail_flags="-bd -q5m"
> > > sendmail_submit_flags="$sendmail_flags -ODaemonPortOptions=Addr=localhost"
> > > sendmail_msp_queue_flags="-Ac -q30m"
> > > sendmail_rebuild_aliases=NO
> > > 
> > > lpd_enable=YES
> > > lpd_flags=-s
> > > chkprintcap_enable=YES
> > > dumpdev=AUTO
> > > clear_tmp_X=NO
> > > ldconfig_paths=/usr/local/lib
> > > ldconfig_paths_aout=""
> > > entropy_file=/boot/entropy-file
> > > 
> > > 
> > > rc.conf.local:
> > > --------------
> > > hostname=typhon.ofw.tld
> > > ifconfig_msk0="inet 192.168.24.1/21"
> > > ifconfig_msk0_alias0="inet 192.168.24.10/32"
> > > 
> > > named_enable=YES
> > > nfs_server_enable=YES
> > > 
> > > nis_client_flags="-s -S ofw.tld,nis-24-1,nis-24-2"
> > > nis_server_enable=YES
> > > 
> > > defaultrouter=192.168.24.2
> > > 
> > > lpd_flags=-l
> > > dumpdev=/dev/da1s1b
> > > quota_enable=YES
> > > 
> > > 
> > > fstab:
> > > ------
> > > /dev/da0s1a	/		ufs	noatime,rw				0 1
> > > /dev/da0s1b	none		swap	sw					0 0
> > > proc		/proc		procfs	rw					0 0
> > > /dev/da0s1d	/usr		ufs	noatime,rw				0 2
> > > /dev/da0s1e	/var		ufs	noatime,nosuid,rw			0 2
> > > 
> > > /dev/da10p1	/share2		ufs	suiddir,groupquota,noatime,nosuid,rw	0 2
> > > /dev/da10p2	/raid2		ufs	userquota,noatime,nosuid,rw		0 2
> > 
> > Thank you.  Can you show me output from the following?
> 
> Thanks to you for looking into this...
> 
> > 
> > * camcontrol devlist
> 
> <IBM DDRS-39130W S92A>             at scbus0 target 0 lun 0 (da0,pass0)
> <IBM DDRS-39130W S97B>             at scbus0 target 1 lun 0 (da1,pass1)
> <AMCC 9690SA-8I  DISK 4.10>        at scbus1 target 0 lun 0 (da10,pass2)
> 
> > * gpart show -p da1
> 
> =>      63  17849937    da1  MBR  (8.5G)
>         63  17849937  da1s1  freebsd  [active]  (8.5G)
> 
> And here is gpart show -p da1s1
> 
> =>       0  17849937   da1s1  BSD  (8.5G)
>          0        16          - free -  (8.0k)
>         16    599984  da1s1a  freebsd-ufs  (293M)
>     600000   2000000  da1s1d  freebsd-ufs  (976M)
>    2600000  11000000  da1s1e  freebsd-ufs  (5.3G)
>   13600000   4249937  da1s1b  freebsd-swap  (2.0G)
> 
> > 
> > I'm pretty sure I see the problem, but I want to be extra sure.
> 
> I am curious already!

Okay, theory #1 shot down -- you have a valid da1s1b.  I was curious
because rc.conf had dumpdev=AUTO, rc.conf.local had dumpdev=/dev/da1s1b,
and /etc/fstab made no mention of /dev/da1s1b (as swap).  So I was
thinking "oh, maybe he meant /dev/da0s1b" -- hence my camcontrol + gpart
request.  :-)

I have 2 more possibilities in mind.  Could I get...

* Output from: sysctl -a hw | grep mem:

* Output from: uname -a  (you can hide the machine name if you want)

* Output from: strings /boot/kernel/kernel | egrep ^option

Thanks.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |



More information about the freebsd-stable mailing list