kern/176636: Periodical crashes with 9.1-R

Mon Mar 4 13:10:01 UTC 2013

>Number:         176636
>Category:       kern
>Synopsis:       Periodical crashes with 9.1-R
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Mar 04 13:10:00 UTC 2013
>Closed-Date:
>Last-Modified:
>Originator:     Rasmus Skaarup
>Release:        9.1-RELEASE
>Organization:
>Environment:
FreeBSD dentredje.dvconsulting.dk 9.1-RELEASE FreeBSD 9.1-RELEASE #0 r243825: Tue Dec  4 09:23:10 UTC 2012     root at farrell.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

>Description:

I've been having trouble with one of our virtual machines running FreeBSD on CentOS with KVM. The other machine is running without issues, but this one keeps crashing. We have tried moving it to another physical machine to eliminate hardware issues - and the crashes are still occurring. 

Today the machine crashed three times. 

Crash 0 (2nd of March):

Unread portion of the kernel message buffer:
panic: bad pte
cpuid = 0
KDB: stack backtrace:
#0 0xffffffff809208a6 at kdb_backtrace+0x66
#1 0xffffffff808ea8be at panic+0x1ce
#2 0xffffffff80bcffe8 at pmap_remove_pages+0x3a8
#3 0xffffffff80b49d9a at vmspace_exit+0x9a
#4 0xffffffff808b9d69 at exit1+0x379
#5 0xffffffff808bac3e at sys_sys_exit+0xe
#6 0xffffffff80bd7ae6 at amd64_syscall+0x546
#7 0xffffffff80bc3447 at Xfast_syscall+0xf7
Uptime: 3d4h17m27s

(gdb) l *pmap_remove_pages+0x3a8
0xffffffff80bcffe8 is in pmap_remove_pages (/usr/src/sys/amd64/amd64/pmap.c:4183).
4178	
4179					/*
4180					 * Update the vm_page_t clean/reference bits.
4181					 */
4182					if ((tpte & (PG_M | PG_RW)) == (PG_M | PG_RW)) {
4183						if ((tpte & PG_PS) != 0) {
4184							for (mt = m; mt < &m[NBPDR / PAGE_SIZE]; mt++)
4185								vm_page_dirty(mt);
4186						} else
4187							vm_page_dirty(m);
(gdb) 

Crash 1 (4th of March):

panic: page fault
cpuid = 2
KDB: stack backtrace:
#0 0xffffffff809208a6 at kdb_backtrace+0x66
#1 0xffffffff808ea8be at panic+0x1ce
#2 0xffffffff80bd8240 at trap_fatal+0x290
#3 0xffffffff80bd857d at trap_pfault+0x1ed
#4 0xffffffff80bd8b9e at trap+0x3ce
#5 0xffffffff80bc315f at calltrap+0x8
#6 0xffffffff80b49d9a at vmspace_exit+0x9a
#7 0xffffffff808b9d69 at exit1+0x379
#8 0xffffffff808bac3e at sys_sys_exit+0xe
#9 0xffffffff80bd7ae6 at amd64_syscall+0x546
#10 0xffffffff80bc3447 at Xfast_syscall+0xf7

(gdb) l *vmspace_exit+0x9a
0xffffffff80b49d9a is in vmspace_exit (/usr/src/sys/vm/vm_map.c:427).
422			pmap_remove_pages(vmspace_pmap(vm));
423			/* Switch now since this proc will free vmspace */
424			PROC_VMSPACE_LOCK(p);
425			p->p_vmspace = &vmspace0;
426			PROC_VMSPACE_UNLOCK(p);
427			pmap_activate(td);
428			vmspace_dofree(vm);
429		}
430		vmspace_container_reset(p);
431	}
(gdb) 

Crash 2 (also 4th of March):

cpuid = 0
KDB: stack backtrace:
#0 0xffffffff809208a6 at kdb_backtrace+0x66
#1 0xffffffff808ea8be at panic+0x1ce
#2 0xffffffff80b50923 at vm_page_free_toq+0x273
#3 0xffffffff816bd5ba at zfs_freebsd_read+0x62a
#4 0xffffffff8099113d at vn_rdwr+0x1ad
#5 0xffffffff8095c9dd at kern_sendfile+0xdad
#6 0xffffffff8095d12c at do_sendfile+0xdc
#7 0xffffffff80bd7ae6 at amd64_syscall+0x546
#8 0xffffffff80bc3447 at Xfast_syscall+0xf7

(gdb) l *vm_page_free_toq+0x273
0xffffffff80b50923 is in vm_page_free_toq (/usr/src/sys/vm/vm_page.c:1886).
1881	
1882		m->valid = 0;
1883		vm_page_undirty(m);
1884	
1885		if (m->wire_count != 0)
1886			panic("vm_page_free: freeing wired page %p", m);
1887		if (m->hold_count != 0) {
1888			m->flags &= ~PG_ZERO;
1889			vm_page_lock_queues();
1890			vm_page_enqueue(PQ_HOLD, m);
(gdb) 

Crash 3 (also 4th of March):

panic: vm_page_free: freeing busy page 0xfffffe00d5d2db38
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff809208a6 at kdb_backtrace+0x66
#1 0xffffffff808ea8be at panic+0x1ce
#2 0xffffffff80b50923 at vm_page_free_toq+0x273
#3 0xffffffff816bd5ba at zfs_freebsd_read+0x62a
#4 0xffffffff8099113d at vn_rdwr+0x1ad
#5 0xffffffff8095c9dd at kern_sendfile+0xdad
#6 0xffffffff8095d12c at do_sendfile+0xdc
#7 0xffffffff80bd7ae6 at amd64_syscall+0x546
#8 0xffffffff80bc3447 at Xfast_syscall+0xf7

The last two seems similar, but otherwise no pattern is obvious to me.

core and info text files are available zipped (98KB) for all crashes here:

http://gal.dk/text-files.zip

Cores are availble upon request, their sizes vary:

-rw-------   1 root  wheel  2482114560 Mar  2 15:04 vmcore.0
-rw-------   1 root  wheel  2597650432 Mar  4 03:08 vmcore.1
-rw-------   1 root  wheel  1687846912 Mar  4 12:48 vmcore.2
-rw-------   1 root  wheel  1076121600 Mar  4 13:06 vmcore.3

>How-To-Repeat:
Occurs by regular production. Machine functions as web and mail server (with MySQL also running on localhost).

>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: