Page fault

Tue Nov 4 11:20:19 PST 2003

On Tue, 4 Nov 2003, Nils Andreas Hakansson wrote:

> I've disabled softupdates because of
> a panic("softdep_move_dependencies: need merge code");

Can't comment on this bit.  Might want to send e-mail to Kirk directly.

> Could someone take a look at this?
> 
> pst: timeout mfa=0x0032d5d0 cmd=0x02
> pst: timeout mfa=0x00336390 cmd=0x02
> pst: timeout mfa=0x0034cdd0 cmd=0x02
> <cut>
> pst: timeout mfa=0x003b7ab0 cmd=0x02
> pst: timeout mfa=0x00396db0 cmd=0x02
> pst: timeout mfa=0x003a3530 cmd=0x02
> pst: timeout mfa=0x00376890 cmd=0x02

This is your storage device getting unhappy, but I'm not really informed
enough on pst to say how or why.  I don't know if it is because the
requests are bad, or because the controller/chain/device is unable to
service the request.

> ufs_access(): Error retrieving ACL on object (5).
> <cut>
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).
> ufs_access(): Error retrieving ACL on object (5).

This is the UFS ACL code failing closed: it's unable to read the ACLs from
disk due to EIO (I/O failure).  This is a correct response to that
scenario.

> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; lapic.id = 00000000
> fault virtual address   = 0xae18c0de
> fault code              = supervisor read, page not present
> instruction pointer     = 0x8:0xc066a566
> stack pointer           = 0x10:0xea3a78cc
> frame pointer           = 0x10:0xea3a7900
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 76932 (smbd)
> kernel: type 12 trap, code=0
> Stopped at      generic_bcopy+0x1a:     repe movsl      (%esi),%es:(%edi)
> db> trace
> generic_bcopy(cf6b0000,1a8,2,c06bd12c,0) at generic_bcopy+0x1a
> ffs_getextattr(ea3a7960,ea3a795c,c05159ad,d0346200,184) at
> ffs_getextattr+0xe0

This appears to be a bug in UFS2's handling of corrupted EA data on disk.
We have some changes in the TrustedBSD development trees to improve
resilience to on-disk corruption, but haven't merged them yet.  Just to
confirm, could you use "gdb -k" on a copy of your kernel with debugging
symbols to see where *ffs_getextattr+0xe0 is?  For me, it turns up in
ffs_vnops.c:1616, which is a variable assignment.  There's a bcopy not far
above there, which seems the likely candidate.

> vn_extattr_get(cb1a8c8c,8,2,c06bd12c,ea3a79d0) at vn_extattr_get+0xaa
> ufs_getacl(ea3a7a14,ea3a7a40,c061560b,ea3a7a14,c06df280) at
> ufs_getacl+0x99
> ufs_vnoperate(ea3a7a14,c06df280,2,a6,c853cd10) at ufs_vnoperate+0x18
> ufs_access(ea3a7a6c,ea3a7b28,c057dcc9,ea3a7a6c,c0716cc8) at
> ufs_access+0xca
> ufs_vnoperate(ea3a7a6c,c0716cc8,c0716cc8,c853cd10,cb1a8c8c) at
> ufs_vnoperate+0x1
> 8
> vn_open_cred(ea3a7bdc,ea3a7cdc,1a4,d0bb7800,22) at vn_open_cred+0x359
> vn_open(ea3a7bdc,ea3a7cdc,1a4,22,c3ee0fb4) at vn_open+0x30
> kern_open(c853cd10,bfbff130,0,1,1a4) at kern_open+0x143
> open(c853cd10,ea3a7d14,c06c44d0,3ed,3) at open+0x30
> syscall(bfbf002f,82b002f,bfbf002f,bfbffd70,82b3724) at syscall+0x28f
> Xint0x80_syscall() at Xint0x80_syscall+0x1d
> --- syscall (5, FreeBSD ELF32, open), eip = 0x662b5233, esp = 0xbfbff07c,
> ebp =
> 0xbfbff098 ---
> db> show locks
> exclusive sleep mutex Giant r = 0 (0xc07115c0) locked @
> /usr/src/sys/vm/vm_fault
> .c:223

Holding Giant here is good.  So to summarize:

This could be the result of a disk read failure.
The UFS code appears to be intolerant of said failure.
The ACL code failed closed properly, although perhaps not so usefully.

Robert N M Watson             FreeBSD Core Team, TrustedBSD Projects
robert at fledge.watson.org      Network Associates Laboratories