kern/111831: page fault while in kernel mode with samba in vfs_vmio_release

Wed Apr 18 19:30:03 UTC 2007

>Number:         111831
>Category:       kern
>Synopsis:       page fault while in kernel mode with samba in vfs_vmio_release
>Confidential:   no
>Severity:       serious
>Priority:       high
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Apr 18 19:30:02 GMT 2007
>Closed-Date:
>Last-Modified:
>Originator:     Adam McDougall
>Release:        FreeBSD 6.2-STABLE #1: Tue Apr 17 11:55:07 EDT 2007
>Organization:
>Environment:
FreeBSD 6.2-STABLE #1: Tue Apr 17 11:55:07 EDT 2007
  amd64  root at ghost2:/usr/obj/usr/src/sys/X4100

>Description:
Background: I have some samba servers I setup recently that serve all of their data from nfsv3 mounts.  I generally have around 400+ total concurrent samba connections during the day.  Access to the two servers is logically
controlled by a Foundry load balancer, but depending on the situation, we may only have one server running.  The servers "ghost2" and "niobe2" are Dual cpu Dual-core opteron sun fire X4100 M2 systems from Sun, running very recent
6-stable in amd64 mode.  I have also tried the same setup on some Dual Xeon 2.0ghz Dell PowerEdge 2650 systems.

FreeBSD will only stay operating for a few hours while in production, then it panics.  I have not been able to establish a repeatable test case other than by putting it in production and waiting.  I prefer to do this as little
as possible because the clients have trouble when I have to fall back to the old samba server which I want to replace.

The panic is always a Fatal trap 12: page fault while in kernel mode, and going by memory I'm pretty sure always "supervisor read data, page not present" with a very low (two digit) fault virtual address, and in vfs_vmio_release.
During earlier crashes while using DDB_UNATTENDED, I never got a kernel coredump, and had to refer to the pointers to determine that vfs_vmio_release was involved.  Today was the first time I had done enough preparation where I could let the servers drop into DDB instead of trying to reboot, so I could do some live debugging while not worrying about getting the server back up ASAP.  Both ghost2 and niobe2 are running the same binary world and kernel from ghost2.

This morning, ghost2 paniced much earlier than I expected, and the ddb trace involved FFS while in process smbstatus (which I was running once per minute from a script).  At that point all the client load was shifted over to niobe2 by
the load balancer.  niobe2 survived until noon, when it paniced in a similar manner but the current process was smbd and the trace involved nfs.  Both panics were in vfs_vmio_release.

Both servers will remain in DDB waiting for further probing, I have no reason to reboot them until a proposed solution or workaround exists to be tested.  I don't know what else to do in ddb until I am pointed to a guide or
instructed on how I can help further a solution for this case.  I am not a coder and only know some basic kernel debugging skills.  I do have several other servers available to do some testing with, but I'd have to put them into
production to reproduce the problem.  Please let me know what I can do.  I hope I have not forgotten anything important.  Thanks.

The following URL contains the kernel config, ddb output from the panic, ps, trace, show pcpu/allpcpu/lockedvnods on both servers, and dmesg.
http://www.egr.msu.edu/~mcdouga9/x4100

>How-To-Repeat:
Unsure how to repeat on demand, must put into production to produce a panic.
>Fix:

>Release-Note:
>Audit-Trail:
>Unformatted: