[PATCH] Add ktrace records for user page faults

Kostik Belousov kostikbel at gmail.com
Mon May 2 20:16:06 UTC 2011


On Mon, May 02, 2011 at 04:02:02PM -0400, John Baldwin wrote:
> On Monday, May 02, 2011 3:55:55 pm Kostik Belousov wrote:
> > On Mon, May 02, 2011 at 03:37:19PM -0400, John Baldwin wrote:
> > > One thing I have found useful is knowing when processes are in the kernel 
> > > instead of in userland.  ktrace already provides records for syscall 
> > > entry/exit.  The other major source of time spent in the kernel that I've seen 
> > > is page fault handling.  To that end, I have a patch that adds ktrace records 
> > > to the beginning and end of VM faults.  This gives a pair of records so a user 
> > > can see how long a fault took (similar to how one can see how long a syscall 
> > > takes now).  Sample output from kdump is below:
> > > 
> > >  47565 echo     CALL  mmap(0x800a87000,0x179000,PROT_READ|
> > > PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)
> > >  47565 echo     RET   mmap 34370777088/0x800a87000
> > >  47565 echo     PFLT  0x800723000 VM_PROT_EXECUTE
> > >  47565 echo     RET   KERN_SUCCESS
> > >  47565 echo     CALL  munmap(0x800887000,0x179000)
> > >  47565 echo     RET   munmap 0
> > >  47565 echo     PFLT  0x800a00000 VM_PROT_WRITE
> > >  47565 echo     RET   KERN_SUCCESS
> > > 
> > > The patch is available at www.freebsd.org/~jhb/patches/ktrace_fault.patch and 
> > > included below.
> > 
> > One immediate detail is that trap() truncates the fault address to the
> > page address, that arguably looses useful information.
> 
> It is true that it would be nice to have the exact faulting address, though
> having page granularity has been sufficient for the few times I've actually
> used the address itself (e.g. I could figure out which page of libstdc++ a
> fault occurred on and narrow down from there as to which of the routines most
> likely was executed given what the app was doing at the time).  In my case
> knowing how much time was spent handling a page fault has been useful.
> 
> Would we have to push this logic out of vm_fault and into every trap() routine
> to get the original address?  That would make the patch quite a bit bigger
> (touching N MD files vs 1 MI file).

Or do the reverse, making vm_fault() do trunc_page() [if doing this
change at all].

Also, I want to note another small detail, that is relevant if you plan
to MFC the change. In HEAD, vm_fault() is only called from trap()s, and
in-kernel page reads where substituted by vm_fault_hold() calls. This is
not true for stable.

Hm, was it indended to report faults from uiomove etc ?
I had this change (that relates to OOM handling) for long time, and
realized that it might be useful there.

diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c
index 4e5f8b8..0d1e68f 100644
--- a/sys/amd64/amd64/trap.c
+++ b/sys/amd64/amd64/trap.c
@@ -697,7 +697,8 @@ trap_pfault(frame, usermode)
 		PROC_UNLOCK(p);
 
 		/* Fault in the user page: */
-		rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL);
+		rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL |
+		    (usermode ? VM_FAULT_USERMODE : 0));
 
 		PROC_LOCK(p);
 		--p->p_lock;
diff --git a/sys/i386/i386/trap.c b/sys/i386/i386/trap.c
index 5a8016c..236f295 100644
--- a/sys/i386/i386/trap.c
+++ b/sys/i386/i386/trap.c
@@ -856,7 +856,8 @@ trap_pfault(frame, usermode, eva)
 		PROC_UNLOCK(p);
 
 		/* Fault in the user page: */
-		rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL);
+		rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL |
+		    (usermode ? VM_FAULT_USERMODE : 0));
 
 		PROC_LOCK(p);
 		--p->p_lock;
diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
index d417a84..c1e87ae 100644
--- a/sys/vm/vm_fault.c
+++ b/sys/vm/vm_fault.c
@@ -408,6 +408,11 @@ RetryFault:;
 			 */
 			fs.m = NULL;
 			if (!vm_page_count_severe() || P_KILLED(curproc)) {
+				if (P_KILLED(curproc) && (fault_flags &
+				    VM_FAULT_USERMODE) != 0) {
+					unlock_and_deallocate(&fs);
+					return (KERN_RESOURCE_SHORTAGE);
+				}
 #if VM_NRESERVLEVEL > 0
 				if ((fs.object->flags & OBJ_COLORED) == 0) {
 					fs.object->flags |= OBJ_COLORED;
diff --git a/sys/vm/vm_map.h b/sys/vm/vm_map.h
index 5311e02..7444549 100644
--- a/sys/vm/vm_map.h
+++ b/sys/vm/vm_map.h
@@ -326,6 +326,7 @@ long vmspace_wired_count(struct vmspace *vmspace);
 #define VM_FAULT_NORMAL 0		/* Nothing special */
 #define VM_FAULT_CHANGE_WIRING 1	/* Change the wiring as appropriate */
 #define	VM_FAULT_DIRTY 2		/* Dirty the page; use w/VM_PROT_COPY */
+#define	VM_FAULT_USERMODE 4		/* Fault initiated by usermode */
 
 /*
  * The following "find_space" options are supported by vm_map_find()
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 196 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20110502/fa8bf18f/attachment.pgp


More information about the freebsd-arch mailing list