[PATCH] Add ktrace records for user page faults

Mon May 2 21:09:29 UTC 2011

On Monday, May 02, 2011 4:16:02 pm Kostik Belousov wrote:
> On Mon, May 02, 2011 at 04:02:02PM -0400, John Baldwin wrote:
> > On Monday, May 02, 2011 3:55:55 pm Kostik Belousov wrote:
> > > On Mon, May 02, 2011 at 03:37:19PM -0400, John Baldwin wrote:
> > > > One thing I have found useful is knowing when processes are in the kernel 
> > > > instead of in userland.  ktrace already provides records for syscall 
> > > > entry/exit.  The other major source of time spent in the kernel that I've seen 
> > > > is page fault handling.  To that end, I have a patch that adds ktrace records 
> > > > to the beginning and end of VM faults.  This gives a pair of records so a user 
> > > > can see how long a fault took (similar to how one can see how long a syscall 
> > > > takes now).  Sample output from kdump is below:
> > > > 
> > > >  47565 echo     CALL  mmap(0x800a87000,0x179000,PROT_READ|
> > > > PROT_WRITE,MAP_PRIVATE|MAP_ANON,0xffffffff,0)
> > > >  47565 echo     RET   mmap 34370777088/0x800a87000
> > > >  47565 echo     PFLT  0x800723000 VM_PROT_EXECUTE
> > > >  47565 echo     RET   KERN_SUCCESS
> > > >  47565 echo     CALL  munmap(0x800887000,0x179000)
> > > >  47565 echo     RET   munmap 0
> > > >  47565 echo     PFLT  0x800a00000 VM_PROT_WRITE
> > > >  47565 echo     RET   KERN_SUCCESS
> > > > 
> > > > The patch is available at www.freebsd.org/~jhb/patches/ktrace_fault.patch and 
> > > > included below.
> > > 
> > > One immediate detail is that trap() truncates the fault address to the
> > > page address, that arguably looses useful information.
> > 
> > It is true that it would be nice to have the exact faulting address, though
> > having page granularity has been sufficient for the few times I've actually
> > used the address itself (e.g. I could figure out which page of libstdc++ a
> > fault occurred on and narrow down from there as to which of the routines most
> > likely was executed given what the app was doing at the time).  In my case
> > knowing how much time was spent handling a page fault has been useful.
> > 
> > Would we have to push this logic out of vm_fault and into every trap() routine
> > to get the original address?  That would make the patch quite a bit bigger
> > (touching N MD files vs 1 MI file).
> 
> Or do the reverse, making vm_fault() do trunc_page() [if doing this
> change at all].

Ok.  That sounds sensible.

> Also, I want to note another small detail, that is relevant if you plan
> to MFC the change. In HEAD, vm_fault() is only called from trap()s, and
> in-kernel page reads where substituted by vm_fault_hold() calls. This is
> not true for stable.
> 
> Hm, was it indended to report faults from uiomove etc ?
> I had this change (that relates to OOM handling) for long time, and
> realized that it might be useful there.

I don't mind if it reports those faults as they will be bracketed by
a system call entry/exit.  However, I primarily care about user-initiated
faults.

> diff --git a/sys/amd64/amd64/trap.c b/sys/amd64/amd64/trap.c
> index 4e5f8b8..0d1e68f 100644
> --- a/sys/amd64/amd64/trap.c
> +++ b/sys/amd64/amd64/trap.c
> @@ -697,7 +697,8 @@ trap_pfault(frame, usermode)
>  		PROC_UNLOCK(p);
>  
>  		/* Fault in the user page: */
> -		rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL);
> +		rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL |
> +		    (usermode ? VM_FAULT_USERMODE : 0));
>  
>  		PROC_LOCK(p);
>  		--p->p_lock;
> diff --git a/sys/i386/i386/trap.c b/sys/i386/i386/trap.c
> index 5a8016c..236f295 100644
> --- a/sys/i386/i386/trap.c
> +++ b/sys/i386/i386/trap.c
> @@ -856,7 +856,8 @@ trap_pfault(frame, usermode, eva)
>  		PROC_UNLOCK(p);
>  
>  		/* Fault in the user page: */
> -		rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL);
> +		rv = vm_fault(map, va, ftype, VM_FAULT_NORMAL |
> +		    (usermode ? VM_FAULT_USERMODE : 0));
>  
>  		PROC_LOCK(p);
>  		--p->p_lock;
> diff --git a/sys/vm/vm_fault.c b/sys/vm/vm_fault.c
> index d417a84..c1e87ae 100644
> --- a/sys/vm/vm_fault.c
> +++ b/sys/vm/vm_fault.c
> @@ -408,6 +408,11 @@ RetryFault:;
>  			 */
>  			fs.m = NULL;
>  			if (!vm_page_count_severe() || P_KILLED(curproc)) {
> +				if (P_KILLED(curproc) && (fault_flags &
> +				    VM_FAULT_USERMODE) != 0) {
> +					unlock_and_deallocate(&fs);
> +					return (KERN_RESOURCE_SHORTAGE);
> +				}
>  #if VM_NRESERVLEVEL > 0
>  				if ((fs.object->flags & OBJ_COLORED) == 0) {
>  					fs.object->flags |= OBJ_COLORED;
> diff --git a/sys/vm/vm_map.h b/sys/vm/vm_map.h
> index 5311e02..7444549 100644
> --- a/sys/vm/vm_map.h
> +++ b/sys/vm/vm_map.h
> @@ -326,6 +326,7 @@ long vmspace_wired_count(struct vmspace *vmspace);
>  #define VM_FAULT_NORMAL 0		/* Nothing special */
>  #define VM_FAULT_CHANGE_WIRING 1	/* Change the wiring as appropriate */
>  #define	VM_FAULT_DIRTY 2		/* Dirty the page; use w/VM_PROT_COPY */
> +#define	VM_FAULT_USERMODE 4		/* Fault initiated by usermode */
>  
>  /*
>   * The following "find_space" options are supported by vm_map_find()
> 

-- 
John Baldwin