Updated rusage patch

Fri Jun 1 08:18:54 UTC 2007

On Thu, 31 May 2007, Jeff Roberson wrote:

> On Thu, 31 May 2007, Bruce Evans wrote:
>
>> On Wed, 30 May 2007, Jeff Roberson wrote:
>> 
>> % Index: kern/kern_exit.c
>> % ===================================================================
>> % RCS file: /usr/home/ncvs/src/sys/kern/kern_exit.c,v
>> % retrieving revision 1.298
>> % diff -u -p -r1.298 kern_exit.c
>> % --- kern/kern_exit.c	14 May 2007 22:21:58 -0000	1.298
>> % +++ kern/kern_exit.c	30 May 2007 00:10:21 -0000
>> % @@ -169,7 +170,8 @@ retry:
>> %  		 * Threading support has been turned off.
>> %  		 */
>> %  	}
>> % -
>> % +	KASSERT(p->p_numthreads == 1,
>> % +	    ("exit1: proc %p exiting with %d threads", p, p->p_numthreads));
>> %  	/*
>> %  	 * Wakeup anyone in procfs' PIOCWAIT.  They should have a hold
>> %  	 * on our vmspace, so we should block below until they have
>> 
>> Lost some formatting.
>
> The extra newline?  Isn't that a style violation?

This is arguable.  Although strict KNF may require no extra blank lines
(indent -sob), and this rule is not too bad for block comments because
the line with the "/*" provides some separation, there are many examples
of old code in kern, including many in this file, where a blank line
is left before the block comment.  In the 4.4Lite2 kern_exit.c, the
count is approx. 10:8 in favour of no extra blank line before block
comments within functions (not counting cases where there clearly
shouldn't be one, e.g., after #ifdef).  In the current kern_exit.c
after this change was committed, the count is 32:13 in favour of a
blank line (down from 33:12 before this change).

My version of style.9 requires leaving a blank line before comments
and not leaving a blank line before statements (if you want to leave
a blank line between statements to form separate blocks of statements,
there must be a comment at the start of each new block).

>> % @@ -438,15 +442,17 @@ retry:
>> %  		PROC_UNLOCK(q);
>> %  	}
>> % % -	/*
>> % -	 * Save exit status and finalize rusage info except for times,
>> % -	 * adding in child rusage info later when our time is locked.
>> % -	 */
>> % +	/* Save exit status. */
>> %  	PROC_LOCK(p);
>> %  	p->p_xstat = rv;
>> %  	p->p_xthread = td;
>> % -	p->p_stats->p_ru.ru_nvcsw++;
>> % -	*p->p_ru = p->p_stats->p_ru;
>> % +	/*
>> % +	 * All statistics have been aggregated into the final td_ru by
>> % +	 * thread_exit(), just copy it into p_ru here.
>> % +	 */
>> % +	*ru = td->td_ru;
>> % +	ru->ru_nvcsw++;
>> % +	p->p_ru = ru;
>> % %  	/*
>> %  	 * Notify interested parties of our demise.
>> 
>> Still have various bugs here [mainly races with statclock].

>> These bugs can probably be fixed by moving the copying to the final
>> thread_exit().  Maybe acct_process() can be moved there too, but I
>> suspect too much of the process has gone away by then for acct_process()
>> to work.
>
> I tried moving the rusage summing and assignment to p_ru to later in this 
> function.  This lead to a race which produced "runtime went backwards" 
> printfs because runtime became 0.  I didn't quite understand this as we check 
> for PRS_ZOMBIE in wait, but I also didn't investigate terribly far.

Some technical problem.

>> Oops, I now see that you removed all the (j)s and sched locking here.
>> But this isn't quite right -- many of the fields are accessed by
>> statclock(), so something like interrupt atomicity or splhigh() is
>> required to access them.  The above (mainly in exit1()) gives much
>> the same races as in RELENG_4:
>> - RELENG_4: statclock() uses splhigh() but not interrupt atomicity.
>>            exit1() uses no locking and thus races with statclock().
>>  above: statclock() still uses sched_lock but not interrupt atomicity.
>>         exit1() uses no locking and thus races with statclock().
>> Time fields are mostly in rux and still fully locked by sched_lock.
>> exit1() copies some of them to p_ru, but that copy is not used.  I
>> think proc locking is still used for p_ru -- it is used in kern_wait(),
>> where it seems to be necessary to prevent multiple threads in the
>> parent process racing to reap the child.
>
> I believe the sched lock locking is necessary for rux.  For rusage, however, 
> it's only ever modified by curthread.  The only races that are present when 
> reading without locks are potentially inconsistent rusage where the stats are 
> not all from a single snapshot of the program. However, the stats can 
> obviously change before the copy makes it back to user-space anyhow.  I don't 
> see a real race that needs protecting.

No, at least the ru_*rss fields are updated by statclock().  These
seem to be the only problematic fields.  They are unimportant.  However,
they should be easy to protect correctly using the same lock as the
td_*ticks fields.  They are still protected by sched_lock in statclock()
and rufetch() but not here.

Hmm, this is confusing.  Normal locking is not used for thread-local
fields.  Instead, a side effect of spinlocking is used:
mtx_lock_spin(&any) in non-interrupt code has the side effect of
masking interrupts on the current CPU, so statclock() can access
anything on the current CPU without acquiring any locks, just like
an interrupt handler on a UP system.  This is used for td_*ticks.
It doesn't work for ru_*rss since since exit1() doesn't hold a
spinlock when copying td_ru.  The sched_locking of ru_*rss in
statclock() doesn't help -- I think it now does nothing except
waste time, since these fields are now thread-local so they need
the same locking as td_*ticks, which is null in statclock() but
non-null elsewhere.

Related bugs:
- td_[usip]ticks are still under (j) (sched_lock) in proc.h.
- td_(uu,us}ticks have always (?) been under (k) (thread-local).  That
   is more correct than (j), but they are updated by an interrupt handler
   and seem to be accessed without disabling interrupts elsewhere.

> This is why I believe the code in exit1() is also safe

Modification by only curthread isn't quite enough -- see above.

> although technically 
> it could lose some information on the way to thread_exit(). To fix this we'd 
> have to do the final accumulation under sched_lock the last time we lock it. 
> However there are other races in the proc code there that were not 
> immediately obvious.  Also, doing the accumulation here negates the 
> possibility of running process accounting after p_ru is valid, which it is 
> not doing now.

I see.

> Related to rusage but unrelated to my patch;  Why are irxss, irdss, and irsss 
> expressed in units of bytes*ticks of execution when rusage doesn't report the 
> ticks of execution and rather a timeval?  Are consumers expected to sum the 
> timevals and then extrapolate?

I don't really know, but guess this is historical.  Ticks can be managed
more efficiently than timevals.  There are remarkably few consumers
- mainly time(1), csh and window(1) in /usr/src.  csh still seems to hard-
   code the tick freqency as 100.  This would make all rss values off
   by a factor of stathz/100.  time(1) hard-coded the frequency as 100
   in FreeBSD-1.1.

Bruce