svn commit: r222537 - in head/sys: kern sys

Fri Jun 3 10:47:12 UTC 2011

On Thu, 2 Jun 2011, Kenneth D. Merry wrote:

> On Wed, Jun 01, 2011 at 20:07:31 +1000, Bruce Evans wrote:
>> On Wed, 1 Jun 2011, Andriy Gapon wrote:
>>
>>> on 31/05/2011 20:29 Kenneth D. Merry said the following:
>>>> +	mtx_init(&mbp->msg_lock, "msgbuf", NULL, MTX_SPIN);
>>>
>>> Sorry that I didn't gather myself together for a review before this change
>>> got
>>> actually committed.
>>> Do you see any reason not to make this spinlock recursive?

It does need to be recursive to avoid deadlock (as I described), but this
prevents it actually working as a lock.  I think the new code needs it to
work as a lock.

>> I see reasons why it must not exist.  The message buffer code cannot use
>> any normal locking, and was carefully written to avoid doing so.
>>
>>> I am a little bit worried about "exotic" situations like receiving an NMI
>>> in the
>>> middle of printing and wanting to print in the NMI context, or similar
>>> things
>>> that penetrate contexts with disabled interrupts - e.g. Machine Check
>>> Exception.
>>> Also it's not clear to me if there won't any bigger damage in the
>>> situations
>>> like those described above.
>>>
>>> P.S. I have been thinking about fixing the problem in a different fashion,
>>> via
>>> reserving portions of dmesg buffer for a whole message using CAS:
>>> http://lists.freebsd.org/pipermail/freebsd-hackers/2010-April/031535.html
>>
>> Something like this might work.
>>
>> PRINTF_BUFR size should not exist either, especially now that it is
>> much more complicated and broken.  Here is some of my old mail about
>> adding (necessarily not normal) locking to to printf.  No one replied,
>> so I never finished this :-(.
>
> That particular solution doesn't lock the normal kernel printf path, and so
> won't fix what we're trying to fix.  (We've got lots of disks in each
> system, and when there are various SES events and disks arriving and
> departing, there are multiple kernel contexts doing printfs simultaneously,
> and that results in scrambled dmesg output.)

But it does.  PRINTF_BUFR_SIZE (at least used to) only be applied to
vprintf(), but my printf_lock() is applied to all calls to kvprintf()
that set TOLOG.  (If I missed some, this takes 2 lines each to fix.)
The calls in tprintf() and log() have comments saying that this is
intentional "to serialize msgbuf", though I fear this might not be safe
for at least tprintf(), since it may block unboundedly.  But even when
it blocks unboundedly, nothing really bad happens: the output just
becomes very slow due to everything waiting for the timeout until the
blockage clears, and may be interleaved.  Recursion gives similar
behaviour, except it takes unrecursing to clear the blockage.  I
consider the very slow output to be a feature.  Unbounded blockages
and recursion are bugs and the slow output makes them more obvious.

> I think the goals should be:
>
> - console output, syslog output, and dmesg output are not scrambled.
>   (i.e. individual kernel printfs make it out atomically, or at least
>   with a certain granularity, like PRINTF_BUFR_SIZE.)

Console output includes debugger output.  This must be able to break into
any otherwise-atomic kernel printf.  Possibly similarly for NMIs (if they
print.  They really shouldn't print, but they at least used to get this
horribly wrong from the beginning on i386).  Similarly for certain (non
maskable) traps.  These really shouldn't happen while we're printing, but
if they do then they may need to panic and then print for things like
machine check exceptions.

Debugger output is the easiest way of showing why printf cannot use
any normal locking.  There is an option (that I don't like or use)
which supports debugger output going through normal printf.  I think
the message buffer is used for this.  So message buffer accesses cannout
use normal locking either.

Apart from this, I don't know any reason why a printf or even a msgbuf
write can't wait for a while (a very long while if the output is to a
slow device -- don't forget to test with a 50 bps serial console) until
any previous printfs complete.  My simple serialization gives this if
the other printfs are on other CPUs, and possibly if they are on other
ithreads (the simple serialization doesn't give up control for simplicity,
since it is hard to tell when it is safe to give up control, but
preemption while we are spinning might get back to the other printfs).
And when the printf is recursive, there is no way back to the other
printfs, so we must proceed.  We could proceed immediately, but my
simple serialization waits in this case too for simplicity.  This case
should be even rarer than waiting too long for another CPU.

> - Can be called by any kernel routine (i.e. doesn't sleep)

Some complications for tprintf and uprintf (and maybe ordinary printf if
it is connected to a pty).  I think we already have the necessary
complications.  Something involving printing only to the message buffer
in the non-sleeping part.

> - Offers a string at a time interface.

I don't provide this.  So printf's a character or word or 2 at a time
may be interleaved.  It is perhaps not to much to ask that important
printfs that are likely to be interleaved preprint their strings into
lines.

> - Does the right thing for log messages (priority, etc.)
>
> It looks like we should add "does not use normal locking" to the list as
> well.
>
> As Justin mentioned, we started off down the path of using atomic
> operations, and then figured out that wouldn't fully work.  Then we
> discussed doing a per-CPU message buffer, with each message tagged with a
> sequence number, and having the reader re-serialize the messages in the
> right order.  The complexity started to get large enough that we decided
> that using a spin lock would be a much easier approach.

But almost eveyone who has worked on this before knows that a spinlock
cannot be used.

> cnputs() already uses a spinlock, so we're no worse off than before from a
> locking standpoint.

An obvious bug for PRINTF_BUFR_SIZE.  But as I said, this one might
work well enough if it were recursive.  This depends on the lower-level
console drivers supporting recursion.  Most don't.  Some have simple
serialization, but have to bypass it open to support debugger output,
much like higher levels except they have more control over the timing.
But in the message buffer accesses, I think you used the lock to provide
exlusive accesses.

> We could perhaps make the message buffer mutex recursive and set the
> MTX_NOWITNESS flag as well, that might make things a little better from a
> side effect standpoint.

I don't see how a recursive mutex can protect the message buffer.  Don't
forget to trace through the code protected by the mutex using ddb, while
at the same time using this code for ddb itself.  It's just too hard to
make this work.  Normally, this type of problem is avoided by avoiding
not using code like this for the low levels like ddb itself.  Ddb mostly
does this, but has complications:

% static void
% db_putc(int c)
% {
% 
% 	/*
% 	 * If not in the debugger or the user requests it, output data to
% 	 * both the console and the message buffer.
% 	 */
% 	if (!kdb_active || ddb_use_printf) {
% 		printf("%c", c);

Here it uses ordinary printf, so it uses whatever printf uses.  In
particular, PRINTF_BUFR_SIZE makes printf use cnputs() and the buggy
locking in it, and printf always uses the message buffer and I think
the new code gives buggy locking there unconditionally.

This problem is not very large, since it takes some foot shooting to get
here:
- there is some buggy code that uses db_printf() when !kdb_active.
   Perhaps this is not used recursively.
- we can configure ddb to use printf.  I think it is a feature that the
   message buffer gets used in this mode.

% 		if (!kdb_active)
% 			return;
% 		if (c == '\r' || c == '\n')
% 			db_check_interrupt();
% 		if (c == '\n' && db_maxlines > 0) {
% 			db_newlines++;
% 			if (db_newlines >= db_maxlines)
% 				db_pager();
% 		}
% 		return;
% 	}
% 
% 	/* Otherwise, output data directly to the console. */
% 	if (c > ' ' && c <= '~') {
% 	    /*
% 	     * Printing character.
% 	     * If we have spaces to print, print them first.
% 	     * Use tabs if possible.
% 	     */
% 	    db_force_whitespace();
% 	    cnputc(c);
% 	    db_capture_writech(c);
% 	    db_output_position++;
% 	    db_last_non_space = db_output_position;
% 	}

I think this routine is used mainly by db_printf() calling kvprintf() and
kvprintf() calling here.  kvprintf() is supposed to be reentrant and in
particular not use any locking to support this.  I think it still does.

Bruce