random FreeBSD panics

Mon Mar 29 20:30:50 UTC 2010

On Mon, Mar 29, 2010 at 02:27:34PM -0400, John Baldwin wrote:
> On Monday 29 March 2010 1:30:38 pm Jeremy Chadwick wrote:
> > On Mon, Mar 29, 2010 at 05:01:02PM +0000, Masoom Shaikh wrote:
> > > On Sun, Mar 28, 2010 at 5:38 PM, Ivan Voras <ivoras at freebsd.org> wrote:
> > > > On 28 March 2010 16:42, Masoom Shaikh <masoom.shaikh at gmail.com> wrote:
> > > >
> > > >> lets assume if this is h/w problem, then how can other OSes overcome
> > > >> this ? is there a way to make FreeBSD ignore this as well, let it
> > > >> result in reasonable performance penalty.
> > > >
> > > > Very probably, if only we could detect where the problem is.
> > > > Try adding "options     PRINTF_BUFR_SIZE=128" to the kernel
> > > 
> > > this option is already there
> > 
> > The key word in Ivan's phrase is "less mangled".  Neither use of or
> > increasing PRINTF_BUFR_SIZE solves the problem of interspersed console
> > output.  I've been ranting/raving about this problem for years now; it
> > truly looks like a mutex lock issue (or lack of such lock), but I've
> > been told numerous times that isn't the case.
> > 
> > To developers: what incentives would help get this issue well-needed
> > attention?  This problem makes kernel debugging, panic analysis, and
> > other console-oriented viewing basically impossible.
> 
> I was recently going to look at it.  The somewhat drastic approach I was going 
> to take was to add a simple serializing lock around trap_fatal() and a few 
> other places that do similar block prints (e.g. mca_log()).  One of the issues 
> with fixing this in printf itself is that you'd want probably want to 
> serialize complete lines of text on a per-thread basis.  You would want to be 
> able to accumulate this line of text across multiple calls to printf (think of 
> it as line-buffering ala stdio).  However, some folks may be nervous about 
> printf not printing things immediately.
> 
> The other issue is that lots of code assumes it can call printf from anywhere 
> and everywhere.  Mostly this just means that if you add locking and line-
> buffering to printf(9) you have to be very careful to make sure it works in 
> odd places.  Probably a lot of this could be solved by deferring things like 
> trap_fatal() until panic() has already been called (which is bde's preferred
> solution I think).

John,

Thanks for the insights, they're greatly appreciated.

I went looking this morning to see how Linux addressed this issue (if at
all), and it's been discussed a few times in the past.  The longest lkml
thread I could find that mentioned the problem was circa 2002.  Probably
not worth reading as there was work done in 2009 to solve the issue.

http://lkml.indiana.edu/hypermail/linux/kernel/0204.1/index.html#161

Work done by RedHat in 2009 details how they implemented a lockless
version of their kernel ring buffer (similar to our system message
buffer, but probably a lot more complex):

http://lwn.net/Articles/340400/
http://lwn.net/Articles/340443/

Supposedly having multiple writers to the ring is 100% safe; no
interspersed output.  Same goes for interrupt-generated stuff.  There's
some comments in the technical document (2nd link) that imply there's an
individual ring buffer for each CPU; possibly per-CPU kernel message
buffers would solve our issue?

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.              PGP: 4BD6C0CB |