UART grab patch

Sun Dec 29 22:27:21 UTC 2013

On Sun, 29 Dec 2013, Warner Losh wrote:

>
> On Dec 29, 2013, at 1:14 AM, Bruce Evans wrote:
>
>> On Sat, 28 Dec 2013, Warner Losh wrote:
>>
>>> OK. Looks like my initial assessment of mountroot was too optimistic. With the exception of the imx driver, all the uart drivers in the tree fell victim to this bug. I hit a second one with the Raspberry Pi, and went looking.
>>
>> Please use the Unix newline character in mail.
>
> Apple's Mail.app hates me. I don't see how to change it.

Most mail I get from apple mail clients is misformatted.

>>> I've uploaded a patch to http://people.freebsd.org/~imp/uart-grab.diff
> ...
>> This is broken or fragile on most or all device types.  It "restores" to a
>> hard-coded state instead of to the previous state.  Maybe uart fixates
>> the state of the console device sufficently for the final state to be
>> constant, but I doubt this.
>
> For many of the UARTs, the state is fixed in attached, so blasting
> out a fixed state is sufficient.

Well, that is fragile.  What is "in attached"?  Normal device attachment
is too late for console drivers.  The state must be set at console
attach time and never changed.  This is not easy to arrange.  First
you have to ensure that normal probe/attach after console attach never
changes it.  Then you have to ensure that normal higher-level device
operations never change it.  For uart, you have to do this for many
different types of hardware.

I forgot to mention a problem with kdb activity again.  This _always_
gives an unfixed state.  The mutex is ignored in kdb mode, so locking
using the mutex alone is insuffient.  The unfixed state and races from
it occur as follows:

   someone does non-kdb console i/o that grabs properly
   grabbing consists of:
     acquire hwmtx
     disable device interrupts (note that this unfixes the state)
   suppose this is interrupted at this point or later by a kdb trap doing
     console i/o
   --> this grabs too, but igores the mutex.  Grabbing consists of:
 	disable interrupts.
       This then does the i/o and ungrabs.  Ungrabbing consists of:
 	enable interrupts (blindly in your version).
   --> back to interrupted context
   Continue doing the i/o with corrupted state.

uart has a much more serious bug from ignoring the mutex (old versions
that didn't ignore the mutex had deadlock instead).  The easiest way
to hit this bug on xx50 is when ns8250_param() deselects the data
register:

   caller acquires hwmtx.  This locks for everything except kdb
   ns8250_param() selects the divisor latch registers and deselects
     the data register and the interrupt enable register.  (Note that
     this is reachable for console devices.  Even though uart's fixation
     breaks the initial state device's control over part of the state
     and the lock state device's control over the fixation, it only
     does this for the line speed and a couple of software parameters;
     ns8250_param() is still reached to set things like the parity,
     and in fact it deselects the data register, etc. even for null
     changes.)
   suppose a kdb trap doing console i/o occurs while the data register
     is deselected.  When this waited for hwmtx, it deadlocked.  Now
     it obviously cannot write any data since the data register is
     deselected.  It actually writes garbage to the low half of the
     divisor latch.  This may or may not be overwritten with the correct
     value for the divisor latch, depending on the ordering of the
     writes.  Blind grabbing and ungrabbing increase this bug slightly.
     The critical interrupt enable register is the other one that is
     deselected.  Grabbing writes garbage to the upper half of the
     divisor latch.  Blind ungrabbing writes different garbage.
     Restoring the previous value would fix up the garbage in this
     half.

Some xx50's have modes in which a larger set of registers are deselected
and these modes are especially needed for changing parameters, but uart
doesn't support them.

>> Indeed, just for xx50, buggy hardware needs
>> special handling for tx interrupts, and uart has this.  It sets the tx
>> interrupt enable bit while transmitting and keeps it clear it otherwise.
>> Thus the initial state of the interrupt enable register varies and the
>> state restored should vary to match.  The correct state to restore
>> for this register can possibly be determined by looking at the global
>> state of the driver, but this would be complicated (it needs transient
>> locking of software state all over; only hardware state is locked).
>
> Yes. That's a flaw in the current implementation. We don't have the
> software state passed around, and we should.

Urk.  cn_arg is passed around, but it is always &uart_console.  This
seems to be a placeholder for doing it better someday.

>> In general, hardware state should be saved in *sc by grab and restored
>> by ungrab.  Locking only this part of *sc across grab/ungrab is still
>> difficult.  Parts of the driver that might invalidate the saved state
>> must be locked out.
>
> Yes. "sc" isn't saved in uart_devinfo for the console, so we can't do this
> today. I'll work things up so that we can do that...

Maybe the state can be stored in the devinfo.  It is good to decouple
the console state from the normal softc state as much as possible, but
with no hints from the softc this seems harder.

>> The patch also "sets" the state to a hard-coded state instead of modifying
>> the current state.  If this is done right, then it is correct since it
>> amounts to an initialization of the driver to console mode.  But the
>> parts of the patch that set the correct (interrupt control) register
>> enable all the non-tx interrupts even if none were enabled before (these
>> are later "restored").  This asks for interrupts that might cause problems.
>> On xx50, this has a chance of being harmless only due to the implementation
>> detail that the hardware has an interrupt identification register and the
>> software uses this register so as to not see hardware state changes unless
>> their interrupt has happened.
>
> Good point...  I'd forgotten just how picky the nsXX50 parts are.

They are fairly orthogonal and thus less picky than some.  Only simpler
devices like sy6551 are much easier to program.  z8530 is much harder.

>> The locking is certainly incomplete.  I don't remember even locking
>> against recursion in earlier changes in uart grab/ungrab.  (Note that
>> there is no anti-recursion code at the top level, except for some
>> deadlocking in cnputs()).
>
> The proper fix to this is at the top level. I'll look into this as a separate issue.

No, the details are too device-specific to handle at the top level.

>> ...
> Concurrent access is verboten, so I'm not sure how useful this would be...
> I'll have to think about this..
>
>> I changed this to use a mutex because atomic accesses are too hard to
>> use.  They would have to be spammed through normal_operation().  The
>> lock must be held throughout.  The !normal_operation() case may still
>> be complicated.  E.g., suppose you have a buggy grab routine that
>> somehow allows an interrupt.  Normal operation is to service the
>> interrupt so that it doesn't repeat.  Abnormal operation can't be to
>> simply return from the interrupt handler, since then the interrupt
>> might repeat endlessly or never occur again.  Here it would be better
>> to do the normal operation and break the console operation.  Make this
>> a feature and let invalid (but not very harmful) states occur
>> occasionally and fix up the invalid states using polling.
>
> True. This would make it more robust. However, given the extent of grabs
> today, I'm not sure that this is worth fixing. I'll have to think about it.

Sloppy grabs aren't worth having.  You are only using them to work
around the multiple consoles breakage.  The former is easy to fix
quickly by dropping support for multiple consoles.  Or don't drop it
completely, but hack on it for mountroot and not much else.

Bruce