psm(4) & atkbdc(4) locking

Mon May 8 21:02:51 UTC 2017

On Mon, 8 May 2017, Warner Losh wrote:

> On Mon, May 8, 2017 at 1:39 PM, Konstantin Belousov <kostikbel at gmail.com> wrote:
>> On Mon, May 08, 2017 at 09:58:27PM +0300, Vladimir Kondratyev wrote:
>>> On 2017-05-08 21:39, Warner Losh wrote:
>>>> On Mon, May 8, 2017 at 12:33 PM, Vladimir Kondratyev
>>>> <vladimir at kondratyev.su> wrote:
>>>>> Hi All
>>>>>
>>>>> In order to implement evdev support in psm(4) driver I`m going to add
>>>>> mutexes to psm and atkbdc drivers and mark psm interrupt and cdev
>>>>> handlers
>>>>> MPSAFE.
>>>>> Atkbd(4) is depending on Giant like before.
>>>>>
>>>>> Locking of these drivers seems to be low-hanging fruit as spl() calls
>>>>> are
>>>>> still in place but it has not occurred.

High-hanging fruit.  it is a lot of work to replace the buggy locking by
something that works, without causing deadlocks instead of relatively
harmless races.

>>>>> Does someone know why it is not done yet? For reasons other than
>>>>> "Nobody
>>>>> took care of it"?
>>>>> Any hidden traps?

There is ddb (discussed below), and also syscons calling the keyboard
driver.  syscons is still Giant-locked, partly because keyboard drivers
are.  The common Giant lock works more simply than separate non Giant
ones.  The same non-Giant lock for both would would have some of the
complexities of Giant without the support already being there.

>>>> Working rock-solid reliably in ddb(4) is what tripped me up when I
>>>> started down this path 10 years ago.
>>>
>>> I tried to avoid atkbd changes as much as possible. Patch adds atkbdc
>>> mutex acquisition
>>> before each hardware access and nothing more. I think mutexes should be
>>> ignored in ddb mode.

They are only ignored if you do it explicitly, but this is fragile.  If
the hardware works after ignoring locks while in kdb mode, then it must
work after ignoring locks in all modes.

>> Locks are not ignored in kdb_active mode, and this is reasonable.
>> Instead, kdb should not acquire locks at all.

(This should be ddb_active mode.  Only ddb uses the keyboard.  But there
is only a flag for kdb.)

See my recent fixes for syscons.  These basically lock console output
correctly, but for the keyboard they depend on keyboard drivers having
no locking and not needing any (really, both syscons and keyboard
drivers are supposed to be locked by a common lock which is Giant, but
this doesn't work for i/o in kdb mode and is fragile for low-level
console output not in kdb mode.  My tests arrange for console output
from almost to worst places where it shouldn't be done (from IPI STOP
handlers, where the CPU doing the output must succeed despite other
CPUs that have already been stopped holding console driver locks).
Keyboard drivers are harder to test in this context, and are sure to
fail. so I don't try hard to test them.  However, kdb and panics in
this context sometime reach the keyboard driver.  Also, on syscons
entry in such context, it tries to not use keyboard drivers if not
necessary.

>> Consider a situation where you need to send a composite command to
>> the hardware, which consists of several registers accesses, and the
>> whole sequence of accesses is covered by a lock to ensure atomicity.
>> Kdb may be entered at arbitrary moment, including the middle of the
>> lock-protected section. This means that kdb might be entered with the
>> lock owned by some thread, not neccessary the thread which was executing
>> when entrance occured.  More, the hardware state is some intermediate
>> state of being commanded the composite sequence, not yet finished.

AT keyboards don't seem to mind some corruption of the hardware state.
In general, provided a bad state doesn't destroy the device, then at
worst the driver can probably recover by resetting everything after a
timeout.

My fixes are simpler and more complete for sio.  They are simpler because
the driver only has 1 lock for input and output (but this lock is also
shared between sio devices, which gives problems related to those for
Giant -- console output is supposed to be immediate and must be synchronous,
but you are forced to wait for other devices).  sio always did an almost
complete state switch for console entry and exit.  This is possible since
the device registers are read-write and restoring them works as well
as a reset for getting back to a usable state.  My recent fixes mainly
add some control of the nesting for the stacked states, and extra care
for some !kdb cases.

> Yes. This is the snag I ran into...
>
>> I do not think that atkbd code correctly handles this situation now,
>> and simply adding a lock there probably make things worse.

Yes, it would just give deadlock.  See my recent fixes.  atkbd currently
requires Giant locking, but this is obviously impossible for low-level
console i/o (even without ddb).  Syscons used to do null locking in this
case, without really trying.  Initially it used non-null spls, but spls
don't work at all for this use (since spl only locks out interrupts).
Then when syscons was converted to Giant, Giant works too well for hiding
itself, so syscons has almost no explicit references to Giant, and
certainly none for low-level-console i/o where they wouldn't work.  Giant
also works well for locking out the interrupt handlers, but converting
the spls to null ones made them do even less than before to accidentally
limit console i/o.

If you sprinkle locks that actually, work, then deadlock is certain in
some cases.  If you use recursive locks like syscons still does, then
they will often avoid deadlock but won't actually work.  They reduce
to null spls if they are already held by the same thread.

Typical examples of deadlocks are:
- thread holding lock is stopped in IPI STOP handler
- another thread calls the console driver, perhaps to report a problem
   in its IPI STOP handler

Typical examples of recursive (or otherwise ignored) locks not actually
working when the are needed are:
- thread is in super-critical region of console driver and aquires lock.
   This should be a spinlock, and interrupts should be disabled (this
   is a side effect of spinlocks)
- NMIs and traps cannot be disabled.  They are often caused by NMI IPI
   STOPs and debugger traps.  Since the region is super-critical, i/o
   should not be done, at least blindly.  But recursive locks are useless
   for stopping the i/o, since the same thread owns the lock.  Ignoring
   the lock in kdb mode is equally dangerous.

> It did for me, since breaking to keyboard was totally broken by the
> changes I made to try to lock things since it was quite likely to
> interrupt things when locks were held... I had a very naive
> implementation, which wasn't up to the task, so some care is needed.
> But this was in the 5.x or 6.x time frame, and the locking situation
> wrt GIANT is much better today than then...

Not much better.  There are just more deadlocks and less races now.
Lots of deadlocks were added even at the printf() and msgbuf level.
One deadlock in cnputs() was there for many years.  Output is now
droppped there instead.

> I don't think it's a huge problem, but just one that the implementor
> needs to be aware of... Since I was unaware of all the details up
> front, I came up with a solution that couldn't possibly work so I
> abandoned it.

The difficulties are mostly in complex hardware and drivers.  The
only method that is sure to work is complete reinitialization/reset
whenever needed (including before the normal driver is probed), without
losing too much i/o.  To take shortcuts in this, so that it doesn't
take as much code as the normal driver, you have to understand the
hardware better than is needed for writing the normal driver.

Bruce