Handle kernel module crashes

Sun Jun 16 10:46:27 UTC 2013

On 10 Jun 2013, at 15:40, Florent Peterschmitt <florent at peterschmitt.fr> wrote:

> Ok and isn't it a "bad" thing ? I mean, even if the video driver
> crashes, I still want to have the ability to reboot the right way,
> avoiding corrupted files and WIP lose.
> 
> Another thing is a non-critical module that can crash, but because not
> used by all apps on the machine, letting them ones that can continue run.
> 
> But I don't know what is the approach of FreeBSD and devs about that.

Yes, it's a bad thing.  If we had privilege domain crossing that was as cheap as a function call (or, at least, almost as cheap) then we could implement fine-grained separation within the kernel and not incur any performance penalty.  Unfortunately, this is not possible without some fairly significant changes to current CPU instruction sets (which, actually, several of us in FreeBSD land are working on, but that's unlikely to be seen in any mainstream processor for at least 5-10 years).  

In the current world, we have a fairly poor selection of choices for isolation.  On i386, we had 4 protection rings, but on the 486 and newer the cost of transitions between to and from rings 1 and 2 were increasingly expensive because most operating systems only used rings 0 and 3 (Netware and OS/2 are the two exceptions that I know of).  On other architectures we just have privileged and unprivileged modes.  Code in privileged mode can't be isolated from other code in privileged mode, code that is in unprivileged mode incurs some overhead for calls into privileged mode.

There are some tricks that you can do to enforce some weaker protection.  For example, every driver could be written on 64-bit platforms to use 32-bit pointers and have a 4GB segment of privileged-mode virtual memory allocated for it to use and have to go through special gates to do anything with the whole kernel's address space.  You'd then end up with a lot more TLB churn, but protection against a number of kinds of pointer error (protection faults inside the 32-bit window would just result in that module being killed and restarted).  

Unfortunately, there are several problems with this.  The most obvious is that killing a module is not always trivial.  For example, a module may hold various locks, but it's not always clear which module owns a lock.  Locks are held by kernel threads, but a thread can have a call stack spanning several modules.  Working out exactly which driver holds the lock is not always trivial, and there is also the question of what you do about a thread that contains some call frames belonging to the module that you've just killed.  You'd need to provide some exception-like mechanism for handling this case (and unwinding the stack in the case where it is potentially corrupt is also nontrivial).  

An alternative is to run the driver entirely, or mostly, in userspace.  The 'mostly' option is often better.  For example, certain categories of USB devices are exposed by the FreeBSD kernel as USB generic devices (ugen driver) and some userspace component sends USB commands to it.  This involves some extra copying, but means that most of the (potentially buggy) driver logic is in the application.  If it crashes, you lose the application state (which, in a desktop setting, is only slightly better than crashing the kernel), but not the whole kernel.  

In the case of certain modern network interfaces (Infiniband in particular) and modern GPUs, the kernel handles even less.  The device has some hardware support for multiplexing and isolation and so all that the kernel has to do is set up some memory that both the device and the userspace code can access - including the device registers for controlling a command queue - and then delegate most of the operation to the userspace code.  This requires an IOMMU to actually provide isolation, otherwise an errant DMA request can still result in accessing or modifying kernel memory.

Even with this kind of isolation, there are still potential problems.  Many devices react poorly to bad input and can be left in a state that is hard to recover from, even if the driver itself is easy to restart.  A lot of OS instability (I saw a number as high as 20% of OS crashes quoted at MSR recently) is caused by drivers poorly reacting to intermittent hardware errors.  Just restarting the driver (an approach that they tried) solved some, but not all of these cases.

Of course, there are a lot of things in the kernel that are not drivers.  For example, FUSE allows us to run filesystems in userspace instead of in the kernel.  This comes with a performance penalty as a result of having to copy data from the kernel's buffer cache into the filesystem process, then back into the kernel, and then into the destination process (for a read - the same sequence in the opposite order on write).  Similarly, we have CUSE for character devices, which is used by a lot of webcam drivers.  These are a relatively good use-case for userspace drivers, because they are typically a streaming interface (data comes just from the device and there isn't a lot of need for latency-sensitive round trips from the app to the driver) and the latency that users care about is on the order of 1/24th of a second, which is a very long time on a modern computer.  There are other examples, such as Netmap for pushing network packets directly into userspace, which can be combined with something like Ilias Marinos' userspace network stack to run the entire TCP/IP stack in userspace.

Moving drivers into userspace is not a panacea.  It adds more asynchronous behaviour, which makes reasoning about the code harder and makes deadlocks far easier to introduce (for example, any userspace process has a lot of implicit interactions with the VM subsystem, which are more explicit in the kernel, and doesn't have a shared global namespace for locks).  Most of the code in the kernel is there because, when the code was written, it was the most sensible place for it.  In most cases, that is still true, although as CPU and software architectures evolve that may change.

David

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 881 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.freebsd.org/pipermail/freebsd-current/attachments/20130616/3de90914/attachment.sig>