suspending threads before devices

Wed Nov 19 03:43:20 UTC 2014

On Nov 18, 2014, at 3:21 PM, John Baldwin <jhb at FreeBSD.org> wrote:

> On Saturday, November 15, 2014 1:00:15 pm Konstantin Belousov wrote:
>> On Sat, Nov 15, 2014 at 05:05:10PM +0200, Andriy Gapon wrote:
>>> On 15/11/2014 12:58, Konstantin Belousov wrote:
>>>> On Fri, Nov 14, 2014 at 11:10:45PM +0200, Andriy Gapon wrote:
>>>>> On 22/03/2012 16:14, Konstantin Belousov wrote:
>>>>>> I already noted this to Jung-uk, I think that current suspend handling
>>>>>> is (somewhat) wrong. We shall not stop other CPUs for suspension when
>>>>>> they are executing some random kernel code. Rather, CPUs should be safely
>>>>>> stopped at the kernel->user boundary, or at sleep point, or at designated
>>>>>> suspend point like idle loop.
>>>>>> 
>>>>>> We already are engaged into somewhat doubtful actions like restoring of %cr2,
>>>>>> since we might, for instance, preemt page fault handler with suspend IPI.
>>>>> 
>>>>> I recently revisited this issue in the context of some suspend+resume problems
>>>>> that I am having with radeonkms driver.  What surprised me is that the driver's
>>>>> suspend code has no synchronization whatsoever with its other code paths.  So, I
>>>>> looked first at the Linux code and then at the illumos code to see how suspend
>>>>> is implemented there.
>>>>> As far as I can see, those kernels do exactly what you suggest that we do.
>>>>> Before suspending devices they first suspend all threads except for one that
>>>>> initiates the suspend.  For userland threads a signal-like mechanism is used to
>>>>> put them in a state similar to SIGSTOP-ed one.  With the kernel threads
>>>>> mechanisms are different between the kernels.  Also, illumos freezes kernel
>>>>> threads after suspending the devices, not before.
>>>>> 
>>>>> I think that we could start with only the userland threads initially.  Do you
>>>>> think the SIGSTOP-like approach would be hard to implement for us?
>>>> We have most, if not all, parts of the stopping code
>>>> already implemented. I mean the single-threading code, see
>>>> thread_single(SINGLE_BOUNDARY). The code ensures that other threads in
>>>> the current process are stopped either at the kernel->user boundary, or
>>>> at the safe kernel sleep point.
>>>> 
>>>> This is not immediately applicable, since the caller is supposed to be
>>>> a thread in the suspended process, but modifications to allow external
>>>> process to do the same are really small comparing with the complexity
>>>> of the code.  I suspect that all what is needed is change of
>>>> 	while/if (remaining != 1)
>>>> to
>>>> 	while/if ((p == curproc && remaining != 1) ||
>>>> 	    (p != curproc && remaining != 0))
>>>> together with explicit passing of struct proc *p to thread_single.
>>> 
>>> Thank you for the pointer!
>>> I think that maybe even more changes are required for that code to be usable for
>>> suspending.  E.g. maybe a different p_flag bit should be used, because I think
>>> that we would like to avoid interaction between the process level suspend and
>>> the global suspend.  I.e. the global suspend might encounter a multi-threaded
>>> process in a single thread mode and would need to suspend its remaining thread.
>> 
>> Thread which is a p_singlethread, is not at the safe point; in other
>> words, a process which is under the singlethreading, should prevent
>> the system from entering sleep state. The singlethreading is the
>> temporal state anyway, it is established during exec() or exit(), so
>> it is fine to wait for in-process singlethreading to end before outer
>> singlethreading is done.
>> 
>> Anyway, this requires real coding to experiment.  I started looking at
>> it since I did somewhat related changes now.
> 
> I would certainly like a way to quiesce threads before entering the real suspend
> path.  I would also like to cleanly unmount filesystems during suspend as well and
> the thread issue is a prerequisite for that.  However, reusing "stop at boundary"
> may not be quite correct because you probably don't want to block suspend because
> you have an NFS request that is retrying due to a down NFS server.  For NFS I
> think you want any threads asleep to just not get a chance to run again until
> after resume completes.

I’m almost certain you don’t want to “unmount” the filesystems. This would invalidate
all open file handles and would be mondo-bado, and would only succeed if you forced
this issue due to all the open references. Perhaps you’re being imprecise.

I think you want to pause all the user land threads, sync the filesystems, which should
mark them as clean and allow for the battery to drain w/o too much trouble. It would
also mean that all the threads that start up again won’t have to reopen files, etc. Once
you’ve done this, you can proceed to kernel threads and suspending the hardware.

Filesystems implemented in userland, however, would be a problem. As would logging
to stable media once the suspend begins (since you’d have already suspended syslogd
and even if you hand’t, you’d then be dirtying the very disks you want to keep clean). There’s
currently hooks into devd that would need attention that are (were?) used to put the video
mode into a state that one can resume from.

And then there’s CardBus / PC Card which detach the devices since they power them down.
They could easily be changed to not detach the devices (but they have to power them down
to avoid interrupt storms), but then all the attachments would need to ensure that their
‘resume’ routines did everything that attached did to initialize the hardware. And the only way to
know if you’re suspended and resumed with hardware that’s the same type, but different that
would need to be treated differently than hardware that’s the same type but actually the same.
The device interface would likely need to grow a UUID function that would be the hash of the
network cards MAC or the disks’ serial number (Swapping CF cards while suspended today is
completely safe since we force a detach, if that were to be fixed, it could be come dangerous
as the new disk is unlikely to be a bitwise copy of the old). This functionality would need to live
in the driver level, not the bus level, because PCI Config space doesn’t have the MAC, nor
does it have enough knowledge to know about serial numbers. PC Card CIS space doesn’t
typically have a serial number (a vanishingly small number of cards have it, 99%+ or more
don’t). Oh, and it doesn’t help that the disk isn’t a direct child of the PC Card bus, but a child of
a child (possibly without any device_t in the case of CAM devices). Oh, did I mention that PC Card
and Cardbus hot plug require kernel threads to be active during suspend / resume to work properly
in some rare cases that time has deserted from my memory.

And then there’s USB, which implemented things differently and possibly unsafely. I haven’t looked
into that in as much detail as I have PC Card and Cardbus.

Warner
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 842 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <http://lists.freebsd.org/pipermail/freebsd-arch/attachments/20141118/273700ec/attachment.sig>