On COW memory mapping in d_mmap_single

Wed Apr 12 11:11:34 UTC 2017

Hi Chris,

Thanks a lot for your answer. I've added Peter to CC, as he knows
about this ongoing project and some of the design decisions, like the
COW mapping, were already taken to some extent when I joined. Please
see my in-lined answers below.

On Wed, Apr 12, 2017 at 1:10 AM, Chris Torek <torek at elf.torek.net> wrote:
>>Yes, all vCPUs are locked before calling mmap(). I agree that we don't
>>need 'COW', as long as we keep all vCPUs locked while we copy the
>>entire VM memory. But this might take a while, imagine a VM with 32GB
>>or more of RAM. This will take maybe minutes to write to disk, so we
>>don't actually want the VM to be freezed for so long. That's the
>>reason we'd like to map the memory COW and then unlock vCPUs.
>
> You'll need to save the device state while holding the CPUs locked,
> too, so that the virtio queues can be in sync when you restore.

Yes, saving vCPU state, vlapic, ioapic etc is done with all vCPUs
locked. Memory, on the other hand, may be too large and take too much
time to copy. I am working right now on saving virtio queues and
device state.

>>It's a OBJT_DEFAULT. It's not a device object, it's the memory object
>>given to guest to use as physical memory.
>
> Your copy code path is basically a simplified vm_map_copy_entry()
> as called from vmspace_fork() for the MAP_INHERIT case.  But if
> these are OBJT_DEFAULT, shouldn't you be calling vm_object_collapse()?
> See https://github.com/flaviusanton/freebsd/blob/bhyve-save-restore/sys/vm/vm_map.c#L3170
> (Maybe src_object->handle is never NULL?  There are several things
> in the VM object code that I do not understand fully here, so this
> might be the case.)

I saw those functions: vm_map_copy_entry() and vm_object_collapse(),
but I didn't have enough understanding of the whole system to be able
to tell if they might do some other things that we don't want them to.
I'll read them again after this e-mail.

>>>Next, how do you undo the damage done by your 'COW' ?
>
>>This is one thing that we've thought about, but we don't have a
>>solution for now. I agree it is very important, though. I figured that
>>it might be possible to 'unmark' the memory object as COW with some
>>additional tricks.
>
> I think you may be better off doing actual vm_map_copy_entry()
> calls.
>
> I am assuming, here, that snapshot-saving is implemented by
> sending a request to the running bhyve, which spins off a thread
> or process that does the snapshot-save.  If you spin it off as
> a real process, i.e., do a fork(), you will get the existing
> VM system to do all the work for you.  The overall strategy
> then looks something like this:
>
>     handle_external_suspend_or_snapshot_request() {
>         set global suspending flag /* if needed */
>         stop all vcpus
>         signal virtio and emulated devices to quiesce, if needed
>         if (snapshot) {
>             open snapshot file
>             pid = fork()
>             if (pid == 0) { /* child */
>                 COW is now in effect on memory: save more-volatile
>                     vcpu and dev state
>                 pthread_cond_signal parent that it's safe to resume
>                 save RAM state
>                 close snapshot file
>                 _exit(0)
>             }
>             if (pid < 0) ... handle error ...
>             /* parent */
>             close snapshot file
>             wait for child to signal OK to resume
>         } else {
>             wait for external resume signal
>         }
>         clear suspending flag
>         resume devices and vcpus
>     }
>
> To resume a snapshot from a file, we load its state and then run
> the last two steps (clear suspending flag and resume devices and
> vcpus).
>
> This way all the COW action happens through fork(), so there is no
> new kernel side code required

This looks perfect to me, this was one of my first questions when I
joined. However, I am not sure if it's ok to fork the entire bhyve
memory space, I remember that I've seen some discussion about this,
that's why I CCed Peter. Right now we have a checkpoint thread that
listens for the checkpoint signal (via a UNIX socket), then it
proceeds to locking the CPUs, saving some state, requests COW mapping
(via ioctl), unlocks vCPUs and copy COW memory to a checkpoint file. I
haven't done anything about unmapping the COW entry yet.

> (Frankly, I think the hard part here is saving device and virtual
> APIC state.  If you have the vlapic state saving working, you have
> made pretty good progress.)

Thanks. I am almost sure it is not complete yet, but I have vlapic
state saved. Actually, I am able to restore VMs using a ramdisk and no
devices except the console. I'd like to open a pull request for review
as soon as possible, but in the meantime I started looking on virtio
devices and save/restore virtio-net too.

--
Flavius