consistent VM hang during reboot

Fri May 9 18:42:01 UTC 2014

On May 8, 2014, at 12:42 PM, Andrew Duane <aduane at juniper.net> wrote:

> From: owner-freebsd-hackers at freebsd.org [mailto:owner-freebsd-hackers at freebsd.org] On Behalf Of John Nielsen
> 
>> On May 8, 2014, at 11:03 AM, John Baldwin <jhb at freebsd.org> wrote:
>> 
>>> On Wednesday, May 07, 2014 7:15:43 pm John Nielsen wrote:
>>>> I am trying to solve a problem with amd64 FreeBSD virtual machines running on a Linux+KVM hypervisor. To be honest I'm not sure if the problem is in FreeBSD or 
>>> the hypervisor, but I'm trying to rule out the OS first.
>>>> 
>>>> The _second_ time FreeBSD boots in a virtual machine with more than one core, the boot hangs just before the kernel would normally print e.g. "SMP: AP CPU #1 
>>> Launched!" (The last line on the console is "usbus0: 12Mbps Full Speed USB v1.0", but the problem persists even without USB). The VM will boot fine a first time, 
>>> but running either "shutdown -r now" OR "reboot" will lead to a hung second boot. Stopping and starting the host qemu-kvm process is the only way to continue.
>>>> 
>>>> The problem seems to be triggered by something in the SMP portion of cpu_reset() (from sys/amd64/amd64/vm_machdep.c). If I hit the virtual "reset" button the next 
>>> boot is fine. If I have 'kern.smp.disabled="1"' set for the initial boot then subsequent boots are fine (but I can only use one CPU core, of course). However, if I 
>>> boot normally the first time then set 'kern.smp.disabled="1"' for the second (re)boot, the problem is triggered. Apparently something in the shutdown code is 
>>> "poisoning the well" for the next boot.
>>>> 
>>>> The problem is present in FreeBSD 8.4, 9.2, 10.0 and 11-CURRENT as of yesterday.
>>>> 
>>>> This (heavy-handed and wrong) patch (to HEAD) lets me avoid the issue:
>>>> 
>>>> --- sys/amd64/amd64/vm_machdep.c.orig	2014-05-07 13:19:07.400981580 -0600
>>>> +++ sys/amd64/amd64/vm_machdep.c	2014-05-07 17:02:52.416783795 -0600
>>>> @@ -593,7 +593,7 @@
>>>> void
>>>> cpu_reset()
>>>> {
>>>> -#ifdef SMP
>>>> +#if 0
>>>> 	cpuset_t map;
>>>> 	u_int cnt;
>>>> 
>>>> I've tried skipping or disabling smaller chunks of code within the #if block but haven't found a consistent winner yet.
>>>> 
>>>> I'm hoping the list will have suggestions on how I can further narrow down the problem, or theories on what might be going on.
>>> 
>>> Can you try forcing the reboot to occur on the BSP (via 'cpuset -l 0 reboot')
>>> or a non-BSP ('cpuset -l 1 reboot') to see if that has any effect?  It might
>>> not, but if it does it would help narrow down the code to consider.
>> 
>> Hello jhb, thanks for responding.
>> 
>> I tried your suggestion but unfortunately it does not make any difference. The reboot hangs regardless of which CPU I assign the command to.
>> 
>> Any other suggestions?
> 
> When I was doing some early work on some of the Octeon multi-core chips, I encountered something similar. If I remember correctly, there was an issue in the shutdown sequence that did not properly halt the cores and set up the "start jump" vector. So the first core would start, and when it tried to start the next ones it would hang waiting for the ACK that they were running (since they didn't have a start vector and hence never started). I know MIPS, not AMD, so I can't say what the equivalent would be, but I'm sure there is one. Check that part, setting up the early state.
> 
> If Juli and/or Adrian are reading this: do you remember anything about that, something like 2 years ago?

That does sound promising, would love more details if anyone can provide them.

Here's another wrinkle:

The KVM machine in question is part of a cluster of identical servers (hardware, OS, software revisions). The problem is present on all servers in the cluster.

I also have access to a second homogenous cluster. The OS and software revisions on this cluster are identical to the first. The hardware is _nearly_ identical--slightly different mainboards from the same manufacturer and slightly older CPUs. The same VMs (identical disk image and definition, including CPU flags passed to the guest) that have a problem on the first cluster work flawlessly on this one.

Not sure if that means the bad behavior only appears on certain CPUs or if it's timing-related or something else entirely. I'd welcome speculation at this point.

CPU details below in case it makes a difference.

== Problem Host ==
model name      : Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms

== Good Host ==
model name      : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid

Thanks,

JN