jenkins bhyve vms crashing and burning after several days of use

Neel Natu neelnatu at gmail.com
Fri Jun 27 21:23:13 UTC 2014


Hi,

On Thu, Jun 26, 2014 at 3:43 PM, Neel Natu <neelnatu at gmail.com> wrote:
> Hi Sean,
>
> On Thu, Jun 26, 2014 at 3:23 PM, Sean Bruno <sbruno at ignoranthack.me> wrote:
>> On Thu, 2014-06-26 at 15:00 -0700, Neel Natu wrote:
>>> Hi Sean,
>>>
>>> On Thu, Jun 26, 2014 at 2:46 PM, Sean Bruno <sbruno at ignoranthack.me> wrote:
>>> > On Thu, 2014-06-26 at 14:42 -0700, Sean Bruno wrote:
>>> >> so, we're seeing the bhyve vms running in the freebsd cluster for
>>> >> jenkins crashing and burning after a couple of days of use.
>>> >>
>>> >> vm exit[9]
>>> >> reason          VMX
>>> >> rip             0x0000000029286336
>>> >> inst_length     3
>>> >> status          0
>>> >> exit_reason     49
>>> >> qualification   0x0000000000000000
>>> >> inst_type       0
>>> >> inst_error      0
>>> >>
>>> >>
>>> >> It looks like we have an active core file on havoc.ysv if you have a
>>> >> moment to look at it:
>>> >>
>>> >> http://people.freebsd.org/~sbruno/bhyve.core
>>> >>
>>> >> FreeBSD havoc.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #2
>>> >> r267362: Wed Jun 11 14:56:34 UTC 2014
>>> >> sbruno at havoc.freebsd.org:/usr/obj/usr/src/sys/HAVOC  amd64
>>> >>
>>> >
>>> > Also, from chaos.ysv
>>> >
>>> > http://people.freebsd.org/~sbruno/bhyve.core.chaos
>>> >
>>> > FreeBSD chaos.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #1
>>> > r267362: Wed Jun 11 15:50:24 UTC 2014
>>> > sbruno at chaos.ysv.freebsd.org:/usr/obj/usr/src/sys/CHAOS  amd64
>>> >
>>>
>>> Can you tell us the processor and memory configuration on havoc and chaos?
>>>
>>> Also, could you execute the following commands on havoc:
>>>
>>> # bhyvectl --vm=vmname --cpu=9 --get-vmcs-guest-physical-address
>>> -- this will output the offending guest physical address that
>>> triggered the EPT misconfiguration
>>>
>>> # bhyvectl --vm=vmname --get-gpa-pmap=<gpa_from_above>
>>> -- this will output the page table entries in the EPT that map to the
>>> offending GPA
>>>
>>> Hopefully that provides us with something to work with.
>>>
>>> best
>>> Neel
>>>
>>> >
>>
>> chaos:
>> CPU: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (2200.05-MHz K8-class CPU)
>>   Origin="GenuineIntel"  Id=0x206d6  Family=0x6  Model=0x2d  Stepping=6
>> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
>> Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
>>   AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
>>   AMD Features2=0x1<LAHF>
>>   TSC: P-state invariant, performance statistics
>> avail memory = 66298322944 (63227 MB)
>>
>> havoc:
>> FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
>> CPU: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz (2400.14-MHz
>> K8-class CPU)
>>   Origin="GenuineIntel"  Id=0x206c2  Family=0x6  Model=0x2c  Stepping=2
>> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
>> Features2=0x29ee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AESNI>
>>   AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
>>   AMD Features2=0x1<LAHF>
>>   TSC: P-state invariant, performance statistics
>> avail memory = 16571621376 (15803 MB)
>>
>
> Thanks, we'll see if there are relevant errata for these processors.
>

Actually these processors have entirely different microarchitectures
(Nehalem and Sandybridge) so its unlikely that this is due to
processor errata.

>>
>> There appear to be three vms running on havoc:
>> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9
>> --get-vmcs-guest-physical-address
>> gpa[9]          0x0000000000000000
>> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9
>> --get-vmcs-guest-physical-address
>> gpa[9]          0x0000000000000000
>> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9
>> --get-vmcs-guest-physical-address
>> gpa[9]          0x0000000000000000
>>
>> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9
>> --get-gpa-pmap=0x0000000000000000
>> gpa 0: 0x300002c936e007 0x300002c9353007 0x300002c9352007 0
>>
>> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9
>> --get-gpa-pmap=0x0000000000000000
>> gpa 0: 0x30000286cb0007 0x300003ad105007 0x3000019b1fd007 0
>>
>> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9
>> --get-gpa-pmap=0x0000000000000000
>> gpa 0: 0x300002c9348007 0x300002c9339007 0
>>
>>
>> But there's no information available on chaos at the moment as there are
>> no active vms running.
>>
>
> Sorry, I should explained a bit more.
>
> After a bhyve(8) exits because of the EPT misconfiguration error there
> are breadcrumbs left over in the VMCS as well as the nested page
> tables. We can use them to diagnose what happened.
>
> The bhyvectl commands above should be executed after the VM exits but
> before it is restarted again. Once it restarts, the breadcrumbs get
> written over and are of no use.
>
> The "--vm=<vmname>" passed to the bhyvectl command should be of the
> virtual machine that crashed.
> The "--cpu=<vcpuid>" passed to the bhyvectl command should be the
> vcpuid that detected the EPT misconfiguration. The reason I used '9'
> as an example above was because you saw this on the console:
>
> vm exit[9]
> reason          VMX
> rip             0x0000000029286336
>
> Hope that helps.
>

I submitted a change in r267966 to dump this information to the
console. It is also stashed in the process memory so we can inspect it
in a coredump.

Would it be possible to upgrade chaos and/or havoc to  r267966 so we
can make progress on debugging this issue?

best
Neel

> best
> Neel
>
>> sean
>>


More information about the freebsd-virtualization mailing list