jenkins bhyve vms crashing and burning after several days of use

Sean Bruno sbruno at ignoranthack.me
Fri Jun 27 23:46:56 UTC 2014


On Fri, 2014-06-27 at 14:23 -0700, Neel Natu wrote:
> Hi,
> 
> On Thu, Jun 26, 2014 at 3:43 PM, Neel Natu <neelnatu at gmail.com> wrote:
> > Hi Sean,
> >
> > On Thu, Jun 26, 2014 at 3:23 PM, Sean Bruno <sbruno at ignoranthack.me> wrote:
> >> On Thu, 2014-06-26 at 15:00 -0700, Neel Natu wrote:
> >>> Hi Sean,
> >>>
> >>> On Thu, Jun 26, 2014 at 2:46 PM, Sean Bruno <sbruno at ignoranthack.me> wrote:
> >>> > On Thu, 2014-06-26 at 14:42 -0700, Sean Bruno wrote:
> >>> >> so, we're seeing the bhyve vms running in the freebsd cluster for
> >>> >> jenkins crashing and burning after a couple of days of use.
> >>> >>
> >>> >> vm exit[9]
> >>> >> reason          VMX
> >>> >> rip             0x0000000029286336
> >>> >> inst_length     3
> >>> >> status          0
> >>> >> exit_reason     49
> >>> >> qualification   0x0000000000000000
> >>> >> inst_type       0
> >>> >> inst_error      0
> >>> >>
> >>> >>
> >>> >> It looks like we have an active core file on havoc.ysv if you have a
> >>> >> moment to look at it:
> >>> >>
> >>> >> http://people.freebsd.org/~sbruno/bhyve.core
> >>> >>
> >>> >> FreeBSD havoc.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #2
> >>> >> r267362: Wed Jun 11 14:56:34 UTC 2014
> >>> >> sbruno at havoc.freebsd.org:/usr/obj/usr/src/sys/HAVOC  amd64
> >>> >>
> >>> >
> >>> > Also, from chaos.ysv
> >>> >
> >>> > http://people.freebsd.org/~sbruno/bhyve.core.chaos
> >>> >
> >>> > FreeBSD chaos.ysv.freebsd.org 11.0-CURRENT FreeBSD 11.0-CURRENT #1
> >>> > r267362: Wed Jun 11 15:50:24 UTC 2014
> >>> > sbruno at chaos.ysv.freebsd.org:/usr/obj/usr/src/sys/CHAOS  amd64
> >>> >
> >>>
> >>> Can you tell us the processor and memory configuration on havoc and chaos?
> >>>
> >>> Also, could you execute the following commands on havoc:
> >>>
> >>> # bhyvectl --vm=vmname --cpu=9 --get-vmcs-guest-physical-address
> >>> -- this will output the offending guest physical address that
> >>> triggered the EPT misconfiguration
> >>>
> >>> # bhyvectl --vm=vmname --get-gpa-pmap=<gpa_from_above>
> >>> -- this will output the page table entries in the EPT that map to the
> >>> offending GPA
> >>>
> >>> Hopefully that provides us with something to work with.
> >>>
> >>> best
> >>> Neel
> >>>
> >>> >
> >>
> >> chaos:
> >> CPU: Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz (2200.05-MHz K8-class CPU)
> >>   Origin="GenuineIntel"  Id=0x206d6  Family=0x6  Model=0x2d  Stepping=6
> >> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
> >> Features2=0x1fbee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,x2APIC,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX>
> >>   AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
> >>   AMD Features2=0x1<LAHF>
> >>   TSC: P-state invariant, performance statistics
> >> avail memory = 66298322944 (63227 MB)
> >>
> >> havoc:
> >> FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
> >> CPU: Intel(R) Xeon(R) CPU           E5620  @ 2.40GHz (2400.14-MHz
> >> K8-class CPU)
> >>   Origin="GenuineIntel"  Id=0x206c2  Family=0x6  Model=0x2c  Stepping=2
> >> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
> >> Features2=0x29ee3ff<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,SMX,EST,TM2,SSSE3,CX16,xTPR,PDCM,PCID,DCA,SSE4.1,SSE4.2,POPCNT,AESNI>
> >>   AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
> >>   AMD Features2=0x1<LAHF>
> >>   TSC: P-state invariant, performance statistics
> >> avail memory = 16571621376 (15803 MB)
> >>
> >
> > Thanks, we'll see if there are relevant errata for these processors.
> >
> 
> Actually these processors have entirely different microarchitectures
> (Nehalem and Sandybridge) so its unlikely that this is due to
> processor errata.
> 
> >>
> >> There appear to be three vms running on havoc:
> >> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9
> >> --get-vmcs-guest-physical-address
> >> gpa[9]          0x0000000000000000
> >> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9
> >> --get-vmcs-guest-physical-address
> >> gpa[9]          0x0000000000000000
> >> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9
> >> --get-vmcs-guest-physical-address
> >> gpa[9]          0x0000000000000000
> >>
> >> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm1 --cpu=9
> >> --get-gpa-pmap=0x0000000000000000
> >> gpa 0: 0x300002c936e007 0x300002c9353007 0x300002c9352007 0
> >>
> >> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm2 --cpu=9
> >> --get-gpa-pmap=0x0000000000000000
> >> gpa 0: 0x30000286cb0007 0x300003ad105007 0x3000019b1fd007 0
> >>
> >> root at havoc.ysv:/home/sbruno # bhyvectl --vm=vm3 --cpu=9
> >> --get-gpa-pmap=0x0000000000000000
> >> gpa 0: 0x300002c9348007 0x300002c9339007 0
> >>
> >>
> >> But there's no information available on chaos at the moment as there are
> >> no active vms running.
> >>
> >
> > Sorry, I should explained a bit more.
> >
> > After a bhyve(8) exits because of the EPT misconfiguration error there
> > are breadcrumbs left over in the VMCS as well as the nested page
> > tables. We can use them to diagnose what happened.
> >
> > The bhyvectl commands above should be executed after the VM exits but
> > before it is restarted again. Once it restarts, the breadcrumbs get
> > written over and are of no use.
> >
> > The "--vm=<vmname>" passed to the bhyvectl command should be of the
> > virtual machine that crashed.
> > The "--cpu=<vcpuid>" passed to the bhyvectl command should be the
> > vcpuid that detected the EPT misconfiguration. The reason I used '9'
> > as an example above was because you saw this on the console:
> >
> > vm exit[9]
> > reason          VMX
> > rip             0x0000000029286336
> >
> > Hope that helps.
> >
> 
> I submitted a change in r267966 to dump this information to the
> console. It is also stashed in the process memory so we can inspect it
> in a coredump.
> 
> Would it be possible to upgrade chaos and/or havoc to  r267966 so we
> can make progress on debugging this issue?
> 
> best
> Neel
> 
> > best
> > Neel
> >
> >> sean
> >>


Yeah, I'll see if I can get that done this weekend.  Waiting for build
breakages to be resolved.  :-)

sean



More information about the freebsd-virtualization mailing list