Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Thu Mar 29 14:41:07 UTC 2012

> Hi,
> 
> * have you filed a PR?
> * is the crash easily reproducable?
> * are you able to boot some ramdisk-only FreeBSD-8.2 images (eg create
> a ramdisk image using nanobsd?) and do some stress testing inside
> that?
> 
> It sounds like you've established it's a storage issue, or at least
> interrupt handling for storage issue. So I'd definitely try the
> ramdisk-only boot and thrash it using lighttpd/httperf or something.
> If that survives fine, I'd look at trying to establish whether there's
> something wrong in the disk driver(s) freebsd is using. I'm not that
> cluey on ESXi, but there may be some PIC/APIC/ACPI change between 7.x
> and 8.0 which has caused this to surface.

We've seen this.  Or something that seems really like it.

We run dozens of FreeBSD VM's, many of which are 8.mumble.  We have a
scripted build environment dating back many years, so generally servers
come out in a fairly reproducible form.

After several months of smooth running, we had need to shuffle some
things around, and migrated some servers to a different datastore.
Suddenly, one particular VM, our corp Jabber server, started randomly 
disconnecting people every morning.  Some inspection showed that the
machine was running, but disk I/O in the VM was freezing up.  
Subsequent inspection suggested that it was happening during the 
periodic daily, though we never managed to get it to happen by manually 
forcing periodic daily, so that's only a theory.  Given that several 
times it appeared that one of the find commands was running, I was 
guessing that something in the thin provisioned disk image for the 
system had gone bad, but reading the entire disk with dd didn't cause 
a hang, running the periodic daily by hand didn't cause a hang, etc.

Migrating the VM to a different host and datastore did not fix the
issue.  Migrating the VM from an Opteron to a Xeon host with all the
latest ESXi 4 patches also didn't make any difference.  Migrating the
disk image from thin to full seemed to fix it, but I only gave it a
day or two, then decided there were other good reasons to reload the
VM, so I nuked the VM, which, of course, fixed it.

In the meantime, a dozen other similar VM's alongside it run just
fine.  My conclusion was that it was something specific that had gone
awry in the virtual machine, probably in the disk image, but I could
not identify it without significant digging that I had no particular
reason or inclination to do; since it appeared to be a VMware problem,
the "reload it and be done with it" seemed the quickest path to 
resolution.

That having been said, if anyone has any brilliant ideas about what 
would constitute useful further steps to isolate this, I can look at
recovering the faulty VM from backup and seeing if it still exhibits
the problem.

... JG
-- 
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.