Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Thu Mar 29 15:49:41 UTC 2012

> On Thursday 29 March 2012 15:42:42 Joe Greco wrote:
> > > Hi,
> 
> Do both 32- and 64-bit versions of FreeBSD crash?

We've only seen it happen on one virtual machine.  That was a 32-bit
version.  And it's not so much a crash as it is a "disk I/O hang".

The fact that it was happening regularly to that one VM, while a
bunch of other similar VM's were running alongside it without any
incident, along with the problem moving with the VM as it is moved
from host to host and from Opteron to Xeon, strongly points at 
something being wrong with the VM itself.  Our systems are built
mostly by script; I rebuilt the VM a few months ago and the
problem vanished.  The rebuilt system "should" have been virtually
identical to the original.  I never actually compared them though.

My working theory was that something bad had happened to the VM
during a migration from one datastore to another.  We have a really
slow-writing iSCSI server that it had been migrated onto for a little
bit, which was where the problem first appeared, I believe.  At
first I thought it was the nightly cron jobs just exceeding the iSCSI
server's capacity to cope, so we migrated the VM onto a host with
local datastores, and it remained broken thereafter.

So my conclusion was that it seemed likely that somehow VMware's 
thin provisioned disk image had gotten fouled up, and under some
unknown use case, it could be teased into locking up further I/O
on the VM.  I wasn't able to prove it.  I tried a read-dd of the
entire disk - passed, flying.  I tried several things to duplicate
the nightly periodic tasks where it seemed so prone to locking up.
They all ran fine.  But if I left the machine run, it'd do it
again eventually.

I explained it at the time to one of my VMware friends:

> But here's where it gets weird.  Three times, now, one VM - our Jabber
> server - has gone wonky in the wee early AM hours.  Disk I/O on the VM
> just locks up.  You can type at the console until it does I/O, so you
> can put in "root" at the login: prompt but never get a pw prompt.  My
> systems all run "top" from /etc/ttys and I can see that a whole bunch
> of processes are stopped in "getblk".  It's like the iSCSI disk has gone
> away, except it hasn't, since the other VM's are all happily churning
> away, on the same datastore, on the same VMware host.

http://www.sol.net/tmp/freebsd/freebsd-esxi-lockup.gif

> Now it's *possible* that the problem actually happens after the 3AM cron
> run (note slight CPU/memory drop) but the Jabber implosion actually
> happens around 0530, see drop in memory%.  But the root problem at the
> VM level seems to be that disk I/O has frozen.  I can't tell for sure when
> that happens.  All three instances are similar to this.
> 
> I can't explain this or figure out how to debug it.  Since it's locked up
> right now, thought I'd ping you for ideas before resetting it.

Now that was actually before we migrated it back to local datastore,
but when we did, the problem remained, suggesting that whatever has
happened to the VM, it is contained within the VM's vmdk or other
files.

... JG
-- 
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.