Please help me diagnose this crazy VMWare/FreeBSD 8.x crash
jgreco at ns.sol.net
Fri Mar 30 16:53:14 UTC 2012
> On Fri, 30 Mar 2012 09:44:47 -0500, Joe Greco <jgreco at ns.sol.net> wrote:
> > Have you migrated these hosts, or were they installed in-place and
> > never moved?
> > fwiw the apparent integrity of things on the VM is consistent with
> > our experience too.
> VMMotion and StorageVMMotion does not seem to affect the stability. Even
> deleting the VM, rebuilding from scratch, re-installing all packages from
> scratch, copying over a few configs and then copying in any other data
> (perhaps website data) does not solve the problem.
On the same vmdk files? "Deleting the VM" makes it sound like not.
> However, our two most notorious for crashing happen to be webservers. We
> moved one to hardware. We simply rsync'd the exact data (entire OS and
> files) off the VM onto hardware, made a few config changes (fstab, network
> interface) and it's been running for 4+ months now with zero crashes.
That part doesn't shock me at all.
> I don't think it's corruption :/
Then it is hard to see what it is.
>From my perspective:
We had a perfectly functional, nearly zero-traffic VM, since Jabber
traffic averages no more than a few messages per hour. It was working
for quite some time.
We moved it from a local datastore to an iSCSI datastore that ended up
getting periodically crushed by the load (in particular during the
periodic daily load imposed by a bunch of VM's all running at once).
At this point, this one VM started hanging on I/O. We expected that
this would clear up upon return to a host with a local datastore. It
This ended up as a broken VM, one that would hang up overnite, maybe
not every night, but several times a week at least.
None of the other VM's, even the VM's that had been abused in this
horribly insensitive manner of being placed on intolerably slow iSCSI,
developed this condition.
There are dozens of other VM's running on these hosts, alongside the
one that was exhibiting this behaviour.
The VM continued to exhibit this behaviour even after having been moved
onto a different ESXi platform and architecture (Opteron->Xeon).
For the problem to "follow" the VM in this manner, and afflict *only*
the one VM, strongly suggests that it is something that is contained
within the VM files that constitute this VM. That is consistent with
the observation that the problem arose at a point where the VM is
known to have had all those files moved from one location to a dodgy
That's why I believe the evidence points to corruption of some sort.
Of course, your case makes this all interesting.
Joe Greco - sol.net Network Services - Milwaukee, WI - http://www.sol.net
"We call it the 'one bite at the apple' rule. Give me one chance [and] then I
won't contact you again." - Direct Marketing Ass'n position on e-mail spam(CNN)
With 24 million small businesses in the US alone, that's way too many apples.
More information about the freebsd-questions