Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

Fri Mar 30 01:25:26 UTC 2012

On 28/03/2012 22:59, Mark Felder wrote:
> Alright guys, I'm at the end of my rope here. For those that haven't 
> seen my previous emails here's the (not so) quick breakdown:
>
> Overview:
>
> FreeBSD ?? - 7.4 never crash
> FreeBSD 8.0 - 8.2 crashes
> FreeBSD 8-STABLE, 8.3, and 9.0 are untested (Sorry, not possible in 
> our production at this time, and we were hoping we could base some 
> stuff on 8.3 for long term stability...)
> ESXi: Confirmed ESXi 4.0 - 5.0 has this problem. Haven't tested on 
> others.
>
>
> History:
>
> Over the course of the last 2 years we've been banging our heads on 
> the wall. VMWare is done debugging this. They claim it's not a VMWare 
> issue. They can't identify what the heck happens. We had a glimmer of 
> hope with ESXi 5.0 fixing it because we never saw any crashes in the 
> handful of deployments, but our dreams were crushed today -- two days 
> before an outage to begin migration to ESXi 5.0 -- when a customer's 
> ESXi 5.0 server and FreeBSD 8.2 guest crashed.
>
>
> Crash Details:
>
> The keyboard/mouse usually stops responding for input on the console; 
> normally we can't type in a username or password. However, we can 
> switch VTs.
>
> If there's a shell on the console and we can type, we can only run 
> things in memory. Any time we try to access the disk it will hang 
> indefinitely.
>
> The server still has network access. We can ping it without issue. SSH 
> of course kicks you out because it can't do any I/O.
>
> If we were to serve a lightweight http server off a memory backed 
> filesystem I'm confident it would run just fine as long as it wasn't 
> logging or anything.
>
> On ESXi you see that there is a CPU spike of 100% that goes on 
> indefinitely. No idea what the FreeBSD OS itself thinks it is doing 
> because we can't run top during the crash.
>
> This crash can affect a server and happen multiple times a week. It 
> can also not show up for 180 days or more. But it does happen. The 
> server can be 100% idle and crash. We have servers that do more I/O 
> than the ones that crash could ever attempt to do and these don't 
> crash at all. Completely inexplicable.
>
>
> Things we've looked into:
>
> Nothing about the installed software matters. We've tried cross 
> referencing the crashed servers by the programs they run but the base 
> OS is the only common denominator due to the wide variety of servers 
> it has affected.
>
> Storage doesn't matter. We've tried different iSCSI SANs, we've tried 
> different switches, we've tried local datastores on the ESXi servers 
> themselves.
>
> HP servers, Dell servers -- doesn't seem to matter either. (All with 
> latest firmwares, BIOSes, etc)
>
> VMWare gave us a ton of debugging tasks, and we've given them 
> gigabytes of debugging info and data; they can't find anything.
>
> VMWare tools -- with, without, using open-vm-tools makes no 
> difference. I think we've done a fair job ruling out VMWare.
>
>
> I think we've finally found enough data that this is definitely 
> something in the FreeBSD world. I'm going to begin prepping some of 
> the known crashy servers with more debugging. Any suggestions on what 
> I should build the kernel with? They never do a proper panic, but I 
> definitely want to at least *try* to get into the debugger the next 
> time it crashes. And when it crashes, what the heck should I be 
> running? I've never played with the KDB before...
>
>
> Thank you for any suggestions and help you can give me....
> _______________________________________________
> freebsd-questions at freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-questions
> To unsubscribe, send any mail to 
> "freebsd-questions-unsubscribe at freebsd.org"

Sorry, coming a bit late to the party,

I have seen similar behavior on a few vm. All of them either Debian and 
FreeBSD. Even though CPU indication are not necessarily relevant in a 
VM, vi launched through crontab -e would take insane amount of CPU (up 
to 84%) and Apache was hanging around 350% 400% (quad CPU VM).
Now the thing is that making a VM snapshot and deploying the snapshot a 
while later, or on a different (way less loaded) VMWare platform would 
basically make it perfectly usable again.
Shutting down the VM and starting it again with only one CPU would also 
basically solve the problem. In a way Debian seemed to be able to 
survive the crisis but Disk I/O have latencies of many seconds, 
sometimes minutes. This would happen only on heavily loaded VMWare. In a 
quite similar way older version of Debian never shown the problem.

Can you test whether you have similar behavior on your platform ?