[Bug 243531] Unstable ena and nvme on AWS

Fri Feb 21 18:44:45 UTC 2020

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=243531

--- Comment #2 from Leif Pedersen <leif at ofWilsonCreek.com> ---
I'm at a bit of a loss to come up with anything particularly helpful. A few
thoughts, although mostly naive observations and wild speculation -

It kind of seems like when one machine has a problem, several others do also.
This suggests that it could be triggered by a shared event in the host's
networking or EBS. (None of our instances have local storage.) I don't have
enough machines or samples to show that it's not just a coincidence though.

The nvme errors are always (or almost always?) accompanied by ena errors, but
ena errors happen without nvme errors sometimes. That suggests it might be
triggered by a network event in the AWS hosting infrastructure, like a network
topology change or something.

I'll attach a /var/log/all.log and the screenshot from a crash that happened
today. Probably nothing new there. This time, the machine did not panic, but
rather wedged after Nagios reported its CPU load at 9. There's nothing running
on this one besides the hourly zfs snapshot transfers, so I think the load from
processes piled up waiting for IO.

The timing of error messages stretches out over many minutes, starting with ena
errors at 02:20:16, and nvme errors finally happen at 02:28:06. Seems odd, like
a problem that ramps up rather slowly rather than an abrupt crash.

It's also interesting that these messages on the console screenshot made it
into syslog, so IO must have recovered, if only briefly.

-- 
You are receiving this mail because:
You are the assignee for the bug.