[Bug 225791] ena driver causing kernel panics on AWS EC2

Sun Sep 9 00:27:21 UTC 2018

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225791

Leif Pedersen <leif at ofWilsonCreek.com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |leif at ofWilsonCreek.com

--- Comment #18 from Leif Pedersen <leif at ofWilsonCreek.com> ---
(In reply to pete from comment #16)

I've been able to reproduce this repeatedly (but not predictably) on 11.2 on an
r4.large. Not to state the blindingly obvious, but smaller instances such as
t2.* aren't affected since they use xn instead of ena. It seems to be most
likely at times of high network IO, which again risks stating the
forehead-slappingly obvious. :)

Multiple times, the crash included the same back-trace shown in this bug.
However, at least once it panicked on a double-fault, which, if related,
suggests that the bug in ena could be incurring memory corruption. Now granted,
I only know of one incidence of a double-fault, so it could've been running on
a host with faulty RAM or something at the time. However, after each panic, I'd
stop/start the instance rather than reboot, to provoke it to move to new
hardware, so I'm not suggesting that the whole bug is merely from faulty host
hardware.

I might beg that the fix could be patched in 11.2, or at least included in 11.3
so it won't have to wait for 12. Otherwise, AWS users will find themselves
stuck on 11.1, and the approaching EOL of 11.1 will leave them without security
updates, which in turn makes this an indirect security issue. However, I
understand there are other considerations at play, and very much appreciate the
relentless work of the security team (not to mention the work on AWS support
and FreeBSD in general).

Probably too much detail: The particular case was our standby MySQL database on
an r4.large. It was stable on 11.1, and problematic after I upgraded it to 11.2
(with `freebsd-update upgrade`); after five or so crashes in a month, I
downgraded it back to 11.1 (again with `freebsd-update upgrade`), after which
it has been perfectly stable for a couple of weeks now. It's in master-master
replication with our production replica, and normally gets a fairly low but
steady stream of activity from the replication. However, we have several
nightly jobs that crank away on updating a model and cause a large volume of
traffic in the replication stream. I don't have proper metrics on bytes/sec, so
I don't have any idea whether it saturates the interface. It's enough that
replication falls behind for up to a few hours, but I wouldn't call our system
"huge" in terms of network traffic by any means.

The reason I included all that detail is to point out: (1) it seems to be a
regression between 11.1 and 11.2, (2) r4.* are for sure affected, and (3) it
may be that the problem is more likely to be triggered on moderate or bursty
network traffic with much task-switching between MySQL threads, compared to a
simple stream of a high speed file transfer, for example.

-Leif

-- 
You are receiving this mail because:
You are the assignee for the bug.