[Bug 225330] clang bug can incorrectly enable or disable interrupts on amd64
bugzilla-noreply at freebsd.org
bugzilla-noreply at freebsd.org
Sat Jan 20 01:03:28 UTC 2018
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225330
Bug ID: 225330
Summary: clang bug can incorrectly enable or disable interrupts
on amd64
Product: Base System
Version: CURRENT
Hardware: amd64
OS: Any
Status: New
Severity: Affects Some People
Priority: ---
Component: bin
Assignee: freebsd-bugs at FreeBSD.org
Reporter: jtl at freebsd.org
Created attachment 189922
--> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=189922&action=edit
Patch to make clang do a slightly better job at restoring the EFLAGS register
While making a change to kernel code, I recently discovered a compiler bug that
can incorrectly enable or disable interrupts on amd64.
Imagine this code:
void
fx(void)
{
_Bool needtocall;
mtx_lock_spin(&lock);
needtocall = (--count == 0);
mtx_unlock_spin(&lock);
if (needtocall)
otherfx();
}
If you compile this code in clang (5.0.0, 5.0.1, or 6.0.0 all behave similarly
in the critical respect) either in a kernel module or with mutex inlining
disabled, you get this assembly (dumped with `objdump -S` to interleave the
assembly and source code):
void
fx(void)
{
0: 55 push %rbp
1: 48 89 e5 mov %rsp,%rbp
_Bool needtocall;
mtx_lock_spin(&lock);
4: 53 push %rbx
5: 50 push %rax
6: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
d: 31 f6 xor %esi,%esi
f: 48 c7 c2 00 00 00 00 mov $0x0,%rdx
16: b9 18 00 00 00 mov $0x18,%ecx
1b: e8 00 00 00 00 callq 20 <fx+0x20>
needtocall = (--count == 0);
20: ff 0c 25 00 00 00 00 decl 0x0
27: 9c pushfq
28: 5b pop %rbx
mtx_unlock_spin(&lock);
29: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
30: 31 f6 xor %esi,%esi
32: 48 c7 c2 00 00 00 00 mov $0x0,%rdx
39: b9 1a 00 00 00 mov $0x1a,%ecx
3e: e8 00 00 00 00 callq 43 <fx+0x43>
if (needtocall)
otherfx();
43: 48 83 c4 08 add $0x8,%rsp
47: 53 push %rbx
48: 9d popfq
_Bool needtocall;
mtx_lock_spin(&lock);
needtocall = (--count == 0);
mtx_unlock_spin(&lock);
if (needtocall)
49: 74 03 je 4e <fx+0x4e>
otherfx();
}
4b: 5b pop %rbx
4c: 5d pop %rbp
4d: c3 retq
mtx_lock_spin(&lock);
needtocall = (--count == 0);
mtx_unlock_spin(&lock);
if (needtocall)
otherfx();
4e: 5b pop %rbx
4f: 5d pop %rbp
50: e9 00 00 00 00 jmpq 55 <count+0x51>
The critical part is the way that clang handles the conditional:
1. Decrement the value. This will set ZF if count reaches 0:
20: ff 0c 25 00 00 00 00 decl 0x0
2. Now, save the status flags (particularly, ZF) using pushfq. Critically,
pushfq saves the entire EFLAGS value, which includes the interrupt flag.
Because we are holding a spin mutex, interrupts are disabled at this point:
27: 9c pushfq
28: 5b pop %rbx
3. Now, call the function to unlock the spin mutex, which will potentially
enable interrupts.
4. Now, restore the status flags. Critically, popfq restores the entire EFLAGS
value, which includes the interrupt flag. Because interrupts were disabled when
the flags were stored, these instructions will disable them again:
43: 48 83 c4 08 add $0x8,%rsp
47: 53 push %rbx
48: 9d popfq
5. Now, use the status flags to determine whether to call the function:
49: 74 03 je 4e <fx+0x4e>
After running this code, interrupts may remain erroneously disabled.
(This is a contrived minimal test case, but I did run into this bug while
testing a code change in the kernel. Interrupts remained erroneously disabled
after dropping a spin lock. And, in that case, the code was in the actual
kernel -- not a module.)
By now, the bug should be obvious: clang shouldn't save and restore the entire
EFLAGS register. It doesn't know what changes these other function calls make
to the values of the EFLAGS register about which clang doesn't care (e.g. IF,
IOPL, AC, etc.). By saving and restoring EFLAGS across function calls, it can
load incorrect values for these other flags.
It looks like the correct solutions are to either not save EFLAGS across
function calls, or to just save the individual flag(s) that need to be
evaluated across the function call. (For example, a simple setz/test would
probably be sufficient -- and more efficient -- in this case.) However, these
are very significant changes that are complicated.
Two other things to note:
1. If clang knows that the target hardware supports lahf/sahf, it will use
those instructions instead. That solves the problem, as it doesn't reset any
flags other than status flags. (It still may not be as efficient as setz, but
at least it is safe for kernel code.) lahf/sahf are supported on all i386 and
*most* (but not all) amd64 CPUs. (Apparently, a few older amd64 CPUs didn't
support the instruction.) Therefore, clang assumes lahf/sahf are supported on
all i386 architectures, but assume it is not supported on amd64 unless the user
specifies a target CPU class known to support it. This is a long way of saying
that the default amd64 FreeBSD build assumes that it is not supported, leading
to the buggy use of pushf/popf shown above.
2. In practice, this seems to very seldom show up in our kernel binaries. Even
a trivial change to the above code (for example, "--count == 5" instead of
"--count == 0") would make the compiler no longer save the EFLAGS register
across the function call. I quickly audited the kernel and modules I compile
and use at work, and found only one place that compiles with the pushf/popf
instructions to save/restore the EFLAGS register across a function call. (Of
course, this fragility can work in the reverse: a trivial change, such as the
one I tried when I found this bug, can cause the compiler to use the pushf/popf
instructions where it had not previously and, therefore, introduce a bug in
interrupt handling.)
Finally, I have a patch that "fixes" (or, probably more accurately, "works
around") the bug. It changes the pushf/popf so that the program will retrieve
the latest EFLAGS register at restore time and only modify the status bits.
In C Pseudo-code, it does this:
save_eflags(register dest) {
dest = load_eflags();
dest &= (OF|SF|ZF|AF|PF|CF);
}
restore_eflags(register src) {
register RAX;
RAX = load_eflags();
RAX &= ~(OF|SF|ZF|AF|PF|CF);
RAX |= src;
set_eflags(RAX);
}
The new assembly sequence looks like this:
Save EFLAGS:
28: 9c pushfq
29: 5b pop %rbx
2a: 48 81 e3 d5 08 00 00 and $0x8d5,%rbx
Restore EFLAGS:
4f: 9c pushfq
50: 48 8b 04 24 mov (%rsp),%rax
54: 48 25 2a f7 ff ff and $0xfffffffffffff72a,%rax
5a: 48 09 d8 or %rbx,%rax
5d: 48 89 04 24 mov %rax,(%rsp)
61: 9d popfq
This is far from ideal, but is better than what is there. And, the good news is
that the compiler seems to actually save/restore EFLAGS very few places in our
kernel code (hopefully, none which are in hot code paths). So, perhaps, this is
good enough for now.
Yet another alternative would be to enable lahf/sahf by default, and provide a
way for people to disable it if their CPU doesn't support those instructions
(with an appropriate warning about how it may be buggy).
--
You are receiving this mail because:
You are the assignee for the bug.
More information about the freebsd-bugs
mailing list