[Bug 239894] security.bsd.stack_guard_page default causes Java to crash

Tue Aug 20 21:26:53 UTC 2019

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=239894

--- Comment #6 from Greg Lewis <glewis at FreeBSD.org> ---
Hi Konstantin,

I think my explanation hasn't been clear enough.  So let me try and include a
few more links and some diagrams.

Here is a diagram for what the Java thread stack looks like from
https://github.com/battleblow/openjdk-jdk11u/blob/bsd-port/src/hotspot/os/bsd/os_bsd.cpp#L4262

   Low memory addresses
    +------------------------+
    |                        |\  Java thread created by VM does not have
    |   pthread guard page   | - pthread guard, attached Java thread usually
    |                        |/  has 1 pthread guard page.
 P1 +------------------------+ Thread::stack_base() - Thread::stack_size()
    |                        |\
    |  HotSpot Guard Pages   | - red, yellow and reserved pages
    |                        |/
    +------------------------+ JavaThread::stack_reserved_zone_base()
    |                        |\
    |      Normal Stack      | -
    |                        |/
 P2 +------------------------+ Thread::stack_base()

When the JVM is creating the HotSpot guard pages, the kernel, based on the
security.bsd.stack_guard_page setting will create some extra guarded pages that
extend into the normal stack region.  This causes the SIGSEGV to have a fault
address in the normal stack region.

There are two initial problems with this.

The first is that the definition of StackOverflowError is an error that is
thrown "Thrown when a stack overflow occurs because an application recurses too
deeply." (see
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/StackOverflowError.html).
 However, there are other reasons a SIGSEGV could occur in the normal stack
region (e.g. a buffer overflow).  The JVM uses the guard pages to be able to
detect that it is clearly a stack overflow that is causing the SIGSEGV rather
than any other possible cause.  You can observe this in the JVM source itself. 
See
https://github.com/battleblow/openjdk-jdk11u/blob/bsd-port/src/hotspot/os_cpu/bsd_x86/os_bsd_x86.cpp#L510
where it checks for the fault address being in the guard zone (first in the
reserved + yellow zones, which is tries to handle gracefully, and then in the
red zone, which is less graceful).  The code on Linux is very similar (see
https://github.com/battleblow/openjdk-jdk11u/blob/bsd-port/src/hotspot/os_cpu/linux_x86/os_linux_x86.cpp#L356).
 I'll note that the continuation of the code provides for some different
handling if the fault address doesn't occur within the guard pages.

The second is that you'll note in that code that when a stack overflow does
occur, the JVM will often unprotect portions of the guard zone it has set up. 
E.g. at
https://github.com/battleblow/openjdk-jdk11u/blob/bsd-port/src/hotspot/os_cpu/bsd_x86/os_bsd_x86.cpp#L525.
 This is because a StackOverflowError is something the Java program can catch
and ignore, if it so chooses.  The reserved pages provide an area the JVM can
unprotect to allow a critical code section to complete so that a Java program
which catches StackOverflowError and continues execution will not be left in a
condition where, for example, it is deadlocked due to the fault occurring
during the critical section of changing a lock state.  The pages created by the
security.bsd.stack_guard_page setting create problems with doing this.  We're
not in the reserved section for starters, but in the normal stack, so
unprotecting it won't help.  Also, it was the kernel which protected the pages,
the JVM can't unprotect them.  This means the critical section can't complete,
meaning that data structures may be in an inconsistent state, which may include
a deadlock as above.  The JEP (https://openjdk.java.net/jeps/270) goes into a
lot more detail around this and the motivation for introducing reserved pages.

There are some other problems here as well.  E.g., the JVM can't predictably
determine which pages might have been protected by the kernel, since the sysctl
can be changed dynamically but libthr can cache thread stacks.  These are less
likely but still problematic.

Hopefully that has provided some clarification.  I'd also like to draw your
attention to Kurt's comment that this doesn't just impact the JVM but the
interaction with libthr in general.  This is something to consider in terms of
a proposed fix.

I'm also curious about how Linux (and other OSes) went about fixing the Stack
Clash vulnerability and whether there is an approach there that might not cause
application issues like this.

-- 
You are receiving this mail because:
You are the assignee for the bug.