[Bug 238700] mmap bug when using file-systems with small blocks and seeking randomly to read from the file descriptor before mapping

Wed Jun 19 04:38:27 UTC 2019

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238700

            Bug ID: 238700
           Summary: mmap bug when using file-systems with small blocks and
                    seeking randomly to read from the file descriptor
                    before mapping
           Product: Base System
           Version: 12.0-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: kern
          Assignee: bugs at FreeBSD.org
          Reporter: alan.murtagh at vorteil.io

I've been hacking on the FreeBSD kernel and I believe I've discovered a rare
bug with the mmap syscall. I've spent quite some time trying to understand
exactly what the cause is, and at this point I think I've got a pretty good
idea, logically, what's wrong. I'm not experienced enough to find the problem
in the code, so I'll just have to give my best description of the problem. 

I'm using Ext2 with 1KiB Blocks to store some files. This is important because
a block size smaller than the system's page size is necessary to cause the bug.
I do not know if the bug is exclusive to Ext, however. 

I have a file that is 13773 bytes long stored on this file-system. That means
four 4KiB pages would be needed to map its 14 blocks of data. What I've noticed
is that if I open the file and use the file descriptor to mmap the entire
contents, then read the entire contents, I trigger four separate page faults to
load in that data. Demand paging. Makes sense so far. 

I've also noticed that if I use the read syscall to read data from the file
before using the mmap syscall on the same file descriptor, I will not trigger
the same number of page faults after mapping the file in. Reading four bytes
from the start (a typical magic number check) seems to pre-fetch two full pages
worth of data from the file, and somehow stores that data in a way that makes
the first two page faults unnecessary. A cool bit of optimization that still
makes perfect sense. 

If, before calling mmap I seek to 128 bytes from the end of the file and read
from there something unexpected happens. As I read the mapped memory I do not
trigger the final page fault, but the page is not completely valid! 

I've played around with different file-sizes, seek locations, and read buffer
sizes. Here's what I believe is happening. The file is chunked up into the
number of pages it needs when it is opened. When this file is mapped into
memory a pagefault on one of its pages causes the entire page to be read from
disk (4 blocks). Even when it is not mapped into userspace memory, these pages
still exist attached to the file descriptor. The read syscall is capable of
making these pages valid, but it DOES NOT take measures to ensure that it's
reading in the complete page. 

So, when I ask it to read the last 128 bytes of my 13773 byte file, taking data
from offsets 13645-13773, the kernel and/or ext driver correctly determines
that I'm after block 13, which it loads into page 3 at an offset within the
page of 1024. This means that page 3 has no data loaded from offsets 0 and
1023, but it will not trigger the page fault the kernel needs to load the
missing data once it has been mapped. 

This problem caused a real-life bug for me as I attempted to run an old Java
runtime on the Linux compat layer. For whatever reason, Java does exactly this
settings up its virtual machine. Reading the first four bytes from a jar file,
then the last 128, the using mmap to link up some functions defined in the jar.
I was able to solve the problem by adding the following code to the Ext driver,
but I feel like this cannot be the optimal place to put a fix.

Branched from a commit (d1bd24e3) to the "stable/12" branch of the GitHub repo
three weeks ago.
In "sys/fs/ext2fs/ext2_vnops.c", beginning at line 2085 (within the "ext2_read"
function).

        // before 
        if (((int)uio->uio_offset / 1024)%4) {
                int before_offset = ((int)uio->uio_offset / 4096) * 4096;
                do {
                        lbn = lblkno(fs, before_offset);
                        error = bread(vp, lbn, 1024, NOCRED, &bp);
                        brelse(bp);
                        bp = NULL;
                        if (error) {
                                break;
                        }
                        before_offset += 1024;
                } while(before_offset < ((int)uio->uio_offset/1024)*1024);
        }
        // after
        int last_byte = (int)uio->uio_offset + (int)uio->uio_resid - 1;
        if (last_byte > 0) {
                int last_page = last_byte / 4096;
                int last_block_in_page = (last_byte % 4096)/1024;
                for (int i = last_block_in_page + 1; i < 4; i++) {
                        int after_offset = ((int)last_page * 4096 + i * 1024);
                        lbn = lblkno(fs, after_offset);
                        error = bread(vp, lbn, 1024, NOCRED, &bp);
                        brelse(bp);
                        bp = NULL;
                        if (error) {
                                break;
                        }
                }
        }
        //

The purpose of this code is to force the ext driver to load in some surrounding
blocks.

If my description is insufficient I should be able to provide a virtual machine
image reproducing the issue.

-- 
You are receiving this mail because:
You are the assignee for the bug.