bugs in contigmalloc*() related to "page not found in hash" panics

Wed Nov 10 22:51:47 PST 2004

:>    Here is the DragonFly commit.
:>
:>    http://www.dragonflybsd.org/cvsweb/src/sys/vm/vm_contig.c.diff?r1=1.10&r2=1.11&f=u
:>
:>    FreeBSD-4:
:>
:> 	FreeBSD-4 is in the same situation that DFly was in and requires
:> 	the same fixes as the above patch, though note that in FreeBSD-4
:> 	the contigmalloc() code is in vm_page.c, not vm_contig.c.
:
:I tried the patch in the hopes it would fix my Nvidia-driver
:crash-on-demand system.  :)  While my system appears stable without the
:Nvidia driver but with this patch, my system can still crash easily with
:the Nvidia driver.  It usually dies with a:

    Point me at the nvidia driver source and I will do a quick audit of it
    to see if there is anything obviously broken.  This is running on
    FreeBSD-4.x?  If it's a binary-only driver there isn't much I can do,
    though.

    The 'page not found in hash' panic can ONLY occur one way: When a
    vm_page's pindex or object fields are directly changed or (under 4.x,)
    if the VM object's hash_rand field is changed.  The only valid way
    to change either of these fields is to call vm_page_insert()
    or to call vm_page_remove().  That it.  There is *NO* other legal 
    way to change those fields within a vm_page that won't result in
    corruption of the VM page hash table (4.x) or object->root splay
    tree (5.x).  The fields cannot be modified directly, the vm_page
    cannot be safely bzero'd, you can't 0 or NULL out the fields, or assign
    a new index or object, etc... only vm_page_insert() and vm_page_remove()
    can do that safely.

    From looking at your bug reports and comparing them with my own
    extensive research on this particular crash I will say *DEFINITIVELY*
    that it is *NOT* a RAM problem.  It's software-caused corruption,
    period end of story.

    I will also note that the backtrace from the panic path in the
    second PR URL is very similar to what we were seeing before we fixed
    the issue in contigmalloc... the problem is that the VM page hash
    table / splay tree gets corrupted *LONG* before the code path that
    actually causes the panic, so it's virtually impossible to glean any
    information from the panic itself.

    There is a test you can run.  If you have a kernel vmcore and related
    kernel image that contains the vm page not found in hash panic, you
    can run this program on it to do a sanity check on the VM page array
    and hash table.  I have modified this program to work with FreeBSD-4.x
    (I'd have to rewrite it to make it work with 5.x/6.x, which I don't have
    time to do):

	fetch http://leaf.dragonflybsd.org/~dillon/vmpageinfo_4x.c

	and follow the instructions in the comments to compile it. 

	Run it with '-N kernel.x -M vmcore.X -d'.

    This program will sanity check the VM page hash table from the core
    file and tell you if there are any pages missing from the hash table
    or sitting in the wrong slot. 

    My expectation is that it will find a page sitting in the wrong slot.

:
:     Fatal trap 12: page fault while in kernel mode
:     fault virtual address   = 0x30
:     fault code              = supervisor read, page not present
:

    This is a different failure.  I'd need a backtrace or a kernel.debug and
    vmcore to play with, and a FreeBSD developer would probably be able to
    help you more with it.  It's obviously a NULL pointer indirection of some
    sort.

:Two "page not found in hash" panics that I believe are related to the
:Nvidia driver:
:http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/71086
:http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/72539

    The 'page not found in hash' bug is *NOT* likely to be related to any
    of the pmap code, simply because the sanity checks already in the
    kernel (assuming the kernel is compiled with options INVARIANTS and
    options INVARIANT_SUPPORT) mostly preclude an error path to this
    panic from the pmap code.  However, pmap panics could be related to
    corrupted VM pages.

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>

:The first PR (mine) asks about a change in pmap_remove() that was later
:removed from FreeBSD-4 but left in FreeBSD-5.  If anyone knows why this
:happened, I would be interested in knowing.
:
:Sean
:-- 
:sean-freebsd at farley.org