CURRENT crashes with nvidia GPU BLOB : vm_radix_insert: key 23c078 is already present

Sat Aug 10 09:12:53 UTC 2013

Am 10.08.2013 10:37, schrieb Gary Jennejohn:
> On Fri, 9 Aug 2013 10:12:37 -0700
> David Wolfskill <david at catwhisker.org> wrote:
> 
>> On Fri, Aug 09, 2013 at 07:32:51AM +0200, O. Hartmann wrote:
>>> ...
>>>>> On 8 August 2013 11:10, O. Hartmann <ohartman at zedat.fu-berlin.de>
>>>>> wrote:
>>>>>> The most recent CURRENT doesn't work with the x11/nvidia-driver
>>>>>> (which is at 319.25 in the ports and 325.15 from nVidia).
>>>>>>
>>>>>> After build- and installworld AND successfully rebuilding port
>>>>>> x11/nvidia-driver, the system crashes immediately after a reboot
>>>>>> as soon the kernel module nvidia.ko seems to get loaded (in my
>>>>>> case, I load nvidia.ko via /etc/rc.conf.local since the nVidia
>>>>>> BLOB doesn't load cleanly everytime when loaded
>>>>>> from /boot/loader.conf).
>>>>>>
>>>>>> The crash occurs on systems with default compilation options set
>>>>>> while building world and with settings like -O3 -march=native. It
>>>>>> doesn't matter.
>>>>>>
>>>>>> FreeBSD and the port x11/nvidia-driver has been compiled with
>>>>>> CLANG.
>>>>>>
>>>>>> Most recent FreeBSD revision still crashing is r254097.
>>>>>>
>>>>>> When vmcore is saved, I always see something like
>>>>>>
>>>>>> savecore: reboot after panic: vm_radix_insert: key 23c078 is
>>>>>> already present
>>>>>>
>>>>>>
>>>>>> Does anyone has any idea what's going on?
>>>>>>
>>>>>> Thanks for helping in advance,
>>>>>>
>>>>>> Oliver
>>>>
>>>> I'm seeing a complete deadlock on my T520 with today's current and
>>>> latest portsnap'd versions of ports for the nvidia-driver updates.
>>>>
>>>> A little bisection and help from others seems to point the finger at
>>>> Jeff's r254025
>>>>
>>>> I'm getting a complete deadlock on X starting, but loading the module
>>>> seems to have no ill effects.
>>>>
>>>> Sean
>>>
>>> Rigth, I loaded the module also via /boot/loader.conf and it loads
>>> cleanly. I start xdm and then the deadlock occurs.
>>>
>>> I tried recompiling the whole xorg suite via "portmaster -f xorg xdm",
>>> it took a while, but no effect, still dying.
>>> .....
>>
>> Sorry to be rather late to the party; the Internet connection I'm using
>> at the moment is a bit flaky.  (I'm out of town.)
>>
>> I managed to get head/i386 @r254135 built and booting ... by removing
>> the "options DEBUG_MEMGUARD" from my kernel.
>>
>> However, that merely prevented a (very!) early panic, and got me to the
>> point where trying to start xdm with the x11/nvidia-driver as the
>> display driver causes an immediate reboot (no crash dump, despite
>> 'dumpdev="AUTO"' in /etc/rc.conf).  No drop to debugger, either.
>>
>> Booting & starting xdm with the nv driver works -- that's my present
>> environment as I am typing this.
>>
>> However, the panic with DEBUG_MEMGUARD may offer a clue.  Unfortunately,
>> it's early enough that screen lock/scrolling doesn't work, and I only
>> had the patience to write down partof the panic information.  (This is
>> on my laptop; no serial console, AFAICT -- and no device to capture the
>> output if I did, since I'm not at home.)
>>
>> The top line of the screen (at the panic) reads:
>>
>> s/kern/subr_vmem.c:1050
>>
>> The backtrace has the expected stuff near the top (about kbd, panic, and
>> memguard stuff); just below that is:
>>
>> vmem_alloc(c1226100,6681000,2,c1820cc0,3b5,...) at 0xc0ac5673=vmem_alloc+0x53/frame 0xc1820ca0
>>
>> Caveat: that was hand-transcribed from the screen to papaer, then
>> hand-transcribed from paper to this email message.  And my highest grade
>> in "Penmanship" was a D+.
>>
>> Be that as it may, here's the relevant section of subr_vmem.c with line
>> numbers (cut/pasted, so tabs get munged):
>>
>>    1039 /*
>>    1040  * vmem_alloc: allocate resource from the arena.
>>    1041  */
>>    1042 int
>>    1043 vmem_alloc(vmem_t *vm, vmem_size_t size, int flags, vmem_addr_t *addrp)
>>    1044 {
>>    1045         const int strat __unused = flags & VMEM_FITMASK;
>>    1046         qcache_t *qc;
>>    1047 
>>    1048         flags &= VMEM_FLAGS;
>>    1049         MPASS(size > 0);
>>    1050         MPASS(strat == M_BESTFIT || strat == M_FIRSTFIT);
>>    1051         if ((flags & M_NOWAIT) == 0)
>>    1052                 WITNESS_WARN(WARN_GIANTOK | WARN_SLEEPOK, NULL, "vmem_alloc");
>>    1053
>>    1054         if (size <= vm->vm_qcache_max) {
>>    1055                 qc = &vm->vm_qcache[(size - 1) >> vm->vm_quantum_shift];
>>    1056                 *addrp = (vmem_addr_t)uma_zalloc(qc->qc_cache, flags);
>>    1057                 if (*addrp == 0)
>>    1058                         return (ENOMEM);
>>    1059                 return (0);
>>    1060         }
>>    1061
>>    1062         return vmem_xalloc(vm, size, 0, 0, 0, VMEM_ADDR_MIN, VMEM_ADDR_MAX,
>>    1063             flags, addrp);
>>    1064 }
>>
>>
>> This is at r254025.
>>
> 
> The REINPLACE_CMD at line 160 of nvidia-driver/Makefile is incorrect.
> 
> How do I know that?  Because I made a patch which results in a working
> nvidia-driver-319.32 with r254050.  That's what I'm running right now.
> 
> Here's the patch (loaded with :r in vi, so all spaces etc. are correct):
> 
> --- src/nvidia_subr.c.orig	2013-08-09 11:32:26.000000000 +0200
> +++ src/nvidia_subr.c	2013-08-09 11:33:23.000000000 +0200
> @@ -945,7 +945,7 @@
>          return ENOMEM;
>      }
>  
> -    address = kmem_alloc_contig(kernel_map, size, flags, 0,
> +    address = kmem_alloc_contig(kmem_arena, size, flags, 0,
>              sc->dma_mask, PAGE_SIZE, 0, attr);
>      if (!address) {
>          status = ENOMEM;
> @@ -994,7 +994,7 @@
>          os_flush_cpu_cache();
>  
>      if (at->pte_array[0].virtual_address != NULL) {
> -        kmem_free(kernel_map,
> +        kmem_free(kmem_arena,
>                  at->pte_array[0].virtual_address, at->size);
>          malloc_type_freed(M_NVIDIA, at->size);
>      }
> @@ -1021,7 +1021,7 @@
>      if (at->attr != VM_MEMATTR_WRITE_BACK)
>          os_flush_cpu_cache();
>  
> -    kmem_free(kernel_map, at->pte_array[0].virtual_address,
> +    kmem_free(kmem_arena, at->pte_array[0].virtual_address,
>              at->size);
>      malloc_type_freed(M_NVIDIA, at->size);
>  
> @@ -1085,7 +1085,7 @@
>      }
>  
>      for (i = 0; i < count; i++) {
> -        address = kmem_alloc_contig(kernel_map, PAGE_SIZE, flags, 0,
> +        address = kmem_alloc_contig(kmem_arena, PAGE_SIZE, flags, 0,
>                  sc->dma_mask, PAGE_SIZE, 0, attr);
>          if (!address) {
>              status = ENOMEM;
> @@ -1139,7 +1139,7 @@
>      for (i = 0; i < count; i++) {
>          if (at->pte_array[i].virtual_address == 0)
>              break;
> -        kmem_free(kernel_map,
> +        kmem_free(kmem_arena,
>                  at->pte_array[i].virtual_address, PAGE_SIZE);
>          malloc_type_freed(M_NVIDIA, PAGE_SIZE);
>      }
> @@ -1169,7 +1169,7 @@
>          os_flush_cpu_cache();
>  
>      for (i = 0; i < count; i++) {
> -        kmem_free(kernel_map,
> +        kmem_free(kmem_arena,
>                  at->pte_array[i].virtual_address, PAGE_SIZE);
>          malloc_type_freed(M_NVIDIA, PAGE_SIZE);
>      }
> 
> The primary differences are
> 1) use kmem_arena instead of kernel_map everywhere.  The REINPLACE_CMD
>    uses kernel_arena
> 2) DO NOT use kva_free, but kmem_free as previously
> 
> To use the patch
> Delete or comment out the 4 lines starting at 160 in Makefile
> Run ``make patch''
> cd work/NVIDIA-FreeBSD-x86_64-319.32/src
> patch < [wherever the patch is]
> cd ../../..
> make deinstall install clean
> kldunload the old nvidia.ko
> kldload the new nvidia.ko
> start X
> 

Yes, I can confirm, that it builds, installs and runs fine for me.

The patch should be placed as
x11/nvidia-driver/files/patch-src__nvidia_subr.c, shoudn't it?

Many thanks for this work.

Regards and a nice weekend,
Rainer Hurling