Panic in arpresolve->rt_check?

Dan Nelson dnelson at allantgroup.com
Tue Oct 16 09:31:25 PDT 2007


In the last episode (Oct 12), Kris Kennaway said:
> Dan Nelson wrote:
>> In the last episode (Oct 10), John Baldwin said:
>>> On Wednesday 12 September 2007 02:50:37 pm Ivan Voras wrote:
>>>> Dan Nelson wrote:
>>>>> The same panic was also reported for 6.2 via PR 107865 and PR
>>>>> 112490.  112490 included a workaround patch (I haven't tried it;
>>>>> just found it).
>>>>
>>>> The proposed patch in kern/112490 looks trivial but someone who
>>>> knows more about net locking should check it out. Unfortunately it
>>>> lacks context and I don't know the code in question to apply it
>>>> safely on a production machine :(
>>>
>>> I also get panics with what appears to be a double free of
>>> rt_gwroute in rtexpunge(), so I think while this PR may help some
>>> with figuring out the problem, I'm not sure it solves the root bug.
>>> 
>>> Hmm, possibly try this patch:
>>
>> This makes the panics more frequent on my machine, actually :)
> 
> Since you can reproduce this frequently the best thing might be to
> instrument all the route handling with KTR so that you can do
> post-mortem and try to figure out where the double-free or missing
> reference happened.

I've added some KTR debugging (why are the macros named CTR* instead of
KTR* ? ) and I think I've got the problem nailed down, but I don't know
anything about networking so I don't know what the solution is.

I've attached a KTR dump and the debugging patches I made (done to
preserve line numbering at the expense of style).

It looks like two threads are entering rt_check at the same time.  In
the ktrdump, lrt0 is 0xc674d000 and lrt0->rt_gwroute is 0xc674ae88 for
both threads.

The thread on CPU1 locks lrt0 at line :1287 (ktr index 641), then locks
lrt0->rt_gwroute at :1303 (k642).  It frees lrt0->rt_gwroute at :1305
(k643), then unlocks lrt0 at :1308 (k651) before calling rtalloc1().

Meanwhile, the thread on CPU0 has entered the rt_check function and is
spinning on the lrt0 lock at line :1287 (k649).  When CPU1 unlocks ltr0
(k651 above), lrt0->rt_gwroute is still pointing to the freed rtentry. 
CPU0 then attempts to lock the now-freed lrt0->rt_gwroute and crashes.

So, the problem is that ltr0->rt_gwroute is being left in an
inconsistent state while ltr0 is unlocked.  What's the solution?  Zero
out rt_gwroute before unlocking lrt0, then do some extra checks after
re-locking to handle the case where another thread has called rtalloc1
before us, or something else?  Or is there some other locking problem
higher up that's allowing rt_check to be called in parellel on the same
rtentry in the first place?

-- 
	Dan Nelson
	dnelson at allantgroup.com
-------------- next part --------------
index  cpu timestamp        file and line                            trace
------ --- ---------------- ---------------------------------------- ----- 
   658   1   55794610845896 net/route.c:1316                         0xcd38caa0 lock 0xc674d000 in rt_check
   657   1   55794610828094 net/route.c:175                          0xcd38caa0 lock 0xc7a49168 in rtalloc1
   656   1   55794610826764 net/route.c:911                          0xcd38caa0 unlock 0xc7a49168 in rtrequest1
   655   1   55794610826184 net/route.c:909                          0xcd38caa0 addref 0xc7a49168 1->2 in rtrequest1
   654   1   55794610824041 netinet/if_ether.c:224                   0xcd38caa0 addref 0xc7a49168 0->1 in arp_rtrequest
   653   1   55794610811998 net/route.c:788                          0xcd38caa0 lock 0xc7a49168 in rtrequest1
   652   0   55794610803227 net/route.c:1303                         0xcb107000 lock 0xc674ae88 in rt_check
   651   1   55794610799304 net/route.c:1308                         0xcd38caa0 unlock 0xc674d000 in rt_check
   650   1   55794610797886 net/route.c:302                          0xcd38caa0<rtfree 0xc674ae88 destroyed
   649   0   55794610795429 net/route.c:1287                         0xcb107000 lock 0xc674d000 in rt_check
   648   1   55794610794813 net/route.c:300                          0xcd38caa0 destroy 0xc674ae88 in rtfree
   647   0   55794610794075 net/route.c:1285                         0xcb107000>rt_check 0xc674d000
   646   0   55794610791936 netinet/if_ether.c:365                   0xcb107000>arpresolve 0xc674d000
   645   1   55794610786793 net/route.c:245                          0xcd38caa0 remref 0xc674ae88 1->0 in rtfree
   644   1   55794610786128 net/route.c:238                          0xcd38caa0>rtfree 0xc674ae88
   643   1   55794610783781 net/route.c:1305                         0xcd38caa0 free 0xc674ae88
   642   1   55794610782649 net/route.c:1303                         0xcd38caa0 lock 0xc674ae88 in rt_check
   641   1   55794610780584 net/route.c:1287                         0xcd38caa0 lock 0xc674d000 in rt_check
   640   1   55794610779823 net/route.c:1285                         0xcd38caa0>rt_check 0xc674d000
   639   1   55794610777390 netinet/if_ether.c:365                   0xcd38caa0>arpresolve 0xc674d000
   638   0   55794610752772 net/route.c:119                          0xcb107000 unlock 0xc674d000 in rtalloc_ign
   637   0   55794610751654 net/route.c:198                          0xcb107000 addref 0xc674d000 1->2 in rtalloc1
   636   0   55794610750193 net/route.c:197                          0xcb107000 lock 0xc674d000 in rtalloc1
   635   1   55794610736882 net/route.c:119                          0xcd38caa0 unlock 0xc674d000 in rtalloc_ign
   634   1   55794610735429 net/route.c:198                          0xcd38caa0 addref 0xc674d000 0->1 in rtalloc1
   633   1   55794610733358 net/route.c:197                          0xcd38caa0 lock 0xc674d000 in rtalloc1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ktr.diff
Type: text/x-diff
Size: 7118 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20071016/68c3ec86/ktr.bin


More information about the freebsd-current mailing list