Panic in arpresolve->rt_check?
Dan Nelson
dnelson at allantgroup.com
Tue Oct 16 09:31:25 PDT 2007
In the last episode (Oct 12), Kris Kennaway said:
> Dan Nelson wrote:
>> In the last episode (Oct 10), John Baldwin said:
>>> On Wednesday 12 September 2007 02:50:37 pm Ivan Voras wrote:
>>>> Dan Nelson wrote:
>>>>> The same panic was also reported for 6.2 via PR 107865 and PR
>>>>> 112490. 112490 included a workaround patch (I haven't tried it;
>>>>> just found it).
>>>>
>>>> The proposed patch in kern/112490 looks trivial but someone who
>>>> knows more about net locking should check it out. Unfortunately it
>>>> lacks context and I don't know the code in question to apply it
>>>> safely on a production machine :(
>>>
>>> I also get panics with what appears to be a double free of
>>> rt_gwroute in rtexpunge(), so I think while this PR may help some
>>> with figuring out the problem, I'm not sure it solves the root bug.
>>>
>>> Hmm, possibly try this patch:
>>
>> This makes the panics more frequent on my machine, actually :)
>
> Since you can reproduce this frequently the best thing might be to
> instrument all the route handling with KTR so that you can do
> post-mortem and try to figure out where the double-free or missing
> reference happened.
I've added some KTR debugging (why are the macros named CTR* instead of
KTR* ? ) and I think I've got the problem nailed down, but I don't know
anything about networking so I don't know what the solution is.
I've attached a KTR dump and the debugging patches I made (done to
preserve line numbering at the expense of style).
It looks like two threads are entering rt_check at the same time. In
the ktrdump, lrt0 is 0xc674d000 and lrt0->rt_gwroute is 0xc674ae88 for
both threads.
The thread on CPU1 locks lrt0 at line :1287 (ktr index 641), then locks
lrt0->rt_gwroute at :1303 (k642). It frees lrt0->rt_gwroute at :1305
(k643), then unlocks lrt0 at :1308 (k651) before calling rtalloc1().
Meanwhile, the thread on CPU0 has entered the rt_check function and is
spinning on the lrt0 lock at line :1287 (k649). When CPU1 unlocks ltr0
(k651 above), lrt0->rt_gwroute is still pointing to the freed rtentry.
CPU0 then attempts to lock the now-freed lrt0->rt_gwroute and crashes.
So, the problem is that ltr0->rt_gwroute is being left in an
inconsistent state while ltr0 is unlocked. What's the solution? Zero
out rt_gwroute before unlocking lrt0, then do some extra checks after
re-locking to handle the case where another thread has called rtalloc1
before us, or something else? Or is there some other locking problem
higher up that's allowing rt_check to be called in parellel on the same
rtentry in the first place?
--
Dan Nelson
dnelson at allantgroup.com
-------------- next part --------------
index cpu timestamp file and line trace
------ --- ---------------- ---------------------------------------- -----
658 1 55794610845896 net/route.c:1316 0xcd38caa0 lock 0xc674d000 in rt_check
657 1 55794610828094 net/route.c:175 0xcd38caa0 lock 0xc7a49168 in rtalloc1
656 1 55794610826764 net/route.c:911 0xcd38caa0 unlock 0xc7a49168 in rtrequest1
655 1 55794610826184 net/route.c:909 0xcd38caa0 addref 0xc7a49168 1->2 in rtrequest1
654 1 55794610824041 netinet/if_ether.c:224 0xcd38caa0 addref 0xc7a49168 0->1 in arp_rtrequest
653 1 55794610811998 net/route.c:788 0xcd38caa0 lock 0xc7a49168 in rtrequest1
652 0 55794610803227 net/route.c:1303 0xcb107000 lock 0xc674ae88 in rt_check
651 1 55794610799304 net/route.c:1308 0xcd38caa0 unlock 0xc674d000 in rt_check
650 1 55794610797886 net/route.c:302 0xcd38caa0<rtfree 0xc674ae88 destroyed
649 0 55794610795429 net/route.c:1287 0xcb107000 lock 0xc674d000 in rt_check
648 1 55794610794813 net/route.c:300 0xcd38caa0 destroy 0xc674ae88 in rtfree
647 0 55794610794075 net/route.c:1285 0xcb107000>rt_check 0xc674d000
646 0 55794610791936 netinet/if_ether.c:365 0xcb107000>arpresolve 0xc674d000
645 1 55794610786793 net/route.c:245 0xcd38caa0 remref 0xc674ae88 1->0 in rtfree
644 1 55794610786128 net/route.c:238 0xcd38caa0>rtfree 0xc674ae88
643 1 55794610783781 net/route.c:1305 0xcd38caa0 free 0xc674ae88
642 1 55794610782649 net/route.c:1303 0xcd38caa0 lock 0xc674ae88 in rt_check
641 1 55794610780584 net/route.c:1287 0xcd38caa0 lock 0xc674d000 in rt_check
640 1 55794610779823 net/route.c:1285 0xcd38caa0>rt_check 0xc674d000
639 1 55794610777390 netinet/if_ether.c:365 0xcd38caa0>arpresolve 0xc674d000
638 0 55794610752772 net/route.c:119 0xcb107000 unlock 0xc674d000 in rtalloc_ign
637 0 55794610751654 net/route.c:198 0xcb107000 addref 0xc674d000 1->2 in rtalloc1
636 0 55794610750193 net/route.c:197 0xcb107000 lock 0xc674d000 in rtalloc1
635 1 55794610736882 net/route.c:119 0xcd38caa0 unlock 0xc674d000 in rtalloc_ign
634 1 55794610735429 net/route.c:198 0xcd38caa0 addref 0xc674d000 0->1 in rtalloc1
633 1 55794610733358 net/route.c:197 0xcd38caa0 lock 0xc674d000 in rtalloc1
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ktr.diff
Type: text/x-diff
Size: 7118 bytes
Desc: not available
Url : http://lists.freebsd.org/pipermail/freebsd-current/attachments/20071016/68c3ec86/ktr.bin
More information about the freebsd-current
mailing list