Lock Order Reversal on 7.0-STABLE with pf and ipfw / dummynet

Sat Mar 15 14:30:18 PDT 2008

On Saturday 15 March 2008, Robert Watson wrote:
> On Fri, 14 Mar 2008, Alex Popa wrote:
> > World was cvsupped on March 6th, around 18:00 GMT.
> >
> > Built and installed kernel + world, with options WITNESS and
> > WITNESS_SKIPSPIN.
> >
> > Short background:  7.0-RELEASE had excellent performance on the
> > machine, but it would randomly lock up after some hours (usually over
> > 10 hours). The lockups were hard, meaning nothing seemed to work
> > (NumLock didn't toggle the keyboard LED, no replies to ping, no disk
> > activity).  We changed the motherboard and RAM and had the same
> > behaviour.  6.2-REL is rock solid on this machine (had over 50 days
> > uptime), but upgrading to 6.3-REL made it lock up just like 7.0 (so
> > we put 6.2 back and accepted the lower performance for the time
> > being).
> >
> > The LOR messages from dmesg of 7.0-STABLE are as follows:
> >
> > lock order reversal:
> > 1st 0xffffffffb19e0680 pf task mtx (pf task mtx) @
> > /usr/src/sys/modules/pf/../../contrib/pf/net/pf.c:6729 2nd
> > 0xffffff00042ea0f0 radix node head (radix node head) @
> > /usr/src/sys/net/route.c:147

I haven't seen this one before, can you obtain the trace for this, please?  
You might need KDB & DDB for that - not sure.

> > lock order reversal: 
> > 1st 0xffffffff80938508 PFil hook read/write mutex (PFil hook
> > read/write mutex) @ /usr/src/sys/net/pfil.c:73 2nd 0xffffffff80938c48
> > tcp (tcp) @ /usr/src/sys/netinet/tcp_input.c:400

This one is most certainly harmless and can be ignored.  It is caused by 
user/group rules, but a LOR with the read instance of a rwlock will not 
lead to a deadlock.

> Dear Alex,
>
> Thanks for this report, and sorry about the problem.  It could well be
> that the lock order warning from WITNESS is related to the hang, and
> might reflect a recursion-related bug in the pf policy routing code. 
> I'm not sure to what extent you can tolerate further downtime, but it
> would be useful to gather some more information about the hang itself
> to try and confirm the involvement of lock order.  In particular, if
> it's feasible, it would be very helpful if you could boot back to the
> 7-STABLE kernel (keeping the 6.2-STABLE userspace should be fine, I

you'll need at least a new pfctl, because the ioctl interface to /dev/pf 
has changed.

> think), and when the hang occurs, use the console debuggger (ideally
> hooked up to serial or firewire) to run the following debugging
> commands:
>
>    show pcpu
>    show allpcpu
>    trace
>    alltrace
>    show allocks
>    show witness
>    show lockedvnods
>    show uma
>    show malloc
>
> A shot-in-the-dark guess is that something about pf's interactions with
> the protocol stack is involved here, but unfortunately I suspect we'll
> need some more information to track it down.
>
> Also, could you confirm if you're using any credential-related firewall
> rules with either ipfw or pf?  These would be uid/gid/jail matching
> rules.
>
> Robert N M Watson
> Computer Laboratory
> University of Cambridge
>
> > More details about the machine in the attached dmesg.  It's a SMP
> > with 4GB of RAM, 3 gigabit cards (em0, em1 and, depending on the
> > motherboard we used, either bge0 or msk0).  Only em0 is linked to a
> > gigabit port, the others are 100Mbits/s
> >
> > My setup has in-kernel IPFIREWALL, IPFIREWALL_VERBOSE,
> > IPFIREWALL_DEFAULT_TO_ACCEPT, DUMMYNET.  I have commented out INET6,
> > SCTP and the wireless interfaces.  WITNESS and WITNESS_SKIPSPIN were
> > only added in the hope of figuring out what locks it up, and they did
> > signal these 2 LORs.
> >
> > pf and pflog are loaded as modules (pf_enable and pflog_enable set to
> > yes in rc.conf).
> >
> > - The ipfw/dummynet side:
> >
> > I use net.link.ether.ipfw = 1 for MAC address checking, ipfw +
> > dummynet for traffic shaping (4 queues at 95Mbits/s for the 2
> > external interfaces in/out, and 4 more queues for traffic that goes
> > outside the AS group for which we have fast access).  Deciding which
> > queue traffic goes in depends on its source address and whether its
> > destination is in ipfw tables 1, 2 or none.  These tables are
> > synchronized from pf tables via a custom script in crontab, which
> > runs every 3 minutes.  The pf tables used as source for these are
> > controlled by OpenBGPD.
> >
> > - The pf side:
> >
> > Filtering is done here, as is policy routing.  Filtering also
> > contains redirecting to a transparent squid proxy of traffic destined
> > to port 80 but not bound for networks received via BGP and saved to
> > tables <metro> and <special>.  Metro and special port 80 traffic goes
> > directly to the destination server.
> >
> > Traffic from net1 and net2 is routed via the "other" external
> > interface, which doesn't contain the default route... with the
> > exception of traffic to pf table <special> (from BGP, same as table 2
> > in ipfw).  Traffic to <special> is routed via fastroute in pf
> > (meaning using the default route).

That's quite a complex setup.  It would really be interesting to get the 
trace for the first LOR in order to figure out which code path we are 
looking at.  I have a feeling that it might be the dummynet entry point, 
but w/o the trace this is only speculation.

> > Attached are full dmesg and the kernel config.
> >
> > I still have access to the hard drive with 7.0-STABLE on it, but not
> > the motherboard/CPU and the network cards... they are running off the
> > hard drive with 6.2 on it.

-- 
/"\  Best regards,                      | mlaier at freebsd.org
\ /  Max Laier                          | ICQ #67774661
 X   http://pf4freebsd.love2party.net/  | mlaier at EFnet
/ \  ASCII Ribbon Campaign              | Against HTML Mail and News
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 187 bytes
Desc: This is a digitally signed message part.
Url : http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20080315/b1e98149/attachment.pgp