[Bug 276870] mbuf cluster leak with on pf+bird2 bgp routers

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 07 Feb 2024 15:25:11 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=276870

            Bug ID: 276870
           Summary: mbuf cluster leak with on pf+bird2 bgp routers
           Product: Base System
           Version: 13.2-STABLE
          Hardware: Any
                OS: Any
            Status: New
          Severity: Affects Some People
          Priority: ---
         Component: kern
          Assignee: bugs@FreeBSD.org
          Reporter: thomas@gibfest.dk

Created attachment 248234
  --> https://bugs.freebsd.org/bugzilla/attachment.cgi?id=248234&action=edit
Screenshot of mbuf cluster total use as reported by netstat -m over time

Hello 🙂

Last month I had one of my FreeBSD routers stop forwarding (stopped responding
on the network at all, had to IPMI in) because it ran out of mbuf clusters. It
usually operates far from the limit, but there is (was) something leaking mbuf
clusters bad, and I suspect it might be bird2, or a combination of bird2 and a
FreeBSD kernel bug.

----

Some background:

The boxes in question are BGP routers for a small network, they run bird2 and
only get a default route from upstream BGP, not a full table.

Due to a missing/misconfigured kernel export filter bird was repeatedly trying
to export some routes to the kernel which the kernel already knew (from
statically configured blackhole routes). So these errors have been repeating in
the logs for some time (more than a year, meaning this in itself has not been
an issue):

Jan 11 19:09:04 dgncr2a bird[30963]: KRT: Error sending route 2a09:94c0::/29 to
kernel: File exists
Jan 11 19:10:04 dgncr2a syslogd: last message repeated 1 times
Jan 11 19:10:04 dgncr2a bird[30963]: KRT: Error sending route 85.209.116.0/22
to kernel: File exists
Jan 11 19:11:04 dgncr2a syslogd: last message repeated 1 times

Over the holidays I upgraded from bird 2.0.9 to bird 2.14, as well as upgrading
FreeBSD from 13-STABLE-384a885111ad to 13-STABLE-2cbd132986a7. I suspect one of
these two changes made this problem appear. I made no changes to bird or router
config other than the upgrades.

----

The mbuf cluster leak was pretty bad, like 8-10 clusters per second at a pretty
steady rate. The kern.ipc.nmbclusters limit on my routers was around 2 million
and I raised it to 4 million now.

Since I had no idea what was causing the leak and I was desperate for a fix I
at one point tried adding the missing kernel export filter (as to at least
silence the noisy warnings in the logs), and imagine my surprise when the mbuf
cluster leak stopped.

I tried removing the filers again, the leak started again, and stopped again
when I re-added the filters. It appears some combination of bird 2.14 and
exporting routes already found in the kernel means leaking mbuf clusters like
crazy.

I have no idea if this is a bird or a freebsd problem. I reported the issue to
the bird-users@ list
http://trubka.network.cz/pipermail/bird-users/2024-January/017314.html and was
encouraged in that thread to open this PR as well.

The attached grafana screenshot shows the per-second rate of increase (seen
over 5 minutes) of the "total" number in the "mbuf clusters in use" line of the
`netstat -m` output for both routers. The green line is the active and the
yellow line is the passive router.

The drop in the green line and the following spike towards the end
(2000-2100ish) is me filtering the blackhole routes from the bird kernel
export, removing the filter to confirm, and re-adding it.

I can to some extent test stuff, but the routers are in production so nothing
too wild.

-- 
You are receiving this mail because:
You are the assignee for the bug.