No buffer space available error

Tue Jul 18 14:30:47 UTC 2006

Hello,

I've been trying to solve this problem by myself for a long time now, but no luck.
I run a few dozens of FreeBSD 5.3/5.4 machines, which serve as routers, NAT boxes,
Apache, Postfix, OpenVPN, ... servers. Most of them are low-cost PC machines since
they are usually deployed to SOHO environments and the loads are rather low.

I am having problems with the "No buffer space available" error like this:

  Jul 18 08:49:36 Router openvpn[661]: write UDPv4: No buffer space available (code=55)

so this is obviously when OpenVPN tries to send UDP packets. And also like this:

  Jun 23 06:27:38 Router pdns[2182]: Unable to send a packet to our recursing
  backend: No buffer space available

when PowerDNS DNS server tries to do some recursive work. I have been searching Google
for a solution and I found out that the error should appear when the mbuf (or sfbuf?)
is "full" and that I can print the current buffer status with 'netstat -m'.

Because the error would show up (and not only show up, but also block the network
operability for that server) at random times, I set up the "swatch" daemon on all those
servers, so that as soon as the error is logged in messages, I run this command:

#!/usr/local/bin/bash
LOG=/var/log/swatch.log

datum=`date`
echo "============== $datum ==============="
sockstat >> $LOG
echo "------------------------------------------------------------" >> $LOG
netstat -n -a >> $LOG
echo "------------------------------------------------------------" >> $LOG
netstat -m >> $LOG
echo "------------------------------------------------------------" >> $LOG
ps ax >> $LOG
echo "============================================================" >> $LOG

Even though the log was growing as I assumed, I couldn't find anything particulary
interesting, because the "netstat -m" command issued by swatch (at the time of the
error) still shows something like this:

2 mbufs in use
1/17088 mbuf clusters in use (current/max)
0/6/4528 sfbufs in use (current/peak/max)
2 KBytes allocated to network
0 requests for sfbufs denied
0 requests for sfbufs delayed
1819 requests for I/O initiated by sendfile
7578 calls to protocol drain routines

I am not sure, but as I understand it, this means that the buffers are quite OK.

What would be the "proper" way to debug this problem? This is happening on machines
with various hardware, from good old Pentium I with 32 MB RAM up to P4 3GHz, 1GB RAM,
various network cards (mostly rtl8139), with ADSL or VDSL, although the errors are
very rare at the VDSL boxes (where the upstream bandwidth is substantially greater).

So, usually the errors appear but the users don't bother really, so it looks like
the problems goes away sometimes (the connection is restored), but sometimes reboot
is needed.

Thanks for your ideas.

P.S.: If the output of the script above could be helpful, let me know, I can publish
it somewhere.

Cheers,
Nejc