ARM64 hosts eventually lockup running net-mgmt/unifi

Tue Aug 4 22:37:57 UTC 2020

On Mon, 03 Aug 2020 10:44:01 -0700,
Ronald Klop wrote:
> 
> On Mon, 03 Aug 2020 18:59:26 +0200, Josh Howard <bsd at zeppelin.net> wrote:
> 
> > This one has been sort of a pain to narrow down, but on any of:
> > RockPro64,
> > RockPi4b, or RPI4, if I run net-mgmt/unifi eventually the host just hard
> > locks. Nothing over serial, nothing interesting in the logs, no
> > other hints,
> > so it's not clear what precisely is causing it. For those
> > unfamiliar, unifi
> > runs both a Java app and a mongodb server. I've tried with openjdk8
> > (their
> > only supports version) and openjdk11, neither one made any
> > difference. I'm
> > not totally sure how a userland app like this could cause this to happen,
> > but it's getting consistent that it eventually does kill my host.
> > 
> > Any ideas or hints would be great!
> 
> 
> I had the same problem. The default amount of nmbclusters is too
> low. If they are full the OS becomes very unresponsive.
> 
> I run this script hourly. It doubles the amount of nmbclusters if more
> than half are occupied.
> 
> @hourly bin/nmbclustercheck.sh
> [root at rpi3 ~]# more bin/nmbclustercheck.sh
> #! /bin/sh
> 
> LINE=$( netstat -m | grep "mbuf clusters" | cut -d ' ' -f 1 )
> CURRENT=$( echo $LINE | cut -d '/' -f 1 )
> MAX=$( echo $LINE | cut -d '/' -f 4 )
> 
> if test $CURRENT -gt $(( $MAX / 2 ))
> then
>         NEW_MAX=$(( $MAX * 2 ))
>         echo Increase kern.upc.nmbclusters from $MAX to $NEW_MAX
>         sysctl kern.ipc.nmbclusters=$NEW_MAX
> fi
> 
> 
> Current amount after 14 days of uptime:
> [root at rpi3 ~]# sysctl kern.ipc.nmbclusters
> kern.ipc.nmbclusters: 19250
> 

Thanks for the lead!

I did attempt this, sadly it didn't change anything. I graphed the
nmbcluster usage over about 12 hours, but at some point the system
simply hanged and there was no recovering short of a hard reboot. The
number of used clusters did increase gradually, but never got close
to the limit. I agree it does seem likely to be somehow related to
some resource exhaustion, but just not getting any indication of what
it is.