Re: tracking down i386 mount issue between Aug 2023 and now-- RELENG_13

From: mike tancsa <mike_at_sentex.net>
Date: Mon, 29 Jan 2024 15:45:01 UTC
Still trying to track this issue down. Its not just one partition, but 
often the entire disk IO locks up with processes stuck. The CF comes up 
as ada0 and I dont see any commits that have touched that. the box is a 
single GEODE CPU but I tried both SMP and UP kernels and it still seems 
to happen. If I play with rtprio on some processes, that *seems* to 
trigger the issue more often.  I did try a RELENG_14 image on a couple 
of test boxes and so far those seem to have survived the weekend without 
lockups. It doesnt seem to be memory pressure as available RAM holds 
steady from bootup to lockup.

     ---Mike


On 1/16/2024 9:48 AM, mike tancsa wrote:
> Not sure exactly where to start, but I noticed this recently on an 
> i386 nanobsd image running on old PC Engines Alix devices that had 
> been rock solid for years. We have a few dozen in the field running 
> with RELENG_13 from Aug that have been very stable with STABLE over 
> the years.  However, somewhere between Aug 2023 and now I am getting 
> some lock ups that are difficult to diagnose as the devices are 
> remote.  I did manage to find one odd thing on a local test unit where 
> a remount of a backup partition is hung.
>
> # ps -auxwwwwp 3443
> USER  PID %CPU %MEM  VSZ  RSS TT  STAT STARTED     TIME COMMAND
> root 3443  3.3  0.9 4708 2320  -  D<   20:18   34:55.20 /sbin/mount 
> -ur /dev/ada0s4 /logs
>
> I dont have truss on the box to attach to the process and ktrace 
> doesnt seem to show anything either.  Does this sort of hang ring a 
> bell for anyone ? Looking back at the git logs, a coarse search for 
> anything to do with mount, doesnt come up with much (2 below).   Also 
> since then a new version of clang so not quite where to start.
>
> Any guidance appreciated. Testing is difficult as the hang doesnt 
> always happen -- sometimes within a day, sometimes 5 days.  ssh is 
> usually borked as well as some processes.  I have a scaled down 
> telegraf agent collecting some basic stats, and the cpu is pegged at 
> 100%. These are single core devices so not sure what is pegging the 
> CPU.  RAM still shows some available so it doesnt seem to be memory 
> pressures.
>
>
> commit 71fceff2480999b3fc921f47ec9adea9eff32041
> Author: Andrew Gierth <andrew@tao146.riddles.org.uk>
> Date:   Sun Dec 24 14:04:21 2023 +0200
>
>     vfs_domount_update(): correct fsidcmp() usage
>
>     (cherry picked from commit 2a1d50fc12f6e604da834fbaea961d412aae6e85)
>
> and
>
> commit 608ccfc29fb48d8edc59a97382936790c02d27f3
> Author: Konstantin Belousov <kib@FreeBSD.org>
> Date:   Thu Nov 9 22:18:47 2023 +0200
>
>     vfs_domount_update(): ensure that 'goto end' works
>
>     PR:     274992
>
>     (cherry picked from commit ede4c412b3ea9289ef42c664b01b6b5ff7eac434)
>