Re: Arm v7 RPi2 -current unresponsive to debugger escape during buildworld

From: bob prohaska <fbsd_at_www.zefox.net>
Date: Fri, 07 Nov 2025 02:22:37 UTC
On Thu, Nov 06, 2025 at 10:00:19AM -0800, Mark Millard wrote:
> On Nov 6, 2025, at 08:38, bob prohaska <fbsd@www.zefox.net> wrote:
> 
> > On Thu, Nov 06, 2025 at 03:45:01PM +0100, Ronald Klop wrote:
> >> Hi,
> >> 
> >> To me it sounds like your machine is overwhelmed by swapping.
> >> 
> >> Try -j1 buildworld.
Maybe a -j1 buildworld could be at least somewhat informative.
Lately none of my Pi2's has made it through buildworld 
without hanging silently. If -j1 buildworld completes,
that would be a significant change. The test will take a
week, but the problem has been going on for a year.   

> > 
> > In most cases of stoppage the swap use is low, 50 MB or sometimes less.
> > Up to about 6-700MB the machines slow their progress, but keep going and
> > there are no complaints on the console about swap taking too long or
> > insufficient. If there's a connection to swap use, it isn't obvious.
> > 
> > It seems to be related more to hours of runtime than swap use. 
> > 
> > More to the point of my question, if the machine is swap-bound,
> > shouldn't the debugger escape still work?
> 
> 
> Are your descriptions of the lack of gaining control for use
> of the serial console? Do you also have ssh or such? Do all
> such see hangs as hung-up/crashed? 
All comms become unresponsive, serial console or ssh.

> Do you get notices about
> loss of network connections to the RPi2 v1.1 in question?
Sometimes, but not always. Occasionally an ssh session will
become unresponsive and only later report a disconnection.

> Do any of those happen automatically? If so, the time
> of such a message could put a bound on when the RPi2 v1.1
> hang-up/deadlocked/crashed, the message about failing
> communication having occurred after the problem starts on
> the RPi2 v1.1.
In some cases the stuck ssh sessions are disconnected only
after reboot completes. In others, it appears to be a matter
of time. Overnight is usually sufficient.

> I'll note that your prior reporting of the end-of-log
> content gives evidence of things that completed, including
> being flushed to the disk. But there likely was more that
> was not flushed to the disk, some of which may have
> otherwise completed. Also, what was actually active at the
> time of the potential deadlock (or other form of crash) is
> unlikely to show in the logs with such a known status.
>
In a lot of cases there's been a top session with a timestamp
and swap usage running at the time of the crash. I've not
made careful comparisions. That's the only timestamping at hand.

> The I/O tries to keep the file system media content from being
> corrupted, but not necessarily that it is up to date. (Fully
> attempting both leads to either a contradiction or horrible
> performance. UFS has different tradeoffs than ZFS for such
> issues but the same general goal applies to both. At least
> that is how I'd summarize it.)
> 
> Knowing where the logs stop can give some idea what might
> follow or have  been active, but it involves other analysis.
> 
> I do not know if tail -f reports buffered information vs.
> only data that makes it to media. It might be that tail -f
> in an ssh session on the/a log file might report closer
> to the failure time, showing information that does not
> make it to the media. That need not be the same as showing
> the actual failure time: just possibly closer.
> 
> 
> As for debugger use, there are thousands of processes.
> If you mean gdb or lldb, there is no uniquely relevant
> process to attach to and monitor that survives across all
> the activity.

Would running the buildworld command under a debugger's control
give any better access to the enter-tilda-control-B sequence on
the serial console? Usually buildworld runs from an ssh session
in the background with top display over it. I could run buildworld
under the debugger from the serial console if it makes any difference.

> 
> Are your kernel builds debug/invariants/witness builds?
> Is world a debug build? (I do not mean just having symbols
> and such as a debug build.) I wonder what the behavior would
> be for avoiding the resource overhead involved in having and
> using the debug code. (But, if it does fail, extracting
> information is normally a problem.)

Sources are all unmodified, so it's whatever -current offers.
I'd expect that to include all three; there's explicit warning
that the witness option is enabled.  

Thanks for writing!

bob prohaska