Re: Arm v7 RPi2 -current unresponsive to debugger escape during buildworld

From: Mark Millard <marklmi_at_yahoo.com>
Date: Fri, 07 Nov 2025 04:15:49 UTC
On Nov 6, 2025, at 18:22, bob prohaska <fbsd@www.zefox.net> wrote:

> On Thu, Nov 06, 2025 at 10:00:19AM -0800, Mark Millard wrote:
>> On Nov 6, 2025, at 08:38, bob prohaska <fbsd@www.zefox.net> wrote:
>> 
>>> On Thu, Nov 06, 2025 at 03:45:01PM +0100, Ronald Klop wrote:
>>>> Hi,
>>>> 
>>>> To me it sounds like your machine is overwhelmed by swapping.
>>>> 
>>>> Try -j1 buildworld.
> Maybe a -j1 buildworld could be at least somewhat informative.
> Lately none of my Pi2's has made it through buildworld 
> without hanging silently. If -j1 buildworld completes,
> that would be a significant change. The test will take a
> week, but the problem has been going on for a year.   
> 
>>> 
>>> In most cases of stoppage the swap use is low, 50 MB or sometimes less.
>>> Up to about 6-700MB the machines slow their progress, but keep going and
>>> there are no complaints on the console about swap taking too long or
>>> insufficient. If there's a connection to swap use, it isn't obvious.
>>> 
>>> It seems to be related more to hours of runtime than swap use. 
>>> 
>>> More to the point of my question, if the machine is swap-bound,
>>> shouldn't the debugger escape still work?
>> 
>> 
>> Are your descriptions of the lack of gaining control for use
>> of the serial console? Do you also have ssh or such? Do all
>> such see hangs as hung-up/crashed?
> All comms become unresponsive, serial console or ssh.
> 
>> Do you get notices about
>> loss of network connections to the RPi2 v1.1 in question?
> Sometimes, but not always. Occasionally an ssh session will
> become unresponsive and only later report a disconnection.
> 
>> Do any of those happen automatically? If so, the time
>> of such a message could put a bound on when the RPi2 v1.1
>> hang-up/deadlocked/crashed, the message about failing
>> communication having occurred after the problem starts on
>> the RPi2 v1.1.
> In some cases the stuck ssh sessions are disconnected only
> after reboot completes. In others, it appears to be a matter
> of time. Overnight is usually sufficient.
> 
>> I'll note that your prior reporting of the end-of-log
>> content gives evidence of things that completed, including
>> being flushed to the disk. But there likely was more that
>> was not flushed to the disk, some of which may have
>> otherwise completed. Also, what was actually active at the
>> time of the potential deadlock (or other form of crash) is
>> unlikely to show in the logs with such a known status.
>> 
> In a lot of cases there's been a top session with a timestamp
> and swap usage running at the time of the crash. I've not
> made careful comparisions. That's the only timestamping at hand.
> 
>> The I/O tries to keep the file system media content from being
>> corrupted, but not necessarily that it is up to date. (Fully
>> attempting both leads to either a contradiction or horrible
>> performance. UFS has different tradeoffs than ZFS for such
>> issues but the same general goal applies to both. At least
>> that is how I'd summarize it.)
>> 
>> Knowing where the logs stop can give some idea what might
>> follow or have  been active, but it involves other analysis.
>> 
>> I do not know if tail -f reports buffered information vs.
>> only data that makes it to media. It might be that tail -f
>> in an ssh session on the/a log file might report closer
>> to the failure time, showing information that does not
>> make it to the media. That need not be the same as showing
>> the actual failure time: just possibly closer.
>> 
>> 
>> As for debugger use, there are thousands of processes.
>> If you mean gdb or lldb, there is no uniquely relevant
>> process to attach to and monitor that survives across all
>> the activity.
> 
> Would running the buildworld command under a debugger's control
> give any better access to the enter-tilda-control-B sequence on
> the serial console? Usually buildworld runs from an ssh session
> in the background with top display over it.

The enter-tilda-control-B sequence is via the tty driver and
kernel. It is not tied to a specific process.

> I could run buildworld
> under the debugger from the serial console if it makes any difference.
> 
>> 
>> Are your kernel builds debug/invariants/witness builds?
>> Is world a debug build? (I do not mean just having symbols
>> and such as a debug build.) I wonder what the behavior would
>> be for avoiding the resource overhead involved in having and
>> using the debug code. (But, if it does fail, extracting
>> information is normally a problem.)
> 
> Sources are all unmodified, so it's whatever -current offers.
> I'd expect that to include all three; there's explicit warning
> that the witness option is enabled.  


===
Mark Millard
marklmi at yahoo.com