Re: -current on armv7 stuck with flashing disk light

From: bob prohaska <fbsd_at_www.zefox.net>
Date: Tue, 27 Jun 2023 17:16:57 UTC
On Tue, Jun 27, 2023 at 09:59:40AM -0700, Mark Millard wrote:
> On Jun 27, 2023, at 09:47, Mark Millard <marklmi@yahoo.com> wrote:
> 
> > On Jun 27, 2023, at 09:29, bob prohaska <fbsd@www.zefox.net> wrote:
> > 
> >> On Mon, Jun 26, 2023 at 07:57:05PM -0700, Mark Millard wrote:
> >>> On Jun 26, 2023, at 19:12, bob prohaska <fbsd@www.zefox.net> wrote:
> >>> 
> >>>> A Pi2 freshly updated to 
> >>>> FreeBSD 14.0-CURRENT #41 main-c3e58ace31: Mon Jun 26 17:06:01 PDT 2023
> >>>>  bob@www.zefox.com:/usr/obj/usr/src/arm.armv7/sys/GENERIC arm
> >>>> got stuck with a flashing USB disk LED after starting a -j3 buildworld.
> >>>> No response to debugger escape, had to pull the plug.
> > 
> > I'm confused.
> > 
> > That says "stuck with a flashing USB disk LED". But:
> > 
> > http://nemesis.zefox.com/~bob/fbsd/rpi2/20230623/readme
> > 
> > says: "the disk had gone to sleep mode. Both LEDs were off"
> > 
> > Are these two different examples with variable behavior
> > across the examples?
> > 

Yes, I got mixed up. There have been several failures, some
belated and the most recent one which was prompt (with a new
kernel). 

> >>> If I understand right, the LED flashing means the disk
> >>> had not stopped doing I/O: the system was still running,
> >>> doing disk activity. (But I do not have a description
> >>> of what your drive documentation says about how the
> >>> drive handles the LED and what various patterns/colors
> >>> may mean.)
> >>> 
> >>> If the processes associated with processing input that
> >>> would identify the debugger escape had the kernel stacks
> >>> involved swapped out to swap space, I doubt that the
> >>> debugger escape would work until/unless the kernel
> >>> stacks are brought back into kernel RAM.
> >>> 
> >>> Avoiding the specific way of losing control is why I
> >>> have in /etc/sysctl.conf :
> >>> 
> >>> #
> >>> # Together this pair avoids swapping out the process kernel stacks.
> >>> # This avoids processes for interacting with the system from being
> >>> # hung-up by such.
> >>> vm.swap_enabled=0
> >>> vm.swap_idle_enabled=0
> >>> 
> >> 
> >> This combination was tried and didn't seem to have any consistent
> >> effect. It's commented out at the moment.
> > 
> > By not having them, we have no way to know if the
> > relevant kernel stacks had been moved to swap space.
> > Having them is part of problem isolation/identification
> > even when other forms of loss of control happen.
> > 
> > The 2 lines serve more than one goal.
> > 
> >>> (No claim such is the only way to lose control.)
> >>> 
> >>> You might be able to get a clue if their was disk I/O going
> >>> on based on modification times on files you know would have
> >>> been modified periodically for some time (minutes) before
> >>> you pulled the plug --but not modified on reboot and later
> >>> activity. May be a log file that would only be modified by
> >>> the build that you had been trying to do?
> >>> 
> >> 
> >> There are log files for build and disk activity (for a cold
> >> hang, no disk activity at all) at
> >> http://nemesis.zefox.com/~bob/fbsd/rpi2/20230623/
> > 
> > So this is a different hangup?
> 
> j4swapscript.log has internal timestamp pairs:
> 
> Wed Jun 21 16:34:06 PDT 2023
> . . .
> Fri Jun 23 07:26:10 PDT 2023
> 
> It would be interesting to know if "Jun 23 07:26:10"
> was after the appearent hangup was identified vs.
> before.
> 
> >> In this case the top window was via ssh. Lately I've
> >> taken to running top on the serial console in hopes
> >> that will help distinguish system hangs from USB hangs.
> > 
> > If you want to identify system hangs, please
> > put back:
> > 
> > vm.swap_enabled=0
> > vm.swap_idle_enabled=0
> > 

They're reinstated now, but I don't want to disturb the system
while it seems to be building world acceptably. 

Sorry for mixing things up!

bob prohaska