Re: 13-STABLE high idprio load gives poor responsiveness and excessive CPU time per task

From: Peter <pmc_at_citylink.dinoex.sub.org>
Date: Thu, 29 Feb 2024 18:56:10 UTC
First off: the honorable Sir Edward, III. had the decency to have me
ntified that they prefer to censor my messages (reasons were not given), 

I for my part consider it rather pointless to publicly ask questions,
only to then inform those who bother to answer that one declines to
receive their answers.

So much for that. Now for the call for popcorn:

On Thu, Feb 29, 2024 at 09:40:39AM -0800, Mark Millard wrote:
! > ! The kernel has multiple, distinct OOM messages. Which type are you
! > ! seeing? :
! > ! 
! > ! "a thread waited too long to allocate a page"
! > 
! > That one.
! 
! That explains why vm.pageout_oom_seq=5120 did not make a
! notable difference in the time frame.

Good. Glad it explains something. 

! If you cause a verbose boot the code:
! 
!        if (bootverbose)
!                printf(
!            "proc %d (%s) failed to alloc page on fault, starting OOM\n",
!                    curproc->p_pid, curproc->p_comm);
! 
! likely will report what process had failed to get a
! page in a timely manor.

These are ad-hoc bhyve which are only created for the purpose of
compiling some ports. So there is zero interest about /which/ process
fails, because any failing process will just fail the build.
The essential point it rather: the very same sizing of ressources works
when booting a 13.2, and crashes reproducible with 13.3

! # run out), avoid pageout delays leading to
! # Out Of Memory killing of processes:
! #vm.pfault_oom_attempts=-1

Yes, I already got that far, and that doesn't help: If the system is
neither allowed to oom-kill nor to crash, it freezes and waits for the
reset button.

As this is an endless loop in the kernel, it is not ressource
exhaustion, but rather unability of the kernel to adjust ressources
accordingly due to being busy with other things (i.e. running an
endless loop).

But then, discussion about this is futile, because there exists a
patch that well expects and nicely fixes mentioned behaviour.