[Bug 263908] Something spawning many "sh" process, system no longer boots, in single user /var/log empty

From: <bugzilla-noreply_at_freebsd.org>
Date: Wed, 11 May 2022 00:11:02 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=263908

            Bug ID: 263908
           Summary: Something spawning many "sh" process, system no longer
                    boots, in single user /var/log empty
           Product: Base System
           Version: 13.1-STABLE
          Hardware: amd64
                OS: Any
            Status: New
          Severity: Affects Only Me
          Priority: ---
         Component: misc
          Assignee: bugs@FreeBSD.org
          Reporter: greg@teamworkweb.com

Not sure how, or even if, I should report this. However figured I should say
something, since process I am using to install and run 13.1-RC6 is basically
same as what I had going with 13.0. But... with a serious issue! All things
being equal, issues point to a flaw or difference in 13.1-RC6 compared to 13.0.

Did a fresh install of 13.1RC-6 on Sunday (05/08) evening. Ran into an issue
with MFI driver (reported as bug 263906) but was able to work around with MRSAS
driver (which I intended to use anyway). Installed common packages for
benchmarks. Built a zpool using dRAID out of HDDs and special vdev using 3x
mirror of SSDs. Applied mix of system tunables that had been working reliably
under 13.0 (can provide if requested). Started a test set of back to back fio
and iozone benchmarks.

Next morning went to check results. Found I could not run anything, was getting
"No more processes" on my shell. Left it running, later Monday evening found I
was able to run processes. But there were over 37,000+ instances of "sh"
running! Mostly in sleep. I was able to pull /var/log/messages, and found:

May  9 20:11:00 freebsd kernel: maxproc limit exceeded by uid 2 (pid 21916);
see tuning(7) and login.conf(5)

Results from top at the time:

last pid: 22684;  load averages:  0.26,  0.18,  0.11                           
                                                                               
              up 0+22:20:59  20:15:46
37976 processes:1 running, 37975 sleeping
CPU:  0.1% user,  0.0% nice,  6.0% system,  0.0% interrupt, 93.8% idle
Mem: 1112K Active, 19G Inact, 8491M Laundry, 2648M Wired, 40K Buf, 817M Free
ARC: 236M Total, 50M MFU, 108M MRU, 2067K Header, 75M Other
     90M Compressed, 222M Uncompressed, 2.46:1 Ratio
Swap: 8192M Total, 2784M Used, 5408M Free, 33% Inuse

  PID USERNAME    THR PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
22684 root          1  20    0    72M    46M CPU1     1   0:16  85.79% top
25011 ntpd          1  20    0    21M  1724K select   3   0:02   0.00% ntpd
 8242 root          1  52    0    13M  2004K wait     1   0:01   0.00% sh

Did a reboot, and has been all down hill from there. System will no longer
boot, at least not to login prompt. It stalls during several points at loading
up, after usb driver load, and after starting network. Can coax it along some
what by crtl-c/x/z, the last thing it will do is "Starting devd".

Kernel seems to be running, as it will reboot if you hit ctrl-alt-del, or power
down if you tap power button.

I can get into single user mode, but find /var/log is empty.

I let it sit for a while at one point, and it displayed a few lines over time
that it was killing of "sh" processes.

Because I had rebooted several times on the first night, right now I suspect
some stock ("out of the box") cron job is running and looping, creating all the
"sh" processes. But I don't have enough detail yet.

Honestly still figuring out how I get root file system out of read-only mode
when booted single user? I want to comment out everything in /etc/crontab and
try booting. See if one of these is the cause. (again all "stock", I didn't
create any custom cron jobs yet)

Because of the issues with the MFI driver, I did pull the LSI 9361 HBA out of
the server. I even destroyed the dRAID pool. Doesn't seem related, issue
persists.

So why am I reporting this as a "bug", when I lack enough detail to confirm the
actual issue? Because every single step I did was the same as performed under
13.0. On the same hardware, that had been 100% stable for 3+ months. All things
being equal, there is something "wrong" or "different" in 13.1-RC6 which is now
broken / breaking my setup.

In the interest of helping rule this out as a flaw in RC6, willing to do what I
can to trouble shoot further. But honestly would need more input as to proper
diagnostic steps. I do have a little more time to "play" with this hardware,
before I have to select a version and put it into production. I was holding out
so I could run 13.1 when it goes to release. But if I cannot figure this out I
will roll back to 13.0 for production, since that was fully stable.

Please let me know what other details to provide, suggestions for trouble
shooting, further diagnostics. Just looking to contribute to RC6 testing,
determine if this is a bug or a "just me" problem. Thanks!

-Greg-

-- 
You are receiving this mail because:
You are the assignee for the bug.