kernel bug in 11.3-STABLE causes frequent crashes

Sat Nov 9 13:56:50 UTC 2019

Eugene,
     Thank you very much for the fast reply!

Eugene Grosbein <eugen at grosbein.net> wrote:

> 09.11.2019 19:45, Scott Bennett ?????:
> >      The rest of this message was posted a little while ago to the
> > freebsd-questions list by mistake.  It was intended for freebsd-stable,
> > so I am posting it here now after posting a brief apology on the other
> > list.
> >      I have had to waste a great deal of time lately in recovering my
> > system from crashes due to a kernel bug.  At present, my system is
> > 
> > FreeBSD hellas 11.3-STABLE FreeBSD 11.3-STABLE #12 r352571: Sat Sep 21 11:39:52 CDT 2019     bennett at hellas:/usr/obj/usr/src/sys/hellas  amd64
> > 
> > There are actually at least two problems, but this particular one has been
> > causing a large portion of my forced reboots.  It usually fails to produce
> > a dump and freezes right after the panic and backtrace messages, as it did
> > earlier tonight, but Wednesday night it did create a dump, which I am
> > keeping in case it should prove helpful in getting the bug identified and
> > solved.  I copied the console messages to paper painstakingly by hand.
> > They appear to be identical each time, except, of course, for the messages
> > that a dump is produced when, indeed, it does produce one.  I am omitting
> > those fairly standard messages.
> > 
> > Fatal trap 12: page fault while in kernel mode
> > cpuid = 2; apic id = 02
> > fault virtual address   = 0x3b8
> > fault code              = supervisor read data, page not present
> > instruction pointer     = 0x20:0xffffffff80a4b14c
> > stack pointer           = 0x0:0xfffffe012a60ea50
> > frame pointer           = 0x0:0xfffffe012a60eae0
> > code segment            = base 0x0, limit 0xfffff, type 0x1b
> >                         = DPL 0, pres 1, long 1, def32 0, gran 1
> > processor eflags        = interrupt enabled, resume, IOPL = 0
> > current process         = 28 (flowcleaner)
> > trap number             = 12
> > panic: page fault
> > cpuid = 2
> > KDB: stack backtrace:
> > #0 0xffffffff80a94707 at kdb_backtrace+0x67
> > #1 0xffffffff80a4fa2e at vpanic+0x17e
> > #2 0xffffffff80a4f8a3 at panic+0x43
> > #3 0xffffffff80f3a4d0 at trap_pfault+0
> > #4 0xffffffff80f3a519 at trap_pfault+0x49
> > #5 0xffffffff80f39bad at trap+0x29
> > #6 0xffffffff80f19f33 at calltrap+0x8
> > #7 0xffffffff80b3bb8d at flowtable_clean_vnet+0x43d
> > #8 0xffffffff80b3c758 at flowtable_cleaner+0xc8
> > #9 0xffffffff80a12ea2 at fork_exit+0x82
> > #10 0xffffffff80flaf4e at fork_trampoline+0xe
> > 
> >      The machine is ancient.  The CPU is a QX9650 (last group of Core 2
> > Quads) with 8 GB of DDR3 memory.
> >      If this can be identified as a known bug and a clue provided to a
> > patch or a safer version to upgrade to, I would be grateful.  I am getting
> > very, very tired of these crashes.
> >      The other forced reboots I will describe in a separate message, but
> > that problem has existed since the time of 11.2-RELEASE and apparently was
> > never investigated, much less fixed, although people began complaining on
> > this list and possibly -questions within the first few days after the
> > release date.
> >      Thanks in advance for any help with this problem!
>
> It seems you have custom kernel with options FLOWTABLE. The code it includes
> is known to be buggy, this options was removed from GENERIC many releases ago.
> Remove it from your kernel configuration, rebuild kernel and you will be fine.
>
     Wonderful.  I have a comment on that line, saying I added it for 8.x, so I
probably found it in 8.1's GENERIC configuration file when I was preparing to upgrade
from 7.3.  It is interesting that it only started hitting me (hard enough to make
me notice it, at least) in 11.3 and maybe a bit earlier in 11.2.  Anyway, that will
be easy enough to fix, but will require rolling /usr/src back to the revision I am
running, which is probably also no big deal.
      I don't seem to be able to build it at the current source revision because
11-STABLE's buildworld began failing during the libc build two or three weeks ago.
I just tried "svn update /usr/src" again, followed by "make -j6 buildworld", and it
still fails with this ending.

--- libc_pic.a ---
ranlib -D libc_pic.a
--- libc.a ---
ranlib -D libc.a
--- libc.so.7.full ---
cc: error: unable to execute command: posix_spawn failed: Permission denied
cc: error: linker command failed with exit code 1 (use -v to see invocation)
*** [libc.so.7.full] Error code 1

make[4]: stopped in /usr/src/lib/libc
1 error

make[4]: stopped in /usr/src/lib/libc
*** [lib/libc__L] Error code 2

make[3]: stopped in /usr/src
1 error

make[3]: stopped in /usr/src
*** [libraries] Error code 2

make[2]: stopped in /usr/src
1 error

make[2]: stopped in /usr/src
*** [_libraries] Error code 2

make[1]: stopped in /usr/src
1 error

make[1]: stopped in /usr/src
*** [buildworld] Error code 2

make: stopped in /usr/src
1 error

make: stopped in /usr/src

     Oh, well.  During the intervening weeks, I haven't seen any src updates that
appear to have anything to do with fixing the virtual memory management bug(s)
that is/are the other thing wasting my time.  I'll start a separate thread for
that, but first I want to do the rollback and get the buildworld started.  Oh,
wait a minute...ah, yes!  I also have a snapshot of /usr/obj to the same revision,
so I won't even need the buildworld, only the buildkernel.  This should be quite
quick then.
     Thanks a bundle for your help.

                                  Scott Bennett, Comm. ASMELG, CFIAG
**********************************************************************
* Internet:   bennett at sdf.org   *xor*   bennett at freeshell.org  *
*--------------------------------------------------------------------*
* "A well regulated and disciplined militia, is at all times a good  *
* objection to the introduction of that bane of all free governments *
* -- a standing army."                                               *
*    -- Gov. John Hancock, New York Journal, 28 January 1790         *
**********************************************************************