Processes hang in state "kmem a", system hang follows

Matthew Rezny matthew at reztek.cz
Sun Feb 2 19:04:54 UTC 2014


On Sat, 1 Feb 2014 20:31:12 +0100
Matthew Rezny <matthew at reztek.cz> wrote:

> I'm seeing rather strange behavior from 10.0 on i386 thus far. This is
> another long message, so if you want the summary without back-story,
> skip to the end. Sometimes it's hard to include relevant details
> without feeling like I'm rambling. I'm seeing rather strange behavior
> from 10.0 on i386 thus far.
> 
> I started with FreeBSD not long before 4.0 release and ran 4.x
> releases on i386 and Alpha for a long time. I tried the 5.x releases
> and had nothing but trouble so stuck with 4.x through that time. The
> Alpha never did move off 4.x before it got retired, but some of my
> i386 boxes made it onto 6.x and then sat there until they were taken
> out of active use. For years, FreeBSD 4.x and 6.x was the reliable OS
> I used for everything but my desktop (which had been OS X).
> 
> More recently I started using FreeBSD 8 on amd64 with ZFS and quickly
> moved on to 9 as soon as 9.0 was released. At the same time, i386
> hardware retired from desktop roles but suitable for network services
> got 8.x installed on UFS. I had rather good experience with 9-STABLE
> on amd64 running with ZFS. For the most part it's solid, ZFS support
> is much better than the sorry state Apple left it in before
> abandoning it on OS X, though I did get a few kernel panics when
> simply connecting disks that contained zpools from OS X. Due to both
> compilation speed difference and the fact older hardware tends to be
> in more entrenched roles, I left my i386 systems out of the ZFS and
> 9.x experiments. I did also try 9.x on my one ppc64 box at various
> times to see if that might be a good way to utilize hardware Apple
> dropped support for years prior. The state on ppc64 varied between
> panic on boot to being able to buildworld but an idle system left for
> a few days would randomly go zombie, console freezes but clearly
> there is some system activity and it responds to ping but might not
> take a ssh connection, which I chalked up to the experimental state
> of the port. I did see console freezes on i386 boxes booted from a
> 9.1 mfsbsd image but never investigated because I was just using it
> to image and erase disks on old machines where I considered the
> hardware suspect.
> 
> In the last couple months I've been moving my amd64 systems to 10,
> starting during the RCs and keeping up such that they are now all
> 10-STABLE. The transition was fairly smooth and they are running quite
> well. Even one box that has prior chipset and BIOS, which was
> panicking with an early 10-BETA is now running 10.0-RELEASE with KMS.
> All very impressive. So, time to start migrating some i386 boxes I
> figure. I had recently moved a number of them to 9.2 and figured I
> should just go ahead and move everything up to 10.0 at close to the
> same time if possible. I had seen no problems with 9.2 or 9-STABLE on
> the i386 boxes that I was preparing to upgrade, I already sorted out
> one Clang bug that affected a few (but less worse than a similar GCC
> bug that remains unfixed) since I had switched compilers when going
> to 9.
> 
> Since I started moving i386 boxes to 10.0, I've had nothing but
> strange problems. Last night I wrote a message about kern.maxswzone,
> something I started getting warnings about on one particular box when
> I put 9.2 on it but which I didn't try to do anything about until
> now. I wrote that message with this one in mind, mentioning that I
> would have another about processes hanging. That one came first
> because it has at least some hard numbers and not so much subjective
> feelings of performance and reliability. Between then and now, the
> pattern struck me, all my early successes with 10 were amd64, and now
> all the i386 boxes I've upgraded are barely functional.
> 
> I have 4 i386 boxes that I tried to put 10.0 on in the past week with
> various degrees of fail. There are 2 sets within the four, two are the
> low-end C3 boxes with 256MB and 384MB RAM described in my prior
> message to the list. The other two are Pentium4 systems, one with 2GB
> RAM and the other with 3GB, substantially bigger disks, decent GPU,
> etc. In other words, two are ancient and two are merely a little
> dated but still very usable. This faster pair I will mention first,
> then I will return to the slow pair. All these boxes are things I use
> around the house for network services or as essentially terminals in
> other rooms (kitchen pc to look up stuff, bedroom pc to watch movies,
> etc). The i386 boxes that run important services (externally facing
> network services, routing/firewall, etc) and being left two a second
> round once all issues are sorted out on these lower-importance boxes
> first.
> 
> The P4s had 9-STABLE installed on UFS volumes. I did the switch from
> csup to svnup to pull the 10.0 sources, did the buildworld/kernel and
> install on both and all looked good. Before I went on to reinstall
> packages or anything else, I decided now might be a good time to try
> switching from UFS to ZFS, everything in /home was already backed up.
> So far I had only tried ZFS on amd64 due to early reports of
> flakiness on i386 related to exhausting kernel memory. In the couple
> years since initial support, the ZFS code has gotten better
> integrated, more people have tried it, some tuning guides have been
> written, and I've seen reports of it being used on boxes with 512MB
> RAM. Most of my i386 boxes in server roles have 2GB and it would be
> nice to migrate those to ZFS if possible. Best to test on these boxes
> first and try tuning if needed.
> 
> I booted both P4 boxes from mfsbsd CD, mounted the existing UFS
> volums, tar the whole mess and drop the uncompressed tar on my file
> server. On the server, I fired off xz to compresses the tar file to
> speed the restore (or so I thought) while I prepared the machines. I
> setup the zpools in the normal way I'd done all my amd64 boxes. One
> P4 box has a single disk, the other has two, so one is a single vdev
> pool and the other is multiple, which adds a little variety for
> testing. Aside from vdevs, the pool properties, filesystems and their
> properties are all identical to how I've been setting up my other ZFS
> boxes. LZ4 on most filesystems, gzip or none on a few, sha256 hashes
> entirely, no dedupe, pretty normal. With the pools configured and
> mounted on /zroot, I scp the tar.xz file for each box into /tmp
> (which is tmpfs), and try tar xjpvf in /zroot.
> 
> After initial good progress, both boxes seemed to hang at about the
> same time. Disk activity stops, tar is sitting there as if it's going
> to do something, but no further progress on either when left for an
> hour. I started top on both boxes and notice that the tar process on
> each is in the state "kmem a" and the resident memory allocation on
> each is exactly the same (around 750MB). My first thought was that I
> used too much RAM with the 500MB tar.xz file in tmpfs. One box says
> 800MB free and the other says 1800MB free but maybe there is a
> shortage of kernel memory. I can't seem to kill tar, so I just reboot
> each, clear the zpools to try from a fresh state again, mount the
> swap before filling /tmp this time, then attempt another extract. No
> joy, it stops the same way, with the exact same memory allocation,
> and each box is stopped on the exact same file as where each stopped
> on the first attempt. The free memory reports are the same as before,
> no sawp is being used, whatever is running out must be non-pageable.
> 
> The next thing I try is decoupling the stages. The tar process is
> growing so large because it has to decompress lzma which requires a
> huge dictionary. I figure maybe the heavy disk I/O is causing
> buffers/cache to contend with the process in some way. Reboot again
> for a fresh start, scp the .tar.xz to /zroot/tmp, xz -d so it's just a
> plain tar, then tar xpvf in /zroot and both complete without error.
> Set the mointpoint to / for each zroot and reboot into the running
> system. That was strange but solvable. I don't know what the "kmem a"
> state is but I can guess it's probably short for something like "kmem
> alloc" which would suggest to me the process is waiting on a kernel
> allocation. So I figure I've got some tuning to do and a hung process
> isn't as bad as the kernel panics others had reported on i386 under
> heavy I/O load (e.g. rsync) with default settings. After all, the boot
> messages include two warnings about tuning ZFS memory on i386. In
> order to do the tuning, I need some reproducible load, and buildworld
> is good for that. So, first thing is switch from svnup to svnlite
> that is now in base and use that to get 10-STABLE sources. I do the
> rm -r on /usr/src and /usr/ports and then fire off the svnlite co for
> each. I find that the slowness of svn checkout is due to network
> latency and running the two in parallel doesn't create I/O contention
> on either disk or network.
> 
> While the P4s are fetching their sources, I go to deal with the pair
> of Via C3 boxes that I had taken to 10-PRERELEASE just a week prior
> and was ready to upgrade to 10-STABLE. Since that upgrade, they sat
> unused waiting for an impending MFC so I could do away with a local
> patch. As mentioned in my other message, I made a mistake here on my
> first attempt, I forgot to clear the existing /usr/src and /usr/ports
> before starting the svnlite checkout. After realizing my mistake, I
> did the now larger (as it includes a .svn dir) rm -r of those dirs to
> start fresh. That's when I hit the problem with rm hanging on one box.
> Without repeating all the details, I had to boot mfsbsd to do the rm
> on the one box with only 256MB RAM, but what difference that made is
> simply inexplicable. Once I had gotten that straightened out, I
> started off the svnlite checkout fresh. On the box with 384MB, the
> completed with only one restart for network dropout (common since it
> takes 2-3 hours per checkout). On the box with 256MB (which had
> previously fully checked out and gotten to the point where it wanted
> to prompt me for the conflict on every file in the tree), svnlite
> could only do a hundred files or so before it seemed to hang in the
> same way as rm. Running just one instance on /usr/src without the
> parallel checkout on /usr/ports made no difference. When rm was
> hanging, I might be able to kill it (after several minutes wait) and
> reboot or the console might lock. When svnlite hung, I could not
> login but I might be able to run a command on another VT. I was able
> to catch that svnlite is getting stuck in the state "kmem a". Hmmm...
> the same state that tar was getting stuck in on the other boxes. How
> were those doing now?
> 
> I look back at the P4s, which should be done as it's been a few hours
> spent on the C3 boxes. They are sitting there  in the middle of
> checkout not making any visible progress. Ctrl-c doesn't work, I can't
> switch VTs, even ctrl-alt-del seems to not work. Seems like the
> consoles are hung in a way eerily similar to what I'd seen from 9.x on
> non-amd64 platforms (both ppc64 and i386). I attempted to initiate an
> ssh connection into each of the P4s and then walked off for a minute
> for refreshment. When I came back, expecting to find a login prompt or
> a timeout, I found the ssh attempts timed out and the two boxes had
> rebooted. I don't know if the ctrl-alt-del finally registered or if
> the incoming ssh connection pushed them over the edge. I wasn't there
> to see and the logs for both stop sometime before the hang. With both
> rebooted, I do a svnlite cleanup in /usr/src and /usr/pots or both,
> then fire off the svnlite co for each directory on both boxes.
> 
> While those were running, I started digging into the kern.maxswzone
> tunable on the C3 box with less RAM. The box with more RAM was able
> to do the rm, svn checkout of both src and ports in parallel, and
> showed no obvious sign of trouble, though I hadn't started a
> buildworld yet. The box with less RAM was failing all over the place
> and the only obvious difference was the warning about that tunable.
> After I wasted hours figuring out the value is already sufficient but
> is apparently reduced after it's set, so it can't be effectively
> turned up, only down, I wrote my previous message to this list on
> that topic specifically and then went to bed.
> 
> This morning I got up and was already thinking about the correlation,
> that 10 is a disaster on all my i386 boxes thus far. The first thing I
> checked was the P4 boxes. Both completed the svn checkout on both src
> and ports, good sign. However, the box with 3GB RAM has the message
> "vm_thread_new: kstack allocation failed" repeated about a dozen
> times, bad sign. First thing I do is try to run top to see what the
> size of ARC is, free RAM, etc. "No more processes." Uh Oh, that's no
> good at all, can't even run top. Curiously, the box with less RAM,
> only 2GB, has no messages so I try to start top on it to see what
> it's state is. Nothing happens when I push return, the cursor is just
> sitting there after top. On another VT, reboot gets the same
> response, none, cursor just sits. I can't type but I can switch VTs
> and scroll, until I do ctrl-alt-del, then every key press after that
> is a beep. Back on the once that said no processes left for top,
> reboot gets the same non-response. ctrl-alt-del doesn't beep, it just
> spits out the ^[[3~ typical of a dead console. Ugh, not even a reset
> button to punch on these P4 boxes.
> 
> So, svnlite checkout is a real strain that can bring a system to it's
> knees. I'm not sure if this should be regarded as horrible
> inefficiency or as a means of checking the box before launching into
> a buildworld (as if that wasn't enough strain to uncover most
> problems). While 10.0 is good on amd64, it seems a disaster on i386.
> Processes hang in this "kmem a" state it doesn't take much more to
> get the box to livelock. I've only seen the "kmem a" state a few
> times as most other times I can't inspect anything before the box is
> locked too hard to do anything. In some cases I'm not sure there's
> even a way to get the box shutdown clean as the most trivial of
> things lock it up hard. It's not even required to do anything. When I
> was experimenting with kern.maxswzone last night I rebooted one box a
> few dozen times, so if I didn't need to look at systcl output I just
> hit ctrl-alt-del at the login prompt. Once the console died right
> then, it had just booted and ctrl-alt-del was met with a beep and
> then it's hung, have to punch reset. I'm guessing the console dies as
> a result of total wedging of I/O systems following heavy disk I/O.
> The cause is not just ZFS because the C3 boxes are UFS. The problem
> is not just the excess swap on the smallest box because I see the
> same sort of troubles on the box with the most RAM. Some kernel
> resource seems to be exhausted regardless of how much RAM or swap is
> present. 
> 
> I'm going to try buildworld on 3 of these to see what happens. For the
> fourth, I still need to get sources onto the disk before I can even
> attempt that. I'm not sure what to expect. It might be instant
> miserable failure, or it might actually run a long time since the I/O
> load is in bursts with lots of recovery time between. It'll take a few
> hours to see if the P4s succeed. It'll take two days to see a C3
> succeed. Maybe by that time, someone will get through all I've written
> and have some useful suggestion for debugging. To me, it's rather hard
> to debug since I have little hint where to start, when the problem
> manifests any logging stops, and the box is in a state where it is
> essentially unobservable without a JTAG to jump in and directly
> inspect the state of it's world.

Replying to self to give status update to anyone reading along.

The pair of P4 boxes made it through buildworld/kernel after a few
tries. On these boxes I have /usr/obj mounted on a tmpfs as that's how
I've been setting up the other boxes with ZFS. Between the ZFS ARC
filling with source, the tmpfs filling with binaries, and the actual
compilation tasks there should be a good bit of memory pressure.

The first build attempt was with -j10 on both boxes. As these are
single core CPUs, -j4 would have probably been more appropriate for
optimal speed. The build process on each failed after about an hour.
The exact stopping point was not noted since the actual error is beyond
reach of syscons history by the time the parallel build process exits.
The two boxes appear to have stopped at different points.

I restarted the make buildworld on each without any -j parameter and
without rebooting. I didn't want to clear the state, if the overly
parallel build caused anything to leak, I want to see that blow up the
non-parallel build. The first run through on each failed at different
points with one of the strangest compiler errors I've yet to see. The
builds failed with a fatal error: unable to open file [something}.c
(where something was rlogin.c on one and citrus_[forgotten].c on the
other). On both boxes, the first thing I did was cat thefile.c and of
course I see the source file as expected, so the compiler failing to
open the source file is a transient error.

Following those odd errors, I restarted the build on each box with
exactly the same options and without rebooting to check reproducibility.
On the second non-parallel build attempt, both boxes succeeded to build
world and then proceeded on to the kernel build without issue. Whatever
resource exhaustion had cleared itself. I checked the memory stats at
that point. The box with 3GB RAM had no swap currently in use, but
might have experienced swapping during the build. The box with 2GB RAM
had 800MB swap used, which is reasonable given the /usr/obj tmpfs was
holding 2.2GB. Interestingly, the box with more RAM was the first of
the pair to fail out of the build both times. The installkernel and
installworld went off without a hitch. I did get a warning about
swapoff failing when dropping to single user on the box with only 2GB,
which is expected given the tmpfs spill into swap.

The situation with buildworld is not too bad. The spurious file open
errors are troubling, but not as bad as a panic or hang. The problem is
likely more specifically ZFS-triggered kernel memory pressure and not
general memory pressure. The low memory use but higher disk I/O
processes like tar and svn are more prone to trigger the problem.
Even higher disk I/O might hit the point of panic as some others have
reported with e.g. rsync on i386. Perhaps with some tuning, these boxes
can be made to behave reasonably. The initial problems with tar seemed
very troubling and I still don't have a good explanation for why the
memory use of the decompress while untaring seemed to make such a
difference.

The situation with the C3 boxes is much worse. More details on those
will be in the other thread since that is where I gave the initial
details on those and got some reply. The most interesting bit from that
pair of boxes is the possible spurious file open fail. Running svnlite
through truss, I couldn't help but notice that it hung immediately
following a failure to stat a file that was in fact present (fsck
truncated it on the reboot after hang). Some VFS issue that therefore
affects UFS and ZFS on i386?


More information about the freebsd-stable mailing list