ZFS commits corruption to disk when kernel allocation fails

Fri Feb 7 08:50:49 UTC 2014

On Sun, 2 Feb 2014 20:04:46 +0100
Matthew Rezny <matthew at reztek.cz> wrote:

> On Sat, 1 Feb 2014 20:31:12 +0100
> Matthew Rezny <matthew at reztek.cz> wrote:
> 
> > I'm seeing rather strange behavior from 10.0 on i386 thus far. This
> > is another long message, so if you want the summary without
> > back-story, skip to the end. Sometimes it's hard to include
> > relevant details without feeling like I'm rambling. I'm seeing
> > rather strange behavior from 10.0 on i386 thus far.
> > 
> > I started with FreeBSD not long before 4.0 release and ran 4.x
> > releases on i386 and Alpha for a long time. I tried the 5.x releases
> > and had nothing but trouble so stuck with 4.x through that time. The
> > Alpha never did move off 4.x before it got retired, but some of my
> > i386 boxes made it onto 6.x and then sat there until they were taken
> > out of active use. For years, FreeBSD 4.x and 6.x was the reliable
> > OS I used for everything but my desktop (which had been OS X).
> > 
> > More recently I started using FreeBSD 8 on amd64 with ZFS and
> > quickly moved on to 9 as soon as 9.0 was released. At the same
> > time, i386 hardware retired from desktop roles but suitable for
> > network services got 8.x installed on UFS. I had rather good
> > experience with 9-STABLE on amd64 running with ZFS. For the most
> > part it's solid, ZFS support is much better than the sorry state
> > Apple left it in before abandoning it on OS X, though I did get a
> > few kernel panics when simply connecting disks that contained
> > zpools from OS X. Due to both compilation speed difference and the
> > fact older hardware tends to be in more entrenched roles, I left my
> > i386 systems out of the ZFS and 9.x experiments. I did also try 9.x
> > on my one ppc64 box at various times to see if that might be a good
> > way to utilize hardware Apple dropped support for years prior. The
> > state on ppc64 varied between panic on boot to being able to
> > buildworld but an idle system left for a few days would randomly go
> > zombie, console freezes but clearly there is some system activity
> > and it responds to ping but might not take a ssh connection, which
> > I chalked up to the experimental state of the port. I did see
> > console freezes on i386 boxes booted from a 9.1 mfsbsd image but
> > never investigated because I was just using it to image and erase
> > disks on old machines where I considered the hardware suspect.
> > 
> > In the last couple months I've been moving my amd64 systems to 10,
> > starting during the RCs and keeping up such that they are now all
> > 10-STABLE. The transition was fairly smooth and they are running
> > quite well. Even one box that has prior chipset and BIOS, which was
> > panicking with an early 10-BETA is now running 10.0-RELEASE with
> > KMS. All very impressive. So, time to start migrating some i386
> > boxes I figure. I had recently moved a number of them to 9.2 and
> > figured I should just go ahead and move everything up to 10.0 at
> > close to the same time if possible. I had seen no problems with 9.2
> > or 9-STABLE on the i386 boxes that I was preparing to upgrade, I
> > already sorted out one Clang bug that affected a few (but less
> > worse than a similar GCC bug that remains unfixed) since I had
> > switched compilers when going to 9.
> > 
> > Since I started moving i386 boxes to 10.0, I've had nothing but
> > strange problems. Last night I wrote a message about kern.maxswzone,
> > something I started getting warnings about on one particular box
> > when I put 9.2 on it but which I didn't try to do anything about
> > until now. I wrote that message with this one in mind, mentioning
> > that I would have another about processes hanging. That one came
> > first because it has at least some hard numbers and not so much
> > subjective feelings of performance and reliability. Between then
> > and now, the pattern struck me, all my early successes with 10 were
> > amd64, and now all the i386 boxes I've upgraded are barely
> > functional.
> > 
> > I have 4 i386 boxes that I tried to put 10.0 on in the past week
> > with various degrees of fail. There are 2 sets within the four, two
> > are the low-end C3 boxes with 256MB and 384MB RAM described in my
> > prior message to the list. The other two are Pentium4 systems, one
> > with 2GB RAM and the other with 3GB, substantially bigger disks,
> > decent GPU, etc. In other words, two are ancient and two are merely
> > a little dated but still very usable. This faster pair I will
> > mention first, then I will return to the slow pair. All these boxes
> > are things I use around the house for network services or as
> > essentially terminals in other rooms (kitchen pc to look up stuff,
> > bedroom pc to watch movies, etc). The i386 boxes that run important
> > services (externally facing network services, routing/firewall,
> > etc) and being left two a second round once all issues are sorted
> > out on these lower-importance boxes first.
> > 
> > The P4s had 9-STABLE installed on UFS volumes. I did the switch from
> > csup to svnup to pull the 10.0 sources, did the buildworld/kernel
> > and install on both and all looked good. Before I went on to
> > reinstall packages or anything else, I decided now might be a good
> > time to try switching from UFS to ZFS, everything in /home was
> > already backed up. So far I had only tried ZFS on amd64 due to
> > early reports of flakiness on i386 related to exhausting kernel
> > memory. In the couple years since initial support, the ZFS code has
> > gotten better integrated, more people have tried it, some tuning
> > guides have been written, and I've seen reports of it being used on
> > boxes with 512MB RAM. Most of my i386 boxes in server roles have
> > 2GB and it would be nice to migrate those to ZFS if possible. Best
> > to test on these boxes first and try tuning if needed.
> > 
> > I booted both P4 boxes from mfsbsd CD, mounted the existing UFS
> > volums, tar the whole mess and drop the uncompressed tar on my file
> > server. On the server, I fired off xz to compresses the tar file to
> > speed the restore (or so I thought) while I prepared the machines. I
> > setup the zpools in the normal way I'd done all my amd64 boxes. One
> > P4 box has a single disk, the other has two, so one is a single vdev
> > pool and the other is multiple, which adds a little variety for
> > testing. Aside from vdevs, the pool properties, filesystems and
> > their properties are all identical to how I've been setting up my
> > other ZFS boxes. LZ4 on most filesystems, gzip or none on a few,
> > sha256 hashes entirely, no dedupe, pretty normal. With the pools
> > configured and mounted on /zroot, I scp the tar.xz file for each
> > box into /tmp (which is tmpfs), and try tar xjpvf in /zroot.
> > 
> > After initial good progress, both boxes seemed to hang at about the
> > same time. Disk activity stops, tar is sitting there as if it's
> > going to do something, but no further progress on either when left
> > for an hour. I started top on both boxes and notice that the tar
> > process on each is in the state "kmem a" and the resident memory
> > allocation on each is exactly the same (around 750MB). My first
> > thought was that I used too much RAM with the 500MB tar.xz file in
> > tmpfs. One box says 800MB free and the other says 1800MB free but
> > maybe there is a shortage of kernel memory. I can't seem to kill
> > tar, so I just reboot each, clear the zpools to try from a fresh
> > state again, mount the swap before filling /tmp this time, then
> > attempt another extract. No joy, it stops the same way, with the
> > exact same memory allocation, and each box is stopped on the exact
> > same file as where each stopped on the first attempt. The free
> > memory reports are the same as before, no sawp is being used,
> > whatever is running out must be non-pageable.
> > 
> > The next thing I try is decoupling the stages. The tar process is
> > growing so large because it has to decompress lzma which requires a
> > huge dictionary. I figure maybe the heavy disk I/O is causing
> > buffers/cache to contend with the process in some way. Reboot again
> > for a fresh start, scp the .tar.xz to /zroot/tmp, xz -d so it's
> > just a plain tar, then tar xpvf in /zroot and both complete without
> > error. Set the mointpoint to / for each zroot and reboot into the
> > running system. That was strange but solvable. I don't know what
> > the "kmem a" state is but I can guess it's probably short for
> > something like "kmem alloc" which would suggest to me the process
> > is waiting on a kernel allocation. So I figure I've got some tuning
> > to do and a hung process isn't as bad as the kernel panics others
> > had reported on i386 under heavy I/O load (e.g. rsync) with default
> > settings. After all, the boot messages include two warnings about
> > tuning ZFS memory on i386. In order to do the tuning, I need some
> > reproducible load, and buildworld is good for that. So, first thing
> > is switch from svnup to svnlite that is now in base and use that to
> > get 10-STABLE sources. I do the rm -r on /usr/src and /usr/ports
> > and then fire off the svnlite co for each. I find that the slowness
> > of svn checkout is due to network latency and running the two in
> > parallel doesn't create I/O contention on either disk or network.
> > 
> > While the P4s are fetching their sources, I go to deal with the pair
> > of Via C3 boxes that I had taken to 10-PRERELEASE just a week prior
> > and was ready to upgrade to 10-STABLE. Since that upgrade, they sat
> > unused waiting for an impending MFC so I could do away with a local
> > patch. As mentioned in my other message, I made a mistake here on my
> > first attempt, I forgot to clear the existing /usr/src
> > and /usr/ports before starting the svnlite checkout. After
> > realizing my mistake, I did the now larger (as it includes a .svn
> > dir) rm -r of those dirs to start fresh. That's when I hit the
> > problem with rm hanging on one box. Without repeating all the
> > details, I had to boot mfsbsd to do the rm on the one box with only
> > 256MB RAM, but what difference that made is simply inexplicable.
> > Once I had gotten that straightened out, I started off the svnlite
> > checkout fresh. On the box with 384MB, the completed with only one
> > restart for network dropout (common since it takes 2-3 hours per
> > checkout). On the box with 256MB (which had previously fully
> > checked out and gotten to the point where it wanted to prompt me
> > for the conflict on every file in the tree), svnlite could only do
> > a hundred files or so before it seemed to hang in the same way as
> > rm. Running just one instance on /usr/src without the parallel
> > checkout on /usr/ports made no difference. When rm was hanging, I
> > might be able to kill it (after several minutes wait) and reboot or
> > the console might lock. When svnlite hung, I could not login but I
> > might be able to run a command on another VT. I was able to catch
> > that svnlite is getting stuck in the state "kmem a". Hmmm... the
> > same state that tar was getting stuck in on the other boxes. How
> > were those doing now?
> > 
> > I look back at the P4s, which should be done as it's been a few
> > hours spent on the C3 boxes. They are sitting there  in the middle
> > of checkout not making any visible progress. Ctrl-c doesn't work, I
> > can't switch VTs, even ctrl-alt-del seems to not work. Seems like
> > the consoles are hung in a way eerily similar to what I'd seen from
> > 9.x on non-amd64 platforms (both ppc64 and i386). I attempted to
> > initiate an ssh connection into each of the P4s and then walked off
> > for a minute for refreshment. When I came back, expecting to find a
> > login prompt or a timeout, I found the ssh attempts timed out and
> > the two boxes had rebooted. I don't know if the ctrl-alt-del
> > finally registered or if the incoming ssh connection pushed them
> > over the edge. I wasn't there to see and the logs for both stop
> > sometime before the hang. With both rebooted, I do a svnlite
> > cleanup in /usr/src and /usr/pots or both, then fire off the
> > svnlite co for each directory on both boxes.
> > 
> > While those were running, I started digging into the kern.maxswzone
> > tunable on the C3 box with less RAM. The box with more RAM was able
> > to do the rm, svn checkout of both src and ports in parallel, and
> > showed no obvious sign of trouble, though I hadn't started a
> > buildworld yet. The box with less RAM was failing all over the place
> > and the only obvious difference was the warning about that tunable.
> > After I wasted hours figuring out the value is already sufficient
> > but is apparently reduced after it's set, so it can't be effectively
> > turned up, only down, I wrote my previous message to this list on
> > that topic specifically and then went to bed.
> > 
> > This morning I got up and was already thinking about the
> > correlation, that 10 is a disaster on all my i386 boxes thus far.
> > The first thing I checked was the P4 boxes. Both completed the svn
> > checkout on both src and ports, good sign. However, the box with
> > 3GB RAM has the message "vm_thread_new: kstack allocation failed"
> > repeated about a dozen times, bad sign. First thing I do is try to
> > run top to see what the size of ARC is, free RAM, etc. "No more
> > processes." Uh Oh, that's no good at all, can't even run top.
> > Curiously, the box with less RAM, only 2GB, has no messages so I
> > try to start top on it to see what it's state is. Nothing happens
> > when I push return, the cursor is just sitting there after top. On
> > another VT, reboot gets the same response, none, cursor just sits.
> > I can't type but I can switch VTs and scroll, until I do
> > ctrl-alt-del, then every key press after that is a beep. Back on
> > the once that said no processes left for top, reboot gets the same
> > non-response. ctrl-alt-del doesn't beep, it just spits out the
> > ^[[3~ typical of a dead console. Ugh, not even a reset button to
> > punch on these P4 boxes.
> > 
> > So, svnlite checkout is a real strain that can bring a system to
> > it's knees. I'm not sure if this should be regarded as horrible
> > inefficiency or as a means of checking the box before launching into
> > a buildworld (as if that wasn't enough strain to uncover most
> > problems). While 10.0 is good on amd64, it seems a disaster on i386.
> > Processes hang in this "kmem a" state it doesn't take much more to
> > get the box to livelock. I've only seen the "kmem a" state a few
> > times as most other times I can't inspect anything before the box is
> > locked too hard to do anything. In some cases I'm not sure there's
> > even a way to get the box shutdown clean as the most trivial of
> > things lock it up hard. It's not even required to do anything. When
> > I was experimenting with kern.maxswzone last night I rebooted one
> > box a few dozen times, so if I didn't need to look at systcl output
> > I just hit ctrl-alt-del at the login prompt. Once the console died
> > right then, it had just booted and ctrl-alt-del was met with a beep
> > and then it's hung, have to punch reset. I'm guessing the console
> > dies as a result of total wedging of I/O systems following heavy
> > disk I/O. The cause is not just ZFS because the C3 boxes are UFS.
> > The problem is not just the excess swap on the smallest box because
> > I see the same sort of troubles on the box with the most RAM. Some
> > kernel resource seems to be exhausted regardless of how much RAM or
> > swap is present. 
> > 
> > I'm going to try buildworld on 3 of these to see what happens. For
> > the fourth, I still need to get sources onto the disk before I can
> > even attempt that. I'm not sure what to expect. It might be instant
> > miserable failure, or it might actually run a long time since the
> > I/O load is in bursts with lots of recovery time between. It'll
> > take a few hours to see if the P4s succeed. It'll take two days to
> > see a C3 succeed. Maybe by that time, someone will get through all
> > I've written and have some useful suggestion for debugging. To me,
> > it's rather hard to debug since I have little hint where to start,
> > when the problem manifests any logging stops, and the box is in a
> > state where it is essentially unobservable without a JTAG to jump
> > in and directly inspect the state of it's world.
> 
> Replying to self to give status update to anyone reading along.
> 
> The pair of P4 boxes made it through buildworld/kernel after a few
> tries. On these boxes I have /usr/obj mounted on a tmpfs as that's how
> I've been setting up the other boxes with ZFS. Between the ZFS ARC
> filling with source, the tmpfs filling with binaries, and the actual
> compilation tasks there should be a good bit of memory pressure.
> 
> The first build attempt was with -j10 on both boxes. As these are
> single core CPUs, -j4 would have probably been more appropriate for
> optimal speed. The build process on each failed after about an hour.
> The exact stopping point was not noted since the actual error is
> beyond reach of syscons history by the time the parallel build
> process exits. The two boxes appear to have stopped at different
> points.
> 
> I restarted the make buildworld on each without any -j parameter and
> without rebooting. I didn't want to clear the state, if the overly
> parallel build caused anything to leak, I want to see that blow up the
> non-parallel build. The first run through on each failed at different
> points with one of the strangest compiler errors I've yet to see. The
> builds failed with a fatal error: unable to open file [something}.c
> (where something was rlogin.c on one and citrus_[forgotten].c on the
> other). On both boxes, the first thing I did was cat thefile.c and of
> course I see the source file as expected, so the compiler failing to
> open the source file is a transient error.
> 
> Following those odd errors, I restarted the build on each box with
> exactly the same options and without rebooting to check
> reproducibility. On the second non-parallel build attempt, both boxes
> succeeded to build world and then proceeded on to the kernel build
> without issue. Whatever resource exhaustion had cleared itself. I
> checked the memory stats at that point. The box with 3GB RAM had no
> swap currently in use, but might have experienced swapping during the
> build. The box with 2GB RAM had 800MB swap used, which is reasonable
> given the /usr/obj tmpfs was holding 2.2GB. Interestingly, the box
> with more RAM was the first of the pair to fail out of the build both
> times. The installkernel and installworld went off without a hitch. I
> did get a warning about swapoff failing when dropping to single user
> on the box with only 2GB, which is expected given the tmpfs spill
> into swap.
> 
> The situation with buildworld is not too bad. The spurious file open
> errors are troubling, but not as bad as a panic or hang. The problem
> is likely more specifically ZFS-triggered kernel memory pressure and
> not general memory pressure. The low memory use but higher disk I/O
> processes like tar and svn are more prone to trigger the problem.
> Even higher disk I/O might hit the point of panic as some others have
> reported with e.g. rsync on i386. Perhaps with some tuning, these
> boxes can be made to behave reasonably. The initial problems with tar
> seemed very troubling and I still don't have a good explanation for
> why the memory use of the decompress while untaring seemed to make
> such a difference.
> 
> The situation with the C3 boxes is much worse. More details on those
> will be in the other thread since that is where I gave the initial
> details on those and got some reply. The most interesting bit from
> that pair of boxes is the possible spurious file open fail. Running
> svnlite through truss, I couldn't help but notice that it hung
> immediately following a failure to stat a file that was in fact
> present (fsck truncated it on the reboot after hang). Some VFS issue
> that therefore affects UFS and ZFS on i386?

Continuing this discussion with myself.

I found the cause of the file open errors during buildworld and the
cause is more troubling than the symptom. After getting ZFS tuned
(details below) I ran a scrub and found that there was a bad file
under /usr/src on both system. Different files on each, and on each it
was the exact same file that the compile had failed on. I know ZFS will
deny reads when it can't verify data integrity (which is annoying for
file recovery, but potential very bad when the corruption is inside
directory data). So it could have denied the reads when the compiler
opened the file, but then why could I read it moments later with cat?
Better question, why is there a corrupt file? A bad sector, maybe,
except ZFS doesn't report read or checksum errors on the vdevs during
the scrub. A bad sector on drives in two boxes, not reporting SMART
errors either? Unlikely. Bad sectors on three drives (one box is a
mirror zpool), all unreported, and two on independent disks coinciding
with the same file such that ZFS can't heal the file data? Impossibly
unlikely! The only possible explanation is in-memory corruption of ZFS
data that is then committed to disk. The hang during svn checkout was
likely the moment ZFS lost it's marbles and wrote some junk to disk in
the directory it was working in.

After the last message, I started on tuning ZFS on i386. I should
have started into the tuning effort sooner, but my vision was clouded
by the similarity to the problem I had just hit on the C3 boxes which
have UFS filesystems. Also, most reports I saw of ZFS failing on i386
had manifested as panics whereas I was seeing livelock with failed
kernel allocations. Once I started tuning the P4 boxes with ZFS, it
became clear the kmem_size adjustment would also be the solution for the
troubled box running UFS.

First thing was vm.kmem_size as that is both first in the tuning guide
(in conjunction with KVA_PAGES) and was mentioned in the warning from
the kernel. It was a little under 400MB by default. I turned it up to
512MB, the claimed max without adjusting KVA_PAGES. That was enough to
do fresh svn checkout, to tar and untar the entire /usr/src and
/usr/ports in a temp dir, xz compress it, untar with -j, rebuild
the world a few times, etc all without any hangs or reboots to reset
state. I tried with tmpfs full enough to spill to disk to duplicate the
mfsbsd case and still no trouble. So it seemed that setting
vm.kmem_size to 512MB is the magic. Considering that is so important a
value to the extent it can be difficult to even get installed onto a
zpool without setting it, I wonder why the kernel only warns about it
but doesn't just set it to the appropriate size. If it knows to warn it
knows enough to set it.

Unfortunately, 512MB is not enough and the repercussions of overrunning
can be far worse. I continued through the ZFS tuning guide, increased
ARC a bit to 320MB while leaving more room between arc_max and kmem_size
than there was by default (192MB gap vs 150MB), enabled prefetch, set
vdev cache size to 5MB, etc. With all the suggested tuning, I ran
through everything again and all seemed ok.

Content the problem was resolved, I went on to ports installs. Both
built Xorg and I quickly tested it. On the box with 2GB RAM I built
XBMC, tested it, and called that one done for the moment. On the box
with 3GB RAM I started compiling KDE4 yesterday. I expected it to
finish today and would have declared this matter resolved if it had.
Unfortunately, it didn't finish, but died about 580 ports into a set
of over 650. The place where it died was installing lapack, which would
be a burst of disk I/O. The symptoms were all the same, hung process,
can't start any new processes, can't login on another console, can't
run top on already logged in console, can't reboot clean. The real
surprise was on reboot, immediately after "Trying to mount root from
zfs:zroot []..." I get a double fault. Tried booting twice, same fault
both times.

Fatal double fault:
eip = 0xc1618ec7
esp = 0xe96b4f80
ebp = 0xe96b52e0
cpuid = 0; apic id = 00
panic: double fault
cpuid = 0
KDB: stack backtrace:
#0 0xc0a1cdbf at kdb_backtrace+0x52
#1 0xc09ef0db at panic+0x121
#2 0xc0e526bb at cpu_fetch_syscall_args+0
Uptime: 10s

That doesn't tell me much specifically. Generally, I know the zpool must
be severely borked to cause that on attempt to import the pool/mount
root. It was bad enough to see that ZFS could manage to write corrupt
data to disk that damages a file. At least that was easy to fix with an
svn revert followed by another svn up and another scrub to be sure. This
time it looked fatal. Fortunately, I was able to boot a mfsbsd CD and
import the pool without panic. Scrub showed zero errors of any type and
the following attempt to boot off the pool succeeded. The error must
have been minor enough to fix without note on import or scrub, but is
severe enough to make the pool unbootable.

Why don't we default to higher vm.kmem_size, at least when using ZFS if
not always? (It is undersize with UFS on low memory boxes too) Is
there a benefit to not letting the kernel use all the RAM on i386 as
it's allowed to do on amd64? Why does KVA_PAGES, which gives 1GB kernel
address space by default, need to be increased in order to increase
vm.kmem_size beyond 512M? Is there something other than the kernel
allocating inside the kernel's address space? Is there some reason to
not let the kernel grow to the limit of it's address space or physical
RAM, whichever is less, when it feels a need to?