ZFS commits corruption to disk when kernel allocation fails

Sun Feb 9 22:37:21 UTC 2014

On Fri, 7 Feb 2014 09:50:49 +0100
Matthew Rezny <matthew at reztek.cz> wrote:

> On Sun, 2 Feb 2014 20:04:46 +0100
> Matthew Rezny <matthew at reztek.cz> wrote:
> 
> > On Sat, 1 Feb 2014 20:31:12 +0100
> > Matthew Rezny <matthew at reztek.cz> wrote:
> > 
> > > I'm seeing rather strange behavior from 10.0 on i386 thus far.
> > > This is another long message, so if you want the summary without
> > > back-story, skip to the end. Sometimes it's hard to include
> > > relevant details without feeling like I'm rambling. I'm seeing
> > > rather strange behavior from 10.0 on i386 thus far.
> > > 
> > > I started with FreeBSD not long before 4.0 release and ran 4.x
> > > releases on i386 and Alpha for a long time. I tried the 5.x
> > > releases and had nothing but trouble so stuck with 4.x through
> > > that time. The Alpha never did move off 4.x before it got
> > > retired, but some of my i386 boxes made it onto 6.x and then sat
> > > there until they were taken out of active use. For years, FreeBSD
> > > 4.x and 6.x was the reliable OS I used for everything but my
> > > desktop (which had been OS X).
> > > 
> > > More recently I started using FreeBSD 8 on amd64 with ZFS and
> > > quickly moved on to 9 as soon as 9.0 was released. At the same
> > > time, i386 hardware retired from desktop roles but suitable for
> > > network services got 8.x installed on UFS. I had rather good
> > > experience with 9-STABLE on amd64 running with ZFS. For the most
> > > part it's solid, ZFS support is much better than the sorry state
> > > Apple left it in before abandoning it on OS X, though I did get a
> > > few kernel panics when simply connecting disks that contained
> > > zpools from OS X. Due to both compilation speed difference and the
> > > fact older hardware tends to be in more entrenched roles, I left
> > > my i386 systems out of the ZFS and 9.x experiments. I did also
> > > try 9.x on my one ppc64 box at various times to see if that might
> > > be a good way to utilize hardware Apple dropped support for years
> > > prior. The state on ppc64 varied between panic on boot to being
> > > able to buildworld but an idle system left for a few days would
> > > randomly go zombie, console freezes but clearly there is some
> > > system activity and it responds to ping but might not take a ssh
> > > connection, which I chalked up to the experimental state of the
> > > port. I did see console freezes on i386 boxes booted from a 9.1
> > > mfsbsd image but never investigated because I was just using it
> > > to image and erase disks on old machines where I considered the
> > > hardware suspect.
> > > 
> > > In the last couple months I've been moving my amd64 systems to 10,
> > > starting during the RCs and keeping up such that they are now all
> > > 10-STABLE. The transition was fairly smooth and they are running
> > > quite well. Even one box that has prior chipset and BIOS, which
> > > was panicking with an early 10-BETA is now running 10.0-RELEASE
> > > with KMS. All very impressive. So, time to start migrating some
> > > i386 boxes I figure. I had recently moved a number of them to 9.2
> > > and figured I should just go ahead and move everything up to 10.0
> > > at close to the same time if possible. I had seen no problems
> > > with 9.2 or 9-STABLE on the i386 boxes that I was preparing to
> > > upgrade, I already sorted out one Clang bug that affected a few
> > > (but less worse than a similar GCC bug that remains unfixed)
> > > since I had switched compilers when going to 9.
> > > 
> > > Since I started moving i386 boxes to 10.0, I've had nothing but
> > > strange problems. Last night I wrote a message about
> > > kern.maxswzone, something I started getting warnings about on one
> > > particular box when I put 9.2 on it but which I didn't try to do
> > > anything about until now. I wrote that message with this one in
> > > mind, mentioning that I would have another about processes
> > > hanging. That one came first because it has at least some hard
> > > numbers and not so much subjective feelings of performance and
> > > reliability. Between then and now, the pattern struck me, all my
> > > early successes with 10 were amd64, and now all the i386 boxes
> > > I've upgraded are barely functional.
> > > 
> > > I have 4 i386 boxes that I tried to put 10.0 on in the past week
> > > with various degrees of fail. There are 2 sets within the four,
> > > two are the low-end C3 boxes with 256MB and 384MB RAM described
> > > in my prior message to the list. The other two are Pentium4
> > > systems, one with 2GB RAM and the other with 3GB, substantially
> > > bigger disks, decent GPU, etc. In other words, two are ancient
> > > and two are merely a little dated but still very usable. This
> > > faster pair I will mention first, then I will return to the slow
> > > pair. All these boxes are things I use around the house for
> > > network services or as essentially terminals in other rooms
> > > (kitchen pc to look up stuff, bedroom pc to watch movies, etc).
> > > The i386 boxes that run important services (externally facing
> > > network services, routing/firewall, etc) and being left two a
> > > second round once all issues are sorted out on these
> > > lower-importance boxes first.
> > > 
> > > The P4s had 9-STABLE installed on UFS volumes. I did the switch
> > > from csup to svnup to pull the 10.0 sources, did the
> > > buildworld/kernel and install on both and all looked good. Before
> > > I went on to reinstall packages or anything else, I decided now
> > > might be a good time to try switching from UFS to ZFS, everything
> > > in /home was already backed up. So far I had only tried ZFS on
> > > amd64 due to early reports of flakiness on i386 related to
> > > exhausting kernel memory. In the couple years since initial
> > > support, the ZFS code has gotten better integrated, more people
> > > have tried it, some tuning guides have been written, and I've
> > > seen reports of it being used on boxes with 512MB RAM. Most of my
> > > i386 boxes in server roles have 2GB and it would be nice to
> > > migrate those to ZFS if possible. Best to test on these boxes
> > > first and try tuning if needed.
> > > 
> > > I booted both P4 boxes from mfsbsd CD, mounted the existing UFS
> > > volums, tar the whole mess and drop the uncompressed tar on my
> > > file server. On the server, I fired off xz to compresses the tar
> > > file to speed the restore (or so I thought) while I prepared the
> > > machines. I setup the zpools in the normal way I'd done all my
> > > amd64 boxes. One P4 box has a single disk, the other has two, so
> > > one is a single vdev pool and the other is multiple, which adds a
> > > little variety for testing. Aside from vdevs, the pool
> > > properties, filesystems and their properties are all identical to
> > > how I've been setting up my other ZFS boxes. LZ4 on most
> > > filesystems, gzip or none on a few, sha256 hashes entirely, no
> > > dedupe, pretty normal. With the pools configured and mounted
> > > on /zroot, I scp the tar.xz file for each box into /tmp (which is
> > > tmpfs), and try tar xjpvf in /zroot.
> > > 
> > > After initial good progress, both boxes seemed to hang at about
> > > the same time. Disk activity stops, tar is sitting there as if
> > > it's going to do something, but no further progress on either
> > > when left for an hour. I started top on both boxes and notice
> > > that the tar process on each is in the state "kmem a" and the
> > > resident memory allocation on each is exactly the same (around
> > > 750MB). My first thought was that I used too much RAM with the
> > > 500MB tar.xz file in tmpfs. One box says 800MB free and the other
> > > says 1800MB free but maybe there is a shortage of kernel memory.
> > > I can't seem to kill tar, so I just reboot each, clear the zpools
> > > to try from a fresh state again, mount the swap before
> > > filling /tmp this time, then attempt another extract. No joy, it
> > > stops the same way, with the exact same memory allocation, and
> > > each box is stopped on the exact same file as where each stopped
> > > on the first attempt. The free memory reports are the same as
> > > before, no sawp is being used, whatever is running out must be
> > > non-pageable.
> > > 
> > > The next thing I try is decoupling the stages. The tar process is
> > > growing so large because it has to decompress lzma which requires
> > > a huge dictionary. I figure maybe the heavy disk I/O is causing
> > > buffers/cache to contend with the process in some way. Reboot
> > > again for a fresh start, scp the .tar.xz to /zroot/tmp, xz -d so
> > > it's just a plain tar, then tar xpvf in /zroot and both complete
> > > without error. Set the mointpoint to / for each zroot and reboot
> > > into the running system. That was strange but solvable. I don't
> > > know what the "kmem a" state is but I can guess it's probably
> > > short for something like "kmem alloc" which would suggest to me
> > > the process is waiting on a kernel allocation. So I figure I've
> > > got some tuning to do and a hung process isn't as bad as the
> > > kernel panics others had reported on i386 under heavy I/O load
> > > (e.g. rsync) with default settings. After all, the boot messages
> > > include two warnings about tuning ZFS memory on i386. In order to
> > > do the tuning, I need some reproducible load, and buildworld is
> > > good for that. So, first thing is switch from svnup to svnlite
> > > that is now in base and use that to get 10-STABLE sources. I do
> > > the rm -r on /usr/src and /usr/ports and then fire off the
> > > svnlite co for each. I find that the slowness of svn checkout is
> > > due to network latency and running the two in parallel doesn't
> > > create I/O contention on either disk or network.
> > > 
> > > While the P4s are fetching their sources, I go to deal with the
> > > pair of Via C3 boxes that I had taken to 10-PRERELEASE just a
> > > week prior and was ready to upgrade to 10-STABLE. Since that
> > > upgrade, they sat unused waiting for an impending MFC so I could
> > > do away with a local patch. As mentioned in my other message, I
> > > made a mistake here on my first attempt, I forgot to clear the
> > > existing /usr/src and /usr/ports before starting the svnlite
> > > checkout. After realizing my mistake, I did the now larger (as it
> > > includes a .svn dir) rm -r of those dirs to start fresh. That's
> > > when I hit the problem with rm hanging on one box. Without
> > > repeating all the details, I had to boot mfsbsd to do the rm on
> > > the one box with only 256MB RAM, but what difference that made is
> > > simply inexplicable. Once I had gotten that straightened out, I
> > > started off the svnlite checkout fresh. On the box with 384MB,
> > > the completed with only one restart for network dropout (common
> > > since it takes 2-3 hours per checkout). On the box with 256MB
> > > (which had previously fully checked out and gotten to the point
> > > where it wanted to prompt me for the conflict on every file in
> > > the tree), svnlite could only do a hundred files or so before it
> > > seemed to hang in the same way as rm. Running just one instance
> > > on /usr/src without the parallel checkout on /usr/ports made no
> > > difference. When rm was hanging, I might be able to kill it
> > > (after several minutes wait) and reboot or the console might
> > > lock. When svnlite hung, I could not login but I might be able to
> > > run a command on another VT. I was able to catch that svnlite is
> > > getting stuck in the state "kmem a". Hmmm... the same state that
> > > tar was getting stuck in on the other boxes. How were those doing
> > > now?
> > > 
> > > I look back at the P4s, which should be done as it's been a few
> > > hours spent on the C3 boxes. They are sitting there  in the middle
> > > of checkout not making any visible progress. Ctrl-c doesn't work,
> > > I can't switch VTs, even ctrl-alt-del seems to not work. Seems
> > > like the consoles are hung in a way eerily similar to what I'd
> > > seen from 9.x on non-amd64 platforms (both ppc64 and i386). I
> > > attempted to initiate an ssh connection into each of the P4s and
> > > then walked off for a minute for refreshment. When I came back,
> > > expecting to find a login prompt or a timeout, I found the ssh
> > > attempts timed out and the two boxes had rebooted. I don't know
> > > if the ctrl-alt-del finally registered or if the incoming ssh
> > > connection pushed them over the edge. I wasn't there to see and
> > > the logs for both stop sometime before the hang. With both
> > > rebooted, I do a svnlite cleanup in /usr/src and /usr/pots or
> > > both, then fire off the svnlite co for each directory on both
> > > boxes.
> > > 
> > > While those were running, I started digging into the
> > > kern.maxswzone tunable on the C3 box with less RAM. The box with
> > > more RAM was able to do the rm, svn checkout of both src and
> > > ports in parallel, and showed no obvious sign of trouble, though
> > > I hadn't started a buildworld yet. The box with less RAM was
> > > failing all over the place and the only obvious difference was
> > > the warning about that tunable. After I wasted hours figuring out
> > > the value is already sufficient but is apparently reduced after
> > > it's set, so it can't be effectively turned up, only down, I
> > > wrote my previous message to this list on that topic specifically
> > > and then went to bed.
> > > 
> > > This morning I got up and was already thinking about the
> > > correlation, that 10 is a disaster on all my i386 boxes thus far.
> > > The first thing I checked was the P4 boxes. Both completed the svn
> > > checkout on both src and ports, good sign. However, the box with
> > > 3GB RAM has the message "vm_thread_new: kstack allocation failed"
> > > repeated about a dozen times, bad sign. First thing I do is try to
> > > run top to see what the size of ARC is, free RAM, etc. "No more
> > > processes." Uh Oh, that's no good at all, can't even run top.
> > > Curiously, the box with less RAM, only 2GB, has no messages so I
> > > try to start top on it to see what it's state is. Nothing happens
> > > when I push return, the cursor is just sitting there after top. On
> > > another VT, reboot gets the same response, none, cursor just sits.
> > > I can't type but I can switch VTs and scroll, until I do
> > > ctrl-alt-del, then every key press after that is a beep. Back on
> > > the once that said no processes left for top, reboot gets the same
> > > non-response. ctrl-alt-del doesn't beep, it just spits out the
> > > ^[[3~ typical of a dead console. Ugh, not even a reset button to
> > > punch on these P4 boxes.
> > > 
> > > So, svnlite checkout is a real strain that can bring a system to
> > > it's knees. I'm not sure if this should be regarded as horrible
> > > inefficiency or as a means of checking the box before launching
> > > into a buildworld (as if that wasn't enough strain to uncover most
> > > problems). While 10.0 is good on amd64, it seems a disaster on
> > > i386. Processes hang in this "kmem a" state it doesn't take much
> > > more to get the box to livelock. I've only seen the "kmem a"
> > > state a few times as most other times I can't inspect anything
> > > before the box is locked too hard to do anything. In some cases
> > > I'm not sure there's even a way to get the box shutdown clean as
> > > the most trivial of things lock it up hard. It's not even
> > > required to do anything. When I was experimenting with
> > > kern.maxswzone last night I rebooted one box a few dozen times,
> > > so if I didn't need to look at systcl output I just hit
> > > ctrl-alt-del at the login prompt. Once the console died right
> > > then, it had just booted and ctrl-alt-del was met with a beep and
> > > then it's hung, have to punch reset. I'm guessing the console
> > > dies as a result of total wedging of I/O systems following heavy
> > > disk I/O. The cause is not just ZFS because the C3 boxes are UFS.
> > > The problem is not just the excess swap on the smallest box
> > > because I see the same sort of troubles on the box with the most
> > > RAM. Some kernel resource seems to be exhausted regardless of how
> > > much RAM or swap is present. 
> > > 
> > > I'm going to try buildworld on 3 of these to see what happens. For
> > > the fourth, I still need to get sources onto the disk before I can
> > > even attempt that. I'm not sure what to expect. It might be
> > > instant miserable failure, or it might actually run a long time
> > > since the I/O load is in bursts with lots of recovery time
> > > between. It'll take a few hours to see if the P4s succeed. It'll
> > > take two days to see a C3 succeed. Maybe by that time, someone
> > > will get through all I've written and have some useful suggestion
> > > for debugging. To me, it's rather hard to debug since I have
> > > little hint where to start, when the problem manifests any
> > > logging stops, and the box is in a state where it is essentially
> > > unobservable without a JTAG to jump in and directly inspect the
> > > state of it's world.
> > 
> > Replying to self to give status update to anyone reading along.
> > 
> > The pair of P4 boxes made it through buildworld/kernel after a few
> > tries. On these boxes I have /usr/obj mounted on a tmpfs as that's
> > how I've been setting up the other boxes with ZFS. Between the ZFS
> > ARC filling with source, the tmpfs filling with binaries, and the
> > actual compilation tasks there should be a good bit of memory
> > pressure.
> > 
> > The first build attempt was with -j10 on both boxes. As these are
> > single core CPUs, -j4 would have probably been more appropriate for
> > optimal speed. The build process on each failed after about an hour.
> > The exact stopping point was not noted since the actual error is
> > beyond reach of syscons history by the time the parallel build
> > process exits. The two boxes appear to have stopped at different
> > points.
> > 
> > I restarted the make buildworld on each without any -j parameter and
> > without rebooting. I didn't want to clear the state, if the overly
> > parallel build caused anything to leak, I want to see that blow up
> > the non-parallel build. The first run through on each failed at
> > different points with one of the strangest compiler errors I've yet
> > to see. The builds failed with a fatal error: unable to open file
> > [something}.c (where something was rlogin.c on one and
> > citrus_[forgotten].c on the other). On both boxes, the first thing
> > I did was cat thefile.c and of course I see the source file as
> > expected, so the compiler failing to open the source file is a
> > transient error.
> > 
> > Following those odd errors, I restarted the build on each box with
> > exactly the same options and without rebooting to check
> > reproducibility. On the second non-parallel build attempt, both
> > boxes succeeded to build world and then proceeded on to the kernel
> > build without issue. Whatever resource exhaustion had cleared
> > itself. I checked the memory stats at that point. The box with 3GB
> > RAM had no swap currently in use, but might have experienced
> > swapping during the build. The box with 2GB RAM had 800MB swap
> > used, which is reasonable given the /usr/obj tmpfs was holding
> > 2.2GB. Interestingly, the box with more RAM was the first of the
> > pair to fail out of the build both times. The installkernel and
> > installworld went off without a hitch. I did get a warning about
> > swapoff failing when dropping to single user on the box with only
> > 2GB, which is expected given the tmpfs spill into swap.
> > 
> > The situation with buildworld is not too bad. The spurious file open
> > errors are troubling, but not as bad as a panic or hang. The problem
> > is likely more specifically ZFS-triggered kernel memory pressure and
> > not general memory pressure. The low memory use but higher disk I/O
> > processes like tar and svn are more prone to trigger the problem.
> > Even higher disk I/O might hit the point of panic as some others
> > have reported with e.g. rsync on i386. Perhaps with some tuning,
> > these boxes can be made to behave reasonably. The initial problems
> > with tar seemed very troubling and I still don't have a good
> > explanation for why the memory use of the decompress while untaring
> > seemed to make such a difference.
> > 
> > The situation with the C3 boxes is much worse. More details on those
> > will be in the other thread since that is where I gave the initial
> > details on those and got some reply. The most interesting bit from
> > that pair of boxes is the possible spurious file open fail. Running
> > svnlite through truss, I couldn't help but notice that it hung
> > immediately following a failure to stat a file that was in fact
> > present (fsck truncated it on the reboot after hang). Some VFS issue
> > that therefore affects UFS and ZFS on i386?
> 
> Continuing this discussion with myself.
> 
> I found the cause of the file open errors during buildworld and the
> cause is more troubling than the symptom. After getting ZFS tuned
> (details below) I ran a scrub and found that there was a bad file
> under /usr/src on both system. Different files on each, and on each it
> was the exact same file that the compile had failed on. I know ZFS
> will deny reads when it can't verify data integrity (which is
> annoying for file recovery, but potential very bad when the
> corruption is inside directory data). So it could have denied the
> reads when the compiler opened the file, but then why could I read it
> moments later with cat? Better question, why is there a corrupt file?
> A bad sector, maybe, except ZFS doesn't report read or checksum
> errors on the vdevs during the scrub. A bad sector on drives in two
> boxes, not reporting SMART errors either? Unlikely. Bad sectors on
> three drives (one box is a mirror zpool), all unreported, and two on
> independent disks coinciding with the same file such that ZFS can't
> heal the file data? Impossibly unlikely! The only possible
> explanation is in-memory corruption of ZFS data that is then
> committed to disk. The hang during svn checkout was likely the moment
> ZFS lost it's marbles and wrote some junk to disk in the directory it
> was working in.
> 
> After the last message, I started on tuning ZFS on i386. I should
> have started into the tuning effort sooner, but my vision was clouded
> by the similarity to the problem I had just hit on the C3 boxes which
> have UFS filesystems. Also, most reports I saw of ZFS failing on i386
> had manifested as panics whereas I was seeing livelock with failed
> kernel allocations. Once I started tuning the P4 boxes with ZFS, it
> became clear the kmem_size adjustment would also be the solution for
> the troubled box running UFS.
> 
> First thing was vm.kmem_size as that is both first in the tuning guide
> (in conjunction with KVA_PAGES) and was mentioned in the warning from
> the kernel. It was a little under 400MB by default. I turned it up to
> 512MB, the claimed max without adjusting KVA_PAGES. That was enough to
> do fresh svn checkout, to tar and untar the entire /usr/src and
> /usr/ports in a temp dir, xz compress it, untar with -j, rebuild
> the world a few times, etc all without any hangs or reboots to reset
> state. I tried with tmpfs full enough to spill to disk to duplicate
> the mfsbsd case and still no trouble. So it seemed that setting
> vm.kmem_size to 512MB is the magic. Considering that is so important a
> value to the extent it can be difficult to even get installed onto a
> zpool without setting it, I wonder why the kernel only warns about it
> but doesn't just set it to the appropriate size. If it knows to warn
> it knows enough to set it.
> 
> Unfortunately, 512MB is not enough and the repercussions of
> overrunning can be far worse. I continued through the ZFS tuning
> guide, increased ARC a bit to 320MB while leaving more room between
> arc_max and kmem_size than there was by default (192MB gap vs 150MB),
> enabled prefetch, set vdev cache size to 5MB, etc. With all the
> suggested tuning, I ran through everything again and all seemed ok.
> 
> Content the problem was resolved, I went on to ports installs. Both
> built Xorg and I quickly tested it. On the box with 2GB RAM I built
> XBMC, tested it, and called that one done for the moment. On the box
> with 3GB RAM I started compiling KDE4 yesterday. I expected it to
> finish today and would have declared this matter resolved if it had.
> Unfortunately, it didn't finish, but died about 580 ports into a set
> of over 650. The place where it died was installing lapack, which
> would be a burst of disk I/O. The symptoms were all the same, hung
> process, can't start any new processes, can't login on another
> console, can't run top on already logged in console, can't reboot
> clean. The real surprise was on reboot, immediately after "Trying to
> mount root from zfs:zroot []..." I get a double fault. Tried booting
> twice, same fault both times.
> 
> Fatal double fault:
> eip = 0xc1618ec7
> esp = 0xe96b4f80
> ebp = 0xe96b52e0
> cpuid = 0; apic id = 00
> panic: double fault
> cpuid = 0
> KDB: stack backtrace:
> #0 0xc0a1cdbf at kdb_backtrace+0x52
> #1 0xc09ef0db at panic+0x121
> #2 0xc0e526bb at cpu_fetch_syscall_args+0
> Uptime: 10s
> 
> That doesn't tell me much specifically. Generally, I know the zpool
> must be severely borked to cause that on attempt to import the
> pool/mount root. It was bad enough to see that ZFS could manage to
> write corrupt data to disk that damages a file. At least that was
> easy to fix with an svn revert followed by another svn up and another
> scrub to be sure. This time it looked fatal. Fortunately, I was able
> to boot a mfsbsd CD and import the pool without panic. Scrub showed
> zero errors of any type and the following attempt to boot off the
> pool succeeded. The error must have been minor enough to fix without
> note on import or scrub, but is severe enough to make the pool
> unbootable.
> 
> Why don't we default to higher vm.kmem_size, at least when using ZFS
> if not always? (It is undersize with UFS on low memory boxes too) Is
> there a benefit to not letting the kernel use all the RAM on i386 as
> it's allowed to do on amd64? Why does KVA_PAGES, which gives 1GB
> kernel address space by default, need to be increased in order to
> increase vm.kmem_size beyond 512M? Is there something other than the
> kernel allocating inside the kernel's address space? Is there some
> reason to not let the kernel grow to the limit of it's address space
> or physical RAM, whichever is less, when it feels a need to?

Unfortunately, this problem is easily reproducible, After fixing the
pool and booting the system off it, I resumed the portmaster run. It
got through lapack (starting off clean) but soon hung exactly the same
way after installing py27-numpy. On reboot it panics almost exactly
the same (backtrace is identical, register values differ by only a few
bytes). So, large ports that strain the system harder than building
world are able to knock this box over. Each time that happens, the pool
is unbootable until I scrub it after booting mfsBSD. Obviously I've
still got some tuning to do to stop it from crashing. It's rather
disturbing that each time the system hangs the pool is left in a rather
bad state.