ZFS crashing while zfs recv in progress

Tue Jun 4 09:53:23 UTC 2013

On Tue, Jun 04, 2013 at 10:54:30AM +0200, Pascal Braun, Continum wrote:
> I've put swap on a seperate disk this time and re-run the zfs send / recv, but i'm still getting the problem. Whats interesting is, i'm still getting the same output about swap space on the console:
> swap_pager: indefinite wait buffer: bufobj: 0, blkno: 36, size: 24576

What this means is covered here:

http://www.freebsd.org/doc/en/books/faq/troubleshoot.html#idp75389104

Google "swap_pager: indefinite wait buffer" and you will see lots of
people talking about this.  The commonality in most situations is that
the I/O subsystem is stalled or too busy to swap in/out a page of
memory, and the kernel throws a nastygram about it.

Basically, your I/O subsystem is taking too long to accomplish a task
(swap in/out a page of memory to/from swap).  It could be the disk
taking too long, it could be the controller taking too long, it could be
the disk being in bad shape, it could be the overall PCI/PCI-X/PCIe bus
being overwhelmed, or could be something as simple as excessive CPU load
(your pool/vdev setup I imagine is very CPU intensive).

Your system is utterly massive.  Warning: this is the first time I have
taken a stab at such a huge system.  The topology, for those wondering:

* mpt0 (LSI SATA/SAS; rev 1.5.20.0), irq 17
  |-- 6 disks attached
      |-> da0  = Hitachi HDS72101 A3MA, 1953525168 sectors
      |-> da1  = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da2  = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da3  = Hitachi HDS5C302 A580, 3907029168 sectors
      |-> da4  = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da5  = Hitachi HDS5C302 A800, 3907029168 sectors
* mpt1 (LSI SATA/SAS; rev 1.5.20.0), irq 18
  |-- 7 disks attached
      |-> da6  = Hitachi HDS72101 A3MA, 1953525168 sectors
      |-> da7  = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da8  = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da9  = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da10 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da11 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da12 = Hitachi HDS5C302 A800, 3907029168 sectors
* mpt2 (LSI SATA/SAS; rev 1.5.20.0), irq 18
  |-- 6 disks attached
      |-> da13 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da14 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da15 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da16 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da17 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da18 = Hitachi HDS5C302 A800, 3907029168 sectors
* mpt3 (LSI SATA/SAS; rev 1.5.20.0), irq 40
  |-- 8 disks attached
      |-> da19 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da20 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da21 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da22 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da23 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da24 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da25 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da26 = Hitachi HDS5C302 A800, 3907029168 sectors
* mpt4 (LSI SATA/SAS; rev 1.5.20.0), irq 41
  |-- 7 disks attached
      |-> da27 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da28 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da29 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da30 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da31 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da32 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da33 = Hitachi HDS5C302 A800, 3907029168 sectors
* mpt5 (LSI SATA/SAS; rev 1.5.20.0), irq 41 (shares IRQ with mpt4)
  |-- 5 disks attached
      |-> da34 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da35 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da36 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da37 = Hitachi HDS5C302 A800, 3907029168 sectors
      |-> da38 = Hitachi HDS5C302 A800, 3907029168 sectors

Things to note / things that caught my eye:

- mpt2 shares an IRQ with mpt1 (irq 18)

- mpt5 shares an IRQ with mpt4 (irq 41)

- Disk da3 has a different drive firmware (A580) than the A800 drives.

- (For readers only): Disks {da0,da6} are a completely different model
  and capacity than all the other drives in the array.  These drives
  are used (in a mirror) for the ZFS root pool ("zroot"), not the big
  fat gigantic pool ("tank").

- I have not verified if any of these disks use 4KByte sectors (dmesg is
  not going to tell you the entire truth).  I would appreciate seeing
  "smartctl -x" output from {da0,da1,da3} so I could get an idea.  Your
  pools use gpt labelling so I am left with the hope that your labels
  refer to the partition with proper 4KB alignment regardless.

- The system has only 16GB RAM.  That's a bit shocking for something of
  this size.

Moving on.

Can you tell me what exact disk (e.g. daXX) in the above list you used
for swap, and what kind of both system and disk load were going on at
the time you saw the swap message?

I'm looking for a capture of "gstat -I500ms" output (you will need a
VERY long/big terminal window to capture this given how many disks you
have) while I/O is happening, as well as "top -s 1" in another window.
I would also like to see "zpool iostat -v 1" output while things are
going on, to help possibly narrow down if there is a single disk causing
the entire I/O subsystem for that controller to choke.

Next: are you using compression or dedup on any of your filesystems?
If not, have you ever in the past?

Next: could we have your loader.conf and sysctl.conf please?

My gut feeling is that if you're doing zfs {send,recv} for "tank" --
which you are -- multiple subsystems and busses are so incredibly
overwhelmed by all the I/O and interrupts and *everything* that it's
very hard for the swap I/O time slicer to get a decent share of time to
swap something out to swap (even worse if that controller is overwhelmed
with requests).  Worse, you're using raidz2, which means even more CPU
time + calculation overhead, which means less time for other tasks
(threads).  Everything on the system -- everything! -- is fighting for
time at multiple levels.

If you could put a swap disk on a dedicated controller (and no other
disks on it), that would be ideal.  Please do not use USB for this task
(the USB stack may introduce its own set of complexities pertaining to
interrupt usage).

If all this turns out to be an "overall system overwhelmed" situation,
my advice is to cut back on the usage.  I would STRONGLY suggest in that
case a 2nd system, and split the number of disks across both.

I'm really surprised given how many disks/etc. you have you didn't
choose to get an actual filer (Netapp).  I sure as hell would have.  I
really do not know why people think ZFS is a full-blown replacement for
a Netapp of this scale -- it isn't.

Anyway take what I say with a grain of salt -- really.  I'm just
throwing out thoughts/ideas as I look over everything.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Making life hard for others since 1977.             PGP 4BD6C0CB |