I/O to pool appears to be hung, panic !

Mon Jul 3 13:36:45 UTC 2017

> On 03 Jul 2017, at 13:10, Edward Napierala <trasz at freebsd.org> wrote:
> 
> 2017-07-03 10:07 GMT+01:00 Ben RUBSON <ben.rubson at gmail.com <mailto:ben.rubson at gmail.com>>:
> 
> > On 29 Jun 2017, at 15:36, Ben RUBSON <ben.rubson at gmail.com <mailto:ben.rubson at gmail.com>> wrote:
> >
> >> On 29 Jun 2017, at 14:43, Fabian Keil <freebsd-listen at fabiankeil.de <mailto:freebsd-listen at fabiankeil.de>> wrote:
> >
> > Thank you for your feedback Fabian.
> >
> >> Ben RUBSON <ben.rubson at gmail.com <mailto:ben.rubson at gmail.com>> wrote:
> >>
> >>> One of my servers did a kernel panic last night, giving the following message :
> >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at '/dev/label/G23iscsi'.
> >> [...]
> >>> Here are some numbers regarding this disk, taken from the server hosting the pool :
> >>> (unfortunately not from the iscsi target server)
> >>> https://s23.postimg.org/zd8jy9xaj/busydisk.png <https://s23.postimg.org/zd8jy9xaj/busydisk.png>
> >>>
> >>> We clearly see that suddendly, disk became 100% busy, meanwhile CPU was almost idle.
> 
> We also clearly see that 5 minutes later (02:09) disk seems to be back but became 100% busy again,
> and that 16 minutes later (default vfs.zfs.deadman_synctime_ms), panic occurred.
> 
> >>> No error message at all on both servers.
> >> [...]
> >>> The only log I have is the following stacktrace taken from the server console :
> >>> panic: I/O to pool 'home' appears to be hung on vdev guid 122... at '/dev/label/G23iscsi'.
> >>> cpuid = 0
> >>> KDB: stack backtrace:
> >>> #0 0xffffffff80b240f7 at kdb_backtrace+0x67
> >>> #1 0xffffffff80ad9462 at vpanic+0x182
> >>> #2 0xffffffff80ad92d3 at panic+0x43
> >>> #3 0xffffffff82238fa7 at vdev_deadman+0x127
> >>> #4 0xffffffff82238ec0 at vdev_deadman+0x40
> >>> #5 0xffffffff82238ec0 at vdev_deadman+0x40
> >>> #6 0xffffffff8222d0a6 at spa_deadman+0x86
> >>> #7 0xffffffff80af32da at softclock_call_cc+0x18a
> >>> #8 0xffffffff80af3854 at softclock+0x94
> >>> #9 0xffffffff80a9348f at intr_event_execute_handlers+0x20f
> >>> #10 0xffffffff80a936f6 at ithread_loop+0xc6
> >>> #11 0xffffffff80a900d5 at fork_exit+0x85
> >>> #12 0xffffffff80f846fe at fork_trampoline+0xe
> >>> Uptime: 92d2h47m6s
> >>>
> >>> I would have been pleased to make a dump available.
> >>> However, despite my (correct ?) configuration, server did not dump :
> >>> (nevertheless, "sysctl debug.kdb.panic=1" make it to dump)
> >>> # grep ^dump /boot/loader.conf /etc/rc.conf
> >>> /boot/loader.conf:dumpdev="/dev/mirror/swap"
> >>> /etc/rc.conf:dumpdev="AUTO"
> >>
> >> You may want to look at the NOTES section in gmirror(8).
> >
> > Yes, I should already be OK (prefer algorithm set).
> >
> >>> I use default kernel, with a rebuilt zfs module :
> >>> # uname -v
> >>> FreeBSD 11.0-RELEASE-p8 #0: Wed Feb 22 06:12:04 UTC 2017     root at amd64-builder.daemonology.net:/usr/obj/usr/src/sys/GENERIC
> >>>
> >>> I use the following iSCSI configuration, which disconnects the disks "as soon as" they are unavailable :
> >>> kern.iscsi.ping_timeout=5
> >>> kern.iscsi.fail_on_disconnection=1
> >>> kern.iscsi.iscsid_timeout=5
> >>>
> >>> I then think disk was at least correctly reachable during these 20 busy minutes.
> >>>
> >>> So, any idea why I could have faced this issue ?
> >>
> >> Is it possible that the system was under memory pressure?
> >
> > No I don't think it was :
> > https://s1.postimg.org/uvsebpyyn/busydisk2.png <https://s1.postimg.org/uvsebpyyn/busydisk2.png>
> > More than 2GB of available memory.
> > Swap not used (624kB).
> > ARC behaviour seems correct (anon increases because ZFS can't actually write I think).
> > Regarding the pool itself, it was receiving data at 6MB/s, sending around 30kB blocks to disks.
> > When disk went busy, throughput fell to some kB, with 128kB blocks.
> >
> >> geli's use of malloc() is known to cause deadlocks under memory pressure:
> >> https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209759 <https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=209759>
> >>
> >> Given that gmirror uses malloc() as well it probably has the same issue.
> >
> > I don't use geli so I should not face this issue.
> >
> >>> I would have thought ZFS would have taken the busy device offline, instead of raising a panic.
> >>> Perhaps it is already possible to make ZFS behave like this ?
> >>
> >> There's a tunable for this: vfs.zfs.deadman_enabled.
> >> If the panic is just a symptom of the deadlock it's unlikely
> >> to help though.
> >
> > I think this tunable should have prevented the server from having raised a panic :
> > # sysctl -d vfs.zfs.deadman_enabled
> > vfs.zfs.deadman_enabled: Kernel panic on stalled ZFS I/O
> > # sysctl vfs.zfs.deadman_enabled
> > vfs.zfs.deadman_enabled: 1
> >
> > But not sure how it would have behaved then...
> > (busy disk miraculously back to normal status, memory pressure due to anon increasing...)
> 
> I then think it would be nice, once vfs.zfs.deadman_synctime_ms has expired,
> to be able to take the busy device offline instead of raising a panic.
> Currently, disabling deadman will avoid the panic but will let the device slowing down the pool.
> 
> I still did not found the root cause of this issue, not sure I will,
> quite difficult actually with a stacktrace and some performance graphs only :/
> 
> What exactly is the disk doing when that happens?  What does "gstat" say?  If the iSCSI
> target is also FreeBSD, what does ctlstat say?

As shown on this graph made with gstat numbers from initiator :
https://s23.postimg.org/zd8jy9xaj/busydisk.png
The disk is continuously writing 3 MBps before the issue happens.
When it occurs, response time increases to around 30 seconds (100% busy),
and consequently disk throughput drops down to some kBps.
CPU stays at an almost fully idle level.

As shown here, no memory pressure :
https://s1.postimg.org/uvsebpyyn/busydisk2.png <https://s1.postimg.org/uvsebpyyn/busydisk2.png>

At the end of graphs' lines, panic is raised.

iSCSI target is also FreeBSD, unfortunately ctlstat was not running during the issue occurred.
So numbers will be average since system startup (102 days ago).
I also do not have gstat numbers from this disk on target side
(to help finding if it's a hardware issue, a iSCSI issue or something else).
I will think about collecting these numbers if ever issue occurs again.

> If everything else fails, you might want to do a packet trace using "tcpdump -w" and take
> a look at it using Wireshark; it decodes iSCSI quite nicely.