Kernel panic in zfs code; 12-STABLE

Thu Jul 18 21:17:51 UTC 2019

On 7/18/2019 15:35, Karl Denninger wrote:
> On 7/18/2019 15:19, Eugene Grosbein wrote:
>> 19.07.2019 3:13, Karl Denninger wrote:
>>
>>> FreeBSD 12.0-STABLE #2 r349024M: Thu Jun 13 18:01:16 CDT 2019    
>>> karl at NewFS.denninger.net:/usr/obj/usr/src/amd64.amd64/sys/KSD-SMP
>>>
>>> Note -- no patches of any sort in the ZFS code; I am NOT running any of
>>> my former patch set.
>>>
>>> NewFS.denninger.net dumped core - see /var/crash/vmcore.8
>>>
>>> Thu Jul 18 15:02:54 CDT 2019
>>>
>>> FreeBSD NewFS.denninger.net 12.0-STABLE FreeBSD 12.0-STABLE #2 r349024M:
>>> Thu Jun 13 18:01:16 CDT 2019    
>>> karl at NewFS.denninger.net:/usr/obj/usr/src/amd64.amd64/sys/KSD-SMP  amd64
>>>
>>> panic: double fault
>> [skip]
>>
>>> #283 0xffffffff82748d91 in zio_vdev_io_done (zio=0xfffff8000b8b8000)
>>>     at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:3376
>>> #284 0xffffffff82744eac in zio_execute (zio=0xfffff8000b8b8000)
>>>     at /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:1786
>>> #285 0xffffffff80c3b7f4 in taskqueue_run_locked (queue=0xfffff801a8b35100)
>>>     at /usr/src/sys/kern/subr_taskqueue.c:467
>>> #286 0xffffffff80c3cb28 in taskqueue_thread_loop (arg=<value optimized out>)
>>>     at /usr/src/sys/kern/subr_taskqueue.c:773
>>> #287 0xffffffff80b9ab23 in fork_exit (
>>>     callout=0xffffffff80c3ca90 <taskqueue_thread_loop>,
>>>     arg=0xfffff801a0577520, frame=0xfffffe009d4edc00)
>>>     at /usr/src/sys/kern/kern_fork.c:1063
>>> #288 0xffffffff810b367e in fork_trampoline ()
>>>     at /usr/src/sys/amd64/amd64/exception.S:996
>>> #289 0x0000000000000000 in ?? ()
>>> Current language:  auto; currently minimal
>>> (kgdb)
>> You have "double fault" and completely insane number of stack frames in the trace.
>> This is obviously infinite recursion resulting in kernel stack overflow and panic.
> Yes, but.... why and how?
>
> What's executing at the time is this command:
>
> zfs send -RI $i at zfs-old $i at zfs-base | zfs receive -Fudv $BACKUP
>
> Which in turn results in the old snapshots on the target not on the
> source being deleted, then the new ones being sent.  It never gets to
> the sending part; it blows up during the delete of the OLD snapshots.
>
> The one(s) it deletes, however, it DOES delete.  When the box is
> rebooted those two snapshots on the target are indeed gone.
>
> That is, it is NOT getting "stuck" on one (which would imply there's an
> un-detected fault in the filesystem on the target in the metadata for
> that snapshot, resulting in a recursive call that blows up the stack)
> and it never gets to send the new snapshot, so whatever is going on is
> NOT on the source filesystem.  Neither source or destination shows any
> errors on the filesystem; both pools are healthy with zero error counts.
>
> Therefore the question -- is the system queueing enough work to blow the
> stack *BUT* the work it queues is all legitimate?  If so there's a
> serious problem in the way the code now functions in that an "ordinary"
> operation can result in what amounts to kernel stack exhaustion.
>
> One note -- I haven't run this backup for the last five days, as I do it
> manually and I've been out of town.  Previous running it on a daily
> basis completed without trouble.  This smells like a backlog of "things
> to do" when the send runs that results in the allegedly-infinite
> recursion (that isn't really infinite) that runs the stack out of space
> -- and THAT implies that the system is trying to queue a crazy amount of
> work on a recursive basis for what is a perfectly-legitimate operation
> -- which it should *NOT* do.

Update: This looks like an OLD bug that came back.

Previously the system would go absolutely insane on the first few
accesses to spinning rust during a snapshot delete and ATTEMPT to send
thousands of TRIM requests -- which spinning rust does not support.  On
a system with mixed vdevs, where some pools are rust and some are SSD,
this was a problem since you can't turn TRIM off because you REALLY want
it on those disks.

The FIX for this was to do this on the import of said pool comprised of
spinning rust:

#
# Now try to trigger TRIM so that we don't have a storm of them
#
# echo "Attempting to disable TRIM on spinning rust...."

mount -t zfs $BACKUP/no-trim /mnt
dd if=/dev/random of=/mnt/kill-trim bs=128k count=2
echo "Performed 2 writes"
sleep 2
rm /mnt/kill-trim
echo "Performed delete of written file; wait"
sleep 35
umount /mnt
echo "Unmounted temporary filesystem"
sleep 2
echo "TRIM disable theoretically done...."

This would cause ZFS to "figure out" that TRIM doesn't work on the VDEVs
on that pool and not send them.  You were then ok.

When this went away I commented those out.

I just un-commented them -- and the backup is now running, and appears
to be ok -- it's well past where it blew up before, and no panic.

So it APPEARS that ZFS (still) doesn't figure out that TRIM doesn't work
on those disks on a pool import quickly enough and this leads to a CRAZY
number of queued requests which blow up the stack.  But, if you perform
an operation that *should* result in a small number of TRIMs being
issued and then wait a few seconds the ZFS code does correctly realize
that these are spinning drives that don't do TRIM, doesn't stack them
up, and it doesn't panic.

This wouldn't bite you on an import and then MOST ordinary I/O
sequences, because the first few blocks you delete or overwrite will
generate a TRIM attempt, it will fail, and the vdev's get properly
marked.  But if the first thing you do after the pool import is a "zfs
recv" *and* that is an incremental receive, that is, it first deletes
the non-existent snapshots on the source, you get screwed.

I don't know why this was "fixed" and now once again is not or if it was
just luck that it didn't bite me earlier after moving to 12.0 -- but
going through the backtrace I found a crap-ton of TRIMs in those frames
which just flat-out don't work, of course, on spinning drives.

IMHO ZFS should figure this out when the pool is imported and set the
flags appropriately, but it appears that it does not until you actually
attempt to TRIM, and the "attempt" in the case of a send/recv pair
results in a very large number of TRIM requests all being queued up at
once -- enough to cause a panic -- before the code figures out that it
can't do them at all.

-- 
Karl Denninger
karl at denninger.net <mailto:karl at denninger.net>
/The Market Ticker/
/[S/MIME encrypted email preferred]/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4897 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.freebsd.org/pipermail/freebsd-stable/attachments/20190718/d11a7e5f/attachment.bin>