ZFS pool hangs (live-locks?) after adding L2ARC

From: Lev Serebryakov <lev_at_FreeBSD.org>
Date: Wed, 20 Dec 2023 13:31:15 UTC
Hello!

    System in question is FreeBSD 13.2-STABLE stable/13-n256849-05c55eed44e5.

    I have 3 ZFS pools: one "simple" (nda1p2) (it is system pool with BE, root, etc), and two radiz1 pools:
      "zstor", consisting of 5 HDDs (daX) and
      "ztorr", consisting of 3 HDDs (adaX).

    Also, I have NVMe disk nvme0 (nda0, it is brand new AData Legend 960 2TB) with 1 GPT partition of type "freebsd-swap"
    (it is NOT configured or enabled as swap in the system!). Size of this partition is 1.6T.

    When I try to add nda0p1 (AData partition) as "cache" to "zstor" pool it is added without problem, but later pool hangs.

    I've experienced 2 hangs:

    (1) Right after adding cache and reboot import of pool hangs. When I tried to import pool by hands in single-user mode,
        I've seen that one kernel thread with name like "z_int_2_2" consume 100% of one core.
        I've waited for one hour without any result. After that I've removed NMVe physically, booted successfully
        and removed it with "zpool remove".

    (2) After that I've re-added "cache" device and everything worked for some time (10+ days). But suddenly one filesystem on the pool
       (only one!) starts to livelock: if you do "ls" on this filesystem it hangs forever, "ls" consume one core (100%) in system and
       again thread with name like "z_int_X_Y" consumes 100% of other core. "ls" could not be killed, only reboot (which hangs too after
       "all bufs synced"!) helps. But after reboot it reproduced again, with exactly same symptoms.
        This time I was able to remove chache device with "zpool remove", without detaching it physically.

   Status of pool:

 > zpool status zstor
   pool: zstor
  state: ONLINE
status: Some supported and requested features are not enabled on the pool.
         The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
         the pool may no longer be accessible by software that does not support
         the features. See zpool-features(7) for details.
   scan: resilvered 1.85T in 04:02:19 with 0 errors on Sat Dec  9 16:21:33 2023
config:

         NAME        STATE     READ WRITE CKSUM
         zstor       ONLINE       0     0     0
           raidz1-0  ONLINE       0     0     0
             da4     ONLINE       0     0     0
             da2     ONLINE       0     0     0
             da3     ONLINE       0     0     0
             da1     ONLINE       0     0     0
             da0     ONLINE       0     0     0

errors: No known data errors

   I have two non-default settings for zfs:

vfs.zfs.min_auto_ashift=12
vfs.zfs.abd_scatter_enabled=0

   I can not find any discussion about such problem on Internet. Also, "live" system doesn't have these "z_int_X_Y" threads at all.

   I want my L2ARC, I've payed for this NVMe!

-- 
// Lev Serebryakov