Re: ZFS operations hanging, but no visible errors?

From: Chris Ross <cross+freebsd_at_distal.com>
Date: Thu, 11 Nov 2021 13:27:42 UTC
Following up on a new hang this same system had (yesterday freebsd-fs mail subject "swap_pager: cannot allocate bio”), I think the same problem might have occurred again.  Certainly the system got stuck again

Based on the below, my executing that dtrace command caused the system to report "ACPI Error: AE_NO_MEMORY”.  In what way is the system out of memory here?  And, does that failure running dtrace suggest that that “out of memory” problem is the core problem causing the ZFS hang in the first place?  My system has 128GB, which is nothing to sneeze at.  Are there parameters that I should change because the normal parameters just don’t work well with a pool or fs this large?

And, from earlier in this thread from last week:  Now that I have the system running again, I can provide the "zpool status” for information.  Let me know if I’ve just tried something crazy here, this is the largest ZFS filesystem I’ve attempted.  I have a 30T pool on another system without issue, and with less RAM.  (The largest fs on that pool is about 18T)

% zfs status
  pool: tank
 state: ONLINE
  scan: scrub repaired 0B in 05:05:55 with 0 errors on Sat Oct 23 04:38:36 2021
config:

	NAME        STATE     READ WRITE CKSUM
	tank        ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    da3     ONLINE       0     0     0
	    da2     ONLINE       0     0     0
	    da1     ONLINE       0     0     0
	  raidz1-1  ONLINE       0     0     0
	    da4     ONLINE       0     0     0
	    da5     ONLINE       0     0     0
	    da6     ONLINE       0     0     0
	  raidz1-2  ONLINE       0     0     0
	    da7     ONLINE       0     0     0
	    da8     ONLINE       0     0     0
	    da9     ONLINE       0     0     0

errors: No known data errors
% zfs list tank
NAME   USED  AVAIL     REFER  MOUNTPOINT
tank  14.2T  35.0T     14.2T  /tank


                               - Chris

> On Nov 7, 2021, at 03:35, Andriy Gapon <avg@freebsd.org> wrote:
> 
> On 05/11/2021 18:59, Chris Ross wrote:
>> Running prostate -kk on the rsync that was hung, then killed, then SIGKILL’d shows:
>> procstat -kk 35220
>>   PID    TID COMM                TDNAME              KSTACK
>> 35220 102499 rsync               -                   mi_switch+0xc1 _sleep+0x1cb vm_wait_doms+0xe2 vm_wait_domain+0x51 vm_domain_alloc_fail+0x86 vm_page_alloc_domain_after+0x7e uma_small_alloc+0x58 keg_alloc_slab+0xba zone_import+0xee zone_alloc_item+0x6f abd_alloc_chunks+0x61 abd_alloc+0x102 arc_hdr_alloc_abd+0xb0 arc_hdr_alloc+0x11e arc_read+0x4f4 dbuf_issue_final_prefetch+0x108 dbuf_prefetch_impl+0x3d0 dmu_zfetch+0x558
> 
> Looks like the system is out of memory.
> It seems that you already established that.
> 
> -- 
> Andriy Gapon
>