Re: close() taking minutes (ZFS related)

From: Peter <pmc_at_citylink.dinoex.sub.org>
Date: Fri, 20 Jan 2023 14:26:39 UTC
On Fri, Jan 20, 2023 at 12:05:20PM +0100, Mateusz Guzik wrote:
! On 1/20/23, Peter <pmc@citylink.dinoex.sub.org> wrote:
! >
! > Does this look familiar with anybody? ("truss -dD ls -l something")
! >
! > 0.047760369 0.000017548 open("/etc/nsswitch.conf",O_RDONLY|O_CLOEXEC,0666) =
! > 4 (0x4)
! > 0.047845099 0.000011290 fstat(4,{ mode=-rw-r--r--
! > ,inode=13886,size=272,blksize=4096 }) = 0 (0x0)
! > 0.047932166 0.000034182 read(4,"#\n# nsswitch.conf(5) - name ser"...,4096) =
! > 272 (0x110)
! > 0.048007308 0.000010661 read(4,0x801848000,4096) = 0 (0x0)
! > 45.827173802 45.779056767 close(4)               = 0 (0x0)
! >              ^^^^^^^^^^^^
! > ...
! > 45.866272643 0.000020608 open("/etc/localtime",O_RDONLY,077) = 6 (0x6)
! > 45.866398048 0.000017731 fstat(6,{ mode=-r--r--r--
! > ,inode=41572,size=2309,blksize=4096 }) = 0 (0x0)
! > 45.866505596 0.000054084 read(6,"TZif2\0\0\0\0\0\0\0\0\0\0\0\0\0"...,41448)
! > = 2309 (0x905)
! > 49.511709215 3.645026276 close(6)                = 0 (0x0)
! >              ^^^^^^^^^^^
! >
! > (The longest I've seen was more than 5 minutes)
! >
! > There is otherwise nothing wrong with the machine, only I am running
! > a postgres table merge for some 10 mio. rows. And only during that the
! > effect happens, reproducibly.
! >
! 
! This is most likely stalls happening on atime update.

I agree, that is the most plausible location.

! Most likely the following will help:
! sysctl vfs.zfs.per_txg_dirty_frees_percent=30
! 
! see https://github.com/openzfs/zfs/issues/13932 for more information

Thank You, but sadly that doesn't change it.

I did now revert some file-locking and async in postgres to defaults.
And I run truss from UFS, because otherwise it would already hang:
    root ~# truss -dD /j/admn/bin/ls -l /j/admn/etc/

And so I get things like this:
...
0.010328158 0.000009112 fstatat(AT_FDCWD,"newsyslog.conf",{ mode=-rw-r--r-- ,inode=13822,size=1708,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
0.010367043 0.000008568 fstatat(AT_FDCWD,"mail",{ mode=drwxr-xr-x ,inode=39,size=30,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
0.010405598 0.000008903 fstatat(AT_FDCWD,"apmd.conf",{ mode=-rw-r--r-- ,inode=44969,size=1240,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
16.746787754 16.736352005 fstatat(AT_FDCWD,"libmap.conf",{ mode=-rw-r--r-- ,inode=13980,size=47,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
16.746863876 0.000012441 fstatat(AT_FDCWD,"rc.conf.d",{ mode=drwxr-xr-x ,inode=51,size=2,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
16.746908593 0.000010568 fstatat(AT_FDCWD,"autofs",{ mode=drwxr-xr-x ,inode=31,size=9,blksize=4096 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
16.746951173 0.000010509 fstatat(AT_FDCWD,"rc.initdiskless",{ mode=-rw-r--r-- ,inode=14262,size=13680,blksize=13824 },AT_SYMLINK_NOFOLLOW) = 0 (0x0)
...
Is there an atime involved? (I tried to switch them of, but it seems
that requires a remount to become effective)

and 'ps' gives me the usual "zfs is busy waiting":
 UID   PID  PPID  C PRI NI     VSZ    RSS MWCHAN   STAT TT          TIME COMMAND
   0 90738 90737  8  23  0   13396   3156 tx->tx_s DX+   4       0:00.00 /j/admn/bin/ls -l /j/admn/etc/

    # sysctl vfs.zfs.txg.timeout=1
    vfs.zfs.txg.timeout: 15 -> 1
Doesn't change things noticeably.

Do You have an idea where to look further? (dtrace should be working,
but I don't know where to look)