Re: close() taking minutes (ZFS related)
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Sat, 21 Jan 2023 12:53:26 UTC
On Fri, Jan 20, 2023 at 03:55:43PM +0100, Mateusz Guzik wrote:
! offcpu.d:
!
! #pragma D option dynvarsize=32m
!
! /*
! * The p_flag & 0x4 test filters out kernel threads.
! */
!
! sched:::sleep
! /(curthread->td_proc->p_flag & 0x4) == 0/
! {
! self->ts = timestamp;
! self->mesg = curthread->td_wmesg;
! }
!
! sched:::wakeup
! /self->ts/
! {
! @[stack(30), stringof(self->mesg)] = sum(timestamp - self->ts);
! self->ts = 0;
! self->mesg = 0;
! }
!
! dtrace:::END
! {
! printa("%k\n%s\n%@d\n\n", @);
! }
!
! dtrace -s offcpu.d -o offcpu.out
!
! git clone https://github.com/brendangregg/FlameGraph.git
! cat offcpu.out | perl FlameGraph/stackcollapse.pl | perl
! FlameGraph/flamegraph.pl --color=io > offcpu.svg
Okay, got it. And now?
I created two pictures, one when the machine is idle, and one when it
is completely stuck. But I don't see significant differences.
I drop them into the ACME directory, so You can have a look:
http://flag.daemon.contact/.well-known/acme-challenge/ZFS.idle.svg
http://flag.daemon.contact/.well-known/acme-challenge/ZFS.busy.svg
Probably this needs more time to really understand - my brains are
currently engaged with the fallout of these table migrations.
I also tried to get a better view on what is exactly happening.
This is a little database load:
The beginning, everything normal:
770 85475 83995 0 21 0 17812 6032 select S+J 2 0:00.18 pg_restore -C -d postgres pgdump_TESTLOAD
The backend starts working, and things start to hang:
770 85503 44085 4 52 0 156652 109460 zcw->zcw DsJ - 0:05.78 postgres: postgres testload [local] ALTER TABLE (post
770 85504 44090 1 20 0 580 384 tx->tx_s DJ - 0:00.00 sh -c /ext/libexec/RedoLog.copy "00000001000000290000
770 85475 83995 4 22 0 17812 6036 select S+J 2 0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD
It's getting bad:
0 13187 1 1 20 0 13276 768 tx->tx_s DsJ - 0:14.51 /usr/sbin/blacklistd
770 44089 44085 16 20 0 80464 12620 tx->tx_s DsJ - 0:09.90 postgres: walwriter (postgres)
770 85503 44085 15 52 0 177132 124516 tx->tx_s DsJ - 0:06.65 postgres: postgres testload [local] CREATE INDEX (pos
770 85504 44090 10 20 0 580 384 tx->tx_s DJ - 0:00.00 sh -c /ext/libexec/RedoLog.copy "00000001000000290000
770 85475 83995 19 22 0 17812 6036 select S+J 2 0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD
And worse:
770 44089 44085 17 20 0 80464 12620 zcw->zcw DsJ - 0:09.92 postgres: walwriter (postgres)
770 85503 44085 2 27 0 177132 124980 usem SsJ - 0:07.39 postgres: postgres testload [local] CREATE INDEX (pos
8 85660 10053 1 52 0 428 252 tx->tx_s DJ - 0:00.00 expr ## : \\(.\\).*
770 85661 85504 8 22 0 416 232 zfs DJ - 0:00.00 fsync /var/db/pg-int/arch/000000010000002900000014
770 85475 83995 19 20 0 17812 6036 select S+J 2 0:00.84 pg_restore -C -d postgres pgdump_TESTLOAD
DB loaded, but things are ugly now:
8 85660 10053 16 52 0 428 252 tx->tx_s DJ - 0:00.00 expr ## : \\(.\\).*
770 85661 85504 8 22 0 416 232 zfs DJ - 0:00.00 fsync /var/db/pg-int/arch/000000010000002900000014
5001 85683 85674 16 20 0 580 380 zfs DsJ - 0:00.00 /bin/sh -c /ext/libexec/svnsync.cron /ext/SVNR svn+ss
0 85687 85677 4 20 0 580 380 zfs DsJ - 0:00.00 /bin/sh -c /usr/libexec/atrun
8 85688 85679 5 20 0 580 380 zfs DsJ - 0:00.00 /bin/sh -c /usr/local/news/bin/send-uucp
770 85666 83995 10 20 0 1020 764 zfs D+J 2 0:00.00 psql -c drop database testload
Then, I tried to do that load again with sync write disabled, to
see if things change (they do not). But, at that time the
daily/periodic appeared, and it does a VACUUM - that is, a first
VACUUM on all the migrated tables. And now I give up, because this is
unuseable:
0 1430 1429 4 20 0 12860 3496 tx->tx_s D - 0:00.09 / /usr /var /home /ext /j/kerb /media /j/oper /usr/loc
0 2569 10505 14 20 0 21072 5800 tx->tx_s DJ - 0:00.03 sendmail: 30L26S6D002569 wand-n.intra.daemon.contact [
0 2273 2269 13 30 0 13680 3480 tx->tx_s Ds - 0:00.02 /bin/sh /ext/libexec/Key.renew
770 2645 44085 2 23 0 83024 23528 tx->tx_s DsJ - 0:25.15 postgres: postgres bareos [local] VACUUM (postgres)
770 2924 89818 9 20 0 17176 5308 tx->tx_s DJ - 0:00.01 fetch --no-verify-peer -1 -T 30 -o /tmp/kurse.89801.ht
8 2936 10053 2 22 0 428 252 tx->tx_s DJ - 0:00.00 expr 29 + 1
0 2947 2845 0 20 0 18384 3736 tx->tx_s DJ - 0:00.00 sendmail: ./30L26SBv001628 from queue (sendmail)
770 2949 2668 2 21 0 424 240 zfs DJ - 0:00.00 mv /var/db/pg-int/arch/000000010000002900000023 /var/d
0 3110 3108 12 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SDj001446 from queue (sendmail)
0 3335 3334 11 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26Sjq001053 from queue (sendmail)
997 3504 4734 7 20 0 580 380 zfs DJ - 0:00.01 sh -c /sbin/ping -c2 wand-n.intra.daemon.contact > /de
997 3505 4734 15 20 0 580 380 zfs DJ - 0:00.01 sh -c sleep `jot -r 1 1 55`
997 3506 4734 12 20 0 580 380 zfs DJ - 0:00.01 sh -c /sbin/ping -c2 pole-n.intra.daemon.contact > /de
997 3507 4734 8 21 0 580 380 zfs DJ - 0:00.01 sh -c /usr/local/bin/xlsclients -display disp-e.intra.
0 3548 3547 4 21 0 18384 3696 zfs DJ - 0:00.00 sendmail: ./30L26SRV001167 from queue (sendmail)
0 3742 3740 9 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SrD001048 from queue (sendmail)
0 3955 3954 8 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26Sjs001053 from queue (sendmail)
0 4155 4154 17 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SBx001628 from queue (sendmail)
0 4375 4374 6 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SLE002237 from queue (sendmail)
0 4575 4574 17 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SDl001446 from queue (sendmail)
0 4753 4752 11 20 0 18384 3656 zfs DJ - 0:00.00 sendmail: ./30L26SrF001048 from queue (sendmail)
0 4952 4951 12 20 0 18384 3696 zfs DJ - 0:00.00 sendmail: ./30L26SRX001167 from queue (sendmail)
0 5170 5169 3 20 0 18384 3700 zfs DJ - 0:00.00 sendmail: ./30L26SWb001552 from queue (sendmail)
0 5383 5382 1 20 0 18384 3660 zfs DJ - 0:00.00 sendmail: ./30L26SYD000965 from queue (sendmail)
0 10486 1 6 20 0 13000 620 tx->tx_s DsJ - 0:13.29 /usr/sbin/cron -s
770 3514 83995 16 20 0 596 288 zfs D+J 2 0:00.00 pg_restore -C -d postgres pgdump_TESTLOAD
1000 3518 83924 18 20 0 13396 3200 zfs D+ 4 0:00.00 ls -la /j/admn/etc/
I attached truss to the VACUUM worker, and it does file reads and
writes in 8k blocks continuousely, just as it should, keeping the
disks busy.
But meanwhile, nothing else is moving any step further.
I had to wait half an hour until the VACUUM was done, then everything
went back to normal, and nothing bad had happened (my scripting is
usually written to cope with indefinte stalls at any place).
The issue is a lot worse than I initially thought, it concerns
everything: sort, reindex, vacuum, ...
As it looks, ZFS allows only one single process to write to the pool,
and everything else is blocked from accessing it during that time.
And this seems SSD-specific; the DB was on mechanical disks before
and did fine for the last 15 years, precisely saturating the SCSI
and no stalls.
I might think this could be related to some of the lots of pseudo-AI
that went into ZFS over the years: all the knobs I used to tune have
disappeared, even arc-min/arc-max seem in the process of disappearing.
Initially in 2008, this DB did run on a Pentium-2 with 750 MB ram, and
things did work, after proper tuning. Now it has 20 cores and 48 GB
ram, and doesn't work.