Scrub incredibly slow with 13.0-RC3 (as well as RC1 & 2)

Sat Mar 27 16:06:48 UTC 2021

	Hello,

TL;DR: read_max makes no difference, don't use hpet with KVM, use
scan_legacy.

On Sat, Mar 27, 2021 at 11:59:43AM +0000, tech-lists wrote:
> Have you tried the following sysctl:
> 
> vfs.read_max=128

It's something I haven't so I gave it a shot. Note that I first ran my
tests on zroot which is on a virtio controller (the image file is on
mirror made of two SSDs).

This pool is small with only 12.1GB allocated out of 21.9GB:
NAME    SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
zroot  21.9G  12.1G  9.88G        -         -    37%    54%  1.00x    ONLINE  -

Scrubbing is hard to bench because the status seems stuck forever:

  scan: scrub in progress since Sat Mar 27 15:16:27 2021
        11.8G scanned at 146M/s, 1.23M issued at 15.2K/s, 12.1G total
        0B repaired, 0.01% done, no estimated completion time

  scan: scrub in progress since Sat Mar 27 15:16:27 2021
        12.0G scanned at 145M/s, 3.58M issued at 43.2K/s, 12.1G total
        0B repaired, 0.03% done, no estimated completion time

  scan: scrub in progress since Sat Mar 27 15:16:27 2021
        12.1G scanned at 142M/s, 409M issued at 4.70M/s, 12.1G total
        0B repaired, 3.31% done, no estimated completion time

Look how it already read 12.0G so pretty much 100% of the allocated
space in the pool and it still shows 0.03% done...

And with a bs of 128k (zfs record size), dd with iflags=direct tells me
I can read this thing at:
  23266197504 bytes (23 GB, 22 GiB) transferred 23.042s, 1010 MB/s

With a bs of 32k I still get 310MB/s.

So please find below scrubbing duration with vfs.read_max 512, 256, 128
and 64: it makes no difference.

Please not it seems it's completely cpu bound again as I see this
process using 100% of the CPU the whole time scrub is running:

   31 root         -8    -     0B  3952K CPU0     0  20:05 100.00% zfskern{txg_thread_enter}

- vfs.read_max=512
vfs.read_max: 64 -> 512
  - Scrub 1 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:10:41 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:13:33 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:54 with 0 errors on Sat Mar 27 15:16:27 2021
- vfs.read_max=256
vfs.read_max: 512 -> 256
  - Scrub 1 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:19:19 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:22:12 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:25:04 2021
- vfs.read_max=128
vfs.read_max: 256 -> 128
  - Scrub 1 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:27:56 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:30:49 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:33:41 2021
- vfs.read_max=64
vfs.read_max: 128 -> 64
  - Scrub 1 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:36:34 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:52 with 0 errors on Sat Mar 27 15:39:26 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:53 with 0 errors on Sat Mar 27 15:42:19 2021

I am attaching a flamegraph for that busy process: again a lot of
hpet_get_timecount... So I went and disabled hpet in kvm: I gained 30
seconds and again, playing with vfs.read_max made no difference but then
again it's a ssd. And given I still am cpu bound I'm not sure I'll gain
anything by playing with block sizes... But the first lesson is to not
use hpet with KVM.

- vfs.read_max=512
vfs.read_max: 64 -> 512
  - Scrub 1 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:14:25 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:16:44 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:19:02 2021
- vfs.read_max=256
vfs.read_max: 512 -> 256
  - Scrub 1 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:21:20 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:23:39 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:25:57 2021
- vfs.read_max=128
vfs.read_max: 256 -> 128
  - Scrub 1 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:28:15 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:30:33 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:32:51 2021
- vfs.read_max=64
vfs.read_max: 128 -> 64
  - Scrub 1 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:35:10 2021
  - Scrub 2 scan: scrub repaired 0B in 00:02:18 with 0 errors on Sat Mar 27 16:37:28 2021
  - Scrub 3 scan: scrub repaired 0B in 00:02:19 with 0 errors on Sat Mar 27 16:39:47 2021

An interesting thing is vfs.zfs.scan_legacy=1 makes the progress start
right away. See how I read almost 50% of the disk and I'm already 49%
done while with the non legacy scan, progress only moves after reading
all the allocated space.

  scan: scrub in progress since Sat Mar 27 16:41:33 2021
        5.85G scanned at 133M/s, 5.85G issued at 133M/s, 12.1G total
        0B repaired, 48.50% done, 00:00:47 to go

It's altogether way quicker (I gain 54 seconds on the no hpet test!!!)
and totally reproducible:
  scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Mar 27 16:43:08 2021
  scan: scrub repaired 0B in 00:01:35 with 0 errors on Sat Mar 27 16:46:10 2021

I will start a scrub of my zraid2 zpool and tweak read_max to see if it
makes a difference but at this point I doubt it.

And scan_legacy could be quicker because I see as many busy threads as there
are vdevs while if I start a scan with the non legacy scan, I only see one from
the get go.

  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
    0 root        -16    -     0B    13M CPU0     0   8:40  99.17% kernel{z_rd_iss_4}
    0 root        -16    -     0B    13M CPU9     9   8:40  99.15% kernel{z_rd_iss_6}
    0 root        -16    -     0B    13M CPU7     7   8:40  98.97% kernel{z_rd_iss_5}
    0 root        -16    -     0B    13M CPU4     4   8:40  98.96% kernel{z_rd_iss_7}
    0 root        -16    -     0B    13M CPU6     6   8:40  98.59% kernel{z_rd_iss_1}
    0 root        -16    -     0B    13M CPU5     5   8:40  98.56% kernel{z_rd_iss_2}
    0 root        -16    -     0B    13M CPU14   14   8:40  98.33% kernel{z_rd_iss_3}
    0 root        -16    -     0B    13M CPU11   11   8:40  97.89% kernel{z_rd_iss_0}

-- 
Mathieu Chouquet-Stringer
            The sun itself sees not till heaven clears.
	             -- William Shakespeare --