Re: measuring swap partition speed

From: Warner Losh <imp_at_bsdimp.com>
Date: Fri, 15 Dec 2023 15:41:10 UTC
On Fri, Dec 15, 2023 at 7:29 AM void <void@f-m.fm> wrote:

> Hello list, I have on a rpi4 a usb3-connected disk partitioned like this:
>
> # gpart show
>
> =>        40  1953525088  da0  GPT  (932G)
>            40      532480    1  efi  (260M)
>        532520        2008       - free -  (1.0M)
>        534528     4194304    2  freebsd-swap  (2.0G)
>       4728832     4194304    4  freebsd-swap  (2.0G)
>       8923136     4194304    5  freebsd-swap  (2.0G)
>      13117440     4194304    6  freebsd-swap  (2.0G)
>      17311744     4194304    7  freebsd-swap  (2.0G)
>      21506048     4194304    8  freebsd-swap  (2.0G)
>      25700352  1927823360    3  freebsd-zfs  (920G)
>    1953523712        1416       - free -  (708K)
>
> If processes swap out, it runs like a slug [1]. I'd like to test if it's
> the disk on its way out. How would I test swap partitions? [2]
>
> [1] it didn't always run like a slug.
>

What's the underlying hardware?


> [2] would nfs-mounted swap be faster? (1G network)
>

Maybe.


> [3] swap is not encrypted
>

Good. You aren't CPU bound.

So the good news, kinda, is that if this is spinning rust, your swap
partitions are
on the fastest part of the disk. I'd expect 1 that's 12G would work better
than 6 that
are 2G since you'd have less head thrash. Parllelism with multiple swap
partitions
works best when they are on separate spindles.

The bad news is that your disk may be fine. I'd expect that as the ZFS
partition fills up,
the seek size will increase as greater distances have to be traversed to
get back
to the swap space. There's a sweet spot of a few tens of GB that drives
usually can seek
far faster than longer throws...

But if it is a SSD, some comments. It makes no sense to have 6 swap
partitions. 1 will do
the job (though this is in a rpi, so maybe you are hitting some of our
silly limits in the swap
code on 32-bit architectures). LBAs are LBAs, which ones you use don't
matter at all (and
I don't want to hear about wear leveling: that doesn't matter at this level
since the FTL does
it behind the scenes in SSDs and NVMe drives). Your drive may be wearing
out if it has slowed
down with time (though a certain amount like 10-20% may be expected in the
first little bit of life,
the rate of performance decline often slows for a large part of life before
again steeply declining).
QLC SSDs do require a lot more drive care and feeding by the firmware,
including a lot more
writes to deal with 'read disturb' in a read-heavy workload. And a rewrite
from the initial landing
EB (that's typically SLC to be fast) to the longer term storage (QLC for
the capacity). Many
work loads trigger a lot more housekeeping than older TLC or MLC drives.
And the cheapest
NAND in the marketplace tends to be QLC, so the cheapest SSDs (and
sometimes NVMe
drives) tends to be QLC. For light use, it doesn't matter, but if you are
starting to notice slow
downs, you are beyond light use these drives do almost OK at (I'm not a fan
of QLC drives,
if you can't tell).

If this is a thumb drive, you lose. Those are the cheapest of the cheap and
crappiest of the crap
in terms of performance (there are a few notable exceptions, but I'm
playing the odds here).
You are doomed to crappy performance.

If it's really a micro-sd card behind a USB adapter,  see my comments on
thumb drives :).

Now, having said all that, your best bet is to run a FIO test. fio is my
go-to choice for doing
benchmarking of storage. Do a random workload with a 8k write size (since
that's the page
size of aarch64) on one of the swap partitions when it's not in active use.
I suspect you have
a SSD, and that it will kinda suck, but be in line with the swap
performance you are seeing.

I use the following template for my testing (128k should be reduced to 8k
for this test, though
I've not looked at how much we cluster writes in our swap code, so maybe
that's too pessimistic).
You might also try reducing the number of I/O jobs, since I'm measuring, or
trying to, what the
best possible sustained throughput numbers are (latency in this test tends
to run kinda high).

; SSD testing: 128k I/O 64 jobs 32 deep queue

[global]
direct=1
rw=randread
refill_buffers
norandommap
randrepeat=0
bs=128k
ioengine=posixaio
iodepth=32
numjobs=64
runtime=60
group_reporting
thread

[ssd128k]

Good luck.

Warner