From nobody Fri Apr 08 17:41:11 2022
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id E838E1A879CD;
	Fri,  8 Apr 2022 17:41:22 +0000 (UTC)
	(envelope-from egoitz@ramattack.net)
Received: from cu1208c.smtpx.saremail.com (cu1208c.smtpx.saremail.com [195.16.148.183])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256)
	(Client did not present a certificate)
	by mx1.freebsd.org (Postfix) with ESMTPS id 4KZlss2FNpz3j5J;
	Fri,  8 Apr 2022 17:41:20 +0000 (UTC)
	(envelope-from egoitz@ramattack.net)
Received: from www.saremail.com (unknown [194.30.0.183])
	by sieve-smtp-backend02.sarenet.es (Postfix) with ESMTPA id A585B60C13A;
	Fri,  8 Apr 2022 19:41:11 +0200 (CEST)
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
Content-Type: multipart/alternative;
 boundary="=_fe85fe2db536d584a8585a29da09a2df"
Date: Fri, 08 Apr 2022 19:41:11 +0200
From: egoitz@ramattack.net
To: Stefan Esser <se@freebsd.org>
Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org,
 freebsd-performance@freebsd.org, Rainer Duffner <rainer@ultra-secure.de>
Subject: Re: Re: Desperate with 870 QVO and ZFS
In-Reply-To: <b9dba1b4-1db1-9d73-da8a-080906c8e146@FreeBSD.org>
References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net>
 <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de>
 <0ef282aee34b441f1991334e2edbcaec@ramattack.net>
 <dd9a55ac-053d-7802-169d-04c95c045ed2@FreeBSD.org>
 <ce51660b5f83f92aa9772d764ae12dff@ramattack.net>
 <e4b7252d-525e-1c0f-c22b-e34b96c1ce83@FreeBSD.org>
 <e3ccbea91aca7c8870fd56ad393401a4@ramattack.net>
 <b9dba1b4-1db1-9d73-da8a-080906c8e146@FreeBSD.org>
Message-ID: <3d24c87110b4a155e3f14d53a9309c61@ramattack.net>
X-Sender: egoitz@ramattack.net
User-Agent: Saremail webmail
X-Rspamd-Queue-Id: 4KZlss2FNpz3j5J
X-Spamd-Bar: ---
Authentication-Results: mx1.freebsd.org;
	dkim=none;
	dmarc=pass (policy=reject) header.from=ramattack.net;
	spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.148.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net
X-Spamd-Result: default: False [-3.79 / 15.00];
	 RCVD_TLS_LAST(0.00)[];
	 RCVD_VIA_SMTP_AUTH(0.00)[];
	 XM_UA_NO_VERSION(0.01)[];
	 TO_DN_SOME(0.00)[];
	 R_SPF_ALLOW(-0.20)[+ip4:195.16.148.0/24];
	 NEURAL_HAM_LONG(-1.00)[-1.000];
	 MIME_GOOD(-0.10)[multipart/alternative,text/plain];
	 ARC_NA(0.00)[];
	 RCPT_COUNT_FIVE(0.00)[5];
	 TO_MATCH_ENVRCPT_SOME(0.00)[];
	 NEURAL_HAM_SHORT(-1.00)[-1.000];
	 DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject];
	 FROM_NO_DN(0.00)[];
	 NEURAL_HAM_MEDIUM(-1.00)[-1.000];
	 MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance];
	 FROM_EQ_ENVFROM(0.00)[];
	 R_DKIM_NA(0.00)[];
	 MIME_TRACE(0.00)[0:+,1:+,2:~];
	 ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES];
	 RCVD_COUNT_TWO(0.00)[2];
	 MID_RHS_MATCH_FROM(0.00)[]
X-ThisMailContainsUnwantedMimeParts: N

--=_fe85fe2db536d584a8585a29da09a2df
Content-Transfer-Encoding: 8bit
Content-Type: text/plain; charset=UTF-8

Hi Stefan, 

Again extremely grateful. It's an absolute honor to receive your help..
really.... 

I have read this mail now but I need to read it slower and in a more
relaxed way.... When I do that I'll answer you (during the weekend or on
Monday at most). 

Don't worry I will keep you updated with news :) :) . I promise :) :) 

Cheers!

El 2022-04-08 13:14, Stefan Esser escribió:

> ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro.
> 
> Am 07.04.22 um 14:30 schrieb egoitz@ramattack.net: El 2022-04-06 23:49, Stefan Esser escribió: 
> 
> El 2022-04-06 17:43, Stefan Esser escribió: 
> 
> Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: Hi Rainer!
> 
> Thank you so much for your help :) :)
> 
> Well I assume they are in a datacenter and should not be a power outage....
> 
> About dataset size... yes... our ones are big... they can be 3-4 TB easily each
> dataset.....
> 
> We bought them, because as they are for mailboxes and mailboxes grow and
> grow.... for having space for hosting them... 
> Which mailbox format (e.g. mbox, maildir, ...) do you use? 
> 
> I'M RUNNING CYRUS IMAP SO SORT OF MAILDIR... TOO MANY LITTLE FILES NORMALLY..... SOMETIMES DIRECTORIES WITH TONS OF LITTLE FILES....

Assuming that many mails are much smaller than the erase block size of
the SSD, this may cause issues. (You may know the following ...) 

For example, if you have message sizes of 8 KB and an erase block size
of 64 KB (just guessing), then 8 mails will be in an erase block. If
half the mails are deleted, then the erase block will still occupy 64
KB, but only hold 32 KB of useful data (and the SSD will only be aware
of this fact if TRIM has signaled which data is no longer relevant). The
SSD will copy several partially filled erase blocks together in a
smaller number of free blocks, which then are fully utilized. Later
deletions will repeat this game, and your data will be copied multiple
times until it has aged (and the user is less likely to delete further
messages). This leads to "write amplification" - data is internally
moved around and thus written multiple times. 

STEFAN!! YOU ARE NICE!! I THINK THIS COULD EXPLAIN ALL OUR PROBLEM. SO,
WHY WE ARE HAVING THE MOST RANDOMNESS IN OUR PERFORMANCE DEGRADATION AND
THAT DOES NOT NECESSARILY HAS TO MATCH WITH THE MOST IO PEAK HOURS...
THAT I COULD CAUSE THAT PERFORMANCE DEGRADATION JUST BY DELETING A
COUPLE OF HUGE (PERHAPS 200.000 MAILS) MAIL FOLDERS IN A MIDDLE TRAFFIC
HOUR TIME!! Yes, if deleting large amounts of data triggers performance
issues (and the disk does not have a deficient TRIM implementation),
then the issue is likely to be due to internal garbage collections
colliding with other operations.

>> THE PROBLEM IS THAT BY WHAT I KNOW, ERASE BLOCK SIZE OF AN SSD DISK IS SOMETHING FIXED IN THE DISK FIRMWARE. I DON'T REALLY KNOW IF PERHAPS IT COULD BE MODIFIED WITH SAMSUNG MAGICIAN OR THOSE KIND OF TOOL OF SAMSUNG.... ELSE I DON'T REALLY SEE THE MANNER OF IMPROVING IT... BECAUSE APART FROM THAT, YOU ARE DELETING A FILE IN RAIDZ-2 ARRAY... NO JUST IN A DISK... I ASSUME ALIGNING CHUNK SIZE, WITH RECORD SIZE AND WITH THE "SECRET" ERASE SIZE OF THE SSD, PERHAPS COULD BE SLIGHTLY COMPENSATED?.

The erase block size is a fixed hardware feature of each flash chip.
There is a block size for writes (e.g. 8 KB) and many such blocks are
combined in one erase block (of e.g. 64 KB, probably larger in todays
SSDs), they can only be returned to the free block pool all together.
And if some of these writable blocks hold live data, they must be
preserved by collecting them in newly allocated free blocks. 

An example of what might happen, showing a simplified layout of files 1,
2, 3 (with writable blocks 1a, 1b, ..., 2a, 2b, ... and "--" for stale
data of deleted files, ".." for erased/writable flash blocks) in an SSD
might be: 

erase block 1: |1a|1b|--|--|2a|--|--|3a| 

erase block 2; |--|--|--|2b|--|--|--|1c| 

erase block 3; |2c|1d|3b|3c|--|--|--|--| 

erase block 4; |..|..|..|..|..|..|..|..| 

This is just a random example how data could be laid out on the physical
storage array. It is assumed that the 3 erase blocks once were
completely occupied 

In this example, 10 of 32 writable blocks are occupied, and only one
free erase block exists. 

This situation must not persist, since the SSD needs more empty erase
blocks. 10/32 of the capacity is used for data, but 3/4 of the blocks
are occupied and not immediately available for new data. 

The garbage collection might combine erase blocks 1 and 3 into a
currently free one, e.g. erase block 4: erase block 1;
|..|..|..|..|..|..|..|..| 

erase block 2; |--|--|--|2b|--|--|--|1c| 

erase block 3; |..|..|..|..|..|..|..|..| 

erase block 4: |1a|1b|2a|3a|2c|1d|3b|3c| 

Now only 2/4 of the capacity is not available for new data (which is
still a lot more than 10/32, but better than before). 

Now assume file 2 is deleted:

erase block 1; |..|..|..|..|..|..|..|..| 

erase block 2; |--|--|--|--|--|--|--|1c| 

erase block 3; |..|..|..|..|..|..|..|..| 

erase block 4: |1a|1b|--|3a|--|1d|3b|3c| 

There is now a new sparsely used erase block 4, and it will soon need to
be garbage collected, too - in fact it could be combined with the live
data from erase block 2, but this may be delayed until there is demand
for more erased blocks (since e.g. file 1 or 3 might also have been
deleted by then). 

The garbage collection does not know which data blocks belong to which
file, and therefore it cannot collect the data belonging to a file into
a single erase block. Blocks are allocated as data comes in (as long as
enough SLC cells are available in this area, else directly in QLC
cells). Your many parallel updates will cause fractions of each larger
file to be spread out over many erase blocks. 

As you can see, a single file that is deleted may affect many erase
blocks, and you have to take redundancy into consideration, which will
multiply the effect by a factor of up to 3 for small files (one ZFS
allocation block). And remember: deleting a message in mdir format will
free the data blocks, but will also remove the directory entry, causing
additional meta-data writes (again multiplied by the raid redundancy). 

A consumer SSD would normally see only very few parallel writes, and
sequential writes of full files will have a high chance to put the data
of each file contiguously in the minimum number of erase blocks,
allowing to free multiple complete erase blocks when such a file is
deleted and thus obviating the need for many garbage collection copies
(that occur if data from several independent files is in one erase
block). 

Actual SSDs have many more cells than advertised. Some 10% to 20% may be
kept as a reserve for aging blocks that e.g. may have failed kind of a
"read-after-write test" (implemented in the write function, which adds
charges to the cells until they return the correct read-outs). 

BTW: Having an ashift value that is lower than the internal write block
size may also lead to higher write amplification values, but a large
ashift may lead to more wasted capacity, which may become an issue if
typical file length are much smaller than the allocation granularity
that results from the ashift value. 

>> Larger mails are less of an issue since they span multiple erase blocks, which will be completely freed when such a message is deleted. 
>> 
>> I SEE I SEE STEFAN... 
>> 
>> Samsung has a lot of experience and generally good strategies to deal with such a situation, but SSDs specified for use in storage systems might be much better suited for that kind of usage profile. 
>> 
>> YES... AND THE DISKS FOR OUR PURPOSE... PERHAPS WEREN'T QVOS....

You should have got (much more expensive) server grade SSDs, IMHO. 

But even 4 * 2 TB QVO (or better EVO) drives per each 8 TB QVO drive
would result in better performance (but would need a lot of extra SATA
ports). 

In fact, I'm not sure whether rotating media and a reasonable L2ARC
consisting of a fast M.2 SSD plus a mirror of small SSDs for a LOG
device would not be a better match for your use case. Reading the L2ARC
would be very fast, writes would be purely sequential and relatively
slow, you could choose a suitable L2ARC strategy (caching of file data
vs. meta data), and the LOG device would support fast fsync() operations
required for reliable mail systems (which confirm data is on stable
storage before acknowledging the reception to the sender).

> We knew they had some speed issues, but those speed issues, we thought (as
> Samsung explains in the QVO site) they started after exceeding the speeding
> buffer this disks have. We though that meanwhile you didn't exceed it's
> capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps
> we were wrong?. 
> These drives are meant for small loads in a typical PC use case,
> i.e. some installations of software in the few GB range, else only
> files of a few MB being written, perhaps an import of media files
> that range from tens to a few hundred MB at a time, but less often
> than once a day. 
> 
> WE MOVE, YOU KNOW... LOTS OF LITTLE FILES... AND LOT'S OF DIFFERENT CONCURRENT MODIFICATIONS BY 1500-2000 CONCURRENT IMAP CONNECTIONS WE HAVE...

I do not expect the read load to be a problem (except possibly when the
SSD is moving data from SLC to QLC blocks, but even then reads will get
priority). But writes and trims might very well overwhelm the SSD,
especially when its getting full. Keeping a part of the SSD unused
(excluded from the partitions created) will lead to a large pool of
unused blocks. This will reduce the write amplification - there are many
free blocks in the "unpartitioned part" of the SSD, and thus there is
less urgency to compact partially filled blocks. (E.g. if you include
only 3/4 of the SSD capacity in a partition used for the ZPOOL, then 1/4
of each erase block could be free due to deletions/TRIM without any
compactions required to hold all this data.) 

Keeping a significant percentage of the SSD unallocated is a good
strategy to improve its performance and resilience. 

WELL, WE HAVE ALLOCATED ALL THE DISK SPACE... BUT NOT USED... JUST
ALLOCATED.... YOU KNOW... WE DO A ZPOOL CREATE WITH THE WHOLE DISKS.....


I think the only chance for a solution that does not require new
hardware is to make sure, only some 80% of the SSDs are used (i.e.
allocate only 80% for ZFS, leave 20% unallocated). This will
significantly reduce the rate of garbage collections and thus reduce the
load they cause. 

I'd use a fast encryption algorithm (zstd - choose a level that does not
overwhelm the CPU, there are benchmark results for ZFS with zstd, and I
found zstd-2 to be best for my use case). This will more than make up
for the space you left unallocated on the SSDs. 

A different mail box format might help, too - I'm happy with dovecot's
mdbox format, which is as fast but much more efficient than mdir.

> As the SSD fills, the space available for the single level write
> cache gets smaller 
> 
> THE SINGLE LEVEL WRITE CACHE IS THE CACHE THESE SSD DRIVERS HAVE, FOR COMPENSATING THE SPEED ISSUES THEY HAVE DUE TO USING QLC MEMORY?. DO YOU REFER TO THAT?. SORRY I DON'T UNDERSTAND WELL THIS PARAGRAPH.

Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC
cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells
as 24 GB of data in QLC mode. 

OK, TRUE.... YES.... 

A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB
(600 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells). 

AHH! YOU MEAN THAT SLC CAPACITY FOR SPEEDING UP THE QLC DISKS, IS
OBTAINED FROM EACH SINGLE LAYER OF THE QLC?. 

There are no specific SLC cells. A fraction of the QLC capable cells is
only written with only 1 instead of 4 bits. This is a much simpler
process, since there are only 2 charge levels per cell that are used,
while QLC uses 16 charge levels, and you can only add charge (must not
overshoot), therefore only small increments are added until the correct
value can be read out). 

But since SLC cells take away specified capacity (which is calculated
assuming all cells hold 4 bits each, not only 1 bit), their number is
limited and shrinks as demand for QLC cells grows. 

The advantage of the SLC cache is fast writes, but also that data in it
may have become stale (trimmed) and thus will never be copied over into
a QLC block. But as the SSD fills and the size of the SLC cache shrinks,
this capability will be mostly lost, and lots of very short lived data
is stored in QLC cells, which will quickly become partially stale and
thus needing compaction as explained above.

> Therefore, the fraction of the cells used as an SLC cache is reduced when it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cells). 
> 
> SORRY I DON'T GET THIS LAST SENTENCE... DON'T UNDERSTAND IT BECAUSE I DON'T REALLY KNOW THE MEANING OF TN... 
> 
> BUT I THINK I'M GETTING THE IDEA IF YOU SAY THAT EACH QLC LAYER, HAS IT'S OWN SLC CACHE OBTAINED FROM THE DISK SPACE AVAIABLE FOR EACH QLC LAYER.... 
> 
> And with less SLC cells available for short term storage of data the probability of data being copied to QLC cells before the irrelevant messages have been deleted is significantly increased. And that will again lead to many more blocks with "holes" (deleted messages) in them, which then need to be copied possibly multiple times to compact them. 
> 
> IF I CORRECT ABOVE, I THINK I GOT THE IDEA YES.... (on many SSDs, I have no numbers for this
> particular device), and thus the amount of data that can be
> written at single cell speed shrinks as the SSD gets full. 
> 
> I have just looked up the size of the SLC cache, it is specified
> to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB
> version, smaller models will have a smaller SLC cache). 
> 
> ASSUMING YOU WERE TALKING ABOUT THE CACHE FOR COMPENSATING SPEED WE PREVIOUSLY COMMENTED, I SHOULD SAY THESE ARE THE 870 QVO BUT THE 8TB VERSION. SO THEY SHOULD HAVE THE BIGGEST CACHE FOR COMPENSATING THE SPEED ISSUES...

I have looked up the data: the larger versions of the 870 QVO have the
same SLC cache configuration as the 2 TB model, 6 GB minimum and up to
72 GB more if there are enough free blocks. 

OURS ONE IS THE 8TB MODEL SO I ASSUME IT COULD HAVE BIGGER LIMITS. THE
DISKS ARE MOSTLY EMPTY, REALLY.... SO... FOR INSTANCE.... 

ZPOOL LIST
NAME             SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP 
DEDUP  HEALTH  ALTROOT
ROOT_DATASET  448G  2.29G   446G        -         -     1%     0%  1.00X
 ONLINE  -
MAIL_DATASET  58.2T  11.8T  46.4T        -         -    26%    20% 
1.00X  ONLINE  - 

Ok, seems you have got 10 * 8 TB in a raidz2 configuration. 

Only 20% of the mail dataset is in use, the situation will become much
worse when the pool will fill up!

>> I SUPPOSE FRAGMENTATION AFFECTS TOO....

On magnetic media fragmentation means that a file is spread out over the
disk in a non-optimal way, causing access latencies due to seeks and
rotational delay. That kind of fragmentation is not really relevant for
SSDs, which allow for fast random access to the cells. 

And the FRAG value shown by the "zpool list" command is not about
fragmentation of files at all, it is about the structure of free space.
Anyway less relevant for SSDs than for classic hard disk drives.

> But after writing those few GB at a speed of some 500 MB/s (i.e.
> after 12 to 150 seconds), the drive will need several minutes to
> transfer those writes to the quad-level cells, and will operate
> at a fraction of the nominal performance during that time.
> (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the
> 2 TB model.) 
> 
> WELL WE ARE IN THE 8TB MODEL. I THINK I HAVE UNDERSTOOD WHAT YOU WROTE IN PREVIOUS PARAGRAPH. YOU SAID THEY CAN BE FAST BUT NOT CONSTANTLY, BECAUSE LATER THEY HAVE TO WRITE ALL THAT TO THEIR PERPETUAL STORAGE FROM THE CACHE. AND THAT'S SLOW. AM I WRONG?. EVEN IN THE 8TB MODEL YOU THINK STEFAN?.

The controller in the SSD supports a given number of channels (e.g 4),
each of which can access a Flash chip independently of the others. Small
SSDs often have less Flash chips than there are channels (and thus a
lower throughput, especially for writes), but the larger models often
have more chips than channels and thus the performance is capped. 

THIS IS TOTALLY LOGICAL. IF A QVO DISK WOULD OUTPERFORM BEST OR SIMILAR
THAN AN INTEL WITHOUT CONSEQUENCES.... WHO WAS GOING TO BUY A EXPENSIVE
INTEL ENTERPRISE?. The QVO is bandwidth limited due to the SATA data
rate of 6 Mbit/s anyway, and it is optimized for reads (which are not
significantly slower than offered by the TLC models). This is a viable
concept for a consumer PC, but not for a server.

> In the case of the 870 QVO, the controller supports 8 channels, which allows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has only 4 Flash chips and is thus limited to 80 MB/s in that situation, while the larger versions have 8, 16, or 32 chips. But due to the limited number of channels, the write rate is limited to 160 MB/s even for the 8 TB model. 
> 
> TOTALLY LOGICAL STEFAN... 
> 
> If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in this limit. 
> THE MAIN PROBLEM WE ARE FACING IS THAT IN SOME PEAK MOMENTS, WHEN THE MACHINE SERVES CONNECTIONS FOR ALL THE INSTANCES IT HAS, AND ONLY AS SAID IN SOME PEAK MOMENTS... LIKE THE 09AM OR THE 11AM.... IT SEEMS THE MACHINE BECOMES SLOWER... AND LIKE IF THE DISKS WEREN'T ABLE TO SERVE ALL THEY HAVE TO SERVE.... IN THESE MOMENTS, NO BIG FILES ARE MOVED... BUT AS WE HAVE 1800-2000 CONCURRENT IMAP CONNECTIONS... NORMALLY THEY ARE DOING EACH ONE... LITTLE CHANGES IN THEIR MAILBOX. DO YOU THINK PERHAPS THIS DISKS THEN ARE NOT APPROPRIATE FOR THIS KIND OF USAGE?-

I'd guess that the drives get into a state in which they have to recycle
lots of partially free blocks (i.e. perform kind of a garbage
collection) and then three kinds of operations are competing with each
other: 

 	* reads (generally prioritized)
 	* writes (filling the SLC cache up to its maximum size)
 	* compactions of partially filled blocks (required to make free blocks
available for re-use)

Writes can only proceed if there are sufficient free blocks, which on a
filled SSD with partially filled erase blocks means that operations of
type 3. need to be performed with priority to not stall all writes. 

My assumption is that this is what you are observing under peak load. 

IT COULD BE ALTHOUGH THE DISKS ARE NOT FILLED.... THE POOL ARE AT 20 OR
30% OF CAPACITY AND FRAGMENTATION FROM 20%-30% (AS ZPOOL LIST STATES).
Yes, and that means that your issues will become much more critical over
time when the free space shrinks and garbage collections will be
required at an even faster rate, with the SLC cache becoming less and
less effective to weed out short lived files as an additional factor
that will increase write amplification.

> And cheap SSDs often have no RAM cache (not checked, but I'd be
> surprised if the QVO had one) and thus cannot keep bookkeeping date
> in such a cache, further limiting the performance under load. 
> 
> THIS BROCHURE (HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/BROCHURE/870_SERIES_BROCHURE.PDF AND THE DATASHEET HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/DATA-SHEET/SAMSUNG_SSD_870_QVO_DATA_SHEET_REV1.1.PDF) SAIS IF I HAVE READ PROPERLY, THE 8TB DRIVE HAS 8GB OF RAM?. I ASSUME THAT IS WHAT THEY CALL THE TURBO WRITE CACHE?.

No, the turbo write cache consists of the cells used in SLC mode (which
can be any cells, not only cells in a specific area of the flash chip). 

I SEE I SEE.... 

The RAM is needed for fast lookup of the position of data for reads and
of free blocks for writes. 

OUR ONES... SEEM TO HAVE 8GB LPDDR4 OF RAM.... AS DATASHEET STATES.... 

Yes, and it makes sense that the RAM size is proportional to the
capacity since a few bytes are required per addressable data block. 

If the block size was 8 KB the RAM could hold 8 bytes (e.g. a pointer
and some status flags) for each logically addressable block. But there
is no information about the actual internal structure of the QVO that I
know of. [...]

>> I SEE.... IT'S EXTREMELY MISLEADING YOU KNOW... BECAUSE... YOU CAN COPY FIVE MAILBOXES OF 50GB CONCURRENTLY FOR INSTANCE.... AND YOU FLOOD A GIGABIT INTERFACE COPYING (OBVIOUSLY BECAUSE DISKS CAN KEEP THAT THROUGHPUT)... BUT LATER.... YOU SEE... YOU ARE IN AN HOUR THAT YESTERDAY, AND EVEN 4 DAYS BEFORE YOU HAVE NOT HAD ANY ISSUES... AND THAT DAY... YOU SEE THE COMMENTED ISSUE... EVEN NOT BEING EXACTLY AT A PEAK HOUR (PERHAPS IS TWO HOURS LATER THE PEAK HOUR EVEN)... OR... BUT I WASN'T NOTICING ABOUT ALL THINGS YOU SAY IN THIS EMAIL.... 
>> 
>> I have seen advice to not use compression in a high load scenario in some other reply. 
>> 
>> I tend to disagree: Since you seem to be limited when the SLC cache is exhausted, you should get better performance if you compress your data. I have found that zstd-2 works well for me (giving a significant overall reduction of size at reasonable additional CPU load). Since ZFS allows to switch compressions algorithms at any time, you can experiment with different algorithms and levels. 
>> 
>> I SEE... YOU SAY COMPRESSION SHOULD BE ENABLED.... THE MAIN REASON BECAUSE WE HAVE NOT ENABLED IT YET, IS FOR KEEPING THE SYSTEM THE MOST NEAR POSSIBLE TO CONFIG DEFAULTS... YOU KNOW... FOR LATER BEING ABLE TO ASK IN THIS MAILING LISTS IF WE HAVE AN ISSUE... BECAUSE YOU KNOW... IT WOULD BE FAR MORE EASIER TO ASK ABOUT SOMETHING STRANGE YOU ARE SEEING WHEN THAT STRANGE THING IS NEAR TO A WELL TESTED CONFIG, LIKE THE CONFIG BY DEFAULT.... 
>> 
>> BUT NOW YOU SAY STEFAN... IF YOU SWITCH BETWEEN COMPRESSION ALGORITHMS YOU WILL END UP WITH A MIX OF DIFFERENT FILES COMPRESSED IN A DIFFERENT MANNER... THAT IS NOT A BIT DISASTER LATER?. DOESN'T AFFECT PERFORMANCE IN SOME MANNER?.
 The compression used is stored in the per file information, each file
in a dataset could have been written with a different compression method
and level. Blocks are independently compressed - a file level
compression may be more effective. Large mail files will contain
incompressible attachments (already compressed), but in base64 encoding.
This should allow a compression ratio of ~1,3. Small files will be plain
text or HTML, offering much better compression factors.

>> One advantage of ZFS compression is that it applies to the ARC, too. And a compression factor of 2 should easily be achieved when storing mail (not for .docx, .pdf, .jpg files though). Having more data in the ARC will reduce the read pressure on the SSDs and will give them more cycles for garbage collections (which are performed in the background and required to always have a sufficient reserve of free flash blocks for writes). 
>> 
>> WE WOULD USE I ASSUME THE LZ4... WHICH IS THE LESS "EXPENSIVE" COMPRESSION ALGORITHM FOR THE CPU... AND I ASSUME TOO FOR AVOIDING DELAY ACCESSING DATA... DO YOU RECOMMEND ANOTHER ONE?. DO YOU ALWAYS RECOMMEND COMPRESSION THEN?.

I'd prefer zstd over lz4 since it offers a much higher compression
ratio. 

Zstd offers higher compression ratios than lz4 at similar or better
decompression speed, but may be somewhat slower compressing the data.
But in my opinion this is outweighed by the higher effective amount of
data in the ARC/L2ARC possible with zstd. 

For some benchmarks of different compression algorithms available for
ZFS and compared to uncompressed mode see the extensive results
published by Jude Allan:

https://docs.google.com/spreadsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTgnXm2yPtiUQ/edit?usp=sharing

The SQL benchmarks might best resemble your use case - but remember that
a significant reduction of the amount of data being written to the SSDs
might be more important than the highest transaction rate, since your
SSDs put a low upper limit on that when highly loaded.

>> I'd give it a try - and if it reduces your storage requirements by 10% only, then keep 10% of each SSD unused (not assigned to any partition). That will greatly improve the resilience of your SSDs, reduce the write-amplification, will allow the SLC cache to stay at its large value, and may make a large difference to the effective performance under high load. 
>> 
>> BUT WHEN YOU ENABLE COMPRESSION... ONLY GETS COMPRESSED THE NEW DATA MODIFIED OR ENTERED. AM I WRONG?.
 Compression is per file system data block (at most 1 MB if you set the
blocksize to that value). Each such block is compressed independently of
all others, to not require more than 1 block to be read and decompressed
when randomly reading a file. If a block does not shrink when compressed
(it may contain compressed file data) the block is written to disk as-is
(uncompressed).

>> BY THE WAY, WE HAVE MORE OR LESS 1/4 OF EACH DISK USED (12 TB ALLOCATED IN A POLL STATED BY ZPOOL LIST, DIVIDED BETWEEN 8 DISKS OF 8TB...)... DO YOU THINK WE COULD BE SUFFERING ON WRITE AMPLIFICATION AND SO... HAVING A SO LITTLE DISK SPACE USED IN EACH DISK?.
 Your use case will cause a lot of garbage collections and this
particular high write amplification values.

>> Regards, STefan 
>> 
>> HEY MATE, YOUR MAIL IS INCREDIBLE. IT HAS HELPED AS A LOT. CAN WE INVITE YOU A CUP OF COFFEE OR A BEER THROUGH PAYPAL OR SIMILAR?. CAN I HELP YOU IN SOME MANNER?.

Thanks, I'm glad to help, and I'd appreciate to hear whether you get
your setup optimized for the purpose (and how well it holds up when you
approach the capacity limits of your drives). 

I'm always interested in experience of users with different use cases
than I have (just being a developer with too much archived mail and
media collected over a few decades). 

Regards, STefan
--=_fe85fe2db536d584a8585a29da09a2df
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html; charset=UTF-8

<html><head><meta http-equiv=3D"Content-Type" content=3D"text/html; charset=
=3DUTF-8" /></head><body style=3D'font-size: 10pt; font-family: Verdana,Gen=
eva,sans-serif'>
<p>Hi Stefan,</p>
<p><br /></p>
<p>Again extremely grateful. It's an absolute honor to receive your help.=
=2E really....</p>
<p><br /></p>
<p>I have read this mail now but I need to read it slower and in a more rel=
axed way.... When I do that I'll answer you (during the weekend or on Monda=
y at most).</p>
<p><br /></p>
<p>Don't worry I will keep you updated with news :) :) . I promise :) :)</p=
>
<p><br /></p>
<p>Cheers!</p>
<div>&nbsp;</div>
<p><br /></p>
<p>El 2022-04-08 13:14, Stefan Esser escribi&oacute;:</p>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0"><!-- html ignored --><br />
<table style=3D"background-color: yellow; border: 1px solid red; font-famil=
y: Arial; font-size: 12px; font-weight: bold;" width=3D"100%">
<tbody>
<tr>
<td><span style=3D"color: red;">ATENCION:</span> Este correo se ha enviado =
desde fuera de la organizaci&oacute;n. No pinche en los enlaces ni abra los=
 adjuntos a no ser que reconozca el remitente y sepa que el contenido es se=
guro.</td>
</tr>
</tbody>
</table>
<br /> <!-- meta ignored -->
<div class=3D"moz-cite-prefix">Am 07.04.22 um 14:30 schrieb <a class=3D"moz=
-txt-link-abbreviated moz-txt-link-freetext" href=3D"mailto:egoitz@ramattac=
k.net">egoitz@ramattack.net</a>:</div>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">El 2022-04-06 23:49, Stefan Esser escribi&oacute;:
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p>El 2022-04-06 17:43, Stefan Esser escribi&oacute;:</p>
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><br /> Am 06.04.22 um 16:36 schrieb <a class=3D"moz-txt-link-abbreviated m=
oz-txt-link-freetext" href=3D"mailto:egoitz@ramattack.net">egoitz@ramattack=
=2Enet</a>:
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">Hi Rainer!<br /> <br /> Thank you so much for your help :) :)<br />=
 <br /> Well I assume they are in a datacenter and should not be a power ou=
tage....<br /> <br /> About dataset size... yes... our ones are big... they=
 can be 3-4 TB easily each<br /> dataset.....<br /> <br /> We bought them, =
because as they are for mailboxes and mailboxes grow and<br /> grow.... for=
 having space for hosting them...</blockquote>
<br /> Which mailbox format (e.g. mbox, maildir, ...) do you use?</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>&nbsp;</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><strong><span style=3D"color: #008000;">I'm running Cyrus imap so sort of =
Maildir... too many little files normally..... Sometimes directories with t=
ons of little files....</span></strong></div>
</blockquote>
</blockquote>
<p>Assuming that many mails are much smaller than the erase block size of t=
he SSD, this may cause issues. (You may know the following ...)</p>
<p>For example, if you have message sizes of 8 KB and an erase block size o=
f 64 KB (just guessing), then 8 mails will be in an erase block. If half th=
e mails are deleted, then the erase block will still occupy 64 KB, but only=
 hold 32 KB of useful data (and the SSD will only be aware of this fact if =
TRIM has signaled which data is no longer relevant). The SSD will copy seve=
ral partially filled erase blocks together in a smaller number of free bloc=
ks, which then are fully utilized. Later deletions will repeat this game, a=
nd your data will be copied multiple times until it has aged (and the user =
is less likely to delete further messages). This leads to "write amplificat=
ion" - data is internally moved around and thus written multiple times.</p>
<p><br /></p>
<p><strong><span style=3D"color: #0000ff;">Stefan!! you are nice!! I think =
this could explain all our problem. So, why we are having the most randomne=
ss in our performance degradation and that does not necessarily has to matc=
h with the most io peak hours... That I could cause that performance degrad=
ation just by deleting a couple of huge (perhaps 200.000 mails) mail folder=
s in a middle traffic hour time!!</span></strong></p>
</blockquote>
</blockquote>
Yes, if deleting large amounts of data triggers performance issues (and the=
 disk does not have a deficient TRIM implementation), then the issue is lik=
ely to be due to internal garbage collections colliding with other operatio=
ns.<br />
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p><strong><span style=3D"color: #0000ff;">The problem is that by what I kn=
ow, erase block size of an SSD disk is something fixed in the disk firmware=
=2E I don't really know if perhaps it could be modified with Samsung magici=
an or those kind of tool of Samsung.... else I don't really see the manner =
of improving it... because apart from that, you are deleting a file in raid=
z-2 array... no just in a disk... I assume aligning chunk size, with record=
 size and with the "secret" erase size of the ssd, perhaps could be slightl=
y compensated?.</span></strong></p>
</blockquote>
</blockquote>
<p>The erase block size is a fixed hardware feature of each flash chip. The=
re is a block size for writes (e.g. 8 KB) and many such blocks are combined=
 in one erase block (of e.g. 64 KB, probably larger in todays SSDs), they c=
an only be returned to the free block pool all together. And if some of the=
se writable blocks hold live data, they must be preserved by collecting the=
m in newly allocated free blocks.</p>
<p>An example of what might happen, showing a simplified layout of files 1,=
 2, 3 (with writable blocks 1a, 1b, ..., 2a, 2b, ... and "--" for stale dat=
a of deleted files, ".." for erased/writable flash blocks) in an SSD might =
be:</p>
<p><span style=3D"font-family: monospace;">erase block 1: |1a|1b|--|--|2a|-=
-|--|3a|</span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 2; |--|--|--|2b|--|-=
-|--|1c|</span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 3; |2c|1d|3b|3c|--|-=
-|--|--|</span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 4; |..|..|..|..|..|=
=2E.|..|..|</span></p>
<p>This is just a random example how data could be laid out on the physical=
 storage array. It is assumed that the 3 erase blocks once were completely =
occupied </p>
<p>In this example, 10 of 32 writable blocks are occupied, and only one fre=
e erase block exists.</p>
<p>This situation must not persist, since the SSD needs more empty erase bl=
ocks. 10/32 of the capacity is used for data, but 3/4 of the blocks are occ=
upied and not immediately available for new data.</p>
<p>The garbage collection might combine erase blocks 1 and 3 into a current=
ly free one, e.g. erase block 4:</p>
<span style=3D"font-family: monospace;">erase block 1; |..|..|..|..|..|..|=
=2E.|..| </span>
<p><span style=3D"font-family: monospace;">erase block 2; |--|--|--|2b|--|-=
-|--|1c|</span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 3; |..|..|..|..|..|=
=2E.|..|..|</span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 4: |1a|1b|2a|3a|2c|1=
d|3b|3c|</span></p>
<p>Now only 2/4 of the capacity is not available for new data (which is sti=
ll a lot more than 10/32, but better than before).</p>
<p>Now assume file 2 is deleted:<span style=3D"font-family: monospace;"><br=
 /> </span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 1; |..|..|..|..|..|=
=2E.|..|..| </span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 2; |--|--|--|--|--|-=
-|--|1c|</span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 3; |..|..|..|..|..|=
=2E.|..|..|</span></p>
<span style=3D"font-family: monospace;"> </span>
<p><span style=3D"font-family: monospace;">erase block 4: |1a|1b|--|3a|--|1=
d|3b|3c|</span></p>
<p>There is now a new sparsely used erase block 4, and it will soon need to=
 be garbage collected, too - in fact it could be combined with the live dat=
a from erase block 2, but this may be delayed until there is demand for mor=
e erased blocks (since e.g. file 1 or 3 might also have been deleted by the=
n).</p>
<p>The garbage collection does not know which data blocks belong to which f=
ile, and therefore it cannot collect the data belonging to a file into a si=
ngle erase block. Blocks are allocated as data comes in (as long as enough =
SLC cells are available in this area, else directly in QLC cells). Your man=
y parallel updates will cause fractions of each larger file to be spread ou=
t over many erase blocks.</p>
<p>As you can see, a single file that is deleted may affect many erase bloc=
ks, and you have to take redundancy into consideration, which will multiply=
 the effect by a factor of up to 3 for small files (one ZFS allocation bloc=
k). And remember: deleting a message in mdir format will free the data bloc=
ks, but will also remove the directory entry, causing additional meta-data =
writes (again multiplied by the raid redundancy).</p>
<p><br /></p>
<p>A consumer SSD would normally see only very few parallel writes, and seq=
uential writes of full files will have a high chance to put the data of eac=
h file contiguously in the minimum number of erase blocks, allowing to free=
 multiple complete erase blocks when such a file is deleted and thus obviat=
ing the need for many garbage collection copies (that occur if data from se=
veral independent files is in one erase block).</p>
<p>Actual SSDs have many more cells than advertised. Some 10% to 20% may be=
 kept as a reserve for aging blocks that e.g. may have failed kind of a "re=
ad-after-write test" (implemented in the write function, which adds charges=
 to the cells until they return the correct read-outs).</p>
<p>BTW: Having an ashift value that is lower than the internal write block =
size may also lead to higher write amplification values, but a large ashift=
 may lead to more wasted capacity, which may become an issue if typical fil=
e length are much smaller than the allocation granularity that results from=
 the ashift value.</p>
<p><br /></p>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p>Larger mails are less of an issue since they span multiple erase blocks,=
 which will be completely freed when such a message is deleted.</p>
<p><strong><span style=3D"color: #0000ff;">I see I see Stefan...</span></st=
rong></p>
<p>Samsung has a lot of experience and generally good strategies to deal wi=
th such a situation, but SSDs specified for use in storage systems might be=
 much better suited for that kind of usage profile.</p>
<p><strong><span style=3D"color: #0000ff;">Yes... and the disks for our pur=
pose... perhaps weren't QVOs....</span></strong></p>
</blockquote>
</blockquote>
<p>You should have got (much more expensive) server grade SSDs, IMHO.</p>
<p>But even 4 * 2 TB QVO (or better EVO) drives per each 8 TB QVO drive wou=
ld result in better performance (but would need a lot of extra SATA ports)=
=2E</p>
<p>In fact, I'm not sure whether rotating media and a reasonable L2ARC cons=
isting of a fast M.2 SSD plus a mirror of small SSDs for a LOG device would=
 not be a better match for your use case. Reading the L2ARC would be very f=
ast, writes would be purely sequential and relatively slow, you could choos=
e a suitable L2ARC strategy (caching of file data vs. meta data), and the L=
OG device would support fast fsync() operations required for reliable mail =
systems (which confirm data is on stable storage before acknowledging the r=
eception to the sender).</p>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">We knew they had some speed issues, but those speed issues, we thou=
ght (as<br /> Samsung explains in the QVO site) they started after exceedin=
g the speeding<br /> buffer this disks have. We though that meanwhile you d=
idn't exceed it's<br /> capacity (the capacity of the speeding buffer) no s=
peed problem arises. Perhaps<br /> we were wrong?.</blockquote>
<br /> These drives are meant for small loads in a typical PC use case,<br =
/> i.e. some installations of software in the few GB range, else only<br />=
 files of a few MB being written, perhaps an import of media files<br /> th=
at range from tens to a few hundred MB at a time, but less often<br /> than=
 once a day.</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>&nbsp;</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><strong><span style=3D"color: #008000;">We move, you know... lots of littl=
e files... and lot's of different concurrent modifications by 1500-2000 con=
current imap connections we have...</span></strong></div>
</blockquote>
</blockquote>
<p>I do not expect the read load to be a problem (except possibly when the =
SSD is moving data from SLC to QLC blocks, but even then reads will get pri=
ority). But writes and trims might very well overwhelm the SSD, especially =
when its getting full. Keeping a part of the SSD unused (excluded from the =
partitions created) will lead to a large pool of unused blocks. This will r=
educe the write amplification - there are many free blocks in the "unpartit=
ioned part" of the SSD, and thus there is less urgency to compact partially=
 filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a part=
ition used for the ZPOOL, then 1/4 of each erase block could be free due to=
 deletions/TRIM without any compactions required to hold all this data.)</p=
>
<p>Keeping a significant percentage of the SSD unallocated is a good strate=
gy to improve its performance and resilience.</p>
<p><strong><span style=3D"color: #0000ff;">Well, we have allocated all the =
disk space... but not used... just allocated.... you know... we do a zpool =
create with the whole disks.....</span></strong><span style=3D"color: #0000=
ff;"></span></p>
</blockquote>
</blockquote>
<p>I think the only chance for a solution that does not require new hardwar=
e is to make sure, only some 80% of the SSDs are used (i.e. allocate only 8=
0% for ZFS, leave 20% unallocated). This will significantly reduce the rate=
 of garbage collections and thus reduce the load they cause.</p>
<p>I'd use a fast encryption algorithm (zstd - choose a level that does not=
 overwhelm the CPU, there are benchmark results for ZFS with zstd, and I fo=
und zstd-2 to be best for my use case). This will more than make up for the=
 space you left unallocated on the SSDs.</p>
<p>A different mail box format might help, too - I'm happy with dovecot's m=
dbox format, which is as fast but much more efficient than mdir.</p>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>As the SSD fills, the space available for the single level write<br /> cac=
he gets smaller</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>&nbsp;</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><strong><span style=3D"color: #008000;">The single level write cache is th=
e cache these ssd drivers have, for compensating the speed issues they have=
 due to using qlc memory?. Do you refer to that?. Sorry I don't understand =
well this paragraph.</span></strong></div>
</blockquote>
</blockquote>
<p>Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC =
cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as=
 24 GB of data in QLC mode.</p>
<p><strong><span style=3D"color: #0000ff;">Ok, true.... yes....</span></str=
ong></p>
<p>A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (60=
0 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells).</p>
<p><strong><span style=3D"color: #0000ff;">Ahh! you mean that SLC capacity =
for speeding up the QLC disks, is obtained from each single layer of the QL=
C?.</span></strong></p>
</blockquote>
</blockquote>
<p>There are no specific SLC cells. A fraction of the QLC capable cells is =
only written with only 1 instead of 4 bits. This is a much simpler process,=
 since there are only 2 charge levels per cell that are used, while QLC use=
s 16 charge levels, and you can only add charge (must not overshoot), there=
fore only small increments are added until the correct value can be read ou=
t).</p>
<p>But since SLC cells take away specified capacity (which is calculated as=
suming all cells hold 4 bits each, not only 1 bit), their number is limited=
 and shrinks as demand for QLC cells grows.</p>
<p>The advantage of the SLC cache is fast writes, but also that data in it =
may have become stale (trimmed) and thus will never be copied over into a Q=
LC block. But as the SSD fills and the size of the SLC cache shrinks, this =
capability will be mostly lost, and lots of very short lived data is stored=
 in QLC cells, which will quickly become partially stale and thus needing c=
ompaction as explained above.</p>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p>Therefore, the fraction of the cells used as an SLC cache is reduced whe=
n it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cell=
s).</p>
<p><span style=3D"color: #0000ff;"><strong>Sorry I don't get this last sent=
ence... don't understand it because I don't really know the meaning of tn=
=2E.. </strong></span></p>
<p><span style=3D"color: #0000ff;"><strong>but I think I'm getting the idea=
 if you say that each QLC layer, has it's own SLC cache obtained from the d=
isk space avaiable for each QLC layer....</strong></span></p>
<p>And with less SLC cells available for short term storage of data the pro=
bability of data being copied to QLC cells before the irrelevant messages h=
ave been deleted is significantly increased. And that will again lead to ma=
ny more blocks with "holes" (deleted messages) in them, which then need to =
be copied possibly multiple times to compact them.</p>
<p><strong><span style=3D"color: #0000ff;">If I correct above, I think I go=
t the idea yes....</span></strong></p>
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;"><span style=3D"font-family: monospace;">(on many SSDs, I have no nu=
mbers for this</span><br />
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>particular device), and thus the amount of data that can be<br /> written =
at single cell speed shrinks as the SSD gets full.</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>&nbsp;</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><br /> I have just looked up the size of the SLC cache, it is specified<br=
 /> to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB<br />=
 version, smaller models will have a smaller SLC cache).</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>&nbsp;</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><strong><span style=3D"color: #008000;">Assuming you were talking about th=
e cache for compensating speed we previously commented, I should say these =
are the 870 QVO but the 8TB version. So they should have the biggest cache =
for compensating the speed issues...</span></strong></div>
</blockquote>
</blockquote>
<p>I have looked up the data: the larger versions of the 870 QVO have the s=
ame SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB=
 more if there are enough free blocks.</p>
<p><strong><span style=3D"color: #0000ff;">Ours one is the 8TB model so I a=
ssume it could have bigger limits. The disks are mostly empty, really.... s=
o... for instance....</span></strong></p>
<p><strong><span style=3D"color: #0000ff;">zpool list</span></strong><br />=
 <strong><span style=3D"color: #0000ff;">NAME&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;=
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; SIZE&nbsp; ALLOC&nbsp;&nbsp; FRE=
E&nbsp; CKPOINT&nbsp; EXPANDSZ&nbsp;&nbsp; FRAG&nbsp;&nbsp;&nbsp; CAP&nbsp;=
 DEDUP&nbsp; HEALTH&nbsp; ALTROOT</span></strong><br /> <strong><span style=
=3D"color: #0000ff;">root_dataset&nbsp; 448G&nbsp; 2.29G&nbsp;&nbsp; 446G&n=
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb=
sp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;&nbsp; 1%&nbsp;&nbsp;&nbsp;&nbsp; 0%&nbs=
p; 1.00x&nbsp; ONLINE&nbsp; -</span></strong><br /> <strong><span style=3D"=
color: #0000ff;">mail_dataset&nbsp; 58.2T&nbsp; 11.8T&nbsp; 46.4T&nbsp;&nbs=
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp=
;&nbsp; -&nbsp;&nbsp;&nbsp; 26%&nbsp;&nbsp;&nbsp; 20%&nbsp; 1.00x&nbsp; ONL=
INE&nbsp; -</span></strong></p>
</blockquote>
</blockquote>
<p>Ok, seems you have got 10 * 8 TB in a raidz2 configuration.</p>
<p>Only 20% of the mail dataset is in use, the situation will become much w=
orse when the pool will fill up!</p>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p><strong><span style=3D"color: #0000ff;">I suppose fragmentation affects =
too....</span></strong></p>
</blockquote>
</blockquote>
<p>On magnetic media fragmentation means that a file is spread out over the=
 disk in a non-optimal way, causing access latencies due to seeks and rotat=
ional delay. That kind of fragmentation is not really relevant for SSDs, wh=
ich allow for fast random access to the cells.</p>
<p>And the FRAG value shown by the "zpool list" command is not about fragme=
ntation of files at all, it is about the structure of free space. Anyway le=
ss relevant for SSDs than for classic hard disk drives.</p>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>But after writing those few GB at a speed of some 500 MB/s (i.e.<br /> aft=
er 12 to 150 seconds), the drive will need several minutes to<br /> transfe=
r those writes to the quad-level cells, and will operate<br /> at a fractio=
n of the nominal performance during that time.<br /> (QLC writes max out at=
 80 MB/s for the 1 TB model, 160 MB/s for the<br /> 2 TB model.)</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>&nbsp;</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><strong><span style=3D"color: #008000;">Well we are in the 8TB model. I th=
ink I have understood what you wrote in previous paragraph. You said they c=
an be fast but not constantly, because later they have to write all that to=
 their perpetual storage from the cache. And that's slow. Am I wrong?. Even=
 in the 8TB model you think Stefan?.</span></strong></div>
</blockquote>
</blockquote>
<p>The controller in the SSD supports a given number of channels (e.g 4), e=
ach of which can access a Flash chip independently of the others. Small SSD=
s often have less Flash chips than there are channels (and thus a lower thr=
oughput, especially for writes), but the larger models often have more chip=
s than channels and thus the performance is capped.</p>
<p><strong><span style=3D"color: #0000ff;">This is totally logical. If a QV=
O disk would outperform best or similar than an Intel without consequences=
=2E... who was going to buy a expensive Intel enterprise?.</span></strong><=
/p>
</blockquote>
</blockquote>
The QVO is bandwidth limited due to the SATA data rate of 6 Mbit/s anyway, =
and it is optimized for reads (which are not significantly slower than offe=
red by the TLC models). This is a viable concept for a consumer PC, but not=
 for a server.<br />
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p>In the case of the 870 QVO, the controller supports 8 channels, which al=
lows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has=
 only 4 Flash chips and is thus limited to 80 MB/s in that situation, while=
 the larger versions have 8, 16, or 32 chips. But due to the limited number=
 of channels, the write rate is limited to 160 MB/s even for the 8 TB model=
=2E</p>
<p><strong><span style=3D"color: #0000ff;">Totally logical Stefan...</span>=
</strong></p>
<p>If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in thi=
s limit.</p>
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><span style=3D"color: #008000;"><strong>The main problem we are facing is =
that in some peak moments, when the machine serves connections for all the =
instances it has, and only as said in some peak moments... like the 09am or=
 the 11am.... it seems the machine becomes slower... and like if the disks =
weren't able to serve all they have to serve.... In these moments, no big f=
iles are moved... but as we have 1800-2000 concurrent imap connections... n=
ormally they are doing each one... little changes in their mailbox. Do you =
think perhaps this disks then are not appropriate for this kind of usage?-<=
/strong></span></div>
</blockquote>
</blockquote>
<p>I'd guess that the drives get into a state in which they have to recycle=
 lots of partially free blocks (i.e. perform kind of a garbage collection) =
and then three kinds of operations are competing with each other:</p>
<ol>
<li>reads (generally prioritized)</li>
<li>writes (filling the SLC cache up to its maximum size)</li>
<li>compactions of partially filled blocks (required to make free blocks av=
ailable for re-use)</li>
</ol>
<p>Writes can only proceed if there are sufficient free blocks, which on a =
filled SSD with partially filled erase blocks means that operations of type=
 3. need to be performed with priority to not stall all writes.</p>
<p>My assumption is that this is what you are observing under peak load.</p=
>
<p><strong><span style=3D"color: #0000ff;">It could be although the disks a=
re not filled.... the pool are at 20 or 30% of capacity and fragmentation f=
rom 20%-30% (as zpool list states).</span></strong></p>
</blockquote>
</blockquote>
Yes, and that means that your issues will become much more critical over ti=
me when the free space shrinks and garbage collections will be required at =
an even faster rate, with the SLC cache becoming less and less effective to=
 weed out short lived files as an additional factor that will increase writ=
e amplification.<br />
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>And cheap SSDs often have no RAM cache (not checked, but I'd be<br /> surp=
rised if the QVO had one) and thus cannot keep bookkeeping date<br /> in su=
ch a cache, further limiting the performance under load.</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
>&nbsp;</div>
<div class=3D"pre" style=3D"margin: 0; padding: 0; font-family: monospace;"=
><strong><span style=3D"color: #008000;">This brochure (<a class=3D"moz-txt=
-link-freetext" style=3D"color: #008000;" href=3D"https://semiconductor.sam=
sung.com/resources/brochure/870_Series_Brochure.pdf" target=3D"_blank" rel=
=3D"noopener noreferrer">https://semiconductor.samsung.com/resources/brochu=
re/870_Series_Brochure.pdf</a> and the datasheet <a class=3D"moz-txt-link-f=
reetext" href=3D"https://semiconductor.samsung.com/resources/data-sheet/Sam=
sung_SSD_870_QVO_Data_Sheet_Rev1.1.pdf" target=3D"_blank" rel=3D"noopener n=
oreferrer">https://semiconductor.samsung.com/resources/data-sheet/Samsung_S=
SD_870_QVO_Data_Sheet_Rev1.1.pdf</a>) sais if I have read properly, the 8TB=
 drive has 8GB of ram?. I assume that is what they call the turbo write cac=
he?.</span></strong></div>
</blockquote>
</blockquote>
<p>No, the turbo write cache consists of the cells used in SLC mode (which =
can be any cells, not only cells in a specific area of the flash chip).</p>
<p><strong><span style=3D"color: #0000ff;">I see I see....</span></strong><=
/p>
<p>The RAM is needed for fast lookup of the position of data for reads and =
of free blocks for writes.</p>
<p><strong><span style=3D"color: #0000ff;">Our ones... seem to have 8GB LPD=
DR4 of ram.... as datasheet states....</span></strong></p>
</blockquote>
</blockquote>
<p>Yes, and it makes sense that the RAM size is proportional to the capacit=
y since a few bytes are required per addressable data block.</p>
<p>If the block size was 8 KB the RAM could hold 8 bytes (e.g. a pointer an=
d some status flags) for each logically addressable block. But there is no =
information about the actual internal structure of the QVO that I know of=
=2E</p>
[...]<br />
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p><strong><span style=3D"color: #0000ff;">I see.... It's extremely mislead=
ing you know... because... you can copy five mailboxes of 50GB concurrently=
 for instance.... and you flood a gigabit interface copying (obviously beca=
use disks can keep that throughput)... but later.... you see... you are in =
an hour that yesterday, and even 4 days before you have not had any issues=
=2E.. and that day... you see the commented issue... even not being exactly=
 at a peak hour (perhaps is two hours later the peak hour even)... or... bu=
t I wasn't noticing about all things you say in this email....</span></stro=
ng></p>
<p>I have seen advice to not use compression in a high load scenario in som=
e other reply.</p>
<p>I tend to disagree: Since you seem to be limited when the SLC cache is e=
xhausted, you should get better performance if you compress your data. I ha=
ve found that zstd-2 works well for me (giving a significant overall reduct=
ion of size at reasonable additional CPU load). Since ZFS allows to switch =
compressions algorithms at any time, you can experiment with different algo=
rithms and levels.</p>
<p><strong><span style=3D"color: #0000ff;">I see... you say compression sho=
uld be enabled.... The main reason because we have not enabled it yet, is f=
or keeping the system the most near possible to config defaults... you know=
=2E.. for later being able to ask in this mailing lists if we have an issue=
=2E.. because you know... it would be far more easier to ask about somethin=
g strange you are seeing when that strange thing is near to a well tested c=
onfig, like the config by default....</span></strong></p>
<p><strong><span style=3D"color: #0000ff;">But now you say Stefan... if you=
 switch between compression algorithms you will end up with a mix of differ=
ent files compressed in a different manner... that is not a bit disaster la=
ter?. Doesn't affect performance in some manner?.</span></strong><span styl=
e=3D"color: #0000ff;"></span></p>
</blockquote>
</blockquote>
The compression used is stored in the per file information, each file in a =
dataset could have been written with a different compression method and lev=
el. Blocks are independently compressed - a file level compression may be m=
ore effective. Large mail files will contain incompressible attachments (al=
ready compressed), but in base64 encoding. This should allow a compression =
ratio of ~1,3. Small files will be plain text or HTML, offering much better=
 compression factors.<br />
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p>One advantage of ZFS compression is that it applies to the ARC, too. And=
 a compression factor of 2 should easily be achieved when storing mail (not=
 for .docx, .pdf, .jpg files though). Having more data in the ARC will redu=
ce the read pressure on the SSDs and will give them more cycles for garbage=
 collections (which are performed in the background and required to always =
have a sufficient reserve of free flash blocks for writes).</p>
<p><strong><span style=3D"color: #0000ff;">We would use I assume the lz4.=
=2E. which is the less "expensive" compression algorithm for the CPU... and=
 I assume too for avoiding delay accessing data... do you recommend another=
 one?. Do you always recommend compression then?.</span></strong></p>
</blockquote>
</blockquote>
<p>I'd prefer zstd over lz4 since it offers a much higher compression ratio=
=2E</p>
<p>Zstd offers higher compression ratios than lz4 at similar or better deco=
mpression speed, but may be somewhat slower compressing the data. But in my=
 opinion this is outweighed by the higher effective amount of data in the A=
RC/L2ARC possible with zstd.</p>
<p>For some benchmarks of different compression algorithms available for ZF=
S and compared to uncompressed mode see the extensive results published by =
Jude Allan:</p>
<pre class=3D"moz-quote-pre"><a class=3D"moz-txt-link-freetext" href=3D"htt=
ps://docs.google.com/spreadsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTgnXm2y=
PtiUQ/edit?usp=3Dsharing" target=3D"_blank" rel=3D"noopener noreferrer">htt=
ps://docs.google.com/spreadsheets/d/1TvCAIDzFsjuLuea7124q-1UtMd0C9amTgnXm2y=
PtiUQ/edit?usp=3Dsharing</a>

The SQL benchmarks might best resemble your use case - but remember that a =
significant reduction of the amount of data being written to the SSDs might=
 be more important than the highest transaction rate, since your SSDs put a=
 low upper limit on that when highly loaded.
</pre>
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p>I'd give it a try - and if it reduces your storage requirements by 10% o=
nly, then keep 10% of each SSD unused (not assigned to any partition). That=
 will greatly improve the resilience of your SSDs, reduce the write-amplifi=
cation, will allow the SLC cache to stay at its large value, and may make a=
 large difference to the effective performance under high load.</p>
<p><strong><span style=3D"color: #0000ff;">But when you enable compression=
=2E.. only gets compressed the new data modified or entered. Am I wrong?.</=
span></strong></p>
</blockquote>
</blockquote>
Compression is per file system data block (at most 1 MB if you set the bloc=
ksize to that value). Each such block is compressed independently of all ot=
hers, to not require more than 1 block to be read and decompressed when ran=
domly reading a file. If a block does not shrink when compressed (it may co=
ntain compressed file data) the block is written to disk as-is (uncompresse=
d).<br />
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p><br /></p>
<p><strong><span style=3D"color: #0000ff;">By the way, we have more or less=
 1/4 of each disk used (12 TB allocated in a poll stated by zpool list, div=
ided between 8 disks of 8TB...)... do you think we could be suffering on wr=
ite amplification and so... having a so little disk space used in each disk=
?.</span></strong></p>
</blockquote>
</blockquote>
Your use case will cause a lot of garbage collections and this particular h=
igh write amplification values.<br />
<blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left: #1010ff 2=
px solid; margin: 0">
<blockquote style=3D"padding: 0 0.4em; border-left: #1010ff 2px solid; marg=
in: 0;">
<p>Regards, STefan</p>
<p><strong><span style=3D"color: #0000ff;">Hey mate, your mail is incredibl=
e. It has helped as a lot. Can we invite you a cup of coffee or a beer thro=
ugh Paypal or similar?. Can I help you in some manner?.</span></strong></p>
</blockquote>
</blockquote>
<p>Thanks, I'm glad to help, and I'd appreciate to hear whether you get you=
r setup optimized for the purpose (and how well it holds up when you approa=
ch the capacity limits of your drives).</p>
<p>I'm always interested in experience of users with different use cases th=
an I have (just being a developer with too much archived mail and media col=
lected over a few decades).</p>
<p>Regards, STefan</p>
</blockquote>
</body></html>

--=_fe85fe2db536d584a8585a29da09a2df--