From nobody Wed Apr 06 21:49:15 2022
X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org
Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1])
	by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4B1871AA6E8B;
	Wed,  6 Apr 2022 21:49:21 +0000 (UTC)
	(envelope-from se@FreeBSD.org)
Received: from smtp.freebsd.org (smtp.freebsd.org [96.47.72.83])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256
	 client-signature RSA-PSS (4096 bits) client-digest SHA256)
	(Client CN "smtp.freebsd.org", Issuer "R3" (verified OK))
	by mx1.freebsd.org (Postfix) with ESMTPS id 4KYdSx1J3Rz3BqT;
	Wed,  6 Apr 2022 21:49:21 +0000 (UTC)
	(envelope-from se@FreeBSD.org)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org; s=dkim;
	t=1649281761;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NTQDA7AmEiqtZXUltjZC6r1zqKWrWLTuqyQu/QurDx8=;
	b=gb/sw9et6KM9TPLK36zERnOOE6ihAh7FHW0Z5Fgu10FhytX/Y4J4Pnd4u2AlvMSr8DL7Kg
	iRDaVGq2tiA5Yz7RzmwkeTQxw3flBHgJmswMtg8EOC1gl/L6SIqgBKMVJ4ICBWBxc95STj
	fSxZCei8tWY+Pwm4+qpUrzivT6+QMaMMm2ig9Wlk6oFlv+yEKwmxiHaqhQpVVpPCqystKX
	1MPoTCNvMJgUU7gyYkTGWICu5UvV4fBq9gtF1ErEGXDeqCQ1w3ENkbV79lPsZ6knddmHRz
	6YK0Vv0o6WEPzjboA8bpJY6rC5cS6KIfq81o7l4IM/VRAkx566gjymcWXm8a9g==
Received: from [IPV6:2003:cd:5f22:6f00:953e:7ee1:500e:87a1] (p200300cd5f226f00953e7ee1500e87a1.dip0.t-ipconnect.de [IPv6:2003:cd:5f22:6f00:953e:7ee1:500e:87a1])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(Client did not present a certificate)
	(Authenticated sender: se/mail)
	by smtp.freebsd.org (Postfix) with ESMTPSA id 176914D30;
	Wed,  6 Apr 2022 21:49:19 +0000 (UTC)
	(envelope-from se@FreeBSD.org)
Message-ID: <e4b7252d-525e-1c0f-c22b-e34b96c1ce83@FreeBSD.org>
Date: Wed, 6 Apr 2022 23:49:15 +0200
List-Id: Technical discussions relating to FreeBSD <freebsd-hackers.freebsd.org>
List-Archive: https://lists.freebsd.org/archives/freebsd-hackers
List-Help: <mailto:freebsd-hackers+help@freebsd.org>
List-Post: <mailto:freebsd-hackers@freebsd.org>
List-Subscribe: <mailto:freebsd-hackers+subscribe@freebsd.org>
List-Unsubscribe: <mailto:freebsd-hackers+unsubscribe@freebsd.org>
Sender: owner-freebsd-hackers@freebsd.org
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0)
 Gecko/20100101 Thunderbird/91.7.0
Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS
Content-Language: en-US
To: egoitz@ramattack.net
Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org,
 freebsd-performance@freebsd.org, Rainer Duffner <rainer@ultra-secure.de>
References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net>
 <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de>
 <0ef282aee34b441f1991334e2edbcaec@ramattack.net>
 <dd9a55ac-053d-7802-169d-04c95c045ed2@FreeBSD.org>
 <ce51660b5f83f92aa9772d764ae12dff@ramattack.net>
From: Stefan Esser <se@FreeBSD.org>
In-Reply-To: <ce51660b5f83f92aa9772d764ae12dff@ramattack.net>
Content-Type: multipart/signed; micalg=pgp-sha256;
 protocol="application/pgp-signature";
 boundary="------------Lic1usorjc8S7FifC6L0nUmq"
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=freebsd.org;
	s=dkim; t=1649281761;
	h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
	 to:to:cc:cc:mime-version:mime-version:content-type:content-type:
	 in-reply-to:in-reply-to:references:references;
	bh=NTQDA7AmEiqtZXUltjZC6r1zqKWrWLTuqyQu/QurDx8=;
	b=v5/9q0D36uM6by3oYOW443+rDA3y21ugXHLeseeuzJs6uALdA6p7lQNGhL/K5PLN0SncDG
	Ze2s0aAILCb2FaUJDRp5csshI8Bd9nT+L5jXZJrOKJ7L4RHetAoNVXCj2/Onql93fRJft7
	0IIaiCTwWEkQWgpWvpqm8VhLvNkr1LlxW5Ml6Oh7uxJgefQH/r0Y5jKry+tmSeRtIvNAdc
	2kaZJrECFKqL+zOJeb5qObc1ss0F4ViPi/wrPL3DXSGn6Cc2J/hdU6A18B4fGGMJEtVOeJ
	+Ln/wIfCbLW+CMB3b86IlBeIF6YaestXttvguHQa68/IgnQl+mCKQ9TkFYvjqw==
ARC-Seal: i=1; s=dkim; d=freebsd.org; t=1649281761; a=rsa-sha256; cv=none;
	b=KT9L8BUHeIW4nZrivqVSEpnzwZ2veeKiP2ovXRvBoh2v+loi8Ts72fDcwkndPNSnqlNzPQ
	CD1E9/NP2epXmB33xK+oL9fJN1k0ZbmLf9uC8jasQwU77EQM7jKDKbuLu1rSHhNC2Avw3r
	Q+0lAxAyhzDjMAoF4MoAND7D7RAMkOLMeRvKMjRnYKV4FMCLtN+LTtXs2bLcr+j+hQ74Rv
	D1mKpI60anTgLJDuSxe0ScgthHQ1MpvgjLUTRFPuB/g1Bz10QS/jTP/IAWKkWua97/6hQV
	PCaznQuyUuI+N+IFFpJ2JhW1zuQfJUgJMP3Vt4tzVxYuM1WPzVcSeYA0EMUtRA==
ARC-Authentication-Results: i=1;
	mx1.freebsd.org;
	none
X-ThisMailContainsUnwantedMimeParts: N

This is an OpenPGP/MIME signed message (RFC 4880 and 3156)
--------------Lic1usorjc8S7FifC6L0nUmq
Content-Type: multipart/mixed; boundary="------------OGljiRSjG08yilMHhaNFyDtW";
 protected-headers="v1"
From: Stefan Esser <se@FreeBSD.org>
To: egoitz@ramattack.net
Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org,
 freebsd-performance@freebsd.org, Rainer Duffner <rainer@ultra-secure.de>
Message-ID: <e4b7252d-525e-1c0f-c22b-e34b96c1ce83@FreeBSD.org>
Subject: Re: {* 05.00 *}Re: Desperate with 870 QVO and ZFS
References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net>
 <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de>
 <0ef282aee34b441f1991334e2edbcaec@ramattack.net>
 <dd9a55ac-053d-7802-169d-04c95c045ed2@FreeBSD.org>
 <ce51660b5f83f92aa9772d764ae12dff@ramattack.net>
In-Reply-To: <ce51660b5f83f92aa9772d764ae12dff@ramattack.net>

--------------OGljiRSjG08yilMHhaNFyDtW
Content-Type: multipart/alternative;
 boundary="------------ASZf2V7VPlu8lceG03SodTv5"

--------------ASZf2V7VPlu8lceG03SodTv5
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

Am 06.04.22 um 18:34 schrieb egoitz@ramattack.net:

> Hi Stefan!
>
> Thank you so much for your answer!!. I do answer below in green bold fo=
r
> instance... for a better distinction....
>
> Very thankful for all your comments Stefan!!! :) :) :)
>
> Cheers!!
>
Hi,

glad to hear that it is useful information - I'll add comments below ...

> El 2022-04-06 17:43, Stefan Esser escribi=C3=B3:
>
>> ATENCION
>> ATENCION
>> ATENCION!!! Este correo se ha enviado desde fuera de la organizacion. =
No
>> pinche en los enlaces ni abra los adjuntos a no ser que reconozca el
>> remitente y sepa que el contenido es seguro.
>>
>> Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net:
>>> Hi Rainer!
>>>
>>> Thank you so much for your help :) :)
>>>
>>> Well I assume they are in a datacenter and should not be a power outa=
ge....
>>>
>>> About dataset size... yes... our ones are big... they can be 3-4 TB e=
asily each
>>> dataset.....
>>>
>>> We bought them, because as they are for mailboxes and mailboxes grow =
and
>>> grow.... for having space for hosting them...
>>
>> Which mailbox format (e.g. mbox, maildir, ...) do you use?
>> =C2=A0
>> *I'm running Cyrus imap so sort of Maildir... too many little files
>> normally..... Sometimes directories with tons of little files....*

Assuming that many mails are much smaller than the erase block size of th=
e SSD,
this may cause issues. (You may know the following ...)

For example, if you have message sizes of 8 KB and an erase block size of=
 64 KB
(just guessing), then 8 mails will be in an erase block. If half the mail=
s are
deleted, then the erase block will still occupy 64 KB, but only hold 32 K=
B of
useful data (and the SSD will only be aware of this fact if TRIM has sign=
aled
which data is no longer relevant). The SSD will copy several partially fi=
lled
erase blocks together in a smaller number of free blocks, which then are =
fully
utilized. Later deletions will repeat this game, and your data will be co=
pied
multiple times until it has aged (and the user is less likely to delete f=
urther
messages). This leads to "write amplification" - data is internally moved=

around and thus written multiple times.

Larger mails are less of an issue since they span multiple erase blocks, =
which
will be completely freed when such a message is deleted.

Samsung has a lot of experience and generally good strategies to deal wit=
h such
a situation, but SSDs specified for use in storage systems might be much =
better
suited for that kind of usage profile.

>>> We knew they had some speed issues, but those speed issues, we though=
t (as
>>> Samsung explains in the QVO site) they started after exceeding the sp=
eeding
>>> buffer this disks have. We though that meanwhile you didn't exceed it=
's
>>> capacity (the capacity of the speeding buffer) no speed problem arise=
s. Perhaps
>>> we were wrong?.
>>
>> These drives are meant for small loads in a typical PC use case,
>> i.e. some installations of software in the few GB range, else only
>> files of a few MB being written, perhaps an import of media files
>> that range from tens to a few hundred MB at a time, but less often
>> than once a day.
>> =C2=A0
>> *We move, you know... lots of little files... and lot's of different
>> concurrent modifications by 1500-2000 concurrent imap connections we h=
ave...*

I do not expect the read load to be a problem (except possibly when the S=
SD is
moving data from SLC to QLC blocks, but even then reads will get priority=
). But
writes and trims might very well overwhelm the SSD, especially when its g=
etting
full. Keeping a part of the SSD unused (excluded from the partitions crea=
ted)
will lead to a large pool of unused blocks. This will reduce the write
amplification - there are many free blocks in the "unpartitioned part" of=
 the
SSD, and thus there is less urgency to compact partially filled blocks. (=
E.g.
if you include only 3/4 of the SSD capacity in a partition used for the Z=
POOL,
then 1/4 of each erase block could be free due to deletions/TRIM without =
any
compactions required to hold all this data.)

Keeping a significant percentage of the SSD unallocated is a good strateg=
y to
improve its performance and resilience.

>> As the SSD fills, the space available for the single level write
>> cache gets smaller
>> =C2=A0
>> *The single level write cache is the cache these ssd drivers have, for=

>> compensating the speed issues they have due to using qlc memory?. Do y=
ou
>> refer to that?. Sorry I don't understand well this paragraph.*

Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC c=
ache
has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as 24 =
GB of
data in QLC mode.

A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (600=
 GB in
150 tn QLC cells plus 100 GB in 100 tn SLC cells).

Therefore, the fraction of the cells used as an SLC cache is reduced when=
 it
gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cells).=


And with less SLC cells available for short term storage of data the
probability of data being copied to QLC cells before the irrelevant messa=
ges
have been deleted is significantly increased. And that will again lead to=
 many
more blocks with "holes" (deleted messages) in them, which then need to b=
e
copied possibly multiple times to compact them.

>> (on many SSDs, I have no numbers for this
>> particular device), and thus the amount of data that can be
>> written at single cell speed shrinks as the SSD gets full.
>> =C2=A0
>>
>>
>> I have just looked up the size of the SLC cache, it is specified
>> to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB
>> version, smaller models will have a smaller SLC cache).
>> =C2=A0
>> *Assuming you were talking about the cache for compensating speed we
>> previously commented, I should say these are the 870 QVO but the 8TB
>> version. So they should have the biggest cache for compensating the sp=
eed
>> issues...*

I have looked up the data: the larger versions of the 870 QVO have the sa=
me SLC
cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB more =
if
there are enough free blocks.

>> But after writing those few GB at a speed of some 500 MB/s (i.e.
>> after 12 to 150 seconds), the drive will need several minutes to
>> transfer those writes to the quad-level cells, and will operate
>> at a fraction of the nominal performance during that time.
>> (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the
>> 2 TB model.)
>> =C2=A0
>> *Well we are in the 8TB model. I think I have understood what you wrot=
e in
>> previous paragraph. You said they can be fast but not constantly, beca=
use
>> later they have to write all that to their perpetual storage from the =
cache.
>> And that's slow. Am I wrong?. Even in the 8TB model you think Stefan?.=
*

The controller in the SSD supports a given number of channels (e.g 4), ea=
ch of
which can access a Flash chip independently of the others. Small SSDs oft=
en
have less Flash chips than there are channels (and thus a lower throughpu=
t,
especially for writes), but the larger models often have more chips than
channels and thus the performance is capped.

In the case of the 870 QVO, the controller supports 8 channels, which all=
ows it
to write 160 MB/s into the QLC cells. The 1 TB model apparently has only =
4
Flash chips and is thus limited to 80 MB/s in that situation, while the l=
arger
versions have 8, 16, or 32 chips. But due to the limited number of channe=
ls,
the write rate is limited to 160 MB/s even for the 8 TB model.

If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in this=
 limit.

>> *The main problem we are facing is that in some peak moments, when the=

>> machine serves connections for all the instances it has, and only as s=
aid in
>> some peak moments... like the 09am or the 11am.... it seems the machin=
e
>> becomes slower... and like if the disks weren't able to serve all they=
 have
>> to serve.... In these moments, no big files are moved... but as we hav=
e
>> 1800-2000 concurrent imap connections... normally they are doing each =
one...
>> little changes in their mailbox. Do you think perhaps this disks then =
are
>> not appropriate for this kind of usage?-*

I'd guess that the drives get into a state in which they have to recycle =
lots
of partially free blocks (i.e. perform kind of a garbage collection) and =
then
three kinds of operations are competing with each other:

 1. reads (generally prioritized)
 2. writes (filling the SLC cache up to its maximum size)
 3. compactions of partially filled blocks (required to make free blocks
    available for re-use)

Writes can only proceed if there are sufficient free blocks, which on a f=
illed
SSD with partially filled erase blocks means that operations of type 3. n=
eed to
be performed with priority to not stall all writes.

My assumption is that this is what you are observing under peak load.

>> And cheap SSDs often have no RAM cache (not checked, but I'd be
>> surprised if the QVO had one) and thus cannot keep bookkeeping date
>> in such a cache, further limiting the performance under load.
>> =C2=A0
>> *This brochure
>> (https://semiconductor.samsung.com/resources/brochure/870_Series_Broch=
ure.pdf
>> and the datasheet
>> https://semiconductor.samsung.com/resources/data-sheet/Samsung_SSD_870=
_QVO_Data_Sheet_Rev1.1.pdf)
>> sais if I have read properly, the 8TB drive has 8GB of ram?. I assume =
that
>> is what they call the turbo write cache?.*

No, the turbo write cache consists of the cells used in SLC mode (which c=
an be
any cells, not only cells in a specific area of the flash chip).

The RAM is needed for fast lookup of the position of data for reads and o=
f free
blocks for writes.

There is no simple relation between SSD "block number" (in the sense of a=
 disk
block on some track of a magnetic disk) and its storage location on the F=
lash
chip. If an existing "data block" (what would be a sector on a hard disk =
drive)
is overwritten, it is instead written at the end of an "open" erase block=
, and
a pointer from that "block number" to the location on the chip is stored =
in an
index. This index is written to Flash storage and could be read from it, =
but it
is much faster to have a RAM with these pointers that can be accessed
independently of the Flash chips. This RAM is required for high transacti=
on
rates (especially random reads), but it does not really help speed up wri=
tes.

>> And the resilience (max. amount of data written over its lifetime)
>> is also quite low - I hope those drives are used in some kind of
>> RAID configuration.
>> =C2=A0
>> *Yep we use raidz-2*

Makes sense ... But you know that you multiply the amount of data written=
 due
to the redundancy.

If a single 8 KB block is written, for example, 3 * 8 KB will written if =
you
take the 2 redundant copies into account.

>> The 870 QVO is specified for 370 full capacity
>> writes, i.e. 370 TB for the 1 TB model. That's still a few hundred
>> GB a day - but only if the write amplification stays in a reasonable
>> range ...
>> =C2=A0
>> *Well yes... 2880TB in our case....not bad.. isn't it?*

I assume that 2880 TB is your total storage capacity? That's not too bad,=
 in
fact. ;-)

This would be 360 * 8 TB ...

Even at 160 MB/s per 8 TB SSD this would allow for more than 50 GB/s of w=
rite
throughput (if all writes were evenly distributed).

Taking all odds into account, I'd guess that at least 10 GB/s can be
continuously written (if supported by the CPUs and controllers).

But this may not be true if the drive is simultaneously reading, trimming=
, and
writing ...


I have seen advice to not use compression in a high load scenario in some=
 other
reply.

I tend to disagree: Since you seem to be limited when the SLC cache is
exhausted, you should get better performance if you compress your data. I=
 have
found that zstd-2 works well for me (giving a significant overall reducti=
on of
size at reasonable additional CPU load). Since ZFS allows to switch
compressions algorithms at any time, you can experiment with different
algorithms and levels.

One advantage of ZFS compression is that it applies to the ARC, too. And =
a
compression factor of 2 should easily be achieved when storing mail (not =
for
=2Edocx, .pdf, .jpg files though). Having more data in the ARC will reduc=
e the
read pressure on the SSDs and will give them more cycles for garbage
collections (which are performed in the background and required to always=
 have
a sufficient reserve of free flash blocks for writes).

I'd give it a try - and if it reduces your storage requirements by 10% on=
ly,
then keep 10% of each SSD unused (not assigned to any partition). That wi=
ll
greatly improve the resilience of your SSDs, reduce the write-amplificati=
on,
will allow the SLC cache to stay at its large value, and may make a large=

difference to the effective performance under high load.

Regards, STefan

**
--------------ASZf2V7VPlu8lceG03SodTv5
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<html>
  <head>
    <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DUTF=
-8">
  </head>
  <body>
    <p>Am 06.04.22 um 18:34 schrieb <a class=3D"moz-txt-link-abbreviated"=
 href=3D"mailto:egoitz@ramattack.net">egoitz@ramattack.net</a>:<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3DU=
TF-8">
      <p>Hi Stefan!</p>
      <p>Thank you so much for your answer!!. I do answer below in green
        bold for instance... for a better distinction....</p>
      <p>Very thankful for all your comments Stefan!!! :) :) :)</p>
      <p>Cheers!!</p>
    </blockquote>
    <p>Hi,</p>
    <p>glad to hear that it is useful information - I'll add comments
      below ...</p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <p>El 2022-04-06 17:43, Stefan Esser escribi=C3=B3:</p>
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0"><!-- html ignored --><!-- head igno=
red --><!-- meta ignored -->
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">ATENCION<br>
          ATENCION<br>
          ATENCION!!! Este correo se ha enviado desde fuera de la
          organizacion. No pinche en los enlaces ni abra los adjuntos a
          no ser que reconozca el remitente y sepa que el contenido es
          seguro.<br>
          <br>
          Am 06.04.22 um 16:36 schrieb <a class=3D"moz-txt-link-abbreviat=
ed" href=3D"mailto:egoitz@ramattack.net">egoitz@ramattack.net</a>:
          <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-lef=
t:
            #1010ff 2px solid; margin: 0">Hi Rainer!<br>
            <br>
            Thank you so much for your help :) :)<br>
            <br>
            Well I assume they are in a datacenter and should not be a
            power outage....<br>
            <br>
            About dataset size... yes... our ones are big... they can be
            3-4 TB easily each<br>
            dataset.....<br>
            <br>
            We bought them, because as they are for mailboxes and
            mailboxes grow and<br>
            grow.... for having space for hosting them...</blockquote>
          <br>
          Which mailbox format (e.g. mbox, maildir, ...) do you use?</div=
>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><strong><span style=3D"color: #008000;">I'm running
              Cyrus imap so sort of Maildir... too many little files
              normally..... Sometimes directories with tons of little
              files....</span></strong><br>
        </div>
      </blockquote>
    </blockquote>
    <p>Assuming that many mails are much smaller than the erase block
      size of the SSD, this may cause issues. (You may know the
      following ...) <br>
    </p>
    <p>For example, if you have message sizes of 8 KB and an erase block
      size of 64 KB (just guessing), then 8 mails will be in an erase
      block. If half the mails are deleted, then the erase block will
      still occupy 64 KB, but only hold 32 KB of useful data (and the
      SSD will only be aware of this fact if TRIM has signaled which
      data is no longer relevant). The SSD will copy several partially
      filled erase blocks together in a smaller number of free blocks,
      which then are fully utilized. Later deletions will repeat this
      game, and your data will be copied multiple times until it has
      aged (and the user is less likely to delete further messages).
      This leads to "write amplification" - data is internally moved
      around and thus written multiple times.</p>
    <p>Larger mails are less of an issue since they span multiple erase
      blocks, which will be completely freed when such a message is
      deleted.<br>
    </p>
    <p>Samsung has a lot of experience and generally good strategies to
      deal with such a situation, but SSDs specified for use in storage
      systems might be much better suited for that kind of usage
      profile.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">
          <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-lef=
t:
            #1010ff 2px solid; margin: 0">We knew they had some speed
            issues, but those speed issues, we thought (as<br>
            Samsung explains in the QVO site) they started after
            exceeding the speeding<br>
            buffer this disks have. We though that meanwhile you didn't
            exceed it's<br>
            capacity (the capacity of the speeding buffer) no speed
            problem arises. Perhaps<br>
            we were wrong?.</blockquote>
          <br>
          These drives are meant for small loads in a typical PC use
          case,<br>
          i.e. some installations of software in the few GB range, else
          only<br>
          files of a few MB being written, perhaps an import of media
          files<br>
          that range from tens to a few hundred MB at a time, but less
          often<br>
          than once a day.</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><strong><span style=3D"color: #008000;">We move, you=

              know... lots of little files... and lot's of different
              concurrent modifications by 1500-2000 concurrent imap
              connections we have...</span></strong></div>
      </blockquote>
    </blockquote>
    <p>I do not expect the read load to be a problem (except possibly
      when the SSD is moving data from SLC to QLC blocks, but even then
      reads will get priority). But writes and trims might very well
      overwhelm the SSD, especially when its getting full. Keeping a
      part of the SSD unused (excluded from the partitions created) will
      lead to a large pool of unused blocks. This will reduce the write
      amplification - there are many free blocks in the "unpartitioned
      part" of the SSD, and thus there is less urgency to compact
      partially filled blocks. (E.g. if you include only 3/4 of the SSD
      capacity in a partition used for the ZPOOL, then 1/4 of each erase
      block could be free due to deletions/TRIM without any compactions
      required to hold all this data.)</p>
    <p>Keeping a significant percentage of the SSD unallocated is a good
      strategy to improve its performance and resilience.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"> As the SSD fills, the space available for the
          single level write<br>
          cache gets smaller</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><strong><span style=3D"color: #008000;">The single
              level write cache is the cache these ssd drivers have, for
              compensating the speed issues they have due to using qlc
              memory?. Do you refer to that?. Sorry I don't understand
              well this paragraph.</span></strong></div>
      </blockquote>
    </blockquote>
    <p>Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell.
      The SLC cache has only 1 bit per cell, thus a 6 GB SLC cache needs
      as many cells as 24 GB of data in QLC mode.</p>
    <p>A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700
      GB (600 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells). <b=
r>
    </p>
    <p>Therefore, the fraction of the cells used as an SLC cache is
      reduced when it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6
      GB in 6 tn SLC cells).</p>
    <p>And with less SLC cells available for short term storage of data
      the probability of data being copied to QLC cells before the
      irrelevant messages have been deleted is significantly increased.
      And that will again lead to many more blocks with "holes" (deleted
      messages) in them, which then need to be copied possibly multiple
      times to compact them.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">(on many SSDs, I have no numbers
        for this<br>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"> particular device), and thus the amount of data
          that can be<br>
          written at single cell speed shrinks as the SSD gets full.</div=
>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><br>
          <br>
          I have just looked up the size of the SLC cache, it is
          specified<br>
          to be 78 GB for the empty SSD, 6 GB when it is full (for the 2
          TB<br>
          version, smaller models will have a smaller SLC cache).</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><strong><span style=3D"color: #008000;">Assuming you=

              were talking about the cache for compensating speed we
              previously commented, I should say these are the 870 QVO
              but the 8TB version. So they should have the biggest cache
              for compensating the speed issues...</span></strong></div>
      </blockquote>
    </blockquote>
    <p>I have looked up the data: the larger versions of the 870 QVO
      have the same SLC cache configuration as the 2 TB model, 6 GB
      minimum and up to 72 GB more if there are enough free blocks.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"> But after writing those few GB at a speed of some
          500 MB/s (i.e.<br>
          after 12 to 150 seconds), the drive will need several minutes
          to<br>
          transfer those writes to the quad-level cells, and will
          operate<br>
          at a fraction of the nominal performance during that time.<br>
          (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s
          for the<br>
          2 TB model.)</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><strong><span style=3D"color: #008000;">Well we are
              in the 8TB model. I think I have understood what you wrote
              in previous paragraph. You said they can be fast but not
              constantly, because later they have to write all that to
              their perpetual storage from the cache. And that's slow.
              Am I wrong?. Even in the 8TB model you think Stefan?.</span=
></strong></div>
      </blockquote>
    </blockquote>
    <p>The controller in the SSD supports a given number of channels
      (e.g 4), each of which can access a Flash chip independently of
      the others. Small SSDs often have less Flash chips than there are
      channels (and thus a lower throughput, especially for writes), but
      the larger models often have more chips than channels and thus the
      performance is capped.</p>
    <p>In the case of the 870 QVO, the controller supports 8 channels,
      which allows it to write 160 MB/s into the QLC cells. The 1 TB
      model apparently has only 4 Flash chips and is thus limited to 80
      MB/s in that situation, while the larger versions have 8, 16, or
      32 chips. But due to the limited number of channels, the write
      rate is limited to 160 MB/s even for the 8 TB model.</p>
    <p>If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s
      in this limit.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><span style=3D"color: #008000;"><strong>The main
              problem we are facing is that in some peak moments, when
              the machine serves connections for all the instances it
              has, and only as said in some peak moments... like the
              09am or the 11am.... it seems the machine becomes
              slower... and like if the disks weren't able to serve all
              they have to serve.... In these moments, no big files are
              moved... but as we have 1800-2000 concurrent imap
              connections... normally they are doing each one... little
              changes in their mailbox. Do you think perhaps this disks
              then are not appropriate for this kind of usage?-</strong><=
/span><br>
        </div>
      </blockquote>
    </blockquote>
    <p>I'd guess that the drives get into a state in which they have to
      recycle lots of partially free blocks (i.e. perform kind of a
      garbage collection) and then three kinds of operations are
      competing with each other:</p>
    <ol>
      <li>reads (generally prioritized)</li>
      <li>writes (filling the SLC cache up to its maximum size)</li>
      <li>compactions of partially filled blocks (required to make free
        blocks available for re-use)</li>
    </ol>
    <p>Writes can only proceed if there are sufficient free blocks,
      which on a filled SSD with partially filled erase blocks means
      that operations of type 3. need to be performed with priority to
      not stall all writes.</p>
    <p>My assumption is that this is what you are observing under peak
      load.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"> And cheap SSDs often have no RAM cache (not
          checked, but I'd be<br>
          surprised if the QVO had one) and thus cannot keep bookkeeping
          date<br>
          in such a cache, further limiting the performance under load.</=
div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><strong><span style=3D"color: #008000;">This brochur=
e
              (<a style=3D"color: #008000;"
href=3D"https://semiconductor.samsung.com/resources/brochure/870_Series_B=
rochure.pdf"
                moz-do-not-send=3D"true" class=3D"moz-txt-link-freetext">=
https://semiconductor.samsung.com/resources/brochure/870_Series_Brochure.=
pdf</a>
              and the datasheet
<a class=3D"moz-txt-link-freetext" href=3D"https://semiconductor.samsung.=
com/resources/data-sheet/Samsung_SSD_870_QVO_Data_Sheet_Rev1.1.pdf">https=
://semiconductor.samsung.com/resources/data-sheet/Samsung_SSD_870_QVO_Dat=
a_Sheet_Rev1.1.pdf</a>)
              sais if I have read properly, the 8TB drive has 8GB of
              ram?. I assume that is what they call the turbo write
              cache?.</span></strong><br>
        </div>
      </blockquote>
    </blockquote>
    <p>No, the turbo write cache consists of the cells used in SLC mode
      (which can be any cells, not only cells in a specific area of the
      flash chip).</p>
    <p>The RAM is needed for fast lookup of the position of data for
      reads and of free blocks for writes.</p>
    <p>There is no simple relation between SSD "block number" (in the
      sense of a disk block on some track of a magnetic disk) and its
      storage location on the Flash chip. If an existing "data block"
      (what would be a sector on a hard disk drive) is overwritten, it
      is instead written at the end of an "open" erase block, and a
      pointer from that "block number" to the location on the chip is
      stored in an index. This index is written to Flash storage and
      could be read from it, but it is much faster to have a RAM with
      these pointers that can be accessed independently of the Flash
      chips. This RAM is required for high transaction rates (especially
      random reads), but it does not really help speed up writes.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"> And the resilience (max. amount of data written
          over its lifetime)<br>
          is also quite low - I hope those drives are used in some kind
          of<br>
          RAID configuration.</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><strong><span style=3D"color: #008000;">Yep we use
              raidz-2</span></strong></div>
      </blockquote>
    </blockquote>
    <p>Makes sense ... But you know that you multiply the amount of data
      written due to the redundancy.</p>
    <p>If a single 8 KB block is written, for example, 3 * 8 KB will
      written if you take the 2 redundant copies into account.<br>
    </p>
    <blockquote type=3D"cite"
      cite=3D"mid:ce51660b5f83f92aa9772d764ae12dff@ramattack.net">
      <blockquote type=3D"cite" style=3D"padding: 0 0.4em; border-left:
        #1010ff 2px solid; margin: 0">
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">The 870 QVO is specified for 370 full capacity<br>
          writes, i.e. 370 TB for the 1 TB model. That's still a few
          hundred<br>
          GB a day - but only if the write amplification stays in a
          reasonable<br>
          range ...</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace">=C2=A0</div>
        <div class=3D"pre" style=3D"margin: 0; padding: 0; font-family:
          monospace"><strong><span style=3D"color: #008000;">Well yes...
              2880TB in our case....not bad.. isn't it?</span></strong></=
div>
      </blockquote>
    </blockquote>
    <p>I assume that 2880 TB is your total storage capacity? That's not
      too bad, in fact. ;-)</p>
    <p>This would be 360 * 8 TB ...</p>
    <p>Even at 160 MB/s per 8 TB SSD this would allow for more than 50
      GB/s of write throughput (if all writes were evenly distributed).</=
p>
    <p>Taking all odds into account, I'd guess that at least 10 GB/s can
      be continuously written (if supported by the CPUs and
      controllers).</p>
    <p>But this may not be true if the drive is simultaneously reading,
      trimming, and writing ...<br>
    </p>
    <p><br>
    </p>
    <p>I have seen advice to not use compression in a high load scenario
      in some other reply.</p>
    <p>I tend to disagree: Since you seem to be limited when the SLC
      cache is exhausted, you should get better performance if you
      compress your data. I have found that zstd-2 works well for me
      (giving a significant overall reduction of size at reasonable
      additional CPU load). Since ZFS allows to switch compressions
      algorithms at any time, you can experiment with different
      algorithms and levels.</p>
    <p>One advantage of ZFS compression is that it applies to the ARC,
      too. And a compression factor of 2 should easily be achieved when
      storing mail (not for .docx, .pdf, .jpg files though). Having more
      data in the ARC will reduce the read pressure on the SSDs and will
      give them more cycles for garbage collections (which are performed
      in the background and required to always have a sufficient reserve
      of free flash blocks for writes).<br>
    </p>
    <p>I'd give it a try - and if it reduces your storage requirements
      by 10% only, then keep 10% of each SSD unused (not assigned to any
      partition). That will greatly improve the resilience of your SSDs,
      reduce the write-amplification, will allow the SLC cache to stay
      at its large value, and may make a large difference to the
      effective performance under high load.</p>
    <p>Regards, STefan<br>
    </p>
    <strong></strong>
  </body>
</html>

--------------ASZf2V7VPlu8lceG03SodTv5--

--------------OGljiRSjG08yilMHhaNFyDtW--

--------------Lic1usorjc8S7FifC6L0nUmq
Content-Type: application/pgp-signature; name="OpenPGP_signature.asc"
Content-Description: OpenPGP digital signature
Content-Disposition: attachment; filename="OpenPGP_signature"

-----BEGIN PGP SIGNATURE-----

wsB5BAABCAAjFiEEo3HqZZwL7MgrcVMTR+u171r99UQFAmJOCtsFAwAAAAAACgkQR+u171r99USt
jAf+Jqn2i8WZUDjj7wNiYznxQzyyjhsmvUb2d7NygsZaC0lcdNuEpjkhWG+Cn7tc5mPuWkbP2nz0
HGpERxnAnf+6chQw6E/3ZXVKCBM+HdiVw1HpmnX91K5FiLecnPC8aD5VlFsrGg7LpTtKBLCwgwls
ssSPRqJvI5wYZEsiGydp/nMcaJeruVOXpjwH7kUDy5HvANKOdtM3X2JJMxHbwPsqtbwo8nGAiE9r
NaNLI9hO0Ljfud4rgCaHo0dWq9sD9zAKOvbmDbSGgQbgVXxgh2Oz+lVmfHfC6MMGcM57K1HJ+LnH
QXhvEcKQif+HVq2LNpwkTRjJdY4a0ajeGSIFS5FXuA==
=LAcb
-----END PGP SIGNATURE-----

--------------Lic1usorjc8S7FifC6L0nUmq--