From nobody Thu Apr 07 12:30:47 2022 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 679491A9AAC5; Thu, 7 Apr 2022 12:31:00 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from cu01208b.smtpx.saremail.com (cu01208b.smtpx.saremail.com [195.16.151.183]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4KZ12B28vFz3M2k; Thu, 7 Apr 2022 12:30:57 +0000 (UTC) (envelope-from egoitz@ramattack.net) Received: from www.saremail.com (unknown [194.30.0.183]) by sieve-smtp-backend01.sarenet.es (Postfix) with ESMTPA id 35B5A60C4C4; Thu, 7 Apr 2022 14:30:48 +0200 (CEST) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="=_ccf36dab3b44229808a0a46435736314" Date: Thu, 07 Apr 2022 14:30:47 +0200 From: egoitz@ramattack.net To: Stefan Esser Cc: freebsd-fs@freebsd.org, freebsd-hackers@freebsd.org, freebsd-performance@freebsd.org, Rainer Duffner Subject: Re: Re: Re: Desperate with 870 QVO and ZFS In-Reply-To: References: <4e98275152e23141eae40dbe7ba5571f@ramattack.net> <665236B1-8F61-4B0E-BD9B-7B501B8BD617@ultra-secure.de> <0ef282aee34b441f1991334e2edbcaec@ramattack.net> Message-ID: X-Sender: egoitz@ramattack.net User-Agent: Saremail webmail X-Rspamd-Queue-Id: 4KZ12B28vFz3M2k X-Spamd-Bar: --- Authentication-Results: mx1.freebsd.org; dkim=none; dmarc=pass (policy=reject) header.from=ramattack.net; spf=pass (mx1.freebsd.org: domain of egoitz@ramattack.net designates 195.16.151.183 as permitted sender) smtp.mailfrom=egoitz@ramattack.net X-Spamd-Result: default: False [-3.79 / 15.00]; RCVD_TLS_LAST(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; XM_UA_NO_VERSION(0.01)[]; TO_DN_SOME(0.00)[]; R_SPF_ALLOW(-0.20)[+ip4:195.16.151.0/24]; NEURAL_HAM_LONG(-1.00)[-1.000]; MIME_GOOD(-0.10)[multipart/alternative,text/plain]; ARC_NA(0.00)[]; RCPT_COUNT_FIVE(0.00)[5]; TO_MATCH_ENVRCPT_SOME(0.00)[]; NEURAL_HAM_SHORT(-1.00)[-1.000]; DMARC_POLICY_ALLOW(-0.50)[ramattack.net,reject]; FROM_NO_DN(0.00)[]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; MLMMJ_DEST(0.00)[freebsd-fs,freebsd-hackers,freebsd-performance]; FROM_EQ_ENVFROM(0.00)[]; R_DKIM_NA(0.00)[]; MIME_TRACE(0.00)[0:+,1:+,2:~]; ASN(0.00)[asn:3262, ipnet:195.16.128.0/19, country:ES]; RCVD_COUNT_TWO(0.00)[2]; MID_RHS_MATCH_FROM(0.00)[] X-ThisMailContainsUnwantedMimeParts: N --=_ccf36dab3b44229808a0a46435736314 Content-Transfer-Encoding: 8bit Content-Type: text/plain; charset=UTF-8 Hi Stefan, An extremely interesting answer and email. Extremely thankful for all your deep explatanations...... They are like gold for us really.... I answer below and in blue bold for better distinction between your lines and mine ones... El 2022-04-06 23:49, Stefan Esser escribió: > ATENCION: Este correo se ha enviado desde fuera de la organización. No pinche en los enlaces ni abra los adjuntos a no ser que reconozca el remitente y sepa que el contenido es seguro. > > Am 06.04.22 um 18:34 schrieb egoitz@ramattack.net: > >> Hi Stefan! >> >> Thank you so much for your answer!!. I do answer below in green bold for instance... for a better distinction.... >> >> Very thankful for all your comments Stefan!!! :) :) :) >> >> Cheers!! > > Hi, > > glad to hear that it is useful information - I'll add comments below ... > > EXTREMELY HELPFUL INFORMATION REALLY! THANK YOU SO MUCH STEFFAN REALLY. VERY VERY THANKFUL FOR YOUR NICE HELP!. > > El 2022-04-06 17:43, Stefan Esser escribió: > > Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net: Hi Rainer! > > Thank you so much for your help :) :) > > Well I assume they are in a datacenter and should not be a power outage.... > > About dataset size... yes... our ones are big... they can be 3-4 TB easily each > dataset..... > > We bought them, because as they are for mailboxes and mailboxes grow and > grow.... for having space for hosting them... > Which mailbox format (e.g. mbox, maildir, ...) do you use? > > I'M RUNNING CYRUS IMAP SO SORT OF MAILDIR... TOO MANY LITTLE FILES NORMALLY..... SOMETIMES DIRECTORIES WITH TONS OF LITTLE FILES.... Assuming that many mails are much smaller than the erase block size of the SSD, this may cause issues. (You may know the following ...) For example, if you have message sizes of 8 KB and an erase block size of 64 KB (just guessing), then 8 mails will be in an erase block. If half the mails are deleted, then the erase block will still occupy 64 KB, but only hold 32 KB of useful data (and the SSD will only be aware of this fact if TRIM has signaled which data is no longer relevant). The SSD will copy several partially filled erase blocks together in a smaller number of free blocks, which then are fully utilized. Later deletions will repeat this game, and your data will be copied multiple times until it has aged (and the user is less likely to delete further messages). This leads to "write amplification" - data is internally moved around and thus written multiple times. STEFAN!! YOU ARE NICE!! I THINK THIS COULD EXPLAIN ALL OUR PROBLEM. SO, WHY WE ARE HAVING THE MOST RANDOMNESS IN OUR PERFORMANCE DEGRADATION AND THAT DOES NOT NECESSARILY HAS TO MATCH WITH THE MOST IO PEAK HOURS... THAT I COULD CAUSE THAT PERFORMANCE DEGRADATION JUST BY DELETING A COUPLE OF HUGE (PERHAPS 200.000 MAILS) MAIL FOLDERS IN A MIDDLE TRAFFIC HOUR TIME!! THE PROBLEM IS THAT BY WHAT I KNOW, ERASE BLOCK SIZE OF AN SSD DISK IS SOMETHING FIXED IN THE DISK FIRMWARE. I DON'T REALLY KNOW IF PERHAPS IT COULD BE MODIFIED WITH SAMSUNG MAGICIAN OR THOSE KIND OF TOOL OF SAMSUNG.... ELSE I DON'T REALLY SEE THE MANNER OF IMPROVING IT... BECAUSE APART FROM THAT, YOU ARE DELETING A FILE IN RAIDZ-2 ARRAY... NO JUST IN A DISK... I ASSUME ALIGNING CHUNK SIZE, WITH RECORD SIZE AND WITH THE "SECRET" ERASE SIZE OF THE SSD, PERHAPS COULD BE SLIGHTLY COMPENSATED?. Larger mails are less of an issue since they span multiple erase blocks, which will be completely freed when such a message is deleted. I SEE I SEE STEFAN... Samsung has a lot of experience and generally good strategies to deal with such a situation, but SSDs specified for use in storage systems might be much better suited for that kind of usage profile. YES... AND THE DISKS FOR OUR PURPOSE... PERHAPS WEREN'T QVOS.... > We knew they had some speed issues, but those speed issues, we thought (as > Samsung explains in the QVO site) they started after exceeding the speeding > buffer this disks have. We though that meanwhile you didn't exceed it's > capacity (the capacity of the speeding buffer) no speed problem arises. Perhaps > we were wrong?. > These drives are meant for small loads in a typical PC use case, > i.e. some installations of software in the few GB range, else only > files of a few MB being written, perhaps an import of media files > that range from tens to a few hundred MB at a time, but less often > than once a day. > > WE MOVE, YOU KNOW... LOTS OF LITTLE FILES... AND LOT'S OF DIFFERENT CONCURRENT MODIFICATIONS BY 1500-2000 CONCURRENT IMAP CONNECTIONS WE HAVE... I do not expect the read load to be a problem (except possibly when the SSD is moving data from SLC to QLC blocks, but even then reads will get priority). But writes and trims might very well overwhelm the SSD, especially when its getting full. Keeping a part of the SSD unused (excluded from the partitions created) will lead to a large pool of unused blocks. This will reduce the write amplification - there are many free blocks in the "unpartitioned part" of the SSD, and thus there is less urgency to compact partially filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a partition used for the ZPOOL, then 1/4 of each erase block could be free due to deletions/TRIM without any compactions required to hold all this data.) Keeping a significant percentage of the SSD unallocated is a good strategy to improve its performance and resilience. WELL, WE HAVE ALLOCATED ALL THE DISK SPACE... BUT NOT USED... JUST ALLOCATED.... YOU KNOW... WE DO A ZPOOL CREATE WITH THE WHOLE DISKS..... >> As the SSD fills, the space available for the single level write >> cache gets smaller >> >> THE SINGLE LEVEL WRITE CACHE IS THE CACHE THESE SSD DRIVERS HAVE, FOR COMPENSATING THE SPEED ISSUES THEY HAVE DUE TO USING QLC MEMORY?. DO YOU REFER TO THAT?. SORRY I DON'T UNDERSTAND WELL THIS PARAGRAPH. Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as 24 GB of data in QLC mode. OK, TRUE.... YES.... A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (600 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells). AHH! YOU MEAN THAT SLC CAPACITY FOR SPEEDING UP THE QLC DISKS, IS OBTAINED FROM EACH SINGLE LAYER OF THE QLC?. Therefore, the fraction of the cells used as an SLC cache is reduced when it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cells). SORRY I DON'T GET THIS LAST SENTENCE... DON'T UNDERSTAND IT BECAUSE I DON'T REALLY KNOW THE MEANING OF TN... BUT I THINK I'M GETTING THE IDEA IF YOU SAY THAT EACH QLC LAYER, HAS IT'S OWN SLC CACHE OBTAINED FROM THE DISK SPACE AVAIABLE FOR EACH QLC LAYER.... And with less SLC cells available for short term storage of data the probability of data being copied to QLC cells before the irrelevant messages have been deleted is significantly increased. And that will again lead to many more blocks with "holes" (deleted messages) in them, which then need to be copied possibly multiple times to compact them. IF I CORRECT ABOVE, I THINK I GOT THE IDEA YES.... >> (on many SSDs, I have no numbers for this >> particular device), and thus the amount of data that can be >> written at single cell speed shrinks as the SSD gets full. >> >> I have just looked up the size of the SLC cache, it is specified >> to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 TB >> version, smaller models will have a smaller SLC cache). >> >> ASSUMING YOU WERE TALKING ABOUT THE CACHE FOR COMPENSATING SPEED WE PREVIOUSLY COMMENTED, I SHOULD SAY THESE ARE THE 870 QVO BUT THE 8TB VERSION. SO THEY SHOULD HAVE THE BIGGEST CACHE FOR COMPENSATING THE SPEED ISSUES... I have looked up the data: the larger versions of the 870 QVO have the same SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB more if there are enough free blocks. OURS ONE IS THE 8TB MODEL SO I ASSUME IT COULD HAVE BIGGER LIMITS. THE DISKS ARE MOSTLY EMPTY, REALLY.... SO... FOR INSTANCE.... ZPOOL LIST NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT ROOT_DATASET 448G 2.29G 446G - - 1% 0% 1.00X ONLINE - MAIL_DATASET 58.2T 11.8T 46.4T - - 26% 20% 1.00X ONLINE - I SUPPOSE FRAGMENTATION AFFECTS TOO.... >> But after writing those few GB at a speed of some 500 MB/s (i.e. >> after 12 to 150 seconds), the drive will need several minutes to >> transfer those writes to the quad-level cells, and will operate >> at a fraction of the nominal performance during that time. >> (QLC writes max out at 80 MB/s for the 1 TB model, 160 MB/s for the >> 2 TB model.) >> >> WELL WE ARE IN THE 8TB MODEL. I THINK I HAVE UNDERSTOOD WHAT YOU WROTE IN PREVIOUS PARAGRAPH. YOU SAID THEY CAN BE FAST BUT NOT CONSTANTLY, BECAUSE LATER THEY HAVE TO WRITE ALL THAT TO THEIR PERPETUAL STORAGE FROM THE CACHE. AND THAT'S SLOW. AM I WRONG?. EVEN IN THE 8TB MODEL YOU THINK STEFAN?. The controller in the SSD supports a given number of channels (e.g 4), each of which can access a Flash chip independently of the others. Small SSDs often have less Flash chips than there are channels (and thus a lower throughput, especially for writes), but the larger models often have more chips than channels and thus the performance is capped. THIS IS TOTALLY LOGICAL. IF A QVO DISK WOULD OUTPERFORM BEST OR SIMILAR THAN AN INTEL WITHOUT CONSEQUENCES.... WHO WAS GOING TO BUY A EXPENSIVE INTEL ENTERPRISE?. In the case of the 870 QVO, the controller supports 8 channels, which allows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has only 4 Flash chips and is thus limited to 80 MB/s in that situation, while the larger versions have 8, 16, or 32 chips. But due to the limited number of channels, the write rate is limited to 160 MB/s even for the 8 TB model. TOTALLY LOGICAL STEFAN... If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in this limit. >> THE MAIN PROBLEM WE ARE FACING IS THAT IN SOME PEAK MOMENTS, WHEN THE MACHINE SERVES CONNECTIONS FOR ALL THE INSTANCES IT HAS, AND ONLY AS SAID IN SOME PEAK MOMENTS... LIKE THE 09AM OR THE 11AM.... IT SEEMS THE MACHINE BECOMES SLOWER... AND LIKE IF THE DISKS WEREN'T ABLE TO SERVE ALL THEY HAVE TO SERVE.... IN THESE MOMENTS, NO BIG FILES ARE MOVED... BUT AS WE HAVE 1800-2000 CONCURRENT IMAP CONNECTIONS... NORMALLY THEY ARE DOING EACH ONE... LITTLE CHANGES IN THEIR MAILBOX. DO YOU THINK PERHAPS THIS DISKS THEN ARE NOT APPROPRIATE FOR THIS KIND OF USAGE?- I'd guess that the drives get into a state in which they have to recycle lots of partially free blocks (i.e. perform kind of a garbage collection) and then three kinds of operations are competing with each other: * reads (generally prioritized) * writes (filling the SLC cache up to its maximum size) * compactions of partially filled blocks (required to make free blocks available for re-use) Writes can only proceed if there are sufficient free blocks, which on a filled SSD with partially filled erase blocks means that operations of type 3. need to be performed with priority to not stall all writes. My assumption is that this is what you are observing under peak load. IT COULD BE ALTHOUGH THE DISKS ARE NOT FILLED.... THE POOL ARE AT 20 OR 30% OF CAPACITY AND FRAGMENTATION FROM 20%-30% (AS ZPOOL LIST STATES). >> And cheap SSDs often have no RAM cache (not checked, but I'd be >> surprised if the QVO had one) and thus cannot keep bookkeeping date >> in such a cache, further limiting the performance under load. >> >> THIS BROCHURE (HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/BROCHURE/870_SERIES_BROCHURE.PDF AND THE DATASHEET HTTPS://SEMICONDUCTOR.SAMSUNG.COM/RESOURCES/DATA-SHEET/SAMSUNG_SSD_870_QVO_DATA_SHEET_REV1.1.PDF) SAIS IF I HAVE READ PROPERLY, THE 8TB DRIVE HAS 8GB OF RAM?. I ASSUME THAT IS WHAT THEY CALL THE TURBO WRITE CACHE?. No, the turbo write cache consists of the cells used in SLC mode (which can be any cells, not only cells in a specific area of the flash chip). I SEE I SEE.... The RAM is needed for fast lookup of the position of data for reads and of free blocks for writes. OUR ONES... SEEM TO HAVE 8GB LPDDR4 OF RAM.... AS DATASHEET STATES.... There is no simple relation between SSD "block number" (in the sense of a disk block on some track of a magnetic disk) and its storage location on the Flash chip. If an existing "data block" (what would be a sector on a hard disk drive) is overwritten, it is instead written at the end of an "open" erase block, and a pointer from that "block number" to the location on the chip is stored in an index. This index is written to Flash storage and could be read from it, but it is much faster to have a RAM with these pointers that can be accessed independently of the Flash chips. This RAM is required for high transaction rates (especially random reads), but it does not really help speed up writes. I SEE... I SEE.... I GOT IT... >> And the resilience (max. amount of data written over its lifetime) >> is also quite low - I hope those drives are used in some kind of >> RAID configuration. >> >> YEP WE USE RAIDZ-2 Makes sense ... But you know that you multiply the amount of data written due to the redundancy. If a single 8 KB block is written, for example, 3 * 8 KB will written if you take the 2 redundant copies into account. I SEE I SEE.... >> The 870 QVO is specified for 370 full capacity >> writes, i.e. 370 TB for the 1 TB model. That's still a few hundred >> GB a day - but only if the write amplification stays in a reasonable >> range ... >> >> WELL YES... 2880TB IN OUR CASE....NOT BAD.. ISN'T IT? I assume that 2880 TB is your total storage capacity? That's not too bad, in fact. ;-) NO... THE TOTAL NUMBER OF WRITES YOU CAN DO....BEFORE THE DISK "BREAKS".... LOL :) :) ... WE ARE HAVING STORAGES OF 50TB DUE TO 8 DISKS OF 8TB IN RAIDZ-2.... This would be 360 * 8 TB ... Even at 160 MB/s per 8 TB SSD this would allow for more than 50 GB/s of write throughput (if all writes were evenly distributed). Taking all odds into account, I'd guess that at least 10 GB/s can be continuously written (if supported by the CPUs and controllers). But this may not be true if the drive is simultaneously reading, trimming, and writing ... I SEE.... IT'S EXTREMELY MISLEADING YOU KNOW... BECAUSE... YOU CAN COPY FIVE MAILBOXES OF 50GB CONCURRENTLY FOR INSTANCE.... AND YOU FLOOD A GIGABIT INTERFACE COPYING (OBVIOUSLY BECAUSE DISKS CAN KEEP THAT THROUGHPUT)... BUT LATER.... YOU SEE... YOU ARE IN AN HOUR THAT YESTERDAY, AND EVEN 4 DAYS BEFORE YOU HAVE NOT HAD ANY ISSUES... AND THAT DAY... YOU SEE THE COMMENTED ISSUE... EVEN NOT BEING EXACTLY AT A PEAK HOUR (PERHAPS IS TWO HOURS LATER THE PEAK HOUR EVEN)... OR... BUT I WASN'T NOTICING ABOUT ALL THINGS YOU SAY IN THIS EMAIL.... I have seen advice to not use compression in a high load scenario in some other reply. I tend to disagree: Since you seem to be limited when the SLC cache is exhausted, you should get better performance if you compress your data. I have found that zstd-2 works well for me (giving a significant overall reduction of size at reasonable additional CPU load). Since ZFS allows to switch compressions algorithms at any time, you can experiment with different algorithms and levels. I SEE... YOU SAY COMPRESSION SHOULD BE ENABLED.... THE MAIN REASON BECAUSE WE HAVE NOT ENABLED IT YET, IS FOR KEEPING THE SYSTEM THE MOST NEAR POSSIBLE TO CONFIG DEFAULTS... YOU KNOW... FOR LATER BEING ABLE TO ASK IN THIS MAILING LISTS IF WE HAVE AN ISSUE... BECAUSE YOU KNOW... IT WOULD BE FAR MORE EASIER TO ASK ABOUT SOMETHING STRANGE YOU ARE SEEING WHEN THAT STRANGE THING IS NEAR TO A WELL TESTED CONFIG, LIKE THE CONFIG BY DEFAULT.... BUT NOW YOU SAY STEFAN... IF YOU SWITCH BETWEEN COMPRESSION ALGORITHMS YOU WILL END UP WITH A MIX OF DIFFERENT FILES COMPRESSED IN A DIFFERENT MANNER... THAT IS NOT A BIT DISASTER LATER?. DOESN'T AFFECT PERFORMANCE IN SOME MANNER?. One advantage of ZFS compression is that it applies to the ARC, too. And a compression factor of 2 should easily be achieved when storing mail (not for .docx, .pdf, .jpg files though). Having more data in the ARC will reduce the read pressure on the SSDs and will give them more cycles for garbage collections (which are performed in the background and required to always have a sufficient reserve of free flash blocks for writes). WE WOULD USE I ASSUME THE LZ4... WHICH IS THE LESS "EXPENSIVE" COMPRESSION ALGORITHM FOR THE CPU... AND I ASSUME TOO FOR AVOIDING DELAY ACCESSING DATA... DO YOU RECOMMEND ANOTHER ONE?. DO YOU ALWAYS RECOMMEND COMPRESSION THEN?. I'd give it a try - and if it reduces your storage requirements by 10% only, then keep 10% of each SSD unused (not assigned to any partition). That will greatly improve the resilience of your SSDs, reduce the write-amplification, will allow the SLC cache to stay at its large value, and may make a large difference to the effective performance under high load. BUT WHEN YOU ENABLE COMPRESSION... ONLY GETS COMPRESSED THE NEW DATA MODIFIED OR ENTERED. AM I WRONG?. BY THE WAY, WE HAVE MORE OR LESS 1/4 OF EACH DISK USED (12 TB ALLOCATED IN A POLL STATED BY ZPOOL LIST, DIVIDED BETWEEN 8 DISKS OF 8TB...)... DO YOU THINK WE COULD BE SUFFERING ON WRITE AMPLIFICATION AND SO... HAVING A SO LITTLE DISK SPACE USED IN EACH DISK?. Regards, STefan HEY MATE, YOUR MAIL IS INCREDIBLE. IT HAS HELPED AS A LOT. CAN WE INVITE YOU A CUP OF COFFEE OR A BEER THROUGH PAYPAL OR SIMILAR?. CAN I HELP YOU IN SOME MANNER?. CHEERS! --=_ccf36dab3b44229808a0a46435736314 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=UTF-8

Hi Stefan,


An extremely interesting answer and email. Extremely thankful for all yo= ur deep explatanations...... They are like gold for us really....


I answer below and in blue bold for better distinction between your line= s and mine ones...

 


El 2022-04-06 23:49, Stefan Esser escribió:


ATENCION: Este correo se ha enviado = desde fuera de la organización. No pinche en los enlaces ni abra los= adjuntos a no ser que reconozca el remitente y sepa que el contenido es se= guro.

Am 06.04.22 um 18:34 schrieb egoitz@ramattack.net:

Hi Stefan!

Thank you so much for your answer!!. I do answer below in green bold for= instance... for a better distinction....

Very thankful for all your comments Stefan!!! :) :) :)

Cheers!!

Hi,

glad to hear that it is useful information - I'll add comments below .= =2E.


Extremely helpful information re= ally! Thank you so much Steffan really. Very very thankful for your nice he= lp!.


El 2022-04-06 17:43, Stefan Esser escribió:



Am 06.04.22 um 16:36 schrieb egoitz@ramattack.net:
Hi Rainer!

Thank you so much for your help :) :)
=
Well I assume they are in a datacenter and should not be a power ou= tage....

About dataset size... yes... our ones are big... they= can be 3-4 TB easily each
dataset.....

We bought them, = because as they are for mailboxes and mailboxes grow and
grow.... for= having space for hosting them...

Which mailbox format (e.g. mbox, maildir, ...) do you use?
 
I'm running Cyrus imap so sort of = Maildir... too many little files normally..... Sometimes directories with t= ons of little files....

Assuming that many mails are much smaller than the erase block size of t= he SSD, this may cause issues. (You may know the following ...)

For example, if you have message sizes of 8 KB and an erase block size o= f 64 KB (just guessing), then 8 mails will be in an erase block. If half th= e mails are deleted, then the erase block will still occupy 64 KB, but only= hold 32 KB of useful data (and the SSD will only be aware of this fact if = TRIM has signaled which data is no longer relevant). The SSD will copy seve= ral partially filled erase blocks together in a smaller number of free bloc= ks, which then are fully utilized. Later deletions will repeat this game, a= nd your data will be copied multiple times until it has aged (and the user = is less likely to delete further messages). This leads to "write amplificat= ion" - data is internally moved around and thus written multiple times.


Stefan!! you are nice!! I think = this could explain all our problem. So, why we are having the most randomne= ss in our performance degradation and that does not necessarily has to matc= h with the most io peak hours... That I could cause that performance degrad= ation just by deleting a couple of huge (perhaps 200.000 mails) mail folder= s in a middle traffic hour time!!


The problem is that by what I kn= ow, erase block size of an SSD disk is something fixed in the disk firmware= =2E I don't really know if perhaps it could be modified with Samsung magici= an or those kind of tool of Samsung.... else I don't really see the manner = of improving it... because apart from that, you are deleting a file in raid= z-2 array... no just in a disk... I assume aligning chunk size, with record= size and with the "secret" erase size of the ssd, perhaps could be slightl= y compensated?.

Larger mails are less of an issue since they span multiple erase blocks,= which will be completely freed when such a message is deleted.

I see I see Stefan...

Samsung has a lot of experience and generally good strategies to deal wi= th such a situation, but SSDs specified for use in storage systems might be= much better suited for that kind of usage profile.

Yes... and the disks for our pur= pose... perhaps weren't QVOs....


We knew they had some speed issues, but those speed issues, we thou= ght (as
Samsung explains in the QVO site) they started after exceedin= g the speeding
buffer this disks have. We though that meanwhile you d= idn't exceed it's
capacity (the capacity of the speeding buffer) no s= peed problem arises. Perhaps
we were wrong?.

These drives are meant for small loads in a typical PC use case,
i.e. some installations of software in the few GB range, else only
= files of a few MB being written, perhaps an import of media files
th= at range from tens to a few hundred MB at a time, but less often
than= once a day.
 
We move, you know... lots of littl= e files... and lot's of different concurrent modifications by 1500-2000 con= current imap connections we have...

I do not expect the read load to be a problem (except possibly when the = SSD is moving data from SLC to QLC blocks, but even then reads will get pri= ority). But writes and trims might very well overwhelm the SSD, especially = when its getting full. Keeping a part of the SSD unused (excluded from the = partitions created) will lead to a large pool of unused blocks. This will r= educe the write amplification - there are many free blocks in the "unpartit= ioned part" of the SSD, and thus there is less urgency to compact partially= filled blocks. (E.g. if you include only 3/4 of the SSD capacity in a part= ition used for the ZPOOL, then 1/4 of each erase block could be free due to= deletions/TRIM without any compactions required to hold all this data.)

Keeping a significant percentage of the SSD unallocated is a good strate= gy to improve its performance and resilience.

Well, we have allocated all the = disk space... but not used... just allocated.... you know... we do a zpool = create with the whole disks.....

As the SSD fills, the space available for the single level write
cac= he gets smaller
 
The single level write cache is th= e cache these ssd drivers have, for compensating the speed issues they have= due to using qlc memory?. Do you refer to that?. Sorry I don't understand = well this paragraph.

Yes, the SSD is specified to hold e.g. 1 TB at 4 bits per cell. The SLC = cache has only 1 bit per cell, thus a 6 GB SLC cache needs as many cells as= 24 GB of data in QLC mode.

Ok, true.... yes....

A 100 GB SLC cache would reduce the capacity of a 1 TB SSD to 700 GB (60= 0 GB in 150 tn QLC cells plus 100 GB in 100 tn SLC cells).

Ahh! you mean that SLC capacity = for speeding up the QLC disks, is obtained from each single layer of the QL= C?.

Therefore, the fraction of the cells used as an SLC cache is reduced whe= n it gets full (e.g. ~1 TB in ~250 tn QLC cells, plus 6 GB in 6 tn SLC cell= s).

Sorry I don't get this last sent= ence... don't understand it because I don't really know the meaning of tn= =2E..

but I think I'm getting the idea= if you say that each QLC layer, has it's own SLC cache obtained from the d= isk space avaiable for each QLC layer....

And with less SLC cells available for short term storage of data the pro= bability of data being copied to QLC cells before the irrelevant messages h= ave been deleted is significantly increased. And that will again lead to ma= ny more blocks with "holes" (deleted messages) in them, which then need to = be copied possibly multiple times to compact them.

If I correct above, I think I go= t the idea yes....

(on many SSDs, I have no numbers for this
particular device), and thus the amount of data that can be
written = at single cell speed shrinks as the SSD gets full.
 


I have just looked up the size of the SLC cache, it is speci= fied
to be 78 GB for the empty SSD, 6 GB when it is full (for the 2 T= B
version, smaller models will have a smaller SLC cache).
 
Assuming you were talking about th= e cache for compensating speed we previously commented, I should say these = are the 870 QVO but the 8TB version. So they should have the biggest cache = for compensating the speed issues...

I have looked up the data: the larger versions of the 870 QVO have the s= ame SLC cache configuration as the 2 TB model, 6 GB minimum and up to 72 GB= more if there are enough free blocks.

Ours one is the 8TB model so I a= ssume it could have bigger limits. The disks are mostly empty, really.... s= o... for instance....

zpool list
= NAME     &= nbsp;       SIZE  ALLOC   FREE=   CKPOINT  EXPANDSZ   FRAG    CAP  = DEDUP  HEALTH  ALTROOT
root_dataset        = ;     448G  2.29G   446G  &nbs= p;     -        = ; -     1%     0%  1.00x = ONLINE  -
I suppose fragmentation affects = too....

But after writing those few GB at a speed of some 500 MB/s (i.e.
aft= er 12 to 150 seconds), the drive will need several minutes to
transfe= r those writes to the quad-level cells, and will operate
at a fractio= n of the nominal performance during that time.
(QLC writes max out at= 80 MB/s for the 1 TB model, 160 MB/s for the
2 TB model.)
 
Well we are in the 8TB model. I th= ink I have understood what you wrote in previous paragraph. You said they c= an be fast but not constantly, because later they have to write all that to= their perpetual storage from the cache. And that's slow. Am I wrong?. Even= in the 8TB model you think Stefan?.

The controller in the SSD supports a given number of channels (e.g 4), e= ach of which can access a Flash chip independently of the others. Small SSD= s often have less Flash chips than there are channels (and thus a lower thr= oughput, especially for writes), but the larger models often have more chip= s than channels and thus the performance is capped.

This is totally logical. If a QV= O disk would outperform best or similar than an Intel without consequences= =2E... who was going to buy a expensive Intel enterprise?.<= /p>

In the case of the 870 QVO, the controller supports 8 channels, which al= lows it to write 160 MB/s into the QLC cells. The 1 TB model apparently has= only 4 Flash chips and is thus limited to 80 MB/s in that situation, while= the larger versions have 8, 16, or 32 chips. But due to the limited number= of channels, the write rate is limited to 160 MB/s even for the 8 TB model= =2E

Totally logical Stefan...=

If you had 4 * 2 TB instead, the throughput would be 4 * 160 MB/s in thi= s limit.

The main problem we are facing is = that in some peak moments, when the machine serves connections for all the = instances it has, and only as said in some peak moments... like the 09am or= the 11am.... it seems the machine becomes slower... and like if the disks = weren't able to serve all they have to serve.... In these moments, no big f= iles are moved... but as we have 1800-2000 concurrent imap connections... n= ormally they are doing each one... little changes in their mailbox. Do you = think perhaps this disks then are not appropriate for this kind of usage?-<= /strong>

I'd guess that the drives get into a state in which they have to recycle= lots of partially free blocks (i.e. perform kind of a garbage collection) = and then three kinds of operations are competing with each other:

  1. reads (generally prioritized)
  2. writes (filling the SLC cache up to its maximum size)
  3. compactions of partially filled blocks (required to make free blocks av= ailable for re-use)

Writes can only proceed if there are sufficient free blocks, which on a = filled SSD with partially filled erase blocks means that operations of type= 3. need to be performed with priority to not stall all writes.

My assumption is that this is what you are observing under peak load.

It could be although the disks a= re not filled.... the pool are at 20 or 30% of capacity and fragmentation f= rom 20%-30% (as zpool list states).

And cheap SSDs often have no RAM cache (not checked, but I'd be
surp= rised if the QVO had one) and thus cannot keep bookkeeping date
in su= ch a cache, further limiting the performance under load.
 
This brochure (https://semiconductor.samsung.com/resources/brochu= re/870_Series_Brochure.pdf and the datasheet https://semiconductor.samsung.com/resources/data-sheet/Samsung_S= SD_870_QVO_Data_Sheet_Rev1.1.pdf) sais if I have read properly, the 8TB= drive has 8GB of ram?. I assume that is what they call the turbo write cac= he?.

No, the turbo write cache consists of the cells used in SLC mode (which = can be any cells, not only cells in a specific area of the flash chip).

I see I see....<= /p>

The RAM is needed for fast lookup of the position of data for reads and = of free blocks for writes.

Our ones... seem to have 8GB LPD= DR4 of ram.... as datasheet states....

There is no simple relation between SSD "block number" (in the sense of = a disk block on some track of a magnetic disk) and its storage location on = the Flash chip. If an existing "data block" (what would be a sector on a ha= rd disk drive) is overwritten, it is instead written at the end of an "open= " erase block, and a pointer from that "block number" to the location on th= e chip is stored in an index. This index is written to Flash storage and co= uld be read from it, but it is much faster to have a RAM with these pointer= s that can be accessed independently of the Flash chips. This RAM is requir= ed for high transaction rates (especially random reads), but it does not re= ally help speed up writes.

I see... I see.... I got it...


And the resilience (max. amount of data written over its lifetime)
i= s also quite low - I hope those drives are used in some kind of
RAID = configuration.
 
Yep we use raidz-2=

Makes sense ... But you know that you multiply the amount of data writte= n due to the redundancy.

If a single 8 KB block is written, for example, 3 * 8 KB will written if= you take the 2 redundant copies into account.

I see I see....<= /p>


The 870 QVO is specified for 370 full capacity
writes, i.e. 370 TB f= or the 1 TB model. That's still a few hundred
GB a day - but only if = the write amplification stays in a reasonable
range ...
 
Well yes... 2880TB in our case..= =2E.not bad.. isn't it?

I assume that 2880 TB is your total storage capacity? That's not too bad= , in fact. ;-)

No... the total number of writes= you can do....before the disk "breaks"....


lol :) :)  ... we are havin= g storages of 50TB due to 8 disks of 8TB in raidz-2....


This would be 360 * 8 TB ...

Even at 160 MB/s per 8 TB SSD this would allow for more than 50 GB/s of = write throughput (if all writes were evenly distributed).

Taking all odds into account, I'd guess that at least 10 GB/s can be con= tinuously written (if supported by the CPUs and controllers).

But this may not be true if the drive is simultaneously reading, trimmin= g, and writing ...

I see.... It's extremely mislead= ing you know... because... you can copy five mailboxes of 50GB concurrently= for instance.... and you flood a gigabit interface copying (obviously beca= use disks can keep that throughput)... but later.... you see... you are in = an hour that yesterday, and even 4 days before you have not had any issues= =2E.. and that day... you see the commented issue... even not being exactly= at a peak hour (perhaps is two hours later the peak hour even)... or... bu= t I wasn't noticing about all things you say in this email....

I have seen advice to not use compression in a high load scenario in som= e other reply.

I tend to disagree: Since you seem to be limited when the SLC cache is e= xhausted, you should get better performance if you compress your data. I ha= ve found that zstd-2 works well for me (giving a significant overall reduct= ion of size at reasonable additional CPU load). Since ZFS allows to switch = compressions algorithms at any time, you can experiment with different algo= rithms and levels.

I see... you say compression sho= uld be enabled.... The main reason because we have not enabled it yet, is f= or keeping the system the most near possible to config defaults... you know= =2E.. for later being able to ask in this mailing lists if we have an issue= =2E.. because you know... it would be far more easier to ask about somethin= g strange you are seeing when that strange thing is near to a well tested c= onfig, like the config by default....

But now you say Stefan... if you= switch between compression algorithms you will end up with a mix of differ= ent files compressed in a different manner... that is not a bit disaster la= ter?. Doesn't affect performance in some manner?.

One advantage of ZFS compression is that it applies to the ARC, too. And= a compression factor of 2 should easily be achieved when storing mail (not= for .docx, .pdf, .jpg files though). Having more data in the ARC will redu= ce the read pressure on the SSDs and will give them more cycles for garbage= collections (which are performed in the background and required to always = have a sufficient reserve of free flash blocks for writes).

We would use I assume the lz4.= =2E. which is the less "expensive" compression algorithm for the CPU... and= I assume too for avoiding delay accessing data... do you recommend another= one?. Do you always recommend compression then?.

I'd give it a try - and if it reduces your storage requirements by 10% o= nly, then keep 10% of each SSD unused (not assigned to any partition). That= will greatly improve the resilience of your SSDs, reduce the write-amplifi= cation, will allow the SLC cache to stay at its large value, and may make a= large difference to the effective performance under high load.

But when you enable compression= =2E.. only gets compressed the new data modified or entered. Am I wrong?.

By the way, we have more or less= 1/4 of each disk used (12 TB allocated in a poll stated by zpool list, div= ided between 8 disks of 8TB...)... do you think we could be suffering on wr= ite amplification and so... having a so little disk space used in each disk= ?.

Regards, STefan

Hey mate, your mail is incredibl= e. It has helped as a lot. Can we invite you a cup of coffee or a beer thro= ugh Paypal or similar?. Can I help you in some manner?.


Cheers!


--=_ccf36dab3b44229808a0a46435736314--