Re: ZFS + mysql appears to be killing my SSD's

From: Daniel Lysfjord via stable <stable_at_freebsd.org>
Date: Mon, 05 Jul 2021 15:37:03 UTC
"Karl Denninger" <karl@denninger.net> skrev 5. juli 2021 kl. 17:10:

> On 7/5/2021 10:30, Pete French wrote:
> 
>> On 05/07/2021 14:37, Stefan Esser wrote:
>>> Hi Pete,
>>> 
>>> have you checked the drive state and statistics with smartctl?
>> 
>> Hi, thanks for the reply - yes, I did check the statistics, and they
>> dont make a lot of sense. I was just looking at them again in fact.
>> 
>> So, one of the machines that we chnaged a drive on when this first
>> started, which was 4 weeks ago.
>> 
>> root@telehouse04:/home/webadmin # smartctl -a /dev/ada0 | grep Perc
>> 169 Remaining_Lifetime_Perc 0x0000   082   082   000    Old_age
>> Offline      -       82
>> root@telehouse04:/home/webadmin # smartctl -a /dev/ada1 | grep Perc
>> 202 Percent_Lifetime_Remain 0x0030   100   100   001    Old_age
>> Offline      -       0
>> 
>> Now, from that you might think the 2nd drive was the one changes, but
>> no. Its the first one, which is now at 82% lifetime remaining! The
>> other druve, still at 100%, has been in there a year. The drives are
>> different manufacturers, which makes comparing most of the numbers
>> tricky unfortunately.
>> 
>> Am now even more worried than when I sent the first email - if that
>> 18% is accurate then I am going to be doing this again in another 4
>> months, and thats not sustainable. It also looks as if this problem
>> has got a lot worse recently. Though I wasnt looking at the numbers
>> before, only noticing tyhe failurses. If I look at 'Percentage Used
>> Endurance Indicator' isntead of the 'Percent_Lifetime_Remain' value
>> then I see some of those well over 200%. That value is, on the newer
>> drives, 100 minus the 'Percent_Lifetime_Remain' value, so I guess they
>> ahve the same underlying metric.
>> 
>> I didnt mention in my original email, but I am encrypting these with
>> geli. Does geli do any write amplification at all ? That might explain
>> the high write volumes...
>> 
>> -pete.
> 
> As noted elsewhere assuming ashift=12 the answer on write amplification
> is no.
> 
> Geli should be initialized with -s 4096; I'm assuming you did that?
> 
> I have a 5-unit geli-encrypted root pool, all Intel 240gb SSDs. They do
> not report remaining lifetime via smart but do report indications of
> trouble.  Here's one example snippet from one of the drives in that pool:
> 
> SMART Attributes Data Structure revision number: 1
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
> 5 Reallocated_Sector_Ct   -O--CK   098   098   000    -    0
> 9 Power_On_Hours          -O--CK   100   100   000    - 53264
> 12 Power_Cycle_Count       -O--CK   100   100   000    -    100
> 170 Available_Reservd_Space PO--CK   100   100   010    -    0
> 171 Program_Fail_Count      -O--CK   100   100   000    -    0
> 172 Erase_Fail_Count        -O--CK   100   100   000    -    0
> 174 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    41
> 175 Power_Loss_Cap_Test     PO--CK   100   100   010    -    631 (295 5442)
> 183 SATA_Downshift_Count    -O--CK   100   100   000    -    0
> 184 End-to-End_Error        PO--CK   100   100   090    -    0
> 187 Reported_Uncorrect      -O--CK   100   100   000    -    0
> 190 Temperature_Case        -O---K   068   063   000    -    32 (Min/Max
> 29/37)
> 192 Unsafe_Shutdown_Count   -O--CK   100   100   000    -    41
> 194 Temperature_Internal    -O---K   100   100   000    -    32
> 197 Current_Pending_Sector  -O--CK   100   100   000    -    0
> 199 CRC_Error_Count         -OSRCK   100   100   000    -    0
> 225 Host_Writes_32MiB       -O--CK   100   100   000    - 1811548
> 226 Workld_Media_Wear_Indic -O--CK   100   100   000    -    205
> 227 Workld_Host_Reads_Perc  -O--CK   100   100   000    -    49
> 228 Workload_Minutes        -O--CK   100   100   000    - 55841
> 232 Available_Reservd_Space PO--CK   100   100   010    -    0
> 233 Media_Wearout_Indicator -O--CK   089   089   000    -    0
> 234 Thermal_Throttle        -O--CK   100   100   000    -    0/0
> 241 Host_Writes_32MiB       -O--CK   100   100   000    - 1811548
> 242 Host_Reads_32MiB        -O--CK   100   100   000    - 1423217
> ||||||_ K auto-keep
> |||||__ C event count
> ||||___ R error rate
> |||____ S speed/performance
> ||_____ O updated online
> |______ P prefailure warning
> 
> Device Statistics (GP Log 0x04)
> Page  Offset Size        Value Flags Description
> 0x01  =====  =               =  ===  == General Statistics (rev 2) ==
> 0x01  0x008  4             100  ---  Lifetime Power-On Resets
> 0x01  0x018  6    118722148102  ---  Logical Sectors Written
> 0x01  0x020  6        89033895  ---  Number of Write Commands
> 0x01  0x028  6     93271951909  ---  Logical Sectors Read
> 0x01  0x030  6         6797990  ---  Number of Read Commands
> 
> 6 years in-use, roughly, and no indication of anything going on in terms
> of warnings about utilization or wear-out.  There is a MYSQL database on
> this box used by Cacti that is running all the time and while the
> traffic is real high, it's there (there is also a Postgres server
> running on there that sees some traffic too.)  These specific drives
> were selected due to having power-fail protection for data in-flight --
> they were one of only a few that I've tested which passed a "pull the
> cord" test even though they're actually the 730s, NOT the "DC" series.
> 
> Raidz2 configuration:
> 
> root@NewFS:/home/karl # zpool status zsr
> pool: zsr
> state: ONLINE
> scan: scrub repaired 0 in 0 days 00:07:05 with 0 errors on Mon Jun 28
> 03:43:58 2021
> config:
> 
> NAME            STATE     READ WRITE CKSUM
> zsr             ONLINE       0     0     0
> raidz2-0      ONLINE       0     0     0
> ada0p4.eli  ONLINE       0     0     0
> ada1p4.eli  ONLINE       0     0     0
> ada2p4.eli  ONLINE       0     0     0
> ada3p4.eli  ONLINE       0     0     0
> ada4p4.eli  ONLINE       0     0     0
> 
> errors: No known data errors
> 
> Micron appears to be the only people making suitable replacements if and
> when these do start to fail on me, but from what I see here it will be a
> good while yet.
> 
> --
> --
> Karl Denninger
> karl@denninger.net <karl@denninger.net>
> /The Market Ticker/
> /[S/MIME encrypted email preferred]/

Running MariaDB and PostgreSQL with FreeBSD 12.2 on a couple of Samsung 250GB 960 EVO drives in a mirror. Very low usage, and expected amount of wear:

smartctl snippet:
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    5 294 592 [2,71 TB]
Data Units Written:                 25 471 775 [13,0 TB]
Host Read Commands:                 55 763 074
Host Write Commands:                1 245 546 898
Controller Busy Time:               3 290
Power Cycles:                       81
Power On Hours:                     29 491
Unsafe Shutdowns:                   46
Media and Data Integrity Errors:    0
Error Information Log Entries:      6
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius
Temperature Sensor 2:               55 Celsius

zpool status:
  pool: znvme
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:14 with 0 errors on Fri Jun  4 03:03:46 2021
config:

	NAME        STATE     READ WRITE CKSUM
	znvme       ONLINE       0     0     0
	  mirror-0  ONLINE       0     0     0
	    nvd0    ONLINE       0     0     0
	    nvd1    ONLINE       0     0     0

errors: No known data errors