Re: TRIM question and zfs

From: Warner Losh <imp_at_bsdimp.com>
Date: Wed, 18 Dec 2024 21:49:17 UTC
On Wed, Dec 18, 2024 at 1:26 PM mike tancsa <mike@sentex.net> wrote:

> TL;DR  does zpool trim <poolname> actually work as well as one expects /
> needs ?
>

I'd expect it to trim the unused part of the drive(s), but that may or may
not
help in high wear situations.

I had a very old server that was running as RELENG_12 for many years on
> some SSDs which were now getting to EOL with 6yrs of work on them-- Wear
> level showed it getting low  for sure.  I had migrated everything live
> off the box, but for some reason, trying to do a zfs send on a volume
> was REALLY slow. I am talking KB/s slow.  It took a long time, but it
> eventually got done.  As there was nothing on this server in production,
> I thought it a good exercise to try and upgrade it in the field. So
> buildworld to 13 and then 14.  I deleted some of the old unneeded files
> and got down to just the zfs volume that was left on the pool so just
> under 200G. I then did a zpool trim tank1, but didnt see any improved
> performance at all. Still crazy slow. So I then did
>
> gpart backup <disk> > /tmp/disk-part.txt
>
> zpool offline tank1 <disk>p1
> trim -f /dev/<disk>
> cat /tmp/disk-part.txt | gpart restore <disk>
> zpool online tank1 <disk>p1
> zpool replace tank1 <disk>p1 <disk>p1
>
> for all 3 <disk>s in the pool one by one.
>
> The first resilver took 13hrs, the second 8 or so and the last 13min.
> After the final resilver was done, I could do a zfs send of the volume
> pretty well at full speed with zpool iostat 1 showing close to a GB/s
> reads.
>
> I know that zfs autotrim and trim just kinda keeps track of what can and
> cant be deleted.  But I would have thought the  zpool trim would have
> had some impact ?
>

It all depends on the drive history. Mostly TRIMing a drive is useful
for reducing write amplification. Done frequently, this gives the drive's
firmware more options when it needs to do the housekeeping it does
to have blocks available to write. The increased choice lets it make
better decisions and reduce the extra writes it has to do to keep the
data fresh and provide free blocks for future writes.

Sometimes, you may get lucky and this trimming will kick off the right
sort of housekeeping and result in lots of free blocks being added to
the pool it keeps internally so writes are faster.

But you are seeing really poor read performance. That's usually caused
by data that's ending the end of its useful life in the current blocks that
it
occupies which triggers transparent data recovery, but at such a high level
it can no longer be "line speed" of the device, but is happening at software
speeds, which can be quite a bit slower. By basically wiping and rewriting
the drives with the resilvers, you've refreshed all the data so now none of
it takes a long time to read. Not sure what 'old data' is on the drives, but
that would also explain the faster resilver times too.


> Questions:
>
> Does this mean that prior to deploying SSDs for use in a zfs pool, you
> should do a full trim -f of the disk ?
>

Yes.

Apart from offlining and doing a trim, resilver etc, is there a better
> way to get back performance ?  Or with a once a week trim prior to
> scrub, will it be "good enough" ?
>

Weekly should suffice.

However, if the problem is due to 'old data' decaying and the drive's
reliability software not moving it aggressively enough to preserve
performance, all the trims in the world won't help. It could be that
the drives are too busy (though the aggregate numbers from smart
aren't suggestive of that).

The current temperature is good, but if the drives baked for a while
for some reason, that could explain the degraded performance.


> Is there a way to tell if a disk REALLY needs to be fully trimmed other
> than approximating for slowing performance ?
>

You might be able to look at the current wear vs the promised lifetime
of the drive. 6 years is out of warranty for sure, so it may just be they
are too worn for anything needing any level of performance.

But usually it's performance. And even then, there's no silver
bullet.


> I know these disks were super old, so maybe current SSDs dont have this
> issue ? Last few years I have switched to Samsung EVOs and they dont
> seem to have these problems, at least not yet in any obvious way.  Not
> sure why this particularly showed up in the zfs volume set, and other
> normal datasets performed ok.
>

Yea. from the smart info, it looks like you've worn them out 1/3. You've
written
about 112TB to the drive based on 80TB of write traffic (if I'm doing the
math
right). This is a fairly good number.. There's not any real link errors to
speak
of (which is another way you can be slow)., nor have you been thermal
throttling.

So you've done 80 drive writes over 6 years (or 0.03 DWPD). This is well
below
ratings for most TLC drives of 0.37 DWPD in the datasheet (but that's only
for
3 years). The drive should be good for about 400 drive writes total, and
you are
at 1/5 of that. But the wear indicators are closer to 1/3 (approximately 2x
what
the absolute wear values would indicate). The write amp is relatively low,
suggesting
that trimming wouldn't help all that much, though. It's at 1.4, which would
bump
total bytes written into the 1/3 lifetime range. Plus TLC writes tend to be
quite a
bit harder on the drive than SLC writes, so that makes the 1/3 wear numbers
kinda make sense.

None of these raw numbers suggests a good root-cause for the slowness,
which is in line with 'bad data from the nand taking a while to recover'.
It all
has to do with the drive's write / power-on / temperature / etc history, and
many key details of that are simply unobtanium though some hints at them
are in the SMART data. Not enough for be to be sure, though why your
drives degraded.

Warner


>      ---Mike
>
>
> disk
>
> smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.2-STABLE amd64] (local build)
> Copyright (C) 2002-23, Bruce Allen, Christian Franke,
> www.smartmontools.org
>
> === START OF INFORMATION SECTION ===
> Model Family:     WD Blue / Red / Green SSDs
> Device Model:     WDC  WDS100T2B0A-00SM50
> Serial Number:    191011A00A72
> LU WWN Device Id: 5 001b44 8b89825ed
> Firmware Version: 401000WD
> User Capacity:    1,000,204,886,016 bytes [1.00 TB]
> Sector Size:      512 bytes logical/physical
> Rotation Rate:    Solid State Device
> Form Factor:      2.5 inches
> TRIM Command:     Available, deterministic, zeroed
> Device is:        In smartctl database 7.3/5528
> ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
> SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
> Local Time is:    Wed Dec 18 15:23:10 2024 EST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> AAM feature is:   Unavailable
> APM level is:     128 (minimum power consumption without standby)
> Rd look-ahead is: Enabled
> Write cache is:   Enabled
> DSN feature is:   Unavailable
> ATA Security is:  Disabled, NOT FROZEN [SEC1]
> Wt Cache Reorder: Unavailable
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x00) Offline data collection activity
>                                          was never started.
>                                          Auto Offline Data Collection:
> Disabled.
> Self-test execution status:      (   0) The previous self-test routine
> completed
>                                          without error or no self-test
> has ever
>                                          been run.
> Total time to complete Offline
> data collection:                (    0) seconds.
> Offline data collection
> capabilities:                    (0x11) SMART execute Offline immediate.
>                                          No Auto Offline data collection
> support.
>                                          Suspend Offline collection upon
> new
>                                          command.
>                                          No Offline surface scan supported.
>                                          Self-test supported.
>                                          No Conveyance Self-test supported.
>                                          No Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                          power-saving mode.
>                                          Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                          General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   2) minutes.
> Extended self-test routine
> recommended polling time:        (  10) minutes.
>
> SMART Attributes Data Structure revision number: 4
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
>    5 Reallocated_Sector_Ct   -O--CK   100   100   ---    -    0
>    9 Power_On_Hours          -O--CK   100   100   ---    -    47271
>   12 Power_Cycle_Count       -O--CK   100   100   ---    -    33
> 165 Block_Erase_Count       -O--CK   100   100   ---    - 906509291245
> 166 Minimum_PE_Cycles_TLC   -O--CK   100   100   ---    -    1
> 167 Max_Bad_Blocks_per_Die  -O--CK   100   100   ---    -    34
> 168 Maximum_PE_Cycles_TLC   -O--CK   100   100   ---    -    33
> 169 Total_Bad_Blocks        -O--CK   100   100   ---    -    534
> 170 Grown_Bad_Blocks        -O--CK   100   100   ---    -    0
> 171 Program_Fail_Count      -O--CK   100   100   ---    -    0
> 172 Erase_Fail_Count        -O--CK   100   100   ---    -    0
> 173 Average_PE_Cycles_TLC   -O--CK   100   100   ---    -    12
> 174 Unexpected_Power_Loss   -O--CK   100   100   ---    -    19
> 184 End-to-End_Error        -O--CK   100   100   ---    -    0
> 187 Reported_Uncorrect      -O--CK   100   100   ---    -    0
> 188 Command_Timeout         -O--CK   100   100   ---    -    0
> 194 Temperature_Celsius     -O---K   075   044   ---    -    25 (Min/Max
> 22/44)
> 199 UDMA_CRC_Error_Count    -O--CK   100   100   ---    -    0
> 230 Media_Wearout_Indicator -O--CK   007   007   ---    - 0x074001140740
> 232 Available_Reservd_Space PO--CK   100   100   004    -    100
> 233 NAND_GB_Written_TLC     -O--CK   100   100   ---    -    12346
> 234 NAND_GB_Written_SLC     -O--CK   100   100   ---    -    90919
> 241 Host_Writes_GiB         ----CK   253   253   ---    -    80762
> 242 Host_Reads_GiB          ----CK   253   253   ---    -    19908
> 244 Temp_Throttle_Status    -O--CK   000   100   ---    -    0
>                              ||||||_ K auto-keep
>                              |||||__ C event count
>                              ||||___ R error rate
>                              |||____ S speed/performance
>                              ||_____ O updated online
>                              |______ P prefailure warning
>
> General Purpose Log Directory Version 1
> SMART           Log Directory Version 1 [multi-sector log support]
> Address    Access  R/W   Size  Description
> 0x00       GPL,SL  R/O      1  Log Directory
> 0x01           SL  R/O      1  Summary SMART error log
> 0x02           SL  R/O      2  Comprehensive SMART error log
> 0x03       GPL     R/O      1  Ext. Comprehensive SMART error log
> 0x04       GPL,SL  R/O      8  Device Statistics log
> 0x06           SL  R/O      1  SMART self-test log
> 0x07       GPL     R/O      1  Extended self-test log
> 0x10       GPL     R/O      1  NCQ Command Error log
> 0x11       GPL     R/O      1  SATA Phy Event Counters log
> 0x24       GPL     R/O   2261  Current Device Internal Status Data log
> 0x25       GPL     R/O   2261  Saved Device Internal Status Data log
> 0x30       GPL,SL  R/O      9  IDENTIFY DEVICE data log
> 0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
> 0xde       GPL     VS       8  Device vendor specific log
>
> SMART Extended Comprehensive Error Log Version: 1 (1 sectors)
> No Errors Logged
>
> SMART Extended Self-test Log Version: 1 (1 sectors)
> No self-tests have been logged.  [To run self-tests, use: smartctl -t]
>
> Selective Self-tests/Logging not supported
>
> SCT Commands not supported
>
> Device Statistics (GP Log 0x04)
> Page  Offset Size        Value Flags Description
> 0x01  =====  =               =  ===  == General Statistics (rev 1) ==
> 0x01  0x008  4              33  ---  Lifetime Power-On Resets
> 0x01  0x010  4           47271  ---  Power-on Hours
> 0x01  0x018  6    169371253578  ---  Logical Sectors Written
> 0x01  0x020  6      2639812949  ---  Number of Write Commands
> 0x01  0x028  6     41752136282  ---  Logical Sectors Read
> 0x01  0x030  6        89429189  ---  Number of Read Commands
> 0x07  =====  =               =  ===  == Solid State Device Statistics
> (rev 1) ==
> 0x07  0x008  1               1  N--  Percentage Used Endurance Indicator
>                                  |||_ C monitored condition met
>                                  ||__ D supports DSN
>                                  |___ N normalized value
>
> Pending Defects log (GP Log 0x0c) not supported
>
> SATA Phy Event Counters (GP Log 0x11)
> ID      Size     Value  Description
> 0x0001  4            0  Command failed due to ICRC error
> 0x0002  4            0  R_ERR response for data FIS
> 0x0005  4            0  R_ERR response for non-data FIS
> 0x000a  4            7  Device-to-host register FISes sent due to a
> COMRESET
>
>