[EXTERNAL] Re: FreeBSD 11.1 Beta 2 ZFS performance degradation on SSDs

Wed Jun 21 15:11:05 UTC 2017

On Wed, Jun 21, 2017 at 6:31 AM, Caza, Aaron <Aaron.Caza at ca.weatherford.com>
wrote:

> >
> > From: Steven Hartland [mailto:killing at multiplay.co.uk]
> > Sent: Wednesday, June 21, 2017 2:01 AM
> > To: Caza, Aaron; freebsd-fs at freebsd.org
> > Subject: [EXTERNAL] Re: FreeBSD 11.1 Beta 2 ZFS performance degradation
> on SSDs
> >
> > On 20/06/2017 21:26, Caza, Aaron wrote:
>
>  On 20/06/2017 17:58, Caza, Aaron wrote:
>
> dT: 1.001s  w: 1.000s
>
>   L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps
>  ms/d   %busy Name
>
>      0   4318   4318  34865    0.0      0      0    0.0      0      0
> 0.0   14.2| ada0
>
>      0   4402   4402  35213    0.0      0      0    0.0      0      0
> 0.0   14.4| ada1
>
>
>
> dT: 1.002s  w: 1.000s
>
>   L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps
>  ms/d   %busy Name
>
>      1   4249   4249  34136    0.0      0      0    0.0      0      0
> 0.0   14.1| ada0
>
>      0   4393   4393  35287    0.0      0      0    0.0      0      0
> 0.0   14.5| ada1
>
> You %busy is very low, so sounds like the bottleneck isn't in raw disk
> performance but somewhere else.
>
>
>
> Might be interesting to see if anything stands out in top -Sz and then
> press h for threads.
>
>
>
>
>
> I rebooted the system to disable Trim so currently not degraded.
>
>
>
> dT: 1.001s  w: 1.000s
>
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps
>  ms/d   %busy Name
>
>     3   3887   3887 426514    0.7      0      0    0.0      0      0
> 0.0   90.7| ada0
>
>     3   3987   3987 434702    0.7      0      0    0.0      0      0
> 0.0   92.0| ada1
>
>
>
> dT: 1.002s  w: 1.000s
>
>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w    d/s   kBps
>  ms/d   %busy Name
>
>     3   3958   3958 433563    0.7      0      0    0.0      0      0
> 0.0   91.6| ada0
>
>     3   3989   3989 438417    0.7      0      0    0.0      0      0
> 0.0   93.0| ada1
>
>
>
> test at f111beta2:~ # dd if=/testdb/test of=/dev/null bs=1m
>
> 16000+0 records in
>
> 16000+0 records out
>
> 16777216000 bytes transferred in 19.385855 secs (865435959 bytes/sec)
> > Now that is interesting, as your getting smaller number ops/s but much
> higher throughput.
> >
> > In the normal case you're seeing ~108Kb per read where in the degraded
> case you're seeing 8Kb per read.
> >
> > Given this and knowing the application level isn't effecting it, we need
> to identify where in the IO stack the reads are getting limited to 8Kb?
> >
> > With your additional information about ARC, it could be that the limited
> memory is the cause.
> >
> >    Regards
> >    Steve
>
> I’m glad to learn that the above info is of some use.  The 50M limit I
> previously used for the ZFS ACS served me well for the past several years.
> And, in fact, I thought it was still working well and only accidentally
> stumbled over the performance drop when testing some Intel 540 SSDs which
> were working surprisingly snappily despite using TLC NAND flash.
> Initially, I saw a simple SQL query in Postgres go from ~35 seconds to ~635
> seconds and suspected the Intel 540s were the cause.  Turns out it was me
> hamstringing the ARC that was to blame.  That said, it’s interesting that,
> using the GEOM ELI layer for 4k sector emulation, it runs fine for several
> hours before performance drops off when the ARC is set too small going from
> 850MB/s down to 80MB/s.  In the case of ashift=12, the initial performance
> impact is immediate on bootup going from 850MB/s with default ARC settings
> down to 450MB/s with ARC set to 50M.  Then, some hours later, dropping down
> to ~70MB/s.
>

Yes. TLC drives typically have some amount, maybe 3% of their NAND
configured as SLC NAND. for a 1TB drive, this would be 30GB. This SLC cache
is super fast compared to the TLC parts. What's typically done is that the
writes land in the SLC part and are then moved to the TLC parts as device
bandwidth allows. If you overrun this landing pad, you are stuck with TLC
write performance, which is going to be about ~1/10th that of SLC.

> With regards to Trim, there were a number of suggestions to disable it.
> My understanding is that Trim support is highly desirable to maintain peak
> performance but it seems it’s the #1 suspect when there’s a performance
> drop.  Is it that problematic?  I’m considering switching from GEOM ELI to
> ashift=12 for my 4k sector emulation in order to get Trim but if it’s of no
> benefit then there’s not much point.
>

TRIM is highly desired to keep the write amplification on the drives down,
which is often more of a longevity thing than a performance thing (though
if you are write constrained, it may help a little). There's some drives
that don't handle the TRIM quickly enough, which is why it's often
implicated in performance issues. TRIM generally helps the drive make
better decisions and copy less data around (thus reducing the number of
non-user initiated writes, which drives write amp), though for some work
loads it doesn't help much (especially ones where any files deleted are
immediately replaced). The benefits from it are that your drives don't wear
as quickly as a result.

Warner