Geom stripe bottleneck

John-Mark Gurney jmg at funkthat.com
Wed Jun 4 16:30:44 UTC 2014


Frank Broniewski wrote this message on Wed, Jun 04, 2014 at 10:38 +0200:
> thank you very much for your verbose and very helpful answer! I think
> that clears things out for me.

You're welcome...

> I've got a question concerning NCQ though:
> 
> # grep ahci /var/run/dmesg.boot
> ahci0: <ATI IXP700 AHCI SATA controller> port
> 0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f
> mem 0xfaffe400-0xfaffe7ff irq 22 at device 17.0 on pci0
> ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported
> ahcich0: <AHCI channel> at channel 0 on ahci0
> ahcich1: <AHCI channel> at channel 1 on ahci0
> ahcich2: <AHCI channel> at channel 2 on ahci0
> ahcich3: <AHCI channel> at channel 3 on ahci0
> ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
> ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
> ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
> ada3 at ahcich3 bus 0 scbus3 target 0 lun 0

try doing a grep ada0, as mine shows:
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada0: <WDC WD30EFRX-68AX9N0 80.00A80> ATA-9 SATA 3.x device
ada0: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
ada0: Command Queueing enabled
ada0: 2861588MB (5860533168 512 byte sectors: 16H 63S/T 16383C)
ada0: Previously was known as ad0

You should probably see something similar...

> and:
> 
> # camcontrol identify ada3
> pass3: <WDC WD6000HLHX-01JJPV0 04.05G04> ATA-8 SATA 3.x device
> pass3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> 
> protocol              ATA/ATAPI-8 SATA 3.x
> device model          WDC WD6000HLHX-01JJPV0
> firmware revision     04.05G04
> serial number         WD-WXL1E61PWAL2
> WWN                   50014ee7aaab0118
> cylinders             16383
> heads                 16
> sectors/track         63
> sector size           logical 512, physical 512, offset 0
> LBA supported         268435455 sectors
> LBA48 supported       1172123568 sectors
> PIO supported         PIO4
> DMA supported         WDMA2 UDMA6
> media RPM             10000
> 
> Feature                      Support  Enabled   Value           Vendor
> read ahead                     yes      yes
> write cache                    yes      yes
> flush cache                    yes      yes
> overlap                        no
> Tagged Command Queuing (TCQ)   no       no
> Native Command Queuing (NCQ)   yes              32 tags
> SMART                          yes      yes
> microcode download             yes      yes
> security                       yes      no
> power management               yes      yes
> advanced power management      yes      yes     128/0x80
> automatic acoustic management  no       no
> media status notification      no       no
> power-up in Standby            yes      no
> write-read-verify              no       no
> unload                         yes      yes
> free-fall                      no       no
> Data Set Management (DSM/TRIM) no
> Host Protected Area (HPA)      yes      no      1172123568/1172123568
> HPA - Security                 no
> 
> 
> is NCQ now enabled? The corresponding line in the camcontrol identify
> output doesn't tell me that explicitly but also doesn't deny that ...
> but the dmesg.boot may hint that the ahci module is loaded ... I'm
> confused :-)
> 
> I do not have a ahci_load=YES in /boot/loader.conf (this is on FreeBSD
> 9.2-p6) and I don't know if that's still necessary or not. Searching the
> internet turned up mostly rather old (2010,2011) results.
> 
> 
> Am 2014-06-03 22:48, schrieb John-Mark Gurney:
> > Frank Broniewski wrote this message on Tue, Jun 03, 2014 at 11:56 +0200:
> >> I have a stripe (RAID0) geom setup for my database's data. Currently I
> >> am applying some large updates on the data and I think the performance
> >> of my stripe could be better. But I am uncertain and so I thought I'd
> >> request some interpretation help from the community :)
> >>
> >> The stripe consists of two disks (WD Velociraptor with 10.000 rpm):
> >>> diskinfo -v ada2
> >> ada2
> >>         512             # sectorsize
> >>         600127266816    # mediasize in bytes (558G)
> >>         1172123568      # mediasize in sectors
> >>         0               # stripesize
> >>         0               # stripeoffset
> >>         1162821         # Cylinders according to firmware.
> >>
> >>         16              # Heads according to firmware.
> >>
> >>         63              # Sectors according to firmware.
> >>
> >>         WD-WXH1E61ASNX9 # Disk ident.
> >>
> >>
> >> and /var/log/dmesg.boot
> >> # snip
> >> ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
> >> ada2: <WDC WD6000HLHX-01JJPV0 04.05G04> ATA-8 SATA 3.x device
> >> ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> >> ada2: Command Queueing enabled
> >> ada2: 572325MB (1172123568 512 byte sectors: 16H 63S/T 16383C)
> >> ada2: Previously was known as ad8
> >> ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
> >> ada3: <WDC WD6000HLHX-01JJPV0 04.05G04> ATA-8 SATA 3.x device
> >> ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
> >> ada3: Command Queueing enabled
> >> ada3: 572325MB (1172123568 512 byte sectors: 16H 63S/T 16383C)
> >> ada3: Previously was known as ad10
> >> #snap
> >>
> >>
> >> And here's some iostat -d -w 10 ada0 ada1 ada2 ada3 example output
> >> #snip
> >>            ada0             ada1             ada2             ada3
> >>   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s
> >>   0.00   0  0.00   0.00   0  0.00  19.33 176  3.32  19.33 176  3.32
> >>  16.25   0  0.01  16.25   0  0.01  16.87 133  2.20  16.87 133  2.20
> >>   0.00   0  0.00   0.00   0  0.00  16.77 146  2.40  16.77 147  2.40
> >>   0.00   0  0.00   0.00   0  0.00  19.46 170  3.24  19.45 170  3.23
> >>  21.50   0  0.01  21.50   0  0.01  17.00 125  2.08  17.00 125  2.08
> >>   0.50   0  0.00   0.50   0  0.00  16.88 145  2.38  16.88 145  2.38
> >>   0.00   0  0.00   0.00   0  0.00  16.96 125  2.07  16.97 125  2.07
> >>   0.00   0  0.00   0.00   0  0.00  19.82 158  3.06  19.81 158  3.07
> >>  28.77   1  0.03  28.77   1  0.03  16.83 133  2.19  16.82 133  2.19
> >> #snap
> > 
> > The key here is the tps... Spining drives have a limited number of
> > tps... first you have moving the heads, which on average will be ~4ms,
> > then you have to wait, on average half a rotation, which for a 10k RPM
> > drive is ~3ms, so each seek will take around 7ms, so, as you can see,
> > your best number is 176 TPS, or ~8ms/transaction... so, it looks like
> > your drives are performing as they should...
> > 
> >> I think the MB/s output is rather low for such a disk. To gain further
> >> insight I started gstat:
> >> dT: 1.001s  w: 1.000s
> >>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
> >>     0     27      0      0    0.0     27   2226    4.8    7.0| ada0
> >>     0     28      1     32   23.9     27   2226    1.3    3.9| ada1
> >>     2    120    115   1838    6.4      5     96    0.2   74.3| ada2
> >>     2    121    116   1854    6.3      5     96    0.4   72.9| ada3
> >>     0     28      1     32   24.0     27   2226    5.0    8.7| mirror/gm
> >>     2    121    116   3708    7.9      5    192    0.4   92.2| stripe/gs
> >>     0     28      1     32   24.0     27   2226    5.0    8.7| mirror/gms1
> >>     0     12      0      0    0.0     12   1343    9.1    6.9| mirror/gms1a
> >>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1b
> >>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1d
> >>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1e
> >>     0     16      1     32   24.0     15    883    1.7    2.9| mirror/gms1f
> >>
> >>
> >> What bothers me here is that the stripe/gs is 92% busy while the disks
> >> themselves are only at 74/72%. This lead me to my post here and seek
> >> some advice, since I don't know enough about the mechanics and so I
> >> can't really find the problem, if there is any at all.
> > 
> > This is because the stripe has to wait for both drives to return data
> > before moving the data up... If you're just running a single threaded
> > benchmark, there isn't multiple IO's in flight, and there for the
> > remaining time is spent in your application before it sends another
> > request down to the stripe...  the different between stripe and the
> > drives is the fact each of them is sometimes faster than the other,
> > so again, won't have work to do until another IO is submitted...
> > 
> > Try sending more IO at it, like doing 4 or more dd read's such that
> > the between the latency of one IO, there is other IO to server...
> > 
> > Also, make sure that you're using NCQ where the OS can submit multiple
> > IO's to the drives at once, this should improve things, but won't
> > change the results you see above as it requires multiple IO's
> > outstanding...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."


More information about the freebsd-geom mailing list