Geom stripe bottleneck

Frank Broniewski brfr at metrico.lu
Wed Jun 4 08:38:47 UTC 2014


Hey,

thank you very much for your verbose and very helpful answer! I think
that clears things out for me.

I've got a question concerning NCQ though:

# grep ahci /var/run/dmesg.boot
ahci0: <ATI IXP700 AHCI SATA controller> port
0xb000-0xb007,0xa000-0xa003,0x9000-0x9007,0x8000-0x8003,0x7000-0x700f
mem 0xfaffe400-0xfaffe7ff irq 22 at device 17.0 on pci0
ahci0: AHCI v1.10 with 4 3Gbps ports, Port Multiplier supported
ahcich0: <AHCI channel> at channel 0 on ahci0
ahcich1: <AHCI channel> at channel 1 on ahci0
ahcich2: <AHCI channel> at channel 2 on ahci0
ahcich3: <AHCI channel> at channel 3 on ahci0
ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
ada1 at ahcich1 bus 0 scbus1 target 0 lun 0
ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
ada3 at ahcich3 bus 0 scbus3 target 0 lun 0


and:

# camcontrol identify ada3
pass3: <WDC WD6000HLHX-01JJPV0 04.05G04> ATA-8 SATA 3.x device
pass3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)

protocol              ATA/ATAPI-8 SATA 3.x
device model          WDC WD6000HLHX-01JJPV0
firmware revision     04.05G04
serial number         WD-WXL1E61PWAL2
WWN                   50014ee7aaab0118
cylinders             16383
heads                 16
sectors/track         63
sector size           logical 512, physical 512, offset 0
LBA supported         268435455 sectors
LBA48 supported       1172123568 sectors
PIO supported         PIO4
DMA supported         WDMA2 UDMA6
media RPM             10000

Feature                      Support  Enabled   Value           Vendor
read ahead                     yes      yes
write cache                    yes      yes
flush cache                    yes      yes
overlap                        no
Tagged Command Queuing (TCQ)   no       no
Native Command Queuing (NCQ)   yes              32 tags
SMART                          yes      yes
microcode download             yes      yes
security                       yes      no
power management               yes      yes
advanced power management      yes      yes     128/0x80
automatic acoustic management  no       no
media status notification      no       no
power-up in Standby            yes      no
write-read-verify              no       no
unload                         yes      yes
free-fall                      no       no
Data Set Management (DSM/TRIM) no
Host Protected Area (HPA)      yes      no      1172123568/1172123568
HPA - Security                 no


is NCQ now enabled? The corresponding line in the camcontrol identify
output doesn't tell me that explicitly but also doesn't deny that ...
but the dmesg.boot may hint that the ahci module is loaded ... I'm
confused :-)

I do not have a ahci_load=YES in /boot/loader.conf (this is on FreeBSD
9.2-p6) and I don't know if that's still necessary or not. Searching the
internet turned up mostly rather old (2010,2011) results.


Am 2014-06-03 22:48, schrieb John-Mark Gurney:
> Frank Broniewski wrote this message on Tue, Jun 03, 2014 at 11:56 +0200:
>> I have a stripe (RAID0) geom setup for my database's data. Currently I
>> am applying some large updates on the data and I think the performance
>> of my stripe could be better. But I am uncertain and so I thought I'd
>> request some interpretation help from the community :)
>>
>> The stripe consists of two disks (WD Velociraptor with 10.000 rpm):
>>> diskinfo -v ada2
>> ada2
>>         512             # sectorsize
>>         600127266816    # mediasize in bytes (558G)
>>         1172123568      # mediasize in sectors
>>         0               # stripesize
>>         0               # stripeoffset
>>         1162821         # Cylinders according to firmware.
>>
>>         16              # Heads according to firmware.
>>
>>         63              # Sectors according to firmware.
>>
>>         WD-WXH1E61ASNX9 # Disk ident.
>>
>>
>> and /var/log/dmesg.boot
>> # snip
>> ada2 at ahcich2 bus 0 scbus2 target 0 lun 0
>> ada2: <WDC WD6000HLHX-01JJPV0 04.05G04> ATA-8 SATA 3.x device
>> ada2: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
>> ada2: Command Queueing enabled
>> ada2: 572325MB (1172123568 512 byte sectors: 16H 63S/T 16383C)
>> ada2: Previously was known as ad8
>> ada3 at ahcich3 bus 0 scbus3 target 0 lun 0
>> ada3: <WDC WD6000HLHX-01JJPV0 04.05G04> ATA-8 SATA 3.x device
>> ada3: 300.000MB/s transfers (SATA 2.x, UDMA6, PIO 8192bytes)
>> ada3: Command Queueing enabled
>> ada3: 572325MB (1172123568 512 byte sectors: 16H 63S/T 16383C)
>> ada3: Previously was known as ad10
>> #snap
>>
>>
>> And here's some iostat -d -w 10 ada0 ada1 ada2 ada3 example output
>> #snip
>>            ada0             ada1             ada2             ada3
>>   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s
>>   0.00   0  0.00   0.00   0  0.00  19.33 176  3.32  19.33 176  3.32
>>  16.25   0  0.01  16.25   0  0.01  16.87 133  2.20  16.87 133  2.20
>>   0.00   0  0.00   0.00   0  0.00  16.77 146  2.40  16.77 147  2.40
>>   0.00   0  0.00   0.00   0  0.00  19.46 170  3.24  19.45 170  3.23
>>  21.50   0  0.01  21.50   0  0.01  17.00 125  2.08  17.00 125  2.08
>>   0.50   0  0.00   0.50   0  0.00  16.88 145  2.38  16.88 145  2.38
>>   0.00   0  0.00   0.00   0  0.00  16.96 125  2.07  16.97 125  2.07
>>   0.00   0  0.00   0.00   0  0.00  19.82 158  3.06  19.81 158  3.07
>>  28.77   1  0.03  28.77   1  0.03  16.83 133  2.19  16.82 133  2.19
>> #snap
> 
> The key here is the tps... Spining drives have a limited number of
> tps... first you have moving the heads, which on average will be ~4ms,
> then you have to wait, on average half a rotation, which for a 10k RPM
> drive is ~3ms, so each seek will take around 7ms, so, as you can see,
> your best number is 176 TPS, or ~8ms/transaction... so, it looks like
> your drives are performing as they should...
> 
>> I think the MB/s output is rather low for such a disk. To gain further
>> insight I started gstat:
>> dT: 1.001s  w: 1.000s
>>  L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
>>     0     27      0      0    0.0     27   2226    4.8    7.0| ada0
>>     0     28      1     32   23.9     27   2226    1.3    3.9| ada1
>>     2    120    115   1838    6.4      5     96    0.2   74.3| ada2
>>     2    121    116   1854    6.3      5     96    0.4   72.9| ada3
>>     0     28      1     32   24.0     27   2226    5.0    8.7| mirror/gm
>>     2    121    116   3708    7.9      5    192    0.4   92.2| stripe/gs
>>     0     28      1     32   24.0     27   2226    5.0    8.7| mirror/gms1
>>     0     12      0      0    0.0     12   1343    9.1    6.9| mirror/gms1a
>>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1b
>>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1d
>>     0      0      0      0    0.0      0      0    0.0    0.0| mirror/gms1e
>>     0     16      1     32   24.0     15    883    1.7    2.9| mirror/gms1f
>>
>>
>> What bothers me here is that the stripe/gs is 92% busy while the disks
>> themselves are only at 74/72%. This lead me to my post here and seek
>> some advice, since I don't know enough about the mechanics and so I
>> can't really find the problem, if there is any at all.
> 
> This is because the stripe has to wait for both drives to return data
> before moving the data up... If you're just running a single threaded
> benchmark, there isn't multiple IO's in flight, and there for the
> remaining time is spent in your application before it sends another
> request down to the stripe...  the different between stripe and the
> drives is the fact each of them is sometimes faster than the other,
> so again, won't have work to do until another IO is submitted...
> 
> Try sending more IO at it, like doing 4 or more dd read's such that
> the between the latency of one IO, there is other IO to server...
> 
> Also, make sure that you're using NCQ where the OS can submit multiple
> IO's to the drives at once, this should improve things, but won't
> change the results you see above as it requires multiple IO's
> outstanding...
> 


-- 
Frank BRONIEWSKI

METRICO s.à r.l.
géomètres
technologies d'information géographique
rue des Romains 36
L-5433 NIEDERDONVEN

tél.: +352 26 74 94 - 28
fax.: +352 26 74 94 99
http://www.metrico.lu


More information about the freebsd-geom mailing list