Improving geom_mirror(4)'s read balancing

Tue Apr 28 09:20:49 UTC 2009

Hi,

We have few production systems using geom_mirror. The functionality is 
rock solid, however I have noticed that the read performance of the 
array, especially in sequential reading, is often worse than performance 
of a single member of array in the same conditions, which made me 
curious as to what's going on there.

After little bit of a research and experimenting with different 
settings, I came to a conclusion that this problem happened since all 
read balancing algorithms implemented in geom_mirror ignored important 
property of modern hard drives. Particularly I am talking about the fact 
that even when asked to read a single sector, the drive usually reads 
the whole track, storing it in the internal buffer. Therefore, sending 
requests for the sectors N and N+1 to the different drives (round-robin) 
or splitting one big request and sending it to two separate disks 
(split) in parallel *degrades* combined performance instead of improving 
it compared to the read from the single drive. The observed decline 
apparently happened due to additional latency resulting from the fact 
that disks needed different time to position themselves to the track in 
question, increasing the combined latency on average. Sustained linear 
transfer speed was limited by the platter-to-buffer speed, not 
buffer-to-interface speed, so that by combining two or more streams 
gained nothing. Moreover, such "balancing" causes both disks to do a 
seek, potentially distracting one of them from serving other requests in 
the meantime, reducing RAID's potential for handling concurrent requests.

As a result I have produced a small patch, which caches offset of the 
last served requests in the disk parameters and sends subsequent 
requests that fail within certain area around that offset to the same 
disk. In addition, it implements another small optimization by analyzing 
number of outstanding requests and using only least busy disks for 
round-robin. This should allow to smooth any inequality of load 
distribution caused by the proximity algorithm and also help in the 
cases when disks require different time to complete their read or write 
requests. Most of the improvement comes from the first part of the patch 
though.

I have tested few values of HDD_CACHE_SIZE from 1MB to 8MB and did not 
found much of the difference in performance, which probably suggests 
that most of the improvement comes from clustering very close reads.

To measure effect I have run few benchmarks:

- file copy over gigabit LAN (SMB)

- local bonnie++

- local raidtest

- Intel NASPT over gigabit LAN (SMB)

Perhaps the most obvious improvement I've seen in the single-thread copy 
to the Vista SMB client - the speed has increased from some 55MB/sec to 
86MB/sec. Due to its fully random nature there has been no improvement 
in the raidtest results (no degradation either). All other benchmarks 
have shown improvement in all I/O bound read tests randing from 20% to 
400%. The latter has been observed in the bonnie++ with random create 
speed increasing from 5,000/sec to 20,000/sec. No test has registered 
any measurable speed degradation.

For example, below are typical results with NASPT (numbers are in MB/sec):

New code:
Test: HDVideo_1Play Throughput: 38.540
Test: HDVideo_2Play Throughput: 29.655
Test: HDVideo_4Play Throughput: 32.885
Test: HDVideo_1Record Throughput: 33.925
Test: HDVideo_1Play_1Record Throughput: 23.967
Test: ContentCreation Throughput: 14.012
Test: OfficeProductivity Throughput: 20.053
Test: FileCopyToNAS Throughput: 24.906
Test: FileCopyFromNAS Throughput: 46.035
Test: DirectoryCopyToNAS Throughput: 11.367
Test: DirectoryCopyFromNAS Throughput: 17.806
Test: PhotoAlbum Throughput: 19.161

Old code:
Test: HDVideo_1Play Throughput: 26.037
Test: HDVideo_2Play Throughput: 28.666
Test: HDVideo_4Play Throughput: 31.623
Test: HDVideo_1Record Throughput: 29.714
Test: HDVideo_1Play_1Record Throughput: 16.857
Test: ContentCreation Throughput: 11.934
Test: OfficeProductivity Throughput: 18.524
Test: FileCopyToNAS Throughput: 25.329
Test: FileCopyFromNAS Throughput: 26.182
Test: DirectoryCopyToNAS Throughput: 10.139
Test: DirectoryCopyFromNAS Throughput: 13.306
Test: PhotoAlbum Throughput: 20.783

The patch is available here: 
http://sobomax.sippysoft.com/~sobomax/geom_mirror.diff. I would like to 
get input on the functionality/code itself, as well on what is the best 
way to add this functionality. Right now, it's part of the round-robin 
balancing code. Technically, it could be added as a separate new 
balancing method, but for the reasons outlined above I really doubt 
having "pure" round-robin has any practical value now. The only case 
where previous behavior might be beneficial is with solid-state/RAM 
disks where there is virtually no seek time, so that by reading close 
sectors from two separate disks one could actually get a better speed. 
At the very least, the new method should become default, while "old 
round-robin" be another option with clearly documented shortcomings. I 
would really like to hear what people think about that.

-Maxim