Improving geom_mirror(4)'s read balancing
Maxim Sobolev
sobomax at FreeBSD.org
Tue Apr 28 09:20:49 UTC 2009
Hi,
We have few production systems using geom_mirror. The functionality is
rock solid, however I have noticed that the read performance of the
array, especially in sequential reading, is often worse than performance
of a single member of array in the same conditions, which made me
curious as to what's going on there.
After little bit of a research and experimenting with different
settings, I came to a conclusion that this problem happened since all
read balancing algorithms implemented in geom_mirror ignored important
property of modern hard drives. Particularly I am talking about the fact
that even when asked to read a single sector, the drive usually reads
the whole track, storing it in the internal buffer. Therefore, sending
requests for the sectors N and N+1 to the different drives (round-robin)
or splitting one big request and sending it to two separate disks
(split) in parallel *degrades* combined performance instead of improving
it compared to the read from the single drive. The observed decline
apparently happened due to additional latency resulting from the fact
that disks needed different time to position themselves to the track in
question, increasing the combined latency on average. Sustained linear
transfer speed was limited by the platter-to-buffer speed, not
buffer-to-interface speed, so that by combining two or more streams
gained nothing. Moreover, such "balancing" causes both disks to do a
seek, potentially distracting one of them from serving other requests in
the meantime, reducing RAID's potential for handling concurrent requests.
As a result I have produced a small patch, which caches offset of the
last served requests in the disk parameters and sends subsequent
requests that fail within certain area around that offset to the same
disk. In addition, it implements another small optimization by analyzing
number of outstanding requests and using only least busy disks for
round-robin. This should allow to smooth any inequality of load
distribution caused by the proximity algorithm and also help in the
cases when disks require different time to complete their read or write
requests. Most of the improvement comes from the first part of the patch
though.
I have tested few values of HDD_CACHE_SIZE from 1MB to 8MB and did not
found much of the difference in performance, which probably suggests
that most of the improvement comes from clustering very close reads.
To measure effect I have run few benchmarks:
- file copy over gigabit LAN (SMB)
- local bonnie++
- local raidtest
- Intel NASPT over gigabit LAN (SMB)
Perhaps the most obvious improvement I've seen in the single-thread copy
to the Vista SMB client - the speed has increased from some 55MB/sec to
86MB/sec. Due to its fully random nature there has been no improvement
in the raidtest results (no degradation either). All other benchmarks
have shown improvement in all I/O bound read tests randing from 20% to
400%. The latter has been observed in the bonnie++ with random create
speed increasing from 5,000/sec to 20,000/sec. No test has registered
any measurable speed degradation.
For example, below are typical results with NASPT (numbers are in MB/sec):
New code:
Test: HDVideo_1Play Throughput: 38.540
Test: HDVideo_2Play Throughput: 29.655
Test: HDVideo_4Play Throughput: 32.885
Test: HDVideo_1Record Throughput: 33.925
Test: HDVideo_1Play_1Record Throughput: 23.967
Test: ContentCreation Throughput: 14.012
Test: OfficeProductivity Throughput: 20.053
Test: FileCopyToNAS Throughput: 24.906
Test: FileCopyFromNAS Throughput: 46.035
Test: DirectoryCopyToNAS Throughput: 11.367
Test: DirectoryCopyFromNAS Throughput: 17.806
Test: PhotoAlbum Throughput: 19.161
Old code:
Test: HDVideo_1Play Throughput: 26.037
Test: HDVideo_2Play Throughput: 28.666
Test: HDVideo_4Play Throughput: 31.623
Test: HDVideo_1Record Throughput: 29.714
Test: HDVideo_1Play_1Record Throughput: 16.857
Test: ContentCreation Throughput: 11.934
Test: OfficeProductivity Throughput: 18.524
Test: FileCopyToNAS Throughput: 25.329
Test: FileCopyFromNAS Throughput: 26.182
Test: DirectoryCopyToNAS Throughput: 10.139
Test: DirectoryCopyFromNAS Throughput: 13.306
Test: PhotoAlbum Throughput: 20.783
The patch is available here:
http://sobomax.sippysoft.com/~sobomax/geom_mirror.diff. I would like to
get input on the functionality/code itself, as well on what is the best
way to add this functionality. Right now, it's part of the round-robin
balancing code. Technically, it could be added as a separate new
balancing method, but for the reasons outlined above I really doubt
having "pure" round-robin has any practical value now. The only case
where previous behavior might be beneficial is with solid-state/RAM
disks where there is virtually no seek time, so that by reading close
sectors from two separate disks one could actually get a better speed.
At the very least, the new method should become default, while "old
round-robin" be another option with clearly documented shortcomings. I
would really like to hear what people think about that.
-Maxim
More information about the freebsd-current
mailing list