ATA disk performance (ICH2 controller), some testsandcomparison with Linux 2.6.5

Sun Sep 26 21:24:24 PDT 2004

:> ii) Now I tried a fair raw throughput comparison between Linux and 
:>    FreeBSD. This time I read always the same whole (Linux) partition 
:>    (~4GB) so that the results should be comparable. I always used native
:>    dd (FreeBSD and Linux). A measure with Linux dd under emulation
:>    in FreeBSD gave yet the same result:
:> 
:>    dd if=/dev/ad0s9 bs=nnn  of=/dev/null
:>    FreeBSD:
:> nnn=4k:   4301789184 bytes transferred in 170.031898 secs (25299895 
:...
:> 
:>    I notice, that the rates are very similar if bs >= 16k. Under FreeBSD 
:>    the raw throughput rate depends on the block size. Read rate under Linux
:>    is independent of the block size. Is there a special reason for that?

:...
:
:Last I heard Linux did not have a raw device.. this means that it will
:always be cached, and will do various read-ahead optimizations, while
:FreeBSD does not have a buffered/cooked device anymore..  w/o the cooked
:device FreeBSD has to suffer the latency of the command to the drive...
:
:-- 
:  John-Mark Gurney				Voice: +1 415 225 5579

    Don't guess, experiment!

    I'm sure linux has a device monitor program.  If not iostat, then something
    else.  It should be fairly easy to determine whether it is buffering
    the data.  The numbers alone don't tell the story.

    You can glean a lot of information by running a script like this:

#!/bin/csh
#
time dd if=/dev/ad0 bs=512 of=/dev/null count=131072
time dd if=/dev/ad0 bs=1k of=/dev/null count=65536
time dd if=/dev/ad0 bs=2k of=/dev/null count=32768
time dd if=/dev/ad0 bs=4k of=/dev/null count=16384
time dd if=/dev/ad0 bs=8k of=/dev/null count=8192
time dd if=/dev/ad0 bs=16k of=/dev/null count=4096
time dd if=/dev/ad0 bs=32k of=/dev/null count=2048
time dd if=/dev/ad0 bs=64k of=/dev/null count=1024
time dd if=/dev/ad0 bs=128k of=/dev/null count=512
time dd if=/dev/ad0 bs=256k of=/dev/null count=256
time dd if=/dev/ad0 bs=512k of=/dev/null count=128

    In particular, you can see both the transfer rate and the user and 
    supervisor overheads involved in sheparding the transfer.  At some
    point the cpu use stops saturating the cpu and the transfer rate 
    maxes out, but even though that occurs you can see that the larger
    block sizes require less supervisor overhead.

    If you don't see this sort of marked reduction in supervisor overhead
    with linux then linux is probably buffering the I/O.

					    TRANSFER RATE
					    VVVVVVVV
67108864 bytes transferred in 8.996482 secs (7459456 bytes/sec)
0.007u 3.412s 0:08.99 37.9%     13+29k 0+0io 0pf+0w
^^^^^  ^^^^^	     ^^^^^
USER    SUPERVISOR   CPU PERCENTAGE
		     (note: only 37% so raw device transaction overheads
		     are likely limiting throughput)

67108864 bytes transferred in 4.563335 secs (14706101 bytes/sec)
0.000u 1.608s 0:04.56 35.0%     10+22k 0+0io 0pf+0w
67108864 bytes transferred in 2.430306 secs (27613340 bytes/sec)
0.000u 0.773s 0:02.43 31.6%     10+23k 0+0io 0pf+0w
67108864 bytes transferred in 1.538318 secs (43624834 bytes/sec)
0.000u 0.312s 0:01.53 20.2%     15+35k 0+0io 0pf+0w
67108864 bytes transferred in 1.264007 secs (53092158 bytes/sec)
0.015u 0.132s 0:01.26 11.1%     24+58k 0+0io 0pf+0w
					    ^^^^^^^^
					    transfer rate maxes out, but

		     cpu time continues to
		     decrease
		     vvvvv
67108864 bytes transferred in 1.174806 secs (57123364 bytes/sec)
0.000u 0.093s 0:01.17 7.6%      22+60k 0+0io 0pf+0w
67108864 bytes transferred in 1.208589 secs (55526618 bytes/sec)
0.000u 0.046s 0:01.20 3.3%      0+0k 0+0io 0pf+0w
67108864 bytes transferred in 1.241938 secs (54035605 bytes/sec)
0.000u 0.007s 0:01.24 0.0%      0+0k 0+0io 0pf+0w
67108864 bytes transferred in 1.208579 secs (55527089 bytes/sec)
0.007u 0.015s 0:01.20 0.8%      0+0k 0+0io 0pf+0w
67108864 bytes transferred in 1.183573 secs (56700232 bytes/sec)
0.000u 0.007s 0:01.18 0.0%      0+0k 0+0io 0pf+0w
67108864 bytes transferred in 1.200243 secs (55912731 bytes/sec)
0.000u 0.015s 0:01.20 0.8%      0+0k 0+0io 0pf+0w

Doing an 'iostat ad0 1' (in my case) at the same time in another window,
on *BSD anyway, tells you what the controller is actually being asked to do.
In this case it is obvious that the controller is being told to make tiny
transfers and is able to do 14000+ transactions/sec, and that even with the
next step up the controller is still only able to do 14000+ transactions/sec,
which indicates that the system has hit the controllers transaction rate
limit.

      tty             ad0             cpu
 tin tout  KB/t tps  MB/s  us ni sy in id
   0    0  0.50 14629  7.14   0  0 39  0 61
		^^^^^
		PHYSICAL TRANSACTIONS PER SECOND
		VVVVVV
   0    0  0.50 14626  7.14   0  0 40  0 60
   0    0  0.50 14627  7.14   0  0 39  0 61
   0    0  0.50 14634  7.15   0  0 42  0 58
   0    0  0.50 14638  7.15   0  0 41  0 59
   0    0  0.50 14491  7.08   0  0 40  0 60
   0    0  0.50 14629  7.14   0  0 37  0 63
   0    0  0.50 14630  7.14   0  0 39  0 61
   0    0  0.73 14393 10.26   0  0 27  0 73
   0    0  1.00 14389 14.05   0  0 38  0 62	<<< note, same max tps
   0    0  1.00 14386 14.05   0  0 35  0 65
   0    0  1.00 14391 14.05   0  0 30  0 70
   0    0  1.00 13915 13.59   0  0 25  0 75
   0    0  1.86 13384 24.35   0  0 28  0 72
   0    0  2.00 13468 26.30   0  0 23  0 77	<<< nearly same max tps
      tty             ad0             cpu
 tin tout  KB/t tps  MB/s  us ni sy in id
   0    0  2.74 12245 32.71   0  0 34  0 66
   0    0  4.00 10840 42.34   0  0 27  0 73	<<< now the tps starts to drop
						    with the larger block size
						    and the transfer is no
						    longer limited by the
						    controller or system.
   0    0  7.41 7061 51.11   0  0 16  0 84
   0    0 12.15 4500 53.37   0  0 10  0 90
   0    0 20.96 2562 52.45   0  0  7  0 93
   0    0 36.15 1461 51.57   0  0  2  0 98
   0    0 64.69 837 52.87   2  0  1  0 98
   0    0 124.72 439 53.46   0  0  2  0 98
   0    0 127.41 429 53.38   0  0  2  0 98
   0    0 127.69 406 50.62   0  0  1  0 99
   0    0 128.00 268 33.50   0  0  1  0 99

Generally speaking, since the hard drive itself will cache data off the
platter, reduced I/O bandwidth using smaller block sizes will almost always
be either a transaction rate limit being hit, or the cpu's limits get
hit (the cpu gets overburdened).

Buffered access to a raw device is not necessarily a good thing.  In fact,
most of the time you don't want to do it because you have a caching layer on
top of your raw accesses (e.g. the filesystem buffer cache / VM cache, or
in the case of a database the database has its own cache and buffered
access would interfere with it).

					-Matt
					Matthew Dillon 
					<dillon at backplane.com>