dd(1) performance when copiing a disk to another

Mon Oct 3 07:21:30 PDT 2005

On Mon, 3 Oct 2005, Patrick Proniewski wrote:

>>>> # dd if=/dev/ad4 of=/dev/null bs=1m count=1000
>>>> 1000+0 records in
>>>> 1000+0 records out
>>>> 1048576000 bytes transferred in 17.647464 secs (59417943
>>>> bytes/sec)

Many wrong answers to the original question have been given.  dd with
a blocks size of 1m between (separate) disk devices is much slower
just because that block size is far too large...

The above is a fairly normal speed.  The expected speed depends mainly
on the disk technology generation and the placement of the sectors being
read.  I get the following speeds for _sequential_ _reading- from the
outer (fastest) tracks of 6- and 3-year old drives which are about 2
generations apart:

%%%
Sep 25 21:52:35 besplex kernel: ad0: 29314MB <IBM-DTLA-307030> [59560/16/63] at ata0-master UDMA100
Sep 25 21:52:35 besplex kernel: ad2: 58644MB <IC35L060AVV207-0> [119150/16/63] at ata1-master UDMA100
ad0 bs 512: 16777216 bytes transferred in 2.788209 secs (6017201 bytes/sec)
ad0 bs 1024: 16777216 bytes transferred in 1.433675 secs (11702245 bytes/sec)
ad0 bs 2048: 16777216 bytes transferred in 0.787466 secs (21305320 bytes/sec)
ad0 bs 4096: 16777216 bytes transferred in 0.479757 secs (34970249 bytes/sec)
ad0 bs 8192: 16777216 bytes transferred in 0.477803 secs (35113250 bytes/sec)
ad0 bs 16384: 16777216 bytes transferred in 0.462006 secs (36313842 bytes/sec)
ad0 bs 32768: 16777216 bytes transferred in 0.462038 secs (36311331 bytes/sec)
ad0 bs 65536: 16777216 bytes transferred in 0.486850 secs (34460748 bytes/sec)
ad0 bs 131072: 16777216 bytes transferred in 0.462046 secs (36310693 bytes/sec)
ad0 bs 262144: 16777216 bytes transferred in 0.469866 secs (35706382 bytes/sec)
ad0 bs 524288: 16777216 bytes transferred in 0.462035 secs (36311555 bytes/sec)
ad0 bs 1048576: 16777216 bytes transferred in 0.478534 secs (35059612 bytes/sec)
ad2 bs 512: 16777216 bytes transferred in 4.115675 secs (4076419 bytes/sec)
ad2 bs 1024: 16777216 bytes transferred in 2.105451 secs (7968466 bytes/sec)
ad2 bs 2048: 16777216 bytes transferred in 1.132157 secs (14818809 bytes/sec)
ad2 bs 4096: 16777216 bytes transferred in 0.662452 secs (25325935 bytes/sec)
ad2 bs 8192: 16777216 bytes transferred in 0.454654 secs (36901065 bytes/sec)
ad2 bs 16384: 16777216 bytes transferred in 0.304761 secs (55050416 bytes/sec)
ad2 bs 32768: 16777216 bytes transferred in 0.304761 secs (55050416 bytes/sec)
ad2 bs 65536: 16777216 bytes transferred in 0.304765 secs (55049683 bytes/sec)
ad2 bs 131072: 16777216 bytes transferred in 0.304762 secs (55050200 bytes/sec)
ad2 bs 262144: 16777216 bytes transferred in 0.304760 secs (55050588 bytes/sec)
ad2 bs 524288: 16777216 bytes transferred in 0.304762 secs (55050200 bytes/sec)
ad2 bs 1048576: 16777216 bytes transferred in 0.304757 secs (55051148 bytes/sec)
%%%

Drive technology hit a speed plateau a few years ago so newer single drives
aren't much faster unless they are more expensive and/or smaller.

The speed is low for small block sizes because the device has to be
talked too too much and the protocol and firmware are not very good.
(Another drive, a WDC 120GB with more cache (8MB instead of 2), ramps
up to about half speed (26MB/sec) for a block size of 4K but sticks
at that speed for block sizes 8K and 16K, then jumps up to full speed
for a block sizes of 32K and larger.  This indicates some firmware
stupidness).  Most drives ramp up almost logarithmically (doubling
the block size almost doubles the speed).  This behaviour is especially
evident on slow SCSI drives like some (most?) ZIP and dvd/cd.  The
command overhead can be 20 msec, so you had better not do 1 512 bytes
of i/o per command or you will get a speed of 25K/sec.  The command
overhead of a new ATA drive is more like 50 usec, but that is still
far too much for high speed with a block size of 512 bytes.

The speed is insignificantly different for block sizes larger than a
limit because the drive's physical limits dominate except possibly
with old (slow) CPUs.

>>> That seems to be 2 or about 2 times faster than disc->disc
>>> transfer... But still slower, than I would have expected...
>>> SATA150 sounds like the drive can do 150MB/sec...
>
> As Eric pointed out, you just can"t reach 150 MB/s with one disk, it's a 
> technological maximum for the bus, but real world performance is well bellow 
> this max.
> In fact, I've though I would reach about 50 to 60 MB/s.

50-60 MB/s is about right.  I haven't benchmarked any SATA or very new
drives.  Apparently they are not much faster.  ISTR that WDC Raptors are
speced for 70-80MB/sec.  You pay twice as much to get a tiny drive with
only 25% more throughput plus faster seeks.

>>>>> (Maybe you could find a way to copy /dev/zero to /dev/ad6
>>>>> without destroying the previous work... :-))
>>>> 
>>>> well, not very easy both disk are the same size ;)
>
>>> I thought of the first 1000 1MB blocks... :-)
>
> damn, I misread this one... :)
> I'm gonna try this asap.

I divide disks into equally sized (fairly small, or half the disk size)
partitions, and cp between them.  dd is too hard to use for me ;-).  cp
is easier to type and automatically picks a reasonable block size.  Of
course I use dd if the block size needs to be controlled, but mostly I
only use it in preference to cp to get its timing info.

>...
>> Have you tried a smaller block size?  What does 8k, 16k, or 512k do for 
>> you?  There really isn't much room for improvement here on a single device.
>
> nop, I'll try one of them, but I can't do many experiments, the box is in my 
> living room, it's a 1U rack, and it's VERY VERY noisy. My girlfriend will 
> kill me if it's running more than an hour a day :))

Smaller block sizes will go much faster, except for copying from a disk to
itself.  Large block sizes are normally a pessimization and the pessimization
is especially noticeable for dd.  Just use the smallest block size that gives
an almost-maximal throughput (e.g., 16K for reading ad2 above, possibly
different for writing).  Large block sizes are pessimal for synchronous
i/o like dd does.  The timing for dd'ing blocks of size N MB at R MB/sec
between ad0 and ad2 is something like:

 	time in secs	activity on ad0		activity on ad2
 	------------	---------------		---------------
 	0		start read of 1MB	idle
 	N/R		finish read; idle	start write of 1MB
 	N/R-epsilon	start read of 1MB	pretend to complete write
 	N/R		continue read		complete write
 	N/R-epsilon	finish read; idle	start write of 1MB
 	N/R-2*epsilon	...			...

After the first block (which takes a little longer), it takes N/R-epsilon
seconds to copy 1 block, where epsilon is the time between the writer's
pretending to complete the write and actually completing it.  This time
is obviously not very dependent on the block size since it is limited by
drives resources and policies (in particular, if the drive doesn't do write
caching, perhaps because write caching is not enabled, then epsilon is 0,
and if out block size is large compared with the drive's cache then the
drive won't be able to signal completion until no more than the drive's
cache size is left to do).  Thus epsilon becomes small relative to the
N/R term when N is large.  Apparently, in your case the speed drops from
59MB/sec to 35MB/sec, so with N == 1 and R == 59, epsilon is about 1/200.

With large block sizes, the speed can be increased using asyncronous output.
There is a utility (in ports) named team that fakes async output using
separate processes.  I have never used it.  Somthing as simple as 2
dd's in a pipe should work OK.

For copying from a disk itself, a large block sizes is needed to limit the
number of seeks, and concurrent reads and writes are exactly what is not
needed (since they would give competing seeks).  The i/o must be
sequentialized, and dd does the right things for this, though the drive
might not (you would prefer epsilon == 0, since if the drive signals
write completion early then it might get confused when you flood it
with the next read and seek to start the read before it completes the
write, then thrash back and forth between writing and reading).

It is interesting that writing large sequential files to at least the
ffs file system (not mounted with -sync) in FreeBSD is slightly faster
than writing directly to the raw disk using write(2), even if the
device driver sees almost the same block sizes for these different
operations.  This is because write(2) is synchronous and sync writes
always cause idle periods (the idle periods are just much smaller for
writing data that is already in memory), while the kernel uses async
writes for data.

Bruce