NCQ vs UFS/ZFS benchmark [Was: Re: FreeBSD 8.0 Performance (at
ivoras at freebsd.org
Thu Dec 3 09:00:42 UTC 2009
Alexander Motin wrote:
> Ivan Voras wrote:
>> If you have a drive to play with, could you also check UFS vs ZFS on
>> both ATA & AHCI? To try and see if the IO scheduling of ZFS plays nicely.
>> For benchmarks I suggest blogbench and bonnie++ (in ports) and if you
>> want to bother, randomio, http://arctic.org/~dean/randomio .
> gstat shown that most of time only one request at a time was running on
> disk. Looks like read or read-modify-write operations (due to many short
> writes in test pattern) are heavily serialized in UFS, even when several
> processes working with the same file. It has almost eliminated effect of
> NCQ in this test.
> Test 2: Same as before, but without O_DIRECT flag:
> ata(4), 1 process, first tps: 78
> ata(4), 1 process, second tps: 469
> ata(4), 32 processes, first tps: 83
> ata(4), 32 processes, second tps: 475
> ahci(4), 1 process, first tps: 79
> ahci(4), 1 process, second tps: 476
> ahci(4), 32 processes, first tps: 93
> ahci(4), 32 processes, second tps: 488
Ok, so this is UFS, normal caching.
> Data doesn't fit into cache. Multiple parallel requests give some effect
> even with legacy driver, but with NCQ enabled it gives much more, almost
> doubling performance!
You've seen queueing in gstat for ZFS+NCQ?
> Teste 4: Same as 3, but with kmem_size=1900M and arc_max=1700M.
> ata(4), 1 process, first tps: 90
> ata(4), 1 process, second tps: ~160-300
> ata(4), 32 processes, first tps: 112
> ata(4), 32 processes, second tps: ~190-322
> ahci(4), 1 process, first tps: 90
> ahci(4), 1 process, second tps: ~140-300
> ahci(4), 32 processes, first tps: 180
> ahci(4), 32 processes, second tps: ~280-550
And this is ZFS with some tuning. I've also seen high deviation in
performance on ZFS so it seems normal.
> As conclusion:
> - in this particular test ZFS scaled well with parallel requests,
> effectively using multiple disks. NCQ shown great benefits. But i386
> constraints are significantly limited ZFS caching abilities.
> - UFS behaves very poorly in this test. Even with parallel workload it
> often serializes device accesses. May be results would be different if
I wouldn't say UFS behaves poorly from your results. It looks like only
the multiprocess case is bad on the UFS. For single-process access the
difference in favour of ZFS is ~10 TPS on the first case and UFS is
apparently much better in all cases but the last on the second try. This
may be explained if you have a large variation between runs.
Also, did you use the whole drive for the file system? In cases like
this it would be interesting to create a special partition (in all
cases, on all drives), covering only a small segment on the disk
(thinking of the drive as a rotational media, made of cylinders). For
example, a partition of size of 30 GB covering only the outer tracks.
> there would be separate file for each process, or with some other
> options, but I think pattern I have used is also possible in some
> applications. Only benefit UFS shown here is more effective memory
> management on i386, leading to higher cache effectiveness.
> It would be nice if somebody explained that UFS behavior.
Possibly, read-only access to memory cache structures is protected by
read-only locks, which are efficient, and ARC is more complicated than
it's worth? But others should have better guesses :)
More information about the freebsd-current