An order of magnitude higher IOPS needed with ZFS than UFS

Wed Jun 12 08:49:54 UTC 2013

Am 12.06.2013 03:48, schrieb Bruce Evans:> I first thought that
clobbering the pointer was a bug, but now I think
> it is a feature.  The i/o really is non-sequential.  Basing most i/o
> sequentiality guesses on a single per-disk pointer (shared across
> different partitions on the same disk) might work better than all
> the separate pointers.  Accesses that are sequential at the file level
> would only be considered sequential if no other physical accesses
> intervene.  After getting that right, use sequentiality guesses again
> to delay some physical accesses if they would intervene.

Hi Bruce,

I tend to disagree ... ;-)

Recognizing sequential reads on a per file basis hints at whether
read-ahead (delaying the next following access and the buffer
needed to keep the data) might be useful. This "knowledge" can
lead to drastically higher total throughput in situations, where
multiple processes (or network clients) read files sequentially
(e.g. a media server for many parallel streams).

If you try to recognize sequential accesses on the device level,
then you may identify cases were one reader is likely to perform
back-to-back reads. But in all other cases (and especially under
high load), you will not be able to identify the processes that
might be helper by reading larger chunks than requested (lowering
the number of seeks required and taking pressure from the storage).

So, I think you need the per file read-ahead heuristics to identify
candidates for read-ahead. And I doubt you can get the same effect
by tracking disk accesses.

Hmmm, you could keep a list of read-ahead pointers per disk, which
could be recycled in a LRU scheme. Any new read that continues a
prior read is detected and updates the corresponding pointer, which
is in a struct with a read-ahead flag or the amount to read-ahead.
Access to this list of pointers could be sped up by having a hash
table that points to them (hash key is some number of LSBs, e.g.
for 256 or 1024 buckets). That way the temporal distribution of
the accesses could be included in the heuristic: If sequential
reads are spread out over a long time, then their corresponding
pointer is lost (after e.g. 256 or 1024 non-sequential accesses
to the volume).

This could be implemented as a scheduler class in GEOM, I think
(to make it easily loadable and selectable per volume, but might
also be appropriate for productive use). That way different
strategies (with regard to read-ahead and the potential for
clustering of writes) could be tested.

Might be interesting to compare such a scheduler with the per file
heuristics as implemented in the kernel now ...

Best regards, STefan