disk scheduling (was: Re: RFC: adding 'proxy' nodes to provider ports (with patch))

Luigi Rizzo rizzo at iet.unipi.it
Sat Mar 21 18:29:16 PDT 2009


On Sat, Mar 21, 2009 at 9:24 PM, Poul-Henning Kamp <phk at phk.freebsd.dk> wrote:
> In message <20090321200334.GB3102 at garage.freebsd.pl>, Pawel Jakub Dawidek write
> s:
>
>>       Special GEOM classes.
>>       ---------------------
>>
>>       - There are no special GEOM classes.
>>
>>I wonder if phk changed his opinion over time. :)
>
> He didn't.
>
>>Maybe instead of adding special providers and GEOM classes, the
>>infrastructure should be extended in some way, so that we won't use
>>provider term to describe something that isn't really a regular GEOM
>>provider.
>
> I have not had time to read this entire thread, being somewhat
> snowed under with work elsewhere.
>
> First up, I am not sure I understand why the proxy nodes would
> be the (or even 'a') right solution for I/O scheduling.
>
> In fact, it is not very clear to me at all that scheduling should
> happen inside geom at all.
>
> I would tend to think that it belongs in the devicedriver, where
> intelligent information about things like tagged queuing abilities
> can be taken into account.
>
> For any kind of scheduling to do anything non-trivial, requests
> needs to be piled up so they can be reordered, doing that in
> places where bio's dont naturally pile up would require a damn
> good argument and strong numbers to convince me.
>
> Where the already do pile up, the existing disksort mechanism
> and API can be used.  (If you want to mess with the disksort
> *algorithm*, by all means do so, but that should not require
> you to hack up any apis, apart from the one to select algorithm).

The thread was meant to be on inserting transparent nodes in GEOM.

Scheduling was just an example on where the problem came out, but
since you ask let's take a short diversion (and let me relabel this
thread so we can discuss things separately).

+ nobody objects that the ideal place for scheduling is where
  requests naturally "pile up". Too bad that this ideal
  place is sometimes one we cannot access, i.e. the firmware
  of the disk drive.

+ some scheduling algorithms are "non work conserving", and
  they work by delaying some requests in the hope to save some
  seeks. They can be very effective (we sent numbers in our previous
  posting in january, but you can look at the literature on
  anticipatory scheduling for more). For the way they work, these
  algorithms artificially cause queues to build up. As such you can
  implement them effectively even above the device driver.

+ changing disksort can do some things but not all one would want.
  E.g. if you need to delay requests (as you do in several disk
  schedulers) then you need to interact heavily with the driver,
  e.g. to make sure it does not assume that the scheduler is
  work-conserving (some do, we found out in the GSoC 2005 work on
  disk schedulers), and to find out which kind of locking to use
  when it is time to reinject delayed requests.  So, implementing
  certain scheduling algorithms in the device driver requires
  specific code on each and every driver.

+ of course adding or not a disk scheduler in one's system
  is completely optional, and there is no intention to change
  any current default.

if you want a quick example on how can you fix some severe problems
with the current disk scheduler even doing scheduling above the
device driver, try the same experiments we did, first without
scheduler, then with the geom_sched module that we posted:

1. run a few 'dd' in parallel on top of an ATA or SATA disk, and look
   at the overal throughput with and without scheduler;

2. run a cvs update (or other seeky application) in parallel with
   a sequential dd reader, and look at how slowly 'dd' runs without
   scheduler;

3. run a cvs update (or other seeky application) in parallel with
   a sequential dd writer, and look at how slowly cvs goes without
   scheduler.  This is mostly an effect of

Examples #1 and #2 are a direct result of the request patterns issued by
readers, and cannot be fixed with work-conserving changes to disksort.
Readers only have one pending request each, so the disk is doing a
seek on each request, and the throughput degrades heavily.  With
anticipation, after one request you give the process a little bit
of time to present another one, so you can serve a short burst
of requests from each reader, boosting both individual and overall
throughput.

Example #3 is a result of the "capture effect" of our disksort:
writers have many pending requests and if they are for contiguous
blocks, once one of them is served the disk keeps serving the same
process starving the others. Here you can do a lot of useful stuff
even above the device driver, e.g. do not serve more than so many
contiguous requests in a row.

cheers
luigi


More information about the freebsd-geom mailing list