[rfc] small bioq patch

Mon Oct 21 03:37:00 UTC 2013

Maksim Yevmenkin wrote this message on Tue, Oct 15, 2013 at 11:15 -0700:
> On Fri, Oct 11, 2013 at 5:14 PM, John-Mark Gurney <jmg at funkthat.com> wrote:
> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 15:39 -0700:
> >> > On Oct 11, 2013, at 2:52 PM, John-Mark Gurney <jmg at funkthat.com> wrote:
> >> >
> >> > Maksim Yevmenkin wrote this message on Fri, Oct 11, 2013 at 11:17 -0700:
> >> >> i would like to submit the attached bioq patch for review and
> >> >> comments. this is proof of concept. it helps with smoothing disk read
> >> >> service times and arrear to eliminates outliers. please see attached
> >> >> pictures (about a week worth of data)
> >> >>
> >> >> - c034 "control" unmodified system
> >> >> - c044 patched system
> >> >
> >> > Can you describe how you got this data?  Were you using the gstat
> >> > code or some other code?
> >>
> >> Yes, it's basically gstat data.
> >
> > The reason I ask this is that I don't think the data you are getting
> > from gstat is what you think you are...  It accumulates time for a set
> > of operations and then divides by the count...  So I'm not sure if the
> > stat improvements you are seeing are as meaningful as you might think
> > they are...
> 
> yes, i'm aware of it. however, i'm not aware of "better" tools. we
> also use dtrace and PCM/PMC. ktrace is not particularly useable for us
> because it does not really work well when we push system above 5 Gbps.
> in order to actually see any "issues" we need to push system to 10
> Gbps range at least.

So, I put a test together using dtrace...  And my test wasn't a big
test, but I put HEAD on a 16G fs, and did a:
find /mnt -type f -exec cat {} +

And I varried the sysctl values as 0, 16 32, 64, 128, 256 and 512... I
was unable to get a significant difference between the runs... Between
each run I would unmount the fs to make sure the fs cache was clean...

I've posted my scripts/results at:
https://people.freebsd.org/~jmg/disklat/

genresults runs the dtrace script and gets the results
makeresults extracts the results from each run
disklatencycmd.d is the dtrace script I used
catall/catallp is the script containing the find command... I tried two
different versions of the command, one single threaded as above, the
other using xargs w/ 4 processes running...  the p4 results are from
the xargs, the sing is the find -exec command above...

Though I will admit that before the patch, on occasion I did see a max
latency of 6s, but in these tests I didn't see it...  The disk was:
Model Family:     Maxtor MaXLine Pro 500
Device Model:     Maxtor 7H500F0
Firmware Version: HA431DN0
User Capacity:    500,107,862,016 bytes [500 GB]

and the partition was close to the begining of the drive...

> >> >> graphs show max/avg disk read service times for both systems across 36
> >> >> spinning drives. both systems are relatively busy serving production
> >> >> traffic (about 10 Gbps at peak). grey shaded areas on the graphs
> >> >> represent time when systems are refreshing their content, i.e. disks
> >> >> are both reading and writing at the same time.
> >> >
> >> > Can you describe why you think this change makes an improvement?  Unless
> >> > you're running 10k or 15k RPM drives, 128 seems like a large number.. as
> >> > that's about halve number of IOPs that a normal HD handles in a second..
> >>
> >> Our (Netflix) load is basically random disk io. We have tweaked the system to ensure that our io path is "wide" enough, I.e. We read 1mb per disk io for majority of the requests. However offsets we read from are all over the place. It appears that we are getting into situation where larger offsets are getting delayed because smaller offsets are "jumping" ahead of them. Forcing bioq insert tail operation and effectively moving insertion point seems to help avoiding getting into this situation. And, no. We don't use 10k or 15k drives. Just regular enterprise 7200 sata drives.
> >
> > I assume that the 1mb reads are then further broken up into 8 128kb
> > reads? so it's more like every 16 reads in your work load that you
> > insert the "ordered" io...
> 
> i'm not sure where 128kb comes from. are you referring to
> MAXPHYS/DLFPHYS? if so, then, no, we have increased *PHYS to 1MB.

Ahh, ok, so another difference between your system and HEAD...

> > I want to make sure that we choose the right value for this number..
> > What number of IOPs are you seeing?
> 
> generally we see < 100 IOPs per disk on a system pushing 10+ Gbps.

w/ 1MB IO's, that makes sense...

> i've experimented with different numbers on our system and i did not
> see much of a difference on our workload. i'm up a value of 1024 now.
> higher numbers seem to produce slightly bigger difference between
> average and max time, but i do not think its statistically meaningful.
> general shape of the curve remains smooth for all tried values so far.
> 
> [...]
> 
> >> > Also, do you see a similar throughput of the system?
> >>
> >> Yes. We do see almost identical throughput from both systems.  I have not pushed the system to its limit yet, but having much smoother disk read service time is important for us because we use it as one of the components of system health metrics. We also need to ensure that disk io request is actually dispatched to the disk in a timely manner.
> >
> > Per above, have you measured at the application layer that you are
> > getting better latency times on your reads?  Maybe by doing a ktrace
> > of the io, and calculating times between read and return or something
> > like that...
> 
> ktrace is not particularly useful. i can see if i can come up with
> dtrace probe or something. our application (or rather clients) are
> _very_ sensitive to latency. having read service times outliers is not
> very good for us.

It shouldn't be hard to setup a dtrace probe similar to the one I use,
but put it on the read call/return of your app...

> > Have you looked at the geom disk schedulers work that Luigi did a few
> > years back?  There have been known issues w/ our io scheduler for a
> > long time...  If you search the mailing lists, you'll see lots of
> > reports from some processes starving out others, probably due to a
> > similar issue...  I've seen similar unfair behavior between processes,
> > but spend time tracking it down...
> 
> yes, we have looked at it. it makes things worse for us, unfortunately.
> 
> > It does look like a good improvement though...
> >
> > Thanks for the work!
> 
> ok :) i'm interested to hear from people who have different workload
> profile. for example lots of iops, i.e. very small files reads or
> something like that.

I'd be interesting to see to take your work load using my scripts to
see if you could see different results...

-- 
  John-Mark Gurney				Voice: +1 415 225 5579

     "All that I will do, has been done, All that I have, has not."