devstat overhead VS precision

Tue Apr 16 06:24:15 UTC 2013

In message <516C71BC.4000902 at FreeBSD.org>, Alexander Motin writes:
>On 15.04.2013 23:43, Poul-Henning Kamp wrote:
>> In message <516C515A.9090602 at FreeBSD.org>, Alexander Motin writes:
>>

>> For tuning anything on a non-ridiculous SSD device or modern
>> harddisks, it will be useless because of the bias you introduce is
>> *not* one which averages out over many operations.
>
>Could you please explain why?
>
>> The fundamental problem is that on a busy system, getbinuptime()
>> does not get called at random times, it will be heavily affected
>> by the I/O traffic, because of the interrupts, the bus-traffic
>> itself, the cache-effects of I/O transfers and the context-switches
>> by the processes causing the I/O.
>
>I'm sorry, but I am not sure I understand above paragraphs.

That was the exact explanation you asked for, and I'm not sure I can
find a better way to explain it, but I'll try:

Your assumption that the error will cancel out, implicitly assumes
that the timestamp returned from getbinuptime() is updated at
times which are totally independent from the I/O traffic you are
trying to measure the latency of.

That is not the case.  The interrupt which updates getbinuptime()'s
cached timestamp is affected a lot by the I/O traffic, for the various
reasons I mention above.

>Sure, getbinuptime() won't allow to answer how many requests completed 
>within 0.5ms, but present API doesn't allow to calculate that any way, 
>providing only total/average times. And why "_5-10_ timecounter interrupts"?

A: Yes it actually does, a userland application running on a dedicated
CPU core can poll the shared memory devstat structure at a very high
rate and get very useful information about short latencies.

Most people don't do that, becuase they don't care about the difference
between 0.5 and 0.45 milliseconds.

B: To get the systematic bias down to 10-20% of the measured interval.

>> 	Latency distribution:
>>
>> 		<5msec:		92.12 %
>> 		<10msec:	 0.17 %
>> 		<20msec:	 1.34 %
>> 		<50msec:	 6.37 %
>> 		>50msec:	 0.00 %
>>
>I agree that such functionality could be interesting. The only worry is 
>which buckets should be there. For modern HDDs above buckets could be 
>fine. For high-end SSD it may go about microseconds then milliseconds. I 
>have doubt that 5 buckets will be universal enough, unless separated by 
>factor of 5-10.

Remember what people use this for:  Answering the question "Does my
disk subsystem suck, and if so, how much"

Buckets like the ones proposed will tell you that.

>> The %busy crap should be killed, all it does is confuse people.
>
>I agree that it heavily lies, especially for cached writes, but at least 
>it allows to make some very basic estimates. 

For rotating disks:  It always lies.

For SSD: It almost always lies.

Kill it.

-- 
Poul-Henning Kamp       | UNIX since Zilog Zeus 3.20
phk at FreeBSD.ORG         | TCP/IP since RFC 956
FreeBSD committer       | BSD since 4.3-tahoe    
Never attribute to malice what can adequately be explained by incompetence.