devstat overhead VS precision

Mon Apr 15 21:31:46 UTC 2013

On 15.04.2013 23:43, Poul-Henning Kamp wrote:
> In message <516C515A.9090602 at FreeBSD.org>, Alexander Motin writes:
>
>>>> I propose to switch that
>>>> statistics from using binuptime() to getbinuptime() to solve the problem
>>>> globally.
>
>>> No objections here, but I wonder if you were able to compare the results
>>> somehow before and after the change so we have some hard numbers to show
>>> that we don't lose much by applying the change.
>>
>> I haven't tested it statistically, but I haven't noticed any visual
>> difference in gstat output with its 0.1ms displayed resolution.
>
> I have tested it statistically, back when I wrote GEOM:  It leads
> to very significant statistical bias.
>
> Just about the only thing in devstat that has any predictive power
> with respect to filesystem performance, is the latency, which measures
> how long time it takes to satisfy each I/O request.
>
> If you run gstat(8), this is the "ms/*" numbers:  milliseconds per
> this or that.
>
> The rest of what's in devstat, with the exception of the queue-length
> ("L(q)") has almost no predictive power, and is IMO, practically
> pointless.  In particular the %busy is totally misleading and I
> deeply regret that I didn't fight to kill it back then.
>
> If you switch to getbinuptime(), the latency measurements will only
> be precise if the I/O operations take much longer than the timecounter
> update period, which is not guaranteed to be 1000 Hz btw.
>
> For measuring how much USB-sticks suck, that will work fine.
>
> For tuning anything on a non-ridiculous SSD device or modern
> harddisks, it will be useless because of the bias you introduce is
> *not* one which averages out over many operations.

Could you please explain why? Unless disk I/O somehow aliased to 
hardclock(), each of them should get random error from 0 to max(1ms, 
1s/HZ). With large number of I/Os that error should be hidden when 
calculating average time. I am not talking about microseconds, but I 
think fraction of millisecond should be realistic to get.

> The fundamental problem is that on a busy system, getbinuptime()
> does not get called at random times, it will be heavily affected
> by the I/O traffic, because of the interrupts, the bus-traffic
> itself, the cache-effects of I/O transfers and the context-switches
> by the processes causing the I/O.

I'm sorry, but I am not sure I understand above paragraphs. Do you want 
to say that in some realistic conditions (not counting entering debugger 
with disabled interrupts, etc) hardclock() can be delayed more then some 
significant percent of its period and that depends of I/O traffic 
itself? Or you want to say that disk I/Os somehow aliased with 
hardclock(), making impossible to hide error by averaging?

> So yes, you can switch to getbinuptime(), but the only statistical
> responsible way to do so, would be to supress latency measurements
> on all I/O operations which complete in less than 5-10 timecounter
> interrupts.

Sure, getbinuptime() won't allow to answer how many requests completed 
within 0.5ms, but present API doesn't allow to calculate that any way, 
providing only total/average times. And why "_5-10_ timecounter interrupts"?

> Apart from some practical issues implementing it, the numbers
> that came out would be pretty useless.
>
> The right idea is probably to bucketize the latencies, so that
> rather than having to keep track of devstat in real time to find
> out, you could get a histogram at any time showing past
> performance something like:
>
> 	Latency distribution:
>
> 		<5msec:		92.12 %
> 		<10msec:	 0.17 %
> 		<20msec:	 1.34 %
> 		<50msec:	 6.37 %
> 		>50msec:	 0.00 %
>
> Doing that with getbinuptime() would be statistically defensible
> provided the top bucket is "<5msec" and it would very clearly tell
> people if they have I/O trouble or not, which IMO is what people
> want to know.
>
> The cost 20 64bit counters in struct devstat (N|R|W|E)*5*8 = 160
> bytes, but since devstat is already 288 bytes, that isn't a major
> catastropy.

I agree that such functionality could be interesting. The only worry is 
which buckets should be there. For modern HDDs above buckets could be 
fine. For high-end SSD it may go about microseconds then milliseconds. I 
have doubt that 5 buckets will be universal enough, unless separated by 
factor of 5-10.

> The ability to measure latency precisly should be retained, but it
> could be made a sysctl enabled debugging facility.
>
> The %busy crap should be killed, all it does is confuse people.

I agree that it heavily lies, especially for cached writes, but at least 
it allows to make some very basic estimates. The value has valid 
explanation and the only problem is that users are misinterpreting it.

-- 
Alexander Motin