Increasing MAXPHYS

Sun Mar 21 16:13:07 UTC 2010

On Mar 21, 2010, at 8:05 AM, Alexander Motin wrote:

> Ivan Voras wrote:
>> Julian Elischer wrote:
>>> You can get better throughput by using TSC for timing because the geom
>>> and devstat code does a bit of timing.. Geom can be told to turn off
>>> it's timing but devstat can't. The 170 ktps is with TSC as timer,
>>> and geom timing turned off.
>> 
>> I see. I just ran randomio on a gzero device and with 10 userland
>> threads (this is a slow 2xquad machine) I get g_up and g_down saturated
>> fast with ~~ 120 ktps. Randomio uses gettimeofday() for measurements.
> 
> I've just got 140Ktps from two real Intel X25-M SSDs on ICH10R AHCI
> controller and single Core2Quad CPU. So at least on synthetic tests it
> is potentially reachable even with casual hardware, while it completely
> saturated quad-core CPU.
> 
>> Hmm, it looks like it could be easy to spawn more g_* threads (and,
>> barring specific class behaviour, it has a fair chance of working out of
>> the box) but the incoming queue will need to also be broken up for
>> greater effect.
> 
> According to "notes", looks there is a good chance to obtain races, as
> some places expect only one up and one down thread.
> 

I agree that more threads just creates many more race complications.  Even if it didn't, the storage driver is a serialization point; it doesn't matter if you have a dozen g_* threads if only one of them can be in the top half of the driver at a time.  No amount of fine-grained locking is going to help this.

I'd like to go in the opposite direction.  The queue-dispatch-queue model of GEOM is elegant and easy to extend, but very wasteful for the simple case, where the simple case is one or two simple partition transforms (mbr, bsdlabel) and/or a simple stripe/mirror transform.  None of these need a dedicated dispatch context in order to operate.  What I'd like to explore is compiling the GEOM stack at creation time into a linear array of operations that happen without a g_down/g_up context switch.  As providers and consumers taste each other and build a stack, that stack gets compiled into a graph, and that graph gets executed directly from the calling context, both from the dev_strategy() side on the top and the bio_done() on the bottom.  GEOM classes that need a detached context can mark themselves as such, doing so will prevent a graph from being created, and the current dispatch model will be retained.

I expect that this will reduce i/o latency by a great margin, thus directly addressing the performance problem that FusionIO makes an example of.  I'd like to also explore having the g_bio model not require a malloc at every stage in the stack/graph; even though going through UMA is fairly fast, it still represents overhead that can be eliminated.  It also represents an out-of-memory failure case that can be prevented.

I might try to work on this over the summer.  It's really a research project in my head at this point, but I'm hopeful that it'll show results.

Scott