kernel: mps0: Out of chain frames, consider increasing hw.mps.max_chains.

Tue Mar 8 20:29:59 UTC 2016

> On Mar 8, 2016, at 11:02 AM, Slawa Olhovchenkov <slw at zxy.spb.ru> wrote:
> 
> On Tue, Mar 08, 2016 at 10:56:39AM -0800, Scott Long wrote:
> 
>> 
>>> On Mar 8, 2016, at 10:48 AM, Slawa Olhovchenkov <slw at zxy.spb.ru> wrote:
>>> 
>>> On Tue, Mar 08, 2016 at 10:34:23AM -0800, Scott Long wrote:
>>> 
>>>> 
>>>>> On Mar 8, 2016, at 10:07 AM, Slawa Olhovchenkov <slw at zxy.spb.ru> wrote:
>>>>> 
>>>>> On Mon, Mar 07, 2016 at 02:10:12PM +0300, Slawa Olhovchenkov wrote:
>>>>> 
>>>>>>>>>> This allocated one for all controllers, or allocated for every controller?
>>>>>>>>> 
>>>>>>>>> It’s per-controller.
>>>>>>>>> 
>>>>>>>>> I’ve thought about making the tuning be dynamic at runtime.  I
>>>>>>>>> implemented similar dynamic tuning for other drivers, but it seemed
>>>>>>>>> overly complex for low benefit.  Implementing it for this driver
>>>>>>>>> would be possible but require some significant code changes.
>>>>>>>> 
>>>>>>>> What cause of chain_free+io_cmds_active << max_chains?
>>>>>>>> One cmd can use many chains?
>>>>>>> 
>>>>>>> Yes.  A request uses and active command, and depending on the size of the I/O,
>>>>>>> it might use several chain frames.
>>>>> 
>>>>> I am play with max_chains and like significant cost of handling
>>>>> max_chains: with 8192 system resonded badly vs 2048. Now try 3192,
>>>>> response like with 2048.
>>>> 
>>>> Hi, I’m not sure I understand what you’re saying.  You said that you tried 8192, but the system still complained of being out of chain frames?  Now you are trying fewer, only 3192?
>>> 
>>> With 8192 system not complained of being out of chain frames, but like
>>> need more CPU power to handle this chain list -- traffic graf (this
>>> host servered HTTP by nginx) have many "jerking", with 3192 traffic
>>> graf is more smooth.
>> 
>> Hi,
>> 
>> The CPU overhead of doing more chain frames is nil.  They are just
>> objects in a list, and processing the list is O(1), not O(n).  What
>> you are likely seeing is other problems with VM and VFS-BIO system
>> struggling to deal with the amount of I/O that you are doing.
>> Depending on what kind I/O you are doing (buffered filesystem
>> reads/writes, memory mapped I/O, unbuffered I/O) there are limits
>> and high/low water marks on how much I/O can be outstanding, and
>> when the limits are reached processes are put to sleep and then race
>> back in when they are woken up.  This causes poor, oscillating
>> system behavior.  There’s some tuning you can do to increase the
>> limits, but yes, it’s a problem that behaves poorly in an untuned
>> system.
> 
> Sorry, I am don't understund you point: how to large unused chain
> frames can consume CPU power?

A ‘chain frame’ is 128 bytes.  By jumping from 2048 to 8192 chain frames allocated, you’ve jumped from 256KB to 1MB of allocated memory.  This sounds like a lot, but if you’re doing enough I/O to saturate the tunings then you likely have many GB of RAM.  The 1MB of memory consumed is going to be well less than 1% of you have, and likely .1 to .01%.  So it’s likely that the VM is not having to work much harder to deal with the missing memory.  In dealing with the chain frames themselves, they are stored on a linked list, and that list is never walked from head to tail.  The driver adds to head and subtracts from the head, so there is no cost for the length of the list.

For comparison, we use 4 ‘mps’ controllers in our servers at Netflix, and run 20Gbps (2.5GB/s) through them.  We’ve done extensive profiling and tuning of the kernel, and we’ve never measured a change in cost for having different chain frame lengths, other than the difficulties that come from having too few.  The problems exist in the VM and VFS-BIO interfaces being poorly tuned for modern workloads.

Scott