Right-sizing the geli thread pool

Fri Jul 10 03:34:34 UTC 2020

> 
> On Jul 9, 2020, at 4:27 PM, Alan Somers <asomers at freebsd.org> wrote:
> 
> Currently, geli creates a separate thread pool for each provider, and by
> default each thread pool contains one thread per cpu.  On a large server
> with many encrypted disks, that can balloon into a very large number of
> threads!  I have a patch in progress that switches from per-provider thread
> pools to a single thread pool for the entire module.  Happily, I see read
> IOPs increase by up to 60%.  But to my surprise, write IOPs _decreases_ by
> up to 25%.  dtrace suggests that the CPU usage is dominated by the
> vmem_free call in biodone, as in the below stack.
> 
>              kernel`lock_delay+0x32
>              kernel`biodone+0x88
>              kernel`g_io_deliver+0x214
>              geom_eli.ko`g_eli_write_done+0xf6
>              kernel`g_io_deliver+0x214
>              kernel`md_kthread+0x275
>              kernel`fork_exit+0x7e
>              kernel`0xffffffff8104784e
> 
> I only have one idea for how to improve things from here.  The geli thread
> pool is still fed by a single global bio queue.  That could cause cache
> thrashing, if bios get moved between cores too often.  I think a superior
> design would be to use a separate bio queue for each geli thread, and use
> work-stealing to balance them.  However,
> 
> 1) That doesn't explain why this change benefits reads more than writes, and
> 2) work-stealing is hard to get right, and I can't find any examples in the
> kernel.
> 
> Can anybody offer tips or code for implementing work stealing?  Or any
> other suggestions about why my write performance is suffering?  I would
> like to get this change committed, but not without resolving that issue.
> 
> -Alan
> __

Alan,

Several years ago I spent a bunch of time optimizing geli+ZFS performance.

Nothing as ambitious as what you are doing though.

I have some hand wavy theories about the write performance and how cache thrash would be more expensive for writes than reads.  The default configuration is essentially pathological for systems with large amounts of disks.  But that doesn’t really explain why your change drops performance.  However I’ll send you over some dtrace stuff I have at work tomorrow. It’s pretty sophisticated and should let you visualize the entire I/O pipeline. (You’ll have to add the geli part)

What I discovered is without a histogram based auto tuner it was not possible to tune for optimal performance for dynamic workloads.

As to your question about work stealing. I’ve got nothing there.

Thanks,

Josh Paetzel
FreeBSD - The Power to Serve