thread taskq / unp_gc() using 100% cpu and stalling unix socket IPC

Mon Nov 19 17:04:06 UTC 2012

On 14.11.2012, at 11:50, Markus Gebert <markus.gebert at hostpoint.ch> wrote:

> On 14.11.2012, at 08:21, Konstantin Belousov <kostikbel at gmail.com> wrote:
> 
>> On Wed, Nov 14, 2012 at 01:41:04AM +0100, Markus Gebert wrote:
>>> 
>>> On 13.11.2012, at 19:30, Markus Gebert <markus.gebert at hostpoint.ch> wrote:
>>> 
>>>> To me it looks like the unix socket GC is triggered way too often and/or running too long, which uses cpu and worse, causes a lot of contention around the unp_list_lock which in turn causes delays for all processes relaying on unix sockets for IPC.
>>>> 
>>>> I don't know why the unp_gc() is called so often and what's triggering this.
>>> 
>>> I have a guess now. Dovecot and relayd both use unix sockets heavily. According to dtrace uipc_detach() gets called quite often by dovecot closing unix sockets. Each time uipc_detach() is called unp_gc_task is taskqueue_enqueue()d if fds are inflight.
>>> 
>>> in uipc_detach():
>>> 682		if (local_unp_rights)	
>>> 683			taskqueue_enqueue(taskqueue_thread, &unp_gc_task);
>>> 
>>> We use relayd in a way that keeps the source address of the client when connecting to the backend server (transparent load balancing). This requires IP_BINDANY on the socket which cannot be set by unprivileged processes, so relayd sends the socket fd to the parent process just to set the socket option and send it back. This means an fd gets transferred twice for every new backend connection.
>>> 
>>> So we have dovecot calling uipc_detach() often and relayd making it likely that fds are inflight (unp_rights > 0). With a certain amount of load this could cause unp_gc_task to be added to the thread taskq too often, slowing everything unix socket related down by holding global locks in unp_gc().
>>> 
>>> I don't know if the slowdown can even cause a negative feedback loop at some point by inreasing the chance of fds being inflight. This would explain why sometimes the condition goes away by itself and sometimes requires intervention (taking load away for a moment).
>>> 
>>> I'll look into a way to (dis)prove all this tomorrow. Ideas still welcome :-).
>>> 
>> 
>> If the only issue is indeed too aggressive scheduling of the taskqueue,
>> than the postpone up to the next tick could do it. The patch below
>> tries to schedule the taskqueue for gc to the next tick if it is not yet
>> scheduled. Could you try it ?
> 
> Sounds like a good idea, thanks! I'm testing the patch right now. It could take a few days to know it works for sure. I'll get back to you soon.

We haven't had any problems since I booted the patched kernel. So the assumption that the gc gets scheduled too often in that situation seems correct.

I realize we're creating an edge case with relayd passing around so many fds. On the other hand, I think the patch makes the unix socket code more robust without hurting anyone. So do you see any chance to get it commited?

Markus