kern/127024: Problem with unix sockets garbage collector

Mon Sep 1 15:10:02 UTC 2008

>Number:         127024
>Category:       kern
>Synopsis:       Problem with unix sockets garbage collector
>Confidential:   no
>Severity:       non-critical
>Priority:       medium
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Sep 01 15:10:00 UTC 2008
>Closed-Date:
>Last-Modified:
>Originator:     Anton Yuzhaninov
>Release:        FreeBSD 7.0-STABLE amd64
>Organization:
Rambler
>Environment:
System: FreeBSD mx22.rambler.ru 7.0-STABLE FreeBSD 7.0-STABLE #1: Fri Jun 27 16:59:59 MSD 2008 root at mx22.rambler.ru:/usr/obj/usr/src/sys/MAIL amd64

Problem occurs on SMP boxes, when unix sockets used under high load.
In our case it is server with postfix MTA, where unix sockets used for IPC.

>Description:
1. Normal work (after reboot):

thread taskq in top is about 0.00% WCPU

sysctl net.local.inflight is almost always zero.
sysctl net.local.taskcount value increased rarely.

2. After several days of work thread taskq starts to eat all available CPU:

1684 processes:26 running, 1639 sleeping, 19 waiting
CPU states:  6.7% user,  0.0% nice, 54.5% system,  1.1% interrupt, 37.7% idle
Mem: 1332M Active, 1903M Inact, 505M Wired, 118M Cache, 214M Buf, 76M Free
Swap: 2060M Total, 2060M Free

   PID USERNAME  THR PRI NICE   SIZE    RES STATE  C   TIME   WCPU COMMAND
     9 root        1   8    -     0K    16K CPU1   1 536:07 100.00% thread taskq
    12 root        1 171 ki31     0K    16K RUN    0  53.5H 64.06% idle: cpu0
    11 root        1 171 ki31     0K    16K RUN    1  50.3H 14.26% idle: cpu1

sysctl net.local.inflight value is always less then 0 (I see values from -1 to -4).
sysctl net.local.taskcount values increased with high rate (about 100 per second).

It seems to be some race in unix sockets code, because on uniprocessor box we can't repeat this.

>How-To-Repeat:
Run postfix MTA on high loaded mail server (> 100 connects per second) with 6-stable or 7-stable (SMP).
Problem should occurs after several days (weeks) of uptime.

>Fix:
Not known yet.
May be in 8-current this problem fixed, but we can't run 8-current on this hardware.
>Release-Note:
>Audit-Trail:
>Unformatted: