FreeBSD 9.1 - openldap slapd lockups, mutex problems

Tue Jan 22 10:19:32 UTC 2013

Hi.

(Im am sending this to the "stable" list, because it maybe kernel related.. )

On 9.1-RELEASE I am witnessing lockups of the openldap slapd daemon.

The slapd runs for some days and then hangs, consuming high amounts of CPU.
In this state slapd can only be restarted by SIGKILL.

 # procstat -kk 71195
  PID    TID COMM             TDNAME           KSTACK                       
71195 149271 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d do_wait+0x678 __umtx_op_wait+0x68 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 194998 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _cv_wait_sig+0x12e seltdwait+0x110 kern_select+0x6ef sys_select+0x5d amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 195544 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 196183 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_timedwait_sig+0x19 _sleep+0x2d4 userret+0x9e doreti_ast+0x1f 
71195 197966 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 198446 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 198453 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 198563 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 199520 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 200038 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 200670 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 200674 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 200675 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 201179 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 201180 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 201181 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 201183 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7 
71195 201189 slapd            -                mi_switch+0x186 sleepq_catch_signals+0x2cc sleepq_wait_sig+0x16 _sleep+0x29d _do_lock_umutex+0x5e8 do_lock_umutex+0x17c __umtx_op_wait_umutex+0x63 amd64_syscall+0x546 Xfast_syscall+0xf7

When I try to stop slapd through the rc script I can see in the logs that the process is waiting for a thread to terminate - indefinitely.
Other multithreaded server processes running on the server without problems (apache-worker, mysqld, bind, etc.)
On UFS2 slapd runs fine, without showing the error.

Things I have tried already to stop the lockups:

- running openldap-server23, openldap24 both with different BDB backend versions.
- tuning the BDB Init File
- reducing the threads used by slapd through slapd.conf

What I didn't try until now:

Mounting a zfs vdev into the jail, to have the BDB storing its data on UFS. (don't like the idea)

Environment:

- freebsd 9.1-rel-amd64 multijail server with cpu resource limit patch[1], which didn't make it into 9.1-rel 
- filesystem: zfs-only, swap on zfs
- active jail limits through rctl.conf (memory, maxprocs, open files)
- a handfull of openldap-server jails that show the same slapd lockup tendency.
- slapd started through daemontools (supvervise)

Some ideas:
- openldap-server with BDB backend uses sparse files for storing the data - on top of ZFS.

Has anyone else running openldap-server on FreeBSD 9.1 inside a jail seen similar problems?
How can I debug this further?

Any hints appreciated :-)

Regards.

[1] https://wiki.freebsd.org/JailResourceLimits