4.9p1 deadlock on "inode"

Mon Dec 22 18:42:14 PST 2003

This morning I found one of my systems would not let me login or issue
commands but still seemed to be running.  ddb showed that lots of
processes were waiting on "inode".  I forced a crash dump and found
166 processes total, 95 waiting on inode and 94 on the same wchan:

(kgdb) p *(struct lock *)0xc133eb00
$9 = {lk_interlock = {lock_data = 0}, lk_flags = 0x200440, lk_sharecount = 0, 
  lk_waitcount = 94, lk_exclusivecount = 1, lk_prio = 8, 
  lk_wmesg = 0xc02b0a8a "inode", lk_timo = 101, lk_lockholder = 304}
(kgdb) 

The lockholder is cron - the process waiting on inode on a different
lock:
(kgdb) p *(struct lock *)0xc1901a00
$10 = {lk_interlock = {lock_data = 0}, lk_flags = 0x200440, lk_sharecount = 0, 
  lk_waitcount = 1, lk_exclusivecount = 1, lk_prio = 8, 
  lk_wmesg = 0xc02b0a8a "inode", lk_timo = 101, lk_lockholder = 15123}
(kgdb) 

Pid 15123 is another cron process waiting on "vlruwk" because there are
too many vnodes in use:
(kgdb) p numvnodes
$12 = 8904
(kgdb) p freevnodes
$13 = 24
(kgdb) p desiredvnodes
$14 = 8879

Process vnlru is waiting on "vlrup" with vnlru_nowhere = 18209.

Looking through the mountlist, mnt_nvnodelistsize was sane on all
filesystems except one (/mnt), where it was 8613 (97% of all vnodes).
Only one process was actively using files in /mnt, though some other
processes may have been using it for $PWD or similar.  This process
was scanning most of the files in /mnt (about 750,000) checking for
files with identical content - basically all files that could
potentially be the same (eg same length) are mmap'd and compared.
This process had 2816 entries in its vm_map.  (It's just occurred to
me that there would be one set of data that would appear in a large
number of files (~30000) but I would have expected this to result in
an error during an mmap(), not a deadlock).

Scanning through the mnt_nvnodelist on /mnt:
5797 entries were for directories with entries in v_cache_src
2804 entries were for files with a usecount > 0
  11 entries were for directories with VFREE|VDOOMED|VXLOCK
   1 VNON entry

This means that none of the vnodes in /mnt were available for
recycling (and the total vnodes on the other filesystems would not be
enough to reach the hysteresis point to unlock the vnode allocation).
I can understand that an mmap'd file holds a usecount on the file's
vnode but my understanding is that vnode entries with v_cache_src
entries should be able to be recycled (though this will slow down
namei()).  If so, should vnlru grow a "try harder" loop that will
recycle these vnodes if it winds up stuck in entries?

I notice vlrureclaim() contains the comment "don't set kern.maxvnodes
too low".  In this case, it is auto-tuned based on 128MB RAM and
"maxusers=0".  Maybe this is too low for my purposes but it would be
much nicer if the system managed to handle this situation gracefully
rather than by deadlocking.

And finally, a question on vlrureclaim():  Why does this process scan
through mnt_nvnodelist and perform a TAILQ_REMOVE(), TAILQ_INSERT_TAIL()
on each node?  Wouldn't it be cheaper to just scan the list, rather than
moving every node to the end of the list?

Peter