Kernel hang on 6.x

Wed Jan 10 22:21:41 UTC 2007

On Thu, 4 Jan 2007 12:53:47 -0500, John Baldwin wrote:
> On Thursday 04 January 2007 10:27, Brian Dean wrote:
> >
> > I believe that I can generate a kernel dump.  We tried this yesterday
> > but didn't have a dump device configured. 
>
> If this is 6.x, turn on minidumps via the sysctl.  The dump size normally is 
> the size of RAM.  With minidumps it can be a lot smaller.  If you get a dump, 
> let me know and I'll point you at some gdb scripts to generate 'ps' type 
> output, etc.
> 

I work with Brian, and have been helping him analyze this problem.  We have
been able to generate kernel dumps, and have also done some additional
analysis under ddb.  Here is a summary of our analysis so far.  Suggestions
as to how to proceed from here are most welcome.

Summary
-------
There appears to be a VM deadlock, not due to any actual locks or mutexes
being held, but due to an inability to satisfy page faults.

In our scenario, one huge process is essentially consuming most of the
memory on the machine.  Depending upon the physical memory configured,
that process will consume 1.5-3 times physical memory.  When the machine
hangs, however, there is still plenty of swap space available.  One
possibly interesting observation is that the hang always seems to occur
when ~14G of swap has been consumed, regardless of the size of physmem
(6G, 8G, 16G) or amount of swap (32G, 64G) configured.

Dropping to DDB on the serial console, and running "show locks", 
"show alllocks", and "show lockedvnods" usually indicate that only an
sio lock is being held (probably for the serial console). A "ps" indicates
that the pagedaemon is in VMWait state, while a number of other processes
(including the huge memory process) are in pfault state.

We have been able to generate a full core dump, as well as a couple of
minidumps with differing hw.physmem settings.  We can make these dumps
available, if that would help, or we can provide information from gdb
on request.  We can usually reproduce the hang within 1.5-2 hours, and
can provide additional information from ddb, as well.

Details
-------

Here is an example of a 'top' session that was running when the machine
became unresponsive.  Note the one huge 'sdsbuild' process.

last pid:  3331;  load averages:  0.00,  0.27,  0.44    up 0+03:22:40  17:06:05
42 processes:  1 running, 41 sleeping
CPU states:  0.0% user,  0.0% nice,  0.8% system,  0.0% interrupt, 99.2% idle
Mem: 14G Active, 1115M Inact, 535M Wired, 39M Cache, 214M Buf, 21M Free
Swap: 64G Total, 14G Used, 50G Free, 21% Inuse

  PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
 2698 blc         1 114    0 29579M 14635M pfault  30:40  0.00% sdsbuild
 3159 blc         1   5    0  7492K   856K ttyin    0:23  0.00% systat
 3158 root        1  96    0  6984K   744K RUN      0:04  0.00% top
 2096 root        1  96    0  3656K   104K select   0:02  0.00% rlogind
 2699 blc         1  -8    0  2492K   212K piperd   0:01  0.00% tee
 1092 root        1  96    0  2692K   168K select   0:01  0.00% cvd
  773 root        1  96    0  7712K   600K select   0:01  0.00% ntpd
  736 root        1  97    0  3984K     0K select   0:00  0.00% <mountd>
 2525 root        1  96    0  3656K   104K select   0:00  0.00% rlogind
 1986 root        1  96    0  3656K   104K select   0:00  0.00% rlogind
  707 root        1  96    0  9760K   352K select   0:00  0.00% amd
  817 root        1  96    0  9340K   564K select   0:00  0.00% sendmail
  657 root        1  96    0  3644K   304K select   0:00  0.00% syslogd
 2000 root        1  20    0  9796K     0K pause    0:00  0.00% <csh>
  825 root        1 -16    0  3664K   336K vmwait   0:00  0.00% cron
 2527 blc         1  20    0   960K     0K pause    0:00  0.00% <ksh>
  676 root        1  96    0  4940K   352K select   0:00  0.00% rpcbind
 1956 root        1  96    0  6108K     0K select   0:00  0.00% <inetd>
  738 root        1 109    0  2724K     0K select   0:00  0.00% <nfsd>
 1987 root        1   8    0 18408K     0K wait     0:00  0.00% <login>
 2097 root        1   8    0 18408K     0K wait     0:00  0.00% <login>
 2526 root        1   8    0 18408K     0K wait     0:00  0.00% <login>
 2098 blc         1  20    0   960K     0K pause    0:00  0.00% <ksh>
  793 root        1  96    0  2516K   152K select   0:00  0.00% usbd
 1999 blc         1   8    0 18400K     0K wait     0:00  0.00% <su>
 1988 blc         1  20    0   960K     0K pause    0:00  0.00% <ksh>
 2682 blc         1  20    0   964K     0K pause    0:00  0.00% <ksh>
 1982 root        1   5    0  3600K     0K ttyin    0:00  0.00% <getty>
 1974 root        1   5    0  3600K     0K ttyin    0:00  0.00% <getty>
 1981 root        1   5    0  3600K     0K ttyin    0:00  0.00% <getty>
 1978 root        1   5    0  3600K     0K ttyin    0:00  0.00% <getty>
 1977 root        1   5    0  3600K     0K ttyin    0:00  0.00% <getty>
 1976 root        1   5    0  3600K     0K ttyin    0:00  0.00% <getty>

Here is a 'ps' from ddb corresponding to the 'top' session above:

db> ps 
  pid  ppid  pgrp   uid   state   wmesg         wchan        cmd
 3332   817   817     0  N                                   sendmail
 3331   825   825     0  N                                   cron
 3159  2527  3159   207  S+      ttyin    0xffffff02574d4c10 systat
 3158  2000  3158     0  SL+     pfault   0xffffffff809326f8 top
 3155     0     0     0  SL      mdwait   0xffffff030c9e7800 [md0]
 2699  2682  2682   207  S+      piperd   0xffffff004e432000 tee
 2698  2682  2682   207  SL+     pfault   0xffffffff809326f8 sdsbuild
 2682  2098  2682   207  SW+     pause    0xffffff03253933c0 ksh
 2527  2526  2527   207  SW+     pause    0xffffff030b906068 ksh
 2526  2525  2526     0  SWs+    wait     0xffffff031af246b0 login
 2525  1956  2525     0  Ss      select   0xffffffff8091fe50 rlogind
 2098  2097  2098   207  SW+     pause    0xffffff030bbc7a70 ksh
 2097  2096  2097     0  SWs+    wait     0xffffff030bbc7358 login
 2096  1956  2096     0  Ss      select   0xffffffff8091fe50 rlogind
 2000  1999  2000     0  SW+     pause    0xffffff030663b718 csh
 1999  1988  1999   207  SW+     wait     0xffffff030663ba08 su
 1988  1987  1988   207  SW+     pause    0xffffff031b2e4718 ksh
 1987  1986  1987     0  SWs+    wait     0xffffff031b2e4a08 login
 1986  1956  1986     0  Ss      select   0xffffffff8091fe50 rlogind
 1982     1  1982     0  ?s+                                 getty
 1981     1  1981     0  SWs+    ttyin    0xffffff03e0c42410 getty
 1980     1  1980     0  SWs+    ttyin    0xffffff03e0c42010 getty
 1979     1  1979     0  SWs+    ttyin    0xffffff03dd03c010 getty
 1978     1  1978     0  SWs+    ttyin    0xffffff03e14b6410 getty
 1977     1  1977     0  SWs+    ttyin    0xffffff03e0921810 getty
 1976     1  1976     0  SWs+    ttyin    0xffffff03e0921410 getty
 1975     1  1975     0  SWs+    ttyin    0xffffff03e0200410 getty
 1974     1  1974     0  SWs+    ttyin    0xffffff03e0200010 getty
 1956     1  1956     0  SWs     select   0xffffffff8091fe50 inetd
 1092     1  1092     0  Ss      select   0xffffffff8091fe50 cvd
  825     1   825     0  SLs     vmwait   0xffffffff809326f8 cron
  817     1   817     0  SLs     vmwait   0xffffffff809326f8 sendmail
  811     1   811     0  SWs     select   0xffffffff8091fe50 sshd
  793     1   793     0  Ss      select   0xffffffff8091fe50 usbd
  773     1   773     0  SLs     pfault   0xffffffff809326f8 ntpd
  745   738   738     0  SW      -        0xffffff03e068bc00 nfsd
  742   738   738     0  SW      -        0xffffff0000f47600 nfsd
  741   738   738     0  SW      -        0xffffff03e068b400 nfsd
  740   738   738     0  SW      -        0xffffff03e068b600 nfsd
  738     1   738     0  SWs     select   0xffffffff8091fe50 nfsd
  736     1   736     0  SWs     select   0xffffffff8091fe50 mountd
  707     1   707     0  SLs     pfault   0xffffffff809326f8 amd
  676     1   676     0  Ss      select   0xffffffff8091fe50 rpcbind
  657     1   657     0  Ss      select   0xffffffff8091fe50 syslogd
  547     1   547     0  SWs     select   0xffffffff8091fe50 devd
   41     0     0     0  SL      -        0xffffffffbf5ddbe4 [schedcpu]
   40     0     0     0  SL      sdflush  0xffffffff80931ae0 [softdepflush]
   39     0     0     0  SL      syncer   0xffffffff8089aa20 [syncer]
   38     0     0     0  SL      vlruwt   0xffffff03dccfc6b0 [vnlru]
   37     0     0     0  SL      psleep   0xffffffff80920718 [bufdaemon]
   36     0     0     0  SL      pgzero   0xffffffff80933460 [pagezero]
   35     0     0     0  SL      psleep   0xffffffff80932b2c [vmdaemon]
   34     0     0     0  SL      VMWait   0xffffffff80932ae0 [pagedaemon]
   33     0     0     0  WL                                  [swi0: sio]
   32     0     0     0  WL                                  [irq15: ata1]
   31     0     0     0  WL                                  [irq14: ata0]
   30     0     0     0  SL      usbevt   0xffffffff94066420 [usb1]
   29     0     0     0  SL      usbtsk   0xffffffff80895750 [usbtask]
   28     0     0     0  SL      usbevt   0xffffffff94064420 [usb0]
   27     0     0     0  WL                                  [irq19: ohci0 ohci1
]
   26     0     0     0  SL      idle     0xffffffff940562e0 [mpt_raid0]
   25     0     0     0  SL      idle     0xffffffff94056000 [mpt_recovery0]
   24     0     0     0  WL                                  [irq28: mpt0]
   23     0     0     0  SL      -        0xffffff0000e2d700 [em3 taskq]
   22     0     0     0  SL      -        0xffffff0000e2dd00 [em2 taskq]
   21     0     0     0  SL      -        0xffffff0000e2a200 [em1 taskq]
   20     0     0     0  SL      -        0xffffff0000e36300 [em0 taskq]
   19     0     0     0  WL                                  [irq9: acpi0]
   18     0     0     0  WL                                  [swi6: Giant taskq]
    9     0     0     0  SL      -        0xffffff0000db5000 [thread taskq]
   17     0     0     0  WL                                  [swi5: +]
    8     0     0     0  SL      -        0xffffff0000db5300 [acpi_task_2]
    7     0     0     0  SL      -        0xffffff0000db5300 [acpi_task_1]
    6     0     0     0  SL      -        0xffffff0000db5300 [acpi_task_0]
    5     0     0     0  SL      -        0xffffff0000db5400 [kqueue taskq]
   16     0     0     0  WL                                  [swi2: cambio]
   15     0     0     0  WL                                  [swi6: task queue]
   14     0     0     0  SL      -        0xffffffff80893300 [yarrow]
    4     0     0     0  SL      -        0xffffffff80896468 [g_down]
    3     0     0     0  SL      -        0xffffffff80896460 [g_up]
    2     0     0     0  SL      -        0xffffffff80896450 [g_event]
   13     0     0     0  WL                                  [swi1: net]
   12     0     0     0  WL                                  [swi3: vm]
   11     0     0     0  WL                                  [swi4: clock sio]
   10     0     0     0  RL      CPU 0                       [idle]
    1     0     1     0  SLs     wait     0xffffff03e1566000 [init]
    0     0     0     0  SLs     vmwait   0xffffffff809326f8 [swapper]

Here is where the pagedaemon process appears to be whenever the hang occurs:

db> tr 34
Tracing pid 34 tid 100027 td 0xffffff03e1557000
sched_switch() at sched_switch+0x11f
mi_switch() at mi_switch+0x14c
sleepq_wait() at sleepq_wait+0x2e
msleep() at msleep+0x191
swp_pager_meta_build() at swp_pager_meta_build+0x13b
swap_pager_putpages() at swap_pager_putpages+0x261
vm_pageout_flush() at vm_pageout_flush+0xde
vm_pageout() at vm_pageout+0x1668
fork_exit() at fork_exit+0x87
fork_trampoline() at fork_trampoline+0xe
--- trap 0, rip = 0, rsp = 0xffffffffbf56ed00, rbp = 0 ---

Here is the lock info from ddb:

db> show locks
exclusive spin mutex sio r = 0 (0xffffffff8097a100) locked @ /usr/src/sys/dev/sio/sio.c:1390
db> show alllocks
db> show lockedvnods
Locked vnodes
db> show pcpu
cpuid        = 0
curthread    = 0xffffff017f421720: pid 10 "idle"
curpcb       = 0xffffffffb6ce2d10
fpcurthread  = none
idlethread   = 0xffffff017f421720: pid 10 "idle"
spin locks held:
exclusive spin mutex sio r = 0 (0xffffffff8097a100) locked @ /usr/src/sys/dev/sio/sio.c:1390
db> show allpcpu
Current CPU: 0

cpuid        = 0
curthread    = 0xffffff017f421720: pid 10 "idle"
curpcb       = 0xffffffffb6ce2d10
fpcurthread  = none
idlethread   = 0xffffff017f421720: pid 10 "idle"
spin locks held:
exclusive spin mutex sio r = 0 (0xffffffff8097a100) locked @ /usr/src/sys/dev/sio/sio.c:1390

Here is a 'vmstat -s' snapshot taken as the machine hung:

 45794937 cpu context switches
 20220635 device interrupts
   810196 software interrupts
 66195618 traps
  6615004 system calls
       61 kernel threads created
     5940  fork() calls
      560 vfork() calls
        0 rfork() calls
     1851 swap pager pageins
    13980 swap pager pages paged in
   234711 swap pager pageouts
  3670484 swap pager pages paged out
     1088 vnode pager pageins
     5591 vnode pager pages paged in
        0 vnode pager pageouts
        0 vnode pager pages paged out
      744 page daemon wakeups
301408821 pages examined by the page daemon
  3228902 pages reactivated
   233903 copy-on-write faults
      425 copy-on-write optimized faults
 55595560 zero fill pages zeroed
 10623188 zero fill pages prezeroed
     1878 intransit blocking page faults
 66078208 total VM faults taken
        0 pages affected by kernel thread creation
   377526 pages affected by  fork()
  1282533 pages affected by vfork()
        0 pages affected by rfork()
 71450207 pages freed
     3805 pages freed by daemon
  7974727 pages freed by exiting processes
  3577239 pages active
   294202 pages inactive
    13662 pages in VM cache
   106557 pages wired down
    10548 pages free
     4096 bytes per page
  3665219 total name lookups
          cache hits (75% pos + 0% neg) system 1% per-directory
          deletions 0%, falsehits 0%, toolong 0%

Here is a snapshot of 'sysctl vm' as the machine hung:

vm.vmtotal: 
System wide totals computed every five seconds: (values in kilobytes)
===============================================
Processes:              (RUNQ: 2 Disk Wait: 1 Page Wait: 1 Sleep: 43)
Virtual Memory:         (Total: 35242K, Active 30153728K)
Real Memory:            (Total: 15917288K Active 15540600K)
Shared Virtual Memory:  (Total: 7128K Active: 6576K)
Shared Real Memory:     (Total: 2368K Active: 2368K)
Free Memory Pages:      96860K

vm.loadavg: { 2.67 1.84 1.53 }
vm.v_free_min: 25512
vm.v_free_target: 107384
vm.v_free_reserved: 5336
vm.v_inactive_target: 161076
vm.v_cache_min: 107384
vm.v_cache_max: 214768
vm.v_pageout_free_min: 34
vm.pageout_algorithm: 0
vm.swap_enabled: 1
vm.kmem_size_scale: 3
vm.kmem_size_max: 419430400
vm.kmem_size: 419430400
vm.nswapdev: 1
vm.dmmax: 32
vm.swap_async_max: 4
vm.zone_count: 69
vm.zone: 
ITEM            SIZE     LIMIT     USED    FREE  REQUESTS

FFS2 dinode:     256,        0,  38765,  25060,   633124
FFS1 dinode:     128,        0,      0,      0,        0
FFS inode:       192,        0,  38765,  26355,   633124
Mountpoints:     936,        0,    181,     11,      181
SWAPMETA:        288,   116519, 116519,      0,   116551
rtentry:         264,        0,     40,     30,       72
ripcb:           304,    33792,      0,     24,       18
sackhole:         32,        0,      0,      0,        0
tcpreass:         40,     2184,      0,      0,        0
hostcache:       136,    15372,     17,     39,       20
syncache:        128,    15370,      0,     58,       12
tcptw:            80,     6795,      0,    810,     1595
tcpcb:           752,    33795,     34,     21,     1655
inpcb:           304,    33792,     34,    794,     1655
udpcb:           304,    33792,    204,     24,     3543
ipq:              56,     1071,      0,    126,     2604
unpcb:           192,    33800,     12,     28,      577
socket:          616,    33792,    250,     26,     5794
KNOTE:           120,        0,      0,     62,     6251
PIPE:            768,        0,      2,     23,     2416
NFSNODE:         816,        0,   9012,   7898,    65144
NFSMOUNT:        584,        0,    174,     15,      174
DIRHASH:        1024,        0,   1219,    293,     2098
L VFS Cache:     327,        0,    729,    687,    10160
S VFS Cache:     104,        0,  52929,  30231,   714353
NAMEI:          1024,        0,      4,      8,  1395238
VNODEPOLL:       152,        0,      0,      0,        0
VNODE:           648,        0,  47992,  34748,   698512
ata_composit:    376,        0,      0,      0,        0
ata_request:     336,        0,      0,     22,       24
g_bio:           216,        0,      0,   1818,  1418264
ACL UMA zone:    388,        0,      0,      0,        0
mbuf_jumbo_1:  16384,        0,      0,      0,        0
mbuf_jumbo_9:   9216,        0,      0,      0,        0
mbuf_jumbo_p:   4096,        0,      0,      0,        0
mbuf_cluster:   2048,    33792,   1152,     72,     7441
mbuf:            256,        0,   1169,    121,  2197657
mbuf_packet:     256,        0,   1152,    138,  1780924
VMSPACE:         544,        0,     47,     16,     6504
UPCALL:           88,        0,      0,      0,        0
KSEGRP:          136,        0,    112,     18,      112
THREAD:          608,        0,    112,      8,      112
PROC:            856,        0,     88,     24,     6562
Files:           120,        0,    149,     68,   223724
4096:           4096,        0,    351,      8,     9244
2048:           2048,        0,     29,    137,     3815
1024:           1024,        0,    727,     17,  1059555
512:             512,        0,    593,    149,    16584
256:             256,        0,    753,    117,    57787
128:             128,        0,   5084,   1325,   291338
64:               64,        0,   3596,    436,    53093
32:               32,        0,   1895,    226,    27977
16:               16,        0,   3166,    194,   389649
mt_zone:          64,        0,    203,     77,      203
DP fakepg:       120,        0,      0,      0,        0
PV ENTRY:         48,  5678064, 3753306, 116118, 106702024
MAP ENTRY:       112,        0,   1707,    207,   273770
KMAP ENTRY:      112,    98406,     16,    116,    11646
MAP:             352,        0,      7,     15,        7
VM OBJECT:       224,        0,  41676,   8678,   178636
128 Bucket:     1048,        0,    240,    306,     1081
64 Bucket:       536,        0,     38,     32,       73
32 Bucket:       280,        0,     16,     26,       39
16 Bucket:       152,        0,     30,     20,       43
UMA Hash:        256,        0,      5,     10,        8
UMA RCntSlab:    128,        0,    612,     26,      619
UMA Slabs:       128,        0,   4452,   3668,    14083
UMA Zones:       184,        0,     67,     13,       67
UMA Kegs:        240,        0,     67,      8,       67

vm.old_contigmalloc: 0
vm.swap_idle_threshold2: 10
vm.swap_idle_threshold1: 2
vm.exec_map_entries: 16
vm.stats.misc.zero_page_count: 1415
vm.stats.misc.cnt_prezero: 6936472
vm.stats.vm.v_kthreadpages: 0
vm.stats.vm.v_rforkpages: 0
vm.stats.vm.v_vforkpages: 1282533
vm.stats.vm.v_forkpages: 377616
vm.stats.vm.v_kthreads: 61
vm.stats.vm.v_rforks: 0
vm.stats.vm.v_vforks: 560
vm.stats.vm.v_forks: 5941
vm.stats.vm.v_interrupt_free_min: 2
vm.stats.vm.v_pageout_free_min: 34
vm.stats.vm.v_cache_max: 214768
vm.stats.vm.v_cache_min: 107384
vm.stats.vm.v_cache_count: 13662
vm.stats.vm.v_inactive_count: 294202
vm.stats.vm.v_inactive_target: 161076
vm.stats.vm.v_active_count: 3577241
vm.stats.vm.v_wire_count: 106556
vm.stats.vm.v_free_count: 10547
vm.stats.vm.v_free_min: 25512
vm.stats.vm.v_free_target: 107384
vm.stats.vm.v_free_reserved: 5336
vm.stats.vm.v_page_count: 4035617
vm.stats.vm.v_page_size: 4096
vm.stats.vm.v_tfree: 71450323
vm.stats.vm.v_pfree: 7974792
vm.stats.vm.v_dfree: 3805
vm.stats.vm.v_pdpages: 301408821
vm.stats.vm.v_pdwakeups: 744
vm.stats.vm.v_reactivated: 3228902
vm.stats.vm.v_intrans: 1878
vm.stats.vm.v_vnodepgsout: 0
vm.stats.vm.v_vnodepgsin: 5591
vm.stats.vm.v_vnodeout: 0
vm.stats.vm.v_vnodein: 1088
vm.stats.vm.v_swappgsout: 3670484
vm.stats.vm.v_swappgsin: 13980
vm.stats.vm.v_swapout: 234711
vm.stats.vm.v_swapin: 1851
vm.stats.vm.v_ozfod: 10623195
vm.stats.vm.v_zfod: 55595610
vm.stats.vm.v_cow_optim: 425
vm.stats.vm.v_cow_faults: 233938
vm.stats.vm.v_vm_faults: 66078331
vm.stats.sys.v_soft: 810461
vm.stats.sys.v_intr: 20220649
vm.stats.sys.v_syscall: 6616609
vm.stats.sys.v_trap: 66195742
vm.stats.sys.v_swtch: 45795521
vm.v_free_severe: 15424
vm.max_proc_mmap: 37449
vm.old_msync: 0
vm.msync_flush_flags: 3
vm.boot_pages: 48
vm.pageout_lock_miss: 0
vm.disable_swapspace_pageouts: 0
vm.defer_swapspace_pageouts: 0
vm.swap_idle_enabled: 0
vm.pageout_stats_interval: 5
vm.pageout_full_stats_interval: 20
vm.pageout_stats_max: 107384
vm.max_launder: 32
vm.idlezero_maxrun: 16
vm.idlezero_enable: 1
vm.kvm_free: 1042280448
vm.kvm_size: 2147479552

---
Brad Chisholm
blc at bsdwins.com