deadlock in "zoneli" state on 6.2-PRERELEASE

Nikolay Pavlov quetzal at zone3000.net
Thu Nov 23 08:02:46 PST 2006


On Thursday, 23 November 2006 at 20:24:15 +0800, delphij at delphij.net wrote:
> 
> Hi,
> 
> On Wed, 22 Nov 2006 21:55:49 +0200, Nikolay Pavlov <quetzal at zone3000.net> wrote:
> > Hi.
> > It seems i have a deadlock on 6.2-PRERELEASE.
> > This is squid server in accelerator mode.
> > I can easily trigger it with a high rate of requests.
> > Squid is locked on some "zoneli" state, i am not sure what it is.
> > Also i can't KILL proccess even with SIGKILL.
> > In addition one of sshd proccess is locked too.
> 
> Would you please update to the latest RELENG_6 and apply this patch:
> 
> http://people.freebsd.org/~delphij/misc/patch-zonelimit-workaround
> 
> to see if things gets improved?
> 
> Thanks in advance!
> 
> Cheers,

Well. This patch works quite ambiguous for me.
Under heavy load this box become unresponseble via network.
System is mostly idle. Squid is locked in zoneli.

ast pid:   840;  load averages:  0.26,  0.24,  0.17 up 0+00:11:50  10:19:46
34 processes:  1 running, 33 sleeping
CPU states:  0.4% user,  0.0% nice,  0.4% system,  1.5% interrupt, 97.8% idle
Mem: 225M Active, 144M Inact, 261M Wired, 12K Cache, 112M Buf, 3259M Free
Swap: 4070M Total, 4070M Free

  PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
  682 squid       1 -16    0   207M   207M zoneli   2:18  6.59% squid
  709 root        1  -8    0  7768K  7240K piperd   0:00  0.00% perl5.8.8
  691 root        1  96    0  6632K  4796K select   0:00  0.00% snmpd
  829 root        1  76  -20  2400K  1648K RUN      0:00  0.00% top
  790 quetzal     1  96    0  6220K  3252K select   0:00  0.00% sshd
  788 root        1   4    0  6232K  3232K sbwait   0:00  0.00% sshd
  837 root        1  20    0  5048K  3024K pause    0:00  0.00% tcsh
  832 root        1   4    0  6232K  3236K sbwait   0:00  0.00% sshd
  820 root        1  20    0  4700K  2856K pause    0:00  0.00% tcsh
  645 root        1  96    0  2984K  1808K select   0:00  0.00% ntpd
  791 quetzal     1  20    0  4708K  2872K pause    0:00  0.00% tcsh
  560 root        1  96    0  1352K   996K select   0:00  0.00% syslogd
  362 _pflogd     1 -58    0  1600K  1144K bpf      0:00  0.00% pflogd
  835 quetzal     1  20    0  4728K  2960K pause    0:00  0.00% tcsh
  688 squid       1  -8    0  1224K   632K piperd   0:00  0.00% unlinkd
  834 quetzal     1  96    0  6220K  3252K select   0:00  0.00% sshd
  840 root        1  20    0  1540K   960K pause    0:00  0.00% netstat
  719 root        1  96    0  3464K  2796K select   0:00  0.00% sendmail
  729 root        1   8    0  1364K  1060K nanslp   0:00  0.00% cron

root at accel1:~# netstat -h 1
            input        (Total)           output
   packets  errs      bytes    packets  errs      bytes colls
      1.6K     0       1.3M       1.5K     0       1.6M     0
      1.8K     0       1.6M       1.7K     0       1.6M     0
      1.3K     0       1.0M       1.4K     0       1.4M     0
      1.5K     0       1.3M       1.5K     0       1.4M     0
      1.6K     0       1.4M       1.6K     0       1.5M     0
      1.7K     0       1.5M       1.6K     0       1.5M     0
      1.3K     0       830K       1.4K     0       1.5M     0
      1.1K     0       679K       1.3K     0       1.4M     0
       812     0       501K        912     0       971K     0
      1.2K     0       1.1M       1.2K     0       1.1M     0
       617     0       325K        742     0       806K     0
       634     0       312K        769     0       818K     0
      1.8K     0       1.7M       1.5K     0       1.1M     0
       11K     0        13M       7.5K     0       3.8M     0
       10K     0        12M       8.0K     0       5.2M     0
      9.7K     0       9.9M       8.2K     0       6.3M     0
       513  1.7K       666K        328     0       151K     0
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Here goes load...

      1.0K   543       782K        434     0       247K     0
         0  2.3K          0          0     0          0     0
         2   605       1.5K          2     0        132     0
            input        (Total)           output
   packets  errs      bytes    packets  errs      bytes colls
         0   334          0          0     0          0     0
         0   286          0          0     0          0     0
         0   288          0          0     0          0     0
       819   204       689K        328     0       122K     0
         0  1.7K          0          0     0          0     0
       866  1.2K       719K        375     0       141K     0
       144  1.5K       175K        111     0        55K     0
         0  1.3K          0          0     0          0     0
       687   182       426K        304     0        73K     0
         0  3.2K          0          0     0          0     0
      1.0K     0       723K        405     0       126K     0
        17  1.8K        25K         11     0       2.2K     0
       598   990       409K        163     0        32K     0
       785  1.9K       635K        313     0        85K     0
         0  3.1K          0          0     0          0     0
         0  1.0K          0          0     0          0     0
         0    82          0          0     0          0     0
         0   205          0          0     0          0     0
         0   181          0          0     0          0     0
         0   197          0          0     0          0     0
      1.0K   633       915K        442     0        89K     0
            input        (Total)           output
   packets  errs      bytes    packets  errs      bytes colls
       821  1.8K       712K        297     0        53K     0
       481  1.8K       437K        326     0        91K     0
         0  1.5K          0          0     0          0     0
         2   190       3.0K          2     0        356     0
         0   210          0          0     0          0     0
         0   150          0          0     0          0     0
         0   210          0          0     0          0     0
       799   631       938K        458     0        99K     0
        36  2.1K        41K         27     0       5.5K     0
       678   260       898K        366     0        70K     0

BUT after some time, with out any load squid is responding again without
any reboot.

Some other info:

root at accel1:~# netstat -i
Name    Mtu Network       Address              Ipkts Ierrs    Opkts Oerrs  Coll
em0    1500 <Link#1>      00:30:48:2a:de:5a   984648 11332   939484     0     0
em0    1500 206.53.57/26  accel1              968817     -   939431     -     -
em1*   1500 <Link#2>      00:30:48:2a:de:5b        0     0        0     0     0
lo0   16384 <Link#3>                             314     0      314     0     0
lo0   16384 localhost     ::1                      0     -        0     -     -
lo0   16384 your-net      localhost              314     -      314     -     -
pflog 33208 <Link#4>                               0     0        0     0     0

Another intresting thing - i see that mbufs deniend counter is growing:

root at accel1:~# netstat -m
65485/65/65550 mbufs in use (current/cache/total)
65284/418/65702/131072 mbuf clusters in use (current/cache/total/max)
65284/0 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0/0 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/0 9k jumbo clusters in use (current/cache/total/max)
0/0/0/0 16k jumbo clusters in use (current/cache/total/max)
146939K/852K/147791K bytes allocated to network (current/cache/total)
21823/0/75016 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/3/6656 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

root at accel1:~# netstat -m
65538/12/65550 mbufs in use (current/cache/total)
65338/364/65702/131072 mbuf clusters in use (current/cache/total/max)
65338/6 mbuf+clusters out of packet secondary zone in use (current/cache)
0/0/0/0 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/0 9k jumbo clusters in use (current/cache/total/max)
0/0/0/0 16k jumbo clusters in use (current/cache/total/max)
147060K/731K/147791K bytes allocated to network (current/cache/total)
23262/0/86054 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0/3/6656 sfbufs in use (current/peak/max)
0 requests for sfbufs denied
0 requests for sfbufs delayed
0 requests for I/O initiated by sendfile
0 calls to protocol drain routines

My configuration files:

root at accel1:~# cat /boot/loader.conf
# --- Loader settings ---
autoboot_delay="5"              # Delay in seconds before autobooting

# --- Kernel tunables ---
kern.ipc.nmbclusters="131072"   # Set the number of mbuf clusters
kern.cam.scsi_delay="5000"      # Delay (in ms) before probing SCSI
kern.maxdsiz="2752M"            # Allow more memory allocation for squid
kern.dfldsiz="2752M"            # Allow more memory allocation for squid
#kern.maxssiz="256M"            # Allow more memory allocation for squid
#kern.maxusers="512"

# --- Networking modules ---
pf_load="YES"                   # Packet filter

# --- Other modules ---
accf_data_load="YES"             # Wait for data accept filter
accf_http_load="YES"             # Wait for full HTTP request accept filter


root at accel1:~# cat /etc/sysctl.conf
# $FreeBSD: src/etc/sysctl.conf,v 1.8 2003/03/13 18:43:50 mux Exp $
#
#  This file is read when going to multi-user and its contents piped thru
#  ``sysctl'' to adjust kernel values.  ``man 5 sysctl.conf'' for details.
#

# Uncomment this to prevent users from seeing information about processes that
# are being run under another UID.
#security.bsd.see_other_uids=0

# --- MAC access for squid ---
net.inet.ip.portrange.reservedlow=0
net.inet.ip.portrange.reservedhigh=0
security.mac.portacl.rules=uid:100:tcp:80


# --- Kernel tunning ---
kern.ipc.somaxconn=8192

# --- Network tunning and protection ---
kern.ipc.maxsockbuf=1048576
net.inet.tcp.sendspace=262144
net.inet.tcp.recvspace=131072
net.inet.tcp.msl=3000
net.inet.icmp.icmplim=50
net.inet.icmp.drop_redirect=1
net.inet.icmp.log_redirect=1
net.inet.ip.redirect=0
net.inet6.ip6.redirect=0

Some kernel options:

options VM_KMEM_SIZE_MAX=536870912"
options         MAC
options         MAC_PORTACL


Ahh... And here is also a panic :)


root at accel1:/usr/obj/usr/src/sys/ACCEL# kgdb kernel.debug /var/crash/vmcore.3
kgdb: kvm_nlist(_stopped_cpus):
kgdb: kvm_nlist(_stoppcbs):
[GDB will not be able to debug user-mode threads: /usr/lib/libthread_db.so: Undefined symbol "ps_pglobal_lookup"]
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd".

Unread portion of the kernel message buffer:


Fatal trap 12: page fault while in kernel mode
fault virtual address   = 0xc
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc054dfac
stack pointer           = 0x28:0xf4817ad0
frame pointer           = 0x28:0xf4817adc
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 13 (swi1: net)
trap number             = 12
panic: page fault
KDB: stack backtrace:
kdb_backtrace(100,c927aa80,28,f4817a90,c,...) at kdb_backtrace+0x29
panic(c06c9d38,c06f033a,0,fffff,c927d69b,...) at panic+0xa8
trap_fatal(f4817a90,c,c927aa80,0,c,...) at trap_fatal+0x2a6
trap_pfault(f4817a90,0,c) at trap_pfault+0x1f3
trap(c93b0008,f4810028,c0590028,1,0,...) at trap+0x325
calltrap() at calltrap+0x5
--- trap 0xc, eip = 0xc054dfac, esp = 0xf4817ad0, ebp = 0xf4817adc ---
m_copydata(0,ffffffff,1,cfcbc174,c1040460,...) at m_copydata+0x28
tcp_output(cc6781d0) at tcp_output+0xa17
tcp_input(ca16ab00,14,1,c927aa80,c927aa80,...) at tcp_input+0x286c
ip_input(ca16ab00) at ip_input+0x5e1
netisr_processqueue(c073d7b8) at netisr_processqueue+0x9f
swi_net(0) at swi_net+0xf2
ithread_execute_handlers(c9279648,c92c3400) at ithread_execute_handlers+0x121
ithread_loop(c92436a0,f4817d38) at ithread_loop+0x54
fork_exit(c04ff5c8,c92436a0,f4817d38) at fork_exit+0x70
fork_trampoline() at fork_trampoline+0x8
--- trap 0x1, eip = 0, esp = 0xf4817d6c, ebp = 0 ---
Uptime: 11m24s
Dumping 3967 MB (3 chunks)
  chunk 0: 1MB (159 pages) ... ok
  chunk 1: 3966MB (1015280 pages) 3950 3934 3918 3902 3886 3870 3854 3838 3822 3806 3790 3774 3758 3742 3726 3710 3694 3678 3662 3646 3630 3614 3598 3582 3566 3550 3534 3518 3502 3486 3470 3454 3438 3422 3406 3390 3374 3358 3342 3326 3310 3294 3278 3262 3246 3230 3214 3198 3182 3166 3150 3134 3118 3102 3086 3070 3054 3038 3022 3006 2990 2974 2958 2942 2926 2910 2894 2878 2862 2846 2830 2814 2798 2782 2766 2750 2734 2718 2702 2686 2670 2654 2638 2622 2606 2590 2574 2558 2542 2526 2510 2494 2478 2462 2446 2430 2414 2398 2382 2366 2350 2334 2318 2302 2286 2270 2254 2238 2222 2206 2190 2174 2158 2142 2126 2110 2094 2078 2062 2046 2030 2014 1998 1982 1966 1950 1934 1918 1902 1886 1870 1854 1838 1822 1806 1790 1774 1758 1742 1726 1710 1694 1678 1662 1646 1630 1614 1598 1582 1566 1550 1534 1518 1502 1486 1470 1454 1438 1422 1406 1390 1374 1358 1342 1326 1310 1294 1278 1262 1246 1230 1214 1198 1182 1166 1150 1134 1118 1102 1086 1070 1054 1038 1022 1006 990 974 958 942 926 910 894 878 862 846 830 814 798 782 766 750 734 718 702 686 670 654 638 622 606 590 574 558 542 526 510 494 478 462 446 430 414 398 382 366 350 334 318 302 286 270 254 238 222 206 190 174 158 142 126 110 94 78 62 46 30 14 ... ok
  chunk 2: 1MB (128 pages)

#0  doadump () at pcpu.h:165
165     pcpu.h: No such file or directory.
        in pcpu.h
(kgdb) bt
#0  doadump () at pcpu.h:165
#1  0xc05167d2 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:409
#2  0xc0516a98 in panic (fmt=0xc06c9d38 "%s") at /usr/src/sys/kern/kern_shutdown.c:565
#3  0xc06a416a in trap_fatal (frame=0xf4817a90, eva=12) at /usr/src/sys/i386/i386/trap.c:837
#4  0xc06a3e9b in trap_pfault (frame=0xf4817a90, usermode=0, eva=12) at /usr/src/sys/i386/i386/trap.c:745
#5  0xc06a3a95 in trap (frame=
      {tf_fs = -918880248, tf_es = -192872408, tf_ds = -1067909080, tf_edi = 1, tf_esi = 0, tf_ebp = -192840996, tf_isp = -192841028, tf_ebx = 4380, tf_edx = -1, tf_ecx = 0, tf_eax = -808730252, tf_trapno = 12, tf_err = 0, tf_eip = -1068179540, tf_cs = 32, tf_eflags = 66050, tf_esp = 4380, tf_ss = -865631792}) at /usr/src/sys/i386/i386/trap.c:435
#6  0xc0691cda in calltrap () at /usr/src/sys/i386/i386/exception.s:139
#7  0xc054dfac in m_copydata (m=0x0, off=-1, len=1, cp=0xcfcbc174 "") at /usr/src/sys/kern/uipc_mbuf.c:543
#8  0xc05ac60f in tcp_output (tp=0xcc6781d0) at /usr/src/sys/netinet/tcp_output.c:770
#9  0xc05aaea4 in tcp_input (m=0xca16ab00, off0=20) at /usr/src/sys/netinet/tcp_input.c:2471
#10 0xc05a1bb9 in ip_input (m=0xca16ab00) at /usr/src/sys/netinet/ip_input.c:785
#11 0xc0590a6f in netisr_processqueue (ni=0xc073d7b8) at /usr/src/sys/net/netisr.c:236
#12 0xc0590c6a in swi_net (dummy=0x0) at /usr/src/sys/net/netisr.c:349
#13 0xc04ff511 in ithread_execute_handlers (p=0xc9279648, ie=0xc92c3400) at /usr/src/sys/kern/kern_intr.c:682
#14 0xc04ff61c in ithread_loop (arg=0xc92436a0) at /usr/src/sys/kern/kern_intr.c:765
#15 0xc04fe498 in fork_exit (callout=0xc04ff5c8 <ithread_loop>, arg=0xc92436a0, frame=0xf4817d38) at /usr/src/sys/kern/kern_fork.c:821
#16 0xc0691d3c in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:208




More information about the freebsd-stable mailing list