vm_zone corruption 4.x

Fri Jan 25 08:35:29 PST 2008

Good day,

    I have stumbled into a strange problem where my FBSD 4.x box keeps 
crashing under network traffic load. I have enabled INVARIANTS and 
debugging and was able to gather a trace. The context here is that a 
listening connection created a syncache entry sent a syn-ack and is now 
processing the ack it got back. Everything seems find until it tries to 
create a new socket from the listening one and as it is about to get 
another tcp control block the kernel dies :(

(kgdb) bt
#0  Debugger (msg=0xc02bd93b "panic") at ../../i386/i386/db_interface.c:321
#1  0xc016b080 in panic (fmt=0xc02e6cd9 "zone: entry not free") at 
../../kern/kern_shutdown.c:593
#2  0xc025046b in zerror () at ../../vm/vm_zone.c:547
#3  0xc02500ab in zalloci (z=0xce703180) at ../../vm/vm_zone.c:76
#4  0xc01c1809 in in_pcballoc (so=0xeef12fe0, pcbinfo=0xc034ba40, p=0x0)
    at ../../netinet/in_pcb.c:167
#5  0xc01df8f0 in tcp_attach (so=0xeef12fe0, p=0x0) at 
../../netinet/tcp_usrreq.c:1603
#6  0xc01ddbc9 in tcp_usr_attach (so=0xeef12fe0, proto=0, p=0x0) at 
../../netinet/tcp_usrreq.c:175
#7  0xc018cd1d in sonewconn3 (head=0xeedfb7c0, connstatus=2, p=0x0) at 
../../kern/uipc_socket2.c:223
#8  0xc018cc54 in sonewconn (head=0xeedfb7c0, connstatus=2) at 
../../kern/uipc_socket2.c:196
#9  0xc01dbc40 in syncache_socket (sc=0xf0f0ac80, lso=0xeedfb7c0)
    at ../../netinet/tcp_syncache.c:594
#10 0xc01dc290 in syncache_expand (inc=0xf585ac50, th=0xc61d0034, 
sop=0xf585ac48, m=0xc3774200)
    at ../../netinet/tcp_syncache.c:946
#11 0xc01d2ce7 in tcp_input (m=0xc3774200, off0=20, proto=6) at 
../../netinet/tcp_input.c:1058
#12 0xc01ca93f in ip_input (m=0xc3774200) at ../../netinet/ip_input.c:1279
#13 0xc01ca9a3 in ipintr () at ../../netinet/ip_input.c:1300
#14 0xc027e5b9 in swi_net_next ()
#15 0xc016de61 in tsleep (ident=0xce7e9700, priority=280, 
wmesg=0xc02bb3b8 "kqread", timo=3)
    at ../../kern/kern_synch.c:479
#16 0xc01616e3 in kqueue_scan (fp=0xce7f7040, maxevents=65535, 
ulistp=0x80a2000, tsp=0xf585af2c,
    p=0xed3c3d80) at ../../kern/kern_event.c:645
#17 0xc0161211 in kevent (p=0xed3c3d80, uap=0xf585af80) at 
../../kern/kern_event.c:454
#18 0xc028c33e in syscall2 (frame={tf_fs = 47, tf_es = -562495441, tf_ds 
= -1078001617,
      tf_edi = 60, tf_esi = 134881340, tf_ebp = -1077937120, tf_isp = 
-175788076,
      tf_ebx = 134852608, tf_edx = 1, tf_ecx = -1077937128, tf_eax = 
363, tf_trapno = 7,
      tf_err = 2, tf_eip = 134690428, tf_cs = 31, tf_eflags = 663, 
tf_esp = -1077937180,
      tf_ss = 47}) at ../../i386/i386/trap.c:1175
#19 0xc027d155 in Xint0x80_syscall ()

(kgdb) p *z
$2 = {zlock = {lock_data = 0}, zitems = 0x0, zfreecnt = 13945, zfreemin 
= 6, znalloc = 253356,
  zkva = 4021252096, zpagecount = 3687, zpagemax = 5120, zmax = 32768, 
ztotal = 23596, zsize = 640,
  zalloc = 1, zflags = 1, zallocflag = 1, zobj = 0xc0341c80, zname = 
0xc02cb489 "tcpcb",
  znext = 0xce703200}

Now there is a couple of strange things here and maybe someone with more 
experience with the vm can shed some light into it.
1) I can't help but find unusual that zitems is NULL ...
2) The sum of zfreecnt + ztotal is bigger the zmax ...
3) If we are in zalloci() why is the zlock not held (0)?

What else should I be looking for here, the crash only happens after a 
certain amount of items are used (>20k so far).

Thanks,

Karim.