svn commit: r221346 - head/sys/netinet
Mikolaj Golub
trociny at freebsd.org
Sat May 14 10:59:09 UTC 2011
Hi,
On Mon, 2 May 2011 21:05:52 +0000 (UTC) John Baldwin wrote:
JB> Author: jhb
JB> Date: Mon May 2 21:05:52 2011
JB> New Revision: 221346
JB> URL: http://svn.freebsd.org/changeset/base/221346
JB> Log:
JB> Handle a rare edge case with nearly full TCP receive buffers. If a TCP
JB> buffer fills up causing the remote sender to enter into persist mode, but
JB> there is still room available in the receive buffer when a window probe
JB> arrives (either due to window scaling, or due to the local application
JB> very slowing draining data from the receive buffer), then the single byte
JB> of data in the window probe is accepted. However, this can cause rcv_nxt
JB> to be greater than rcv_adv. This condition will only last until the next
JB> ACK packet is pushed out via tcp_output(), and since the previous ACK
JB> advertised a zero window, the ACK should be pushed out while the TCP
JB> pcb is write-locked.
JB>
JB> During the window while rcv_nxt is greather than rcv_adv, a few places
JB> would compute the remaining receive window via rcv_adv - rcv_nxt.
JB> However, this value was then (uint32_t)-1. On a 64 bit machine this
JB> could expand to a positive 2^32 - 1 when cast to a long. In particular,
JB> when calculating the receive window in tcp_output(), the result would be
JB> that the receive window was computed as 2^32 - 1 resulting in advertising
JB> a far larger window to the remote peer than actually existed.
JB>
JB> Fix various places that compute the remaining receive window to either
JB> assert that it is not negative (i.e. rcv_nxt <= rcv_adv), or treat the
JB> window as full if rcv_nxt is greather than rcv_adv.
JB>
JB> Reviewed by: bz
JB> MFC after: 1 month
JB> Modified:
JB> head/sys/netinet/tcp_input.c
JB> head/sys/netinet/tcp_output.c
JB> head/sys/netinet/tcp_timewait.c
JB> Modified: head/sys/netinet/tcp_input.c
JB> ==============================================================================
JB> --- head/sys/netinet/tcp_input.c Mon May 2 21:04:37 2011 (r221345)
JB> +++ head/sys/netinet/tcp_input.c Mon May 2 21:05:52 2011 (r221346)
JB> @@ -1831,6 +1831,9 @@ tcp_do_segment(struct mbuf *m, struct tc
JB> win = sbspace(&so->so_rcv);
JB> if (win < 0)
JB> win = 0;
JB> + KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt),
JB> + ("tcp_input negative window: tp %p rcv_nxt %u rcv_adv %u", tp,
JB> + tp->rcv_adv, tp->rcv_nxt));
I am getting this when running tests with HAST (both primary and secondary HAST
instances on the same host).
HAST is synchronizing data in MAXPHYS (131072 bytes) blocks. The sender splits
them on smaller chunks of MAX_SEND_SIZE (32768 bytes), while the receiver
receives the whole block calling recv() with MSG_WAITALL option.
FreeBSD kopusha.home.net 9.0-CURRENT FreeBSD 9.0-CURRENT #6 r221878: Sat May 14 11:44:42 EEST 2011 root at kopusha.home.net:/usr/obj/usr/src/sys/GENERIC i386
panic: tcp_input negative window: tp 0xc9777ad0 rcv_nxt 1530593650 rcv_adv 1530593651
#0 doadump () at pcpu.h:244
#1 0xc04ddac9 in db_fncall (dummy1=-1063410006, dummy2=0, dummy3=-1,
dummy4=0xe67547f0 "\004HuФ") at /usr/src/sys/ddb/db_command.c:548
#2 0xc04ddeff in db_command (last_cmdp=0xc0fcdbfc, cmd_table=0x0, dopager=0)
at /usr/src/sys/ddb/db_command.c:445
#3 0xc04ddfb4 in db_command_script (command=0xc0fceb04 "call doadump")
at /usr/src/sys/ddb/db_command.c:516
#4 0xc04e2280 in db_script_exec (scriptname=0xe67548fc "kdb.enter.panic",
warnifnotfound=Variable "warnifnotfound" is not available.
) at /usr/src/sys/ddb/db_script.c:302
#5 0xc04e2367 in db_script_kdbenter (eventname=0xc0e7d4ee "panic")
at /usr/src/sys/ddb/db_script.c:324
#6 0xc04dff98 in db_trap (type=3, code=0) at /usr/src/sys/ddb/db_main.c:228
#7 0xc09da822 in kdb_trap (type=3, code=0, tf=0xe6754a40)
at /usr/src/sys/kern/subr_kdb.c:533
#8 0xc0ccc2eb in trap (frame=0xe6754a40) at /usr/src/sys/i386/i386/trap.c:719
#9 0xc0cb4fec in calltrap () at /usr/src/sys/i386/i386/exception.s:168
#10 0xc09da6aa in kdb_enter (why=0xc0e7d4ee "panic", msg=0xc0e7d4ee "panic")
at cpufunc.h:71
#11 0xc09a5db4 in panic (
fmt=0xc0ea2d6b "tcp_input negative window: tp %p rcv_nxt %u rcv_adv %u")
at /usr/src/sys/kern/kern_shutdown.c:584
#12 0xc0b29cdb in tcp_do_segment (m=0xc9aac800, th=0xc9aac874, so=0xc9e0b680,
tp=0xc9777ad0, drop_hdrlen=52, tlen=1, iptos=0 '\0', ti_locked=3)
---Type <return> to continue, or q <return> to quit---
at /usr/src/sys/netinet/tcp_input.c:1834
#13 0xc0b2d309 in tcp_input (m=0xc9aac800, off0=20)
at /usr/src/sys/netinet/tcp_input.c:1369
#14 0xc0ac4676 in ip_input (m=0xc9aac800)
at /usr/src/sys/netinet/ip_input.c:765
#15 0xc0a6948a in swi_net (arg=0xc1425880) at /usr/src/sys/net/netisr.c:653
#16 0xc097c675 in intr_event_execute_handlers (p=0xc65bf578, ie=0xc6608280)
at /usr/src/sys/kern/kern_intr.c:1257
#17 0xc097d559 in ithread_loop (arg=0xc65836a0)
at /usr/src/sys/kern/kern_intr.c:1270
#18 0xc0979928 in fork_exit (callout=0xc097d4b0 <ithread_loop>,
arg=0xc65836a0, frame=0xe6754d28) at /usr/src/sys/kern/kern_fork.c:920
#19 0xc0cb5064 in fork_trampoline () at /usr/src/sys/i386/i386/exception.s:275
(kgdb) fr 12
#12 0xc0b29cdb in tcp_do_segment (m=0xc9aac800, th=0xc9aac874, so=0xc9e0b680,
tp=0xc9777ad0, drop_hdrlen=52, tlen=1, iptos=0 '\0', ti_locked=3)
at /usr/src/sys/netinet/tcp_input.c:1834
1834 KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt),
(kgdb) p *tp
$1 = {
t_segq = {
lh_first = 0x0
},
t_pspare = {0x0, 0x0},
t_segqlen = 0,
t_dupacks = 0,
t_timers = 0xc9777cbc,
t_inpcb = 0xc8c4cdc0,
t_state = 9,
t_flags = 525300,
t_vnet = 0x0,
snd_una = 1602773057,
snd_max = 1602773057,
snd_nxt = 1602773057,
snd_up = 1602773057,
snd_wl1 = 1530593650,
snd_wl2 = 1602773057,
iss = 1602772219,
irs = 1499210531,
rcv_nxt = 1530593651,
rcv_adv = 1530593650,
rcv_wnd = 671,
---Type <return> to continue, or q <return> to quit---
rcv_up = 1530593650,
snd_wnd = 70872,
snd_cwnd = 29510,
snd_spare1 = 0,
snd_ssthresh = 1073725440,
snd_spare2 = 0,
snd_recover = 1602773057,
t_maxopd = 16344,
t_rcvtime = 175861,
t_starttime = 171604,
t_rtttime = 0,
t_rtseq = 1602772220,
t_bw_spare1 = 0,
t_bw_spare2 = 0,
t_rxtcur = 31,
t_maxseg = 14336,
t_srtt = 56,
t_rttvar = 37,
t_rxtshift = 0,
t_rttmin = 3,
t_rttbest = 40,
t_rttupdated = 3,
max_sndwnd = 71680,
---Type <return> to continue, or q <return> to quit---
t_softerror = 0,
t_oobflags = 0 '\0',
t_iobc = 0 '\0',
snd_scale = 3 '\003',
rcv_scale = 3 '\003',
request_r_scale = 3 '\003',
ts_recent = 175361,
ts_recent_age = 175361,
ts_offset = 2326744134,
last_ack_sent = 1530593651,
snd_cwnd_prev = 0,
snd_ssthresh_prev = 0,
snd_recover_prev = 0,
t_sndzerowin = 4,
t_badrxtwin = 0,
snd_limited = 0 '\0',
snd_numholes = 0,
snd_holes = {
tqh_first = 0x0,
tqh_last = 0xc9777bb8
},
snd_fack = 0,
rcv_numsacks = 0,
---Type <return> to continue, or q <return> to quit---
sackblks = {{
start = 0,
end = 0
}, {
start = 0,
end = 0
}, {
start = 0,
end = 0
}, {
start = 0,
end = 0
}, {
start = 0,
end = 0
}, {
start = 0,
end = 0
}},
sack_newdata = 0,
sackhint = {
nexthole = 0x0,
sack_bytes_rexmit = 0,
---Type <return> to continue, or q <return> to quit---
last_sack_ack = 0,
ispare = 0,
_pad = {0, 0}
},
t_rttlow = 0,
rfbuf_ts = 175361,
rfbuf_cnt = 0,
t_tu = 0x0,
t_sndrexmitpack = 0,
t_rcvoopack = 0,
t_toe = 0x0,
t_bytes_acked = 0,
cc_algo = 0xc0fa6960,
ccv = 0xc9777d5c,
osd = 0xc9777d74,
t_ispare = 0,
t_pspare2 = {0x0, 0x0, 0x0, 0x0},
_pad = {0 <repeats 12 times>}
}
(kgdb) p *so
$2 = {
so_count = 1,
so_type = 1,
so_options = 4,
so_linger = 0,
so_state = 2,
so_qstate = 0,
so_pcb = 0xc8c4cdc0,
so_vnet = 0x0,
so_proto = 0xc0fa56a8,
so_head = 0x0,
so_incomp = {
tqh_first = 0x0,
tqh_last = 0x0
},
so_comp = {
tqh_first = 0x0,
tqh_last = 0x0
},
so_list = {
tqe_next = 0x0,
tqe_prev = 0xc9e0ab88
},
---Type <return> to continue, or q <return> to quit---
so_qlen = 0,
so_incqlen = 0,
so_qlimit = 0,
so_timeo = 0,
so_error = 0,
so_sigio = 0x0,
so_oobmark = 0,
so_aiojobq = {
tqh_first = 0x0,
tqh_last = 0xc9e0b6cc
},
so_rcv = {
sb_sel = {
si_tdlist = {
tqh_first = 0x0,
tqh_last = 0x0
},
si_note = {
kl_list = {
slh_first = 0x0
},
kl_lock = 0xc0970a90 <knlist_mtx_lock>,
kl_unlock = 0xc0970ac0 <knlist_mtx_unlock>,
---Type <return> to continue, or q <return> to quit---
kl_assert_locked = 0xc0970e10 <knlist_mtx_assert_locked>,
kl_assert_unlocked = 0xc0970de0 <knlist_mtx_assert_unlocked>,
kl_lockarg = 0xc9e0b6f8
},
si_mtx = 0x0
},
sb_mtx = {
lock_object = {
lo_name = 0xc0e84d6b "so_rcv",
lo_flags = 16973824,
lo_data = 0,
lo_witness = 0xc655a618
},
mtx_lock = 4
},
sb_sx = {
lock_object = {
lo_name = 0xc0e75c2c "so_rcv_sx",
lo_flags = 36896768,
lo_data = 0,
lo_witness = 0xc65634b0
},
sx_lock = 3356203168
---Type <return> to continue, or q <return> to quit---
},
sb_state = 0,
sb_mb = 0xc9aab800,
sb_mbtail = 0xc88f0000,
sb_lastrecord = 0xc9aab800,
sb_sndptr = 0x0,
sb_sndptroff = 0,
sb_cc = 71010,
sb_hiwat = 71680,
sb_mbcnt = 85760,
sb_mcnt = 23,
sb_ccnt = 20,
sb_mbmax = 262144,
sb_ctl = 0,
sb_lowat = 1,
sb_timeo = 2000,
sb_flags = 2052,
sb_upcall = 0,
sb_upcallarg = 0x0
},
so_snd = {
sb_sel = {
si_tdlist = {
---Type <return> to continue, or q <return> to quit---
tqh_first = 0x0,
tqh_last = 0x0
},
si_note = {
kl_list = {
slh_first = 0x0
},
kl_lock = 0xc0970a90 <knlist_mtx_lock>,
kl_unlock = 0xc0970ac0 <knlist_mtx_unlock>,
kl_assert_locked = 0xc0970e10 <knlist_mtx_assert_locked>,
kl_assert_unlocked = 0xc0970de0 <knlist_mtx_assert_unlocked>,
kl_lockarg = 0xc9e0b78c
},
si_mtx = 0x0
},
sb_mtx = {
lock_object = {
lo_name = 0xc0e84d64 "so_snd",
lo_flags = 16973824,
lo_data = 0,
lo_witness = 0xc655a5b0
},
mtx_lock = 4
---Type <return> to continue, or q <return> to quit---
},
sb_sx = {
lock_object = {
lo_name = 0xc0e75c22 "so_snd_sx",
lo_flags = 36896768,
lo_data = 0,
lo_witness = 0xc6563448
},
sx_lock = 1
},
sb_state = 16,
sb_mb = 0x0,
sb_mbtail = 0x0,
sb_lastrecord = 0x0,
sb_sndptr = 0x0,
sb_sndptroff = 0,
sb_cc = 0,
sb_hiwat = 43008,
sb_mbcnt = 0,
sb_mcnt = 0,
sb_ccnt = 0,
sb_mbmax = 262144,
sb_ctl = 0,
---Type <return> to continue, or q <return> to quit---
sb_lowat = 2048,
sb_timeo = 2000,
sb_flags = 2048,
sb_upcall = 0,
sb_upcallarg = 0x0
},
so_cred = 0xc879b700,
so_label = 0x0,
so_peerlabel = 0x0,
so_gencnt = 1987,
so_emuldata = 0x0,
so_accf = 0x0,
so_fibnum = 0,
so_user_cookie = 0
}
(kgdb)
JB> tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt));
JB>
JB> /* Reset receive buffer auto scaling when not in bulk receive mode. */
JB> @@ -2868,7 +2871,10 @@ dodata: /* XXX */
JB> * buffer size.
JB> * XXX: Unused.
JB> */
JB> - len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt);
JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt))
JB> + len = so->so_rcv.sb_hiwat - (tp->rcv_adv - tp->rcv_nxt);
JB> + else
JB> + len = so->so_rcv.sb_hiwat;
JB> #endif
JB> } else {
JB> m_freem(m);
JB> Modified: head/sys/netinet/tcp_output.c
JB> ==============================================================================
JB> --- head/sys/netinet/tcp_output.c Mon May 2 21:04:37 2011 (r221345)
JB> +++ head/sys/netinet/tcp_output.c Mon May 2 21:05:52 2011 (r221346)
JB> @@ -561,15 +561,21 @@ after_sack_rexmit:
JB> * taking into account that we are limited by
JB> * TCP_MAXWIN << tp->rcv_scale.
JB> */
JB> - long adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale) -
JB> - (tp->rcv_adv - tp->rcv_nxt);
JB> + long adv;
JB> + int oldwin;
JB> +
JB> + adv = min(recwin, (long)TCP_MAXWIN << tp->rcv_scale);
JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt)) {
JB> + oldwin = (tp->rcv_adv - tp->rcv_nxt);
JB> + adv -= oldwin;
JB> + } else
JB> + oldwin = 0;
JB>
JB> /*
JB> * If the new window size ends up being the same as the old
JB> * size when it is scaled, then don't force a window update.
JB> */
JB> - if ((tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale ==
JB> - (adv + tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale)
JB> + if (oldwin >> tp->rcv_scale == (adv + oldwin) >> tp->rcv_scale)
JB> goto dontupdate;
JB> if (adv >= (long) (2 * tp->t_maxseg))
JB> goto send;
JB> @@ -1008,7 +1014,8 @@ send:
JB> if (recwin < (long)(so->so_rcv.sb_hiwat / 4) &&
JB> recwin < (long)tp->t_maxseg)
JB> recwin = 0;
JB> - if (recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
JB> + if (SEQ_GT(tp->rcv_adv, tp->rcv_nxt) &&
JB> + recwin < (long)(tp->rcv_adv - tp->rcv_nxt))
JB> recwin = (long)(tp->rcv_adv - tp->rcv_nxt);
JB> if (recwin > (long)TCP_MAXWIN << tp->rcv_scale)
JB> recwin = (long)TCP_MAXWIN << tp->rcv_scale;
JB> Modified: head/sys/netinet/tcp_timewait.c
JB> ==============================================================================
JB> --- head/sys/netinet/tcp_timewait.c Mon May 2 21:04:37 2011 (r221345)
JB> +++ head/sys/netinet/tcp_timewait.c Mon May 2 21:05:52 2011 (r221346)
JB> @@ -242,6 +242,9 @@ tcp_twstart(struct tcpcb *tp)
JB> /*
JB> * Recover last window size sent.
JB> */
JB> + KASSERT(SEQ_GEQ(tp->rcv_adv, tp->rcv_nxt),
JB> + ("tcp_twstart negative window: tp %p rcv_nxt %u rcv_adv %u", tp,
JB> + tp->rcv_adv, tp->rcv_nxt));
JB> tw->last_win = (tp->rcv_adv - tp->rcv_nxt) >> tp->rcv_scale;
JB>
JB> /*
--
Mikolaj Golub
More information about the svn-src-head
mailing list