kern/68011: [patch] Isochronous delays in PPPoE

Wed Jun 16 14:50:56 GMT 2004

>Number:         68011
>Category:       kern
>Synopsis:       [patch] Isochronous delays in PPPoE
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Wed Jun 16 14:50:26 GMT 2004
>Closed-Date:
>Last-Modified:
>Originator:     Sergio de Souza Prallon
>Release:        FreeBSD 4.10-STABLE i386
>Organization:
>Environment:
>Description:

	I use clockspeed (ports/sysutils/clockspeed) to keep my clock
	in sync. A couple of months ago I noticed it no longer was able
	to get the time reliably. When run from the cmd line it produced
	error msgs and sometimes failed to set the clock. Pinging the
	NTP server, I saw the RTT was too high (~500-1000ms). Even more
	anoying was the fact that most of the ICMP replies were taking
	the same RTT (to a 1ms precision). Pinging other sites and
	servers had the same results. The same for the PPPoE terminator.
	TCP connections were normal except for a "lag" in interactive
	SSH sessions to remote hosts. HTTP downloads were acceptable.

	At first, I tought it was a problem with my access provider, but
	they assured me everything was just fine on they side (no alarms,
	no abnormal error rates, etc). Not that I really trust them but
	I decided to investigate my side. My HW configuration haven't
	changed in months before, so the problem had to be software
	related. A week or two before, I had cvsup'ed and rebuilt my
	system. To check this, I cvsup'ed angain, this time to 4.9-REL.

	The problem vanished.

	After making a diff 4.9-REL and 4.10-ST, I began a process to
	try to pinpoint the change(s) that caused the problem. Eventually
	I came to 3 diffs that were commited at the same time with the
	same CVS comment:

	----8<--------8<--------8<--------8<--------8<--------8<----
	MFC:
	speedup stream socket recv handling by tracking the tail of the
	mbuf chain instead of walking the list for each append.  This has
	been pretty well tested at Yahoo!

	Obtained from:  netbsd (jason thorpe)
	Reviewed by:    silby
	----8<--------8<--------8<--------8<--------8<--------8<----

	I failed to understand how such change slow down (or synchronize)
	my trafic. I don't see any time dependency (spin loops or sleeps)
	in it, but it do trigger the problem.

	To document it, I produced a screen (ports/misc/screen) session
	where I show:

	1) The problem occurring on an up to date system.
	2) That a 4.9-REL does not have it.
	3) That a patched 4.9-REL kernel have it (with both userlands).

	The screenlog plus (possibly) relevant syslog and config info
	(including the diff that cause the bug) are in an annex file.

	I don't know if it affects other types of connections. I only
	have ADSL here.

>How-To-Repeat:

	Start with a 4.9-REL system. Apply the patch and make a new
	kernel. It should exhibit the problem. Based on what it's
	changed, I don't think it's platform specific but I just can't
	prove it.

>Fix:

	I'm currently running a 4.9-REL kernel with a 4.10-ST userland
	just fine. I believe that undoing the change should fix(?) the
	problem. I have not tested it, because the patch fail to reverse
	due to other changes in the code after this one. Of course, the
	correct solution is to understand what's going on and rewrite
	the change.

>Release-Note:
>Audit-Trail:
>Unformatted:
 >System:
 	FreeBSD ethshar 4.10-STABLE FreeBSD 4.10-STABLE #0:
 		Sun Jun 13 13:05:35 BRT 2004
 		root at ethshar:/aux/src/sys/compile/TEST i386

 	Machine is a Intel Seattle II (SE440BX-2) + PIII 600E
 		+ 256MB RAM + 20GB HD.

 	The Internet connection is ADSL (256Kbps).
 	It uses a VIA Rhyne III ethernet + USR 9001 ADSL modem.
 	I don't known the brand of the DSLAM but the tunnel terminator
 		is probably a Cisco 6400.