Strange timing problems with BETA7

Thu Oct 14 02:47:11 PDT 2004

Hi,

I'm having very strange stability problems with BETA7 which seems related to timing/clock :

Hardware is a Netra t 1125 with 2 CPU.

Symptoms :

After a fresh reboot, when I do a standard ping on any ip adress, the 
interval between the pings is not constant and is generally lower than the 1 second it
should be by default.

I sometimes also get negative latencies with ping or traceroute :

# ping 62.4.16.70
PING 62.4.16.70 (62.4.16.70): 56 data bytes
64 bytes from 62.4.16.70: icmp_seq=0 ttl=60 time=-432.827 ms
64 bytes from 62.4.16.70: icmp_seq=1 ttl=60 time=1.955 ms

# traceroute 62.4.16.70
traceroute to 62.4.16.70 (62.4.16.70), 64 hops max, 52 byte packets
 1  gi0-12-swr102-mix-courbevoie (213.215.63.1)  436.046 ms  0.733 ms  0.611 ms
 2  gi0-2-3-edou.nerim.net (194.79.130.114)  0.619 ms  -434.763 ms  435.882 ms
 3  gi0-3-32-svenny.nerim.net (194.79.130.1)  1.737 ms  1.435 ms  1.715 ms

After a few hours of activity (this box is an ftp server), the kernel gives this kind
of message :

calcru: negative runtime of -893918 usec for pid 1344 (pure-ftpd)
calcru: negative runtime of -761379 usec for pid 1339 (pure-ftpd)
calcru: negative runtime of -1687109 usec for pid 1337 (pure-ftpd)
calcru: negative runtime of -295856 usec for pid 7 (pagedaemon)
calcru: runtime went backwards from 162673274 usec to 159978646 usec for pid 29 (intr2017: hme0)
calcru: runtime went backwards from 33673531 usec to 30674086 usec for pid 4 (g_down)
calcru: runtime went backwards from 102734677682 usec to 102731983847 usec for pid 12 (idle: cpu0)
calcru: runtime went backwards from 102678868452 usec to 102678764016 usec for pid 11 (idle: cpu1)

At this point, doing a netstat -Iw 1 gives nothing but the fields header. In the
same fashion, pinging any ip address gives a single reply and the ping command
is then stuck. (both processes are in select() state when they are stuck and
are interruptible with ^C)

When doing a reboot after a few hours of uptime, the reboot process seems to
get stuck after killing all the running processes, I never see the kernel
shutdown messages and have to power cycle the box.

Some apps seem to have problems with timing too :
wget gives randomly :

Assertion failed: (msecs >= 0), function calc_rate, file retr.c, line 262.
Abort trap (core dumped)

This started when I upgraded from 5.2.1 to BETA3 and the problem is still
present in BETA7 (last cvsup from Oct 5).

I reseted the date according to the heads up about the mk48txx commit.

I tried mpsafenet=0 with same result. My kernel config is pretty much like
GENERIC except that I'm using SCHED_4BSD, maxusers 512 and ZERO_COPY_SOCKETS
(no WITNESS, no INVARIANTS).

Any ideas on this ? Can this be a hardware problem ?

-- 
Herve Boulouis