From sean at chittenden.org Mon Jun 2 05:22:12 2008 From: sean at chittenden.org (Sean Chittenden) Date: Mon Jun 2 05:22:16 2008 Subject: Micro-benchmark for various time syscalls... Message-ID: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> I wrote a small micro-benchmark utility[1] to test various time syscalls and the results were a bit surprising to me. The results were from a UP machine and I believe that the difference between gettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would've been bigger on an SMP system and performance would've degraded further with each additional core. clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of CLOCK_REALTIME's value[2]). In fact, I'd assume that CLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2) (a statement I can't back up, but believe is likely to be correct) and therefore there isn't much harm (if any) in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread use vs. gettimeofday(2). FYI. -sc PS Is there a reason that time(3) can't be implemented in terms of clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to time_t's whole second resolution. % ./bench_time 9079882 | sort -rnk1 Timing micro-benchmark. 9079882 syscall iterations. Avg. us/call Elapsed Name 9.322484 84.647053 gettimeofday(2) 8.955324 81.313291 time(3) 8.648315 78.525684 clock_gettime(2/CLOCK_REALTIME) 8.598495 78.073325 clock_gettime(2/CLOCK_MONOTONIC) 0.674194 6.121600 clock_gettime(2/CLOCK_PROF) 0.648083 5.884515 clock_gettime(2/CLOCK_VIRTUAL) 0.330556 3.001412 clock_gettime(2/CLOCK_REALTIME_FAST) 0.306514 2.783111 clock_gettime(2/CLOCK_SECOND) 0.262788 2.386085 clock_gettime(2/CLOCK_MONOTONIC_FAST) Last value from gettimeofday(2): 1212380080.620649 Last value from time(3): 1212380161 Last value from clock_gettime(2/CLOCK_VIRTUAL): 2.296430000 Last value from clock_gettime(2/CLOCK_SECOND): 1212380338.000000000 Last value from clock_gettime(2/CLOCK_REALTIME_FAST): 1212380243.461081040 Last value from clock_gettime(2/CLOCK_REALTIME): 1212380240.459788612 Last value from clock_gettime(2/CLOCK_PROF): 185.560343000 Last value from clock_gettime(2/CLOCK_MONOTONIC_FAST): 5747219.271879584 Last value from clock_gettime(2/CLOCK_MONOTONIC): 5747216.886509281 [1] http://sean.chittenden.org/pubfiles/freebsd/bench_time.c [2] sys/time.h comments about precision. http://fxr.watson.org/fxr/source/sys/time.h#L269 -- Sean Chittenden sean@chittenden.org http://sean.chittenden.org/ From kometen at gmail.com Mon Jun 2 08:26:28 2008 From: kometen at gmail.com (Claus Guttesen) Date: Mon Jun 2 08:26:32 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> Message-ID: > I wrote a small micro-benchmark utility[1] to test various time syscalls and > the results were a bit surprising to me. The results were from a UP machine > and I believe that the difference between gettimeofday(2) and > clock_gettime(CLOCK_REALTIME_FAST) would've been bigger on an SMP system and > performance would've degraded further with each additional core. > > clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most > authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of > CLOCK_REALTIME's value[2]). In fact, I'd assume that CLOCK_REALTIME_FAST is > just as accurate as Linux's gettimeofday(2) (a statement I can't back up, > but believe is likely to be correct) and therefore there isn't much harm (if > any) in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive more > widespread use vs. gettimeofday(2). FYI. -sc > > PS Is there a reason that time(3) can't be implemented in terms of > clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to time_t's > whole second resolution. > > % ./bench_time 9079882 | sort -rnk1 > Timing micro-benchmark. 9079882 syscall iterations. > Avg. us/call Elapsed Name > 9.322484 84.647053 gettimeofday(2) > 8.955324 81.313291 time(3) > 8.648315 78.525684 clock_gettime(2/CLOCK_REALTIME) > 8.598495 78.073325 clock_gettime(2/CLOCK_MONOTONIC) > 0.674194 6.121600 clock_gettime(2/CLOCK_PROF) > 0.648083 5.884515 clock_gettime(2/CLOCK_VIRTUAL) > 0.330556 3.001412 clock_gettime(2/CLOCK_REALTIME_FAST) > 0.306514 2.783111 clock_gettime(2/CLOCK_SECOND) > 0.262788 2.386085 clock_gettime(2/CLOCK_MONOTONIC_FAST) > Last value from gettimeofday(2): 1212380080.620649 > Last value from time(3): 1212380161 > Last value from clock_gettime(2/CLOCK_VIRTUAL): 2.296430000 > Last value from clock_gettime(2/CLOCK_SECOND): 1212380338.000000000 > Last value from clock_gettime(2/CLOCK_REALTIME_FAST): 1212380243.461081040 > Last value from clock_gettime(2/CLOCK_REALTIME): 1212380240.459788612 > Last value from clock_gettime(2/CLOCK_PROF): 185.560343000 > Last value from clock_gettime(2/CLOCK_MONOTONIC_FAST): 5747219.271879584 > Last value from clock_gettime(2/CLOCK_MONOTONIC): 5747216.886509281 rozetta~/devel/c%>sysctl hw.model hw.model: Intel(R) Xeon(R) CPU E5345 @ 2.33GHz rozetta~/devel/c%>./bench_time 9079882 | sort -rnk1 Timing micro-benchmark. 9079882 syscall iterations. Avg. us/call Elapsed Name 1.405469 12.761494 clock_gettime(2/CLOCK_REALTIME) 1.313101 11.922799 time(3) 1.305518 11.853953 clock_gettime(2/CLOCK_MONOTONIC) 1.303947 11.839681 gettimeofday(2) 0.442908 4.021557 clock_gettime(2/CLOCK_PROF) 0.436484 3.963223 clock_gettime(2/CLOCK_VIRTUAL) 0.217718 1.976851 clock_gettime(2/CLOCK_MONOTONIC_FAST) 0.215264 1.954571 clock_gettime(2/CLOCK_REALTIME_FAST) 0.211779 1.922932 clock_gettime(2/CLOCK_SECOND) Value from time(3): 1212391638 Last value from gettimeofday(2): 1212391626.146308 Last value from clock_gettime(2/CLOCK_VIRTUAL): 4.179847000 Last value from clock_gettime(2/CLOCK_SECOND): 1212391676.000000000 Last value from clock_gettime(2/CLOCK_REALTIME_FAST): 1212391652.785214038 Last value from clock_gettime(2/CLOCK_REALTIME): 1212391650.830730996 Last value from clock_gettime(2/CLOCK_PROF): 60.276182000 Last value from clock_gettime(2/CLOCK_MONOTONIC_FAST): 1190915.000747909 Last value from clock_gettime(2/CLOCK_MONOTONIC): 1190913.024357334 gettimeofday is 6 times slower on this system, 28 times slower on your system. -- regards Claus When lenity and cruelty play for a kingdom, the gentlest gamester is the soonest winner. Shakespeare From gary at velocity-servers.net Mon Jun 2 06:06:24 2008 From: gary at velocity-servers.net (Gary Stanley) Date: Mon Jun 2 09:35:01 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> Message-ID: <20080602060624.93F5F8FC4A@mx1.freebsd.org> At 12:54 AM 6/2/2008, Sean Chittenden wrote: >I wrote a small micro-benchmark utility[1] to test various time >syscalls and the results were a bit surprising to me. The results >were from a UP machine and I believe that the difference between >gettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would've been >bigger on an SMP system and performance would've degraded further with >each additional core. > >clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for >most authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- >10ms of CLOCK_REALTIME's value[2]). In fact, I'd assume that >CLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2) (a >statement I can't back up, but believe is likely to be correct) and >therefore there isn't much harm (if any) in seeing clock_gettime(2) + >CLOCK_REALTIME_FAST receive more widespread use vs. gettimeofday(2). >FYI. -sc > >PS Is there a reason that time(3) can't be implemented in terms of >clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to >time_t's whole second resolution. Another interesting idea is to map gettimeofday() to userland, sort of like darwin (commpage) and linux (vsyscall) via read only page. Can you try changing microtime() in kern_time.c:gettimeofday() to getmicrotime() to see if your benchmarks change any? Also; what clock are you using for your benchmarks? ACPI? TSC? From brde at optusnet.com.au Mon Jun 2 10:55:36 2008 From: brde at optusnet.com.au (Bruce Evans) Date: Mon Jun 2 10:55:39 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <20080602060624.93F5F8FC4A@mx1.freebsd.org> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602060624.93F5F8FC4A@mx1.freebsd.org> Message-ID: <20080602203217.T3100@delplex.bde.org> On Mon, 2 Jun 2008, Gary Stanley wrote: > At 12:54 AM 6/2/2008, Sean Chittenden wrote: >> PS Is there a reason that time(3) can't be implemented in terms of >> clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to >> time_t's whole second resolution. > > Another interesting idea is to map gettimeofday() to userland, sort of like > darwin (commpage) and linux (vsyscall) via read only page. time() can reasonably be implemented like that, but not gettimeofday(). gettimeofday() should have an accuracy of 1 usec and it returns a large data structure that cannot be locked by simple atomic accesses. The read-only page would have to be updated millions of times per second or take a pagefault to access to give the same functionality as FreeBSD gettimeofday(). The updates would cost about 100% of 1 CPU. Other CPUs could then read the time using locking like that in binuptime() but more complicated (needs an atomic update for at least the generation count, and probably more). The pagefaults would give a smaller pessimization (I guess slightly longer to reach microtime() than via the current syscall, and identical time in microtime() to do the update on demand). Bruce From brde at optusnet.com.au Mon Jun 2 11:25:40 2008 From: brde at optusnet.com.au (Bruce Evans) Date: Mon Jun 2 11:25:47 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> Message-ID: <20080602205953.X3162@delplex.bde.org> On Mon, 2 Jun 2008, Claus Guttesen wrote: >> % ./bench_time 9079882 | sort -rnk1 >> Timing micro-benchmark. 9079882 syscall iterations. >> Avg. us/call Elapsed Name >> 9.322484 84.647053 gettimeofday(2) >> 8.955324 81.313291 time(3) >> 8.648315 78.525684 clock_gettime(2/CLOCK_REALTIME) >> 8.598495 78.073325 clock_gettime(2/CLOCK_MONOTONIC) >> 0.674194 6.121600 clock_gettime(2/CLOCK_PROF) >> 0.648083 5.884515 clock_gettime(2/CLOCK_VIRTUAL) >> 0.330556 3.001412 clock_gettime(2/CLOCK_REALTIME_FAST) >> 0.306514 2.783111 clock_gettime(2/CLOCK_SECOND) >> 0.262788 2.386085 clock_gettime(2/CLOCK_MONOTONIC_FAST) In previous mail, I said that these were very slow. > rozetta~/devel/c%>sysctl hw.model > hw.model: Intel(R) Xeon(R) CPU E5345 @ 2.33GHz > > rozetta~/devel/c%>./bench_time 9079882 | sort -rnk1 > Timing micro-benchmark. 9079882 syscall iterations. > Avg. us/call Elapsed Name > 1.405469 12.761494 clock_gettime(2/CLOCK_REALTIME) > 1.313101 11.922799 time(3) > 1.305518 11.853953 clock_gettime(2/CLOCK_MONOTONIC) > 1.303947 11.839681 gettimeofday(2) > 0.442908 4.021557 clock_gettime(2/CLOCK_PROF) > 0.436484 3.963223 clock_gettime(2/CLOCK_VIRTUAL) > 0.217718 1.976851 clock_gettime(2/CLOCK_MONOTONIC_FAST) > 0.215264 1.954571 clock_gettime(2/CLOCK_REALTIME_FAST) > 0.211779 1.922932 clock_gettime(2/CLOCK_SECOND) These seem about right for a normal untuned ~2GHz system: - there is a syscall overhead of about 200 nsec - the hardware parts of the ACPI (?) timecounter are very slow, so they add 1100 nsec - anomalous extra 100 nsec for CLOCK_REALTIME. CLOCK_REALTIME does less than gettimeofday(). - CLOCK_PROF and CLOCK_VIRTUAL use the slow function calcru() in the kernel. This apparently takes about the same time as a syscall. calcru() uses cpu_ticks() (which normally uses the TSC on i386 and amd64) to determine the time spent since the thread was last context switched, so it is more accurate than CLOCK_REALTIME_FAST but less accurate than CLOCK_REALTIME; using the TSC makes it faster than a non-TSC timecounter. calcru() still seems to have broken accounting for the current timeslice in other running threads in the process. > gettimeofday is 6 times slower on this system, 28 times slower on your system. 1.epsilon times slower on my system :-). Bruce From brde at optusnet.com.au Mon Jun 2 17:11:09 2008 From: brde at optusnet.com.au (Bruce Evans) Date: Mon Jun 2 17:11:14 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> Message-ID: <20080602182214.I2764@delplex.bde.org> On Sun, 1 Jun 2008, Sean Chittenden wrote: > I wrote a small micro-benchmark utility[1] to test various time syscalls and > the results were a bit surprising to me. The results were from a UP machine > and I believe that the difference between gettimeofday(2) and > clock_gettime(CLOCK_REALTIME_FAST) would've been bigger on an SMP system and > performance would've degraded further with each additional core. I wouldn't expect SMP to make much difference between CLOCK_REALTIME and CLOCK_REALTIME_FAST. The only difference is that the former calls nanotime() where the latter calls getnanotime(). nanotime() always does more, but it doesn't have any extra SMP overheads in most cases (in rare cases like i386 using the i8254 timecounter, it needs to lock accesses to the timecounter hardware). gettimeofday() always does more than CLOCK_REALTIME, but again no more for SMP. > clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most > authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of > CLOCK_REALTIME's value[2]). In fact, I'd assume that CLOCK_REALTIME_FAST is > just as accurate as Linux's gettimeofday(2) (a statement I can't back up, but > believe is likely to be correct) and therefore there isn't much harm (if any) > in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread use > vs. gettimeofday(2). FYI. -sc The existence of most of CLOCK_* is a bug. I wouldn't use CLOCK_REALTIME_FAST for anything (if only because it doesn't exist in most kernels that I run. I switched from using gettimeofday() to CLOCK_REALTIME many years ago when syscalls started taking less than 1 usec and still occasionally have problems from this running old kernels, because old i386 kernels don't support CLOCK_REALTIME and old amd64 kernels have a broken CLOCK_REALTIME in 32-bit mode). > PS Is there a reason that time(3) can't be implemented in terms of > clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to time_t's > whole second resolution. I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low accuracy timers provided by the get*time() family are accurate enough to give the time in seconds. Unfortunately, they are still broken -- they are all incoherent relative to nanotime() and some are incoherent relative to each other. CLOCK_SECOND can lag the time in seconds given by up to tc_tick/HZ seconds. This is because CLOCK_SECOND returns the time in seconds at the last tc_windup(), so it misses seeing rollovers of the second in the interval between the rollover and the next tc_windup(), while nanotime() doesn't miss seeing these rollovers so it gives incoherent times, with nanotime()/CLOCK_REALTIME being correct and time_second/CLOCK_SECOND broken. vfs_timestamp() already defaults to using time_second, so it gives times incoherent with time() since the latter still uses getttimeofday(). Some file system test programs see this incoherency and I run them with vfs.timestamp.precision=3 (nanotime()) to avoid it. File systems were micro-optimized to use time_second (now not so micro optimized to use vfs_timestamp() which defaults to using time_second), but micro-pessimizing them to use nanotime() makes no significant difference. This is because most file system timestamp updates are cached (delayed until the next sync or disk write), and in cases where the updates are written to disk the time to read the clock is in the noise relative to the time for the disk write. > > % ./bench_time 9079882 | sort -rnk1 > Timing micro-benchmark. 9079882 syscall iterations. > Avg. us/call Elapsed Name > 9.322484 84.647053 gettimeofday(2) > 8.955324 81.313291 time(3) > 8.648315 78.525684 clock_gettime(2/CLOCK_REALTIME) > 8.598495 78.073325 clock_gettime(2/CLOCK_MONOTONIC) > 0.674194 6.121600 clock_gettime(2/CLOCK_PROF) > 0.648083 5.884515 clock_gettime(2/CLOCK_VIRTUAL) > 0.330556 3.001412 clock_gettime(2/CLOCK_REALTIME_FAST) > 0.306514 2.783111 clock_gettime(2/CLOCK_SECOND) > 0.262788 2.386085 clock_gettime(2/CLOCK_MONOTONIC_FAST) These are very slow. Are they on a 486? :-) I get about 262 ns for CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems. The syscall overhead is about 200 nsec (170 nsec for a simpler syscall and maybe 30 nsec extra for copyin/out for clock_gettime()) and reading the TSC timecounter adds another 60 nsec, including a whole 6 nsec for the hardware part of the read (perhaps more like 30 nsec than 60 for the whoe read). The TSC doesn't work on all machines (never for SMP), but this will hopefully change. (Phenom is supposed to have TSCs that are coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+ to implement this :-(. Core2 already has a 40+ cycles rdtsc, but AFAIK it doesn't have coherent TSCs.) Other timecounters are much slower than the TSC, but I haven't seen one take 8000 nsec since 486 days. Some of my benchmark results: 2.205GHz A64 in 32-bit mode, VIA motherboard: %%% 2008/01/05 (TSC) bde-current, -O2 -mcpu=athlon-xp min 240, max 77658, mean 242.171787, std 65.655259 2007/11/23 (TSC) bde-current min 247, max 11890, mean 247.857786, std 62.559317 2007/05/19 (TSC) plain -current-noacpi min 262, max 286965, mean 263.941187, std 41.801400 2007/05/19 (TSC) plain -current-acpi min 261, max 68926, mean 279.848650, std 40.477440 2007/05/19 (ACPI-fast timecounter) plain -current-acpi min 558, max 285494, mean 827.597038, std 78.322301 2007/05/19 (i8254) plain -current-acpi min 3352, max 288288, mean 4182.774148, std 257.977752 %%% These times are for CLOCK_REALTIME. This system has a fairly fast ACPI and i8254 timecounters. 1500-800 nsec is more typical for ACPI-fast, and 4000-5000 is more typical for i8254. ACPI-fast should be named ACPI-not-very-slow. ACPI-safe is very slow, perhaps slower than i8254. i8254 could be made about twice as fast if anyone cared. 133MHz P1: %%% 1996/07/12: min 3, max 472, mean 3.320346, std 0.694846 1998/02/21 pre-phk: min 3, max 595, mean 3.443382, std 0.767383 1998/02/21 post-phk: min 4, max 99, mean 4.614527, std 0.710407 1999/12/04: min 4, max 120, mean 4.630231, std 0.777733 2000/09/29: min 5, max 203, mean 5.376130, std 1.912127 2001/05/19: min 6, max 1715, mean 6.783378, std 2.015211 2001/09/02: min 5, max 482, mean 5.474384, std 2.683939 %%% These times are for gettimeofday(). Note that there are now in usec. The timecounter is always the TSC (post-phk) or uses the TSC more directly (pre-phk). These times serve mainly to document time bloat due to timecounters and SMPng. The P1 has limited caching and suffers more from longer code paths than new CPUs. 66MHz 486DX2: %%% 1995/11/03: min 13, max 171, mean 14.286634, std 1.836667 2000/11/15: min 20, max 542, mean 21.843003, std 8.003137 %%% Here the timecounter is always the i8254. These times serve mainly as a reminder of how slow old machines were. The i8254 timecounter hardware didn't take any longer back then (it was probably faster, since old machines didn't have PCI bridges, and they had tunable ISA wait states which I tuned), but a simple syscall took 7.2 usec and gettimeofday() took much longer. The bloat between 1995 and 2000 was relatively similar to that on the P1 system. Other implementation bugs (all in clock_getres()): - all of the clock ids that use getnanotime() claim a resolution of 1 nsec, but that us bogus. The actual resolution is more like tc_tick/HZ. The extra resolution in a struct timespec is only used to return garbage related to the incoherency of the clocks. (If it could be arranged that tc_windup() always ran on a tc_tick/HZ boundary, then the clocks would be coherent and the times would always be a multiple of tc_tick/HZ, with no garbage in low bits.) - CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is bogus. The actual resolution is more like 1/stathz, or perhaps 1 microsecond. hz is irrelevant here since statclock ticks are used. statclock ticks only have a resolution of 1/stathz, but if 1 nsec is correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since caclru() calculates the time to a resolution of 1 usec; it is just very inaccurate at that resolution. "Resolution" is a poor term for the functionality needed here. I think a hint about the accuracy is more important. In simple implementations using interrupts and ticks, the accuracy would be about the the same as the resolution, but FreeBSD is more complicated. Bruce From sean at chittenden.org Mon Jun 2 19:05:42 2008 From: sean at chittenden.org (Sean Chittenden) Date: Mon Jun 2 19:05:52 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <20080602182214.I2764@delplex.bde.org> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602182214.I2764@delplex.bde.org> Message-ID: >> I wrote a small micro-benchmark utility[1] to test various time >> syscalls and the results were a bit surprising to me. The results >> were from a UP machine and I believe that the difference between >> gettimeofday(2) and clock_gettime(CLOCK_REALTIME_FAST) would've >> been bigger on an SMP system and performance would've degraded >> further with each additional core. > > I wouldn't expect SMP to make much difference between CLOCK_REALTIME > and > CLOCK_REALTIME_FAST. The only difference is that the former calls > nanotime() where the latter calls getnanotime(). nanotime() always > does > more, but it doesn't have any extra SMP overheads in most cases (in > rare > cases like i386 using the i8254 timecounter, it needs to lock > accesses to > the timecounter hardware). gettimeofday() always does more than > CLOCK_REALTIME, but again no more for SMP. You may be right, I can only speculate. Going off of phk@'s rhetorical questions regarding gettimeofday(2) working across cores/ threads, I assumed there would be some kind of synchronization. http://lists.freebsd.org/mailman/htdig/freebsd-current/2005-October/057280.html >> clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for >> most authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- >> 10ms of CLOCK_REALTIME's value[2]). In fact, I'd assume that >> CLOCK_REALTIME_FAST is just as accurate as Linux's gettimeofday(2) >> (a statement I can't back up, but believe is likely to be correct) >> and therefore there isn't much harm (if any) in seeing >> clock_gettime(2) + CLOCK_REALTIME_FAST receive more widespread use >> vs. gettimeofday(2). FYI. -sc > > The existence of most of CLOCK_* is a bug. I wouldn't use > CLOCK_REALTIME_FAST > for anything (if only because it doesn't exist in most kernels that I > run. I think that's debatable, actually. I modified my little micro- benchmark program to test the realtime values returned from each execution and found that CLOCK_REALTIME_FAST likely updates itself sufficiently frequently for most applications (not all, but most). My test ensures that time doesn't go backwards and tally's the number of times that the values are identical. It'd be nice of CLOCK_REALTIME_FAST incremented by a small and reasonable fudge factor every time it's invoked that way the values aren't identical. On my machine, I can make 100K gettimeofday(2) calls compared to 3M CLOCK_REALTIME_FAST calls, which is a significantly large delta when you're aiming for software that's handling around ~40-50Kpps and want to include time information periodically (see above comment about a fudge factor being included after every call *grin* ). http://sean.chittenden.org/pubfiles/freebsd/bench_clock_realtime.c % ./bench_clock_realtime 9079882 | sort -rnk1 clock realtime micro-benchmark. 9079882 syscall iterations. Avg. us/call Elapsed Name 9.317078 84.597968 gettimeofday(2) 8.960372 81.359120 time(3) 8.776467 79.689287 clock_gettime(2/CLOCK_REALTIME) 0.332357 3.017763 clock_gettime(2/CLOCK_REALTIME_FAST) 0.311705 2.830246 clock_gettime(2/CLOCK_SECOND) Value from time(3): 1212427374 Last value from gettimeofday(2): 1212427293.590511 Equal: 0 Last value from clock_gettime(2/CLOCK_SECOND): 1212427460.000000000 Equal: 9079878 Last value from clock_gettime(2/CLOCK_REALTIME_FAST): 1212427457.656410126 Equal: 9078198 Last value from clock_gettime(2/CLOCK_REALTIME): 1212427454.639076390 Equal: 0 % irb >> tot = 9079882 => 9079882 >> eq = 9078198 => 9078198 >> tot - eq => 1684 >> time = 3.017763 => 3.017763 >> (tot - eq) / time => 558.029242190324 >> tot / time => 3008812.15655437 # number of CLOCK_REALTIME_FAST calls per second >> tot / 84.597968 => 107329.788346689 # number of gettimeofday(2) calls per second > I switched from using gettimeofday() to CLOCK_REALTIME many years > ago when syscalls started taking less than 1 usec and still > occasionally > have problems from this running old kernels, because old i386 kernels > don't support CLOCK_REALTIME and old amd64 kernels have a broken > CLOCK_REALTIME in 32-bit mode). Entirely possible that's why things are more expensive on my test machine. % sysctl hw.model hw.model: AMD Athlon(tm) 64 Processor 3500+ % uname -a FreeBSD dev2.office.chittenden.org 7.0-RELEASE FreeBSD 7.0-RELEASE #0: Sun Feb 24 10:35:36 UTC 2008 root@driscoll.cse.buffalo.edu:/usr/ obj/usr/src/sys/GENERIC amd64 >> PS Is there a reason that time(3) can't be implemented in terms of >> clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to >> time_t's whole second resolution. > > I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low > accuracy timers provided by the get*time() family are accurate enough > to give the time in seconds. Unfortunately, they are still broken -- > they are all incoherent relative to nanotime() and some are incoherent > relative to each other. CLOCK_SECOND can lag the time in seconds > given > by up to tc_tick/HZ seconds. This is because CLOCK_SECOND returns the > time in seconds at the last tc_windup(), so it misses seeing rollovers > of the second in the interval between the rollover and the next > tc_windup(), while nanotime() doesn't miss seeing these rollovers so > it gives incoherent times, with nanotime()/CLOCK_REALTIME being > correct > and time_second/CLOCK_SECOND broken. Interesting. Incoherent, but accurate enough? We're talking about a <10ms window of incoherency, right? >> % ./bench_time 9079882 | sort -rnk1 >> Timing micro-benchmark. 9079882 syscall iterations. >> Avg. us/call Elapsed Name >> 9.322484 84.647053 gettimeofday(2) >> 8.955324 81.313291 time(3) >> 8.648315 78.525684 clock_gettime(2/CLOCK_REALTIME) >> 8.598495 78.073325 clock_gettime(2/CLOCK_MONOTONIC) >> 0.674194 6.121600 clock_gettime(2/CLOCK_PROF) >> 0.648083 5.884515 clock_gettime(2/CLOCK_VIRTUAL) >> 0.330556 3.001412 clock_gettime(2/CLOCK_REALTIME_FAST) >> 0.306514 2.783111 clock_gettime(2/CLOCK_SECOND) >> 0.262788 2.386085 clock_gettime(2/CLOCK_MONOTONIC_FAST) > > These are very slow. Are they on a 486? :-) I get about 262 ns for > CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems. > The syscall overhead is about 200 nsec (170 nsec for a simpler syscall > and maybe 30 nsec extra for copyin/out for clock_gettime()) and > reading > the TSC timecounter adds another 60 nsec, including a whole 6 nsec for > the hardware part of the read (perhaps more like 30 nsec than 60 for > the > whoe read). The TSC doesn't work on all machines (never for SMP), but > this will hopefully change. (Phenom is supposed to have TSCs that are > coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+ > to implement this :-(. Core2 already has a 40+ cycles rdtsc, but > AFAIK > it doesn't have coherent TSCs.) Other timecounters are much slower > than > the TSC, but I haven't seen one take 8000 nsec since 486 days. *shrug* elapsed / number of calls. Not doing anything fancy here. > Some of my benchmark results: Can I run this same test/see how this was written? > This system has a fairly fast ACPI and i8254 timecounters. 1500-800 > nsec is more typical for ACPI-fast, and 4000-5000 is more typical > for i8254. ACPI-fast should be named ACPI-not-very-slow. ACPI-safe > is very slow, perhaps slower than i8254. i8254 could be made about > twice as fast if anyone cared. Hrm. % sysctl -a | grep -i acpi_timer machdep.acpi_timer_freq: 3579545 dev.acpi_timer.0.%desc: 24-bit timer at 3.579545MHz dev.acpi_timer.0.%driver: acpi_timer dev.acpi_timer.0.%location: unknown dev.acpi_timer.0.%pnpinfo: unknown dev.acpi_timer.0.%parent: acpi0 % sysctl -a | grep -i tsc kern.timecounter.choice: TSC(800) ACPI-safe(850) i8254(0) dummy(-1000000) kern.timecounter.tc.TSC.mask: 4294967295 kern.timecounter.tc.TSC.counter: 2749242907 kern.timecounter.tc.TSC.frequency: 2222000000 kern.timecounter.tc.TSC.quality: 800 kern.timecounter.smp_tsc: 0 machdep.tsc_freq: 2222000000 > Other implementation bugs (all in clock_getres()): > - all of the clock ids that use getnanotime() claim a resolution of 1 > nsec, but that us bogus. The actual resolution is more like > tc_tick/HZ. > The extra resolution in a struct timespec is only used to return > garbage related to the incoherency of the clocks. (If it could be > arranged that tc_windup() always ran on a tc_tick/HZ boundary, then > the clocks would be coherent and the times would always be a multiple > of tc_tick/HZ, with no garbage in low bits.) > - CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is > bogus. The actual resolution is more like 1/stathz, or perhaps 1 > microsecond. hz is irrelevant here since statclock ticks are used. > statclock ticks only have a resolution of 1/stathz, but if 1 nsec is > correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since > caclru() calculates the time to a resolution of 1 usec; it is just > very inaccurate at that resolution. > "Resolution" is a poor term for the functionality needed here. I > think > a hint about the accuracy is more important. In simple > implementations > using interrupts and ticks, the accuracy would be about the the same > as > the resolution, but FreeBSD is more complicated. Is there any reason that the garbage resolution can't be zero'ed out to indicate confidence of the kernel in the precision of the information? -sc -- Sean Chittenden sean@chittenden.org http://sean.chittenden.org/ From sean at chittenden.org Mon Jun 2 19:11:29 2008 From: sean at chittenden.org (Sean Chittenden) Date: Mon Jun 2 19:11:36 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <20080602205953.X3162@delplex.bde.org> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602205953.X3162@delplex.bde.org> Message-ID: >> rozetta~/devel/c%>sysctl hw.model >> hw.model: Intel(R) Xeon(R) CPU E5345 @ 2.33GHz >> >> rozetta~/devel/c%>./bench_time 9079882 | sort -rnk1 >> Timing micro-benchmark. 9079882 syscall iterations. >> Avg. us/call Elapsed Name >> 1.405469 12.761494 clock_gettime(2/CLOCK_REALTIME) >> 1.313101 11.922799 time(3) >> 1.305518 11.853953 clock_gettime(2/CLOCK_MONOTONIC) >> 1.303947 11.839681 gettimeofday(2) >> 0.442908 4.021557 clock_gettime(2/CLOCK_PROF) >> 0.436484 3.963223 clock_gettime(2/CLOCK_VIRTUAL) >> 0.217718 1.976851 clock_gettime(2/CLOCK_MONOTONIC_FAST) >> 0.215264 1.954571 clock_gettime(2/CLOCK_REALTIME_FAST) >> 0.211779 1.922932 clock_gettime(2/CLOCK_SECOND) > > These seem about right for a normal untuned ~2GHz system: This begs the question, tuning for time calls. Do you have a best practice that you use for reducing the cost of time calls? -sc -- Sean Chittenden sean@chittenden.org http://sean.chittenden.org/ From gary at velocity-servers.net Mon Jun 2 19:51:25 2008 From: gary at velocity-servers.net (Gary Stanley) Date: Mon Jun 2 19:58:47 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <20080602203217.T3100@delplex.bde.org> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602060624.93F5F8FC4A@mx1.freebsd.org> <20080602203217.T3100@delplex.bde.org> Message-ID: <20080602195125.245858FC1D@mx1.freebsd.org> At 06:55 AM 6/2/2008, Bruce Evans wrote: >On Mon, 2 Jun 2008, Gary Stanley wrote: > >>At 12:54 AM 6/2/2008, Sean Chittenden wrote: >>>PS Is there a reason that time(3) can't be implemented in terms of >>>clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to >>>time_t's whole second resolution. >> >>Another interesting idea is to map gettimeofday() to userland, sort >>of like darwin (commpage) and linux (vsyscall) via read only page. > >time() can reasonably be implemented like that, but not gettimeofday(). >gettimeofday() should have an accuracy of 1 usec and it returns a large >data structure that cannot be locked by simple atomic accesses. The >read-only page would have to be updated millions of times per second >or take a pagefault to access to give the same functionality as FreeBSD >gettimeofday(). The updates would cost about 100% of 1 CPU. Other >CPUs could then read the time using locking like that in binuptime() >but more complicated (needs an atomic update for at least the generation >count, and probably more). The pagefaults would give a smaller >pessimization (I guess slightly longer to reach microtime() than via >the current syscall, and identical time in microtime() to do the update >on demand). Here's a sloppy thought :) What about just rewriting gettimeofday in libc to query the TSC and convert it to usecs etc? That would eliminate any costly userland -> kernel overhead. I have a proof of concept here to do this. The only bad thing is the skewing of the TSC.. From gary at velocity-servers.net Mon Jun 2 19:55:52 2008 From: gary at velocity-servers.net (Gary Stanley) Date: Mon Jun 2 20:24:49 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <20080602182214.I2764@delplex.bde.org> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602182214.I2764@delplex.bde.org> Message-ID: <20080602195552.ADC778FC1F@mx1.freebsd.org> At 06:19 AM 6/2/2008, Bruce Evans wrote: >These are very slow. Are they on a 486? :-) I get about 262 ns for >CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems. >The syscall overhead is about 200 nsec (170 nsec for a simpler syscall >and maybe 30 nsec extra for copyin/out for clock_gettime()) and reading >the TSC timecounter adds another 60 nsec, including a whole 6 nsec for >the hardware part of the read (perhaps more like 30 nsec than 60 for the >whoe read). The TSC doesn't work on all machines (never for SMP), but >this will hopefully change. (Phenom is supposed to have TSCs that are >coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+ >to implement this :-(. Core2 already has a 40+ cycles rdtsc, but AFAIK >it doesn't have coherent TSCs.) Other timecounters are much slower than >the TSC, but I haven't seen one take 8000 nsec since 486 days. Phenom's don't have TSCs that are coherent, as least on a few machines here: 4 CPUs, running 4 parallel test-tasks. checking for time-warps via: - read time stamp counter (RDTSC) instruction (cycle resolution) - gettimeofday (TOD) syscall (usec resolution) - clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution) new TSC-warp maximum: -4294919263 cycles, 00000000ffffe11b -> 0000000000009cbc new TSC-warp maximum: -4294919300 cycles, 00000000ffff74e4 -> 0000000000003060 | TSC: 2.24us, fail:3 | TOD: 2.24us, fail:0 | CLK: 2.24us, fail:0 | The code to test the TSC to check for warping: http://leaf.dragonflybsd.org/~gary/tests/time-warp-test.c However, it seems that Core2's don't have any warping of the TSC. I tested that code on a core2quad for 8 hours with no TSC failures. From brde at optusnet.com.au Tue Jun 3 08:03:14 2008 From: brde at optusnet.com.au (Bruce Evans) Date: Tue Jun 3 08:03:20 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <200806021951.m52JpQEd013447@mail14.syd.optusnet.com.au> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602060624.93F5F8FC4A@mx1.freebsd.org> <20080602203217.T3100@delplex.bde.org> <200806021951.m52JpQEd013447@mail14.syd.optusnet.com.au> Message-ID: <20080603175313.O6038@delplex.bde.org> On Mon, 2 Jun 2008, Gary Stanley wrote: > At 06:55 AM 6/2/2008, Bruce Evans wrote: >> On Mon, 2 Jun 2008, Gary Stanley wrote: >> >>> At 12:54 AM 6/2/2008, Sean Chittenden wrote: >>>> PS Is there a reason that time(3) can't be implemented in terms of >>>> clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to >>>> time_t's whole second resolution. >>> >>> Another interesting idea is to map gettimeofday() to userland, sort of >>> like darwin (commpage) and linux (vsyscall) via read only page. >> >> time() can reasonably be implemented like that, but not gettimeofday(). >> gettimeofday() should have an accuracy of 1 usec and it returns a large >> data structure that cannot be locked by simple atomic accesses... > > Here's a sloppy thought :) What about just rewriting gettimeofday in libc to > query the TSC and convert it to usecs etc? That would eliminate any costly > userland -> kernel overhead. I have a proof of concept here to do this. This is hard enough to do in the kernel. The result is the TSC timecounter, which is too hard to make work properly (coherently and without interference from power saving, etc., changing the clock frequency, and on arches that don't have a TSC, and on arches that have a TSC whose access methods are spelled differently than on i386...), except on some machines. > The only bad thing is the skewing of the TSC.. Closer to impossible to handle in userland. Of course, some userland benchmarks that don't need very precise timing can just call rdtsc() and depend on the frequency not changing too much while the benchmark is running. Process times in the kernel use essentially this method.o Another complication with using the TSC is that it executes out of order on many (i386/amd64) CPU types. So rdtsc's inside short sections of code don't work right. Bruce From brde at optusnet.com.au Tue Jun 3 09:20:00 2008 From: brde at optusnet.com.au (Bruce Evans) Date: Tue Jun 3 09:20:06 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: <200806021955.m52Jtqg2019409@mail14.syd.optusnet.com.au> References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602182214.I2764@delplex.bde.org> <200806021955.m52Jtqg2019409@mail14.syd.optusnet.com.au> Message-ID: <20080603185459.T6038@delplex.bde.org> On Mon, 2 Jun 2008, Gary Stanley wrote: > At 06:19 AM 6/2/2008, Bruce Evans wrote: > >> These are very slow. Are they on a 486? :-) I get about 262 ns for >> CLOCK_REALTIME using the TSC timecounter on all ~2GHz UP systems. >> The syscall overhead is about 200 nsec (170 nsec for a simpler syscall >> and maybe 30 nsec extra for copyin/out for clock_gettime()) and reading >> the TSC timecounter adds another 60 nsec, including a whole 6 nsec for >> the hardware part of the read (perhaps more like 30 nsec than 60 for the >> whoe read). The TSC doesn't work on all machines (never for SMP), but >> this will hopefully change. (Phenom is supposed to have TSCs that are >> coherent across CPUs, and rdtsc has slowed down from 12 cycles to 40+ >> to implement this :-(. Core2 already has a 40+ cycles rdtsc, but AFAIK >> it doesn't have coherent TSCs.) Other timecounters are much slower than >> the TSC, but I haven't seen one take 8000 nsec since 486 days. > > Phenom's don't have TSCs that are coherent, as least on a few machines here: According to the amd64 arch manual (volume 3 3.14 Sep 2007): If CPUID 8000_0007.edx[8] = 1, then [details about hardware states...] then the TSC is suitable for use as a source of time. Google shows support for this feature in at least Linux and Xen. Phenom also has a rdtscp instruction which is serializing. > 4 CPUs, running 4 parallel test-tasks. > checking for time-warps via: > - read time stamp counter (RDTSC) instruction (cycle resolution) > - gettimeofday (TOD) syscall (usec resolution) > - clock_gettime(CLOCK_MONOTONIC) syscall (nsec resolution) > > new TSC-warp maximum: -4294919263 cycles, 00000000ffffe11b -> > 0000000000009cbc > new TSC-warp maximum: -4294919300 cycles, 00000000ffff74e4 -> > 0000000000003060 > | TSC: 2.24us, fail:3 | TOD: 2.24us, fail:0 | CLK: 2.24us, fail:0 | The difference seems to be only about -0x6000, with an overflow bug in the test giving a value near -2^32. > The code to test the TSC to check for warping: > > http://leaf.dragonflybsd.org/~gary/tests/time-warp-test.c > However, it seems that Core2's don't have any warping of the TSC. I tested > that code on a core2quad for 8 hours with no TSC failures. Interesting. Please check the manual. I don't have current Intel arch manuals handy Bruce From brde at optusnet.com.au Tue Jun 3 09:32:02 2008 From: brde at optusnet.com.au (Bruce Evans) Date: Tue Jun 3 09:32:06 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602205953.X3162@delplex.bde.org> Message-ID: <20080603192056.B6242@delplex.bde.org> On Mon, 2 Jun 2008, Sean Chittenden wrote: >>> rozetta~/devel/c%>sysctl hw.model >>> hw.model: Intel(R) Xeon(R) CPU E5345 @ 2.33GHz >>> >>> rozetta~/devel/c%>./bench_time 9079882 | sort -rnk1 >>> Timing micro-benchmark. 9079882 syscall iterations. >>> Avg. us/call Elapsed Name >>> 1.405469 12.761494 clock_gettime(2/CLOCK_REALTIME) >>> ... >> >> These seem about right for a normal untuned ~2GHz system: > > This begs the question, tuning for time calls. Do you have a best practice > that you use for reducing the cost of time calls? -sc At least try all possible time counters, and choose the one that works best. Best == fastest and accurate enough. Best != highest quality according to kernel hard-coded quality numbers. ntp will tell you if it isn't accurate enough if this isn't obvious. This normally means the TSC on UP systems without power management and ACPI-fast otherwise. The kernel quality parameter gives too much preference to ACPI-fast. Switching between all possible timecounters at runtime is easier in not very old versions of FreeBSD. Old versions didn't even list all timecounters considered at boot time. Some timecounters, e.g., HPET and of course ACPI* on non-ACPI systems are not available even if the hardware supports them unless they are configured at compile time or boot time. It's hard to test the HPET counter on new FreeBSD cluster machines because it is not confiugured and it would require privilege to use if it were configured but not selected. Bruce From brde at optusnet.com.au Tue Jun 3 10:14:24 2008 From: brde at optusnet.com.au (Bruce Evans) Date: Tue Jun 3 10:14:29 2008 Subject: Micro-benchmark for various time syscalls... In-Reply-To: References: <2B465A44-2578-4675-AA17-EBE17A072017@chittenden.org> <20080602182214.I2764@delplex.bde.org> Message-ID: <20080603193227.K6242@delplex.bde.org> On Mon, 2 Jun 2008, Sean Chittenden wrote: >> I wouldn't expect SMP to make much difference between CLOCK_REALTIME and >> CLOCK_REALTIME_FAST. The only difference is that the former calls >> nanotime() where the latter calls getnanotime(). nanotime() always does >> more, but it doesn't have any extra SMP overheads in most cases (in rare >> cases like i386 using the i8254 timecounter, it needs to lock accesses to >> the timecounter hardware). gettimeofday() always does more than >> CLOCK_REALTIME, but again no more for SMP. > > You may be right, I can only speculate. Going off of phk@'s rhetorical > questions regarding gettimeofday(2) working across cores/threads, I assumed > there would be some kind of synchronization. > > http://lists.freebsd.org/mailman/htdig/freebsd-current/2005-October/057280.html The synchronization is all in binuptime(). It is quite delicate. It depends mainly on a unlocked, nonatomically-accessed generation count for software synchronization and the hardware being almost-automatically synchronized with itself for hardware synchronization. It takes various magic for an unlocked, non-atomically accessed generation count to work. Since it has no locking and executes identical code for SMP and !SMP, it has identical overheads for SMP and !SMP. Hardware is almost-automatically synchronized with itself by using identical hardware for all CPUs. This is what breaks down for the TSC on SMP systems (power management may affect both). Some hardware timecounters like the i8254 require locking to give exclusive access to the hardware. >>> clock_gettime(CLOCK_REALTIME_FAST) is likely the ideal function for most >>> authors (CLOCK_REALTIME_FAST is supposed to be precise to +/- 10ms of >>> CLOCK_REALTIME's value[2]). In fact, I'd assume that CLOCK_REALTIME_FAST >>> is just as accurate as Linux's gettimeofday(2) (a statement I can't back >>> up, but believe is likely to be correct) and therefore there isn't much >>> harm (if any) in seeing clock_gettime(2) + CLOCK_REALTIME_FAST receive >>> more widespread use vs. gettimeofday(2). FYI. -sc >> >> The existence of most of CLOCK_* is a bug. I wouldn't use >> CLOCK_REALTIME_FAST >> for anything (if only because it doesn't exist in most kernels that I >> run. > > I think that's debatable, actually. I modified my little micro-benchmark It's debateable, but not with me :-). > program to test the realtime values returned from each execution and found > that CLOCK_REALTIME_FAST likely updates itself sufficiently frequently for > most applications (not all, but most). My test ensures that time doesn't go > backwards and tally's the number of times that the values are identical. > It'd be nice of CLOCK_REALTIME_FAST incremented by a small and reasonable > fudge factor every time it's invoked that way the values aren't identical. I would probably go direct to the hardware if doing a large enough number of measurements for clock granularity of access overheads to matter. Otherwise, CLOCK_REALTIME or CLOCK_MONOTIC is best. These are easy to use and give the most accurate results possible. >>> PS Is there a reason that time(3) can't be implemented in terms of >>> clock_gettime(CLOCK_SECOND)? 10ms seems precise enough compared to >>> time_t's whole second resolution. >> >> I might use CLOCK_SECOND (unlike CLOCK_REALTIME_FAST), since the low >> accuracy timers provided by the get*time() family are accurate enough >> to give the time in seconds. Unfortunately, they are still broken -- >> they are all incoherent relative to nanotime() and some are incoherent >> relative to each other. CLOCK_SECOND can lag the time in seconds given >> by up to tc_tick/HZ seconds. This is because CLOCK_SECOND returns the >> time in seconds at the last tc_windup(), so it misses seeing rollovers >> of the second in the interval between the rollover and the next >> tc_windup(), while nanotime() doesn't miss seeing these rollovers so >> it gives incoherent times, with nanotime()/CLOCK_REALTIME being correct >> and time_second/CLOCK_SECOND broken. > > Interesting. Incoherent, but accurate enough? We're talking about a <10ms > window of incoherency, right? Yes. 10ms is a lot. It results in about 1 in every 100 timestamps being coherent, so my fs benchmark that tests for file times being coherent (it actually tests for ctime/mtime/atime updates happening in the correcy order when file times are incoherent with time(1)) doesn't have to run for very long to find an incoherency. After rounding the times to a seconds boundary, the amount of the incoherency is rounded up from 1-10ms to 1 second. Incoherencies of 1 second persist for the length of the window. The delicate locking in binuptime() doesn't allow the data structure updates that would be required to make all the access methods coherent. Full locking would probably be required for that. >> Some of my benchmark results: > > Can I run this same test/see how this was written? It is an old sanity test program by wollman which I've touched as little as possible, just to convert to CLOCK_REALTIME and to hack around some bugs involving array overruns which became larger with the larger range of values in nanoseconds. He probably doesn't want to see it, but I will include it here :-). %%% #include #include #include #include #include #include #include #include #define N 2000000 int diffs[N]; int hist[N * 10]; /* XXX various assumptions on diffs */ int main(void) { int i, j; int min, max; double sum, mean, var, sumsq; struct timespec tv, otv; memset(diffs, '\0', sizeof diffs); /* fault in whole array, we hope */ for(i = 0; i < N; i++) { clock_gettime(CLOCK_REALTIME, &tv); do { otv = tv; clock_gettime(CLOCK_REALTIME, &tv); } while(tv.tv_sec == otv.tv_sec && tv.tv_nsec == otv.tv_nsec); diffs[i] = tv.tv_nsec - otv.tv_nsec + 1000000000 * (tv.tv_sec - otv.tv_sec); } min = INT_MAX; max = INT_MIN; sum = 0; sumsq = 0; for(i = 0; i < N; i++) { if(diffs[i] > max) max = diffs[i]; if(diffs[i] < min) min = diffs[i]; sum += diffs[i]; sumsq += diffs[i] * diffs[i]; } mean = sum / (double)N; var = (sumsq - 2 * mean * sum + sum * mean) / (double)N; printf("min %d, max %d, mean %f, std %f\n", min, max, mean, sqrt(var)); for(i = 0; i < N; i++) { hist[diffs[i]]++; } for(j = 0; j < 5; j++) { max = 0; min = 0; for(i = 0; i < N; i++) { if(hist[i] > max) { max = hist[i]; min = i; /* xxx */ } } printf("%dth: %d (%d observations)\n", j + 1, min, max); hist[min] = 0; } return 0; } %%% >> Other implementation bugs (all in clock_getres()): >> - all of the clock ids that use getnanotime() claim a resolution of 1 >> nsec, but that us bogus. The actual resolution is more like tc_tick/HZ. >> The extra resolution in a struct timespec is only used to return >> garbage related to the incoherency of the clocks. (If it could be >> arranged that tc_windup() always ran on a tc_tick/HZ boundary, then >> the clocks would be coherent and the times would always be a multiple >> of tc_tick/HZ, with no garbage in low bits.) >> - CLOCK_VIRTUAL and CLOCK_PROF claim a resolution of 1/hz, but that is >> bogus. The actual resolution is more like 1/stathz, or perhaps 1 >> microsecond. hz is irrelevant here since statclock ticks are used. >> statclock ticks only have a resolution of 1/stathz, but if 1 nsec is >> correct for CLOCK_REALTIME_FAST, then 1 usec is correct here since >> caclru() calculates the time to a resolution of 1 usec; it is just >> very inaccurate at that resolution. >> "Resolution" is a poor term for the functionality needed here. I think >> a hint about the accuracy is more important. In simple implementations >> using interrupts and ticks, the accuracy would be about the the same as >> the resolution, but FreeBSD is more complicated. > > Is there any reason that the garbage resolution can't be zero'ed out to > indicate confidence of the kernel in the precision of the information? -sc Well, I only recently decided that "garbage" is the right way to think of the extra precision. Some care would be required to not increase incoherency when discarding the garbage. Bruce From felipebgn at gmail.com Thu Jun 12 01:07:47 2008 From: felipebgn at gmail.com (Felipe Neuwald) Date: Thu Jun 12 01:07:50 2008 Subject: Performance with python and FreeBSD 7.0 amd64 Message-ID: <928b5da90806111738h55bbbbb0y3e9731323a1561f4@mail.gmail.com> Hi all, We have a few servers running zope + plone. On one server running FreeBSD 6.3-STABLE i386, I got no problems, but, with one server running FreeBSD 7.0-STABLE amd64, same versions of applications, I got some errors, like the following: dmesg result: pid 74775 (python), uid 1002: exited on signal 11 pid 74861 (python), uid 1002: exited on signal 11 pid 74911 (python), uid 1002: exited on signal 11 pid 74926 (python), uid 1002: exited on signal 11 pid 74970 (python), uid 1002: exited on signal 11 pid 75038 (python), uid 1002: exited on signal 11 pid 75069 (python), uid 1002: exited on signal 11 pid 75095 (python), uid 1002: exited on signal 11 pid 75131 (python), uid 1002: exited on signal 11 pid 75136 (python), uid 1002: exited on signal 11 pid 75204 (python), uid 1002: exited on signal 11 pid 75842 (python), uid 1002: exited on signal 11 pid 75949 (python), uid 1002: exited on signal 10 pid 75962 (python), uid 1002: exited on signal 11 pid 75999 (python), uid 1002: exited on signal 4 pid 76097 (python), uid 1002: exited on signal 10 pid 77452 (python), uid 1002: exited on signal 11 pid 78012 (python), uid 1002: exited on signal 10 pid 78044 (python), uid 1002: exited on signal 11 pid 78425 (python), uid 1002: exited on signal 4 pid 78464 (python), uid 1002: exited on signal 11 pid 78615 (python), uid 1002: exited on signal 11 pid 78638 (python), uid 1002: exited on signal 11 pid 78656 (python), uid 1002: exited on signal 11 pid 78809 (python), uid 1002: exited on signal 11 pid 79076 (python), uid 1002: exited on signal 11 pid 84577 (python), uid 1002: exited on signal 11 I'm using python version 2.4 on every server. So, if someone can help me: - How can I debug the system to find the error? - How can I configure the server for plone + zope (python) best performance? Thank you very much, Felipe Neuwald. From andymac at bullseye.apana.org.au Thu Jun 12 12:40:43 2008 From: andymac at bullseye.apana.org.au (Andrew MacIntyre) Date: Thu Jun 12 13:18:11 2008 Subject: Performance with python and FreeBSD 7.0 amd64 In-Reply-To: <928b5da90806111738h55bbbbb0y3e9731323a1561f4@mail.gmail.com> References: <928b5da90806111738h55bbbbb0y3e9731323a1561f4@mail.gmail.com> Message-ID: <485107C2.7080202@bullseye.andymac.org> Felipe Neuwald wrote: > We have a few servers running zope + plone. On one server running > FreeBSD 6.3-STABLE i386, I got no problems, but, with one server > running FreeBSD 7.0-STABLE amd64, same versions of applications, I got > some errors, like the following: > > dmesg result: > pid 74775 (python), uid 1002: exited on signal 11 segmentation violation {...} > pid 75949 (python), uid 1002: exited on signal 10 bus error {...} > pid 75999 (python), uid 1002: exited on signal 4 illegal instruction {...} Hmm... that's an interesting mix of failures. I have seen bus errors when Python runs out of stack space either in the main thread or child threads (not an unknown issue with Zope). gcc 4.x in my limited experience generates sometimes noticeably larger stack frames than gcc 3.x (which is standard on 6.x), which can provoke unexpected stack exhaustion. You don't mention whether you're using a local build or a binary package. Nor do you mention the point release (python 2.4.5 is the most recent in the 2.4 series). The default thread stack size according to my 6.3 box's ports is 1MB for Python 2.4.4) which should be adequate for most circumstances. The illegal instruction failure suggests something wrong with your binaries (including those built for Zope). The segmentation violations often indicate a problem with reference counts, frequently attributable to bugs in 3rd party extensions. You might want to check that all binaries for Python, Zope & Plone (if it has any) link against the same libraries. If you can snaffle cores, you might want to try and extract backtraces from gdb (debugging symbols would make this more productive...) -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia From felipebgn at gmail.com Thu Jun 12 16:32:25 2008 From: felipebgn at gmail.com (Felipe Neuwald) Date: Thu Jun 12 16:32:31 2008 Subject: Performance with python and FreeBSD 7.0 amd64 In-Reply-To: <485107C2.7080202@bullseye.andymac.org> References: <928b5da90806111738h55bbbbb0y3e9731323a1561f4@mail.gmail.com> <485107C2.7080202@bullseye.andymac.org> Message-ID: <928b5da90806120932v49113d35if81b12c45c86c662@mail.gmail.com> > > Hmm... that's an interesting mix of failures. > > I have seen bus errors when Python runs out of stack space either in the > main thread or child threads (not an unknown issue with Zope). > > gcc 4.x in my limited experience generates sometimes noticeably larger > stack frames than gcc 3.x (which is standard on 6.x), which can provoke > unexpected stack exhaustion. > > You don't mention whether you're using a local build or a binary package. > Nor do you mention the point release (python 2.4.5 is the most recent in > the 2.4 series). I'm using a local build, installed via ports tree (python24-2.4.5_1). > The default thread stack size according to my 6.3 box's ports is 1MB for > Python 2.4.4) which should be adequate for most circumstances. > > The illegal instruction failure suggests something wrong with your > binaries (including those built for Zope). > > The segmentation violations often indicate a problem with reference > counts, frequently attributable to bugs in 3rd party extensions. > > You might want to check that all binaries for Python, Zope & Plone (if it > has any) link against the same libraries. Ok, I'll try to check these. I'm not the python + zope + plone guy, I'm the FreeBSD administrator. I'll have to work with the application team to find the solution for these problem. > If you can snaffle cores, you might want to try and extract backtraces > from gdb (debugging symbols would make this more productive...) Ok, I'll also try to get more information with cores. Thanks, Felipe Neuwald. From felipebgn at gmail.com Thu Jun 12 17:57:07 2008 From: felipebgn at gmail.com (Felipe Neuwald) Date: Thu Jun 12 17:57:10 2008 Subject: Performance with python and FreeBSD 7.0 amd64 In-Reply-To: <485107C2.7080202@bullseye.andymac.org> References: <928b5da90806111738h55bbbbb0y3e9731323a1561f4@mail.gmail.com> <485107C2.7080202@bullseye.andymac.org> Message-ID: <928b5da90806121057j48c178bdp773adfe759561e15@mail.gmail.com> Andrew, I'll try to recompile python with "HUGE STACK SIZE" option. Let's see. Felipe Neuwald. From felipebgn at gmail.com Fri Jun 13 12:45:54 2008 From: felipebgn at gmail.com (Felipe Neuwald) Date: Fri Jun 13 12:45:58 2008 Subject: Performance with python and FreeBSD 7.0 amd64 In-Reply-To: <928b5da90806121057j48c178bdp773adfe759561e15@mail.gmail.com> References: <928b5da90806111738h55bbbbb0y3e9731323a1561f4@mail.gmail.com> <485107C2.7080202@bullseye.andymac.org> <928b5da90806121057j48c178bdp773adfe759561e15@mail.gmail.com> Message-ID: <928b5da90806130545w37621619t20dab5495318459a@mail.gmail.com> Andrew and all, After recompile python 2.4 with HUGE_STACK_SIZE option, I got no more problems. I'll still wait the weekend to say it again, and wait for the customer reply about system performance / errors. If I got news, I'll send to you. Thank you very much, Felipe Neuwald. 2008/6/12 Felipe Neuwald : > Andrew, I'll try to recompile python with "HUGE STACK SIZE" option. Let's see. > > Felipe Neuwald. > From felipebgn at gmail.com Mon Jun 16 12:44:55 2008 From: felipebgn at gmail.com (Felipe Neuwald) Date: Mon Jun 16 12:45:06 2008 Subject: Performance with python and FreeBSD 7.0 amd64 In-Reply-To: <928b5da90806130545w37621619t20dab5495318459a@mail.gmail.com> References: <928b5da90806111738h55bbbbb0y3e9731323a1561f4@mail.gmail.com> <485107C2.7080202@bullseye.andymac.org> <928b5da90806121057j48c178bdp773adfe759561e15@mail.gmail.com> <928b5da90806130545w37621619t20dab5495318459a@mail.gmail.com> Message-ID: <928b5da90806160544y26c069ddl5bb6a6b946fefe5c@mail.gmail.com> Hi Andrew and all, Just to inform: after the weekend test, I still got no more errors. I think the problem is solved. Thank you very much, Felipe Neuwald. 2008/6/13 Felipe Neuwald : > Andrew and all, > > After recompile python 2.4 with HUGE_STACK_SIZE option, I got no more > problems. I'll still wait the weekend to say it again, and wait for > the customer reply about system performance / errors. If I got news, > I'll send to you. > > Thank you very much, > > Felipe Neuwald. > > 2008/6/12 Felipe Neuwald : >> Andrew, I'll try to recompile python with "HUGE STACK SIZE" option. Let's see. >> >> Felipe Neuwald. >> > From andymac at bullseye.apana.org.au Mon Jun 16 14:40:16 2008 From: andymac at bullseye.apana.org.au (Andrew MacIntyre) Date: Mon Jun 16 14:45:01 2008 Subject: Performance with python and FreeBSD 7.0 amd64 In-Reply-To: <928b5da90806160544y26c069ddl5bb6a6b946fefe5c@mail.gmail.com> References: <928b5da90806111738h55bbbbb0y3e9731323a1561f4@mail.gmail.com> <485107C2.7080202@bullseye.andymac.org> <928b5da90806121057j48c178bdp773adfe759561e15@mail.gmail.com> <928b5da90806130545w37621619t20dab5495318459a@mail.gmail.com> <928b5da90806160544y26c069ddl5bb6a6b946fefe5c@mail.gmail.com> Message-ID: <48566D5E.3090208@bullseye.andymac.org> Felipe Neuwald wrote: > Just to inform: after the weekend test, I still got no more errors. I > think the problem is solved. > > Thank you very much, > > Felipe Neuwald. > > 2008/6/13 Felipe Neuwald : >> Andrew and all, >> >> After recompile python 2.4 with HUGE_STACK_SIZE option, I got no more >> problems. I'll still wait the weekend to say it again, and wait for >> the customer reply about system performance / errors. If I got news, >> I'll send to you. Hope it stays that way for you. FWIW, Python 2.5 and later don't need to be recompiled to change the thread stack size, provided you can change the main script - there's a function in the threading module that will do it. Regards, Andrew. -- ------------------------------------------------------------------------- Andrew I MacIntyre "These thoughts are mine alone..." E-mail: andymac@bullseye.apana.org.au (pref) | Snail: PO Box 370 andymac@pcug.org.au (alt) | Belconnen ACT 2616 Web: http://www.andymac.org/ | Australia From hazlewood at gmail.com Wed Jun 18 21:55:37 2008 From: hazlewood at gmail.com (Hazlewood) Date: Wed Jun 18 21:55:51 2008 Subject: 6.1 busy server periodically hangs, waits, then recovers a couple minutes later - analysis? Message-ID: Hello List, I've done some searching but this particular random and temporary lockup condition that I'm experiencing doesn't seem to happen that much...anyways, here goes with my symptoms and I was hoping someone could guide me towards some add'l testing or stats I can gather to help pinpoint the root cause. The symptoms are as follows during this indefinite frozen condition: - Existing shell's will continue to be responsive - Existing sessions such as http download continue to work - Programs running within shells such as vmstat/systat/iostat, etc.. continue to spit out data - *New* incoming socket requests or commands executed on the shell will sit there indefinitely and come return an established connection or execute said command several minutes later when the system returns to life. Here's some data to start with: kern.ostype: FreeBSD kern.osrelease: 6.1-STABLE kern.osrevision: 199506 kern.version: FreeBSD 6.1-STABLE #1: Sat Jul 15 03:08:58 MST 2006 net.isr.swi_count: -993194527 net.isr.drop: 0 net.isr.queued: 57342094 net.isr.deferred: 1573529131 net.isr.directed: -1846058772 net.isr.count: -272529641 net.isr.direct: 1 net.route.netisr_maxqlen: 256 I changed to net.isr.direct=1 and saw a dramatic drop in the number of interrupts and a small drop in context switches but no real observed change in the number of lockups that happen.. net.inet.ip.intr_queue_maxlen: 512 net.inet.ip.intr_queue_drops: 10855650 I changed maxlen to 512 and the queue_drops don't increment anymore... hw.machine: amd64 hw.model: Intel(R) Xeon(R) CPU 5140 @ 2.33GHz hw.ncpu: 4 hw.byteorder: 1234 hw.physmem: 8578887680 hw.usermem: 8305831936 hw.pagesize: 4096 hw.floatingpoint: 1 hw.machine_arch: amd64 dev.em.0.%desc: Intel(R) PRO/1000 Network Connection Version - 5.1.5 dev.em.0.%driver: em dev.em.0.%location: slot=0 function=0 dev.em.0.%pnpinfo: vendor=0x8086 device=0x1096 subvendor=0x15d9 subdevice=0x0000 class=0x020000 dev.em.0.%parent: pci4 dev.em.0.debug_info: -1 dev.em.0.stats: -1 dev.em.0.rx_int_delay: 66 dev.em.0.tx_int_delay: 66 dev.em.0.rx_abs_int_delay: 666 dev.em.0.tx_abs_int_delay: 666 dev.em.0.rx_processing_limit: -1 I tried messing with the default moderated polling settings for the em driver thinking the large number of interrupts coming from the NIC might possibly have something to do with it but so far no change. dev.mpt.0.%desc: LSILogic SAS Adapter dev.mpt.0.%driver: mpt dev.mpt.0.%location: slot=1 function=0 dev.mpt.0.%pnpinfo: vendor=0x1000 device=0x0054 subvendor=0x1000 subdevice=0x3050 class=0x010000 dev.mpt.0.%parent: pci5 dev.mpt.0.debug: 3 This is scary, does the GIANT-LOCKED mean that this storage subsystem driver locks the entire kernel when it does I/O calls? (sorry I'm a little sketched out reading about the random bits of freebsd6 that don't yet use finer grained locking...) /var/run/dmesg.boot:mpt0: port 0x3000-0x30ff mem 0xc8310000-0xc8313fff,0xc8300000-0xc830ffff irq 24 at device 1.0 on pci5 /var/run/dmesg.boot:mpt0: [GIANT-LOCKED] /var/run/dmesg.boot:mpt0: MPI Version=1.5.12.0 Here's the various stats and logs.... vmstat -i interrupt total rate irq4: sio0 45153724 35 irq6: fdc0 3 0 irq14: ata0 47 0 irq18: em0 6188085724 4798 irq24: mpt0 340758261 264 cpu0: timer 2572161860 1994 cpu1: timer 2541470854 1970 cpu3: timer 2575006045 1996 cpu2: timer 2575266013 1997 Total 16837902531 13057 Here I caught about a minute's worth of frozen state as evidenced by the solid 15 or 17 processes in the wait queue and nothing really going on. vmstat 5 procs memory page disks faults cpu r b w avm fre flt re pi po fr sr da0 da1 in sy cs us sy id 2 4 0 1374060 281876 41 0 0 0 2694 2323 0 0 192 3062 2137 3 10 88 2 11 0 1374060 348312 218 0 0 0 5263 8216 75 107 8129 25567 26444 6 15 79 2 6 0 1374060 396152 0 2 0 0 6063 8359 107 115 8190 25814 26399 6 19 75 0 7 1 1374060 309376 0 1 0 0 5623 0 63 130 8071 23806 25064 6 18 76 1 5 0 1374060 389544 0 0 0 0 5632 8438 83 92 8371 24884 26412 6 15 79 1 7 1 1374060 284640 0 1 0 0 5296 0 91 85 8380 23548 25405 6 18 76 0 5 3 1374060 359252 0 0 0 0 5072 8226 71 79 8294 24216 26170 5 18 77 1 8 0 1374060 411248 0 0 0 0 6034 8231 88 126 8366 27454 27470 6 18 76 1 6 1 1374060 303560 0 1 0 0 5475 0 94 121 8239 25001 25857 6 17 77 1 5 1 1374188 402788 14 1 0 0 5392 10179 52 139 8193 24831 25380 5 17 78 0 5 1 1374192 301200 20 0 0 0 5178 0 86 92 8336 24750 26702 5 15 79 1 7 0 1374192 369736 0 1 0 0 4941 8176 63 74 7998 23055 24414 6 17 78 1 8 1 1374192 275292 13 1 0 0 5192 0 119 102 8230 24862 26339 5 15 79 1 5 0 1374192 364596 459 0 1 1 5755 8438 112 124 8332 26315 26651 5 16 78 0 8 0 1374192 267856 0 1 0 0 4897 0 105 88 7997 22995 24989 6 16 78 1 12 0 1374192 339624 0 0 0 0 4920 8294 75 135 8227 24276 25400 5 16 80 2 5 0 1374200 397088 13 1 0 0 5718 8412 89 91 8116 26028 26313 6 18 77 2 6 0 1374228 292596 10 1 0 0 5374 0 94 96 8075 24513 25413 6 18 76 0 6 1 1374236 341492 7 1 0 0 6057 8342 134 116 8268 27172 27364 6 19 76 3 8 0 1374240 390940 7 1 0 0 5916 8243 108 121 8304 27672 27495 6 18 76 2 7 0 1374252 275052 7 2 0 0 5810 0 129 91 8198 24374 25501 5 18 77 4 8 0 1374252 321808 3 1 0 0 5921 8236 100 113 8273 26616 27128 6 19 75 1 8 0 1374268 272796 11 0 0 0 4735 0 83 114 7848 21924 24042 5 15 80 2 7 0 1374284 324676 7 0 0 0 6110 8201 114 93 8284 24293 25378 6 18 76 2 8 0 1374296 360076 10 1 0 0 6956 8328 136 125 8348 26724 27040 6 19 75 1 5 1 1374312 407356 230 0 0 0 6934 8229 126 131 8319 28257 27213 6 18 76 1 11 1 1374344 300228 13 0 0 0 6493 0 127 97 8354 25438 25542 6 19 75 0 4 1 1374344 348096 13 1 0 0 6719 8345 96 112 6806 28724 24457 7 19 74 1 10 1 1374952 404552 31 0 0 0 5796 8165 107 123 6340 24678 21992 6 17 78 0 13 1 1374952 374140 0 0 0 0 7774 0 97 117 6452 23443 22131 4 14 81 3 6 0 1374956 441772 3 0 0 0 3232 8181 147 110 6429 24878 22451 5 17 78 1 7 0 1374968 341796 10 0 0 0 4838 0 89 115 6542 24689 22643 5 17 78 0 13 0 1375012 414996 11 1 0 0 4828 8282 97 103 6162 21969 20812 4 14 81 0 15 0 1375192 414196 14 0 0 0 39 0 0 0 1695 380 3494 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 1255 23 2565 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 1248 25 2549 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 1012 23 2079 0 0 100 0 15 0 1375192 414196 218 0 0 0 252 0 0 0 828 476 1779 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 783 23 1618 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 809 24 1670 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 725 23 1503 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 545 25 1140 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 489 23 1027 0 0 100 0 15 0 1375192 414196 0 0 0 0 0 0 0 0 509 72 1075 0 0 100 2 14 0 1375376 412284 39 0 0 0 46 0 6 2 5397 359 1170 0 79 21 0 15 0 1375380 410128 99 0 0 0 211 0 10 46 1921 839 1182 0 69 31 1 15 0 1375380 421280 266 0 0 0 907 0 15 0 1582 578 589 0 73 26 0 15 0 1375380 421504 13 0 0 0 45 0 26 1 1581 94 497 0 59 41 0 17 1 1377496 430164 211 0 0 0 1505 0 16 19 2869 2736 2900 2 68 30 2 17 0 1377060 386444 264 1 0 0 970 0 34 30 3532 4017 3917 3 87 10 2 7 0 1377180 414764 1166 5 2 0 8861 8362 124 174 6236 31661 20389 10 26 64 0 7 1 1377180 346060 67 0 0 0 8122 0 77 220 6442 24740 22522 5 21 74 2 4 0 1377180 391492 7 0 0 0 5783 8273 87 72 6218 23687 22724 5 17 78 0 6 0 1377180 324412 10 0 0 0 5293 0 63 111 6949 20025 20603 4 17 79 2 6 0 1377180 426368 15 0 0 0 6419 8211 101 104 6487 18747 19175 5 23 72 0 4 1 1377180 334612 26 1 0 0 4488 0 133 115 6490 23064 22344 5 15 79 2 8 0 1377180 286908 104 0 0 0 4959 0 132 75 6222 18812 18577 5 27 68 1 10 0 1377180 389248 226 1 0 0 6362 9045 125 111 6263 22443 21853 5 17 78 2 11 1 1377180 291636 51 0 0 0 5923 0 98 116 6474 21827 20642 6 26 68 0 4 2 1377180 306052 37 1 0 0 7971 0 68 91 6006 21735 20893 5 18 78 1 6 0 1377180 428352 30 0 0 0 3050 8609 120 81 5949 19114 18973 4 25 71 0 6 1 1377180 374184 24 0 0 0 4525 0 97 92 6483 22294 21853 5 18 77 2 5 1 1377180 319532 4 0 0 0 4965 0 105 88 6348 21421 21703 5 18 77 Here's a normal top -S output, I haven't been able to grab one from when the problem exists... last pid: 67537; load averages: 0.72, 0.93, 0.94 up 15+16:05:21 14:12:38 83 processes: 7 running, 63 sleeping, 13 waiting CPU states: 4.7% user, 0.0% nice, 12.6% system, 0.2% interrupt, 82.5% idle Mem: 1232M Active, 5912M Inact, 261M Wired, 285M Cache, 214M Buf, 11M Free Swap: 9216M Total, 32K Used, 9216M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 12 root 1 171 52 0K 16K CPU2 2 344.0H 88.72% idle: cpu2 14 root 1 171 52 0K 16K RUN 0 338.1H 88.33% idle: cpu0 13 root 1 171 52 0K 16K RUN 1 338.3H 85.64% idle: cpu1 11 root 1 171 52 0K 16K CPU3 3 282.4H 62.89% idle: cpu3 24 root 1 -68 0 0K 16K CPU3 3 29.5H 20.17% em0 taskq 66405 squid 1 4 0 832M 819M kqread 3 13:47 13.28% squid 82587 squid 1 4 0 294M 280M kqread 3 885:07 12.55% squid At first I was thinking the single NIC and single CPU that gets hit from it was draining to 0% idle but I don't think it's really related and wouldn't explain why new processes couldn't run on new sessions, etc.. as for dmesg errors, all we see are these errors that pop every minute or so: mpt0: Unhandled Event Notify Frame. Event 0xe (ACK not required). mpt0: mpt_cam_event: 0xe but also once the problem starts we see a bunch of timeouts from mpt0: mpt0: request 0xffffffff8bd10010:19992 timed out for ccb 0xffffff022d40b400 (req->ccb 0xffffff022d40b400) mpt0: request 0xffffffff8bd0f5c0:19993 timed out for ccb 0xffffff01949b4800 (req->ccb 0xffffff01949b4800) mpt0: attempting to abort req 0xffffffff8bd10010:19992 function 0 mpt0: request 0xffffffff8bd0fb98:19998 timed out for ccb 0xffffff020d3f3c00 (req->ccb 0xffffff020d3f3c00) mpt0: completing timedout/aborted req 0xffffffff8bd10010:19992 mpt0: request 0xffffffff8bd0a9c8:20002 timed out for ccb 0xffffff022e147800 (req->ccb 0xffffff022e147800) mpt0: request 0xffffffff8bd098f0:20003 timed out for ccb 0xffffff0000f01800 (req->ccb 0xffffff0000f01800) (da2:mpt0:0:17:0): WRITE(10). CDB: 2a 0 c 5e f cf 0 0 80 0 (da2:mpt0:0:17:0): CAM Status: SCSI Status Error (da2:mpt0:0:17:0): SCSI Status: Check Condition (da2:mpt0:0:17:0): UNIT ATTENTION asc:29,7 (da2:mpt0:0:17:0): Reserved ASC/ASCQ pair (da2:mpt0:0:17:0): Retrying Command (per Sense Data) mpt0: abort of req 0xffffffff8bd10010:0 completed mpt0: attempting to abort req 0xffffffff8bd0f5c0:19993 function 0 mpt0: request 0xffffffff8bd13978:20004 timed out for ccb 0xffffff0000d60000 (req->ccb 0xffffff0000d60000) mpt0: completing timedout/aborted req 0xffffffff8bd0f5c0:19993 mpt0: request 0xffffffff8bd0e648:20005 timed out for ccb 0xffffff0000cb2000 (req->ccb 0xffffff0000cb2000) mpt0: abort of req 0xffffffff8bd12a58:0 completed mpt0: attempting to abort req 0xffffffff8bd130e0:20027 function 0 mpt0: completing timedout/aborted req 0xffffffff8bd130e0:20027 mpt0: abort of req 0xffffffff8bd130e0:0 completed mpt0: attempting to abort req 0xffffffff8bd11f58:20028 function 0 mpt0: completing timedout/aborted req 0xffffffff8bd11f58:20028 mpt0: abort of req 0xffffffff8bd11f58:0 completed mpt0: attempting to abort req 0xffffffff8bd0d4c0:20029 function 0 mpt0: completing timedout/aborted req 0xffffffff8bd0d4c0:20029 Etc... Here's what some typical disk activity looks like: iostat -d 5 da0 da1 da2 da3 KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s 47.87 45 2.11 51.09 54 2.69 50.72 56 2.77 50.09 61 3.00 47.47 93 4.30 54.25 92 4.87 49.92 77 3.77 43.49 81 3.45 70.62 49 3.41 62.31 80 4.89 54.47 86 4.58 44.36 70 3.03 56.75 50 2.79 52.30 103 5.26 64.82 56 3.53 50.95 55 2.73 59.76 74 4.34 63.63 64 3.98 43.97 141 6.04 45.34 104 4.60 41.91 105 4.30 50.78 96 4.75 61.19 66 3.96 50.23 82 4.00 54.88 60 3.20 64.17 70 4.37 71.42 74 5.16 40.63 104 4.11 52.17 110 5.60 54.50 86 4.55 50.85 95 4.74 36.75 91 3.28 I have not used WITNESS before but would this be a good time to start looking? Is the server simply too busy? What else could I look for or try tweaking to get around this problem that doesn't happen at lower off-peak load levels? Thanks, Hazlewood