Recent Problems with RELENG_7 i386

Thu Oct 9 00:20:42 PDT 2008

--- On Thu, 10/9/08, Jeremy Chadwick <koitsu at FreeBSD.org> wrote:

> From: Jeremy Chadwick <koitsu at FreeBSD.org>
> Subject: Re: Recent Problems with RELENG_7 i386
> To: "bf" <bf2006a at yahoo.com>
> Cc: freebsd-stable at freebsd.org
> Date: Thursday, October 9, 2008, 1:12 AM
> On Wed, Oct 08, 2008 at 10:00:32PM -0700, bf wrote:
> > 
> > 
> > 
> > --- On Wed, 10/8/08, Jeremy Chadwick
> <koitsu at FreeBSD.org> wrote:
> > 
> > > From: Jeremy Chadwick <koitsu at FreeBSD.org>
> > > Subject: Re: Recent Problems with RELENG_7 i386
> > > To: "bf" <bf2006a at yahoo.com>
> > > Cc: freebsd-stable at freebsd.org
> > > Date: Wednesday, October 8, 2008, 2:36 PM
> > > On Wed, Oct 08, 2008 at 10:19:47AM -0700, bf
> wrote:
> > > > After updating to RELENG_7 i386 of this
> weekend, I
> > > have been having problems
> > > > with my machine.  When booting normally, the
> system
> > > slows or hangs at the
> > > > login prompt.  If I am able to continue past
> the
> > > prompt, I sometimes experience 
> > > > erratic mouse behavior, and a subsequent
> hang, after
> > > varying lengths of time,
> > > > even under light workloads.  The same
> problem does not
> > > seem to occur in 
> > > > single-user mode, and did not occur with the
> RELENG_7
> > > i386 of just over a
> > > > week ago.  I have been unable to obtain
> crashdumps so
> > > far, and the only
> > > > log messages I can find that weren't
> present
> > > before are notices like those
> > > > recorded below:
> > > > 
> > > > Oct  8 11:00:40 myhost kernel: t_delta
> > > 15.fd80bdcb75b60200 too short
> > > 
> > > This comes from src/sys/kern/kern_tc.c, around
> line 908. 
> > > I'm not
> > > familiar with the kernel, but two ideas come to
> mind:
> > > 
> > > 1) If you have Intel SpeedStep (EIST) or AMD
> > > Cool'n'Quiet enabled in
> > > your BIOS, try disabling it,
> > > 
> > > 2) If you're using powerd, disable it (I
> don't see
> > > it enabled),
> > > 
> > > 3) Try keeping HZ at 1000 (the default).
> > > 
> > 
> > Thanks, Jeremy, for taking the time to consider my
> question and reply.
> > 
> > My CPU is pre-Cool'n'Quiet, and as far as I
> can tell I had disabled
> > all forms of power management that may affect the
> clock speeds.  I have
> > found that by raising kern.hz to 250, or by using the
> default, I no
> > longer receive the t_delta is too short messages, and
> the other problems
> > are no longer apparent.  My question is: why did this
> occur now?
> 
> I don't know.  We can't rewind time and find out
> system parameters and
> kernel details from 6 months ago.  :-)

Well, actually, with version control, we can -- if we're willing to take the trouble.  My local settings haven't changed much, and the load we're talking about is not some pattern lost in the mists of time, but simply booting up.  But I'm not suggesting going to any such lengths: the changed behavior occurred after the most recent changes to RELENG_7, in the past two weeks or less. Like you, I haven't taken the time to delve into the inner workings of the kernel, and so I was hoping that someone who had been monitoring RELENG_7's behavior (it's being tested before-and-after commits fairly often during this pre-release freeze, right?) or someone who had made recent changes might be able to narrow it down even further, either by pointing to some changes that might have tipped my machine over the edge at kern.hz=100, or by ruling out FreeBSD changes entirely and pointing the finger at some hardware problem.  

Some of the related sysctls are:

kern.timecounter.tick: 1
kern.timecounter.choice: TSC(800) ACPI-safe(850) i8254(0) dummy(-1000000)
kern.timecounter.hardware: ACPI-safe
kern.timecounter.stepwarnings: 0
kern.timecounter.tc.i8254.mask: 4294967295
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.i8254.quality: 0
kern.timecounter.tc.ACPI-safe.mask: 16777215
kern.timecounter.tc.ACPI-safe.frequency: 3579545
kern.timecounter.tc.ACPI-safe.quality: 850
kern.timecounter.tc.TSC.mask: 4294967295
kern.timecounter.tc.TSC.frequency: 906349154
kern.timecounter.tc.TSC.quality: 800
vfs.timestamp_precision: 0
machdep.acpi_timer_freq: 3579545
dev.acpi_timer.0.%desc: 24-bit timer at 3.579545MHz
dev.acpi_timer.0.%driver: acpi_timer
dev.acpi_timer.0.%location: unknown
dev.acpi_timer.0.%pnpinfo: unknown
dev.acpi_timer.0.%parent: acpi0
dev.attimer.0.%desc: AT realtime clock
dev.attimer.0.%driver: attimer
dev.attimer.0.%location: handle=\_SB_.PCI0.PIB_.RTC_
dev.attimer.0.%pnpinfo: _HID=PNP0B00 _UID=0
dev.attimer.0.%parent: acpi0
dev.attimer.1.%desc: AT timer
dev.attimer.1.%driver: attimer
dev.attimer.1.%location: handle=\_SB_.PCI0.PIB_.TIME
dev.attimer.1.%pnpinfo: _HID=PNP0100 _UID=0
dev.attimer.1.%parent: acpi0
dev.pmtimer.0.%driver: pmtimer
dev.pmtimer.0.%parent: isa0

and my original message showed how some of the timers were handled during
booting up.  I don't think these have changed since 12 Sept., when jhb@ changed local_apic.c in SVN rev 182982.  So that change _alone_ would not
have caused my problem.  More recently, kib@ committed changes to
db_trace.c, devfs, vm, ata, and to various kernel routines; and rwatson@
made changes to udp, tcp, sockets, and ipfw.

Regards, 

             b.

> 
> I'm thinking it might have something to do with the
> timecounter selected
> by the kernel, but as I said, we can't rewind time to
> find out what
> things were in the past.
> 
> The kernel environment variables I'm talking about are
> kern.timecounter.
> "sysctl kern.timecounter" could help shed some
> light here, maybe.  It
> would at least allow us to see what timecounters are
> available on your
> system, and if a bad/unreliable one is being selected
> automatically.
> 
> > I have been using a similar configuration for months
> now without any
> > apparent problems. My original goal in using a lower
> kern.hz was to
> > avoid burdening my machine with excessive context
> switching.
> 
> This is over my head, technically.  I would need to pull
> John Baldwin
> into this, since he knows a bit about both (timecounters
> and context
> switching).  I'm just a simple caveman..... :-)
> 
> > I saw the relevant section of kern_tc.c before I wrote
> my first
> > message, but when skimming through the changes in
> RELENG_7 over the
> > past week or two, I couldn't see any commit that
> may have directly
> > affected kernel timekeeping.  Has some new workload
> been imposed on
> > the system by recent changes, that may have made a
> kern.hz of 100
> > insufficient?  Is this tuneable setting properly
> implemented, so that
> > all parts of the base system are using it's
> current value rather than
> > the default?  Could some of my hardware, such as my
> RTC, be
> > malfunctioning?
> 
> Well, I believe HZ was increased from 100 to 1000 long ago
> (RELENG_6?)
> as a default.  I'm really not sure of the implications
> of decreasing it,
> besides having less granularity for some things (the only
> things I know
> of would be something pertaining to firewalls, I just
> can't remember
> what.  My brain is full.  :-) )
> 
> -- 
> | Jeremy Chadwick                                jdc at
> parodius.com |
> | Parodius Networking                      
> http://www.parodius.com/ |
> | UNIX Systems Administrator                  Mountain
> View, CA, USA |
> | Making life hard for others since 1977.              PGP:
> 4BD6C0CB |