5.1-RELEASE TODO
Peter Wemm
peter at wemm.org
Tue May 13 15:41:55 PDT 2003
Peter Wemm wrote:
> Don Lewis wrote:
> > On 13 May, Robert Watson wrote:
> > >
> > > On Tue, 13 May 2003, Heiko Schaefer wrote:
> > >
> > >> > That said, we are actively discussing what, if any, workarounds are
> > >> > appropriate, including some alternative workarounds from the ones
> > >> > currently present.
> > >>
> > >> bosko (who was mentioned here various time, regarding a patch to work
> > >> around this) has contacted me, and i am looking forward to try his
> > >> patch. assuming that the patch is correct (whatever that would mean in
> > >> this context), and there is some chance of accepting it anytime soon,
> > >> maybe it would be sensible to try to get that into the release - or
> > >> delay the release until this is sorted out ?!
> > >>
> > >> wouldn't a release that corrupts data in many, relevant, cases (i
> > >> consider the box i had the trouble with entirely mainstream) be worse
> > >> than no release at all?
> > >
> > > You don't need to argue to me that we need stability (I'm a fan of it
> > > myself): what I need is evidence that some set of changes is actually
> > > solving the problem, not masking it. If there exists a patch that
> > > substantially improves stability on some set of systems (and not at the
> > > cost of another set), I think you can rest assured that we'll get it into
> > > the release. As with you, we're very concerned by the recent spate of
> > > instability, especially in the beta cycle, and how to address that is ver
y
> > > much on our minds.
> >
> > Both my AMD system running -current and PII system running -stable are
> > afflicted with these data corruption problems. The limited amount of
> > information that I've seen about these problems leads me to believe that
> > in order to use the 4 MB page feature without danger to system integrity
> > is to relocate the kernel. If this is the case, then it would seem to
> > make sense to disable the use of 4 MB pages by adding the DISABLE_PSE
> > option until the system is patched.
>
> The thing is, we only use 4MB pages for two things.
> 1) The first 4MB of KVM is mapped as a 4MB page.
> 2) Large device mappings, eg: the Xserver mmaping /dev/mem for the frame
> buffer. The thing is though, these 4MB pages are not mapped with PG_G.
>
> Are you running X? Are you using the broadcom ethernet driver?
>
> Also of note: I recently saw a brand new P4 system with a genuine intel
> motherboard, for a RELENG_4 system. It had shocking data corruption
> problems. The memory was swapped - no change. The motherboard and CPU were
> swapped (same motherboard model, much newer P4 cpu stepping) - no change.
> It was simply unreliable. Backporting DISABLE_PG_G to RELENG_4 and turning
> on it and DISABLE_PSE greatly reduced the problem, but it still happened.
> In the end, the Intel motherboard was replaced with a P4 Xeon system
> motherboard and the problem instantly went away. The trouble appeared
> to be a generic problem the Intel 845 chipset motherboard.
>
> Remember, this was RELENG_4 as of a few months ago. It isn't a 5.x-only
> problem.
>
> The bge driver has been firmly implicated in one of the cases of data
> corruption. Paul's recent if_bge fixes completely solved one person's
> long-standing problems. There are DMA bugs in the earlier chipsets that
> we didn't have the prescribed workarounds for. And even though the compiles
> were happening on local disks, all it took was running the build in an Xterm
> so that the make output was going over the network, or doing a tail -f etc.
>
> > PG_G is probably different. A better case can be made that using this
> > option is only masking software bugs that should be fixable. The
> > problem is that these bugs are only rarely triggered, look a lot like
> > flakey hardware, and it's just about impossible for most FreeBSD users
> > to track the problem to its root cause.
>
> For what its worth, we have #ifdef'ed code in i386/pmap.c:
> #ifdef I686_CPU_not /* Problem seems to have gone away */
> /* Deal with un-resolved Pentium4 issues */
> if (cpu_class == CPUCLASS_686 &&
> strcmp(cpu_vendor, "GenuineIntel") == 0 &&
> (cpu_id & 0xf00) == 0xf00) {
> printf("Warning: Pentium 4 cpu: PG_G disabled (global flag)\n
");
> pgeflag = 0;
> }
> #endif
>
> I really do not want DISABLE_PSE and DISABLE_PG_G turned on for what appears
> to have a hardware component. I'd much rather the above ifdef's turned on.
>
> For the folks having problems, here's what I'd like to know:
>
> - Are you running X? (standard XFree86 or do you have the agp and drm driver
s
> enabled?)
> - What ethernet card? (particularly if bge)
> - Is there any network traffic at the time? ie: if you remove the network
> card entirely and do the compile tests on a /dev/ttyv0 console, does it still
> happen?
> - What hardware do you have? (cpuid line shoing the Id = 0xNNN number,
> memory size/type and whether it has ECC or not, motherboard chipset, etc)
> - Have you replaced any hardware? If so, which parts?
Oh, and two more things:
- Do DISABLE_PG_G and/or DISABLE_PSE actually affect the stability?
- Are you seeing application faults (segfault etc) or kernel stability
(fatal trap, panic etc).
Cheers,
-Peter
--
Peter Wemm - peter at wemm.org; peter at FreeBSD.org; peter at yahoo-inc.com
"All of this is for nothing if we don't go to the stars" - JMS/B5
More information about the freebsd-current
mailing list