From northox at mantor.org Wed Mar 5 02:23:55 2008 From: northox at mantor.org (Danny Fullerton) Date: Wed Mar 5 02:23:59 2008 Subject: Dual AMD MP unstable under heavy load when smp is active Message-ID: <47CDFFFF.10507@mantor.org> Hi guys, I been having quite some trouble finding a problem whom seem to be related with SMP on one of my production server. The problem is not easily reproducible but the best way I found was to fire up "make buildworld" while having some other things going on (mysql, apache, bind, jails, etc). When SMP is active, the compile will end up with a segfault or, quite rarely, end up with a crash. I recently configure the crash device but still was unable to recreate a full system crash. At first, I thought it was related to the memory so I done some test and changed most DIMM but ultimately, the problem was sill there. To pin point the problem, I first tried to add options to the GENERIC kernel witch I found to be stable. That's how I found that it was related to SMP. I then tried mixing some other thing like reducing the driver in the kernel to the minimum I could for different reason. One of them is that the motherboard is a "Tyan thunder K7X" (http://www.tyan.com/archive/products/html/thunderk7x.html) and it has an onbord adaptec SCSI controller which I don't use. Since the driver used for this adapter is not MP safe, I tried disabling it via the BIOS and/or by disabling the driver in the kernel but it had no effect. The actual SCSI adapter in used is the Dell 4/DC (LSILogic MegaRAID) you can see in the dmesg. Now I have no clue on how I could further debug this problem. dmesg from generic kernel: Copyright (c) 1992-2008 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 6.3-RELEASE-p1 #0: Wed Feb 27 07:56:51 EST 2008 root@megatron.mantor.org:/usr/obj/usr/src/sys/GENERIC ACPI APIC Table: Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) MP 2200+ (1800.07-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x680 Stepping = 0 Features=0x383fbff AMD Features=0xc0480800 real memory = 3220701184 (3071 MB) avail memory = 3150741504 (3004 MB) MADT: Forcing active-low polarity and level trigger for SCI ioapic0 irqs 0-23 on motherboard kbd1 at kbdmux0 ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) hptrr: HPT RocketRAID controller driver v1.1 (Feb 27 2008 07:56:28) acpi0: on motherboard acpi0: Power Button (fixed) acpi0: Sleep Button (fixed) Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x8008-0x800b on acpi0 cpu0: on acpi0 acpi_button0: on acpi0 pcib0: port 0xcf8-0xcff,0x8000-0x807f,0x8080-0x80ff iomem 0xd8000-0xdbfff on acpi0 pci0: on pcib0 agp0: port 0x1810-0x1813 mem 0xf8000000-0xfbffffff,0xf6210000-0xf6210fff at device 0.0 on pci0 pcib1: at device 1.0 on pci0 pci1: on pcib1 isab0: at device 7.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 7.1 on pci0 ata0: on atapci0 ata1: on atapci0 pci0: at device 7.3 (no driver attached) amr0: mem 0xf6200000-0xf620ffff irq 20 at device 8.0 on pci0 amr0: delete logical drives supported by controller amr0: Firmware 350O, BIOS 1.09, 128MB RAM ahc0: port 0x1000-0x10ff mem 0xf4000000-0xf4000fff irq 20 at device 10.0 on pci0 ahc0: [GIANT-LOCKED] aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs ahc1: port 0x1400-0x14ff mem 0xf4001000-0xf4001fff irq 21 at device 10.1 on pci0 ahc1: [GIANT-LOCKED] aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs pcib2: at device 16.0 on pci0 pci2: on pcib2 ohci0: mem 0xf4100000-0xf4100fff irq 19 at device 0.0 on pci2 ohci0: [GIANT-LOCKED] usb0: OHCI version 1.0, legacy support usb0: SMM does not respond, resetting usb0: on ohci0 usb0: USB revision 1.0 uhub0: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 4 ports with 4 removable, self powered pci2: at device 7.0 (no driver attached) xl0: <3Com 3c980C Fast Etherlink XL> port 0x2400-0x247f mem 0xf4102000-0xf410207f irq 18 at device 8.0 on pci2 miibus0: on xl0 ukphy0: on miibus0 ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto xl0: Ethernet address: 00:e0:81:22:2e:c4 xl1: <3Com 3c980C Fast Etherlink XL> port 0x2480-0x24ff mem 0xf4102400-0xf410247f irq 19 at device 9.0 on pci2 miibus1: on xl1 ukphy1: on miibus1 ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto xl1: Ethernet address: 00:e0:81:22:2e:c5 atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 pmtimer0 on isa0 orm0: at iomem 0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xe0000-0xe3fff on isa0 ppc0: parallel port not found. sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0: configured irq 4 not in bitmap of probed irqs 0 sio0: port may not be enabled sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 8250 or not responding sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounter "TSC" frequency 1800073530 Hz quality 800 Timecounters tick every 1.000 msec hptrr: no controller detected. Waiting 5 seconds for SCSI devices to settle ad0: 476940MB at ata0-master UDMA100 amr0: delete logical drives supported by controller amrd0: on amr0 amrd0: 139900MB (286515200 sectors) RAID 1 (optimal) Trying to mount root from ufs:/dev/amrd0s1a kldstat: Id Refs Address Size Name 1 10 0xc0400000 7a05b0 kernel 2 1 0xc0ba1000 5c304 acpi.ko 3 1 0xc8093000 3000 fdescfs.ko 4 1 0xc8106000 3000 pflog.ko 5 1 0xc8109000 2d000 pf.ko 6 1 0xc817b000 19000 linux.ko If you have any idea or you need more information to diagnosis the problem please let me known. regards, --- Danny Fullerton Mantor Organization From missmanp at cfw.com Wed Mar 5 03:14:46 2008 From: missmanp at cfw.com (Paul Missman) Date: Wed Mar 5 03:14:51 2008 Subject: Dual AMD MP unstable under heavy load when smp is active References: <47CDFFFF.10507@mantor.org> Message-ID: <007e01c87e6d$1c9c00a0$0b28a8c0@a1000> Danny, I don't know what the bug is, but it does exist. I have an IBM x3455 with 2 Opteron dual core processors. Under heavy loads it crashes. As a step in debugging, I unplugged one of the processors, and the problem went away. I switched to Centos version 4, and it operates perfectly. In addition to FreeBSD, the problem also exists in Fedora Core. Of the OSes I tested, only Redhat and Centos worked correctly on the x3455. I didn't try Windows, so I can't say whether or not it operates properly on this system. Unfortunately, that is all I know about the issue. Paul Missman ----- Original Message ----- From: "Danny Fullerton" To: Sent: Tuesday, March 04, 2008 9:05 PM Subject: Dual AMD MP unstable under heavy load when smp is active > Hi guys, > > I been having quite some trouble finding a problem whom seem to be > related with SMP on one of my production server. > > The problem is not easily reproducible but the best way I found was to > fire up "make buildworld" while having some other things going on > (mysql, apache, bind, jails, etc). When SMP is active, the compile will > end up with a segfault or, quite rarely, end up with a crash. I recently > configure the crash device but still was unable to recreate a full > system crash. > > At first, I thought it was related to the memory so I done some test and > changed most DIMM but ultimately, the problem was sill there. To pin > point the problem, I first tried to add options to the GENERIC kernel > witch I found to be stable. That's how I found that it was related to > SMP. I then tried mixing some other thing like reducing the driver in > the kernel to the minimum I could for different reason. One of them is > that the motherboard is a "Tyan thunder K7X" > (http://www.tyan.com/archive/products/html/thunderk7x.html) and it has > an onbord adaptec SCSI controller which I don't use. Since the driver > used for this adapter is not MP safe, I tried disabling it via the BIOS > and/or by disabling the driver in the kernel but it had no effect. The > actual SCSI adapter in used is the Dell 4/DC (LSILogic MegaRAID) you can > see in the dmesg. > > Now I have no clue on how I could further debug this problem. > > dmesg from generic kernel: > > Copyright (c) 1992-2008 The FreeBSD Project. > Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 > The Regents of the University of California. All rights reserved. > FreeBSD is a registered trademark of The FreeBSD Foundation. > FreeBSD 6.3-RELEASE-p1 #0: Wed Feb 27 07:56:51 EST 2008 > root@megatron.mantor.org:/usr/obj/usr/src/sys/GENERIC > ACPI APIC Table: > Timecounter "i8254" frequency 1193182 Hz quality 0 > CPU: AMD Athlon(tm) MP 2200+ (1800.07-MHz 686-class CPU) > Origin = "AuthenticAMD" Id = 0x680 Stepping = 0 > > Features=0x383fbff > AMD Features=0xc0480800 > real memory = 3220701184 (3071 MB) > avail memory = 3150741504 (3004 MB) > MADT: Forcing active-low polarity and level trigger for SCI > ioapic0 irqs 0-23 on motherboard > kbd1 at kbdmux0 > ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) > hptrr: HPT RocketRAID controller driver v1.1 (Feb 27 2008 07:56:28) > acpi0: on motherboard > acpi0: Power Button (fixed) > acpi0: Sleep Button (fixed) > Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 > acpi_timer0: <24-bit timer at 3.579545MHz> port 0x8008-0x800b on acpi0 > cpu0: on acpi0 > acpi_button0: on acpi0 > pcib0: port > 0xcf8-0xcff,0x8000-0x807f,0x8080-0x80ff iomem 0xd8000-0xdbfff on acpi0 > pci0: on pcib0 > agp0: port 0x1810-0x1813 mem > 0xf8000000-0xfbffffff,0xf6210000-0xf6210fff at device 0.0 on pci0 > pcib1: at device 1.0 on pci0 > pci1: on pcib1 > isab0: at device 7.0 on pci0 > isa0: on isab0 > atapci0: port > 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 7.1 on pci0 > ata0: on atapci0 > ata1: on atapci0 > pci0: at device 7.3 (no driver attached) > amr0: mem 0xf6200000-0xf620ffff irq 20 at > device 8.0 on pci0 > amr0: delete logical drives supported by controller > amr0: Firmware 350O, BIOS 1.09, 128MB RAM > ahc0: port 0x1000-0x10ff mem > 0xf4000000-0xf4000fff irq 20 at device 10.0 on pci0 > ahc0: [GIANT-LOCKED] > aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs > ahc1: port 0x1400-0x14ff mem > 0xf4001000-0xf4001fff irq 21 at device 10.1 on pci0 > ahc1: [GIANT-LOCKED] > aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs > pcib2: at device 16.0 on pci0 > pci2: on pcib2 > ohci0: mem 0xf4100000-0xf4100fff irq 19 > at device 0.0 on pci2 > ohci0: [GIANT-LOCKED] > usb0: OHCI version 1.0, legacy support > usb0: SMM does not respond, resetting > usb0: on ohci0 > usb0: USB revision 1.0 > uhub0: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 > uhub0: 4 ports with 4 removable, self powered > pci2: at device 7.0 (no driver attached) > xl0: <3Com 3c980C Fast Etherlink XL> port 0x2400-0x247f mem > 0xf4102000-0xf410207f irq 18 at device 8.0 on pci2 > miibus0: on xl0 > ukphy0: on miibus0 > ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > xl0: Ethernet address: 00:e0:81:22:2e:c4 > xl1: <3Com 3c980C Fast Etherlink XL> port 0x2480-0x24ff mem > 0xf4102400-0xf410247f irq 19 at device 9.0 on pci2 > miibus1: on xl1 > ukphy1: on miibus1 > ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > xl1: Ethernet address: 00:e0:81:22:2e:c5 > atkbdc0: port 0x60,0x64 irq 1 on acpi0 > atkbd0: irq 1 on atkbdc0 > kbd0 at atkbd0 > atkbd0: [GIANT-LOCKED] > fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on > acpi0 > fdc0: does not respond > device_attach: fdc0 attach returned 6 > fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on > acpi0 > fdc0: does not respond > device_attach: fdc0 attach returned 6 > pmtimer0 on isa0 > orm0: at iomem > 0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xe0000-0xe3fff on isa0 > ppc0: parallel port not found. > sc0: at flags 0x100 on isa0 > sc0: VGA <16 virtual consoles, flags=0x300> > sio0: configured irq 4 not in bitmap of probed irqs 0 > sio0: port may not be enabled > sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 > sio0: type 8250 or not responding > sio1: configured irq 3 not in bitmap of probed irqs 0 > sio1: port may not be enabled > vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 > Timecounter "TSC" frequency 1800073530 Hz quality 800 > Timecounters tick every 1.000 msec > hptrr: no controller detected. > Waiting 5 seconds for SCSI devices to settle > ad0: 476940MB at ata0-master UDMA100 > amr0: delete logical drives supported by controller > amrd0: on amr0 > amrd0: 139900MB (286515200 sectors) RAID 1 (optimal) > Trying to mount root from ufs:/dev/amrd0s1a > > kldstat: > > Id Refs Address Size Name > 1 10 0xc0400000 7a05b0 kernel > 2 1 0xc0ba1000 5c304 acpi.ko > 3 1 0xc8093000 3000 fdescfs.ko > 4 1 0xc8106000 3000 pflog.ko > 5 1 0xc8109000 2d000 pf.ko > 6 1 0xc817b000 19000 linux.ko > > If you have any idea or you need more information to diagnosis the > problem please let me known. > > regards, > > --- > Danny Fullerton > Mantor Organization > _______________________________________________ > freebsd-smp@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-smp > To unsubscribe, send any mail to "freebsd-smp-unsubscribe@freebsd.org" > > > -- > No virus found in this incoming message. > Checked by AVG Free Edition. > Version: 7.5.516 / Virus Database: 269.21.4/1310 - Release Date: 3/4/2008 > 8:35 AM > > From northox at mantor.org Wed Mar 5 03:32:03 2008 From: northox at mantor.org (Danny Fullerton) Date: Wed Mar 5 03:32:07 2008 Subject: Dual AMD MP unstable under heavy load when smp is active In-Reply-To: <007e01c87e6d$1c9c00a0$0b28a8c0@a1000> References: <47CDFFFF.10507@mantor.org> <007e01c87e6d$1c9c00a0$0b28a8c0@a1000> Message-ID: <47CE1433.2020308@mantor.org> Hello Paul, I would like to known if done those test with the recent FreeBSD 7.0? I seen lots of work in the SMP area of this release and I'm wondering if I could have better chance with this version. thanks, dmesg with smp on (GENERIC + option smp): Copyright (c) 1992-2008 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 6.3-RELEASE-p1 #0: Wed Feb 27 21:11:40 EST 2008 root@megatron.mantor.org:/usr/obj/usr/src/sys/MEGATRONTEST Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: AMD Athlon(tm) MP 2200+ (1800.07-MHz 686-class CPU) Origin = "AuthenticAMD" Id = 0x680 Stepping = 0 Features=0x383fbff AMD Features=0xc0480800 real memory = 3220701184 (3071 MB) avail memory = 3146387456 (3000 MB) ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs cpu0 (BSP): APIC ID: 1 cpu1 (AP): APIC ID: 0 MADT: Forcing active-low polarity and level trigger for SCI ioapic0 irqs 0-23 on motherboard kbd1 at kbdmux0 ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) hptrr: HPT RocketRAID controller driver v1.1 (Feb 27 2008 21:11:16) acpi0: on motherboard acpi0: Power Button (fixed) acpi0: Sleep Button (fixed) Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x8008-0x800b on acpi0 cpu0: on acpi0 cpu1: on acpi0 acpi_button0: on acpi0 pcib0: port 0xcf8-0xcff,0x8000-0x807f,0x8080-0x80ff iomem 0xd8000-0xdbfff on acpi0 pci0: on pcib0 agp0: port 0x1810-0x1813 mem 0xf8000000-0xfbffffff,0xf6210000-0xf6210fff at device 0.0 on pci0 pcib1: at device 1.0 on pci0 pci1: on pcib1 isab0: at device 7.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 7.1 on pci0 ata0: on atapci0 ata1: on atapci0 pci0: at device 7.3 (no driver attached) amr0: mem 0xf6200000-0xf620ffff irq 20 at device 8.0 on pci0 amr0: delete logical drives supported by controller amr0: Firmware 350O, BIOS 1.09, 128MB RAM ahc0: port 0x1000-0x10ff mem 0xf4000000-0xf4000fff irq 20 at device 10.0 on pci0 ahc0: [GIANT-LOCKED] aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs ahc1: port 0x1400-0x14ff mem 0xf4001000-0xf4001fff irq 21 at device 10.1 on pci0 ahc1: [GIANT-LOCKED] aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs pcib2: at device 16.0 on pci0 pci2: on pcib2 ohci0: mem 0xf4100000-0xf4100fff irq 19 at device 0.0 on pci2 ohci0: [GIANT-LOCKED] usb0: OHCI version 1.0, legacy support usb0: SMM does not respond, resetting usb0: on ohci0 usb0: USB revision 1.0 uhub0: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 uhub0: 4 ports with 4 removable, self powered pci2: at device 7.0 (no driver attached) xl0: <3Com 3c980C Fast Etherlink XL> port 0x2400-0x247f mem 0xf4102000-0xf410207f irq 18 at device 8.0 on pci2 miibus0: on xl0 ukphy0: on miibus0 ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto xl0: Ethernet address: 00:e0:81:22:2e:c4 xl1: <3Com 3c980C Fast Etherlink XL> port 0x2480-0x24ff mem 0xf4102400-0xf410247f irq 19 at device 9.0 on pci2 miibus1: on xl1 ukphy1: on miibus1 ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto xl1: Ethernet address: 00:e0:81:22:2e:c5 atkbdc0: port 0x60,0x64 irq 1 on acpi0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 fdc0: does not respond device_attach: fdc0 attach returned 6 pmtimer0 on isa0 orm0: at iomem 0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xe0000-0xe3fff on isa0 ppc0: parallel port not found. sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0: configured irq 4 not in bitmap of probed irqs 0 sio0: port may not be enabled sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 8250 or not responding sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 Timecounters tick every 1.000 msec hptrr: no controller detected. Waiting 5 seconds for SCSI devices to settle ad0: 476940MB at ata0-master UDMA100 amr0: delete logical drives supported by controller amrd0: on amr0 amrd0: 139900MB (286515200 sectors) RAID 1 (optimal) SMP: AP CPU #1 Launched! Trying to mount root from ufs:/dev/amrd0s1a --- Danny Fullerton Mantor Organization Paul Missman wrote: > > Danny, > > I don't know what the bug is, but it does exist. > > I have an IBM x3455 with 2 Opteron dual core processors. Under heavy > loads it crashes. As a step in debugging, I unplugged one of the > processors, and the problem went away. I switched to Centos version > 4, and it operates perfectly. > > In addition to FreeBSD, the problem also exists in Fedora Core. > > Of the OSes I tested, only Redhat and Centos worked correctly on the > x3455. > > I didn't try Windows, so I can't say whether or not it operates > properly on this system. > > Unfortunately, that is all I know about the issue. > > Paul Missman > > > ----- Original Message ----- From: "Danny Fullerton" > To: > Sent: Tuesday, March 04, 2008 9:05 PM > Subject: Dual AMD MP unstable under heavy load when smp is active > > >> Hi guys, >> >> I been having quite some trouble finding a problem whom seem to be >> related with SMP on one of my production server. >> >> The problem is not easily reproducible but the best way I found was to >> fire up "make buildworld" while having some other things going on >> (mysql, apache, bind, jails, etc). When SMP is active, the compile will >> end up with a segfault or, quite rarely, end up with a crash. I recently >> configure the crash device but still was unable to recreate a full >> system crash. >> >> At first, I thought it was related to the memory so I done some test and >> changed most DIMM but ultimately, the problem was sill there. To pin >> point the problem, I first tried to add options to the GENERIC kernel >> witch I found to be stable. That's how I found that it was related to >> SMP. I then tried mixing some other thing like reducing the driver in >> the kernel to the minimum I could for different reason. One of them is >> that the motherboard is a "Tyan thunder K7X" >> (http://www.tyan.com/archive/products/html/thunderk7x.html) and it has >> an onbord adaptec SCSI controller which I don't use. Since the driver >> used for this adapter is not MP safe, I tried disabling it via the BIOS >> and/or by disabling the driver in the kernel but it had no effect. The >> actual SCSI adapter in used is the Dell 4/DC (LSILogic MegaRAID) you can >> see in the dmesg. >> >> Now I have no clue on how I could further debug this problem. >> >> dmesg from generic kernel: >> >> Copyright (c) 1992-2008 The FreeBSD Project. >> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 >> The Regents of the University of California. All rights reserved. >> FreeBSD is a registered trademark of The FreeBSD Foundation. >> FreeBSD 6.3-RELEASE-p1 #0: Wed Feb 27 07:56:51 EST 2008 >> root@megatron.mantor.org:/usr/obj/usr/src/sys/GENERIC >> ACPI APIC Table: >> Timecounter "i8254" frequency 1193182 Hz quality 0 >> CPU: AMD Athlon(tm) MP 2200+ (1800.07-MHz 686-class CPU) >> Origin = "AuthenticAMD" Id = 0x680 Stepping = 0 >> >> Features=0x383fbff >> >> AMD Features=0xc0480800 >> real memory = 3220701184 (3071 MB) >> avail memory = 3150741504 (3004 MB) >> MADT: Forcing active-low polarity and level trigger for SCI >> ioapic0 irqs 0-23 on motherboard >> kbd1 at kbdmux0 >> ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, >> RF5413) >> hptrr: HPT RocketRAID controller driver v1.1 (Feb 27 2008 07:56:28) >> acpi0: on motherboard >> acpi0: Power Button (fixed) >> acpi0: Sleep Button (fixed) >> Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 >> acpi_timer0: <24-bit timer at 3.579545MHz> port 0x8008-0x800b on acpi0 >> cpu0: on acpi0 >> acpi_button0: on acpi0 >> pcib0: port >> 0xcf8-0xcff,0x8000-0x807f,0x8080-0x80ff iomem 0xd8000-0xdbfff on acpi0 >> pci0: on pcib0 >> agp0: port 0x1810-0x1813 mem >> 0xf8000000-0xfbffffff,0xf6210000-0xf6210fff at device 0.0 on pci0 >> pcib1: at device 1.0 on pci0 >> pci1: on pcib1 >> isab0: at device 7.0 on pci0 >> isa0: on isab0 >> atapci0: port >> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 7.1 on pci0 >> ata0: on atapci0 >> ata1: on atapci0 >> pci0: at device 7.3 (no driver attached) >> amr0: mem 0xf6200000-0xf620ffff irq 20 at >> device 8.0 on pci0 >> amr0: delete logical drives supported by controller >> amr0: Firmware 350O, BIOS 1.09, 128MB RAM >> ahc0: port 0x1000-0x10ff mem >> 0xf4000000-0xf4000fff irq 20 at device 10.0 on pci0 >> ahc0: [GIANT-LOCKED] >> aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs >> ahc1: port 0x1400-0x14ff mem >> 0xf4001000-0xf4001fff irq 21 at device 10.1 on pci0 >> ahc1: [GIANT-LOCKED] >> aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs >> pcib2: at device 16.0 on pci0 >> pci2: on pcib2 >> ohci0: mem 0xf4100000-0xf4100fff irq 19 >> at device 0.0 on pci2 >> ohci0: [GIANT-LOCKED] >> usb0: OHCI version 1.0, legacy support >> usb0: SMM does not respond, resetting >> usb0: on ohci0 >> usb0: USB revision 1.0 >> uhub0: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 >> uhub0: 4 ports with 4 removable, self powered >> pci2: at device 7.0 (no driver attached) >> xl0: <3Com 3c980C Fast Etherlink XL> port 0x2400-0x247f mem >> 0xf4102000-0xf410207f irq 18 at device 8.0 on pci2 >> miibus0: on xl0 >> ukphy0: on miibus0 >> ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto >> xl0: Ethernet address: 00:e0:81:22:2e:c4 >> xl1: <3Com 3c980C Fast Etherlink XL> port 0x2480-0x24ff mem >> 0xf4102400-0xf410247f irq 19 at device 9.0 on pci2 >> miibus1: on xl1 >> ukphy1: on miibus1 >> ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto >> xl1: Ethernet address: 00:e0:81:22:2e:c5 >> atkbdc0: port 0x60,0x64 irq 1 on acpi0 >> atkbd0: irq 1 on atkbdc0 >> kbd0 at atkbd0 >> atkbd0: [GIANT-LOCKED] >> fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on >> acpi0 >> fdc0: does not respond >> device_attach: fdc0 attach returned 6 >> fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on >> acpi0 >> fdc0: does not respond >> device_attach: fdc0 attach returned 6 >> pmtimer0 on isa0 >> orm0: at iomem >> 0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xe0000-0xe3fff on isa0 >> ppc0: parallel port not found. >> sc0: at flags 0x100 on isa0 >> sc0: VGA <16 virtual consoles, flags=0x300> >> sio0: configured irq 4 not in bitmap of probed irqs 0 >> sio0: port may not be enabled >> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 >> sio0: type 8250 or not responding >> sio1: configured irq 3 not in bitmap of probed irqs 0 >> sio1: port may not be enabled >> vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on >> isa0 >> Timecounter "TSC" frequency 1800073530 Hz quality 800 >> Timecounters tick every 1.000 msec >> hptrr: no controller detected. >> Waiting 5 seconds for SCSI devices to settle >> ad0: 476940MB at ata0-master UDMA100 >> amr0: delete logical drives supported by controller >> amrd0: on amr0 >> amrd0: 139900MB (286515200 sectors) RAID 1 (optimal) >> Trying to mount root from ufs:/dev/amrd0s1a >> >> kldstat: >> >> Id Refs Address Size Name >> 1 10 0xc0400000 7a05b0 kernel >> 2 1 0xc0ba1000 5c304 acpi.ko >> 3 1 0xc8093000 3000 fdescfs.ko >> 4 1 0xc8106000 3000 pflog.ko >> 5 1 0xc8109000 2d000 pf.ko >> 6 1 0xc817b000 19000 linux.ko >> >> If you have any idea or you need more information to diagnosis the >> problem please let me known. >> >> regards, >> >> --- >> Danny Fullerton >> Mantor Organization >> _______________________________________________ >> freebsd-smp@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-smp >> To unsubscribe, send any mail to "freebsd-smp-unsubscribe@freebsd.org" >> >> >> -- >> No virus found in this incoming message. >> Checked by AVG Free Edition. >> Version: 7.5.516 / Virus Database: 269.21.4/1310 - Release Date: >> 3/4/2008 8:35 AM >> >> > > _______________________________________________ > freebsd-smp@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-smp > To unsubscribe, send any mail to "freebsd-smp-unsubscribe@freebsd.org" From bug at camisano.net Wed Mar 5 09:40:19 2008 From: bug at camisano.net (Daniel Ponticello) Date: Wed Mar 5 09:40:28 2008 Subject: Dual AMD MP unstable under heavy load when smp is active Message-ID: <4ed61a046d15d39fc605e39fe6b02ce0@camisano.net> Hi Danny, i made some tests with FreeBSD 7.0 Prerelease in december and the problem is no longer present. The crash and seg fault you see seems to be related to ACPI/SMP implementation of freebsd6. The problem is also present and more evident with VMWare virtual hardware. No problems if you are using Intel hardware. Hope this helps. Daniel -----Original message----- From: Danny Fullerton northox@mantor.org Date: Wed, 05 Mar 2008 04:32:03 +0100 To: freebsd-smp@freebsd.org Subject: Re: Dual AMD MP unstable under heavy load when smp is active > Hello Paul, > > I would like to known if done those test with the recent FreeBSD 7.0? I > seen lots of work in the SMP area of this release and I'm wondering if I > could have better chance with this version. > > thanks, > > dmesg with smp on (GENERIC + option smp): > > Copyright (c) 1992-2008 The FreeBSD Project. > Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 > The Regents of the University of California. All rights reserved. > FreeBSD is a registered trademark of The FreeBSD Foundation. > FreeBSD 6.3-RELEASE-p1 #0: Wed Feb 27 21:11:40 EST 2008 > root@megatron.mantor.org:/usr/obj/usr/src/sys/MEGATRONTEST > Timecounter "i8254" frequency 1193182 Hz quality 0 > CPU: AMD Athlon(tm) MP 2200+ (1800.07-MHz 686-class CPU) > Origin = "AuthenticAMD" Id = 0x680 Stepping = 0 > > Features=0x383fbff > AMD Features=0xc0480800 > real memory = 3220701184 (3071 MB) > avail memory = 3146387456 (3000 MB) > ACPI APIC Table: > FreeBSD/SMP: Multiprocessor System Detected: 2 CPUs > cpu0 (BSP): APIC ID: 1 > cpu1 (AP): APIC ID: 0 > MADT: Forcing active-low polarity and level trigger for SCI > ioapic0 irqs 0-23 on motherboard > kbd1 at kbdmux0 > ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) > hptrr: HPT RocketRAID controller driver v1.1 (Feb 27 2008 21:11:16) > acpi0: on motherboard > acpi0: Power Button (fixed) > acpi0: Sleep Button (fixed) > Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 > acpi_timer0: <24-bit timer at 3.579545MHz> port 0x8008-0x800b on acpi0 > cpu0: on acpi0 > cpu1: on acpi0 > acpi_button0: on acpi0 > pcib0: port > 0xcf8-0xcff,0x8000-0x807f,0x8080-0x80ff iomem 0xd8000-0xdbfff on acpi0 > pci0: on pcib0 > agp0: port 0x1810-0x1813 mem > 0xf8000000-0xfbffffff,0xf6210000-0xf6210fff at device 0.0 on pci0 > pcib1: at device 1.0 on pci0 > pci1: on pcib1 > isab0: at device 7.0 on pci0 > isa0: on isab0 > atapci0: port > 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 7.1 on pci0 > ata0: on atapci0 > ata1: on atapci0 > pci0: at device 7.3 (no driver attached) > amr0: mem 0xf6200000-0xf620ffff irq 20 at > device 8.0 on pci0 > amr0: delete logical drives supported by controller > amr0: Firmware 350O, BIOS 1.09, 128MB RAM > ahc0: port 0x1000-0x10ff mem > 0xf4000000-0xf4000fff irq 20 at device 10.0 on pci0 > ahc0: [GIANT-LOCKED] > aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs > ahc1: port 0x1400-0x14ff mem > 0xf4001000-0xf4001fff irq 21 at device 10.1 on pci0 > ahc1: [GIANT-LOCKED] > aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs > pcib2: at device 16.0 on pci0 > pci2: on pcib2 > ohci0: mem 0xf4100000-0xf4100fff irq 19 > at device 0.0 on pci2 > ohci0: [GIANT-LOCKED] > usb0: OHCI version 1.0, legacy support > usb0: SMM does not respond, resetting > usb0: on ohci0 > usb0: USB revision 1.0 > uhub0: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 > uhub0: 4 ports with 4 removable, self powered > pci2: at device 7.0 (no driver attached) > xl0: <3Com 3c980C Fast Etherlink XL> port 0x2400-0x247f mem > 0xf4102000-0xf410207f irq 18 at device 8.0 on pci2 > miibus0: on xl0 > ukphy0: on miibus0 > ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > xl0: Ethernet address: 00:e0:81:22:2e:c4 > xl1: <3Com 3c980C Fast Etherlink XL> port 0x2480-0x24ff mem > 0xf4102400-0xf410247f irq 19 at device 9.0 on pci2 > miibus1: on xl1 > ukphy1: on miibus1 > ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > xl1: Ethernet address: 00:e0:81:22:2e:c5 > atkbdc0: port 0x60,0x64 irq 1 on acpi0 > atkbd0: irq 1 on atkbdc0 > kbd0 at atkbd0 > atkbd0: [GIANT-LOCKED] > fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 > fdc0: does not respond > device_attach: fdc0 attach returned 6 > fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on acpi0 > fdc0: does not respond > device_attach: fdc0 attach returned 6 > pmtimer0 on isa0 > orm0: at iomem > 0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xe0000-0xe3fff on isa0 > ppc0: parallel port not found. > sc0: at flags 0x100 on isa0 > sc0: VGA <16 virtual consoles, flags=0x300> > sio0: configured irq 4 not in bitmap of probed irqs 0 > sio0: port may not be enabled > sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 > sio0: type 8250 or not responding > sio1: configured irq 3 not in bitmap of probed irqs 0 > sio1: port may not be enabled > vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 > Timecounters tick every 1.000 msec > hptrr: no controller detected. > Waiting 5 seconds for SCSI devices to settle > ad0: 476940MB at ata0-master UDMA100 > amr0: delete logical drives supported by controller > amrd0: on amr0 > amrd0: 139900MB (286515200 sectors) RAID 1 (optimal) > SMP: AP CPU #1 Launched! > Trying to mount root from ufs:/dev/amrd0s1a > > --- > Danny Fullerton > Mantor Organization > > Paul Missman wrote: > > > > Danny, > > > > I don't know what the bug is, but it does exist. > > > > I have an IBM x3455 with 2 Opteron dual core processors. Under heavy > > loads it crashes. As a step in debugging, I unplugged one of the > > processors, and the problem went away. I switched to Centos version > > 4, and it operates perfectly. > > > > In addition to FreeBSD, the problem also exists in Fedora Core. > > > > Of the OSes I tested, only Redhat and Centos worked correctly on the > > x3455. > > > > I didn't try Windows, so I can't say whether or not it operates > > properly on this system. > > > > Unfortunately, that is all I know about the issue. > > > > Paul Missman > > > > > > ----- Original Message ----- From: "Danny Fullerton" > > To: > > Sent: Tuesday, March 04, 2008 9:05 PM > > Subject: Dual AMD MP unstable under heavy load when smp is active > > > > > >> Hi guys, > >> > >> I been having quite some trouble finding a problem whom seem to be > >> related with SMP on one of my production server. > >> > >> The problem is not easily reproducible but the best way I found was to > >> fire up "make buildworld" while having some other things going on > >> (mysql, apache, bind, jails, etc). When SMP is active, the compile will > >> end up with a segfault or, quite rarely, end up with a crash. I recently > >> configure the crash device but still was unable to recreate a full > >> system crash. > >> > >> At first, I thought it was related to the memory so I done some test and > >> changed most DIMM but ultimately, the problem was sill there. To pin > >> point the problem, I first tried to add options to the GENERIC kernel > >> witch I found to be stable. That's how I found that it was related to > >> SMP. I then tried mixing some other thing like reducing the driver in > >> the kernel to the minimum I could for different reason. One of them is > >> that the motherboard is a "Tyan thunder K7X" > >> (http://www.tyan.com/archive/products/html/thunderk7x.html) and it has > >> an onbord adaptec SCSI controller which I don't use. Since the driver > >> used for this adapter is not MP safe, I tried disabling it via the BIOS > >> and/or by disabling the driver in the kernel but it had no effect. The > >> actual SCSI adapter in used is the Dell 4/DC (LSILogic MegaRAID) you can > >> see in the dmesg. > >> > >> Now I have no clue on how I could further debug this problem. > >> > >> dmesg from generic kernel: > >> > >> Copyright (c) 1992-2008 The FreeBSD Project. > >> Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 > >> The Regents of the University of California. All rights reserved. > >> FreeBSD is a registered trademark of The FreeBSD Foundation. > >> FreeBSD 6.3-RELEASE-p1 #0: Wed Feb 27 07:56:51 EST 2008 > >> root@megatron.mantor.org:/usr/obj/usr/src/sys/GENERIC > >> ACPI APIC Table: > >> Timecounter "i8254" frequency 1193182 Hz quality 0 > >> CPU: AMD Athlon(tm) MP 2200+ (1800.07-MHz 686-class CPU) > >> Origin = "AuthenticAMD" Id = 0x680 Stepping = 0 > >> > >> Features=0x383fbff > >> > >> AMD Features=0xc0480800 > >> real memory = 3220701184 (3071 MB) > >> avail memory = 3150741504 (3004 MB) > >> MADT: Forcing active-low polarity and level trigger for SCI > >> ioapic0 irqs 0-23 on motherboard > >> kbd1 at kbdmux0 > >> ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, > >> RF5413) > >> hptrr: HPT RocketRAID controller driver v1.1 (Feb 27 2008 07:56:28) > >> acpi0: on motherboard > >> acpi0: Power Button (fixed) > >> acpi0: Sleep Button (fixed) > >> Timecounter "ACPI-safe" frequency 3579545 Hz quality 850 > >> acpi_timer0: <24-bit timer at 3.579545MHz> port 0x8008-0x800b on acpi0 > >> cpu0: on acpi0 > >> acpi_button0: on acpi0 > >> pcib0: port > >> 0xcf8-0xcff,0x8000-0x807f,0x8080-0x80ff iomem 0xd8000-0xdbfff on acpi0 > >> pci0: on pcib0 > >> agp0: port 0x1810-0x1813 mem > >> 0xf8000000-0xfbffffff,0xf6210000-0xf6210fff at device 0.0 on pci0 > >> pcib1: at device 1.0 on pci0 > >> pci1: on pcib1 > >> isab0: at device 7.0 on pci0 > >> isa0: on isab0 > >> atapci0: port > >> 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xf000-0xf00f at device 7.1 on pci0 > >> ata0: on atapci0 > >> ata1: on atapci0 > >> pci0: at device 7.3 (no driver attached) > >> amr0: mem 0xf6200000-0xf620ffff irq 20 at > >> device 8.0 on pci0 > >> amr0: delete logical drives supported by controller > >> amr0: Firmware 350O, BIOS 1.09, 128MB RAM > >> ahc0: port 0x1000-0x10ff mem > >> 0xf4000000-0xf4000fff irq 20 at device 10.0 on pci0 > >> ahc0: [GIANT-LOCKED] > >> aic7899: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs > >> ahc1: port 0x1400-0x14ff mem > >> 0xf4001000-0xf4001fff irq 21 at device 10.1 on pci0 > >> ahc1: [GIANT-LOCKED] > >> aic7899: Ultra160 Wide Channel B, SCSI Id=7, 32/253 SCBs > >> pcib2: at device 16.0 on pci0 > >> pci2: on pcib2 > >> ohci0: mem 0xf4100000-0xf4100fff irq 19 > >> at device 0.0 on pci2 > >> ohci0: [GIANT-LOCKED] > >> usb0: OHCI version 1.0, legacy support > >> usb0: SMM does not respond, resetting > >> usb0: on ohci0 > >> usb0: USB revision 1.0 > >> uhub0: AMD OHCI root hub, class 9/0, rev 1.00/1.00, addr 1 > >> uhub0: 4 ports with 4 removable, self powered > >> pci2: at device 7.0 (no driver attached) > >> xl0: <3Com 3c980C Fast Etherlink XL> port 0x2400-0x247f mem > >> 0xf4102000-0xf410207f irq 18 at device 8.0 on pci2 > >> miibus0: on xl0 > >> ukphy0: on miibus0 > >> ukphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > >> xl0: Ethernet address: 00:e0:81:22:2e:c4 > >> xl1: <3Com 3c980C Fast Etherlink XL> port 0x2480-0x24ff mem > >> 0xf4102400-0xf410247f irq 19 at device 9.0 on pci2 > >> miibus1: on xl1 > >> ukphy1: on miibus1 > >> ukphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > >> xl1: Ethernet address: 00:e0:81:22:2e:c5 > >> atkbdc0: port 0x60,0x64 irq 1 on acpi0 > >> atkbd0: irq 1 on atkbdc0 > >> kbd0 at atkbd0 > >> atkbd0: [GIANT-LOCKED] > >> fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on > >> acpi0 > >> fdc0: does not respond > >> device_attach: fdc0 attach returned 6 > >> fdc0: port 0x3f0-0x3f5,0x3f7 irq 6 drq 2 on > >> acpi0 > >> fdc0: does not respond > >> device_attach: fdc0 attach returned 6 > >> pmtimer0 on isa0 > >> orm0: at iomem > >> 0xc0000-0xc7fff,0xc8000-0xc87ff,0xc8800-0xc8fff,0xe0000-0xe3fff on isa0 > >> ppc0: parallel port not found. > >> sc0: at flags 0x100 on isa0 > >> sc0: VGA <16 virtual consoles, flags=0x300> > >> sio0: configured irq 4 not in bitmap of probed irqs 0 > >> sio0: port may not be enabled > >> sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 > >> sio0: type 8250 or not responding > >> sio1: configured irq 3 not in bitmap of probed irqs 0 > >> sio1: port may not be enabled > >> vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on > >> isa0 > >> Timecounter "TSC" frequency 1800073530 Hz quality 800 > >> Timecounters tick every 1.000 msec > >> hptrr: no controller detected. > >> Waiting 5 seconds for SCSI devices to settle > >> ad0: 476940MB at ata0-master UDMA100 > >> amr0: delete logical drives supported by controller > >> amrd0: on amr0 > >> amrd0: 139900MB (286515200 sectors) RAID 1 (optimal) > >> Trying to mount root from ufs:/dev/amrd0s1a > >> > >> kldstat: > >> > >> Id Refs Address Size Name > >> 1 10 0xc0400000 7a05b0 kernel > >> 2 1 0xc0ba1000 5c304 acpi.ko > >> 3 1 0xc8093000 3000 fdescfs.ko > >> 4 1 0xc8106000 3000 pflog.ko > >> 5 1 0xc8109000 2d000 pf.ko > >> 6 1 0xc817b000 19000 linux.ko > >> > >> If you have any idea or you need more information to diagnosis the > >> problem please let me known. > >> > >> regards, > >> > >> --- > >> Danny Fullerton > >> Mantor Organization > >> _______________________________________________ > >> freebsd-smp@freebsd.org mailing list > >> http://lists.freebsd.org/mailman/listinfo/freebsd-smp > >> To unsubscribe, send any mail to "freebsd-smp-unsubscribe@freebsd.org" > >> > >> > >> -- > >> No virus found in this incoming message. > >> Checked by AVG Free Edition. > >> Version: 7.5.516 / Virus Database: 269.21.4/1310 - Release Date: > >> 3/4/2008 8:35 AM > >> > >> > > > > _______________________________________________ > > freebsd-smp@freebsd.org mailing list > > http://lists.freebsd.org/mailman/listinfo/freebsd-smp > > To unsubscribe, send any mail to "freebsd-smp-unsubscribe@freebsd.org" > > _______________________________________________ > freebsd-smp@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-smp > To unsubscribe, send any mail to "freebsd-smp-unsubscribe@freebsd.org" From missmanp at cfw.com Wed Mar 5 11:58:47 2008 From: missmanp at cfw.com (Paul Missman) Date: Wed Mar 5 11:58:53 2008 Subject: Dual AMD MP unstable under heavy load when smp is active References: <47CDFFFF.10507@mantor.org> <007e01c87e6d$1c9c00a0$0b28a8c0@a1000> <47CE1433.2020308@mantor.org> Message-ID: <009101c87eb8$4728af30$0b28a8c0@a1000> Danny, >From what I can reconstruct, it seems I was using the 64-bit version of FreeBSD 6.4. Looks like the last responder to the list says that version 7 is free of this problem. Best of luck, Paul ----- Original Message ----- From: "Danny Fullerton" To: Sent: Tuesday, March 04, 2008 10:32 PM Subject: Re: Dual AMD MP unstable under heavy load when smp is active > Hello Paul, > > I would like to known if done those test with the recent FreeBSD 7.0? I > seen lots of work in the SMP area of this release and I'm wondering if I > could have better chance with this version. > > thanks, > >> From northox at mantor.org Fri Mar 7 17:35:56 2008 From: northox at mantor.org (Danny Fullerton) Date: Fri Mar 7 17:36:01 2008 Subject: Dual AMD MP unstable under heavy load when smp is active In-Reply-To: <4ed61a046d15d39fc605e39fe6b02ce0@camisano.net> References: <4ed61a046d15d39fc605e39fe6b02ce0@camisano.net> Message-ID: <47D17CF9.5090403@mantor.org> Hi guys, Just to let you known upgrading to 7.0 effectively fixed the problem. Thanks for your help and for 7 years of hard work on the SMP optimization. ;) --- Danny Fullerton Mantor Organization www.mantor.org Daniel Ponticello wrote: > Hi Danny, > i made some tests with FreeBSD 7.0 Prerelease in december and the problem > is no longer present. > The crash and seg fault you see seems to be related to ACPI/SMP implementation > of freebsd6. > The problem is also present and more evident with VMWare virtual hardware. > No problems if you are using Intel hardware. > > > Hope this helps. > > > Daniel From daniel at dgnetwork.com.br Wed Mar 12 20:01:33 2008 From: daniel at dgnetwork.com.br (=?ISO-8859-1?Q?Daniel_Dias_Gon=E7alves?=) Date: Wed Mar 12 20:01:45 2008 Subject: FreeBSD 6.3 fxp0 MBUF and PAE Message-ID: <47D834AE.8080301@dgnetwork.com.br> Hi, When using the interface fxp0 with PAE enable in kernel, occurs the following error: fxp0: can't map mbuf (error 12) ... it repeats, repeats and lost communication. Information: 6.3-RELEASE fxp0@pci14:4:0: class=0x020000 card=0x00708086 chip=0x12298086 rev=0x10 hdr=0x00 vendor = 'Intel Corporation' device = '82550/1/7/8/9 EtherExpress PRO/100(B) Ethernet Adapter' class = network subclass = ethernet I wait reply. Thanks. Daniel From pyunyh at gmail.com Thu Mar 13 01:55:04 2008 From: pyunyh at gmail.com (Pyun YongHyeon) Date: Thu Mar 13 01:55:12 2008 Subject: FreeBSD 6.3 fxp0 MBUF and PAE In-Reply-To: <47D834AE.8080301@dgnetwork.com.br> References: <47D834AE.8080301@dgnetwork.com.br> Message-ID: <20080313012741.GC16972@cdnetworks.co.kr> On Wed, Mar 12, 2008 at 04:53:18PM -0300, Daniel Dias Gon?alves wrote: > Hi, > > When using the interface fxp0 with PAE enable in kernel, occurs the > following error: > > fxp0: can't map mbuf (error 12) > ... > > it repeats, repeats and lost communication. > error 12 means ENOMEM. bus_dmamap_load_mbuf_sg(9) failed due to insuffcient resources. I guess there is no way to overcome this situation in driver. The only remaining way I can think of would be reclaiming of transmitted frames but how well it works would depends on circumstances. Personally I don't see a reason to print these ENOMEM errors for production box without late limiting. > Information: > 6.3-RELEASE > > fxp0@pci14:4:0: class=0x020000 card=0x00708086 chip=0x12298086 rev=0x10 > hdr=0x00 > vendor = 'Intel Corporation' > device = '82550/1/7/8/9 EtherExpress PRO/100(B) Ethernet Adapter' > class = network > subclass = ethernet > > I wait reply. > > Thanks. > > Daniel -- Regards, Pyun YongHyeon From alfred at freebsd.org Sat Mar 15 03:01:41 2008 From: alfred at freebsd.org (Alfred Perlstein) Date: Sat Mar 15 03:01:44 2008 Subject: timeout/untimeout race conditions/crash [patch] Message-ID: <20080315024114.GD67856@elvis.mu.org> We think we tracked down a defect in timeout/untimeout in FreeBSD. We have reduced the problem to the following scenario: 2+ cpu system, one cpu is running softclock at the same time another thread is running on another cpu which makes use of timeout/untimeout. CPU 0 is running "softclock" CPU 1 is running "driver" with Giant held. softclock: mtx_lock_spin(&callout_lock) softclock: CACHES the callout structure's fields. softclock: sees that it's a CALLOUT_LOCAL_ALLOC softclock: executes this code: if (c->c_flags & CALLOUT_LOCAL_ALLOC) { c->c_func = NULL; c->c_flags = CALLOUT_LOCAL_ALLOC; SLIST_INSERT_HEAD(&callfree, c, c_links.sle); curr_callout = NULL; } else { NOTE: that c->c_func has been set to NULL and curr_callout is also NULL. softclock: mtx_unlock_spin(&callout_lock) driver: calls untimeout(), the following sequence happens: mtx_lock_spin(&callout_lock); if (handle.callout->c_func == ftn && handle.callout->c_arg == arg) callout_stop(handle.callout); mtx_unlock_spin(&callout_lock); NOTE: untimeout() sees that handle.callout->c_func is not set to the function so it does NOT call callout_stop(9)! driver: free's backing structure for c->c_arg. softclock: executes callout. softclock: likely crashes at this point due to access after free. I have a patch I'm trying out here, but I need feedback on it. The way the patch works is to treat CALLOUT_LOCAL_ALLOC (timeout/untimeout) callouts the same as ~CALLOUT_LOCAL_ALLOC allocs, and moves the freelist manipulation to the end of the callout dispatch. Some light testing seems to have the system work. We are doing some testing in-house to also make sure this works. Please provide feedback. See attached delta. -- - Alfred Perlstein -------------- next part -------------- A non-text attachment was scrubbed... Name: kern_timeout.diff Type: text/x-diff Size: 1185 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-smp/attachments/20080315/a340c52a/kern_timeout.bin From jhb at freebsd.org Mon Mar 17 16:44:10 2008 From: jhb at freebsd.org (John Baldwin) Date: Mon Mar 17 16:44:17 2008 Subject: timeout/untimeout race conditions/crash [patch] In-Reply-To: <20080315024114.GD67856@elvis.mu.org> References: <20080315024114.GD67856@elvis.mu.org> Message-ID: <200803171127.20561.jhb@freebsd.org> On Friday 14 March 2008 10:41:14 pm Alfred Perlstein wrote: > We think we tracked down a defect in timeout/untimeout in > FreeBSD. > > We have reduced the problem to the following scenario: > > 2+ cpu system, one cpu is running softclock at the same time > another thread is running on another cpu which makes use of > timeout/untimeout. > > CPU 0 is running "softclock" > CPU 1 is running "driver" with Giant held. > > softclock: mtx_lock_spin(&callout_lock) > softclock: CACHES the callout structure's fields. > softclock: sees that it's a CALLOUT_LOCAL_ALLOC > softclock: executes this code: > if (c->c_flags & CALLOUT_LOCAL_ALLOC) { > c->c_func = NULL; > c->c_flags = CALLOUT_LOCAL_ALLOC; > SLIST_INSERT_HEAD(&callfree, c, > c_links.sle); > curr_callout = NULL; > } else { > > NOTE: that c->c_func has been set to NULL and curr_callout > is also NULL. > softclock: mtx_unlock_spin(&callout_lock) > driver: calls untimeout(), the following sequence happens: > mtx_lock_spin(&callout_lock); > if (handle.callout->c_func == ftn && handle.callout->c_arg == arg) > callout_stop(handle.callout); > mtx_unlock_spin(&callout_lock); > > NOTE: untimeout() sees that handle.callout->c_func is not set > to the function so it does NOT call callout_stop(9)! > driver: free's backing structure for c->c_arg. > softclock: executes callout. > softclock: likely crashes at this point due to access after free. > > I have a patch I'm trying out here, but I need feedback on it. > > The way the patch works is to treat CALLOUT_LOCAL_ALLOC (timeout/untimeout) > callouts the same as ~CALLOUT_LOCAL_ALLOC allocs, and moves the > freelist manipulation to the end of the callout dispatch. > > Some light testing seems to have the system work. > > We are doing some testing in-house to also make sure this works. > > Please provide feedback. > > See attached delta. This is not a bug. Don't use untimeout(9) as it is not guaranteed to be reliable. Instead, use callout_*(). Your patch doesn't solve any races as the driver detach routine needs to use callout_drain() and not just callout_stop/untimeout anyways. Fix your broken drivers. -- John Baldwin From alfred at freebsd.org Mon Mar 17 20:26:53 2008 From: alfred at freebsd.org (Alfred Perlstein) Date: Mon Mar 17 20:26:58 2008 Subject: timeout/untimeout race conditions/crash [patch] In-Reply-To: <200803171127.20561.jhb@freebsd.org> References: <20080315024114.GD67856@elvis.mu.org> <200803171127.20561.jhb@freebsd.org> Message-ID: <20080317201014.GA67856@elvis.mu.org> * John Baldwin [080317 09:43] wrote: > > This is not a bug. Don't use untimeout(9) as it is not guaranteed to be > reliable. Instead, use callout_*(). Your patch doesn't solve any races as > the driver detach routine needs to use callout_drain() and not just > callout_stop/untimeout anyways. Fix your broken drivers. I understand that some old Giant locked code can issue timeout/untimeout without Giant held, which would certainly cause this issue to happen and is uncorrectable, however, this is with completely Giant locked code. We are not trying to use timeout(9) for mpsafe code, this is old code and relies upon Giant. Giant locked code should be timeout/untimeout safe. As explained in my email, there exists a condition where the Giant locked code can have a timer fire even though proper Giant locking is observed. For a Giant locked subsystem, one should be able to have the following code work: mtx_lock(&Giant); /* formerly spl higher than softclock */ untimeout(&func, arg, &sc->handle); free(sc); mtx_unlock(&Giant); /* formerly splx() */ Normally splsoftclock would completely block the timeout from firing and this sort of code would be safe. It is no longer safe due to a BUG in the way that Giant is used. Please reread the original mail to better understand the synopsis of the problem. thank you, -Alfred From jhb at freebsd.org Mon Mar 17 22:29:51 2008 From: jhb at freebsd.org (John Baldwin) Date: Mon Mar 17 22:29:55 2008 Subject: timeout/untimeout race conditions/crash [patch] In-Reply-To: <20080317201014.GA67856@elvis.mu.org> References: <20080315024114.GD67856@elvis.mu.org> <200803171127.20561.jhb@freebsd.org> <20080317201014.GA67856@elvis.mu.org> Message-ID: <200803171659.33547.jhb@freebsd.org> On Monday 17 March 2008 04:10:14 pm Alfred Perlstein wrote: > * John Baldwin [080317 09:43] wrote: > > > > This is not a bug. Don't use untimeout(9) as it is not guaranteed to be > > reliable. Instead, use callout_*(). Your patch doesn't solve any races as > > the driver detach routine needs to use callout_drain() and not just > > callout_stop/untimeout anyways. Fix your broken drivers. > > I understand that some old Giant locked code can issue timeout/untimeout > without Giant held, which would certainly cause this issue to happen > and is uncorrectable, however, this is with completely Giant locked > code. > > We are not trying to use timeout(9) for mpsafe code, this is old > code and relies upon Giant. > > Giant locked code should be timeout/untimeout safe. As explained > in my email, there exists a condition where the Giant locked code > can have a timer fire even though proper Giant locking is observed. > > For a Giant locked subsystem, one should be able to have the following > code work: > > mtx_lock(&Giant); /* formerly spl higher than softclock */ > untimeout(&func, arg, &sc->handle); > free(sc); > mtx_unlock(&Giant); /* formerly splx() */ > > Normally splsoftclock would completely block the timeout from firing > and this sort of code would be safe. It is no longer safe due to > a BUG in the way that Giant is used. > > Please reread the original mail to better understand the synopsis > of the problem. Hmm. My worry is about leaving the callout structure around while invoking the timeout routine itself, but it is already off the callwheel so it shouldn't be visible via untimeout() to any other code. I guess the patch is ok, but I'll be happy when we can axe timeout/untimeout altogether. -- John Baldwin From alfred at freebsd.org Fri Mar 21 04:32:26 2008 From: alfred at freebsd.org (Alfred Perlstein) Date: Fri Mar 21 04:32:29 2008 Subject: request for review, callout fix. Message-ID: <20080321043225.GZ67856@elvis.mu.org> Hi guys, attached is a fix for the timeout/untimeout race with Giant locked code. Basically the old code would make the callout inaccessable right before calling it inside of softclock. However only the callout lock is held, so when switching to the callout's associated mutex (in this case Giant) there's a race where a "local" (timeout/untimeout) callout would be fired even if stopped. This patch fixes that. We've run several hours of regression testing on a version of this for 6.x. People internal to Juniper and iedowse@ helped with this. Please review/comment. thank you, -- - Alfred Perlstein -------------- next part -------------- A non-text attachment was scrubbed... Name: kern_timeout.diff Type: text/x-diff Size: 1396 bytes Desc: not available Url : http://lists.freebsd.org/pipermail/freebsd-smp/attachments/20080321/d8e3c8d9/kern_timeout.bin From jhb at freebsd.org Fri Mar 21 19:08:03 2008 From: jhb at freebsd.org (John Baldwin) Date: Fri Mar 21 19:08:07 2008 Subject: request for review, callout fix. In-Reply-To: <20080321043225.GZ67856@elvis.mu.org> References: <20080321043225.GZ67856@elvis.mu.org> Message-ID: <200803211314.04002.jhb@freebsd.org> On Friday 21 March 2008 12:32:25 am Alfred Perlstein wrote: > Hi guys, attached is a fix for the timeout/untimeout race with > Giant locked code. > > Basically the old code would make the callout inaccessable > right before calling it inside of softclock. > > However only the callout lock is held, so when switching to > the callout's associated mutex (in this case Giant) there's > a race where a "local" (timeout/untimeout) callout would be > fired even if stopped. > > This patch fixes that. We've run several hours of regression > testing on a version of this for 6.x. > > People internal to Juniper and iedowse@ helped with this. > > Please review/comment. Curious as to how c->c_flags could change if CALLOUT_LOCAL_ALLOC is set? Since it hasn't been enqueued on the callfree list, it isn't visible to any other code, so nothing should be able to mark it active or pending. IOW, you should be able to do this in your second hunk: if (c_flags & CALLOUT_LOCAL_ALLOC) { KASSERT(c->c_flags == CALLOUT_LOCAL_ALLOC, ("corrupted callout")); c->c_func = NULL; SLIST_INSERT_HEAD(...); } -- John Baldwin From alfred at freebsd.org Fri Mar 21 20:30:40 2008 From: alfred at freebsd.org (Alfred Perlstein) Date: Fri Mar 21 20:30:43 2008 Subject: request for review, callout fix. In-Reply-To: <200803211314.04002.jhb@freebsd.org> References: <20080321043225.GZ67856@elvis.mu.org> <200803211314.04002.jhb@freebsd.org> Message-ID: <20080321203039.GJ67856@elvis.mu.org> * John Baldwin [080321 12:08] wrote: > On Friday 21 March 2008 12:32:25 am Alfred Perlstein wrote: > > Hi guys, attached is a fix for the timeout/untimeout race with > > Giant locked code. > > > > Basically the old code would make the callout inaccessable > > right before calling it inside of softclock. > > > > However only the callout lock is held, so when switching to > > the callout's associated mutex (in this case Giant) there's > > a race where a "local" (timeout/untimeout) callout would be > > fired even if stopped. > > > > This patch fixes that. We've run several hours of regression > > testing on a version of this for 6.x. > > > > People internal to Juniper and iedowse@ helped with this. > > > > Please review/comment. > > Curious as to how c->c_flags could change if CALLOUT_LOCAL_ALLOC is set? > Since it hasn't been enqueued on the callfree list, it isn't visible to any > other code, so nothing should be able to mark it active or pending. IOW, you > should be able to do this in your second hunk: > > if (c_flags & CALLOUT_LOCAL_ALLOC) { > KASSERT(c->c_flags == CALLOUT_LOCAL_ALLOC, ("corrupted callout")); > c->c_func = NULL; > SLIST_INSERT_HEAD(...); > } It's more hairy than that. Actually, I think you're right... The confusion is the race for "callout_stop_safe", but now that I think about it, the softclock will only "grab" this callout if untimeout has NOT yet been called. If untimeout HAS been called, then softclock won't even see the callout. If untimeout HAS NOT been called and softclock grabs the callout, it will have cleared CALLOUT_PENDING and then untimeout (callout_stop_safe) will no longer free it. Therefore it is safe to omit the check for flags as you suggest. Is that right? -- - Alfred Perlstein From jhb at freebsd.org Fri Mar 21 20:55:12 2008 From: jhb at freebsd.org (John Baldwin) Date: Fri Mar 21 20:55:16 2008 Subject: request for review, callout fix. In-Reply-To: <20080321203039.GJ67856@elvis.mu.org> References: <20080321043225.GZ67856@elvis.mu.org> <200803211314.04002.jhb@freebsd.org> <20080321203039.GJ67856@elvis.mu.org> Message-ID: <200803211654.36299.jhb@freebsd.org> On Friday 21 March 2008 04:30:39 pm Alfred Perlstein wrote: > * John Baldwin [080321 12:08] wrote: > > On Friday 21 March 2008 12:32:25 am Alfred Perlstein wrote: > > > Hi guys, attached is a fix for the timeout/untimeout race with > > > Giant locked code. > > > > > > Basically the old code would make the callout inaccessable > > > right before calling it inside of softclock. > > > > > > However only the callout lock is held, so when switching to > > > the callout's associated mutex (in this case Giant) there's > > > a race where a "local" (timeout/untimeout) callout would be > > > fired even if stopped. > > > > > > This patch fixes that. We've run several hours of regression > > > testing on a version of this for 6.x. > > > > > > People internal to Juniper and iedowse@ helped with this. > > > > > > Please review/comment. > > > > Curious as to how c->c_flags could change if CALLOUT_LOCAL_ALLOC is set? > > Since it hasn't been enqueued on the callfree list, it isn't visible to any > > other code, so nothing should be able to mark it active or pending. IOW, you > > should be able to do this in your second hunk: > > > > if (c_flags & CALLOUT_LOCAL_ALLOC) { > > KASSERT(c->c_flags == CALLOUT_LOCAL_ALLOC, ("corrupted callout")); > > c->c_func = NULL; > > SLIST_INSERT_HEAD(...); > > } > > It's more hairy than that. > > Actually, I think you're right... > > The confusion is the race for "callout_stop_safe", > but now that I think about it, the softclock will only > "grab" this callout if untimeout has NOT yet been called. > > If untimeout HAS been called, then softclock won't even see > the callout. Yes. > If untimeout HAS NOT been called and softclock grabs the > callout, it will have cleared CALLOUT_PENDING and then > untimeout (callout_stop_safe) will no longer free it. Yes. > Therefore it is safe to omit the check for flags as you > suggest. > > Is that right? Yes, I believe so. The tricky case is the race you are originally trying to fix which is that the untimeout() comes in after the spin lock is dropped but before Giant is acquired, but that falls under your second case above. -- John Baldwin From alfred at freebsd.org Fri Mar 21 23:52:06 2008 From: alfred at freebsd.org (Alfred Perlstein) Date: Fri Mar 21 23:52:09 2008 Subject: request for review, callout fix. In-Reply-To: <200803211654.36299.jhb@freebsd.org> References: <20080321043225.GZ67856@elvis.mu.org> <200803211314.04002.jhb@freebsd.org> <20080321203039.GJ67856@elvis.mu.org> <200803211654.36299.jhb@freebsd.org> Message-ID: <20080321235205.GO67856@elvis.mu.org> * John Baldwin [080321 13:55] wrote: > On Friday 21 March 2008 04:30:39 pm Alfred Perlstein wrote: > > * John Baldwin [080321 12:08] wrote: > > > On Friday 21 March 2008 12:32:25 am Alfred Perlstein wrote: > > > > Hi guys, attached is a fix for the timeout/untimeout race with > > > > Giant locked code. > > > > > > > > Basically the old code would make the callout inaccessable > > > > right before calling it inside of softclock. > > > > > > > > However only the callout lock is held, so when switching to > > > > the callout's associated mutex (in this case Giant) there's > > > > a race where a "local" (timeout/untimeout) callout would be > > > > fired even if stopped. > > > > > > > > This patch fixes that. We've run several hours of regression > > > > testing on a version of this for 6.x. > > > > > > > > People internal to Juniper and iedowse@ helped with this. > > > > > > > > Please review/comment. > > > > > > Curious as to how c->c_flags could change if CALLOUT_LOCAL_ALLOC is set? > > > Since it hasn't been enqueued on the callfree list, it isn't visible to > any > > > other code, so nothing should be able to mark it active or pending. IOW, > you > > > should be able to do this in your second hunk: > > > > > > if (c_flags & CALLOUT_LOCAL_ALLOC) { > > > KASSERT(c->c_flags == CALLOUT_LOCAL_ALLOC, ("corrupted callout")); > > > c->c_func = NULL; > > > SLIST_INSERT_HEAD(...); > > > } > > > > It's more hairy than that. > > > > Actually, I think you're right... > > > > The confusion is the race for "callout_stop_safe", > > but now that I think about it, the softclock will only > > "grab" this callout if untimeout has NOT yet been called. > > > > If untimeout HAS been called, then softclock won't even see > > the callout. > > Yes. > > > If untimeout HAS NOT been called and softclock grabs the > > callout, it will have cleared CALLOUT_PENDING and then > > untimeout (callout_stop_safe) will no longer free it. > > Yes. > > > Therefore it is safe to omit the check for flags as you > > suggest. > > > > Is that right? > > Yes, I believe so. The tricky case is the race you are originally trying to > fix which is that the untimeout() comes in after the spin lock is dropped but > before Giant is acquired, but that falls under your second case above. Thank you, I'll be committing later tonight. -Alfred From mstogsdill at sycamore.us Thu Mar 27 15:51:28 2008 From: mstogsdill at sycamore.us (Michael Stogsdill) Date: Thu Mar 27 15:51:33 2008 Subject: SMP/HTT and Beowulf cluster on FreeBSD 7.0-RELEASE In-Reply-To: <10687817.342491206657081791.JavaMail.root@dcifs4> Message-ID: <11970534.342511206657117969.JavaMail.root@dcifs4> Hey, I have a question that I can't seem to find a decent answer to. The mailing lists have had some similar topics, but they were mostly for the 6.x-RELEASE and other minor differences. Heres my situation; I'm trying to create a beowulf cluster running FreeBSD 7.0-RELEASE consisting of 8 systems all running on Dual Xeon w/HTT both running at 2.8Gz. What exactly this system will be doing I can't say, so lets just think about the performance of the cluster running a benchmarking tool similar to sysbench. Because they are dual processor systems, I will be using the SMP kernel already, so I'm wondering if turning on HTT will help or hinder my Beowulf' performance. The cluster is not yet operational, otherwise I would do some tests myself! Thanks! -- Michael B. Stogsdill Sycamore.US Inc Software Engineer/Security Administrator From kris at FreeBSD.org Thu Mar 27 16:01:19 2008 From: kris at FreeBSD.org (Kris Kennaway) Date: Thu Mar 27 16:01:22 2008 Subject: SMP/HTT and Beowulf cluster on FreeBSD 7.0-RELEASE In-Reply-To: <11970534.342511206657117969.JavaMail.root@dcifs4> References: <11970534.342511206657117969.JavaMail.root@dcifs4> Message-ID: <47EC2744.6010405@FreeBSD.org> Michael Stogsdill wrote: > Hey, I have a question that I can't seem to find a decent answer to. The mailing lists have had some similar topics, but they were mostly for the 6.x-RELEASE and other minor differences. > > Heres my situation; I'm trying to create a beowulf cluster running FreeBSD 7.0-RELEASE consisting of 8 systems all running on Dual Xeon w/HTT both running at 2.8Gz. What exactly this system will be doing I can't say, so lets just think about the performance of the cluster running a benchmarking tool similar to sysbench. Because they are dual processor systems, I will be using the SMP kernel already, so I'm wondering if turning on HTT will help or hinder my Beowulf' performance. The cluster is not yet operational, otherwise I would do some tests myself! The answer is always "HTT performance depends on your workload. Try it and see if it helps". Kris From julian at elischer.org Thu Mar 27 17:42:06 2008 From: julian at elischer.org (Julian Elischer) Date: Thu Mar 27 17:42:10 2008 Subject: SMP/HTT and Beowulf cluster on FreeBSD 7.0-RELEASE In-Reply-To: <47EC2744.6010405@FreeBSD.org> References: <11970534.342511206657117969.JavaMail.root@dcifs4> <47EC2744.6010405@FreeBSD.org> Message-ID: <47EC3BA1.6070808@elischer.org> Kris Kennaway wrote: > Michael Stogsdill wrote: >> Hey, I have a question that I can't seem to find a decent answer to. >> The mailing lists have had some similar topics, but they were mostly >> for the 6.x-RELEASE and other minor differences. >> Heres my situation; I'm trying to create a beowulf cluster running >> FreeBSD 7.0-RELEASE consisting of 8 systems all running on Dual Xeon >> w/HTT both running at 2.8Gz. What exactly this system will be doing I >> can't say, so lets just think about the performance of the cluster >> running a benchmarking tool similar to sysbench. Because they are dual >> processor systems, I will be using the SMP kernel already, so I'm >> wondering if turning on HTT will help or hinder my Beowulf' >> performance. The cluster is not yet operational, otherwise I would do >> some tests myself! > > The answer is always "HTT performance depends on your workload. Try it > and see if it helps". new processors are not quite as bad as the original HTT processors. also, if the workload includes a mix of FP and integer work it will be worth having them on. > > Kris > _______________________________________________ > freebsd-smp@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-smp > To unsubscribe, send any mail to "freebsd-smp-unsubscribe@freebsd.org" From stephen at math.missouri.edu Thu Mar 27 17:45:54 2008 From: stephen at math.missouri.edu (Stephen Montgomery-Smith) Date: Thu Mar 27 17:46:00 2008 Subject: SMP/HTT and Beowulf cluster on FreeBSD 7.0-RELEASE In-Reply-To: <11970534.342511206657117969.JavaMail.root@dcifs4> References: <11970534.342511206657117969.JavaMail.root@dcifs4> Message-ID: <47EC3ADF.3010303@math.missouri.edu> Michael Stogsdill wrote: > Hey, I have a question that I can't seem to find a decent answer to. The mailing lists have had some similar topics, but they were mostly for the 6.x-RELEASE and other minor differences. > > Heres my situation; I'm trying to create a beowulf cluster running FreeBSD 7.0-RELEASE consisting of 8 systems all running on Dual Xeon w/HTT both running at 2.8Gz. What exactly this system will be doing I can't say, so lets just think about the performance of the cluster running a benchmarking tool similar to sysbench. Because they are dual processor systems, I will be using the SMP kernel already, so I'm wondering if turning on HTT will help or hinder my Beowulf' performance. The cluster is not yet operational, otherwise I would do some tests myself! > > Thanks! My personal experience is that HTT did help. This was with an older Xeon dual processor system, and I was running multithreaded programs that were basically huge amounts of floating point calculations (a bit like FFT). Also, recent advances in FreeBSD have made it extremely good at running multithreaded programs, but I still think it would be worthwhile trying Linux as well. In the old days, Linux did much better, and who knows, they might have advanced ahead again. Since you are looking to get overy ounce of performance out of your computers, I would try out all the possibilities and see what works best. (And to answer your question, I found recent versions of FreeBSD slightly better than Linux at taking advantage of HTT in my particular applications.) From petri at helenius.fi Thu Mar 27 22:17:00 2008 From: petri at helenius.fi (Petri Helenius) Date: Thu Mar 27 22:17:05 2008 Subject: SMP/HTT and Beowulf cluster on FreeBSD 7.0-RELEASE In-Reply-To: <47EC3ADF.3010303@math.missouri.edu> References: <11970534.342511206657117969.JavaMail.root@dcifs4> <47EC3ADF.3010303@math.missouri.edu> Message-ID: <47EC7AAD.5020107@helenius.fi> Stephen Montgomery-Smith wrote: > > My personal experience is that HTT did help. This was with an older > Xeon dual processor system, and I was running multithreaded programs > that were basically huge amounts of floating point calculations (a bit > like FFT). In my experience HTT helps 20-30% in a well engineered workload. Extreme care needs to be taken on minimizing synchronization primitive overhead or that will take away your gain. (don't have 4 or 8 threads hammering on one mutex a million times a second) Pete From fbsdq at kuhl.co.uk Sun Mar 30 03:57:43 2008 From: fbsdq at kuhl.co.uk (Rob) Date: Sun Mar 30 03:57:48 2008 Subject: SMP interrupt problem Message-ID: <47EF6EC1.8040706@kuhl.co.uk> Hi, got a problem with high interrupt load on a couple of dual CPU servers. First loaded 6.3 onto each of them and both displayed a constant load showing under interrupt when looking at top. I then loaded 7.0 onto one of them and found exactly the same problem. One of them has since had Fedora installed and works fine with that. Below is the top -CS output and dmesg from the machine with 7.0 installed, if anyone wishes to see the same info from the 6.3 install I have that as well if needed. Thanks! top -CS last pid: 752; load averages: 0.02, 0.11, 0.09 up 0+00:09:49 06:49:17 67 processes: 6 running, 47 sleeping, 14 waiting CPU states: 0.0% user, 0.0% nice, 0.0% system, 23.6% interrupt, 76.4% idle Mem: 8400K Active, 5488K Inact, 21M Wired, 8512K Buf, 958M Free Swap: 1024M Total, 1024M Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME CPU COMMAND 11 root 1 171 ki31 0K 8K CPU3 3 9:17 99.02% idle: cpu3 12 root 1 171 ki31 0K 8K CPU2 2 9:17 99.02% idle: cpu2 13 root 1 171 ki31 0K 8K RUN 1 9:15 99.02% idle: cpu1 24 root 1 -52 - 0K 8K CPU0 0 7:18 85.40% irq9: acpi0 14 root 1 171 ki31 0K 8K RUN 0 2:00 12.60% idle: cpu0 15 root 1 -32 - 0K 8K WAIT 1 0:01 0.00% swi4: clock s 4 root 1 -8 - 0K 8K - 1 0:00 0.00% g_down 3 root 1 -8 - 0K 8K - 1 0:00 0.00% g_up 728 root 1 4 0 8384K 3816K sbwait 3 0:00 0.00% sshd 701 root 1 8 0 3596K 1580K wait 2 0:00 0.00% login 31 root 1 -64 - 0K 8K WAIT 0 0:00 0.00% irq14: ata0 2 root 1 -8 - 0K 8K - 1 0:00 0.00% g_event 709 root 1 5 0 3472K 2176K ttyin 1 0:00 0.00% csh 739 root 1 20 0 3472K 2204K pause 2 0:00 0.00% csh 43 root 1 -32 - 0K 8K - 3 0:00 0.00% schedcpu 752 root 1 96 0 3488K 1640K CPU1 1 0:00 0.00% top dmesg Copyright (c) 1992-2008 The FreeBSD Project. Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 The Regents of the University of California. All rights reserved. FreeBSD is a registered trademark of The FreeBSD Foundation. FreeBSD 7.0-RELEASE #0: Sun Feb 24 19:59:52 UTC 2008 root@logan.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC Timecounter "i8254" frequency 1193182 Hz quality 0 CPU: Intel(R) Xeon(TM) CPU 2.80GHz (2799.22-MHz 686-class CPU) Origin = "GenuineIntel" Id = 0xf29 Stepping = 9 Features=0xbfebfbff Features2=0x4400 Logical CPUs per core: 2 real memory = 1073676288 (1023 MB) avail memory = 1037078528 (989 MB) ACPI APIC Table: FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs cpu0 (BSP): APIC ID: 0 cpu1 (AP): APIC ID: 1 cpu2 (AP): APIC ID: 6 cpu3 (AP): APIC ID: 7 MADT: Forcing active-low polarity and level trigger for SCI ioapic0 irqs 0-23 on motherboard ioapic1 irqs 24-47 on motherboard ioapic2 irqs 48-71 on motherboard kbd1 at kbdmux0 ath_hal: 0.9.20.3 (AR5210, AR5211, AR5212, RF5111, RF5112, RF2413, RF5413) hptrr: HPT RocketRAID controller driver v1.1 (Feb 24 2008 19:59:27) acpi0: on motherboard acpi0: [ITHREAD] acpi0: Power Button (fixed) acpi0: reservation of 0, a0000 (3) failed acpi0: reservation of 100000, 3ff00000 (3) failed Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000 acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0 cpu0: on acpi0 p4tcc0: on cpu0 cpu1: on acpi0 p4tcc1: on cpu1 cpu2: on acpi0 p4tcc2: on cpu2 cpu3: on acpi0 p4tcc3: on cpu3 pcib0: port 0xcf8-0xcff on acpi0 pci0: on pcib0 pcib1: at device 2.0 on pci0 pci2: on pcib1 pcib2: at device 29.0 on pci2 pci4: on pcib2 em0: port 0xd800-0xd83f m em 0xfe9e0000-0xfe9fffff irq 48 at device 1.0 on pci4 em0: Ethernet address: 00:e0:81:27:63:93 em0: [FILTER] pcib3: at device 31.0 on pci2 pci3: on pcib3 uhci0: port 0xe800-0xe81f irq 16 at device 29.0 on pci0 uhci0: [GIANT-LOCKED] uhci0: [ITHREAD] usb0: on uhci0 usb0: USB revision 1.0 uhub0: on usb0 uhub0: 2 ports with 2 removable, self powered pcib4: at device 30.0 on pci0 pci1: on pcib4 fxp0: port 0xc400-0xc43f mem 0xfe7fe000-0xfe7feff f,0xfe7a0000-0xfe7bffff irq 17 at device 1.0 on pci1 miibus0: on fxp0 inphy0: PHY 1 on miibus0 inphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto fxp0: Ethernet address: 00:e0:81:27:63:92 fxp0: [ITHREAD] vgapci0: port 0xc800-0xc8ff mem 0xfd000000-0xfdffffff,0 xfe7ff000-0xfe7fffff irq 18 at device 2.0 on pci1 isab0: at device 31.0 on pci0 isa0: on isab0 atapci0: port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x37 6,0xffa0-0xffaf at device 31.1 on pci0 ata0: on atapci0 ata0: [ITHREAD] ata1: on atapci0 ata1: [ITHREAD] pci0: at device 31.3 (no driver attached) acpi_button0: on acpi0 acpi_button1: on acpi0 pmtimer0 on isa0 orm0: at iomem 0xc0000-0xc7fff,0xc8000-0xc97ff pnpid ORM0000 o n isa0 atkbdc0: at port 0x60,0x64 on isa0 atkbd0: irq 1 on atkbdc0 kbd0 at atkbd0 atkbd0: [GIANT-LOCKED] atkbd0: [ITHREAD] ppc0: parallel port not found. sc0: at flags 0x100 on isa0 sc0: VGA <16 virtual consoles, flags=0x300> sio0: configured irq 4 not in bitmap of probed irqs 0 sio0: port may not be enabled sio0: configured irq 4 not in bitmap of probed irqs 0 sio0: port may not be enabled sio0 at port 0x3f8-0x3ff irq 4 flags 0x10 on isa0 sio0: type 8250 or not responding sio0: [FILTER] sio1: configured irq 3 not in bitmap of probed irqs 0 sio1: port may not be enabled vga0: at port 0x3c0-0x3df iomem 0xa0000-0xbffff on isa0 ums0: on uhub0 ums0: 3 buttons and Z dir. ukbd0: on uhub0 kbd2 at ukbd0 uhid0: on uhub0 Timecounters tick every 1.000 msec hptrr: no controller detected. ad0: 76319MB at ata0-master UDMA100 acd0: CDROM at ata1-master UDMA33 SMP: AP CPU #1 Launched! SMP: AP CPU #2 Launched! SMP: AP CPU #3 Launched! Trying to mount root from ufs:/dev/ad0s1a