8.1 amd64 lockup (maybe zfs or disk related)

Sat Feb 12 03:24:30 UTC 2011

Thanks for all the help. I've learned some new things, but haven't fixed
the problem yet.

> 1) Re-enable both CPU cores; I can't see this being responsible for the
> problem.  I do understand the concern over added power draw, but see
> recommendation (4a) below.

I re-enabled all cores but experienced a lockup while running zpool
scrub.  I was able to run scrub twice with 4 of 6 cores enabled without
lockup.  Also, when lockup occurs I'm not able to access the debugger
with ctrl-alt-esc.  Just to keep things straight, since I'm running
geli, more cores means more io throughput during a scrub.

If I'm not able to use the kernel debugger to diagnose this problem,
should I disable it?  Could it be a security risk?

> 1) Disable the JMicron SATA controller entirely.
> 
> 2) Disable the ATI IXP700/800 SATA controller entirely.
> 
> 3a) Purchase a Silicon Image controller (one of the models I referenced
> in my previous mail).  Many places sell them, but lots of online vendors
> hide or do not disclose what ASIC they're using for the controller.  You
> might have to look at their Driver Downloads section to find out what
> actual chip is used.

This is on my todo list, but as of now I'm still running the controllers
on the motherboard.  I should have the controller replaced by next week.

> 3b) You've stated you're using one of your drives on an eSATA cable.  If
> you are using a SATA-to-eSATA adapter bracket[1][2], please stop
> immediately and use a native eSATA port instead.
> 
> Adapter brackets are known to cause all sorts of problems that appear as
> bizarre/strange failures (xxx_DMAxx errors are quite common in this
> situation), not to mention with all the internal cabling and external
> cabling, a lot of the time people exceed the maximum SATA cable length
> without even realising it -- it's the entire length from the SATA port
> on your motherboard, to and through the adapter (good luck figuring out
> how much wire is used there, to the end of the eSATA cable.  Native
> eSATA removes use of the shoddy adapters and also extends the maximum
> cable length (from 1 metre to 2 metres), plus provides the proper amount
> of power for eSATA devices (yes this matters!).  Wikipedia has
> details[3].
> 
> Silicon Image and others do make chips that offer both internal SATA and
> an eSATA port on the same controller.  Given your number of disks, you
> might have to invest in multiple controllers.

My motherboard has an eSATA port and that's what I'm using (not an
extension bracket)  Do you still recommend against it?  I figured one
fewer drive in the case would reduce the load on my PSU.

> 4a) Purchase a Kill-a-Watt meter and measure exactly how much power your
> entire PC draws, including on power-on (it will be a lot higher during
> power-on than during idle/use, as drives spinning up draw lots of amps).
> I strongly recommend the Kill-a-Watt P4600 model[4] over the P4400 model.
> Based on the wattage and amperage results, you should be able to
> determine if you're nearing the maximum draw of your PSU.

Kill-a-Watt meter arrived today.  It looks like during boot it's not
exceeding 200 watts.  During a zpool scrub it gets up to ~255 watts
(with all cores enabled).  So I don't think the problem is gross power
consumption. 

> 4b) However, even if you're way under-draw (say, 400W), the draw may not
> be the problem but instead the maximum amount of power/amperage/whatever
> a single physical power cable can provide.  I imagine to some degree it
> depends on the gauge of wire being used; excessive use of Y-splitters to
> provide more power connectors than the physical cable provides means
> that you might be drawing too much across the existing gauge of cable
> that runs to the PSU.  I have seen setups where people have 6 hard disks
> coming off of a single power cable (with Y-splitters and molex-to-SATA
> power adapters) and have their drives randomly drop off the bus.  Please
> don't do this.

Yes this seems like it could be a problem.  I'll shutdown and figure out
which drives are connected to which cables.  Maybe with some rearranging
I can even out the load.  Even if I have a bunch of drives on a single
cable, would a voltage drop on one cable filled with drives be enough to
lockup the machine?  It seems like the motherboard power would be
unaffected.

> A better solution might be to invest in a server-grade chassis, such as
> one from Supermicro, that offers a hot-swap SATA backplane.  The
> backplane provides all the correct amounts of power to the maximum
> number of disks that can be connected to it.  Here are some cases you
> can look at that[5][6][7].  Also be aware that if you're already using a
> hot-swap backplane, most consumer-grade ones are complete junk and have
> been known to cause strange anomalies; it's always best in those
> situations to go straight from motherboard-to-drive or card-to-drive.

This would be nice, but it's not in my budget right now.  I'll keep it
in mind for my next major upgrade.  

> After reviewing your SMART stats on the drive, I agree -- it looks
> perfectly healthy (for a Seagate disk).  Nothing wrong there.
> 
> > > > calcru: runtime went backwards from 82 usec to 70 usec for pid 20 (flowcleaner)
> > > > calcru: runtime went backwards from 363 usec to 317 usec for pid 8 (pagedaemon)
> > > > calcru: runtime went backwards from 111 usec to 95 usec for pid 7 (xpt_thrd)
> > > > calcru: runtime went backwards from 1892 usec to 1629 usec for pid 1 (init)
> > > > calcru: runtime went backwards from 6786 usec to 6591 usec for pid 0 (kernel)
> > > 
> > > This is a problem that has plagued FreeBSD for some time.  It's usually
> > > caused by EIST (est) being used, but that's on Intel platforms.  AMD has
> > > something similar called Cool'n'Quiet (see cpufreq(4) man page).  Are
> > > you running powerd(8) on this system?  If so, try disabling that and see
> > > if these go away.
> > 
> > sadly, I don't know if I'm running powerd. 
> > ps aux | grep power gives nothing, so no I guess...
> > as far as I can tell, this error is the least of my problems right now,
> > but i would like to fix it.
> 
> Yes that's an accurate ps/grep to use; powerd_enable="yes" in
> /etc/rc.conf is how you make use of it.

Is this recommended for desktop machines?  

> Could you provide output from "sysctl -a | grep freq"?  That might help
> shed some light on the above errors as well, but as I said, I'm not
> familiar with AMD systems.
> 

$ sysctl -a | grep freq
kern.acct_chkfreq: 15
kern.timecounter.tc.i8254.frequency: 1193182
kern.timecounter.tc.ACPI-fast.frequency: 3579545
kern.timecounter.tc.HPET.frequency: 14318180
kern.timecounter.tc.TSC.frequency: 3491654411
net.inet.sctp.sack_freq: 2
debug.cpufreq.verbose: 0
debug.cpufreq.lowest: 0
machdep.acpi_timer_freq: 3579545
machdep.tsc_freq: 3491654411
machdep.i8254_freq: 1193182
dev.cpu.0.freq: 3000
dev.cpu.0.freq_levels: 3000/19507 2625/17068 2300/14500 2012/12687
1725/10875 1600/10535 1400/9218 1200/7901 1000/6584 800/6345 700/5551
600/4758 500/3965 400/3172 300/2379 200/1586 100/793
dev.acpi_throttle.0.freq_settings: 10000/-1 8750/-1 7500/-1 6250/-1
5000/-1 3750/-1 2500/-1 1250/-1
dev.cpufreq.0.%driver: cpufreq
dev.cpufreq.0.%parent: cpu0
dev.hwpstate.0.freq_settings: 3000/19507 2300/14500 1600/10535 800/6345