8.1 amd64 lockup (maybe zfs or disk related)

Wed Feb 9 09:29:00 UTC 2011

On Tue, Feb 08, 2011 at 11:07:21PM -0800, Greg Bonett wrote:
> rebuilt my kernel with debug options, but thankfully I think I've
> learned how to avoid lockup for the time being.  I think I am asking too
> much of my 650 watt power supply.  I unplugged one hard drive and
> disabled another CPU core (now running 4 of 6).  I'm sad to lose the
> horsepower, but I was able to complete an entire zpool scrub and other
> high load tasks without a lockup.

Too much to reply to with regards to your disk setup, so I'll summarise
my recommendations at this point:

1) Re-enable both CPU cores; I can't see this being responsible for the
problem.  I do understand the concern over added power draw, but see
recommendation (4a) below.

1) Disable the JMicron SATA controller entirely.

2) Disable the ATI IXP700/800 SATA controller entirely.

3a) Purchase a Silicon Image controller (one of the models I referenced
in my previous mail).  Many places sell them, but lots of online vendors
hide or do not disclose what ASIC they're using for the controller.  You
might have to look at their Driver Downloads section to find out what
actual chip is used.

3b) You've stated you're using one of your drives on an eSATA cable.  If
you are using a SATA-to-eSATA adapter bracket[1][2], please stop
immediately and use a native eSATA port instead.

Adapter brackets are known to cause all sorts of problems that appear as
bizarre/strange failures (xxx_DMAxx errors are quite common in this
situation), not to mention with all the internal cabling and external
cabling, a lot of the time people exceed the maximum SATA cable length
without even realising it -- it's the entire length from the SATA port
on your motherboard, to and through the adapter (good luck figuring out
how much wire is used there, to the end of the eSATA cable.  Native
eSATA removes use of the shoddy adapters and also extends the maximum
cable length (from 1 metre to 2 metres), plus provides the proper amount
of power for eSATA devices (yes this matters!).  Wikipedia has
details[3].

Silicon Image and others do make chips that offer both internal SATA and
an eSATA port on the same controller.  Given your number of disks, you
might have to invest in multiple controllers.

4a) Purchase a Kill-a-Watt meter and measure exactly how much power your
entire PC draws, including on power-on (it will be a lot higher during
power-on than during idle/use, as drives spinning up draw lots of amps).
I strongly recommend the Kill-a-Watt P4600 model[4] over the P4400 model.
Based on the wattage and amperage results, you should be able to
determine if you're nearing the maximum draw of your PSU.

4b) However, even if you're way under-draw (say, 400W), the draw may not
be the problem but instead the maximum amount of power/amperage/whatever
a single physical power cable can provide.  I imagine to some degree it
depends on the gauge of wire being used; excessive use of Y-splitters to
provide more power connectors than the physical cable provides means
that you might be drawing too much across the existing gauge of cable
that runs to the PSU.  I have seen setups where people have 6 hard disks
coming off of a single power cable (with Y-splitters and molex-to-SATA
power adapters) and have their drives randomly drop off the bus.  Please
don't do this.

A better solution might be to invest in a server-grade chassis, such as
one from Supermicro, that offers a hot-swap SATA backplane.  The
backplane provides all the correct amounts of power to the maximum
number of disks that can be connected to it.  Here are some cases you
can look at that[5][6][7].  Also be aware that if you're already using a
hot-swap backplane, most consumer-grade ones are complete junk and have
been known to cause strange anomalies; it's always best in those
situations to go straight from motherboard-to-drive or card-to-drive.

[1]: http://www.cooldrives.com/newesiidebrf.html 
[2]: http://www.cooldrives.com/essaii3gbexp.html
[3]: http://en.wikipedia.org/wiki/Serial_ATA#eSATA
[4]: http://www.amazon.com/dp/B000RGF29Q
[5]: http://www.supermicro.com/products/chassis/4U/?chs=742
[6]: http://www.supermicro.com/products/chassis/4U/?chs=743
[7]: http://www.supermicro.com/products/chassis/4U/?chs=745

> I've attached the output of smartctl -a /dev/ad1.  I don't think this
> error is being caused by the disk though.

After reviewing your SMART stats on the drive, I agree -- it looks
perfectly healthy (for a Seagate disk).  Nothing wrong there.

> > > calcru: runtime went backwards from 82 usec to 70 usec for pid 20 (flowcleaner)
> > > calcru: runtime went backwards from 363 usec to 317 usec for pid 8 (pagedaemon)
> > > calcru: runtime went backwards from 111 usec to 95 usec for pid 7 (xpt_thrd)
> > > calcru: runtime went backwards from 1892 usec to 1629 usec for pid 1 (init)
> > > calcru: runtime went backwards from 6786 usec to 6591 usec for pid 0 (kernel)
> > 
> > This is a problem that has plagued FreeBSD for some time.  It's usually
> > caused by EIST (est) being used, but that's on Intel platforms.  AMD has
> > something similar called Cool'n'Quiet (see cpufreq(4) man page).  Are
> > you running powerd(8) on this system?  If so, try disabling that and see
> > if these go away.
> 
> sadly, I don't know if I'm running powerd. 
> ps aux | grep power gives nothing, so no I guess...
> as far as I can tell, this error is the least of my problems right now,
> but i would like to fix it.

Yes that's an accurate ps/grep to use; powerd_enable="yes" in
/etc/rc.conf is how you make use of it.

Could you provide output from "sysctl -a | grep freq"?  That might help
shed some light on the above errors as well, but as I said, I'm not
familiar with AMD systems.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |