8.1 amd64 lockup (maybe zfs or disk related)

Wed Feb 9 07:07:25 UTC 2011

ok, I think you're right - there is more than one problem with this
system, but I think I'm starting to isolate them and make some
progress.  

> # Debugging options
> options         BREAK_TO_DEBUGGER       # Sending a serial BREAK drops to DDB
> options         KDB                     # Enable kernel debugger support
> options         KDB_TRACE               # Print stack trace automatically on panic
> options         DDB                     # Support DDB
> options         GDB                     # Support remote GDB
> 
> Documented here:
> http://www.freebsd.org/doc/en/books/developers-handbook/kerneldebug-options.html

rebuilt my kernel with debug options, but thankfully I think I've
learned how to avoid lockup for the time being.  I think I am asking too
much of my 650 watt power supply.  I unplugged one hard drive and
disabled another CPU core (now running 4 of 6).  I'm sad to lose the
horsepower, but I was able to complete an entire zpool scrub and other
high load tasks without a lockup. 

> Let's look at your storage controller setup:
> 
> atapci0: <JMicron JMB361 UDMA133 controller> irq 18
> atapci1: <AHCI SATA controller> on atapci0
>    ata2: <ATA channel 0> on atapci1
>    ata3: <ATA channel 1> on atapci1
>    ata4: <ATA channel 0> on atapci0
> atapci2: <ATI IXP700/800 SATA300 controller> irq 19
> atapci2: AHCI v1.20 controller with 4 6Gbps ports, PM supported
>    ata5: <ATA channel 0> on atapci2
>    ata6: <ATA channel 1> on atapci2
>    ata7: <ATA channel 2> on atapci2
>    ata8: <ATA channel 3> on atapci2
> atapci3: <ATI IXP700/800 UDMA133 controller> port 0x1f0-0x1f7,0x3f6,0x170-0x177,0x376,0xff00-0xff0f at device 20.1 on pci0
>    ata0: <ATA channel 0> on atapci3
>    ata1: <ATA channel 1> on atapci3
> 
> There have been recent discussions about "problems" on the ATI
> IXP700/800 controllers.  I do not buy AMD systems, so I can't comment on
> this controllers' reliability.  Just a FYI point.  Here's the thread:
> 
> http://lists.freebsd.org/pipermail/freebsd-stable/2011-February/thread.html#61348
> 
> I also tend to avoid JMicron controllers like the plague.  I've seen too
> many problem reports with them over the years, regardless of OS.

I'll look into this.  I think the controller is the source of the
"FAILURE - READ_LMA48" errors.  I switched the disk/sata port pairing
and the error stayed with the sata port, not the disk.

> Now for the disk layout (I'm excluding da0, which is a USB flash disk of
> some kind).
> 
>  ad0: 953869MB <WDC WD10EARS-00Y5B1 80.00A80> at ata0-master UDMA133 SATA
>  ad1: 953869MB <Seagate ST31000333AS CC1H> at ata0-slave UDMA133 SATA
>  ad4: 1430799MB <WL1500GSA6472 05.00F.1> at ata2-master UDMA100 SATA 3Gb/s
>  ad8: 15279MB <TRANSCEND 20091215> at ata4-master UDMA66 
> acd0: CDRW <NEC CD-RW NR-7900A/1.08> at ata4-slave UDMA33 
> ad10: 953869MB <WL1000GSA1672 05.00J05> at ata5-master UDMA100 SATA 3Gb/s
> ad12: 953869MB <Seagate ST31000333AS CC1H> at ata6-master UDMA100 SATA 3Gb/s
> ad14: 953869MB <SAMSUNG HD103UJ 1AA01118> at ata7-master UDMA100 SATA 3Gb/s
> ad16: 953869MB <WL1000GSA1672 HA.00CHA> at ata8-master UDMA100 SATA 3Gb/s
> 
> You have a very large number of hard disks in this machine, so I sure
> hope you do have a decent enough PSU to handle it all.
> 
> If I had to make a recommendation, it would be to decrease the number of
> hard disks in the machine.  You have 8 of them -- one of which may be a
> RAM drive or something similar -- and that isn't including your CDRW
> drive.

Yes, I think this is the problem.  Though, for clarification, there are
only 6 spindle disks in the machine.  ad4 is an external drive over
esata (with it's own power), and ad8 is a CF drive.

> I would also try getting rid of the JMicron controller; I would
> recommend investing in a Silicon Image controller to replace it,
> specifically one driven by the 3124, 3132, or 3531 chips.  Avoid the
> 3112, 3114, and 3512 chips:
> http://en.wikipedia.org/wiki/Silicon_Image#Product_alerts

Thanks for the recommendation.  I'll probably pick one of these up along
with a new power supply.

> Next we have this:
> 
> > ad1: TIMEOUT - READ_DMA retrying (1 retry left) LBA=1
> > GEOM: ad1: partition 1 does not start on a track boundary.
> > GEOM: ad1: partition 1 does not end on a track boundary.
> > GEOM: label/1TBdisk5: partition 1 does not start on a track boundary.
> > GEOM: label/1TBdisk5: partition 1 does not end on a track boundary.
> 
> This doesn't look good, especially the READ_DMA timeout on ad1.  That's
> a different disk than the one you told me about before.  LBA 1 is
> literally the 2nd block on the disk, which is a little too close to
> block 0 for comfort.  I'd love to see "smartctl -a /dev/ad1" output
> here.

I've attached the output of smartctl -a /dev/ad1.  I don't think this
error is being caused by the disk though.  As I said above, I changed
the sata port / drive pairing and this error stays with the sata port,
not the drive.  (so, as you said, time for a new controller)

> > calcru: runtime went backwards from 82 usec to 70 usec for pid 20 (flowcleaner)
> > calcru: runtime went backwards from 363 usec to 317 usec for pid 8 (pagedaemon)
> > calcru: runtime went backwards from 111 usec to 95 usec for pid 7 (xpt_thrd)
> > calcru: runtime went backwards from 1892 usec to 1629 usec for pid 1 (init)
> > calcru: runtime went backwards from 6786 usec to 6591 usec for pid 0 (kernel)
> 
> This is a problem that has plagued FreeBSD for some time.  It's usually
> caused by EIST (est) being used, but that's on Intel platforms.  AMD has
> something similar called Cool'n'Quiet (see cpufreq(4) man page).  Are
> you running powerd(8) on this system?  If so, try disabling that and see
> if these go away.

sadly, I don't know if I'm running powerd. 
ps aux | grep power gives nothing, so no I guess...
as far as I can tell, this error is the least of my problems right now,
but i would like to fix it.

> > GEOM_ELI: Device label/1tbgreendisk.eli created.
> > GEOM_ELI: Encryption: AES-CBC 256
> > GEOM_ELI:     Crypto: software
> > {...}
> 
> There was no mention of geli(8) being used on this system until now.
> There may be other complexities as a result of this; I don't know.

yeah, geli is being used on this system, sorry i forgot to mention that 

> Good luck.
> 

Thanks for the help, I'm at least able to keep the machine online now.
-------------- next part --------------
smartctl 5.40 2010-10-16 r3189 [FreeBSD 8.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.11 family
Device Model:     ST31000333AS
Serial Number:    9TE1MB10
Firmware Version: CC1H
User Capacity:    1,000,204,886,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Tue Feb  8 07:41:31 2011 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		 ( 617) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 208) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   120   099   006    Pre-fail  Always       -       243069120
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       84
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   079   060   030    Pre-fail  Always       -       83902794
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       14308
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       84
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   093   000    Old_age   Always       -       56
189 High_Fly_Writes         0x003a   017   017   000    Old_age   Always       -       83
190 Airflow_Temperature_Cel 0x0022   076   051   045    Old_age   Always       -       24 (Min/Max 24/24)
194 Temperature_Celsius     0x0022   024   049   000    Old_age   Always       -       24 (0 17 0 0)
195 Hardware_ECC_Recovered  0x001a   050   019   000    Old_age   Always       -       243069120
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       96619584305016
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       412576321
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2438661969

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     13221         -
# 2  Extended offline    Interrupted (host reset)      90%     13216         -
# 3  Short offline       Completed without error       00%     13207         -
# 4  Extended offline    Interrupted (host reset)      50%     13199         -
# 5  Extended offline    Completed without error       00%     13134         -
# 6  Conveyance offline  Completed without error       00%     13131         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.