Stable SATA pci card for FreeBSD 6.x/7.0

Cian Hughes Ci at nHugh.es
Fri Aug 15 00:44:21 UTC 2008


Sebastiaan,
Have you tried connecting your 250GB drives to the troublesome  
controller? If so, does "stressing" them cause the system to panic?

~Cian Hughes
--
University of Bristol Medical School

On 14 Aug 2008, at 10:37, Sebastiaan van Erk wrote:

> Thanks Jonathan,
>
> I'm starting to expect it has to be the controller as well. About 20  
> minutes after I posted this message yesterday (and thus 20 minutes  
> after ad6 got disconnected - atacontrol list showed "no device  
> present" for it) the machine crashed while writing to the remaining  
> ad4 drive (kernel panic). I attached the logs below. I also ran the  
> long smart self test on both drives, and no errors were found on  
> either drive (logs also attached).
>
> Unfortunately I could not attach the new disks to my mainboard SATA  
> because my mainboard SATA somehow hangs trying to detect them. So I  
> cannot test if *not* using the controller is going to solve the  
> problems, though I'm it seems logical at the moment it has to be the  
> controller, especially if other people have had similar issues.
>
> I guess I'll be buying another controller.
>
> Regards,
> Sebastiaan
>
> Jonathan Groll wrote:
>> On Wed, Aug 13, 2008 at 03:10:56PM +0200, Sebastiaan van Erk wrote:
>>> Hi,
>>>
>>> Just an update on this issue.
>>>
>>> Quick summary: I fixed the BIOS issues, the hardware monitor  
>>> issues, and the rl0/rl1 watchdog timeout issues (it seems).  
>>> However I'm still having problems with my SATA drives (or at least  
>>> one of them). More info below.
>>>
>>> BIOS:
>>> I flashed my BIOS to the latest version about a year ago, and  
>>> never noticed that there was any problem, but it turns out there  
>>> was. I never reset the BIOS to default factory settings after the  
>>> upgrade, and it seems the settings were corrupt. After having  
>>> reset the BIOS to the "default optimized factory settings" it  
>>> stopped crashing when I go into the H/W monitor and also when  
>>> using healthd -d (output below):
>>>
>>> Temp.= 40.0, 36.0, 66.0; Rot.=    0,    0,    0
>>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00,  1.95,  -0.11, -1.54
>>> Temp.= 40.0, 36.0, 66.0; Rot.=    0,    0,    0
>>> Vcore = 1.44, 3.14; Volt. = 3.33, 4.97,  1.95,  -0.11, -1.54
>>> Temp.= 40.0, 36.0, 66.0; Rot.=    0,    0,    0
>>> Vcore = 1.44, 3.12; Volt. = 3.34, 4.97,  1.95,  -0.11, -1.54
>>> Temp.= 40.0, 36.0, 66.0; Rot.=    0,    0,    0
>>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00,  1.95,  -0.11, -1.54
>>> Temp.= 40.0, 36.0, 66.0; Rot.=    0,    0,    0
>>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00,  1.95,  -0.11, -1.54
>>>
>>> This also seems to have fixed the rl0 watchdog timeout problems. I  
>>> no longer see those in my logs.
>>>
>>> SATA DRIVES:
>>>
>>> I'm still having problems with the SATA drives.
>>>
>>> I tried connecting the 1TB Samsung drives to my mainboard, but  
>>> then the box hangs when booting with the "Detecting IDE drives"  
>>> message. The regular (PATA) IDE drives are detected first, and  
>>> then it repeats the "Detecting IDE drives" message to detect the  
>>> sata drives, and hangs. When I connect my 250GB SATA drives to my  
>>> mainboard they detect fine, and the box boots normally.
>>>
>>> I did another rsync of my old mirror (the 250GB disks) to the new  
>>> mirror (1TB disks), but again one of the disks got detached. This  
>>> time there are no other messages in the log, the only thing I see  
>>> is the following:
>>>
>>> Aug 13 14:35:27 piglet su: sebster to root on /dev/ttyp5
>>> Aug 13 14:55:38 piglet kernel: ad6: FAILURE - device detached
>>> Aug 13 14:55:38 piglet kernel: subdisk6: detached
>>> Aug 13 14:55:38 piglet kernel: ad6: detached
>>> Aug 13 14:55:38 piglet kernel: GEOM_MIRROR: Device gm1: provider  
>>> ad6 disconnected.
>>> Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to  
>>> size>100K
>>>
>>> (unfortunate that the log file just got rotated, but in the new  
>>> log file there is nothing execpt the one expected line:
>>>
>>> Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to  
>>> size>100K
>>>
>>> So, nothing after the disconnect...
>>>
>>> The questions I have now is:
>>> 1) Could an upgrade to FreeBSD 7-STABLE fix the issue (it's a LOT  
>>> of work for me, but I'll do it if there are SATA driver issues  
>>> fixed).
>> I suspect the problem may be the SiI driver in Freebsd. As a  
>> reference
>> point, I've had a similar problem, even on 7-STABLE, but with sparc64
>> hardware (see earlier post in this thread).
>> It'll probably be simplest for you to just buy another controller of
>> another brand. On the other hand, it'll be worth knowing exactly what
>> is wrong with the SiI driver...
>> Cheers,
>> Jonathan
> Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to  
> size>100K
> Aug 13 15:11:26 piglet su: sebster to root on /dev/ttyp4
> Aug 13 15:34:55 piglet kernel: mirror/ 
> gm1s1e[WRITE(offset=875450693632, length=2048)]error = 6
> Aug 13 15:34:55 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450695680, length=2048)]error = 6
>
> [snip 335750 similar lines]
>
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450931200, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450933248, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450935296, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450937344, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450939392, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450941440, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450943488, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450945536, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450947584, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450949632, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450951680, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450953728, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450955776, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450957824, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450959872, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450961920, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450963968, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450966016, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450968064, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450970112, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450972160, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/ 
> gm1s1e[WRITE(offset=875450974208, length=2048)]error = 6
> Aug 13 15:42:23 piglet syslogd: kernel boot file is /boot/kernel/ 
> kernel
> Aug 13 15:42:23 piglet kernel: Copyright (c) 1992-2008 The FreeBSD  
> Project.
> smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8  
> Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Device Model:     SAMSUNG HD103UJ
> Serial Number:    S13PJ1BQ606865
> Firmware Version: 1AA01112
> User Capacity:    1,000,204,886,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 3b
> Local Time is:    Thu Aug 14 11:28:13 2008 CEST
>
> ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual  
> for details.
>
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x02)	Offline data collection  
> activity
> 					was completed without error.
> 					Auto Offline Data Collection: Disabled.
> Self-test execution status:      (   0)	The previous self-test  
> routine completed
> 					without error or no self-test has ever
> 					been run.
> Total time to complete Offline
> data collection: 		 (11811) seconds.
> Offline data collection
> capabilities: 			 (0x7b) SMART execute Offline immediate.
> 					Auto Offline data collection on/off support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before  
> entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					General Purpose Logging supported.
> Short self-test routine
> recommended polling time: 	 (   2) minutes.
> Extended self-test routine
> recommended polling time: 	 ( 198) minutes.
> Conveyance self-test routine
> recommended polling time: 	 (  21) minutes.
> SCT capabilities: 	       (0x003f)	SCT Status supported.
> 					SCT Feature Control supported.
> 					SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE       
> UPDATED  WHEN_FAILED RAW_VALUE
>  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail   
> Always       -       0
>  3 Spin_Up_Time            0x0007   076   076   011    Pre-fail   
> Always       -       8010
>  4 Start_Stop_Count        0x0032   100   100   000    Old_age    
> Always       -       8
>  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail   
> Always       -       0
>  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail   
> Always       -       0
>  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail   
> Offline      -       10255
>  9 Power_On_Hours          0x0032   100   100   000    Old_age    
> Always       -       272
> 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail   
> Always       -       0
> 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age    
> Always       -       0
> 12 Power_Cycle_Count       0x0032   100   100   000    Old_age    
> Always       -       8
> 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age    
> Always       -       0
> 183 Unknown_Attribute       0x0032   100   100   000    Old_age    
> Always       -       0
> 184 Unknown_Attribute       0x0033   100   100   099    Pre-fail   
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age    
> Always       -       0
> 188 Unknown_Attribute       0x0032   100   100   000    Old_age    
> Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   057   052   000    Old_age    
> Always       -       43 (Lifetime Min/Max 43/48)
> 194 Temperature_Celsius     0x0022   056   050   000    Old_age    
> Always       -       44 (Lifetime Min/Max 43/50)
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age    
> Always       -       195799724
> 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age    
> Always       -       0
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age    
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age    
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age    
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age    
> Always       -       0
> 201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age    
> Always       -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 0
> Warning: ATA Specification requires self-test log structure revision  
> number = 1
> Num  Test_Description    Status                  Remaining   
> LifeTime(hours)  LBA_of_first_error
> # 1  Offline             Completed without error       00%        
> 261         -
> # 2  Offline             Aborted by host               40%        
> 251         -
> # 3  Short offline       Aborted by host               00%        
> 250         -
>
> SMART Selective Self-Test Log Data Structure Revision Number (0)  
> should be 1
> SMART Selective self-test log data structure revision number 0
> Warning: ATA Specification requires selective self-test log data  
> structure revision number = 1
> SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>    1        0        0  Not_testing
>    2        0        0  Not_testing
>    3        0        0  Not_testing
>    4        0        0  Not_testing
>    5        0        0  Not_testing
> Selective self-test flags (0x0):
>  After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute  
> delay.
>
> smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8  
> Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Device Model:     SAMSUNG HD103UJ
> Serial Number:    S13PJ1BQ607102
> Firmware Version: 1AA01112
> User Capacity:    1,000,204,886,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 3b
> Local Time is:    Thu Aug 14 11:28:39 2008 CEST
>
> ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual  
> for details.
>
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x02)	Offline data collection  
> activity
> 					was completed without error.
> 					Auto Offline Data Collection: Disabled.
> Self-test execution status:      (   0)	The previous self-test  
> routine completed
> 					without error or no self-test has ever
> 					been run.
> Total time to complete Offline
> data collection: 		 (12131) seconds.
> Offline data collection
> capabilities: 			 (0x7b) SMART execute Offline immediate.
> 					Auto Offline data collection on/off support.
> 					Suspend Offline collection upon new
> 					command.
> 					Offline surface scan supported.
> 					Self-test supported.
> 					Conveyance Self-test supported.
> 					Selective Self-test supported.
> SMART capabilities:            (0x0003)	Saves SMART data before  
> entering
> 					power-saving mode.
> 					Supports SMART auto save timer.
> Error logging capability:        (0x01)	Error logging supported.
> 					General Purpose Logging supported.
> Short self-test routine
> recommended polling time: 	 (   2) minutes.
> Extended self-test routine
> recommended polling time: 	 ( 203) minutes.
> Conveyance self-test routine
> recommended polling time: 	 (  22) minutes.
> SCT capabilities: 	       (0x003f)	SCT Status supported.
> 					SCT Feature Control supported.
> 					SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE       
> UPDATED  WHEN_FAILED RAW_VALUE
>  1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail   
> Always       -       0
>  3 Spin_Up_Time            0x0007   077   077   011    Pre-fail   
> Always       -       7810
>  4 Start_Stop_Count        0x0032   100   100   000    Old_age    
> Always       -       10
>  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail   
> Always       -       0
>  7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail   
> Always       -       0
>  8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail   
> Offline      -       9978
>  9 Power_On_Hours          0x0032   100   100   000    Old_age    
> Always       -       272
> 10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail   
> Always       -       0
> 11 Calibration_Retry_Count 0x0012   100   100   000    Old_age    
> Always       -       0
> 12 Power_Cycle_Count       0x0032   100   100   000    Old_age    
> Always       -       10
> 13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age    
> Always       -       0
> 183 Unknown_Attribute       0x0032   100   100   000    Old_age    
> Always       -       0
> 184 Unknown_Attribute       0x0033   100   100   099    Pre-fail   
> Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age    
> Always       -       0
> 188 Unknown_Attribute       0x0032   100   100   000    Old_age    
> Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   059   054   000    Old_age    
> Always       -       41 (Lifetime Min/Max 41/46)
> 194 Temperature_Celsius     0x0022   058   053   000    Old_age    
> Always       -       42 (Lifetime Min/Max 41/47)
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age    
> Always       -       31616
> 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age    
> Always       -       0
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age    
> Always       -       0
> 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age    
> Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age    
> Always       -       0
> 200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age    
> Always       -       0
> 201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age    
> Always       -       0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 0
> Warning: ATA Specification requires self-test log structure revision  
> number = 1
> Num  Test_Description    Status                  Remaining   
> LifeTime(hours)  LBA_of_first_error
> # 1  Offline             Completed without error       00%        
> 261         -
> # 2  Offline             Aborted by host               40%        
> 251         -
> # 3  Short offline       Aborted by host               00%        
> 250         -
>
> SMART Selective Self-Test Log Data Structure Revision Number (0)  
> should be 1
> SMART Selective self-test log data structure revision number 0
> Warning: ATA Specification requires selective self-test log data  
> structure revision number = 1
> SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>    1        0        0  Not_testing
>    2        0        0  Not_testing
>    3        0        0  Not_testing
>    4        0        0  Not_testing
>    5        0        0  Not_testing
> Selective self-test flags (0x0):
>  After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute  
> delay.
>



More information about the freebsd-stable mailing list