Stable SATA pci card for FreeBSD 6.x/7.0
Cian Hughes
Ci at nHugh.es
Fri Aug 15 00:44:21 UTC 2008
Sebastiaan,
Have you tried connecting your 250GB drives to the troublesome
controller? If so, does "stressing" them cause the system to panic?
~Cian Hughes
--
University of Bristol Medical School
On 14 Aug 2008, at 10:37, Sebastiaan van Erk wrote:
> Thanks Jonathan,
>
> I'm starting to expect it has to be the controller as well. About 20
> minutes after I posted this message yesterday (and thus 20 minutes
> after ad6 got disconnected - atacontrol list showed "no device
> present" for it) the machine crashed while writing to the remaining
> ad4 drive (kernel panic). I attached the logs below. I also ran the
> long smart self test on both drives, and no errors were found on
> either drive (logs also attached).
>
> Unfortunately I could not attach the new disks to my mainboard SATA
> because my mainboard SATA somehow hangs trying to detect them. So I
> cannot test if *not* using the controller is going to solve the
> problems, though I'm it seems logical at the moment it has to be the
> controller, especially if other people have had similar issues.
>
> I guess I'll be buying another controller.
>
> Regards,
> Sebastiaan
>
> Jonathan Groll wrote:
>> On Wed, Aug 13, 2008 at 03:10:56PM +0200, Sebastiaan van Erk wrote:
>>> Hi,
>>>
>>> Just an update on this issue.
>>>
>>> Quick summary: I fixed the BIOS issues, the hardware monitor
>>> issues, and the rl0/rl1 watchdog timeout issues (it seems).
>>> However I'm still having problems with my SATA drives (or at least
>>> one of them). More info below.
>>>
>>> BIOS:
>>> I flashed my BIOS to the latest version about a year ago, and
>>> never noticed that there was any problem, but it turns out there
>>> was. I never reset the BIOS to default factory settings after the
>>> upgrade, and it seems the settings were corrupt. After having
>>> reset the BIOS to the "default optimized factory settings" it
>>> stopped crashing when I go into the H/W monitor and also when
>>> using healthd -d (output below):
>>>
>>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0
>>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00, 1.95, -0.11, -1.54
>>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0
>>> Vcore = 1.44, 3.14; Volt. = 3.33, 4.97, 1.95, -0.11, -1.54
>>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0
>>> Vcore = 1.44, 3.12; Volt. = 3.34, 4.97, 1.95, -0.11, -1.54
>>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0
>>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00, 1.95, -0.11, -1.54
>>> Temp.= 40.0, 36.0, 66.0; Rot.= 0, 0, 0
>>> Vcore = 1.44, 3.12; Volt. = 3.34, 5.00, 1.95, -0.11, -1.54
>>>
>>> This also seems to have fixed the rl0 watchdog timeout problems. I
>>> no longer see those in my logs.
>>>
>>> SATA DRIVES:
>>>
>>> I'm still having problems with the SATA drives.
>>>
>>> I tried connecting the 1TB Samsung drives to my mainboard, but
>>> then the box hangs when booting with the "Detecting IDE drives"
>>> message. The regular (PATA) IDE drives are detected first, and
>>> then it repeats the "Detecting IDE drives" message to detect the
>>> sata drives, and hangs. When I connect my 250GB SATA drives to my
>>> mainboard they detect fine, and the box boots normally.
>>>
>>> I did another rsync of my old mirror (the 250GB disks) to the new
>>> mirror (1TB disks), but again one of the disks got detached. This
>>> time there are no other messages in the log, the only thing I see
>>> is the following:
>>>
>>> Aug 13 14:35:27 piglet su: sebster to root on /dev/ttyp5
>>> Aug 13 14:55:38 piglet kernel: ad6: FAILURE - device detached
>>> Aug 13 14:55:38 piglet kernel: subdisk6: detached
>>> Aug 13 14:55:38 piglet kernel: ad6: detached
>>> Aug 13 14:55:38 piglet kernel: GEOM_MIRROR: Device gm1: provider
>>> ad6 disconnected.
>>> Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to
>>> size>100K
>>>
>>> (unfortunate that the log file just got rotated, but in the new
>>> log file there is nothing execpt the one expected line:
>>>
>>> Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to
>>> size>100K
>>>
>>> So, nothing after the disconnect...
>>>
>>> The questions I have now is:
>>> 1) Could an upgrade to FreeBSD 7-STABLE fix the issue (it's a LOT
>>> of work for me, but I'll do it if there are SATA driver issues
>>> fixed).
>> I suspect the problem may be the SiI driver in Freebsd. As a
>> reference
>> point, I've had a similar problem, even on 7-STABLE, but with sparc64
>> hardware (see earlier post in this thread).
>> It'll probably be simplest for you to just buy another controller of
>> another brand. On the other hand, it'll be worth knowing exactly what
>> is wrong with the SiI driver...
>> Cheers,
>> Jonathan
> Aug 13 15:00:00 piglet newsyslog[1800]: logfile turned over due to
> size>100K
> Aug 13 15:11:26 piglet su: sebster to root on /dev/ttyp4
> Aug 13 15:34:55 piglet kernel: mirror/
> gm1s1e[WRITE(offset=875450693632, length=2048)]error = 6
> Aug 13 15:34:55 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450695680, length=2048)]error = 6
>
> [snip 335750 similar lines]
>
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450931200, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450933248, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450935296, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450937344, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450939392, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450941440, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450943488, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450945536, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450947584, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450949632, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450951680, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450953728, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450955776, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450957824, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450959872, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450961920, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450963968, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450966016, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450968064, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450970112, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450972160, length=2048)]error = 6
> Aug 13 15:36:30 piglet kernel: g_vfs_done():mirror/
> gm1s1e[WRITE(offset=875450974208, length=2048)]error = 6
> Aug 13 15:42:23 piglet syslogd: kernel boot file is /boot/kernel/
> kernel
> Aug 13 15:42:23 piglet kernel: Copyright (c) 1992-2008 The FreeBSD
> Project.
> smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8
> Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Device Model: SAMSUNG HD103UJ
> Serial Number: S13PJ1BQ606865
> Firmware Version: 1AA01112
> User Capacity: 1,000,204,886,016 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 8
> ATA Standard is: ATA-8-ACS revision 3b
> Local Time is: Thu Aug 14 11:28:13 2008 CEST
>
> ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual
> for details.
>
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x02) Offline data collection
> activity
> was completed without error.
> Auto Offline Data Collection: Disabled.
> Self-test execution status: ( 0) The previous self-test
> routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (11811) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before
> entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 198) minutes.
> Conveyance self-test routine
> recommended polling time: ( 21) minutes.
> SCT capabilities: (0x003f) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail
> Always - 0
> 3 Spin_Up_Time 0x0007 076 076 011 Pre-fail
> Always - 8010
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> Always - 8
> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
> Always - 0
> 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail
> Always - 0
> 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail
> Offline - 10255
> 9 Power_On_Hours 0x0032 100 100 000 Old_age
> Always - 272
> 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail
> Always - 0
> 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age
> Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> Always - 8
> 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age
> Always - 0
> 183 Unknown_Attribute 0x0032 100 100 000 Old_age
> Always - 0
> 184 Unknown_Attribute 0x0033 100 100 099 Pre-fail
> Always - 0
> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age
> Always - 0
> 188 Unknown_Attribute 0x0032 100 100 000 Old_age
> Always - 0
> 190 Airflow_Temperature_Cel 0x0022 057 052 000 Old_age
> Always - 43 (Lifetime Min/Max 43/48)
> 194 Temperature_Celsius 0x0022 056 050 000 Old_age
> Always - 44 (Lifetime Min/Max 43/50)
> 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age
> Always - 195799724
> 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age
> Always - 0
> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age
> Always - 0
> 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
> Offline - 0
> 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age
> Always - 0
> 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age
> Always - 0
> 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age
> Always - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 0
> Warning: ATA Specification requires self-test log structure revision
> number = 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Offline Completed without error 00%
> 261 -
> # 2 Offline Aborted by host 40%
> 251 -
> # 3 Short offline Aborted by host 00%
> 250 -
>
> SMART Selective Self-Test Log Data Structure Revision Number (0)
> should be 1
> SMART Selective self-test log data structure revision number 0
> Warning: ATA Specification requires selective self-test log data
> structure revision number = 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute
> delay.
>
> smartctl version 5.38 [i386-portbld-freebsd6.3] Copyright (C) 2002-8
> Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
>
> === START OF INFORMATION SECTION ===
> Device Model: SAMSUNG HD103UJ
> Serial Number: S13PJ1BQ607102
> Firmware Version: 1AA01112
> User Capacity: 1,000,204,886,016 bytes
> Device is: In smartctl database [for details use: -P show]
> ATA Version is: 8
> ATA Standard is: ATA-8-ACS revision 3b
> Local Time is: Thu Aug 14 11:28:39 2008 CEST
>
> ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual
> for details.
>
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status: (0x02) Offline data collection
> activity
> was completed without error.
> Auto Offline Data Collection: Disabled.
> Self-test execution status: ( 0) The previous self-test
> routine completed
> without error or no self-test has ever
> been run.
> Total time to complete Offline
> data collection: (12131) seconds.
> Offline data collection
> capabilities: (0x7b) SMART execute Offline immediate.
> Auto Offline data collection on/off support.
> Suspend Offline collection upon new
> command.
> Offline surface scan supported.
> Self-test supported.
> Conveyance Self-test supported.
> Selective Self-test supported.
> SMART capabilities: (0x0003) Saves SMART data before
> entering
> power-saving mode.
> Supports SMART auto save timer.
> Error logging capability: (0x01) Error logging supported.
> General Purpose Logging supported.
> Short self-test routine
> recommended polling time: ( 2) minutes.
> Extended self-test routine
> recommended polling time: ( 203) minutes.
> Conveyance self-test routine
> recommended polling time: ( 22) minutes.
> SCT capabilities: (0x003f) SCT Status supported.
> SCT Feature Control supported.
> SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
> UPDATED WHEN_FAILED RAW_VALUE
> 1 Raw_Read_Error_Rate 0x000f 100 100 051 Pre-fail
> Always - 0
> 3 Spin_Up_Time 0x0007 077 077 011 Pre-fail
> Always - 7810
> 4 Start_Stop_Count 0x0032 100 100 000 Old_age
> Always - 10
> 5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
> Always - 0
> 7 Seek_Error_Rate 0x000f 253 253 051 Pre-fail
> Always - 0
> 8 Seek_Time_Performance 0x0025 100 100 015 Pre-fail
> Offline - 9978
> 9 Power_On_Hours 0x0032 100 100 000 Old_age
> Always - 272
> 10 Spin_Retry_Count 0x0033 100 100 051 Pre-fail
> Always - 0
> 11 Calibration_Retry_Count 0x0012 100 100 000 Old_age
> Always - 0
> 12 Power_Cycle_Count 0x0032 100 100 000 Old_age
> Always - 10
> 13 Read_Soft_Error_Rate 0x000e 100 100 000 Old_age
> Always - 0
> 183 Unknown_Attribute 0x0032 100 100 000 Old_age
> Always - 0
> 184 Unknown_Attribute 0x0033 100 100 099 Pre-fail
> Always - 0
> 187 Reported_Uncorrect 0x0032 100 100 000 Old_age
> Always - 0
> 188 Unknown_Attribute 0x0032 100 100 000 Old_age
> Always - 0
> 190 Airflow_Temperature_Cel 0x0022 059 054 000 Old_age
> Always - 41 (Lifetime Min/Max 41/46)
> 194 Temperature_Celsius 0x0022 058 053 000 Old_age
> Always - 42 (Lifetime Min/Max 41/47)
> 195 Hardware_ECC_Recovered 0x001a 100 100 000 Old_age
> Always - 31616
> 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age
> Always - 0
> 197 Current_Pending_Sector 0x0012 100 100 000 Old_age
> Always - 0
> 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
> Offline - 0
> 199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age
> Always - 0
> 200 Multi_Zone_Error_Rate 0x000a 100 100 000 Old_age
> Always - 0
> 201 Soft_Read_Error_Rate 0x000a 100 100 000 Old_age
> Always - 0
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 0
> Warning: ATA Specification requires self-test log structure revision
> number = 1
> Num Test_Description Status Remaining
> LifeTime(hours) LBA_of_first_error
> # 1 Offline Completed without error 00%
> 261 -
> # 2 Offline Aborted by host 40%
> 251 -
> # 3 Short offline Aborted by host 00%
> 250 -
>
> SMART Selective Self-Test Log Data Structure Revision Number (0)
> should be 1
> SMART Selective self-test log data structure revision number 0
> Warning: ATA Specification requires selective self-test log data
> structure revision number = 1
> SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
> 1 0 0 Not_testing
> 2 0 0 Not_testing
> 3 0 0 Not_testing
> 4 0 0 Not_testing
> 5 0 0 Not_testing
> Selective self-test flags (0x0):
> After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute
> delay.
>
More information about the freebsd-stable
mailing list