Critical issues with WD green drives

Andrea Venturoli ml at netfence.it
Thu Jun 2 12:16:12 UTC 2011


Hello.

In a server of mine (7.3p4/i386) I replaced a 1TB Hitachi SATA drive 
(which worked perfectly), with two brand new Western Digital 2TB disks.
Now I'm having critical problems, ranging from the disks getting stuck, 
to the box rebooting.
Those are not the main disks in the box, so they are currently 
unmounted; I wasn't even able to run newfs on them, since every process 
that tries to use these disk will hang after a while (and can't be 
killed either).

The box is based on an Intel S5000 motherboard and the drives are 
attached on the MB in an hot-swap enclosure.



First, what I think might be the relevant part of dmesg:

> FreeBSD 7.3-RELEASE-p4 #1: Wed Dec 15 11:53:13 CET 2010
> root at xxxxx.xxxxxxxx.xx:/usr/obj/usr/src/sys/XXXXX i386
> Timecounter "i8254" frequency 1193182 Hz quality 0
> CPU: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz (2004.99-MHz 686-class CPU)
> Origin = "GenuineIntel"  Id = 0x10676  Stepping = 6
> Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
> Features2=0xce33d<SSE3,DTES64,MON,DS_CPL,VMX,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1>
> AMD Features=0x20100000<NX,LM>
> AMD Features2=0x1<LAHF>
> Cores per package: 4
> real memory  = 2143289344 (2044 MB)
> avail memory = 2090176512 (1993 MB)
> ACPI APIC Table: <INTEL  S5000PSL>
> FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
> ...
> acpi0: <INTEL S5000PSL> on motherboard
> acpi0: [ITHREAD]
> acpi0: Power Button (fixed)
> acpi0: reservation of 0, a0000 (3) failed
> Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
> acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
> acpi_hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
> Timecounter "HPET" frequency 14318180 Hz quality 900
> acpi_button0: <Sleep Button> on acpi0
> acpi_button1: <Power Button> on acpi0
> pcib0: <ACPI Host-PCI bridge> port 0xca2,0xca3,0xcf8-0xcff on acpi0
> pci0: <ACPI PCI bus> on pcib0
> pcib1: <ACPI PCI-PCI bridge> at device 2.0 on pci0
> pci1: <ACPI PCI bus> on pcib1
> pcib2: <ACPI PCI-PCI bridge> irq 16 at device 0.0 on pci1
> pci2: <ACPI PCI bus> on pcib2
> pcib3: <ACPI PCI-PCI bridge> irq 16 at device 0.0 on pci2
> pci3: <ACPI PCI bus> on pcib3
> pcib4: <PCI-PCI bridge> at device 0.0 on pci3
> pci4: <PCI bus> on pcib4
> ...
> pcib5: <PCI-PCI bridge> at device 0.2 on pci3
> pci5: <PCI bus> on pcib5
> pcib6: <ACPI PCI-PCI bridge> irq 18 at device 2.0 on pci2
> pci6: <ACPI PCI bus> on pcib6
> ...
> pcib7: <ACPI PCI-PCI bridge> at device 0.3 on pci1
> pci7: <ACPI PCI bus> on pcib7
> ...
> pcib8: <PCI-PCI bridge> at device 3.0 on pci0
> pci8: <PCI bus> on pcib8
> pcib9: <ACPI PCI-PCI bridge> at device 4.0 on pci0
> pci9: <ACPI PCI bus> on pcib9
> pcib10: <ACPI PCI-PCI bridge> at device 5.0 on pci0
> pci10: <ACPI PCI bus> on pcib10
> pcib11: <ACPI PCI-PCI bridge> at device 6.0 on pci0
> pci11: <ACPI PCI bus> on pcib11
> pcib12: <PCI-PCI bridge> at device 7.0 on pci0
> pci12: <PCI bus> on pcib12
> pci0: <base peripheral> at device 8.0 (no driver attached)
> pcib13: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
> pci13: <ACPI PCI bus> on pcib13
> ...
> pcib14: <ACPI PCI-PCI bridge> at device 30.0 on pci0
> pci14: <ACPI PCI bus> on pcib14
> ...
> atapci1: <Intel 63XXESB2 SATA300 controller> port 0x40d8-0x40df,0x40f4-0x40f7,0x40d0-0x40d7,0x40f0-0x40f3,0x4020-0x403f mem 0xb9000000-0xb90003ff irq 20 at device 31.2 on pci0
> atapci1: [ITHREAD]
> atapci1: AHCI called from vendor specific driver
> atapci1: AHCI Version 01.10 controller with 6 ports detected
> ata2: <ATA channel 0> on atapci1
> ata2: [ITHREAD]
> ata3: <ATA channel 1> on atapci1
> ata3: [ITHREAD]
> ata4: <ATA channel 2> on atapci1
> ata4: [ITHREAD]
> ata5: <ATA channel 3> on atapci1
> ata5: [ITHREAD]
> ata6: <ATA channel 4> on atapci1
> ata6: [ITHREAD]
> ata7: <ATA channel 5> on atapci1
> ata7: [ITHREAD]
> ...
> ad4: 1907729MB <WDC WD20EARS-00MVWB0 51.0AB51> at ata2-master SATA300
> ad8: 1907729MB <WDC WD20EARS-00MVWB0 51.0AB51> at ata4-master SATA300
> ...
> GEOM_STRIPE: Device backup created (id=912470894).
> GEOM_STRIPE: Disk ad4 attached to backup.
> GEOM_STRIPE: Disk ad8 attached to backup.
> GEOM_STRIPE: Device backup activated.
> ...



Following are some samples of the messages I get in the logs:
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SMART taskqueue timeout - completing request directly
> ad8: WARNING - SMART taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad4: FAILURE - SMART timed out
> ad8: WARNING - SMART freeing taskqueue zombie request
> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad8: FAILURE - SMART timed out
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad8: WARNING - ATA_IDENTIFY taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad4: FAILURE - SET_MULTI timed out
> ad8: WARNING - ATA_IDENTIFY freeing taskqueue zombie request
> ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE requeued due to channel reset
> ad4: WARNING - SETFEATURES SET TRANSFER MODE requeued due to channel reset
> ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SMART taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad8: FAILURE - ATA_IDENTIFY timed out LBA=0
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad8: WARNING - ATAPI_IDENTIFY taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad4: FAILURE - SMART timed out
> ad8: WARNING - ATAPI_IDENTIFY freeing taskqueue zombie request
> ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing request directly
> ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request directly


smartctl -a gives:
> smartctl 5.40 2010-10-16 r3189 [FreeBSD 7.3-RELEASE-p4 i386] (local build)
> Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
>
> === START OF INFORMATION SECTION ===
> Model Family:     Western Digital Caviar Green (Adv. Format) family
> Device Model:     WDC WD20EARS-00MVWB0
> Serial Number:    WD-WMAZA4718261
> Firmware Version: 51.0AB51
> User Capacity:    2,000,398,934,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  Exact ATA specification draft version not indicated
> Local Time is:    Thu Jun  2 14:08:08 2011 CEST
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
>
> General SMART Values:
> Offline data collection status:  (0x82) Offline data collection activity
>                                         was completed without error.
>                                         Auto Offline Data Collection: Enabled.
> Self-test execution status:      (  41) The self-test routine was interrupted
>                                         by the host with a hard or soft reset.
> Total time to complete Offline
> data collection:                 (38580) seconds.
> Offline data collection
> capabilities:                    (0x7b) SMART execute Offline immediate.
>                                         Auto Offline data collection on/off support.
>                                         Suspend Offline collection upon new
>                                         command.
>                                         Offline surface scan supported.
>                                         Self-test supported.
>                                         Conveyance Self-test supported.
>                                         Selective Self-test supported.
> SMART capabilities:            (0x0003) Saves SMART data before entering
>                                         power-saving mode.
>                                         Supports SMART auto save timer.
> Error logging capability:        (0x01) Error logging supported.
>                                         General Purpose Logging supported.
> Short self-test routine
> recommended polling time:        (   2) minutes.
> Extended self-test routine
> recommended polling time:        ( 255) minutes.
> Conveyance self-test routine
> recommended polling time:        (   5) minutes.
> SCT capabilities:              (0x3035) SCT Status supported.
>                                         SCT Feature Control supported.
>                                         SCT Data Table supported.
>
> SMART Attributes Data Structure revision number: 16
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x002f   100   253   051    Pre-fail  Always       -       0
>   3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       -       1058
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       9
>   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
>   7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
>   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       22
>  10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
>  11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       8
> 192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7
> 193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       67
> 194 Temperature_Celsius     0x0022   118   114   000    Old_age   Always       -       32
> 196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
> 198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
> 199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
> 200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       11
>
> SMART Error Log Version: 1
> No Errors Logged
>
> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
> # 1  Extended captive    Interrupted (host reset)      90%        21         -
> # 2  Extended captive    Interrupted (host reset)      90%        21         -
> # 3  Conveyance captive  Completed without error       00%        20         -
> # 4  Short captive       Completed without error       00%        20         -
> # 5  Short captive       Interrupted (host reset)      90%         1         -
>
> SMART Selective self-test log data structure revision number 1
>  SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
>     1        0        0  Not_testing
>     2        0        0  Not_testing
>     3        0        0  Not_testing
>     4        0        0  Not_testing
>     5        0        0  Not_testing
> Selective self-test flags (0x0):
>   After scanning selected spans, do NOT read-scan remainder of disk.
> If Selective self-test is pending on power-up, resume after 0 minute delay.

(This is for one drive, but the other is almost identical).
Notice I can't complete a long test, since the box will crash, dump and 
reboot.



Following is a backtrace from one of the crash dumps:
> # kgdb kernel.debug /var/crash/vmcore.14
> GNU gdb 6.1.1 [FreeBSD]
> Copyright 2004 Free Software Foundation, Inc.
> GDB is free software, covered by the GNU General Public License, and you are
> welcome to change it and/or distribute copies of it under certain conditions.
> Type "show copying" to see the conditions.
> There is absolutely no warranty for GDB.  Type "show warranty" for details.
> This GDB was configured as "i386-marcel-freebsd"...
>
> Unread portion of the kernel message buffer:
> ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
> ad8: WARNING - SET_MULTI requeued due to channel reset
> ad8: FAILURE - SET_MULTI timed out
>
>
> Fatal trap 12: page fault while in kernel mode
> cpuid = 0; apic id = 00
> fault virtual address   = 0x188
> fault code              = supervisor read, page not present
> instruction pointer     = 0x20:0xc05553d4
> stack pointer           = 0x28:0xe8efca8c
> frame pointer           = 0x28:0xe8efcaa4
> code segment            = base 0x0, limit 0xfffff, type 0x1b
>                         = DPL 0, pres 1, def32 1, gran 1
> processor eflags        = interrupt enabled, resume, IOPL = 0
> current process         = 8125 (smartctl)
> trap number             = 12
> panic: page fault
> cpuid = 0
> Uptime: 37m18s
> Physical memory: 2033 MB
> Dumping 151 MB: 136 120 104 88 72 56 40 24 8
>
> Reading symbols from /boot/kernel/splash_bmp.ko...Reading symbols from /boot/kernel/splash_bmp.ko.symbols...done.
> done.
> Loaded symbols for /boot/kernel/splash_bmp.ko
> Reading symbols from /boot/kernel/geom_stripe.ko...Reading symbols from /boot/kernel/geom_stripe.ko.symbols...done.
> done.
> Loaded symbols for /boot/kernel/geom_stripe.ko
> Reading symbols from /boot/kernel/acpi.ko...Reading symbols from /boot/kernel/acpi.ko.symbols...done.
> done.
> Loaded symbols for /boot/kernel/acpi.ko
> #0  doadump () at pcpu.h:196
> 196             __asm __volatile("movl %%fs:0,%0" : "=r" (td));
> (kgdb) bt
> #0  doadump () at pcpu.h:196
> #1  0xc0563d48 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:418
> #2  0xc0564025 in panic (fmt=Variable "fmt" is not available.
> ) at /usr/src/sys/kern/kern_shutdown.c:574
> #3  0xc0732764 in trap_fatal (frame=0xe8efca4c, eva=392) at /usr/src/sys/i386/i386/trap.c:950
> #4  0xc07329b4 in trap_pfault (frame=0xe8efca4c, usermode=0, eva=392) at /usr/src/sys/i386/i386/trap.c:863
> #5  0xc0733351 in trap (frame=0xe8efca4c) at /usr/src/sys/i386/i386/trap.c:541
> #6  0xc0718abb in calltrap () at /usr/src/sys/i386/i386/exception.s:166
> #7  0xc05553d4 in _mtx_lock_sleep (m=0xc5d84acc, tid=3328172032, opts=0, file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:339
> #8  0xc056300b in _sema_post (sema=0xc5d84acc, file=0x0, line=0) at /usr/src/sys/kern/kern_sema.c:79
> #9  0xc047cf0c in ata_completed (context=0xc5d84a80, dummy=0) at /usr/src/sys/dev/ata/ata-queue.c:490
> #10 0xc047c7d5 in ata_queue_request (request=0xc5d84a80) at /usr/src/sys/dev/ata/ata-queue.c:112
> #11 0xc046439f in ata_device_ioctl (dev=0xc507d200, cmd=3224920420, data=0xc5cc12c0 "¡") at /usr/src/sys/dev/ata/ata-all.c:493
> #12 0xc04769e9 in ad_ioctl (disk=0xc53cac00, cmd=3224920420, data=0xc5cc12c0, flag=1, td=0xc65fe000) at /usr/src/sys/dev/ata/ata-disk.c:373
> #13 0xc050d83b in g_disk_ioctl (pp=0xc5572d00, cmd=3224920420, data=0xc5cc12c0, fflag=1, td=0xc65fe000) at /usr/src/sys/geom/geom_disk.c:231
> #14 0xc050cc3e in g_dev_ioctl (dev=0xc5556600, cmd=3224920420, data=0xc5cc12c0 "¡", fflag=1, td=0xc65fe000) at /usr/src/sys/geom/geom_dev.c:332
> #15 0xc0502dbf in devfs_ioctl_f (fp=0xc64aba18, com=3224920420, data=0xc5cc12c0, cred=0xc63ee100, td=0xc65fe000) at /usr/src/sys/fs/devfs/devfs_vnops.c:602
> #16 0xc059d075 in kern_ioctl (td=0xc65fe000, fd=3, com=3224920420, data=0xc5cc12c0 "¡") at file.h:269
> #17 0xc059d1ad in ioctl (td=0xc65fe000, uap=0xe8efccfc) at /usr/src/sys/kern/sys_generic.c:571
> #18 0xc0732cf5 in syscall (frame=0xe8efcd38) at /usr/src/sys/i386/i386/trap.c:1101
> #19 0xc0718b20 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:262
> #20 0x00000033 in ?? ()
> Previous frame inner to this frame (corrupt stack?)
> (kgdb)



Please, I'm really desperate; any help is appreciated.
Is this a known problem? Should I upgrade? Is there any settings I can 
try? Patches?


  Bye & Thanks
	av.


More information about the freebsd-questions mailing list