Panic 8.2 PRERELEASE WRITE_DMA48

Sun Jan 9 16:30:30 UTC 2011

On Sun, Jan 09, 2011 at 04:41:43PM +0100, Tom Vijlbrief wrote:
> I've run many fscks on /usr in single user because I had soft update
> inconsistencies,
> no DMA errors during those repairs.

There's no 1:1 ratio between running fsck on a filesystem and seeing a
DMA error.  I should explain what I mean by that: just because you
receive a read or write error from a disk during operation doesn't mean
fsck will induce it.  fsck simply checks filesystem tables and so on for
integrity, it doesn't do the equivalent of a bad block scan, nor does it
check (read) every data block referenced by an inode.

So if you have a filesystem which has a bad block somewhere within a
data block, fsck almost certainly won't catch this.  ZFS, on the other
hand (specifically a "zpool scrub"), would/should induce such.

The reason I advocated booting into single-user and running a fsck
manually is because there's confirmation that background fsck doesn't
catch/handle all filesystem consistency errors that a foreground fsck
does.  This is why I continue to advocate background_fsck="no" in
rc.conf(5).  That's for another discussion though.

Let's review the disk:

> === START OF INFORMATION SECTION ===
> Model Family:     SAMSUNG SpinPoint F1 DT series
> Device Model:     SAMSUNG HD103UJ
> Serial Number:    S13PJ9BQC02902
> Firmware Version: 1AA01113
> User Capacity:    1,000,204,886,016 bytes
> Device is:        In smartctl database [for details use: -P show]
> ATA Version is:   8
> ATA Standard is:  ATA-8-ACS revision 3b
> Local Time is:    Sun Jan  9 16:40:24 2011 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
>
> ... 

> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE > UPDATED  WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate     0x000f   100   100   051    Pre-fail > Always       -       0
>   3 Spin_Up_Time            0x0007   078   078   011    Pre-fail > Always       -       7580
>   4 Start_Stop_Count        0x0032   100   100   000    Old_age > Always       -       399
>   5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail > Always       -       0
>   7 Seek_Error_Rate         0x000f   253   253   051    Pre-fail > Always       -       0
>   8 Seek_Time_Performance   0x0025   100   100   015    Pre-fail > Offline      -       10097
>   9 Power_On_Hours          0x0032   100   100   000    Old_age > Always       -       2375
>  10 Spin_Retry_Count        0x0033   100   100   051    Pre-fail > Always       -       0
>  11 Calibration_Retry_Count 0x0012   100   100   000    Old_age > Always       -       0
>  12 Power_Cycle_Count       0x0032   100   100   000    Old_age > Always       -       392
>  13 Read_Soft_Error_Rate    0x000e   100   100   000    Old_age > Always       -       0
> 183 Runtime_Bad_Block       0x0032   100   100   000    Old_age > Always       -       0
> 184 End-to-End_Error        0x0033   100   100   000    Pre-fail > Always       -       0
> 187 Reported_Uncorrect      0x0032   100   100   000    Old_age > Always       -       0
> 188 Command_Timeout         0x0032   100   100   000    Old_age > Always       -       0
> 190 Airflow_Temperature_Cel 0x0022   057   052   000    Old_age > Always       -       43 (Min/Max 42/45)
> 194 Temperature_Celsius     0x0022   056   050   000    Old_age > Always       -       44 (Min/Max 42/46)
> 195 Hardware_ECC_Recovered  0x001a   100   100   000    Old_age > Always       -       20728126
> 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age > Always       -       0
> 197 Current_Pending_Sector  0x0012   100   100   000    Old_age > Always       -       0
> 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age > Offline      -       0
> 199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age > Always       -       1
> 200 Multi_Zone_Error_Rate   0x000a   100   100   000    Old_age > Always       -       0
> 201 Soft_Read_Error_Rate    0x000a   100   100   000    Old_age > Always       -       0

Your drive looks fine.  Attribute 195 isn't anything to worry about
(vendor-specific encoding makes this number appear large).  Attribute
199 indicates one CRC error, but again nothing to worry about -- but
could explain a single error during the lifetime of the drive
(impossible to determine when it happened).

> SMART Self-test log structure revision number 1
> Num  Test_Description    Status                  Remaining > LifeTime(hours)  LBA_of_first_error
> # 1  Short offline       Completed without error       00%      2361         -
> # 2  Short offline       Completed without error       00%      2205         -
> # 3  Short offline       Completed without error       00%      2138         -
> # 4  Extended offline    Completed without error       00%      2109         -
> # 5  Short offline       Completed without error       00%      2105         -
> # 6  Short offline       Completed without error       00%      2092         -
> # 7  Short offline       Completed without error       00%      2083         -
> # 8  Short offline       Completed without error       00%      2057         -
> # 9  Extended offline    Completed without error       00%      2037         -
> #10  Short offline       Completed without error       00%      2033         -
> #11  Short offline       Completed without error       00%      2009         -
> #12  Short offline       Completed without error       00%      1974         -
> #13  Short offline       Completed without error       00%      1941         -
> #14  Extended offline    Completed without error       00%      1920         -
> #15  Short offline       Completed without error       00%      1916         -
> #16  Short offline       Completed without error       00%      1868         -
> #17  Short offline       Completed without error       00%      1810         -
> #18  Short offline       Completed without error       00%      1655         -
> #19  Short offline       Completed without error       00%      1638         -
> #20  Extended offline    Completed without error       00%      1596         -
> #21  Short offline       Completed without error       00%      1591         -

Not to get off topic, but what is causing this?  It looks like you have
a cron job or something very aggressive doing a "smartctl -t short
/dev/ad4" or equivalent.  If you have such, please disable this
immediately.  You shouldn't be doing SMART tests with such regularity;
it accomplishes absolutely nothing, especially the "short" tests.  Let
the drive operate normally, otherwise run smartd and watch logs instead.

If you want to scan the disk for bad blocks, you need to do a selective
LBA test.  Your drive does support selective scanning, as shown here:

> Offline data collection
> capabilities:
> ...
>                                         Selective Self-test supported.

You can do this with "smartctl -t select,0-max /dev/ad4", and safely
while the drive is in operation.  You can check the status of the scan
(assuming the Samsung supports it) by using "smartctl -c /dev/ad4" and
look at the percentage of completion.

However, I would expect that if your drive had bad blocks, or even blocks
which the drive consisted suspect, that Attributes 196 and 197 would be
non-zero.  I'm more familiar with Western Digital and Seagate disks
though.

> dmesg was in the attachment of the original mail but I'll paste it here:

I apologise, I missed that -- sometimes the mailing list software
removes attachments, so I've grown accustomed to not looking for them.
My bad.

> atapci0: <SiI 3512 SATA150 controller> port 0xb400-0xb407,0xb000-0xb003,0xa800-0xa807,0xa400-0xa403,0xa000-0xa00f mem 0xf0800000-0xf08001ff irq 23 at device 11.0 on pci2
> atapci0: [ITHREAD]
> ata2: <ATA channel 0> on atapci0
> ata2: [ITHREAD]
> ata3: <ATA channel 1> on atapci0
> ata3: [ITHREAD]
> ad4: 953869MB <SAMSUNG HD103UJ 1AA01113> at ata2-master UDMA100 SATA 1.5Gb/s

Using that information and circling back to the original error:

> unknown: TIMEOUT - WRITE_DMA48 retrying (1 retry left) LBA=274799820^M
> ata2: timeout waiting to issue command^M
> ata2: error issuing WRITE_DMA48 command^M
> g_vfs_done():ad4s2f[WRITE(offset=28915105792, length=131072)]error = 6^M
> /usr: got error 6 while accessing filesystem^M
> panic: softdep_deallocate_dependencies: unrecovered I/O error^M
> cpuid = 0^M
> KDB: stack backtrace:^M

errno 6 is "device not configured".  ad4 is on a Silicon Image
controller (thankfully a reliable model).  Sadly AHCI (ahci.ko) isn't in
use here; I would advocate switching to it (your device names will
change however) and see if these errors continue (they'll appear as SCSI
CAM errors though).  ahci_load="yes" in /boot/loader.conf should be
enough.  smartmontools does know to talk ATA to /dev/adaX (that's not a
typo) disks.

Am I advocating use of ahci.ko as a workaround for the problem?  Sort
of.  I know that Alexander Motin has a lot of good experience with the
Silicon Image controllers and would also advocate use of AHCI when one
has such.  Possibly what you're seeing is a bug or quirk of some kind in
the ata(4) driver.  These kinds of quirks ("I got an error but the disk
itself looks fine") have concerned me on FreeBSD for many, many years
now.

I would recommend using ahci.ko first, then doing the selective scan
only if more errors continue/show up after the fact.

So in summary, at this point your drive looks fine, but I'd feel better
after a selective scan had a chance to run.

Purely speculative: there's always the possibility the Samsung disks do
something similar to what IBM ATA drives circa 1999-2000 did: a feature
called "ADM" (Automatic Drive Maintenance), where the drive would
literally drop to standby mode to perform whatever.  If it received an
ATA command from the controller while doing this, would spin back up and
respond to the command.  The whole down/up process took so long that
FreeBSD reported the issue as a timeout, as well as a DMA error if it
was trying to do a read/write operation.  You could literally hear the
drive powering down then going "thunk" and powering back up when it
received an ATA command.  I mailed IBM about this and they confirmed it.
The feature also existed on SCSI drives (and still does, I think), but
is disabled by default.  Here's relevant reading material:

http://jdc.parodius.com/freebsd/ibm_email_aware_of_adm.txt
http://www.mail-archive.com/freebsd-current@freebsd.org/msg07222.html

The ATA drives that came out in 2001 and beyond had this feature
*completely removed*, so it's pretty obvious it was causing problems,
probably as more people started using the drives in servers vs. standard
Windows desktops (well-known for hiding such I/O conditions).

I imagine if Samsung drives did this we'd be seeing a lot more reports
about it here on the lists.  I'd pay close attention to the timestamps
on the timeouts.

-- 
| Jeremy Chadwick                                   jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                  Mountain View, CA, USA |
| Making life hard for others since 1977.               PGP 4BD6C0CB |