misc/136182: Heavy disk writes (e.g. ZFS resilver to a drive) can cause "adX: TIMEOUT - FLUSHCACHE retrying (1 retry left)" on console.

Tue Jun 30 08:20:02 UTC 2009

>Number:         136182
>Category:       misc
>Synopsis:       Heavy disk writes (e.g. ZFS resilver to a drive) can cause "adX: TIMEOUT - FLUSHCACHE retrying (1 retry left)" on console.
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          change-request
>Submitter-Id:   current-users
>Arrival-Date:   Tue Jun 30 08:20:01 UTC 2009
>Closed-Date:
>Last-Modified:
>Originator:     Karl Pielorz
>Release:        7.2-STABLE
>Organization:
>Environment:
FreeBSD caladan.tdx.co.uk 7.2-STABLE FreeBSD 7.2-STABLE #54: Mon Jun 29 09:25:13 BST 2009     root at caladan.tdx.co.uk:/usr/src/sys/amd64/compile/CALADAN64-SMP  amd64
>Description:

While doing a ZFS 'resilver' to a new drive (a Western Digital WD5000AAKS), you get a number of 'flushcache' timeouts logged to the console.

The drive reports no SMART errors, and passes (in this case) a Western Digital drive check, with no errors.

The only error logged is for the 'flushcache' operation.

Checking in the mailing list, there's past references to upping the timeout on the ATA 'flushcache' command from the default 5 seconds, to 30 seconds - as apparently the ATA spec says a flushcache can take up to 30 seconds.

e.g. http://lists.freebsd.org/pipermail/freebsd-current/2009-April/005939.html

This patch doesn't apply to 7.2-S, and doesn't seem to have made it in.

The patch included with this PR fixes the problem for me - I no longer get flushcache warnings while doing a resilver on this system.
>How-To-Repeat:

Saturate a drive with write I/O - in my case, take 2 * 500Gb Western Digital WD5000AAKS SATA drives - in a ZFS mirror set.

Fill the zpool that's created - and then remove one of the mirrored pairs, and replace it with another 'blank' drive.

Tell ZFS to do a drive replace (which starts a 'resilver' to copy the data from the good drive, to the new drive).

At various points you'll keep getting:

ad34: TIMEOUT - FLUSHCACHE retrying (1 retry left)

Logged to the console, sometimes once in a while - other times, quite often - during the resilver.
>Fix:

The attached patch 'fixes it for me' - it sets a timeout for flush commands, of 30 seconds, instead of the default 5.

This has been running the past couple of days, and I've not seen a single flush timeout.

Patch attached with submission follows:

--- ata-disk.c	2009-06-30 08:55:56.000000000 +0100
+++ ata-disk.c.kp	2009-06-30 08:54:47.000000000 +0100
@@ -339,6 +339,7 @@
 	request->transfersize = 0;
 	request->flags = ATA_R_CONTROL;
 	request->u.ata.command = ATA_FLUSHCACHE;
+	request->timeout = 30;
 	break;
     default:
 	device_printf(dev, "FAILURE - unknown BIO operation\n");


>Release-Note:
>Audit-Trail:
>Unformatted: