svn commit: r224496 - head/sys/cam

Alexander Motin mav at FreeBSD.org
Sat Jul 30 21:09:35 UTC 2011


Kostik Belousov wrote:
> On Fri, Jul 29, 2011 at 08:30:28PM +0000, Alexander Motin wrote:
>> Author: mav
>> Date: Fri Jul 29 20:30:28 2011
>> New Revision: 224496
>> URL: http://svn.freebsd.org/changeset/base/224496
>>
>> Log:
>>   In some cases failed SATA disks may report their presence, but don't
>>   respond to any commands. I've found that because of multiple command
>>   retries, each of which cause 30s timeout, bus reset and another retry or
>>   requeue for many commands, it may take ages to eventually drop the
>>   failed device. The odd thing is that those retries continue even after
>>   XPT considered device as dead and invalidated it.
>>   
>>   This patch makes cam_periph_error() to block any command retries after
>>   periph was marked as invalid. With that patch all activity completes in
>>   1-2 minutes, just after several timeouts, required to consider device
>>   death. This should make ZFS, gmirror, graid, etc. operation more robust.
>>   
>>   Reviewed by:	mjacob@ on scsi@
>>   
>>   Approved by:	re (kib)
>>
>> Modified:
>>   head/sys/cam/cam_periph.c
> Amusingly, this commit makes my test machine to not boot.
> This is Ibex Peak PCH, with two SATA disks on the channels 0 and 1.
> 
> It seems that geom thread 100012 owns GEOM topology lock, while sleeping
> in adaclose->cam_periph_getccb() :
> 
> db> bt 100012
> Tracing pid 12 tid 100012 td 0xfffffe00028a2000
> sched_switch() at 0xffffffff8034a0c7 = sched_switch+0x157
> mi_switch() at 0xffffffff803291fb = mi_switch+0x2eb
> sleepq_switch() at 0xffffffff803631f3 = sleepq_switch+0x123
> sleepq_wait() at 0xffffffff80363eed = sleepq_wait+0x4d
> _sleep() at 0xffffffff80329b59 = _sleep+0x3b9
> cam_periph_getccb() at 0xffffffff817ffc50 = cam_periph_getccb+0xa0
> adaclose() at 0xffffffff8182c484 = adaclose+0xc4
> g_disk_access() at 0xffffffff802bea74 = g_disk_access+0x1e4
> g_access() at 0xffffffff802c519a = g_access+0x1ba
> g_dev_attrchanged() at 0xffffffff802bd1f6 = g_dev_attrchanged+0x96
> g_dev_taste() at 0xffffffff802bd574 = g_dev_taste+0x284
> g_new_provider_event() at 0xffffffff802c4ecd = g_new_provider_event+0xad
> g_run_events() at 0xffffffff802c0750 = g_run_events+0x250
> fork_exit() at 0xffffffff802f0d99 = fork_exit+0x189
> fork_trampoline() at 0xffffffff804ee3be = fork_trampoline+0xe
> --- trap 0, rip = 0, rsp = 0xffffff800025fd00, rbp = 0 ---
> 
> (gdb) list *cam_periph_getccb+0xa0
> 0x1c50 is in cam_periph_getccb (/usr/home/kostik/work/build/bsd/DEV/src/sys/modules/cam/../../cam/cam_periph.c:883).
> 882
> 883             while (SLIST_FIRST(&periph->ccb_list) == NULL) {
> 884                     if (periph->immediate_priority > priority)
> 
> Reverting the rev. or not loading ahci.ko allows machine to boot.

After many experiments I believe that problem is not related to this
change. I've managed to reproduce it depending on GEOM modules
registration order. After I disabled all GEOM modules and only geom_dev
left, problem became persistent. Specifics of the geom_dev is that it
opens device and closes it back without doing any I/O. That caused race
condition between CCB allocation for FLUSHCACHE execution in adaclose()
and higher-priority commands of device initialization sequence. Any I/O
scheduled before adaclose() closed that race, making problem rare. The
problem is specific to the ada, as for no other driver initialization
and payload requests may intersect in time.

Attached patch solved the problem for me. Please try it and approve
commit if it works.

-- 
Alexander Motin
-------------- next part --------------
--- cam/ata/ata_da.c.prev	2011-07-30 22:22:21.000000000 +0300
+++ cam/ata/ata_da.c	2011-07-30 23:12:22.000000000 +0300
@@ -488,12 +488,20 @@ static void
 adaschedule(struct cam_periph *periph)
 {
 	struct ada_softc *softc = (struct ada_softc *)periph->softc;
+	uint32_t prio;
 
+	/* Check if cam_periph_getccb() was called. */
+	prio = periph->immediate_priority;
+
+	/* Check if we have more work to do. */
 	if (bioq_first(&softc->bio_queue) ||
 	    (!softc->trim_running && bioq_first(&softc->trim_queue))) {
-		/* Have more work to do, so ensure we stay scheduled */
-		xpt_schedule(periph, CAM_PRIORITY_NORMAL);
+		prio = CAM_PRIORITY_NORMAL;
 	}
+
+	/* Schedule CCB if any of above is true. */
+	if (prio != CAM_PRIORITY_NONE)
+		xpt_schedule(periph, prio);
 }
 
 /*


More information about the svn-src-head mailing list