svn commit: r224496 - head/sys/cam

Andriy Gapon avg at FreeBSD.org
Sat Jul 30 21:17:07 UTC 2011


on 30/07/2011 23:47 Alexander Motin said the following:
> Kostik Belousov wrote:
>> On Fri, Jul 29, 2011 at 08:30:28PM +0000, Alexander Motin wrote:
>>> Author: mav
>>> Date: Fri Jul 29 20:30:28 2011
>>> New Revision: 224496
>>> URL: http://svn.freebsd.org/changeset/base/224496
>>>
>>> Log:
>>>   In some cases failed SATA disks may report their presence, but don't
>>>   respond to any commands. I've found that because of multiple command
>>>   retries, each of which cause 30s timeout, bus reset and another retry or
>>>   requeue for many commands, it may take ages to eventually drop the
>>>   failed device. The odd thing is that those retries continue even after
>>>   XPT considered device as dead and invalidated it.
>>>   
>>>   This patch makes cam_periph_error() to block any command retries after
>>>   periph was marked as invalid. With that patch all activity completes in
>>>   1-2 minutes, just after several timeouts, required to consider device
>>>   death. This should make ZFS, gmirror, graid, etc. operation more robust.
>>>   
>>>   Reviewed by:	mjacob@ on scsi@
>>>   
>>>   Approved by:	re (kib)
>>>
>>> Modified:
>>>   head/sys/cam/cam_periph.c
>> Amusingly, this commit makes my test machine to not boot.
>> This is Ibex Peak PCH, with two SATA disks on the channels 0 and 1.
>>
>> It seems that geom thread 100012 owns GEOM topology lock, while sleeping
>> in adaclose->cam_periph_getccb() :
>>
>> db> bt 100012
>> Tracing pid 12 tid 100012 td 0xfffffe00028a2000
>> sched_switch() at 0xffffffff8034a0c7 = sched_switch+0x157
>> mi_switch() at 0xffffffff803291fb = mi_switch+0x2eb
>> sleepq_switch() at 0xffffffff803631f3 = sleepq_switch+0x123
>> sleepq_wait() at 0xffffffff80363eed = sleepq_wait+0x4d
>> _sleep() at 0xffffffff80329b59 = _sleep+0x3b9
>> cam_periph_getccb() at 0xffffffff817ffc50 = cam_periph_getccb+0xa0
>> adaclose() at 0xffffffff8182c484 = adaclose+0xc4
>> g_disk_access() at 0xffffffff802bea74 = g_disk_access+0x1e4
>> g_access() at 0xffffffff802c519a = g_access+0x1ba
>> g_dev_attrchanged() at 0xffffffff802bd1f6 = g_dev_attrchanged+0x96
>> g_dev_taste() at 0xffffffff802bd574 = g_dev_taste+0x284
>> g_new_provider_event() at 0xffffffff802c4ecd = g_new_provider_event+0xad
>> g_run_events() at 0xffffffff802c0750 = g_run_events+0x250
>> fork_exit() at 0xffffffff802f0d99 = fork_exit+0x189
>> fork_trampoline() at 0xffffffff804ee3be = fork_trampoline+0xe
>> --- trap 0, rip = 0, rsp = 0xffffff800025fd00, rbp = 0 ---
>>
>> (gdb) list *cam_periph_getccb+0xa0
>> 0x1c50 is in cam_periph_getccb (/usr/home/kostik/work/build/bsd/DEV/src/sys/modules/cam/../../cam/cam_periph.c:883).
>> 882
>> 883             while (SLIST_FIRST(&periph->ccb_list) == NULL) {
>> 884                     if (periph->immediate_priority > priority)
>>
>> Reverting the rev. or not loading ahci.ko allows machine to boot.
> 
> After many experiments I believe that problem is not related to this
> change. I've managed to reproduce it depending on GEOM modules
> registration order. After I disabled all GEOM modules and only geom_dev
> left, problem became persistent. Specifics of the geom_dev is that it
> opens device and closes it back without doing any I/O. That caused race
> condition between CCB allocation for FLUSHCACHE execution in adaclose()
> and higher-priority commands of device initialization sequence. Any I/O
> scheduled before adaclose() closed that race, making problem rare. The
> problem is specific to the ada, as for no other driver initialization
> and payload requests may intersect in time.

Alexander,

somewhat contradicting your conclusions... can the following issue be also
caused by similar mechanics?
http://lists.freebsd.org/pipermail/freebsd-current/2010-October/020336.html

> Attached patch solved the problem for me. Please try it and approve
> commit if it works.
> 


-- 
Andriy Gapon


More information about the svn-src-all mailing list