ahci-timeout regression in beta3
freebsd at omnilan.de
Sat Mar 5 12:11:19 UTC 2016
Bezüglich John Baldwin's Nachricht vom 02.03.2016 18:32 (localtime):
> On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote:
>> Bezüglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime):
>>> I have a remote machine with a probably defective ODD, but until r294989
>>> (from Jan 28th) I could boot with just these warnings:
>>> (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00
>>> (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error
>>> (cd1:ahcich1:0:0:0): SCSI status: Check Condition
>>> (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read
>>> (cd1:ahcich1:0:0:0): Error 5, Unretryable error
>>> (cd1:ahcich1:0:0:0): cddone: got error 0x5 back
>>> beta3 doesn't boot anymore, it's hanging with ahci-timeouts:
>>> ahcich2: Timeout on slot 11 port 0
>>> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr
>>> 00000000 cmd 0004cb17
>>> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00
>>> 00 00 00
>>> (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00
>>> 00 00
>>> (aprobe0:ahcich2:0:0:0) CAM status: Command timeout
>>> (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked
>>> ada1 detached
>>> The numbers (first ACB) and also the channel varies from time to time
>> I could narrow it down to r295480
>> Reverting that lets the machine boot again.
>> I captured verbose boot messages, finding out that problem relaxes with
>> verbose-booting, since ahci seems to recover:
>> TSC timecounter discards lower 1 bit(s)
>> Timecounter "TSC-low" frequency 1746033500 Hz quality -100
>> ahcich2: Timeout on slot 12 port 0
>> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr
>> 00000000 cmd 0004cc17
>> ahcich2: AHCI reset...
>> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00
>> 00 00 00
>> (ada1:ahcich2:0:0:0): CAM status: Command timeout
>> (ada1:ahcich2:0:0:0): Retrying command
>> ahcich2: SATA connect time=100us status=00000123
>> ahcich2: AHCI reset: device found
>> ahcich2: AHCI reset: device ready after 100ms
>> ahcich1: SNTF 0x0001
>> ahcich1: SNTF 0x0001
>> I have checked twice that r295480 introduces boot failure here.
>> I have absolutely no idea where/how/why/what race happens...
>> Thanks for any hints,
> That is most bizarre. Does HEAD boot fine on this machine? The change
> in question probably alters the timing of startup a bit since the random
> kthread is placed on the run queue later which might affect the relative
> order of kthreads as they start executing, but that would just mean it is
> exposting a race in some other part of the system.
Bizarre it is...
HEAD (r295683, 02/17/2016) boots fine.
This time I checked with vendor-ISO images, while before it was a custom
setup rollout with local patches and special hw adaptions. Now I'm sure
it's not site-related.
With BETA3-iso, where booting fails, "random: unblocking device."
happens after timecounter initialization and before attaching ses0/cdX/adaX.
With HEAD-iso, where booting succeeds, "random: unblocking device."
happens way after ses0/adaX/cdX attached, right before rc.
On HEAD, ahci-devices attach in the same order as with -stable pre-r295480.
Since r295480, cdX attaches before adaX on -stable and while searching
for the cluprit, I had observed that attaching-order was a clear
indicator whether machine boots or not.
Sorry, I can't provide more useful info at this time, I just can
describe simple symptoms :-(
While playing with that machine I remember having had such a bizarre
problem before, arising with r284665. Here's what I found in my kernel
# Don't build kernel with RACCT by default, which was enabled with r284665,
# since ahci(4) fails when using MSI with ahcichX timeout!
nooptions RACCT # Resource accounting framework
nooptions RACCT_DEFAULT_TO_DISABLED # Set
kern.racct.enable=0 by default
nooptions RCTL # Resource limits
Perhpas it's related?!
More information about the freebsd-stable