ahci-timeout regression in beta3
jhb at freebsd.org
Sat Mar 5 22:00:57 UTC 2016
On Saturday, March 05, 2016 01:11:13 PM Harry Schmalzbauer wrote:
> Bezüglich John Baldwin's Nachricht vom 02.03.2016 18:32 (localtime):
> > On Monday, February 29, 2016 07:29:03 PM Harry Schmalzbauer wrote:
> >> Bezüglich Harry Schmalzbauer's Nachricht vom 28.02.2016 20:55 (localtime):
> >>> Hello,
> >>> I have a remote machine with a probably defective ODD, but until r294989
> >>> (from Jan 28th) I could boot with just these warnings:
> >>> (cd1:ahcich1:0:0:0): READ(10). CDB: 28 00 00 38 85 e0 00 00 01 00
> >>> (cd1:ahcich1:0:0:0): CAM status: SCSI Status Error
> >>> (cd1:ahcich1:0:0:0): SCSI status: Check Condition
> >>> (cd1:ahcich1:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read
> >>> error)
> >>> (cd1:ahcich1:0:0:0): Error 5, Unretryable error
> >>> (cd1:ahcich1:0:0:0): cddone: got error 0x5 back
> >>> …
> >>> beta3 doesn't boot anymore, it's hanging with ahci-timeouts:
> >>> ahcich2: Timeout on slot 11 port 0
> >>> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00000800 tfd 40 derr
> >>> 00000000 cmd 0004cb17
> >>> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 01 ae a3 50 40 5d 01 00
> >>> 00 00 00
> >>> ...
> >>> (aprobe0:ahcich2:0:0:0) ATA_IDENTIFY. ACB eec 00 00 00 00 40 00 00 00 00
> >>> 00 00
> >>> (aprobe0:ahcich2:0:0:0) CAM status: Command timeout
> >>> (aprobe0:ahcich2:0:0:0) Error 5, Retry was blocked
> >>> ada1 detached
> >>> ...
> >>> The numbers (first ACB) and also the channel varies from time to time
> >> I could narrow it down to r295480
> >> (https://svnweb.freebsd.org/base?view=revision&revision=295480)
> >> Reverting that lets the machine boot again.
> >> I captured verbose boot messages, finding out that problem relaxes with
> >> verbose-booting, since ahci seems to recover:
> >> …
> >> TSC timecounter discards lower 1 bit(s)
> >> Timecounter "TSC-low" frequency 1746033500 Hz quality -100
> >> ahcich2: Timeout on slot 12 port 0
> >> ahcich2: is 00000008 cs 00000000 ss 00000000 rs 00001000 tfd 40 serr
> >> 00000000 cmd 0004cc17
> >> ahcich2: AHCI reset...
> >> (ada1:ahcich2:0:0:0): READ_FPDMA_QUEUED. ACB: 60 04 71 a3 50 40 5d 01 00
> >> 00 00 00
> >> (ada1:ahcich2:0:0:0): CAM status: Command timeout
> >> (ada1:ahcich2:0:0:0): Retrying command
> >> ahcich2: SATA connect time=100us status=00000123
> >> ahcich2: AHCI reset: device found
> >> ahcich2: AHCI reset: device ready after 100ms
> >> ahcich1: SNTF 0x0001
> >> ahcich1: SNTF 0x0001
> >> …
> >> I have checked twice that r295480 introduces boot failure here.
> >> I have absolutely no idea where/how/why/what race happens...
> >> Thanks for any hints,
> > That is most bizarre. Does HEAD boot fine on this machine? The change
> > in question probably alters the timing of startup a bit since the random
> > kthread is placed on the run queue later which might affect the relative
> > order of kthreads as they start executing, but that would just mean it is
> > exposting a race in some other part of the system.
> Bizarre it is...
> HEAD (r295683, 02/17/2016) boots fine.
> BETA3 fails.
> This time I checked with vendor-ISO images, while before it was a custom
> setup rollout with local patches and special hw adaptions. Now I'm sure
> it's not site-related.
> With BETA3-iso, where booting fails, "random: unblocking device."
> happens after timecounter initialization and before attaching ses0/cdX/adaX.
> With HEAD-iso, where booting succeeds, "random: unblocking device."
> happens way after ses0/adaX/cdX attached, right before rc.
Yes, HEAD's /dev/random has many more changes than were put into 10 for
> On HEAD, ahci-devices attach in the same order as with -stable pre-r295480.
> Since r295480, cdX attaches before adaX on -stable and while searching
> for the cluprit, I had observed that attaching-order was a clear
> indicator whether machine boots or not.
> Sorry, I can't provide more useful info at this time, I just can
> describe simple symptoms :-(
> While playing with that machine I remember having had such a bizarre
> problem before, arising with r284665. Here's what I found in my kernel
> # Don't build kernel with RACCT by default, which was enabled with r284665,
> # since ahci(4) fails when using MSI with ahcichX timeout!
> nooptions RACCT # Resource accounting framework
> nooptions RACCT_DEFAULT_TO_DISABLED # Set
> kern.racct.enable=0 by default
> nooptions RCTL # Resource limits
> Perhpas it's related?!
I think it's related in the sense that there is a timing race in ahci and
that the /dev/random and RACCT changes alter the timing enough to trigger
the race simply by changing the relative order of SYSINIT's during boot
(and/or the amount of time between the ahci driver doing its initial
probe and the second probe that is run for the interrupt config hooks that
actually probes the attached SATA devices).
More information about the freebsd-stable