ahci-timeout regression in beta3

Harry Schmalzbauer freebsd at omnilan.de
Sat Mar 5 22:41:09 UTC 2016

Bezüglich John Baldwin's Nachricht vom 05.03.2016 22:50 (localtime):
> On Saturday, March 05, 2016 01:11:13 PM Harry Schmalzbauer wrote:
>> Bezüglich John Baldwin's Nachricht vom 02.03.2016 18:32 (localtime):
>> With BETA3-iso, where booting fails, "random: unblocking device."
>> happens after timecounter initialization and before attaching ses0/cdX/adaX.
>> With HEAD-iso, where booting succeeds, "random: unblocking device."
>> happens way after ses0/adaX/cdX attached, right before rc.
> Yes, HEAD's /dev/random has many more changes than were put into 10 for
> BETA3.
>> On HEAD, ahci-devices attach in the same order as with -stable pre-r295480.
>> Since r295480, cdX attaches before adaX on -stable and while searching
>> for the cluprit, I had observed that attaching-order was a clear
>> indicator whether machine boots or not.
>> Perhpas it's related?!
>> https://lists.freebsd.org/pipermail/freebsd-stable/2015-July/082706.html
> I think it's related in the sense that there is a timing race in ahci and
> that the /dev/random and RACCT changes alter the timing enough to trigger
> the race simply by changing the relative order of SYSINIT's during boot
> (and/or the amount of time between the ahci driver doing its initial
> probe and the second probe that is run for the interrupt config hooks that
> actually probes the attached SATA devices).

Thanks for your comment, I had such kind of race in mind, but I don't
have the skills to debug myself - then and now and unfortunately also
not the time for an upgrade ;-)

But meanwhile I deployed 10.3-RC1 without reverting r295480 (and also
removing "nooptions RACCT" (+ RCTL), since effectless
»kern.racct.enable« was corrected some time after that problem hit me).

Good news is that these ahci-timeouts haven't showed up elsewhere yet –
I've updated several _very_ similar setups (C200 chipsets; but none with
a suspicious faulty ODD)

So it's clearly not a show stopper for 10.3.

But there's a timing race to find, which affects ahci-timeouts. The most
nasty one's I ever fought... And it's not very welcome finding a remote
machine stop booting because of a faulty ODD one wasn't ware, since it
succeeds booting previous FreeBSD release and other OSs.

Tell me if I can help out with my skills.



More information about the freebsd-stable mailing list