Re: git: a58536b91ae3 - main - pci: Disable Electromechanical Interlock.

From: Alexander Motin <mav_at_FreeBSD.org>
Date: Tue, 04 Oct 2022 23:32:39 UTC
On 04.10.2022 12:47, John Baldwin wrote:
> On 10/4/22 7:44 AM, Alexander Motin wrote:
>> The branch main has been updated by mav:
>>
>> URL: 
>> https://cgit.FreeBSD.org/src/commit/?id=a58536b91ae3931d222c3e4f1a949ff4a4927fb2
>>
>> commit a58536b91ae3931d222c3e4f1a949ff4a4927fb2
>> Author:     Alexander Motin <mav@FreeBSD.org>
>> AuthorDate: 2022-10-04 14:34:15 +0000
>> Commit:     Alexander Motin <mav@FreeBSD.org>
>> CommitDate: 2022-10-04 14:34:15 +0000
>>
>>      pci: Disable Electromechanical Interlock.
>>      Add sysctl/tunable to control Electromechanical Interlock support.
>>      Disable it by default since Linux does not do it either and it seems
>>      the number of systems having it broken is higher than having 
>> working.
>>      This fixes NVMe backplane operation on ASUS RS500A-E11-RS12U server
>>      with AMD EPYC 7402 CPU, where attempts to control reported interlock
>>      for some reason end up in PCIe link loss, while interlock status 
>> does
>>      not change (it is not really there).
>>      MFC after:      2 weeks
> 
> See also https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=256264, though
> that is more for the case where slots aren't really hotplug at all.
> 
> The root issue seems to be that there are generic HotPlug-capable
> bridges but that manufacturers fail to correctly wire up the various
> input pins such that the bridges can actually determine that there is no
> MRL or EI, etc.  The above PR (which I still can't get the reporter to
> test the patch for, but perhaps should just merge?) disables PCI-e hotplug
> if the link is up, but the other status bits claim that the device is
> partially inserted when attaching the bridge.

In my case the slots are really expected to be hot-pluggable, just ASUS 
can't do things right.  In the case of the PR your patch seems to have 
sense.  I'd be more worried about already present check for broken MRL 
-- if we see MRL open, but device is still powered, we may wish to 
quickly shut the device.  But I agree that probability of false negative 
here is much higher than of positive.  I still haven't had my hands on 
on any hardware implementing all those cool bells and whistles.

-- 
Alexander Motin