Re: Dell Latitude 7400 - nvme0: Missing interrupt

From: Pavel Timofeev <timp87_at_gmail.com>
Date: Sat, 09 Oct 2021 14:44:22 UTC
пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>:

>
>
> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com> wrote:
>
>>
>>
>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>:
>>
>>>
>>>
>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com> wrote:
>>>
>>>>
>>>>
>>>>  Warner Losh <imp@bsdimp.com>:
>>>>
>>>>>
>>>>>
>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev <timp87@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>  Pavel Timofeev <timp87@gmail.com>:
>>>>>>
>>>>>> >
>>>>>> > Chuck Tuffli <ctuffli@gmail.com>:
>>>>>> >
>>>>>> >> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev <timp87@gmail.com>
>>>>>> wrote:
>>>>>> >> >
>>>>>> >> > Hello
>>>>>> >> > I've got a Dell Latitude 7400 and tried installing the latest
>>>>>> >> 14.0-CURRENT
>>>>>> >> > (main-n248636-d20e9e02db3) on it.
>>>>>> >> > Despite other things the weird one which concerns me is
>>>>>> >> >   nvme0: Missing interrupt
>>>>>> >> > message I get sometimes on the console.
>>>>>> >> > It seems like I get it only after the reboot of the laptop, i.
>>>>>> e. not
>>>>>> >> > getting that message if I power cycle the laptop, at least I
>>>>>> haven't
>>>>>> >> seen
>>>>>> >> > them for now in such cases.
>>>>>> >> > So when the laptop is rebooted I can't even take advantage of
>>>>>> >> > nvmecontrol(8) quickly.
>>>>>> >> > Well, it still works, but it takes tens of seconds to return the
>>>>>> output.
>>>>>> >> ...
>>>>>> >> > dmesg when power cycled -
>>>>>> >> >
>>>>>> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ
>>>>>> >> > dmesg when rebooted -
>>>>>> >> >
>>>>>> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh
>>>>>> >>
>>>>>> >> I'm sort of curious about the time stamps for the log messages in
>>>>>> the
>>>>>> >> failing case. Something like:
>>>>>> >>
>>>>>> >> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>> >>
>>>>>> >> --chuck
>>>>>> >>
>>>>>> >
>>>>>> > Well, I can't see timestamps in the verbose boot log. Am I missing
>>>>>> some
>>>>>> > configuration for that?
>>>>>> >
>>>>>> > $ grep "nv\(me\|d\)" /var/log/messages
>>>>>> > nvme0: <Generic NVMe Device> mem
>>>>>> > 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>>>>>> at device
>>>>>> > 0.0 on pci6
>>>>>> > nvme0: attempting to allocate 5 MSI-X vectors (17 supported)
>>>>>> > nvme0: using IRQs 133-137 for MSI-X
>>>>>> > nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20
>>>>>> > nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0, MPSMAX 0
>>>>>> > nvme0: Version: 0x00010300: 1.3
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvme0: Missing interrupt
>>>>>> > nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>>>>>> > GEOM: new disk nvd0
>>>>>> > nvd0: 488386MB (1000215216 512 byte sectors)
>>>>>> >
>>>>>>
>>>>>>
>>>>>> Ah, sorry, provided wrong output.
>>>>>> Here is what you requested:
>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device> mem
>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at
>>>>>> device
>>>>>> 0.0 on pci6
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate 5 MSI-X
>>>>>> vectors (17 supported)
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for MSI-X
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES 1023,
>>>>>> CQR,
>>>>>> TO 20
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: DSTRD 0,
>>>>>> NSSRS,
>>>>>> CSS 1, MPSMIN 0, MPSMAX 0
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: 1.3
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix 512GB>
>>>>>> NVMe
>>>>>> namespace
>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0
>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512 byte
>>>>>> sectors)
>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt
>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt
>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt
>>>>>>
>>>>>
>>>>> What happens if you set hw.nvme.use_nvd=0 and hw.cam.nda.nvd_compat=1
>>>>> in the boot loader and reboot? Same thing except nda where nvd was? Or
>>>>> does
>>>>> it work?
>>>>>
>>>>> Something weird is going on in the interrupt assignment, I think, but I
>>>>> wanted to get any nvd vs nda issues out of the way first.
>>>>>
>>>>> Warner
>>>>>
>>>>
>>>> Do you mean kern.cam.nda.nvd_compat instead of hw.cam.nda.nvd_compat?
>>>> kern.cam.nda.nvd_compat is 1 by default now.
>>>>
>>>> So I tried to set  hw.nvme.use_nvd to 1 as suggested, but I still see
>>>>   nvme0: Missing interrupt
>>>> and now also
>>>>   Root mount waiting for: CAM
>>>> messages besides those
>>>>
>>>
>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by default
>>> these
>>> days.
>>>
>>> I'll take a look on monday starting at the differences in interrupt
>>> assignment that
>>> are apparent when you cold boot vs reboot.
>>>
>>> Thanks for checking... I'd hoped this was a cheap fix, but also didn't
>>> really
>>> expect it to be.
>>>
>>> Warner
>>>
>>>
>> I've recently upgraded to main-n249974-17f790f49f5 and it got even worse
>> now.
>> So clean poweron works as before.
>> But if rebooted nvme drive refuses to work, while before the code upgrade
>> it was just complaining about missing interrupts.
>>
>> currently dmesg show this:
>> nvme0: <Generic NVMe Device> mem
>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>> 0.0 on pci6
>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>> nvd0: 488386MB (1000215216 512 byte sectors)
>> nvme0: <Generic NVMe Device> mem
>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>> 0.0 on pci6
>>
>
> Why is this showing up twice? Or is everything above this line left over
> from the first, working boot?
>
>
>> nvme0: RECOVERY_START 9585870784 vs 9367036288
>> nvme0: timeout with nothing complete, resetting
>> nvme0: Resetting controller due to a timeout.
>> nvme0: RECOVERY_WAITING
>> nvme0: resetting controller
>> nvme0: aborting outstanding admin command
>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000
>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>> nvme0: nvme_identify_controller failed!
>> nvme0: waiting
>>
>
> Clearly something bad is going on with the drive here... We looked into
> the completion queues when we didn't get an interrupt and there was nothing
> complete there....
>
> The only thing I can think of is that this means there's a phase error
> between the drive and the system. I recently removed a second reset and
> made it an option NVME_2X_RESET. Can you see if adding
> 'options NVME_2X_RESET' to your kernel config fixes this?
>
> Warner
>
>
>> nvme0: <Generic NVMe Device> mem
>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>> 0.0 on pci6
>> nvme0: RECOVERY_START 9362778467 vs 9361830561
>> nvme0: timeout with nothing complete, resetting
>> nvme0: Resetting controller due to a timeout.
>> nvme0: RECOVERY_WAITING
>> nvme0: resetting controller
>> nvme0: aborting outstanding admin command
>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000
>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>> nvme0: nvme_identify_controller failed!
>> nvme0: waiting
>>
>>

Sorry, it's showing twice due to multiple reboots. For one boot it's like:
nvme0: <Generic NVMe Device> mem
0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
0.0 on pci6
nvme0: RECOVERY_START 9633303481 vs 9365971423
nvme0: timeout with nothing complete, resetting
nvme0: Resetting controller due to a timeout.
nvme0: RECOVERY_WAITING
nvme0: resetting controller
nvme0: aborting outstanding admin command
nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000
nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
nvme0: nvme_identify_controller failed!
nvme0: waiting

Well, neither Windows not Linux have any problems with the device. I
understand they may be hiding it or workaround somehow.

I'll try setting NVME_2X_RESET in the kernel config and report back in a
while.