Re: Dell Latitude 7400 - nvme0: Missing interrupt

From: Alexander Motin <mav_at_FreeBSD.org>
Date: Sun, 17 Oct 2021 17:25:42 UTC
It may be a noise, but comparing logs I see in reboot case also
"acpi_ec0: not getting interrupts, switched to polled mode".  I am
thinking whether the problem may be caused not by SSD, but by some
resource conflict/misconfiguration in the system.  Pavel, can you
compare `devinfo -vr` and `lspci -vvvvv` in both cases. looking for any
differences?  Are you running the latest BIOS?

On 12.10.2021 15:56, Warner Losh wrote:
> One piece of data that would be good to have:
> 
> nvmecontrol identify nvme0
> 
> There's an optional feature that none of my drives have, but that the Linux
> driver does oddly
> weird things when enabled. The output of that command will help me
> understand if that may
> be in play. Maybe we need to do oddly weird things too :)
> 
> Warner
> 
> On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <imp@bsdimp.com> wrote:
> 
>>
>>
>> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <timp87@gmail.com> wrote:
>>
>>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <imp@bsdimp.com>:
>>>
>>>>
>>>>
>>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <timp87@gmail.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>  Warner Losh <imp@bsdimp.com>:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev <timp87@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>  Pavel Timofeev <timp87@gmail.com>:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Chuck Tuffli <ctuffli@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>>> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev <
>>>>>>>>>>> timp87@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello
>>>>>>>>>>>>>> I've got a Dell Latitude 7400 and tried installing the latest
>>>>>>>>>>>>> 14.0-CURRENT
>>>>>>>>>>>>>> (main-n248636-d20e9e02db3) on it.
>>>>>>>>>>>>>> Despite other things the weird one which concerns me is
>>>>>>>>>>>>>>   nvme0: Missing interrupt
>>>>>>>>>>>>>> message I get sometimes on the console.
>>>>>>>>>>>>>> It seems like I get it only after the reboot of the laptop,
>>>>>>>>>>> i. e. not
>>>>>>>>>>>>>> getting that message if I power cycle the laptop, at least I
>>>>>>>>>>> haven't
>>>>>>>>>>>>> seen
>>>>>>>>>>>>>> them for now in such cases.
>>>>>>>>>>>>>> So when the laptop is rebooted I can't even take advantage of
>>>>>>>>>>>>>> nvmecontrol(8) quickly.
>>>>>>>>>>>>>> Well, it still works, but it takes tens of seconds to return
>>>>>>>>>>> the output.
>>>>>>>>>>>>> ...
>>>>>>>>>>>>>> dmesg when power cycled -
>>>>>>>>>>>>>>
>>>>>>>>>>> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ
>>>>>>>>>>>>>> dmesg when rebooted -
>>>>>>>>>>>>>>
>>>>>>>>>>> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm sort of curious about the time stamps for the log messages
>>>>>>>>>>> in the
>>>>>>>>>>>>> failing case. Something like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>>>>
>>>>>>>>>>>>> --chuck
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Well, I can't see timestamps in the verbose boot log. Am I
>>>>>>>>>>> missing some
>>>>>>>>>>>> configuration for that?
>>>>>>>>>>>>
>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>>>>>>>
>>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>>>>>>> 0.0 on pci6
>>>>>>>>>>>> nvme0: attempting to allocate 5 MSI-X vectors (17 supported)
>>>>>>>>>>>> nvme0: using IRQs 133-137 for MSI-X
>>>>>>>>>>>> nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20
>>>>>>>>>>>> nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0,
>>>>>>>>>>> MPSMAX 0
>>>>>>>>>>>> nvme0: Version: 0x00010300: 1.3
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvme0: Missing interrupt
>>>>>>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>>>>>>>>>>>> GEOM: new disk nvd0
>>>>>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Ah, sorry, provided wrong output.
>>>>>>>>>>> Here is what you requested:
>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device> mem
>>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>>>>>>>>>>> at device
>>>>>>>>>>> 0.0 on pci6
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate 5
>>>>>>>>>>> MSI-X
>>>>>>>>>>> vectors (17 supported)
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for
>>>>>>>>>>> MSI-X
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES
>>>>>>>>>>> 1023, CQR,
>>>>>>>>>>> TO 20
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030: DSTRD
>>>>>>>>>>> 0, NSSRS,
>>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: 1.3
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix
>>>>>>>>>>> 512GB> NVMe
>>>>>>>>>>> namespace
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0
>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512
>>>>>>>>>>> byte
>>>>>>>>>>> sectors)
>>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and
>>>>>>>>>> hw.cam.nda.nvd_compat=1
>>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd
>>>>>>>>>> was? Or does
>>>>>>>>>> it work?
>>>>>>>>>>
>>>>>>>>>> Something weird is going on in the interrupt assignment, I think,
>>>>>>>>>> but I
>>>>>>>>>> wanted to get any nvd vs nda issues out of the way first.
>>>>>>>>>>
>>>>>>>>>> Warner
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead
>>>>>>>>> of hw.cam.nda.nvd_compat?
>>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now.
>>>>>>>>>
>>>>>>>>> So I tried to set  hw.nvme.use_nvd to 1 as suggested, but I still
>>>>>>>>> see
>>>>>>>>>   nvme0: Missing interrupt
>>>>>>>>> and now also
>>>>>>>>>   Root mount waiting for: CAM
>>>>>>>>> messages besides those
>>>>>>>>>
>>>>>>>>
>>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by default
>>>>>>>> these
>>>>>>>> days.
>>>>>>>>
>>>>>>>> I'll take a look on monday starting at the differences in interrupt
>>>>>>>> assignment that
>>>>>>>> are apparent when you cold boot vs reboot.
>>>>>>>>
>>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also
>>>>>>>> didn't really
>>>>>>>> expect it to be.
>>>>>>>>
>>>>>>>> Warner
>>>>>>>>
>>>>>>>>
>>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even
>>>>>>> worse now.
>>>>>>> So clean poweron works as before.
>>>>>>> But if rebooted nvme drive refuses to work, while before the code
>>>>>>> upgrade it was just complaining about missing interrupts.
>>>>>>>
>>>>>>> currently dmesg show this:
>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>> 0.0 on pci6
>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)
>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>> 0.0 on pci6
>>>>>>>
>>>>>>
>>>>>> Why is this showing up twice? Or is everything above this line left
>>>>>> over from the first, working boot?
>>>>>>
>>>>>>
>>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288
>>>>>>> nvme0: timeout with nothing complete, resetting
>>>>>>> nvme0: Resetting controller due to a timeout.
>>>>>>> nvme0: RECOVERY_WAITING
>>>>>>> nvme0: resetting controller
>>>>>>> nvme0: aborting outstanding admin command
>>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>>>>>>> cdw11:00000000
>>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>>>> nvme0: nvme_identify_controller failed!
>>>>>>> nvme0: waiting
>>>>>>>
>>>>>>
>>>>>> Clearly something bad is going on with the drive here... We looked
>>>>>> into the completion queues when we didn't get an interrupt and there was
>>>>>> nothing complete there....
>>>>>>
>>>>>> The only thing I can think of is that this means there's a phase error
>>>>>> between the drive and the system. I recently removed a second reset and
>>>>>> made it an option NVME_2X_RESET. Can you see if adding
>>>>>> 'options NVME_2X_RESET' to your kernel config fixes this?
>>>>>>
>>>>>> Warner
>>>>>>
>>>>>>
>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>> 0.0 on pci6
>>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561
>>>>>>> nvme0: timeout with nothing complete, resetting
>>>>>>> nvme0: Resetting controller due to a timeout.
>>>>>>> nvme0: RECOVERY_WAITING
>>>>>>> nvme0: resetting controller
>>>>>>> nvme0: aborting outstanding admin command
>>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>>>>>>> cdw11:00000000
>>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>>>> nvme0: nvme_identify_controller failed!
>>>>>>> nvme0: waiting
>>>>>>>
>>>>>>>
>>>>>
>>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's
>>>>> like:
>>>>> nvme0: <Generic NVMe Device> mem
>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>> 0.0 on pci6
>>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423
>>>>> nvme0: timeout with nothing complete, resetting
>>>>> nvme0: Resetting controller due to a timeout.
>>>>> nvme0: RECOVERY_WAITING
>>>>> nvme0: resetting controller
>>>>> nvme0: aborting outstanding admin command
>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001 cdw11:00000000
>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>> nvme0: nvme_identify_controller failed!
>>>>> nvme0: waiting
>>>>>
>>>>> Well, neither Windows not Linux have any problems with the device. I
>>>>> understand they may be hiding it or workaround somehow.
>>>>>
>>>>
>>>> Yea, I'm trying to figure out why your machine is different than mine,
>>>> and what Windows or Linux do that is different. It may be dodgy hardware,
>>>> but others have no trouble...
>>>>
>>>> I'll try setting NVME_2X_RESET in the kernel config and report back in a
>>>>> while.
>>>>>
>>>>
>>>> Thanks. If it helps, that tells me something. If it doesn't, that tells
>>>> me something else.
>>>>
>>>> I suspect that it is somewhere else in the system, tbh, but I need to
>>>> find it systematically.
>>>>
>>>> Warner
>>>>
>>>
>>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed
>>> anything. I. e it didn't help.
>>>
>>
>> While it would have been nice to have this be the fix, I'm not that
>> surprised either.
>> It was the biggest change of late, apart from the big re-arrangement that
>> I'd done.
>>
>> So the other changes have moved from the occasional missing interrupt
>> message
>> (which the old code would get when a command wasn't completed in the
>> timeout
>> period, but that we found to be done when we scanned the completion queue.
>> Now
>> the device is detected fine (as before), but then doesn't do I/O at all
>> (including not
>> completing the identify command!) and is worse. This is unexpected and I'm
>> trying
>> understand what happens on reboot that 'changes'the working state and why
>> the
>> new code behaves so differently.
>>
>> Warner
>>
> 

-- 
Alexander Motin