Re: Dell Latitude 7400 - nvme0: Missing interrupt

From: Pavel Timofeev <timp87_at_gmail.com>
Date: Sun, 17 Oct 2021 23:52:39 UTC
вс, 17 окт. 2021 г. в 11:19, Alexander Motin <mavbsd@gmail.com>:

> It may be a noise, but comparing logs I see in reboot case also
> "acpi_ec0: not getting interrupts, switched to polled mode".  I am
> thinking whether the problem may be caused not by SSD, but by some
> resource conflict/misconfiguration in the system.  Pavel, can you
> compare `devinfo -vr` and `lspci -vvvvv` in both cases. looking for any
> differences?  Are you running the latest BIOS?
>
> On 12.10.2021 15:56, Warner Losh wrote:
> > One piece of data that would be good to have:
> >
> > nvmecontrol identify nvme0
> >
> > There's an optional feature that none of my drives have, but that the
> Linux
> > driver does oddly
> > weird things when enabled. The output of that command will help me
> > understand if that may
> > be in play. Maybe we need to do oddly weird things too :)
> >
> > Warner
> >
> > On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <imp@bsdimp.com> wrote:
> >
> >>
> >>
> >> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <timp87@gmail.com>
> wrote:
> >>
> >>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <imp@bsdimp.com>:
> >>>
> >>>>
> >>>>
> >>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <timp87@gmail.com> wrote:
> >>>>
> >>>>>
> >>>>>
> >>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>:
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>  Warner Losh <imp@bsdimp.com>:
> >>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev <
> timp87@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>>  Pavel Timofeev <timp87@gmail.com>:
> >>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Chuck Tuffli <ctuffli@gmail.com>:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev <
> >>>>>>>>>>> timp87@gmail.com> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Hello
> >>>>>>>>>>>>>> I've got a Dell Latitude 7400 and tried installing the
> latest
> >>>>>>>>>>>>> 14.0-CURRENT
> >>>>>>>>>>>>>> (main-n248636-d20e9e02db3) on it.
> >>>>>>>>>>>>>> Despite other things the weird one which concerns me is
> >>>>>>>>>>>>>>   nvme0: Missing interrupt
> >>>>>>>>>>>>>> message I get sometimes on the console.
> >>>>>>>>>>>>>> It seems like I get it only after the reboot of the laptop,
> >>>>>>>>>>> i. e. not
> >>>>>>>>>>>>>> getting that message if I power cycle the laptop, at least I
> >>>>>>>>>>> haven't
> >>>>>>>>>>>>> seen
> >>>>>>>>>>>>>> them for now in such cases.
> >>>>>>>>>>>>>> So when the laptop is rebooted I can't even take advantage
> of
> >>>>>>>>>>>>>> nvmecontrol(8) quickly.
> >>>>>>>>>>>>>> Well, it still works, but it takes tens of seconds to return
> >>>>>>>>>>> the output.
> >>>>>>>>>>>>> ...
> >>>>>>>>>>>>>> dmesg when power cycled -
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ
> >>>>>>>>>>>>>> dmesg when rebooted -
> >>>>>>>>>>>>>>
> >>>>>>>>>>>
> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'm sort of curious about the time stamps for the log
> messages
> >>>>>>>>>>> in the
> >>>>>>>>>>>>> failing case. Something like:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> --chuck
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> Well, I can't see timestamps in the verbose boot log. Am I
> >>>>>>>>>>> missing some
> >>>>>>>>>>>> configuration for that?
> >>>>>>>>>>>>
> >>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
> >>>>>>>>>>>> nvme0: <Generic NVMe Device> mem
> >>>>>>>>>>>>
> >>>>>>>>>>>
> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
> >>>>>>>>>>>> 0.0 on pci6
> >>>>>>>>>>>> nvme0: attempting to allocate 5 MSI-X vectors (17 supported)
> >>>>>>>>>>>> nvme0: using IRQs 133-137 for MSI-X
> >>>>>>>>>>>> nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20
> >>>>>>>>>>>> nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0,
> >>>>>>>>>>> MPSMAX 0
> >>>>>>>>>>>> nvme0: Version: 0x00010300: 1.3
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvme0: Missing interrupt
> >>>>>>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
> >>>>>>>>>>>> GEOM: new disk nvd0
> >>>>>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Ah, sorry, provided wrong output.
> >>>>>>>>>>> Here is what you requested:
> >>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device>
> mem
> >>>>>>>>>>>
> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
> >>>>>>>>>>> at device
> >>>>>>>>>>> 0.0 on pci6
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate
> 5
> >>>>>>>>>>> MSI-X
> >>>>>>>>>>> vectors (17 supported)
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for
> >>>>>>>>>>> MSI-X
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES
> >>>>>>>>>>> 1023, CQR,
> >>>>>>>>>>> TO 20
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030:
> DSTRD
> >>>>>>>>>>> 0, NSSRS,
> >>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300:
> 1.3
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix
> >>>>>>>>>>> 512GB> NVMe
> >>>>>>>>>>> namespace
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0
> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512
> >>>>>>>>>>> byte
> >>>>>>>>>>> sectors)
> >>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt
> >>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt
> >>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and
> >>>>>>>>>> hw.cam.nda.nvd_compat=1
> >>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd
> >>>>>>>>>> was? Or does
> >>>>>>>>>> it work?
> >>>>>>>>>>
> >>>>>>>>>> Something weird is going on in the interrupt assignment, I
> think,
> >>>>>>>>>> but I
> >>>>>>>>>> wanted to get any nvd vs nda issues out of the way first.
> >>>>>>>>>>
> >>>>>>>>>> Warner
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead
> >>>>>>>>> of hw.cam.nda.nvd_compat?
> >>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now.
> >>>>>>>>>
> >>>>>>>>> So I tried to set  hw.nvme.use_nvd to 1 as suggested, but I still
> >>>>>>>>> see
> >>>>>>>>>   nvme0: Missing interrupt
> >>>>>>>>> and now also
> >>>>>>>>>   Root mount waiting for: CAM
> >>>>>>>>> messages besides those
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by
> default
> >>>>>>>> these
> >>>>>>>> days.
> >>>>>>>>
> >>>>>>>> I'll take a look on monday starting at the differences in
> interrupt
> >>>>>>>> assignment that
> >>>>>>>> are apparent when you cold boot vs reboot.
> >>>>>>>>
> >>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also
> >>>>>>>> didn't really
> >>>>>>>> expect it to be.
> >>>>>>>>
> >>>>>>>> Warner
> >>>>>>>>
> >>>>>>>>
> >>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even
> >>>>>>> worse now.
> >>>>>>> So clean poweron works as before.
> >>>>>>> But if rebooted nvme drive refuses to work, while before the code
> >>>>>>> upgrade it was just complaining about missing interrupts.
> >>>>>>>
> >>>>>>> currently dmesg show this:
> >>>>>>> nvme0: <Generic NVMe Device> mem
> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
> at device
> >>>>>>> 0.0 on pci6
> >>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
> >>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)
> >>>>>>> nvme0: <Generic NVMe Device> mem
> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
> at device
> >>>>>>> 0.0 on pci6
> >>>>>>>
> >>>>>>
> >>>>>> Why is this showing up twice? Or is everything above this line left
> >>>>>> over from the first, working boot?
> >>>>>>
> >>>>>>
> >>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288
> >>>>>>> nvme0: timeout with nothing complete, resetting
> >>>>>>> nvme0: Resetting controller due to a timeout.
> >>>>>>> nvme0: RECOVERY_WAITING
> >>>>>>> nvme0: resetting controller
> >>>>>>> nvme0: aborting outstanding admin command
> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
> >>>>>>> cdw11:00000000
> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
> >>>>>>> nvme0: nvme_identify_controller failed!
> >>>>>>> nvme0: waiting
> >>>>>>>
> >>>>>>
> >>>>>> Clearly something bad is going on with the drive here... We looked
> >>>>>> into the completion queues when we didn't get an interrupt and
> there was
> >>>>>> nothing complete there....
> >>>>>>
> >>>>>> The only thing I can think of is that this means there's a phase
> error
> >>>>>> between the drive and the system. I recently removed a second reset
> and
> >>>>>> made it an option NVME_2X_RESET. Can you see if adding
> >>>>>> 'options NVME_2X_RESET' to your kernel config fixes this?
> >>>>>>
> >>>>>> Warner
> >>>>>>
> >>>>>>
> >>>>>>> nvme0: <Generic NVMe Device> mem
> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
> at device
> >>>>>>> 0.0 on pci6
> >>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561
> >>>>>>> nvme0: timeout with nothing complete, resetting
> >>>>>>> nvme0: Resetting controller due to a timeout.
> >>>>>>> nvme0: RECOVERY_WAITING
> >>>>>>> nvme0: resetting controller
> >>>>>>> nvme0: aborting outstanding admin command
> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
> >>>>>>> cdw11:00000000
> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
> >>>>>>> nvme0: nvme_identify_controller failed!
> >>>>>>> nvme0: waiting
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's
> >>>>> like:
> >>>>> nvme0: <Generic NVMe Device> mem
> >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at
> device
> >>>>> 0.0 on pci6
> >>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423
> >>>>> nvme0: timeout with nothing complete, resetting
> >>>>> nvme0: Resetting controller due to a timeout.
> >>>>> nvme0: RECOVERY_WAITING
> >>>>> nvme0: resetting controller
> >>>>> nvme0: aborting outstanding admin command
> >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
> cdw11:00000000
> >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
> >>>>> nvme0: nvme_identify_controller failed!
> >>>>> nvme0: waiting
> >>>>>
> >>>>> Well, neither Windows not Linux have any problems with the device. I
> >>>>> understand they may be hiding it or workaround somehow.
> >>>>>
> >>>>
> >>>> Yea, I'm trying to figure out why your machine is different than mine,
> >>>> and what Windows or Linux do that is different. It may be dodgy
> hardware,
> >>>> but others have no trouble...
> >>>>
> >>>> I'll try setting NVME_2X_RESET in the kernel config and report back
> in a
> >>>>> while.
> >>>>>
> >>>>
> >>>> Thanks. If it helps, that tells me something. If it doesn't, that
> tells
> >>>> me something else.
> >>>>
> >>>> I suspect that it is somewhere else in the system, tbh, but I need to
> >>>> find it systematically.
> >>>>
> >>>> Warner
> >>>>
> >>>
> >>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed
> >>> anything. I. e it didn't help.
> >>>
> >>
> >> While it would have been nice to have this be the fix, I'm not that
> >> surprised either.
> >> It was the biggest change of late, apart from the big re-arrangement
> that
> >> I'd done.
> >>
> >> So the other changes have moved from the occasional missing interrupt
> >> message
> >> (which the old code would get when a command wasn't completed in the
> >> timeout
> >> period, but that we found to be done when we scanned the completion
> queue.
> >> Now
> >> the device is detected fine (as before), but then doesn't do I/O at all
> >> (including not
> >> completing the identify command!) and is worse. This is unexpected and
> I'm
> >> trying
> >> understand what happens on reboot that 'changes'the working state and
> why
> >> the
> >> new code behaves so differently.
> >>
> >> Warner
> >>
> >
>
> --
> Alexander Motin
>


Thanks for the reply.
It's using the latest firmware. This is the  first thing I do in such case.


Attaching devinfo and lspci output.
These are diffs showing the difference between clean boot and a reboot:

$ diff -u devinfo.ok devinfo.nok
--- devinfo.ok 2021-10-17 17:48:07.964346000 -0600
+++ devinfo.nok 2021-10-17 17:48:07.886881000 -0600
@@ -214,10 +214,6 @@
             nvme0 pnpinfo vendor=0x1c5c device=0x1639 subvendor=0x1c5c
subdevice=0x1639 class=0x010802 at slot=0 function=0 dbsf=pci0:59:0:0
handle=\_SB_.PCI0.RP13.PXSX
                 Interrupt request lines:
                     0x85
-                    0x86
-                    0x87
-                    0x88
-                    0x89
                 pcib7 memory window:
                     0xcc100000-0xcc103fff
                     0xcc104000-0xcc104fff

$ diff -u lspci.ok lspci.nok
--- lspci.ok 2021-10-17 17:48:15.894470000 -0600
+++ lspci.nok 2021-10-17 17:48:15.341379000 -0600
@@ -132,7 +132,7 @@
  Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
  Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
- Address: 00000000fee06000  Data: 0033
+ Address: 00000000fee06000  Data: 0034
  Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
  DevCap: MaxPayload 128 bytes, PhantFunc 0
  ExtTag- RBE- FLReset+