Re: Dell Latitude 7400 - nvme0: Missing interrupt

From: Pavel Timofeev <timp87_at_gmail.com>
Date: Fri, 14 Jan 2022 18:53:06 UTC
вс, 17 окт. 2021 г. в 17:52, Pavel Timofeev <timp87@gmail.com>:

>
>
> вс, 17 окт. 2021 г. в 11:19, Alexander Motin <mavbsd@gmail.com>:
>
>> It may be a noise, but comparing logs I see in reboot case also
>> "acpi_ec0: not getting interrupts, switched to polled mode".  I am
>> thinking whether the problem may be caused not by SSD, but by some
>> resource conflict/misconfiguration in the system.  Pavel, can you
>> compare `devinfo -vr` and `lspci -vvvvv` in both cases. looking for any
>> differences?  Are you running the latest BIOS?
>>
>> On 12.10.2021 15:56, Warner Losh wrote:
>> > One piece of data that would be good to have:
>> >
>> > nvmecontrol identify nvme0
>> >
>> > There's an optional feature that none of my drives have, but that the
>> Linux
>> > driver does oddly
>> > weird things when enabled. The output of that command will help me
>> > understand if that may
>> > be in play. Maybe we need to do oddly weird things too :)
>> >
>> > Warner
>> >
>> > On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <imp@bsdimp.com> wrote:
>> >
>> >>
>> >>
>> >> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <timp87@gmail.com>
>> wrote:
>> >>
>> >>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <imp@bsdimp.com>:
>> >>>
>> >>>>
>> >>>>
>> >>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <timp87@gmail.com>
>> wrote:
>> >>>>
>> >>>>>
>> >>>>>
>> >>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>:
>> >>>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com>
>> >>>>>> wrote:
>> >>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>:
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com
>> >
>> >>>>>>>> wrote:
>> >>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>>  Warner Losh <imp@bsdimp.com>:
>> >>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev <
>> timp87@gmail.com>
>> >>>>>>>>>> wrote:
>> >>>>>>>>>>
>> >>>>>>>>>>>  Pavel Timofeev <timp87@gmail.com>:
>> >>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Chuck Tuffli <ctuffli@gmail.com>:
>> >>>>>>>>>>>>
>> >>>>>>>>>>>>> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev <
>> >>>>>>>>>>> timp87@gmail.com> wrote:
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>>>> Hello
>> >>>>>>>>>>>>>> I've got a Dell Latitude 7400 and tried installing the
>> latest
>> >>>>>>>>>>>>> 14.0-CURRENT
>> >>>>>>>>>>>>>> (main-n248636-d20e9e02db3) on it.
>> >>>>>>>>>>>>>> Despite other things the weird one which concerns me is
>> >>>>>>>>>>>>>>   nvme0: Missing interrupt
>> >>>>>>>>>>>>>> message I get sometimes on the console.
>> >>>>>>>>>>>>>> It seems like I get it only after the reboot of the laptop,
>> >>>>>>>>>>> i. e. not
>> >>>>>>>>>>>>>> getting that message if I power cycle the laptop, at least
>> I
>> >>>>>>>>>>> haven't
>> >>>>>>>>>>>>> seen
>> >>>>>>>>>>>>>> them for now in such cases.
>> >>>>>>>>>>>>>> So when the laptop is rebooted I can't even take advantage
>> of
>> >>>>>>>>>>>>>> nvmecontrol(8) quickly.
>> >>>>>>>>>>>>>> Well, it still works, but it takes tens of seconds to
>> return
>> >>>>>>>>>>> the output.
>> >>>>>>>>>>>>> ...
>> >>>>>>>>>>>>>> dmesg when power cycled -
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ
>> >>>>>>>>>>>>>> dmesg when rebooted -
>> >>>>>>>>>>>>>>
>> >>>>>>>>>>>
>> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> I'm sort of curious about the time stamps for the log
>> messages
>> >>>>>>>>>>> in the
>> >>>>>>>>>>>>> failing case. Something like:
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>> --chuck
>> >>>>>>>>>>>>>
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> Well, I can't see timestamps in the verbose boot log. Am I
>> >>>>>>>>>>> missing some
>> >>>>>>>>>>>> configuration for that?
>> >>>>>>>>>>>>
>> >>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>> >>>>>>>>>>>> nvme0: <Generic NVMe Device> mem
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>> >>>>>>>>>>>> 0.0 on pci6
>> >>>>>>>>>>>> nvme0: attempting to allocate 5 MSI-X vectors (17 supported)
>> >>>>>>>>>>>> nvme0: using IRQs 133-137 for MSI-X
>> >>>>>>>>>>>> nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20
>> >>>>>>>>>>>> nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0,
>> >>>>>>>>>>> MPSMAX 0
>> >>>>>>>>>>>> nvme0: Version: 0x00010300: 1.3
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvme0: Missing interrupt
>> >>>>>>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>> >>>>>>>>>>>> GEOM: new disk nvd0
>> >>>>>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)
>> >>>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>>
>> >>>>>>>>>>> Ah, sorry, provided wrong output.
>> >>>>>>>>>>> Here is what you requested:
>> >>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device>
>> mem
>> >>>>>>>>>>>
>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>> >>>>>>>>>>> at device
>> >>>>>>>>>>> 0.0 on pci6
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to
>> allocate 5
>> >>>>>>>>>>> MSI-X
>> >>>>>>>>>>> vectors (17 supported)
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for
>> >>>>>>>>>>> MSI-X
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff:
>> MQES
>> >>>>>>>>>>> 1023, CQR,
>> >>>>>>>>>>> TO 20
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030:
>> DSTRD
>> >>>>>>>>>>> 0, NSSRS,
>> >>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300:
>> 1.3
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix
>> >>>>>>>>>>> 512GB> NVMe
>> >>>>>>>>>>> namespace
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0
>> >>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216
>> 512
>> >>>>>>>>>>> byte
>> >>>>>>>>>>> sectors)
>> >>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt
>> >>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt
>> >>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt
>> >>>>>>>>>>>
>> >>>>>>>>>>
>> >>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and
>> >>>>>>>>>> hw.cam.nda.nvd_compat=1
>> >>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd
>> >>>>>>>>>> was? Or does
>> >>>>>>>>>> it work?
>> >>>>>>>>>>
>> >>>>>>>>>> Something weird is going on in the interrupt assignment, I
>> think,
>> >>>>>>>>>> but I
>> >>>>>>>>>> wanted to get any nvd vs nda issues out of the way first.
>> >>>>>>>>>>
>> >>>>>>>>>> Warner
>> >>>>>>>>>>
>> >>>>>>>>>
>> >>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead
>> >>>>>>>>> of hw.cam.nda.nvd_compat?
>> >>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now.
>> >>>>>>>>>
>> >>>>>>>>> So I tried to set  hw.nvme.use_nvd to 1 as suggested, but I
>> still
>> >>>>>>>>> see
>> >>>>>>>>>   nvme0: Missing interrupt
>> >>>>>>>>> and now also
>> >>>>>>>>>   Root mount waiting for: CAM
>> >>>>>>>>> messages besides those
>> >>>>>>>>>
>> >>>>>>>>
>> >>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by
>> default
>> >>>>>>>> these
>> >>>>>>>> days.
>> >>>>>>>>
>> >>>>>>>> I'll take a look on monday starting at the differences in
>> interrupt
>> >>>>>>>> assignment that
>> >>>>>>>> are apparent when you cold boot vs reboot.
>> >>>>>>>>
>> >>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also
>> >>>>>>>> didn't really
>> >>>>>>>> expect it to be.
>> >>>>>>>>
>> >>>>>>>> Warner
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even
>> >>>>>>> worse now.
>> >>>>>>> So clean poweron works as before.
>> >>>>>>> But if rebooted nvme drive refuses to work, while before the code
>> >>>>>>> upgrade it was just complaining about missing interrupts.
>> >>>>>>>
>> >>>>>>> currently dmesg show this:
>> >>>>>>> nvme0: <Generic NVMe Device> mem
>> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>> at device
>> >>>>>>> 0.0 on pci6
>> >>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>> >>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)
>> >>>>>>> nvme0: <Generic NVMe Device> mem
>> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>> at device
>> >>>>>>> 0.0 on pci6
>> >>>>>>>
>> >>>>>>
>> >>>>>> Why is this showing up twice? Or is everything above this line left
>> >>>>>> over from the first, working boot?
>> >>>>>>
>> >>>>>>
>> >>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288
>> >>>>>>> nvme0: timeout with nothing complete, resetting
>> >>>>>>> nvme0: Resetting controller due to a timeout.
>> >>>>>>> nvme0: RECOVERY_WAITING
>> >>>>>>> nvme0: resetting controller
>> >>>>>>> nvme0: aborting outstanding admin command
>> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>> >>>>>>> cdw11:00000000
>> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>> >>>>>>> nvme0: nvme_identify_controller failed!
>> >>>>>>> nvme0: waiting
>> >>>>>>>
>> >>>>>>
>> >>>>>> Clearly something bad is going on with the drive here... We looked
>> >>>>>> into the completion queues when we didn't get an interrupt and
>> there was
>> >>>>>> nothing complete there....
>> >>>>>>
>> >>>>>> The only thing I can think of is that this means there's a phase
>> error
>> >>>>>> between the drive and the system. I recently removed a second
>> reset and
>> >>>>>> made it an option NVME_2X_RESET. Can you see if adding
>> >>>>>> 'options NVME_2X_RESET' to your kernel config fixes this?
>> >>>>>>
>> >>>>>> Warner
>> >>>>>>
>> >>>>>>
>> >>>>>>> nvme0: <Generic NVMe Device> mem
>> >>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>> at device
>> >>>>>>> 0.0 on pci6
>> >>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561
>> >>>>>>> nvme0: timeout with nothing complete, resetting
>> >>>>>>> nvme0: Resetting controller due to a timeout.
>> >>>>>>> nvme0: RECOVERY_WAITING
>> >>>>>>> nvme0: resetting controller
>> >>>>>>> nvme0: aborting outstanding admin command
>> >>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>> >>>>>>> cdw11:00000000
>> >>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>> >>>>>>> nvme0: nvme_identify_controller failed!
>> >>>>>>> nvme0: waiting
>> >>>>>>>
>> >>>>>>>
>> >>>>>
>> >>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's
>> >>>>> like:
>> >>>>> nvme0: <Generic NVMe Device> mem
>> >>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>> at device
>> >>>>> 0.0 on pci6
>> >>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423
>> >>>>> nvme0: timeout with nothing complete, resetting
>> >>>>> nvme0: Resetting controller due to a timeout.
>> >>>>> nvme0: RECOVERY_WAITING
>> >>>>> nvme0: resetting controller
>> >>>>> nvme0: aborting outstanding admin command
>> >>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>> cdw11:00000000
>> >>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>> >>>>> nvme0: nvme_identify_controller failed!
>> >>>>> nvme0: waiting
>> >>>>>
>> >>>>> Well, neither Windows not Linux have any problems with the device. I
>> >>>>> understand they may be hiding it or workaround somehow.
>> >>>>>
>> >>>>
>> >>>> Yea, I'm trying to figure out why your machine is different than
>> mine,
>> >>>> and what Windows or Linux do that is different. It may be dodgy
>> hardware,
>> >>>> but others have no trouble...
>> >>>>
>> >>>> I'll try setting NVME_2X_RESET in the kernel config and report back
>> in a
>> >>>>> while.
>> >>>>>
>> >>>>
>> >>>> Thanks. If it helps, that tells me something. If it doesn't, that
>> tells
>> >>>> me something else.
>> >>>>
>> >>>> I suspect that it is somewhere else in the system, tbh, but I need to
>> >>>> find it systematically.
>> >>>>
>> >>>> Warner
>> >>>>
>> >>>
>> >>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't
>> changed
>> >>> anything. I. e it didn't help.
>> >>>
>> >>
>> >> While it would have been nice to have this be the fix, I'm not that
>> >> surprised either.
>> >> It was the biggest change of late, apart from the big re-arrangement
>> that
>> >> I'd done.
>> >>
>> >> So the other changes have moved from the occasional missing interrupt
>> >> message
>> >> (which the old code would get when a command wasn't completed in the
>> >> timeout
>> >> period, but that we found to be done when we scanned the completion
>> queue.
>> >> Now
>> >> the device is detected fine (as before), but then doesn't do I/O at all
>> >> (including not
>> >> completing the identify command!) and is worse. This is unexpected and
>> I'm
>> >> trying
>> >> understand what happens on reboot that 'changes'the working state and
>> why
>> >> the
>> >> new code behaves so differently.
>> >>
>> >> Warner
>> >>
>> >
>>
>> --
>> Alexander Motin
>>
>
>
> Thanks for the reply.
> It's using the latest firmware. This is the  first thing I do in such case.
>
>
> Attaching devinfo and lspci output.
> These are diffs showing the difference between clean boot and a reboot:
>
> $ diff -u devinfo.ok devinfo.nok
> --- devinfo.ok 2021-10-17 17:48:07.964346000 -0600
> +++ devinfo.nok 2021-10-17 17:48:07.886881000 -0600
> @@ -214,10 +214,6 @@
>              nvme0 pnpinfo vendor=0x1c5c device=0x1639 subvendor=0x1c5c
> subdevice=0x1639 class=0x010802 at slot=0 function=0 dbsf=pci0:59:0:0
> handle=\_SB_.PCI0.RP13.PXSX
>                  Interrupt request lines:
>                      0x85
> -                    0x86
> -                    0x87
> -                    0x88
> -                    0x89
>                  pcib7 memory window:
>                      0xcc100000-0xcc103fff
>                      0xcc104000-0xcc104fff
>
> $ diff -u lspci.ok lspci.nok
> --- lspci.ok 2021-10-17 17:48:15.894470000 -0600
> +++ lspci.nok 2021-10-17 17:48:15.341379000 -0600
> @@ -132,7 +132,7 @@
>   Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA
> PME(D0+,D1-,D2-,D3hot+,D3cold+)
>   Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
>   Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
> - Address: 00000000fee06000  Data: 0033
> + Address: 00000000fee06000  Data: 0034
>   Capabilities: [40] Express (v2) Root Complex Integrated Endpoint, MSI 00
>   DevCap: MaxPayload 128 bytes, PhantFunc 0
>   ExtTag- RBE- FLReset+
>
>
Hi,
I hope everyone is doing well.
So several BIOS updates passed, now the BIOS version 1.15.1, but it works
the same. At least on CURRENT built several days ago
(main-n252414-0e8181c0123).
What is interesting iwn(4) and iwlwifi(4) work the same way - only full
power cycle makes wifi functional, simple reboot brakes it in most cases.
Does anybody have any idea?