Re: Dell Latitude 7400 - nvme0: Missing interrupt

From: Warner Losh <imp_at_bsdimp.com>
Date: Wed, 13 Oct 2021 02:46:52 UTC
On Tue, Oct 12, 2021 at 6:51 PM Pavel Timofeev <timp87@gmail.com> wrote:

>
>
> вт, 12 окт. 2021 г. в 13:56, Warner Losh <imp@bsdimp.com>:
>
>> One piece of data that would be good to have:
>>
>> nvmecontrol identify nvme0
>>
>> There's an optional feature that none of my drives have, but that the
>> Linux driver does oddly
>> weird things when enabled. The output of that command will help me
>> understand if that may
>> be in play. Maybe we need to do oddly weird things too :)
>>
>> Warner
>>
>> On Sun, Oct 10, 2021 at 11:00 PM Warner Losh <imp@bsdimp.com> wrote:
>>
>>>
>>>
>>> On Sun, Oct 10, 2021 at 10:48 PM Pavel Timofeev <timp87@gmail.com>
>>> wrote:
>>>
>>>> сб, 9 окт. 2021 г. в 14:59, Warner Losh <imp@bsdimp.com>:
>>>>
>>>>>
>>>>>
>>>>> On Sat, Oct 9, 2021, 8:44 AM Pavel Timofeev <timp87@gmail.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> пт, 8 окт. 2021 г. в 14:49, Warner Losh <imp@bsdimp.com>:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 8, 2021 at 2:42 PM Pavel Timofeev <timp87@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> сб, 21 авг. 2021 г. в 15:22, Warner Losh <imp@bsdimp.com>:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Aug 21, 2021 at 3:06 PM Pavel Timofeev <timp87@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  Warner Losh <imp@bsdimp.com>:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Aug 20, 2021 at 10:42 PM Pavel Timofeev <
>>>>>>>>>>> timp87@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>  Pavel Timofeev <timp87@gmail.com>:
>>>>>>>>>>>>
>>>>>>>>>>>> >
>>>>>>>>>>>> > Chuck Tuffli <ctuffli@gmail.com>:
>>>>>>>>>>>> >
>>>>>>>>>>>> >> On Mon, Aug 16, 2021 at 7:43 PM Pavel Timofeev <
>>>>>>>>>>>> timp87@gmail.com> wrote:
>>>>>>>>>>>> >> >
>>>>>>>>>>>> >> > Hello
>>>>>>>>>>>> >> > I've got a Dell Latitude 7400 and tried installing the
>>>>>>>>>>>> latest
>>>>>>>>>>>> >> 14.0-CURRENT
>>>>>>>>>>>> >> > (main-n248636-d20e9e02db3) on it.
>>>>>>>>>>>> >> > Despite other things the weird one which concerns me is
>>>>>>>>>>>> >> >   nvme0: Missing interrupt
>>>>>>>>>>>> >> > message I get sometimes on the console.
>>>>>>>>>>>> >> > It seems like I get it only after the reboot of the
>>>>>>>>>>>> laptop, i. e. not
>>>>>>>>>>>> >> > getting that message if I power cycle the laptop, at least
>>>>>>>>>>>> I haven't
>>>>>>>>>>>> >> seen
>>>>>>>>>>>> >> > them for now in such cases.
>>>>>>>>>>>> >> > So when the laptop is rebooted I can't even take advantage
>>>>>>>>>>>> of
>>>>>>>>>>>> >> > nvmecontrol(8) quickly.
>>>>>>>>>>>> >> > Well, it still works, but it takes tens of seconds to
>>>>>>>>>>>> return the output.
>>>>>>>>>>>> >> ...
>>>>>>>>>>>> >> > dmesg when power cycled -
>>>>>>>>>>>> >> >
>>>>>>>>>>>> https://drive.google.com/file/d/1dB27oB1O2CcnZy6DvOOhmFO8SN8V8SwJ
>>>>>>>>>>>> >> > dmesg when rebooted -
>>>>>>>>>>>> >> >
>>>>>>>>>>>> https://drive.google.com/file/d/1DsKTMkihp_OmUcirByLaVO4o2mU38Bxh
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> I'm sort of curious about the time stamps for the log
>>>>>>>>>>>> messages in the
>>>>>>>>>>>> >> failing case. Something like:
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>>> >>
>>>>>>>>>>>> >> --chuck
>>>>>>>>>>>> >>
>>>>>>>>>>>> >
>>>>>>>>>>>> > Well, I can't see timestamps in the verbose boot log. Am I
>>>>>>>>>>>> missing some
>>>>>>>>>>>> > configuration for that?
>>>>>>>>>>>> >
>>>>>>>>>>>> > $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>>> > nvme0: <Generic NVMe Device> mem
>>>>>>>>>>>> >
>>>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>>>>>>> > 0.0 on pci6
>>>>>>>>>>>> > nvme0: attempting to allocate 5 MSI-X vectors (17 supported)
>>>>>>>>>>>> > nvme0: using IRQs 133-137 for MSI-X
>>>>>>>>>>>> > nvme0: CapLo: 0x140103ff: MQES 1023, CQR, TO 20
>>>>>>>>>>>> > nvme0: CapHi: 0x00000030: DSTRD 0, NSSRS, CSS 1, MPSMIN 0,
>>>>>>>>>>>> MPSMAX 0
>>>>>>>>>>>> > nvme0: Version: 0x00010300: 1.3
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvme0: Missing interrupt
>>>>>>>>>>>> > nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>>>>>>>>>>>> > GEOM: new disk nvd0
>>>>>>>>>>>> > nvd0: 488386MB (1000215216 512 byte sectors)
>>>>>>>>>>>> >
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Ah, sorry, provided wrong output.
>>>>>>>>>>>> Here is what you requested:
>>>>>>>>>>>> $ grep "nv\(me\|d\)" /var/log/messages
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: <Generic NVMe Device>
>>>>>>>>>>>> mem
>>>>>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff
>>>>>>>>>>>> at device
>>>>>>>>>>>> 0.0 on pci6
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: attempting to allocate
>>>>>>>>>>>> 5 MSI-X
>>>>>>>>>>>> vectors (17 supported)
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: using IRQs 133-137 for
>>>>>>>>>>>> MSI-X
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapLo: 0x140103ff: MQES
>>>>>>>>>>>> 1023, CQR,
>>>>>>>>>>>> TO 20
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: CapHi: 0x00000030:
>>>>>>>>>>>> DSTRD 0, NSSRS,
>>>>>>>>>>>> CSS 1, MPSMIN 0, MPSMAX 0
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Version: 0x00010300: 1.3
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: <PC611 NVMe SK hynix
>>>>>>>>>>>> 512GB> NVMe
>>>>>>>>>>>> namespace
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: GEOM: new disk nvd0
>>>>>>>>>>>> Aug 21 04:34:36 nostromo kernel: nvd0: 488386MB (1000215216 512
>>>>>>>>>>>> byte
>>>>>>>>>>>> sectors)
>>>>>>>>>>>> Aug 21 04:34:42 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>>> Aug 21 04:35:36 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>>> Aug 21 04:35:50 nostromo kernel: nvme0: Missing interrupt
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> What happens if you set hw.nvme.use_nvd=0 and
>>>>>>>>>>> hw.cam.nda.nvd_compat=1
>>>>>>>>>>> in the boot loader and reboot? Same thing except nda where nvd
>>>>>>>>>>> was? Or does
>>>>>>>>>>> it work?
>>>>>>>>>>>
>>>>>>>>>>> Something weird is going on in the interrupt assignment, I
>>>>>>>>>>> think, but I
>>>>>>>>>>> wanted to get any nvd vs nda issues out of the way first.
>>>>>>>>>>>
>>>>>>>>>>> Warner
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Do you mean kern.cam.nda.nvd_compat instead
>>>>>>>>>> of hw.cam.nda.nvd_compat?
>>>>>>>>>> kern.cam.nda.nvd_compat is 1 by default now.
>>>>>>>>>>
>>>>>>>>>> So I tried to set  hw.nvme.use_nvd to 1 as suggested, but I still
>>>>>>>>>> see
>>>>>>>>>>   nvme0: Missing interrupt
>>>>>>>>>> and now also
>>>>>>>>>>   Root mount waiting for: CAM
>>>>>>>>>> messages besides those
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> OK. That all makes sense. I'd forgotten that nvd_compat=1 by
>>>>>>>>> default these
>>>>>>>>> days.
>>>>>>>>>
>>>>>>>>> I'll take a look on monday starting at the differences in
>>>>>>>>> interrupt assignment that
>>>>>>>>> are apparent when you cold boot vs reboot.
>>>>>>>>>
>>>>>>>>> Thanks for checking... I'd hoped this was a cheap fix, but also
>>>>>>>>> didn't really
>>>>>>>>> expect it to be.
>>>>>>>>>
>>>>>>>>> Warner
>>>>>>>>>
>>>>>>>>>
>>>>>>>> I've recently upgraded to main-n249974-17f790f49f5 and it got even
>>>>>>>> worse now.
>>>>>>>> So clean poweron works as before.
>>>>>>>> But if rebooted nvme drive refuses to work, while before the code
>>>>>>>> upgrade it was just complaining about missing interrupts.
>>>>>>>>
>>>>>>>> currently dmesg show this:
>>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>>> 0.0 on pci6
>>>>>>>> nvd0: <PC611 NVMe SK hynix 512GB> NVMe namespace
>>>>>>>> nvd0: 488386MB (1000215216 512 byte sectors)
>>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>>> 0.0 on pci6
>>>>>>>>
>>>>>>>
>>>>>>> Why is this showing up twice? Or is everything above this line left
>>>>>>> over from the first, working boot?
>>>>>>>
>>>>>>>
>>>>>>>> nvme0: RECOVERY_START 9585870784 vs 9367036288
>>>>>>>> nvme0: timeout with nothing complete, resetting
>>>>>>>> nvme0: Resetting controller due to a timeout.
>>>>>>>> nvme0: RECOVERY_WAITING
>>>>>>>> nvme0: resetting controller
>>>>>>>> nvme0: aborting outstanding admin command
>>>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>>>>>>>> cdw11:00000000
>>>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>>>>> nvme0: nvme_identify_controller failed!
>>>>>>>> nvme0: waiting
>>>>>>>>
>>>>>>>
>>>>>>> Clearly something bad is going on with the drive here... We looked
>>>>>>> into the completion queues when we didn't get an interrupt and there was
>>>>>>> nothing complete there....
>>>>>>>
>>>>>>> The only thing I can think of is that this means there's a phase
>>>>>>> error between the drive and the system. I recently removed a second reset
>>>>>>> and made it an option NVME_2X_RESET. Can you see if adding
>>>>>>> 'options NVME_2X_RESET' to your kernel config fixes this?
>>>>>>>
>>>>>>> Warner
>>>>>>>
>>>>>>>
>>>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>>>> 0.0 on pci6
>>>>>>>> nvme0: RECOVERY_START 9362778467 vs 9361830561
>>>>>>>> nvme0: timeout with nothing complete, resetting
>>>>>>>> nvme0: Resetting controller due to a timeout.
>>>>>>>> nvme0: RECOVERY_WAITING
>>>>>>>> nvme0: resetting controller
>>>>>>>> nvme0: aborting outstanding admin command
>>>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>>>>>>>> cdw11:00000000
>>>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>>>>> nvme0: nvme_identify_controller failed!
>>>>>>>> nvme0: waiting
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> Sorry, it's showing twice due to multiple reboots. For one boot it's
>>>>>> like:
>>>>>> nvme0: <Generic NVMe Device> mem
>>>>>> 0xcc100000-0xcc103fff,0xcc105000-0xcc105fff,0xcc104000-0xcc104fff at device
>>>>>> 0.0 on pci6
>>>>>> nvme0: RECOVERY_START 9633303481 vs 9365971423
>>>>>> nvme0: timeout with nothing complete, resetting
>>>>>> nvme0: Resetting controller due to a timeout.
>>>>>> nvme0: RECOVERY_WAITING
>>>>>> nvme0: resetting controller
>>>>>> nvme0: aborting outstanding admin command
>>>>>> nvme0: IDENTIFY (06) sqid:0 cid:15 nsid:0 cdw10:00000001
>>>>>> cdw11:00000000
>>>>>> nvme0: ABORTED - BY REQUEST (00/07) sqid:0 cid:15 cdw0:0
>>>>>> nvme0: nvme_identify_controller failed!
>>>>>> nvme0: waiting
>>>>>>
>>>>>> Well, neither Windows not Linux have any problems with the device. I
>>>>>> understand they may be hiding it or workaround somehow.
>>>>>>
>>>>>
>>>>> Yea, I'm trying to figure out why your machine is different than mine,
>>>>> and what Windows or Linux do that is different. It may be dodgy hardware,
>>>>> but others have no trouble...
>>>>>
>>>>> I'll try setting NVME_2X_RESET in the kernel config and report back in
>>>>>> a while.
>>>>>>
>>>>>
>>>>> Thanks. If it helps, that tells me something. If it doesn't, that
>>>>> tells me something else.
>>>>>
>>>>> I suspect that it is somewhere else in the system, tbh, but I need to
>>>>> find it systematically.
>>>>>
>>>>> Warner
>>>>>
>>>>
>>>> Surprisingly, setting NVME_2X_RESET in the kernel config hasn't changed
>>>> anything. I. e it didn't help.
>>>>
>>>
>>> While it would have been nice to have this be the fix, I'm not that
>>> surprised either.
>>> It was the biggest change of late, apart from the big re-arrangement
>>> that I'd done.
>>>
>>> So the other changes have moved from the occasional missing interrupt
>>> message
>>> (which the old code would get when a command wasn't completed in the
>>> timeout
>>> period, but that we found to be done when we scanned the completion
>>> queue. Now
>>> the device is detected fine (as before), but then doesn't do I/O at all
>>> (including not
>>> completing the identify command!) and is worse. This is unexpected and
>>> I'm trying
>>> understand what happens on reboot that 'changes'the working state and
>>> why the
>>> new code behaves so differently.
>>>
>>> Warner
>>>
>>
>
> Sure, here it is:
>

Thanks. Good news / bad news based on this...


> Controller Capabilities/Features
> ================================
> Vendor ID:                   1c5c
> Subsystem Vendor ID:         1c5c
> Serial Number:               ND03N46381010423H
> Model Number:                PC611 NVMe SK hynix 512GB
>

OK. I have a PC511 256GB that I've been trying to get rotated
into my test harness... I'll make it next up tomorrow based on this...
With luck, it will have the same problem as yours. It was in a batch
of drives I got to try to sort out reported problems... (unluck of the
draw that it was pushed to the back of the list....)


> Firmware Version:            11000111
> Recommended Arb Burst:       4
> IEEE OUI Identifier:         2e e4 ac
> Multi-Path I/O Capabilities: Not Supported
> Max Data Transfer Size:      262144 bytes
> Sanitize Crypto Erase:       Not Supported
> Sanitize Block Erase:        Supported
> Sanitize Overwrite:          Not Supported
> Sanitize NDI:                Not Supported
> Sanitize NODMMAS:            Undefined
> Controller ID:               0x0001
> Version:                     1.3.0
>
> Admin Command Set Attributes
> ============================
> Security Send/Receive:       Supported
> Format NVM:                  Supported
> Firmware Activate/Download:  Supported
> Namespace Management:        Not Supported
> Device Self-test:            Supported
> Directives:                  Not Supported
> NVMe-MI Send/Receive:        Not Supported
> Virtualization Management:   Not Supported
> Doorbell Buffer Config:      Not Supported
> Get LBA Status:              Not Supported
>

Hmmm, I was hoping it was related to Doorbell Buffer Config.


> Sanitize:                    block,
> Abort Command Limit:         4
> Async Event Request Limit:   8
> Number of Firmware Slots:    3
> Firmware Slot 1 Read-Only:   No
> Per-Namespace SMART Log:     No
> Error Log Page Entries:      256
> Number of Power States:      5
> Total NVM Capacity:          0 bytes
> Unallocated NVM Capacity:    0 bytes
> Firmware Update Granularity: 00 (Not Reported)
> Host Buffer Preferred Size:  0 bytes
> Host Buffer Minimum Size:    0 bytes
>
> NVM Command Set Attributes
> ==========================
> Submission Queue Entry Size
>   Max:                       64
>   Min:                       64
> Completion Queue Entry Size
>   Max:                       16
>   Min:                       16
> Number of Namespaces:        1
>

These are all typical / normal as well...


> Compare Command:             Supported
> Write Uncorrectable Command: Supported
> Dataset Management Command:  Supported
> Write Zeroes Command:        Supported
> Save Features:               Supported
> Reservations:                Not Supported
> Timestamp feature:           Supported
> Verify feature:              Not Supported
> Fused Operation Support:     Not Supported
> Format NVM Attributes:       Per-NS Erase, Per-NS Format
> Volatile Write Cache:        Present
>

Also typical, except write uncorrectable (which we don't use).

Warner


> NVM Subsystem Name:
>
>
>