[Bug 211713] NVME controller failure: resetting (Samsung SM961 SSD Drives)

bugzilla-noreply at freebsd.org bugzilla-noreply at freebsd.org
Tue Nov 6 22:25:49 UTC 2018


https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=211713

--- Comment #68 from David <mentalbarcode at fastest.cc> ---
I'm testing FreeBSD 12.0-BETA3 r340039 GENERIC, and I have an PM961 PCIe NVMe
m.2 1TB drive that came with my Lenovo ThinkPad P50.
P/N: MZSLW1T0HMLH-000L1 Produced Oct 2016

That drive is recognized by FreeBSD 12, but is not usable whatsoever (can't
read/write to it).  I've used this drive with Debian testing since 2016 without
trouble on my ThinkPad P50.

I installed FreeBSD 12 on an internal 2TB HDD in the ThinkPad in order to test
FreeBSD, but the PM961 continued to cause boot delays -- I would see "nvme0:
Missing interrupt" messages until the system finally gave up and continued with
the boot process.

I attempted to install FreeBSD 11 on the 2TB HDD but the install failed when it
had trouble recognizing the nvme drive.

Initially I thought the missing interrupt problem with FreeBSD was caused by
the LUKS encryption on the nvme drive because I had not formatted that drive
yet since I was dual booting. So I purchased another Samsung NVMe SSD 960 PRO
m.2 1TB drive P/N: MZVKP1T0HMJP, and that drive works with FreeBSD 12. The new
nvme was installed in the ThinkPad along with the original nvme and HDD drive. 
The 2TB HDD and the new 1TB nvme drives are dedicated to FreeBSD using ZFS.  I
attempted to create a ZFS mirror using the two nvme drives and FreeBSD
successfully wrote to the original nvme drive (because it overwrote my Linux
partitions) but the overall `zfs_create_diskpart` process failed and I had to
start over using only the new nvme drive, which worked.  I eventually removed
the original nvme drive from my laptop because of the constant "missing
interrupt" delays.

However, after removing the original nmve drive and while installing a virtual
machine in VirtualBox on my new nvme, my laptop went into (what seemed to be)
ACPI S3 suspended mode, and after I woke the machine the laptop rebooted
itself.  Thinking the problem was VirtualBox, I removed that software and setup
Bhyve instead.  During a virtual machine install in Bhyve, the laptop went into
an S3-style suspended mode again, and this time when I woke the machine I
noticed the nvme0 resetting controller, write, read, and aborted-by-request
messages in `dmesg` (output attached above).

For the most part, the new nvme device seems stable with FreeBSD 12.  I haven't
test it with FreeBSD 11.  I don't know if KDE's baloo service crashing and
creating a 256GB core dump every single time I login is part of the problem
using this drive.  Today I disabled Baloo file indexing and installed another
virtual machine using Bhyve and the system hasn't reported any problems with
the nvme.  I also used `dd` to create some 10GB and 100GB files using input
from /dev/urandom, and that didn't cause any issues so far.

Lastly, cold boots on the new nvme (without the old nvme installed in the
laptop) are normal. However, reboots can take literally 2 minutes to complete. 
This includes an extended delay on the BIOS screen before reaching the GELI
password prompt, and a delay after loading the kernel before moving on to the
---<<BOOT>>--- screen, and the entire boot process is sluggish until finally
reaching the login prompt.  I've never experienced this with Debian testing and
I suspect the FreeBSD nvme driver is leaving the system in a weird state.

IIRC setting hw.nvme.enable_aborts=1 while the original nvme drive is still in
the laptop causes a kernel panic while booting.  I haven't tried setting
hw.nvme.per_cpu_io_queues=0 since the system is usable and not completely
instable.

Hardware details:

# nvmecontrol devlist
 nvme0: SAMSUNG MZSLW1T0HMLH-000L1
    nvme0ns1 (976762MB)
 nvme1: Samsung SSD 960 PRO 1TB
    nvme1ns1 (976762MB)

# pciconf -lbace nvme0
nvme0 at pci0:2:0:0:       class=0x010802 card=0xa801144d chip=0xa804144d rev=0x00
hdr=0x00
    bar   [10] = type Memory, range 64, base 0xd4400000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 32 messages, 64 bit 
    cap 10[70] = PCI-Express 2 endpoint max data 256(256) FLR RO NS
                 link x4(x4) speed 8.0(8.0) ASPM L1(L1)
    cap 11[b0] = MSI-X supports 33 messages, enabled
                 Table in map 0x10[0x3000], PBA in map 0x10[0x2000]
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0003[148] = Serial 1 0000000000000000
    ecap 0004[158] = Power Budgeting 1
    ecap 0019[168] = PCIe Sec 1 lane errors 0
    ecap 0018[188] = LTR 1
    ecap 001e[190] = unknown 1
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error

# pciconf -lbace nvme1
nvme1 at pci0:62:0:0:      class=0x010802 card=0xa801144d chip=0xa804144d rev=0x00
hdr=0x00
    bar   [10] = type Memory, range 64, base 0xd4200000, size 16384, enabled
    cap 01[40] = powerspec 3  supports D0 D3  current D0
    cap 05[50] = MSI supports 32 messages, 64 bit 
    cap 10[70] = PCI-Express 2 endpoint max data 256(256) FLR RO NS
                 link x4(x4) speed 8.0(8.0) ASPM L1(L1)
    cap 11[b0] = MSI-X supports 8 messages, enabled
                 Table in map 0x10[0x3000], PBA in map 0x10[0x2000]
    ecap 0001[100] = AER 2 0 fatal 0 non-fatal 1 corrected
    ecap 0003[148] = Serial 1 0000000000000000
    ecap 0004[158] = Power Budgeting 1
    ecap 0019[168] = PCIe Sec 1 lane errors 0
    ecap 0018[188] = LTR 1
    ecap 001e[190] = unknown 1
  PCI-e errors = Correctable Error Detected
                 Unsupported Request Detected
     Corrected = Advisory Non-Fatal Error

# diskinfo -t /dev/nvme0ns1
/dev/nvme0ns1
        512             # sectorsize
        1024209543168   # mediasize in bytes (954G)
        2000409264      # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        No              # TRIM/UNMAP support
        Unknown         # Rotation rate in RPM

Seek times:
        Full stroke:^C
Nov  6 18:58:09 fenixbsd kernel: nvme0: Missing interrupt
Nov  6 18:58:39 fenixbsd syslogd: last message repeated 1 times

# diskinfo -t /dev/nvme1ns1
/dev/nvme1ns1
        512             # sectorsize
        1024209543168   # mediasize in bytes (954G)
        2000409264      # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        No              # TRIM/UNMAP support
        Unknown         # Rotation rate in RPM

Seek times:
        Full stroke:      250 iter in   0.011499 sec =    0.046 msec
        Half stroke:      250 iter in   0.010018 sec =    0.040 msec
        Quarter stroke:   500 iter in   0.015302 sec =    0.031 msec
        Short forward:    400 iter in   0.013087 sec =    0.033 msec
        Short backward:   400 iter in   0.012144 sec =    0.030 msec
        Seq outer:       2048 iter in   0.041548 sec =    0.020 msec
        Seq inner:       2048 iter in   0.042294 sec =    0.021 msec

Transfer rates:
        outside:       102400 kbytes in   0.066412 sec =  1541890 kbytes/sec
        middle:        102400 kbytes in   0.064908 sec =  1577618 kbytes/sec
        inside:        102400 kbytes in   0.064534 sec =  1586760 kbytes/sec

-- 
You are receiving this mail because:
You are the assignee for the bug.


More information about the freebsd-bugs mailing list