Re: nvme controller reset failures on recent -CURRENT
- In reply to: Mark Johnston : "Re: nvme controller reset failures on recent -CURRENT"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Tue, 13 Feb 2024 04:06:21 UTC
On 12 Feb, Mark Johnston wrote:
> On Mon, Feb 12, 2024 at 04:28:10PM -0800, Don Lewis wrote:
>> I just upgraded my package build machine to:
>> FreeBSD 15.0-CURRENT #110 main-n268161-4015c064200e
>> from:
>> FreeBSD 15.0-CURRENT #106 main-n265953-a5ed6a815e38
>> and I've had two nvme-triggered panics in the last day.
>>
>> nvme is being used for swap and L2ARC. I'm not able to get a crash
>> dump, probably because the nvme device has gone away and I get an error
>> about not having a dump device. It looks like a low-memory panic
>> because free memory is low and zfs is calling malloc().
>>
>> This shows up in the log leading up to the panic:
>> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a
>> nd possible hot unplug.
>> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> Feb 12 10:07:41 zipper kernel: nvme0: resetting controller
>> Feb 12 10:07:41 zipper kernel: nvme0: Resetting controller due to a timeout a
>> nd possible hot unplug.
>> Feb 12 10:07:41 zipper syslogd: last message repeated 1 times
>> Feb 12 10:07:41 zipper kernel: nvme0: Waiting for reset to complete
>> Feb 12 10:07:41 zipper syslogd: last message repeated 2 times
>> Feb 12 10:07:41 zipper kernel: nvme0: failing queued i/o
>> Feb 12 10:07:41 zipper kernel: nvme0: Failed controller, stopping watchdog ti
>> meout.
>
> Are you by chance using the drive mentioned here? https://github.com/openzfs/zfs/discussions/14793
>
> I was bitten by that and ended up replacing the drive with a different
> model. The crash manifested exactly as you describe, though I didn't
> have L2ARC or swap enabled on it.
Nope:
nda0 at nvme0 bus 0 scbus9 target 0 lun 1
nda0: <INTEL SSDPEKNW512G8 002C BTNH940617WE512A>
nda0: Serial Number BTNH940617WE512A
nda0: nvme version 1.3
nda0: 488386MB (1000215216 512 byte sectors)
I'm not seeing super high I/O rates> I happened to have iostat running
when the machine paniced:
0 584 88.4 31 2.68 65.8 112 7.18 68.2 107 7.13 80 0 20 0 0
0 565 99.1 32 3.06 27.9 74 2.01 30.5 70 2.08 80 0 20 0 0
0 612 92.8 31 2.77 18.9 148 2.74 18.9 148 2.73 86 0 14 0 0
0 618 88.6 13 1.17 25.0 59 1.44 24.2 61 1.44 89 0 11 0 0
0 586 45.4 5 0.22 31.4 55 1.70 30.8 57 1.70 84 0 16 0 0
0 598 12.7 3 0.03 38.1 64 2.40 37.1 66 2.40 84 0 16 0 0
0 675 36.1 6 0.21 23.7 156 3.62 22.7 164 3.63 88 0 12 0 0
0 641 6.9 6 0.04 25.7 243 6.10 25.3 246 6.08 71 0 29 0 0
0 737 20.1 9 0.18 36.4 148 5.24 37.2 144 5.24 78 0 22 0 0
0 578 44.7 23 1.03 25.1 164 4.01 25.5 161 3.99 86 0 14 0 0
0 608 70.3 15 1.06 51.1 64 3.19 51.3 64 3.19 89 0 11 0 0
0 624 38.6 9 0.35 32.3 121 3.80 32.2 121 3.79 90 0 10 0 0
0 577 80.6 16 1.28 37.8 66 2.44 36.5 69 2.46 90 0 10 0 0
tty nda0 ada0 ada1 cpu
tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id
0 566 87.7 16 1.39 27.2 60 1.60 25.3 66 1.62 87 0 13 0 0
0 599 77.2 11 0.83 17.4 391 6.66 17.3 395 6.66 74 0 26 0 0
0 660 45.0 7 0.31 18.7 575 10.51 18.6 578 10.49 76 0 24 0 0
0 615 37.7 8 0.31 24.0 303 7.11 24.0 303 7.11 58 0 42 0 0
Fssh_packet_write_wait: ... port 22: Broken pipe
ada* are old and slow spinning rust.
That report does mention something else that could also be a cause. I
upgraded the motherboard BIOS around the same time. When I get a
chance, I'll drop back to the older FreeBSD version and see if the
problem goes away.