[Bug 278958] zfs panic: page fault in sync_dnodes_task

From: <bugzilla-noreply_at_freebsd.org>
Date: Tue, 14 May 2024 15:45:55 UTC
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=278958

--- Comment #1 from nunziotocci2000@gmail.com ---
Another panic this morning at 3:33AM with an identical backtrace.
Looking at the core.txt there was a backup running this time as well. There is
a running `zfs` process with a `sudo` as the parent, and `sshd` as the next
process in the tree.

A `zpool status` shoes that our jail dataset experience an error:
  pool: zsmtp_jail
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 92K in 00:23:52 with 0 errors on Tue May 14 03:53:36
2024
config:

        NAME        STATE     READ WRITE CKSUM
        zsmtp_jail  ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nda1p3  ONLINE       0     0     1
            nda2p3  ONLINE       0     0     0

It seems to me that this was likely caused by the panic corrupting something on
that drive. smartctl -a /dev/nvme1 doesn't seem out of the ordinary:
=== START OF INFORMATION SECTION ===
Model Number:                       Force MP600
Serial Number:                      2006823000012856205C
Firmware Version:                   EGFM11.3
PCI Vendor/Subsystem ID:            0x1987
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 2fc124183d
Local Time is:                      Tue May 14 10:42:58 2024 CDT
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     9.78W       -        -    0  0  0  0        0       0
 1 +     6.75W       -        -    1  1  1  1        0       0
 2 +     5.23W       -        -    2  2  2  2        0       0
 3 -   0.0490W       -        -    3  3  3  3     2000    2000
 4 -   0.0018W       -        -    4  4  4  4    25000   25000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        39 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    3%
Data Units Read:                    295,807,203 [151 TB]
Data Units Written:                 98,425,650 [50.3 TB]
Host Read Commands:                 1,499,250,766
Host Write Commands:                2,297,088,561
Controller Busy Time:               5,756
Power Cycles:                       108
Power On Hours:                     33,640
Unsafe Shutdowns:                   67
Media and Data Integrity Errors:    0
Error Information Log Entries:      543
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA
 NSID Seg SCT Code
 0   Short             Completed without error               30728            -
    -   -   -    -

I will run a SMART self test and report the results.

-- 
You are receiving this mail because:
You are the assignee for the bug.