Kernel panic and ZFS corruption on 11.3-RELEASE

Thu Aug 29 07:18:59 UTC 2019


On 29/08/2019 4:37 pm, Victor Sudakov wrote:
> MJ wrote:
>>
>>
>> On 28/08/2019 12:57 pm, Victor Sudakov wrote:
>>> Dear Colleagues,
>>>
>>> Shortly after upgrading to 11.3-RELEASE I had a kernel panic:
>>>
>>> Aug 28 00:01:40 vas kernel: panic: solaris assert: dmu_buf_hold_array(os, object, offset, size, 0, ((char *)(uintptr_t)__func__), &numbufs, &dbp) == 0 (0x5 == 0x0), file: /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c, line: 1022
>>> Aug 28 00:01:40 vas kernel: cpuid = 0
>>> Aug 28 00:01:40 vas kernel: KDB: stack backtrace:
>>> Aug 28 00:01:40 vas kernel: #0 0xffffffff80b4c4d7 at kdb_backtrace+0x67
>>> Aug 28 00:01:40 vas kernel: #1 0xffffffff80b054ee at vpanic+0x17e
>>> Aug 28 00:01:40 vas kernel: #2 0xffffffff80b05363 at panic+0x43
>>> Aug 28 00:01:40 vas kernel: #3 0xffffffff8260322c at assfail3+0x2c
>>> Aug 28 00:01:40 vas kernel: #4 0xffffffff822a9585 at dmu_write+0xa5
>>> Aug 28 00:01:40 vas kernel: #5 0xffffffff82302b38 at space_map_write+0x188
>>> Aug 28 00:01:40 vas kernel: #6 0xffffffff822e31fd at metaslab_sync+0x41d
>>> Aug 28 00:01:40 vas kernel: #7 0xffffffff8230b63b at vdev_sync+0xab
>>> Aug 28 00:01:40 vas kernel: #8 0xffffffff822f776b at spa_sync+0xb5b
>>> Aug 28 00:01:40 vas kernel: #9 0xffffffff82304420 at txg_sync_thread+0x280
>>> Aug 28 00:01:40 vas kernel: #10 0xffffffff80ac8ac3 at fork_exit+0x83
>>> Aug 28 00:01:40 vas kernel: #11 0xffffffff80f69d6e at fork_trampoline+0xe
>>> Aug 28 00:01:40 vas kernel: Uptime: 14d3h42m57s
>>>
>>> after which the ZFS pool became corrupt:
>>>
>>>     pool: d02
>>>    state: FAULTED
>>> status: The pool metadata is corrupted and the pool cannot be opened.
>>> action: Recovery is possible, but will result in some data loss.
>>> 	Returning the pool to its state as of вторник, 27 августа 2019 г. 23:51:20
>>> 	should correct the problem.  Approximately 9 minutes of data
>>> 	must be discarded, irreversibly.  Recovery can be attempted
>>> 	by executing 'zpool clear -F d02'.  A scrub of the pool
>>> 	is strongly recommended after recovery.
>>>      see: http://illumos.org/msg/ZFS-8000-72
>>>     scan: resilvered 423K in 0 days 00:00:05 with 0 errors on Sat Sep 30 04:12:20 2017
>>> config:
>>>
>>> 	NAME	    STATE     READ WRITE CKSUM
>>> 	d02	    FAULTED	 0     0     2
>>> 	  ada2.eli  ONLINE	 0     0    12
>>>
>>> However, "zpool clear -F d02" results in error:
>>> cannot clear errors for d02: I/O error
>>>
>>> Do you know if there is a way to recover the data, or should I say farewell to several hundred Gb of anime?
>>>
>>> PS I think I do have the vmcore file if someone is interested to debug the panic.
>>
>> Do you have a backup? Then restore it.
> 
> No, it's much more interesting to try and recover the pool.
> 
>>
>> If you don't, have you tried
>> zpool import -F d02
> 
> I've tried "zpool clear -F d02" with no success (see above).
> 
> Later I tried "zpool import -Ff d02", but on an 11.2 system, as David
> Christensen advised, and this was a success.
> 
>> Some references you might like to read:
>> https://docs.oracle.com/cd/E19253-01/819-5461/gbctt/index.html
>> Take note of this section:
>> "If the damaged pool is in the zpool.cache file, the problem is discovered when the system is booted, and the damaged pool is reported in the zpool status command. If the pool isn't in the zpool.cache file, it won't successfully import or open and you'll see the damaged pool messages when you attempt to import the pool."
>>
>> I've not had your exact error, but in the case of disk corruption/failure, I've used import as the sledgehammer approach.
> 
> What do you think made all the difference: 11.2 vs 11.3, or "import -F" vs "clear -F"?
> 
> What is the difference between  "import -F" vs "clear -F" in the fixing of zpool errors?
> 
Isn't it obvious? One worked, the other didn't! :-)

Why import over clear:
https://illumos.org/msg/ZFS-8000-6X