Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)
- Reply: Mark Millard : "Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)"
- Reply: bob prohaska : "Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)"
- In reply to: bob prohaska : "Re: fsck segfaults on rpi3 running 13-stable (and on 14-CURRENT analyzing the same file system that resulted from the 13-STABLE crash)"
- Go to: [ bottom of page ] [ top of archives ] [ this month ]
Date: Mon, 20 Feb 2023 05:50:45 UTC
On Feb 19, 2023, at 20:45, bob prohaska <fbsd@www.zefox.net> wrote:
> On Sun, Feb 19, 2023 at 02:35:15PM -0800, Mark Millard wrote:
>>
>> Kirk likely monitors the freebsd-fs list.
>
> I didn't notice there was such a list 8-\
>
>> Kirk likely does not monitor the freebsd-arm list.
>> None of us thought to switch to freebsd-fs at the
>> time. The only part of your context that ended up
>> to be arm specific was original buildworld crash.
>> You definitely started in an appropriate place
>> (freebsd-arm). After the crash, the rest was more
>> general relative to platforms and more specific
>> relative to file system handling (UFS support).
>>
>> I do not see any reason for any of this exchange
>> to go to any lists, given the current status.
>
> Alas, the story's not over yet 8-(
>
> After getting the disk fsck'd and booting once more,
> an attempt to buildworld using a fresh /usr/src
> and empty /usr/obj crashed again,
I'm confused. The original crash was reported to be
on a RPi2B using a armv7 kernel, or so I thought.
(The RPi3B was for later fsck_ffs activity for the
media's UFS.)
This new material indicates a RPi3B arm64 (aarch64)
context for this buildworld failure. Is it the same
media as for the prior buildworld failure?
> in I think the
> same way. This time some notes have been collected
> at
> http://www.zefox.net/~fbsd/rpi3/scsi_status_error/readme
>
> To a casual glance, it looks like a hardware error.
> But, the machine seems to work fine until it's running
> buildworld, and then crashes during a relatively easy
> part of buildworld. The initial error message is:
>
> bob@pelorus:/usr/src % (da0:umass-sim0:0:0:0): READ(10). CDB: 28 00 43 29 d6 40 00 00 40 00
> (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
> (da0:umass-sim0:0:0:0): SCSI status: Check Condition
> (da0:umass-sim0:0:0:0): SCSI sense: MEDIUM ERROR asc:11,0 (Unrecovered read error)
> (da0:umass-sim0:0:0:0): Error 5, Unretryable error
A description of "Media Error" from seagate is:
Medium Error - Indicates the command terminated with a nonrecovered error condition, probably caused by a flaw in the medium or an error in the recorded data.
To compare/contrast with other alternatives, see:
https://www.seagate.com/support/kb/scsi-sense-key-chart-196259en/
A more extensive list with asc/ascq involved as well is at:
https://en.wikipedia.org/wiki/Key_Code_Qualifier/
Allowing more comparison/contrast with other classifications.
It indicates:
3 11 00 Medium Error - unrecovered read error
(matching the reported text).
> SCSI errors are not unknown, but they usually succeed on retry.
> It's not obvious why this is treated as un-retryable.
Because that is what the "3 11 00" combination involved
means. The drive is reporting that. It is not a FreeBSD
driver choice of handling.
(I'm not expert at drive internals, so I take it at face
value.)
> Are there any simple tests that might help decide what's wrong?
> It's likely that re-running buildworld will reproduce the crash.
See the https://en.wikipedia.org/wiki/Key_Code_Qualifier/
description material for some background information?
> I've placed the results of smartctl -a at the end of the notes.
> The interpretation isn't self evident, hopefully someone else
> can lend an eye. I'll try smartctl -t after a good night's sleep.
man smartctl reports:
UNC: UNCorrectable Error in Data
The 3 examples of:
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455
indicate UNC. All 3 list the same LBA value.
Error 4 occurred at disk power-on lifetime: 11121 hours (463 days + 9 hours)
Error 3 occurred at disk power-on lifetime: 11098 hours (462 days + 10 hours)
Error 2 occurred at disk power-on lifetime: 11096 hours (462 days + 8 hours)
So spread over a little over a day overall, with 2 and 3
spread over a couple of hours.
It suggests to me that the drive is no longer usable.
But I'm no expert.
===
Mark Millard
marklmi at yahoo.com