dump trying to access incorrect block numbers?

Sat Jul 8 16:45:11 UTC 2017

[I add notes about a problem that happens after the
"fsck -B". Also  forgot to mention: production style
kernel world builds were in use. And a tried a
powerpc64 build and it works the same.]

On 2017-Jul-7, at 11:09 PM, Mark Millard <markmi at dsl-only.net> wrote:

> [This note has more information than one sent with extra text
> in the subject but with a partially different "to" list.]
> 
> Peter Jeremy peter at rulingia.com wrote on
> Sat Jul 8 02:00:47 UTC 2017 :
> 
>> When did you first notice this (what SVN revision)?
>> Do you know what the last good SVN revision was?
>> Is this a new or old filesystem?
>> Is the filesystem mounted/active or not when you dump it?
>> What are the relevant parameters for the filesystem on ada0s3a?
>> Are you running softupdates, journalling etc?
>> Which dump(8) phase is reporting the errors?
>> What are the exact dump and fsck commands you ran?
> 
> I can add a little information with some contrast
> and only "fsck -B" in use (with an unclean file
> system from a prior crash), no dump use. Still:
> a snapshot is involved in the below.
> 
> Unfortunately two problems with major consequences
> for my involved context limit the svn range that I
> can cover for the activity, the problem version
> ranges being:
> 
> -r319722 through -r320651 (fixed by -r320652)
> (actually this is why I had used "boot -s" 
> in what I report later: I could get to a
> shell prompt that way instead of crashing
> before any login prompt; the crashes left
> the file system in need of repair)
> 
> -r320509 through -r320561 (fixed by -r320570)
> 
> So I was using -r320570 to avoid one of the
> two problems.
> 
> 
> 
> Context: 32-bit powerpc FreeBSD used on PowerMac G5
> so-called "Quad-core". (So big-endian as well.)
> Softupdates, no journalling. Long-in-use file
> system having lots of FreeBSD versions updates
> and port rebuilds over the time.
> 
> The following is from now, not from the time of the
> example messages:
> 
> # dumpfs / | more
> magic   19540119 (UFS2) time    Fri Jul  7 22:53:34 2017
> superblock location     65536   id      [ <OMITTED> ]
> ncg     158     size    25165823        blocks  24372006
> bsize   32768   shift   15      mask    0xffff8000
> fsize   4096    shift   12      mask    0xfffff000
> frag    8       shift   3       fsbtodb 3
> minfree 8%      optim   time    symlinklen 120
> maxbsize 32768  maxbpg  4096    maxcontig 4     contigsumsize 4
> nbfree  2130375 ndir    65518   nifree  11769796        nffree  425065
> bpg     20032   fpg     160256  ipg     80128   unrefs  0
> nindir  4096    inopb   128     maxfilesize     2252349704110079
> sbsize  4096    cgsize  32768   csaddr  5048    cssize  4096
> sblkno  24      cblkno  32      iblkno  40      dblkno  5048
> cgrotor 127     fmod    0       ronly   0       clean   0
> metaspace 6408  avgfpdir 64     avgfilesize 16384
> flags   soft-updates trim 
> fsmnt   /
> volname FBSDG4Srootfs   swuid   0       providersize    25165823
> . . .
> 
> 
> 
> What I had done that produced the messages was:
> 
> <Prior failed multi-user boot from system problem
> leaves root (only) file system not marked clean
> so fsck -B will actually do something below>
> 
> boot -s (so: single user mode)
> # The next 3 lines are the content of a generic, manually-run script.
> mount -u /
> mount -a -t ufs (but there is no other file system)
> swapon -a       (there is a swap partition)
> #
> fsck -B
> 
> That "fsck -B" caused the same kinds of lines
> reported by Michael Butler, happening as fsck
> makes a snapshot for the background processing
> to use. (I have camera pictures and could type
> in some of the lines if needed.)
> 
> After those lines was text like (typed in from
> an example camera picture):
> 
> ** //.snap/fsck_snapshot
> ** Last Mount on /
> ** Root file system
> ** Phase 1 - Check Blocks and Sizes
> ** Phase 2 - Check Pathnames
> ** Phase 3 - Check Connectivity
> ** Phase 4 - Check Reference Counts
> ** Phase 5 - Check Cyl groups
> Reclaimed: 0 directories, 1 files, 22680 fragments
> 780914 files, 4797127 used, 19552199 free (443479 frags, 3288590 blocks, 1.8% fragmentation)
> 
> ***** FILE SYSTEM MARKED CLEAN *****

[I forgot or mention that the context was a
production style kernel and world build,
no invariants or other such.]

Since I'm running a patched -r320570 for the
issue:

-r319722 through -r320651 (fixed by -r320652)

I went back and forced a power-off without
shutdown and did the sequence:

boot -s (so: single user mode)
# The next 3 lines are the content of a generic, manually-run script.
mount -u /
mount -a -t ufs (but there is no other file system)
swapon -a       (there is a swap partition)
#
fsck -B

but always waited briefly after the fsck -B finished.

Like before the following happens as it tries to trim:
(typed in from camera picture)

panic: ffs_blkfree_cq: freeing free block
cpuid = 2 (varies, of course)
time = (varies)
KDB: stack backtrace
(stack addresses can vary: just an example here)
0xd23b17e0: at kdb_backtrace+0x5c
0xd23b1850: at vpanic+0x1e8
0xd23b18c0: at panic+0x54
0xd23b1910: at ffs_blkfree_cq+0x278
0xd23b1980: at ffs_blkfree_trim_task+0x60
0xd23b19b0: at taskqueue_run_locked+0x10
0xd23b1a10: at taskqueue_thread_loop+0x174
0xd23b1a50: at fork_exit+0xf4
0xd23b1a80: at fork_trampoline+0xc
KDB: enter: panic
[ thread pid 0 tid 1000082 ]
Stopped at kdb_enter_0x70: addi r0,r0,0x0

I've tried this on a powerpc64 and it works
the same, complete with the "freeing free
block" issue.

===
Mark Millard
markmi at dsl-only.net