releng/13 release/13.0.0 : odd/incorrect diff result over nfs (in a zfs file systems context)

Sun May 23 08:27:56 UTC 2021

On 2021-May-23, at 00:44, Mark Millard <marklmi at yahoo.com> wrote:

> On 2021-May-21, at 17:56, Rick Macklem <rmacklem at uoguelph.ca> wrote:
> 
>> Mark Millard wrote:
>> [stuff snipped]
>>> Well, why is it that ls -R, find, and diff -r all get file
>>> name problems via genet0 but diff -r gets no problems
>>> comparing the content of files that it does match up (the
>>> vast majority)? Any clue how could the problems possibly
>>> be unique to the handling of file names/paths? Does it
>>> suggest anything else to look into for getting some more
>>> potentially useful evidence?
>> Well, all I can do is describe the most common TSO related
>> failure:
>> - When a read RPC reply (including NFS/RPC/TCP/IP headers)
>> is slightly less than 64K bytes (many TSO implementations are
>> limited to 64K or 32 discontiguous segments, think 32 2K
>> mbuf clusters), the driver decides it is ok, but when the MAC
>> header is added it exceeds what the hardware can handle correctly...
>> --> This will happen when reading a regular file that is slightly less
>>      than a multiple of 64K in size.
>> or
>> --> This will happen when reading just about any large directory,
>>     since the directory reply for a 64K request is converted to Sun XDR
>>     format and clipped at the last full directory entry that will fit within 64K.
>> For ports, where most files are small, I think you can tell which is more
>> likely to happen.
>> --> If TSO is disabled, I have no idea how this might matter, but??
>> 
>>> I'll note that netstat -I ue0 -d and netstat -I genet0 -d
>>> do not report changes in Ierrs or Idrop in a before vs.
>>> after failures comparison. (There may be better figures
>>> to look at for all I know.)
>>> 
>>> I tried "ifconfig genet0 -rxcsum -rxcsum -rxcsum6 -txcsum6"
>>> and got no obvious change in behavior.
>> All we know is that the data is getting corrupted somehow.
>> 
>> NFS traffic looks very different than typical TCP traffic. It is
>> mostly small messages travelling in both directions concurrently,
>> with some large messages thrown in the mix.
>> All I'm saying is that, testing a net interface with something like
>> bulk data transfer in one direction doesn't verify it works for NFS
>> traffic.
>> 
>> Also, the large RPC messages are a chain of about 33 mbufs of
>> various lengths, including a mix of partial clusters and regular
>> data mbufs, whereas a bulk send on a socket will typically
>> result in an mbuf chain of a lot of full 2K clusters.
>> --> As such, NFS can be good at tickling subtle bugs it the
>>     net driver related to mbuf handling.
>> 
>> rick
>> 
>>>> W.r.t. reverting r367492...the patch to replace r367492 was just
>>>> committed to "main" by rscheff@ with a two week MFC, so it
>>>> should be in stable/13 soon. Not sure if an errata can be done
>>>> for it for releng13.0?
>>> 
>>> That update is reported to be causing "rack" related panics:
>>> 
>>> https://lists.freebsd.org/pipermail/dev-commits-src-main/2021-May/004440.html
>>> 
>>> reports (via links):
>>> 
>>> panic: _mtx_lock_sleep: recursed on non-recursive mutex so_snd @ /syzkaller/managers/i386/kernel/sys/modules/tcp/rack/../../../netinet/tcp_stacks/rack.c:10632
>>> 
>>> Still, I have a non-debug update to main building and will
>>> likely do a debug build as well. llvm is rebuilding, so
>>> the builds will take a notable time.
> 
> I got the following built and installed on the two
> machines:
> 
> # uname -apKU
> FreeBSD CA72_16Gp_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021     root at CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72  arm64 aarch64 1400013 1400013
> 
> # uname -apKU
> FreeBSD CA72_4c8G_ZFS 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n246854-03b0505b8fe8-dirty: Sat May 22 16:25:04 PDT 2021     root at CA72_16Gp_ZFS:/usr/obj/BUILDs/main-CA72-dbg-clang/usr/main-src/arm64.aarch64/sys/GENERIC-DBG-CA72  arm64 aarch64 1400013 1400013
> 
> Note that both are booted with debug builds of main.
> 
> Using the context with the alternate EtherNet device that has not
> had an associated diff -r, find, pr ls -R failure yet
> yet got a panic that looks likely to be unrelated:
> 
> # mount -onoatime 192.168.1.187:/usr/ports/ /mnt/
> # diff -r /usr/ports/ /mnt/ | more
> nvme0: cpl does not map to outstanding cmd
> cdw0:00000000 sqhd:0020 sqid:0003 cid:007e p:1 sc:00 sct:0 m:0 dnr:0
> panic: received completion for unknown cmd
> cpuid = 3
> time = 1621743752
> KDB: stack backtrace:
> db_trace_self() at db_trace_self
> db_trace_self_wrapper() at db_trace_self_wrapper+0x30
> vpanic() at vpanic+0x188
> panic() at panic+0x44
> nvme_qpair_process_completions() at nvme_qpair_process_completions+0x1fc
> nvme_timeout() at nvme_timeout+0x3c
> softclock_call_cc() at softclock_call_cc+0x124
> softclock() at softclock+0x60
> ithread_loop() at ithread_loop+0x2a8
> fork_exit() at fork_exit+0x74
> fork_trampoline() at fork_trampoline+0x14
> KDB: enter: panic
> [ thread pid 12 tid 100028 ]
> Stopped at      kdb_enter+0x48: undefined       f904411f
> db> 
> 
> Based on the "nvme" references, I expect this is tied to
> handling the Optane 480 GiByte that is in the PCIe slot
> and is the boot/only media for the machine doing the diff.
> 
> "db> dump" seems to have worked.
> 
> After reboot, zpool scrub found no errors.
> 
> So, trying again . . .
> 
> I got some "Expensive timeout(9) function" notices:
> 
> Expensive timeout(9) function: 0xffff000000717b64(0) 1.210285924 s
> Expensive timeout(9) function: 0xffff000000717b64(0) 4.001010935 s
> 
> 0xffff000000717b64 looks to be uma_timeout:
> 
> ffff000000717b60 <uma_startup3+0x118> b ffff000000717b3c <uma_startup3+0xf4>
> ffff000000717b64 <uma_timeout> stp      x29, x30, [sp, #-32]!
> ffff000000717b68 <uma_timeout+0x4> stp  x20, x19, [sp, #16]
> . . .
> 
> . . . Hmm. The debug kernel test context seems to take a
> very long time. It has not failed so far but is still
> going.
> 
> So I stopped it and switch to testing with the genet0 device
> that was involved for the earlier failures.  . . .
> 
> It did not fail. Nor did the debug kernel report anything
> beyond:
> 
> if_delmulti_locked: detaching ifnet instance 0xffffa00000fc8000
> if_delmulti_locked: detaching ifnet instance 0xffffa00000fc8000
> Expensive timeout(9) function: 0xffff00000050c088(0) 6.318652023 s
> 
> on one machine and:
> 
> if_delmulti_locked: detaching ifnet instance 0xffffa0000b56b800
> 
> on the other.
> 
> So I may reboot into the also-updated non-debug builds on both
> machines and try in that context.
> 

The non-debug build pair of machines got the problem:

# diff -r /usr/ports/ /mnt/ | more
Only in /mnt/devel/electron12/files: 
Only in /usr/ports/devel/electron12/files: patch-chrome_browser_media_webrtc_webrtc__logging__controller.cc
Only in /usr/ports/devel/electron12/files: patch-components_previews_core_previews__features.cc
Only in /mnt/devel/electron12/files: <A0><CE><C8>֏<DC>ܦ2<B2><E2><AA>^H
Only in /mnt/www/chromium/files: patch-chrome_browser_chrome__browser
Only in /usr/ports/www/chromium/files: patch-chrome_browser_chrome__browser__main__posix.cc

I'll note that it turns out that the debug build had more
than is typical enabled: DIAGNOSTICS, BUF_TRACKING, and
FULL_BUF_TRACKING were also enabled. I'd forgotten that
I'd previously had a reason to add those to what my debug
builds included (for a prior problem investigation). I'd
not done debug builds in some time.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)