Re: FYI for aarch64 main [14] running a mid March version: I ended up with [usb{usbus2}] stuck at (near) 100% cpu

From: Mark Millard via freebsd-arm <freebsd-arm_at_freebsd.org>
Date: Fri, 14 May 2021 21:51:27 UTC
On 2021-May-14, at 13:48, Rodney W. Grimes <freebsd-rwg at gndrsh.dnsmgr.net> wrote:

>> 
>> On 2021-May-14, at 05:52, Rodney W. Grimes <freebsd-rwg  at gndrsh.dnsmgr.net> wrote:
>> 
>>>> Note: The context was using a non-debug main build
>>>>     from mid-2021-Mar. (More details identified
>>>>     later.)
>>>> 
>>>> The issue happend while attempting a:
>>>> 
>>>> # zfs send -R zpold@for-copy | zfs recv -Fdv zpnew
>>>> 
>>>> where the drives involved in the command were:
>>>> 
>>>> zpold: a USB3 SSD, using /dev/da0p3
>>>> zpnew: an 480 GiByte Optane in the PCIe slot, using /dev/nda0p3
>>>> 
>>>> with:
>>>> 
>>>> # gpart show -pl
>>>> =>       40  468862048    da0  GPT  (224G)
>>>>        40     532480  da0p1  4C8GCA72EFI  (260M)
>>>>    532520       2008         - free -  (1.0M)
>>>>    534528   29360128  da0p2  4C8GCA72swp14  (14G)
>>>>  29894656    4194304         - free -  (2.0G)
>>>>  34088960   33554432  da0p4  4C8GCA72swp16  (16G)
>>>>  67643392  401217536  da0p3  4C8GCA72zfs  (191G)
>>>> 468860928       1160         - free -  (580K)
>>>> 
>>>> =>        40  2000409184    ada0  GPT  (954G)
>>>>         40      409600  ada0p1  (null)  (200M)
>>>>     409640  1740636160  ada0p2  FBSDmacchroot  (830G)
>>>> 1741045800    58720256  ada0p3  FBSDmacchswp0  (28G)
>>>> 1799766056   176160768  ada0p4  FBSDmacchswp1  (84G)
>>>> 1975926824    24482400          - free -  (12G)
>>>> 
>>>> =>       40  937703008    nda0  GPT  (447G)
>>>>        40     532480  nda0p1  CA72opt0EFI  (260M)
>>>>    532520       2008          - free -  (1.0M)
>>>>    534528  117440512  nda0p2  CA72opt0swp56  (56G)
>>>> 117975040   16777216          - free -  (8.0G)
>>>> 134752256  134217728  nda0p4  CA72opt0swp64  (64G)
>>>> 268969984  668731392  nda0p3  CA72opt0zfs  (319G)
>>>> 937701376       1672          - free -  (836K)
>>>> 
>>>> The system running was that on /dev/ada0p2 (FBSDmacchroot,
>>>> which is UFS instead of ZFS).
>>>> 
>>>> The [usb{usbus2}] process eventually got stuck-busy, no
>>>> more I/O:
>>>> 
>>>> CPU 0:  0.0% user,  0.0% nice,  100% system,  0.0% interrupt,  0.0% idle
>>>> CPU 1:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
>>>> CPU 2:  0.0% user,  0.0% nice,  0.0% system,  0.0% interrupt,  100% idle
>>>> CPU 3:  0.4% user,  0.0% nice,  0.0% system,  0.0% interrupt, 99.6% idle
>>>> 
>>>> PID USERNAME    PRI NICE     SIZE       RES STATE    C   TIME     CPU COMMAND
>>>>  15 root        -72    -       0B   262144B CPU0     0   8:51  99.95% [usb{usbus2}]
>>>> 
>>>> 1295 root         -8    0  20108Ki    8092Ki q->bq_   2   0:04   0.00% zfs recv -Fdv zpnew{receive_writer_thre}
>>>> 1295 root         48    0  20108Ki    8092Ki piperd   2   0:22   0.00% zfs recv -Fdv zpnew{zfs}
>>>> 1294 root         -8    0  17544Ki    7740Ki q->bq_   2   0:01   0.00% zfs send -R zpold@for-copy{send_reader_thread}
>>>> 1294 root         -8    0  17544Ki    7740Ki q->bq_   0   0:00   0.00% zfs send -R zpold@for-copy{send_merge_thread}
>>>> 1294 root         -8    0  17544Ki    7740Ki hdr->b   2   0:00   0.00% zfs send -R zpold@for-copy{send_traverse_threa}
>>>> 1294 root         52    0  17544Ki    7740Ki range-   3   0:20   0.00% zfs send -R zpold@for-copy{zfs}
>>>> 
>>>> 1036 root         -8    -       0B    1488Ki t->zth   0   0:00   0.00% [zfskern{z_checkpoint_discar}]
>>>> 1036 root         -8    -       0B    1488Ki t->zth   1   0:00   0.00% [zfskern{z_livelist_condense}]
>>>> 1036 root         -8    -       0B    1488Ki t->zth   2   0:00   0.00% [zfskern{z_livelist_destroy}]
>>>> 1036 root         -8    -       0B    1488Ki t->zth   1   0:00   0.00% [zfskern{z_indirect_condense}]
>>>> 1036 root         -8    -       0B    1488Ki mmp->m   3   0:00   0.00% [zfskern{mmp_thread_enter}]
>>>> 1036 root         -8    -       0B    1488Ki tx->tx   1   0:00   0.00% [zfskern{txg_thread_enter}]
>>>> 1036 root         -8    -       0B    1488Ki tx->tx   2   0:00   0.00% [zfskern{txg_thread_enter}]
>>>> 
>>>> I was unable to ^c or ^z the process where I
>>>> typed the command. I eventually stopped the
>>>> system with "shutdown -p now" from a ssh
>>>> session (that had already been in place).
>>> 
>>> Should this occur again before doing the shutdown run a
>>> zpool status &
>>> I have gotten in this state when the recv pool was a usb device
>> 
>> The USB device had the send pool in my example.
>> 
>>> and for some reason it had a timeout and gone offline.
>> 
>> No messages about timeouts or other such were made.
>> 
>>> The clue
>>> this occured are in dmesg, and zpool status.
>> 
>> No console, dmesg -a, or /var/log/messages output were
>> generated. (And the system was running from a SATA SSD
>> that was operating well.)
>> 
>> For reference, the USB EtherNet device that was in a
>> USB2 port continued to operate just fine, allowing the
>> use of existing ssh sessions that were displaying
>> gstat -spod and top -Samio -ototal until I started
>> looking at the problem. (I did not try making a new
>> ssh session).
>> 
>> I did not do the "zpool status" so I can not report
>> about it.
>> 
>>> Unplug/plug the USB device, check dmesg that it came online,
>>> and do a zpool clear.
>> 
>> Okay. I normally avoid unplugging USB storage media
>> if the system overall does not hang up: hopes of
>> a clean shutdown leaving things better.
> 
> Do the zpool status, and only if that indicates a device offline
> or other problem would you proceed to do the unplug/plug, as at
> that point zfs has stopped doing anything to the device and
> your shutdown wont do anything as far as zfs for that pool anyway.
> 
>> 
>> The system did appear to shutdown to completion.
> 
> Yes, you can shutdown a system with a zpool in failed state.

I still have the console output available and I was
wrong: the shutdown hung up and apparently I cut
power before "All buffers synced". All it got to
was:

                Stopping cron.
Waiting for PIDS: 858.
Stopping sshd.
Waiting for PIDS: 852.
fstab: /etc/fstab:6: Inappropriate file type or format
Stopping ntpd.
Waiting for PIDS: 807.
Stopping nfsd.
Waiting for PIDS: 781 782.
Stopping mountd.
Waiting for PIDS: 779.
Stopping rpcbind.
Waiting for PIDS: 741.
Stopping devd.
Waiting for PIDS: 431.
Writing entropy file: .
Writing early boot entropy file: .
.
Terminated
May 13 20:16:26 FBSDmacch syslogd: exiting on signal 15

It did not get to any of the usual sort of (from a different
shutdown):

Waiting (max 60 seconds) for system process `vnlru' to stop... done
Waiting (max 60 seconds) for system process `syncer' to stop... 
Syncing disks, vnodes remaining... 0 0 0 0 0 0 0 0 0 0 done
Waiting (max 60 seconds) for system thread `bufdaemon' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-1' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-3' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-4' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-5' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-0' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-2' to stop... done
Waiting (max 60 seconds) for system thread `bufspacedaemon-6' to stop... done
All buffers synced.
Uptime: 8m12s

>> 
>>>> 
>>>> When I retried after rebooting and scrubbing (no
>>>> problems found), the problem did not repeat.
>>>> 
>>>> I do not have more information nor a way to repeat
>>>> the problem on demand, unfortunately.
>>>> 
>>>> Details of the vintage of the system software and
>>>> such:
>>>> 
>>>> # ~/fbsd-based-on-what-freebsd-main.sh 
>>>> FreeBSD FBSDmacch 14.0-CURRENT FreeBSD 14.0-CURRENT mm-src-n245445-def0058cc690 GENERIC-NODBG  arm64 aarch64 1400005 1400005
>>>> def0058cc690 (HEAD -> mm-src) mm-src snapshot for mm's patched build in git context.
>>>> merge-base: 7381bbee29df959e88ec59866cf2878263e7f3b2
>>>> merge-base: CommitDate: 2021-03-12 20:29:42 +0000
>>>> 7381bbee29df (freebsd/main, freebsd/HEAD, pure-src, main) cam: Run all XPT_ASYNC ccbs in a dedicated thread
>>>> n245444 (--first-parent --count for merge-base)
>>>> 
>>>> The system was a MACCHIATObin Double Shot.

===
Mark Millard
marklmi at yahoo.com
( dsl-only.net went
away in early 2018-Mar)