From nobody Mon May 01 00:44:10 2023 X-Original-To: freebsd-hackers@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4Q8kx872B9z48WfM; Mon, 1 May 2023 00:44:12 +0000 (UTC) (envelope-from mjguzik@gmail.com) Received: from mail-oi1-x234.google.com (mail-oi1-x234.google.com [IPv6:2607:f8b0:4864:20::234]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256 client-signature RSA-PSS (2048 bits) client-digest SHA256) (Client CN "smtp.gmail.com", Issuer "GTS CA 1D4" (verified OK)) by mx1.freebsd.org (Postfix) with ESMTPS id 4Q8kx81Gspz4YjQ; Mon, 1 May 2023 00:44:12 +0000 (UTC) (envelope-from mjguzik@gmail.com) Authentication-Results: mx1.freebsd.org; none Received: by mail-oi1-x234.google.com with SMTP id 5614622812f47-38dfdc1daa9so1339128b6e.1; Sun, 30 Apr 2023 17:44:12 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20221208; t=1682901851; x=1685493851; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=mcXwCpT4Vk7YyJ1B6VIHWGfjsi7tjVDvKFYslcLEEf0=; b=P9VID/nuVD1PunzpW+s1yB72WfTbYQFw8kaDnM9p+kkG0deOU56ydXLlvUPz3gljd9 F6m034g2UejdfOpu6kNhrSbTBbTt2d5ZK/xVylul/32S1iGfIoR1wh3T6SQswyEXf48n /y/t5+OHGJ5iXcdeSGSnRA7Zw1EfYTyR6QojXTmNZSZMAhWNx4ZxRLmWOt4GCXGQAICA Zr7Qo+2fxMOtyNnAq9py5ecNrHVMRRVqvd6tU5iauGo6NXGssDnIAzT7CUSXchihuQ7F w32x2pns9b6FGL6TkYW8HLxLXzCJvYWENEJrgTONHfCZNKqDy0lnZVVmns0+2NTgOqLV q5uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1682901851; x=1685493851; h=cc:to:subject:message-id:date:from:references:in-reply-to :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=mcXwCpT4Vk7YyJ1B6VIHWGfjsi7tjVDvKFYslcLEEf0=; b=YbN9dPL4j/aZqV0nAkRkCa5w9SmZ9aLk5BXzyjyVgpntNwxsfNMvPRzXg3fSHE9lH+ 8L2pfSk4TdLqb0hHtmq9brrKKIzahvY54OhQRpcDbo0w1C3NF6fJCDulvFM289pFSo0a 4fhjkMClz+nZ4ENMGoxD6kv2jkcwce/JDfOZ66Y/6R+L38GMa6sbX7+0uEV9Z5lmTMvV DO2Ot+jlHDQrWk528oA/fRKH5VrXmKHrJfEZU5pF+4CaLwM96nekfE496+M8h6CmZw1y biY7b9o9LqzF8sLcrJriQKawhiLf/qTGhZ+fXkuOMWgbq7NgAiTE/HIRaKKWX9ep4TIP 5e2Q== X-Gm-Message-State: AC+VfDzYh77OmsovYz8pBxnaSd3Kulds9lqkG5PFj0NOrIyjrThtyw7J w4q1txjxlsI0l228jcEz2YhZmV79ouktKRPA9rcicn5o X-Google-Smtp-Source: ACHHUZ40rbXa1CytS5bGunpAgnfsTwB0N1LvAFHLZCbsJv+qv+hNIQkl8yZVnD4KSsqOCGkpM6D5jZxUedcyLyHkUIE= X-Received: by 2002:a05:6808:253:b0:389:7b6b:7a2c with SMTP id m19-20020a056808025300b003897b6b7a2cmr6157292oie.44.1682901850774; Sun, 30 Apr 2023 17:44:10 -0700 (PDT) List-Id: Technical discussions relating to FreeBSD List-Archive: https://lists.freebsd.org/archives/freebsd-hackers List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-hackers@freebsd.org MIME-Version: 1.0 Received: by 2002:a8a:1194:0:b0:4d4:94b:7266 with HTTP; Sun, 30 Apr 2023 17:44:10 -0700 (PDT) In-Reply-To: References: <7AE28A5B-109E-4C26-9D70-BCA5D49CD79D@yahoo.com> <02DC03AE-E082-4FB5-AA0D-396F64CC23CB@yahoo.com> From: Mateusz Guzik Date: Mon, 1 May 2023 02:44:10 +0200 Message-ID: Subject: Re: armv8.2-A+ tuned FreeBSD kernels vs. poudriere bulk and USB3 media: tx->tx_quiesce_done_cv related blocking of processes? To: Mark Millard Cc: FreeBSD Hackers , freebsd-arm Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: 4Q8kx81Gspz4YjQ X-Spamd-Bar: ---- X-Spamd-Result: default: False [-4.00 / 15.00]; REPLY(-4.00)[]; ASN(0.00)[asn:15169, ipnet:2607:f8b0::/32, country:US] X-Rspamd-Pre-Result: action=no action; module=replies; Message is reply to one we originated X-ThisMailContainsUnwantedMimeParts: N can you redo zfs test with: sysctl vfs.zfs.per_txg_dirty_frees_percent=5 On 5/1/23, Mark Millard wrote: > As the evidence this time is largely independent of the details > reported previousy, I'm top posting this. > > The previous ZFS on USB3 results were based on poudriere using > "USE_TMPFS=data", meaning that almost all file I/O was via ZFS > to the USB3 media. > > The UFS on U2 960GB Optane via USB3 adapter results did not not > suffer the reported problems, despite "USE_TMPFS=data". (I let > it run to completion.) But this had both a media and a file > system difference. > > This time the results are for just changing poudriere to use > "USE_TMPFS=all" instead but back on the original ZFS on > USB3 media. The implication is that the vast majority of the > file I/O is not via ZFS to the USB3 media. > > The context has 32 GiBytes of RAM and about 118 GiBytes of > swap/paging space. It would need to page if pet run to > completion. > > Observing, the load average is behaving normally: Things are > not stuck waiting. "gstat -spod" indicates the ZFS I/O is > not sustained (no paging in swap space use yet). > > First 1 hr: 262 ports built. > > But this had both a media and a file system difference again. > > I'm stopping after this, in order to set up the next just- > ZFS experiments. > > > Next experiments: > > I expect to move the ZFS context to U2 960GB Optane media > used with the USB3 adapter and to test both "USE_TMPFS=data" > and "USE_TMPSF=all", probably letting USE_TMPFS=all run to > completion. > > If Optane based USE_TMPFS=data context still has the problem, > even the more performance media would have been not enough to > avoid what would then appear to be a ZFS problem, two other > file systems not having the problem. > > The Optane based USE_TMPFS=all context should just handle the > paging and more rare ZFS I/O quicker. I do not expect problems > for this combination, given the UFS on Optane USB3 results > and the partial USE_TMPFS=all non-Optane USB3 results. Even > with ZFS working, this likely is the more performant type of > context for the Windows Dev Kit 2023, given that I'm leaving > Windows 11 Pro in place on the internal media. > > > Hypothesis for the original problem: > > I wonder if ZFS write activity to the USB3 media was largely > blocking the ZFS read activity to the same media, causing > lots of things to have to spend much time waiting for data > instead of making progress, leading to long periods of low > load averages. > > > Older material: > > On Apr 30, 2023, at 00:50, Mark Millard wrote: > >> On Apr 29, 2023, at 19:44, Mark Millard wrote: >> >>> This is based on: main-n262658-b347c2284603-dirty, b347c2284603 >>> being from late Apr 28, 2023 UTC. (The "-dirty" is from some >>> historical patches that I use.) The build is a non-debug build >>> (but with symbols not stripped). World or Kernel had been built >>> using: >>> >>> -mcpu=cortex-a78C+flagm+nofp16fml >>> >>> just for testing purposes. (Worked nicely for -j8 buildworld >>> buildkernel testing for the 4 cortex-a78c's plus 4 cortex-x1c's >>> present.) >>> >>> Monitoring poudriere bulk related activity via top and gstat -spod >>> I see a lot of the odd result of one process doing something >>> like: >>> >>> CPU4 4 1:39 99.12% /usr/local/sbin/pkg-static create -f tzst -r >>> /wrkdirs/usr/ports/devel/cmake-core/work/stage >>> >>> while other processes sit in the likes of: >>> >>> tx->tx >>> zcq->z >>> zcw->z >>> zilog- >>> select >>> wait >>> >>> But sometimes there is no CPU bound process and the top CPU process is >>> the likes of: >>> >>> 1.24% [usb{usbus0}] >>> >>> "gstat -spod" basically shows da0 dedicated to write activity most >>> of the time. >>> >>> After: sysctl kern.tty_info_kstacks=1 >>> Then using ^T at various times, I see a lot of: >>> >>> load: 0.48 cmd: sh 93914 [tx->tx_quiesce_done_cv] 7534.91r 11.06u 22.66s >>> 0% 3800k >>> #0 0xffff0000004fd564 at mi_switch+0x104 >>> #1 0xffff000000463f40 at _cv_wait+0x120 >>> #2 0xffff00000153fa34 at txg_wait_open+0xf4 >>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c >>> #4 0xffff000001448254 at zfs_rmnode+0x64 >>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48 >>> #10 0xffff0000005fc430 at vinactivef+0x180 >>> #11 0xffff0000005fba50 at vput_final+0x200 >>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320 >>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14 >>> #14 0xffff0000008f8514 at do_el0_sync+0x594 >>> #15 0xffff0000008d4904 at handle_el0_sync+0x40 >>> >>> load: 0.34 cmd: sh 93914 [tx->tx_quiesce_done_cv] 7566.69r 11.06u 22.66s >>> 0% 3800k >>> #0 0xffff0000004fd564 at mi_switch+0x104 >>> #1 0xffff000000463f40 at _cv_wait+0x120 >>> #2 0xffff00000153fa34 at txg_wait_open+0xf4 >>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c >>> #4 0xffff000001448254 at zfs_rmnode+0x64 >>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34 >>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >>> #7 0xffff0000005fd6c0 at vgonel+0x450 >>> #8 0xffff0000005fde7c at vrecycle+0x9c >>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48 >>> #10 0xffff0000005fc430 at vinactivef+0x180 >>> #11 0xffff0000005fba50 at vput_final+0x200 >>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320 >>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14 >>> #14 0xffff0000008f8514 at do_el0_sync+0x594 >>> #15 0xffff0000008d4904 at handle_el0_sync+0x40 >>> >>> load: 0.44 cmd: sh 93914 [tx->tx_quiesce_done_cv] 7693.52r 11.24u 23.08s >>> 0% 3800k >>> #0 0xffff0000004fd564 at mi_switch+0x104 >>> #1 0xffff000000463f40 at _cv_wait+0x120 >>> #2 0xffff00000153fa34 at txg_wait_open+0xf4 >>> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c >>> #4 0xffff000001448254 at zfs_rmnode+0x64 >>> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34 >>> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >>> #7 0xffff0000005fd6c0 at vgonel+0x450 >>> #8 0xffff0000005fde7c at vrecycle+0x9c >>> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48 >>> #10 0xffff0000005fc430 at vinactivef+0x180 >>> #11 0xffff0000005fba50 at vput_final+0x200 >>> #12 0xffff00000060c4d0 at kern_funlinkat+0x320 >>> #13 0xffff00015d6cbbf4 at filemon_wrapper_unlink+0x14 >>> #14 0xffff0000008f8514 at do_el0_sync+0x594 >>> #15 0xffff0000008d4904 at handle_el0_sync+0x40 >>> >>> >>> The system (Windows Dev Kit 2023) has 32 GiBytes of RAM. Example >>> output from a top modified to show some "Max[imum]Obs[erved]" >>> information: >>> >>> last pid: 17198; load averages: 0.33, 0.58, 1.06 MaxObs: 15.49, >>> 8.73, 5.75 >>> up 0+20:48:10 19:14:49 >>> 426 threads: 9 running, 394 sleeping, 1 stopped, 22 waiting, 50 >>> MaxObsRunning >>> CPU: 0.0% user, 0.0% nice, 0.2% system, 0.1% interrupt, 99.7% idle >>> Mem: 282760Ki Active, 7716Mi Inact, 23192Ki Laundry, 22444Mi Wired, >>> 2780Ki Buf, 848840Ki Free, 2278Mi MaxObsActive, 22444Mi MaxObsWired, >>> 22752Mi MaxObs(Act+Wir+Lndry) >>> ARC: 11359Mi Total, 3375Mi MFU, 5900Mi MRU, 993Mi Anon, 93076Ki Header, >>> 992Mi Other >>> 8276Mi Compressed, 19727Mi Uncompressed, 2.38:1 Ratio >>> Swap: 120832Mi Total, 120832Mi Free, 2301Mi MaxObs(Act+Lndry+SwapUsed), >>> 22752Mi MaxObs(Act+Wir+Lndry+SwapUsed) >>> >>> >>> The poudriere bulk has 8 builders but has ALLOW_MAKE_JOBS=yes >>> without any explicit settings for the likes of MAKE_JOBS_NUMBER . >>> So it is a configuration that allows a high load average compared >>> to the number of hardware threads (here: cores) in the system. >>> >>> >>> I've rebooted to do a test with filemon not loaded at the time >>> (here it was loaded from prior buildworld buildkernel activity). >>> We will see if it still ends up with such problems. >> >> It still ends up with things waiting, but the detailed STATE >> valiues generally lsited are somewhat different. >> >> I also tried a chroot into a world from use of -mcpu=cortex-a72 >> and got similar results. Suggesting that only the >> -mcpu=cortex-a78c+flagm+nofp16fml kernel is required to see the >> issues. This even got some examples like: >> >> load: 1.20 cmd: sh 95560 [tx->tx_quiesce_done_cv] 2022.55r 3.21u 9.24s 0% >> 3852k >> #0 0xffff0000004fd564 at mi_switch+0x104 >> #1 0xffff000000463f40 at _cv_wait+0x120 >> #2 0xffff000001518a34 at txg_wait_open+0xf4 >> #3 0xffff00000147d0bc at dmu_free_long_range+0x17c >> #4 0xffff000001421254 at zfs_rmnode+0x64 >> #5 0xffff00000142e7c4 at zfs_freebsd_reclaim+0x34 >> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >> #7 0xffff0000005fd6c0 at vgonel+0x450 >> #8 0xffff0000005fde7c at vrecycle+0x9c >> #9 0xffff00000142e678 at zfs_freebsd_inactive+0x48 >> #10 0xffff0000005fc430 at vinactivef+0x180 >> #11 0xffff0000005fba50 at vput_final+0x200 >> #12 0xffff00015d8ceab4 at null_reclaim+0x154 >> #13 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >> #14 0xffff0000005fd6c0 at vgonel+0x450 >> #15 0xffff0000005fde7c at vrecycle+0x9c >> #16 0xffff00015d8ce8e8 at null_inactive+0x38 >> #17 0xffff0000005fc430 at vinactivef+0x180 >> >> (The chroot use involves null mounts.) >> >> which is sort of analogous to the filemon related >> backtraces I showed earlier. The common part >> across the examples looks to be #0..#11: >> >> #0 0xffff0000004fd564 at mi_switch+0x104 >> #1 0xffff000000463f40 at _cv_wait+0x120 >> #2 0xffff00000153fa34 at txg_wait_open+0xf4 >> #3 0xffff0000014a40bc at dmu_free_long_range+0x17c >> #4 0xffff000001448254 at zfs_rmnode+0x64 >> #5 0xffff0000014557c4 at zfs_freebsd_reclaim+0x34 >> #6 0xffff000000a1340c at VOP_RECLAIM_APV+0x2c >> #7 0xffff0000005fd6c0 at vgonel+0x450 >> #8 0xffff0000005fde7c at vrecycle+0x9c >> #9 0xffff000001455678 at zfs_freebsd_inactive+0x48 >> #10 0xffff0000005fc430 at vinactivef+0x180 >> #11 0xffff0000005fba50 at vput_final+0x200 >> >> There were a lot more nanslp examples in all >> the alter testing (i.e, those avoided filemon.ko >> being loaded). >> >> Starting from having pkgclean -A'd the ports, the >> experiments got about the same number of ports built >> as of the end of the 1st hour. >> >> >> >> UFS vs. ZFS? Different media types? . . . >> >> So I decided to create and try a UFS test context >> instead of a ZFS one. But the media that was best >> to update was a U2 960GB Optane in a USB3 >> adapter, something that would perform noticeably >> better than my normal USB3 NVMe drives, even with >> USB involved. >> >> This combination maintained reasonable load averages >> (instead of having long periods of <1) and finish >> building 172 ports in the 1st hour, far more than the >> around 83 each time I tried the other device/ZFS >> combination. NO evdience of the earlier reported >> oddities. >> >> I should also time a from-scratch buildworld >> buildkernel. >> >> I'll look into setting up another U2 960GB Optane >> for use in the USB3 adpater, but with ZFS. That >> should help isolate media vs. file system >> contributions to the varying behaviors. > > > > === > Mark Millard > marklmi at yahoo.com > > > -- Mateusz Guzik