From nobody Sat May 04 16:59:15 2024 X-Original-To: freebsd-current@mlmmj.nyi.freebsd.org Received: from mx1.freebsd.org (mx1.freebsd.org [IPv6:2610:1c1:1:606c::19:1]) by mlmmj.nyi.freebsd.org (Postfix) with ESMTP id 4VWv6B3zWBz5JB6M for ; Sat, 4 May 2024 16:59:30 +0000 (UTC) (envelope-from marklmi@yahoo.com) Received: from sonic303-24.consmr.mail.gq1.yahoo.com (sonic303-24.consmr.mail.gq1.yahoo.com [98.137.64.205]) (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (Client did not present a certificate) by mx1.freebsd.org (Postfix) with ESMTPS id 4VWv6943RCz4ks9 for ; Sat, 4 May 2024 16:59:29 +0000 (UTC) (envelope-from marklmi@yahoo.com) Authentication-Results: mx1.freebsd.org; dkim=pass header.d=yahoo.com header.s=s2048 header.b=MmwS5POU; dmarc=pass (policy=reject) header.from=yahoo.com; spf=pass (mx1.freebsd.org: domain of marklmi@yahoo.com designates 98.137.64.205 as permitted sender) smtp.mailfrom=marklmi@yahoo.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1714841967; bh=YyX3KwaFJOpg6/KX3ywEVAS/a977l2tW5fWoUEnBoJc=; h=From:Subject:Date:To:References:From:Subject:Reply-To; b=MmwS5POUfJBNMmlCA/eDxUUAn+oQiYUUDanWBQQTtppYRPIAeEmhYFcU09V8j87Q0KypOm9tvjd/CtIoqXl1XjUR5Yr/WZsozPEdnN0dso/3jltDHwwVvRi9OhuOnbSDhwS/aB6UIiT/OLkJ0YiOMr9dhomEyx0soIp8MvyzWihgdNZqqYLcpFX8GsoL37LUnKfuAqjXONRx3y9DG8U9A9bvpMp1nHwS3LKheO+hEA6R90pUUAzT19pPGs6f5z6T+ZMJaoTo5cQr9CPYBuA9sBz8QsduokJt23H1zoI5/5tIfXfBb38ffj7HD0UBX3iJMpk7jBZOLBZ+66i4gq93FA== X-SONIC-DKIM-SIGN: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s2048; t=1714841967; bh=10jnSm7EO6X+HMaYa1l0hbzS5I4qgDRhJmGhmTxXnUn=; h=X-Sonic-MF:From:Subject:Date:To:From:Subject; b=bTBoRUwv7fNqMuLoy+8fgjuwoG5WPuY2FUXUnPOTqvKtD58WQO5DynDeIW1YN4Cc4AYFlSJdcGDM/ooFo0J3hlGtvQAlLwjqPYUNT/CRIPAHfzvuk1xZOm3vL+YlhcTdsPIrRRAZEQaeZe7Q1U7qvDveAu0ce9S+sWIbUlSkU1lzN8ou8bNbMa2qwGaIQz/Ejbjw5i09ucGcsUe09b6n3MyHqyTS3wXDvZSUGIirwxdcyQapxmXEZE3TD1UFF1c+xJIP1H4J9FO/NqhBC64E6vOfTY+Wn1/IK5t08fNDdve/S+owQ5a9uk6L54Ebpbufb1hAA+mLJZ+JKQG00/DpaA== X-YMail-OSG: rr0fZnMVM1luw9juQUxViQD2Y2313dY9Y74Q9DlGOXfgSvM9GJ991mZL_Ozd8cz MzlHA38W349STqfAxd1yz47jBHR7QjEwdO9pVFCr6heyhBxNfcGWosi.VI_Xt.yZjLF7vFb.X_ci L0NZVy9sQcLtnsKAxjwTunSIbf6euMLchuNZHJm5TBpphwjpXWbrxuHi.IKBSycn7n0fHOvyRHN5 ixCxBpwWO50StpE7882f63vaPe6UOYB0WwXLDAxRf27AjBInuOfN8pt5TvLuVznM8hr0TvC6Ic3I 2L8OeUD.C1WuCbL_S29w7enpOIgW2snSO2XfpZ0MEADnP2H7qmHr7kC5U6KofvSpBhXNDgtC5pR2 uhqpT5XvF63iQaDnoF_xcntl73LmIfQ9zH5g_WgGKmuxJmi90UdFD2IxShW.kLfAISC7B6GCm3.S XarKhGg.cigFTk.eorXRLvVPCauV8EtdtMmCLC0862VYctMNkX_ga3Su.pama5B2O_TDZOWon9Lv YPbWoj7L2CUin_sZVdS3NCrAzbZj6IoqSgQMHkZ1v1QRSHvIqroKPT9Nk1dDUkIZoJxv3DdqjZes pkgVCNzmPqSE7qqxGBQnbEbqiCV2vSe_E6Mzg1j0WyssPIJeGosjs3GIyH5CgIIVVLhNhHdxEo9p ndJACdEps4OAg6nFgX.PaKrm1lRJYrvvWJuA82EkEu5M6lE.21DZRdBDnS59uKKrYQ.jec3pSS3J mudgn8r_0I7UTZqR7RuuHn7csDrtbpddnNOrNzBihCnIGbbOO0f4HNE7kFvoclUgYRsgarV5UCXS xrYnZk8wvI0aTGKBa032ghjyqh4qlk6A6IkVAfKTJlS4CSI83WVA4r909_cjWQzTBeYd0tlQB6Pf hqld_8ynPRuspvSYvTEXegdHtAPkE9BcuL.4lSDSW3BLkZ_HVLmMpktX4cBDrKYMHLDlCumiTbHP tohStXgbTuptJ4BP6QQor7dPeXpsz3rOHB4SZ6UKF.rKw5.W2STFymKSGfO8CIli.RKSS18Km3El 30T50gkCQpAbl1Di1Ge28a1MYtPi_SBLL5OB51eleOIPjEdT_fmKsPDrsoZ.HVinfoV7P7dEHzx4 v9QYQ2HHurc8KzEf2TtGE86W2.cq1sUlDKTHEMuYXFKpfL9XVx_Vlj_JXKWQ4gEac6591Vph.xXO Rc7YMswu5AsajDpDmHGIOOSJnwjO1vMxiyhK9uWdubFepD5OHVkRv6TzX48pl0JF8f5xWoqN4U2u u2pIHR8jn9QaB7KEzDztwX6SX..YGGrLrlxXVD9sMOUkcvp5AHtN.JZ5yFB74_pxBI.39VSiBkE1 8gAnmyO0KqVcvw2ydik6ssKJq6Txna90b2MYCXgoAIPr2GNsPvw6qi8k1QEtAMoNmNatA1ROY0SM XZIYtAFMaoNIaSwgW02.r5l47MLOC1zFKY1jjEJDXpc2W4FBMBhBdTohlowMN0ATf7q0eY1GZrew R1fHnKDAteNiOkKocsOTthy0.sqNHdpKHqzR.JKzdvzcNr9r3CGsj9At5K_kFl1r36i9Wo88AqnK .N4TVeIHOUnl39TJzepWrVxQVnXCtcxb6lFmMqI15pllJ4okGDYcH6ijYBYZNWU.oyByhL8kmk._ Zxclv0pFPOhA.VRj24qIrmn6_KPo76KlSD4DHx19NJlsYP2dxe1BVnEXi5tvhVTStIsfLlX0AEe3 Yt6naXRdojYISkRhwOqjLnODWvZcTcBTgMzoi6DqZRfMAgxLIIdaiYZCbieUBTODMPBPDw5WtExs i_9QZVYbjlarkGz60blLvgAkwvH4SVHoObPGQ7Zf7c_HTgksmFD7uT1..OsVAGhzJzDrbBBdW_TC Uifx.UsTtxQKWIiWjeOKsg5MU8CfuXuwD55rqWVM5MiW1.Cs7.lv5bXH2s2BFRqgi8aDlpLPDECd OvUw8OkPyAwj46RwmsCYvrEHvM_VqYUNJGhpjDiXdsSdFh6Qq2PhxaDwZ1gxOtZJc7uQlIELlx3e f2uS5uhOQ7DGcW5ltIDkcX3swZx7pcWpjoFHm6bEihX5liIToUhKDLcKSj3tUGkpk.pu15r0wA3U MQbdA2PLIQ8zTrQqFtwVqtCWhJDGHqFHWEHZIGReGUZDp2Zb.YHAoHIfFM709IvGBjEshFu0rEBh 8DrWt2SIItOewH6QWWWtScvbJtFGbwroJ3IwkAqErIqySC9jNQumcw2vatxoSq0XTETzJT.pUrKy 31EXu8rF2OKajG14Z3Ju55cFaIYj0o8_AnvLsvBHztBGzfp0RvFqrcqcdNTbiaKLGxnfJNDKEWjQ k X-Sonic-MF: X-Sonic-ID: ddb9d4b9-9a82-4b5f-bf15-cae0edcb45c5 Received: from sonic.gate.mail.ne1.yahoo.com by sonic303.consmr.mail.gq1.yahoo.com with HTTP; Sat, 4 May 2024 16:59:27 +0000 Received: by hermes--production-gq1-59c575df44-xmm9l (Yahoo Inc. Hermes SMTP Server) with ESMTPA ID e5e94488eac63e976f26cf8139fe1cee; Sat, 04 May 2024 16:59:26 +0000 (UTC) From: Mark Millard Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: quoted-printable List-Id: Discussions about the use of FreeBSD-current List-Archive: https://lists.freebsd.org/archives/freebsd-current List-Help: List-Post: List-Subscribe: List-Unsubscribe: Sender: owner-freebsd-current@FreeBSD.org Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3774.500.171.1.1\)) Subject: main [so: 15] amd64: Rare poudriere bulk builder "stuck in umtxq_sleep" condition (race failure?) during high-load-average "poudriere bulk -c -a" runs Message-Id: Date: Sat, 4 May 2024 09:59:15 -0700 To: Current FreeBSD , freebsd-amd64@freebsd.org X-Mailer: Apple Mail (2.3774.500.171.1.1) References: X-Spamd-Bar: --- X-Spamd-Result: default: False [-3.50 / 15.00]; NEURAL_HAM_MEDIUM(-1.00)[-1.000]; NEURAL_HAM_LONG(-1.00)[-1.000]; NEURAL_HAM_SHORT(-1.00)[-0.998]; MV_CASE(0.50)[]; DMARC_POLICY_ALLOW(-0.50)[yahoo.com,reject]; R_SPF_ALLOW(-0.20)[+ptr:yahoo.com]; R_DKIM_ALLOW(-0.20)[yahoo.com:s=s2048]; MIME_GOOD(-0.10)[text/plain]; ARC_NA(0.00)[]; RCPT_COUNT_TWO(0.00)[2]; MIME_TRACE(0.00)[0:+]; TO_DN_SOME(0.00)[]; FREEMAIL_ENVFROM(0.00)[yahoo.com]; RCVD_TLS_LAST(0.00)[]; DWL_DNSWL_NONE(0.00)[yahoo.com:dkim]; FREEMAIL_FROM(0.00)[yahoo.com]; FROM_HAS_DN(0.00)[]; ASN(0.00)[asn:36647, ipnet:98.137.64.0/20, country:US]; RCVD_IN_DNSWL_NONE(0.00)[98.137.64.205:from]; TO_MATCH_ENVRCPT_SOME(0.00)[]; RCVD_COUNT_TWO(0.00)[2]; FROM_EQ_ENVFROM(0.00)[]; DKIM_TRACE(0.00)[yahoo.com:+]; MLMMJ_DEST(0.00)[freebsd-current@freebsd.org]; RWL_MAILSPIKE_POSSIBLE(0.00)[98.137.64.205:from]; MID_RHS_MATCH_FROM(0.00)[]; RCVD_VIA_SMTP_AUTH(0.00)[]; SUBJECT_HAS_QUESTION(0.00)[] X-Rspamd-Queue-Id: 4VWv6943RCz4ks9 I recently did some of my rare "poudriere bulk -c -a" high-load-average style experiments, here on a 7950X3D (amd64) system and I ended up with a couple of stuck builders (one per bulk run of 2 runs). Contexts: # uname -apKU FreeBSD 7950X3D-UFS 15.0-CURRENT FreeBSD 15.0-CURRENT #142 = main-n269589-9dcf39575efb-dirty: Sun Apr 21 07:28:55 UTC 2024 = root@7950X3D-ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64= .amd64/sys/GENERIC-NODBG amd64 amd64 1500018 1500018 # uname -apKU FreeBSD 7950X3D-ZFS 15.0-CURRENT FreeBSD 15.0-CURRENT #142 = main-n269589-9dcf39575efb-dirty: Sun Apr 21 07:28:55 UTC 2024 = root@7950X3D-ZFS:/usr/obj/BUILDs/main-amd64-nodbg-clang/usr/main-src/amd64= .amd64/sys/GENERIC-NODBG amd64 amd64 1500018 1500018 So: One was in a ZFS context and the other was in a UFS context. 32 hardware threads, 32 builders, ALLOW_MAKE_JOBS=3Dyes in use (no use of MAKE_JOBS_NUMBER_LIMIT or the like), USE_TMPFS=3Dall in use, TMPFS_BLACKLIST in use, 192 GiBytes of RAM, 512 GiByte Swap partition in use, so SystemRAM+SystemSWAP being 704 GiBytes. I'll start with notes about the more recent UFS context experiment . . . graphics/pinta in the UFS experiment had gotten stuck in threads of /usr/local/bin/mono (mono-sgen): [05] 15:31:47 graphics/pinta | pinta-1.7.1_4 = stage 15:28:31 2.30 GiB 0% 0% # procstat -k -k 93415 PID TID COMM TDNAME KSTACK = =20 93415 671706 mono-sgen - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_lock_umutex+0x6a6 __umtx_op_wait_umutex+0x49 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20 93415 678651 mono-sgen SGen worker mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20 93415 678652 mono-sgen Finalizer mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd __umtx_op_sem2_wait+0x49a sys__umtx_op+0x7e = amd64_syscall+0x115 fast_syscall_common+0xf8=20 93415 678655 mono-sgen - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8=20 93415 678660 mono-sgen Thread Pool Wor mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_lock_umutex+0x6a6 __umtx_op_wait_umutex+0x49 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8 So I did a kill -9 93415 to let the bulk run complete. I then removed my ADDITION of BROKEN to print/miktex that had gotten stuck in the ZFS experiment and tried in the now tiny-load-average UFS context: bulk print/miktex graphics/pinta They both worked just fine, not getting stuck (UFS context): [00:00:50] [02] [00:00:25] Finished graphics/pinta | pinta-1.7.1_4: = Success ending TMPFS: 2.30 GiB [00:14:11] [01] [00:13:47] Finished print/miktex | miktex-23.9_3: = Success ending TMPFS: 3.21 GiB I'll note that the "procstat -k -k" for the stuck print/miketex in the ZFS context had looked like: # procstat -k -k 70121 PID TID COMM TDNAME KSTACK = =20 70121 409420 miktex-ctangle - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait+0x53 sys__umtx_op+0x7e = amd64_syscall+0x115 fast_syscall_common+0xf8=20 70121 646547 miktex-ctangle - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = kqueue_scan+0x9f1 kqueue_kevent+0x13b kern_kevent_fp+0x4b = kern_kevent_generic+0xd6 sys_kevent+0x61 amd64_syscall+0x115 = fast_syscall_common+0xf8=20 70121 646548 miktex-ctangle - mi_switch+0xba = sleepq_catch_signals+0x2c6 sleepq_wait_sig+0x9 _sleep+0x1ae = umtxq_sleep+0x2cd do_wait+0x244 __umtx_op_wait_uint_private+0x54 = sys__umtx_op+0x7e amd64_syscall+0x115 fast_syscall_common+0xf8 Note that, unlike the UFS context, the above also involves: kqueue_scan It looks like there is some form of failing race(?) condition that can occur on amd64 --and does rarely occur in high load average contexts. I've no clue how to reduce this to a simple, repeatable context. =3D=3D=3D Mark Millard marklmi at yahoo.com