git: ecb3a7d43dd6 - main - netmap: Disable a buggy and unsafe test (sync_kloop_conflict)

Go to: [ bottom of page ] [ top of archives ] [ this month ]
From: John Baldwin <jhb_at_FreeBSD.org>
Date: Thu, 06 Mar 2025 18:22:59 UTC
The branch main has been updated by jhb:

URL: https://cgit.FreeBSD.org/src/commit/?id=ecb3a7d43dd67809037f9066e7716a05c41d8d63

commit ecb3a7d43dd67809037f9066e7716a05c41d8d63
Author:     John Baldwin <jhb@FreeBSD.org>
AuthorDate: 2025-03-06 18:22:25 +0000
Commit:     John Baldwin <jhb@FreeBSD.org>
CommitDate: 2025-03-06 18:22:25 +0000

    netmap: Disable a buggy and unsafe test (sync_kloop_conflict)
    
    This test starts two threads to verify that two concurrent threads
    cannot enter the kernel loop on the same netmap context.  The test
    even has a comment about a potential race condition where the first
    thread enters the loop and is stopped before the second thread tries
    to enter the loop.  It claims it is fixed by the use of a semaphore.
    Unfortunately, the semaphore doesn't close the race.
    
    In the CI setup for CHERI, we run the testsuite once a week against
    various architectures using single CPU QEMU instances.  Across
    multiple recent runs of the plain "aarch64" test the job ran for an
    entire day before QEMU was killed by a timeout.  The last messages
    logged were from this test:
    
    734.881045 [1182] generic_netmap_attach     Emulated adapter for tap3312 created (prev was NULL)
    734.882340 [ 321] generic_netmap_register   Emulated adapter for tap3312 activated
    734.882675 [2224] netmap_csb_validate       csb_init for kring tap3312 RX0: head 0, cur 0, hwcur 0, hwtail 0
    734.883042 [2224] netmap_csb_validate       csb_init for kring tap3312 TX0: head 0, cur 0, hwcur 0, hwtail 1023
    734.915397 [ 820] netmap_sync_kloop         kloop busy_wait 1, direct_tx 0, direct_rx 0, na_could_sleep 0
    736.901945 [ 820] netmap_sync_kloop         kloop busy_wait 1, direct_tx 0, direct_rx 0, na_could_sleep 0
    
    From the timestamps, the synchronous kloop was entered twice 2 seconds
    apart.  This corresponds to the 2 second timeout on the semaphore in
    the test.  What appears to have happened is that th1 started and
    entered the kernel where it spun in an endless busy loop.  This
    starves th2 so it _never_ runs.  Once the semaphore times out, th1 is
    preempted to run the main thread which invokes the ioctl to stop the
    busy loop.  th1 then exits the loop and returns to userland to exit.
    Only after this point does th2 actually run and execute the ioctl to
    enter the kernel.  Since th1 has already exited, th2 doesn't error and
    enters its own happy spin loop.  The main thread hangs forever in
    pthread_join, and the process is unkillable (the busy loop in the
    kernel doesn't check for any pending signals so kill -9 is ignored and
    ineffective).
    
    I don't see a way to fix this test, so I've just disabled it.  There
    is no good way to ensurce concurrency on a single CPU system when one
    thread wants to sit in a spin loop.  Someone should fix the netmap
    kloop to respond to kill -9 in which case kyua could perhaps at least
    timeout the individual test process and kill it.
    
    Reviewed by:    vmaffione
    Obtained from:  CheriBSD
    Sponsored by:   AFRL, DARPA
    Differential Revision:  https://reviews.freebsd.org/D49220
---
 tests/sys/netmap/ctrl-api-test.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/tests/sys/netmap/ctrl-api-test.c b/tests/sys/netmap/ctrl-api-test.c
index 8d33b4c58d2a..6b45dbb1cfea 100644
--- a/tests/sys/netmap/ctrl-api-test.c
+++ b/tests/sys/netmap/ctrl-api-test.c
@@ -1596,6 +1596,7 @@ sync_kloop_csb_enable(struct TestContext *ctx)
 	return sync_kloop_start_stop(ctx);
 }
 
+#if 0
 static int
 sync_kloop_conflict(struct TestContext *ctx)
 {
@@ -1640,6 +1641,14 @@ sync_kloop_conflict(struct TestContext *ctx)
 	/* Wait for one of the two threads to fail to start the kloop, to
 	 * avoid a race condition where th1 starts the loop and stops,
 	 * and after that th2 starts the loop successfully. */
+	/*
+	 * XXX: This doesn't fully close the race.  th2 might fail to
+	 * start executing since th1 can enter the kernel and hog the
+	 * CPU on a single-CPU system until the semaphore timeout
+	 * awakens this thread and it calls sync_kloop_stop.  Once th1
+	 * exits the kernel, th2 can finally run and will then loop
+	 * forever in the ioctl handler.
+	 */
 	clock_gettime(CLOCK_REALTIME, &to);
 	to.tv_sec += 2;
 	ret = sem_timedwait(&sem, &to);
@@ -1674,6 +1683,7 @@ sync_kloop_conflict(struct TestContext *ctx)
 	               ? 0
 	               : -1;
 }
+#endif
 
 static int
 sync_kloop_eventfds_mismatch(struct TestContext *ctx)
@@ -2079,7 +2089,9 @@ static struct mytest tests[] = {
 	decltest(sync_kloop_eventfds_all_direct_rx),
 	decltest(sync_kloop_nocsb),
 	decltest(sync_kloop_csb_enable),
+#if 0
 	decltest(sync_kloop_conflict),
+#endif
 	decltest(sync_kloop_eventfds_mismatch),
 	decltest(null_port),
 	decltest(null_port_all_zero),