Re: git: 8271d9b99a3b - main - libsys: remove usage of pthread_once and _once_stub

From: Konstantin Belousov <kostikbel_at_gmail.com>
Date: Thu, 22 Feb 2024 04:27:08 UTC
On Thu, Feb 22, 2024 at 01:23:08AM +0000, Jessica Clarke wrote:
> On 22 Feb 2024, at 01:11, Konstantin Belousov <kostikbel@gmail.com> wrote:
> > 
> > On Wed, Feb 21, 2024 at 05:23:10PM +0000, Jessica Clarke wrote:
> >> On 21 Feb 2024, at 14:17, Konstantin Belousov <kostikbel@gmail.com> wrote:
> >>> 
> >>> On Wed, Feb 21, 2024 at 12:51:04AM +0000, Jessica Clarke wrote:
> >>>> On 21 Feb 2024, at 00:29, Konstantin Belousov <kib@FreeBSD.org> wrote:
> >>>>> 
> >>>>> The branch main has been updated by kib:
> >>>>> 
> >>>>> URL: https://cgit.FreeBSD.org/src/commit/?id=8271d9b99a3b98c662ee9a6257a144284b7e1728
> >>>>> 
> >>>>> commit 8271d9b99a3b98c662ee9a6257a144284b7e1728
> >>>>> Author:     Konstantin Belousov <kib@FreeBSD.org>
> >>>>> AuthorDate: 2024-02-20 14:45:29 +0000
> >>>>> Commit:     Konstantin Belousov <kib@FreeBSD.org>
> >>>>> CommitDate: 2024-02-21 00:26:11 +0000
> >>>>> 
> >>>>>  libsys: remove usage of pthread_once and _once_stub
> >>>>> 
> >>>>>  that existed in auxv.c, use simple bool gate instead. This leaves a
> >>>>>  small window if two threads try to call _elf_aux_info(3) simultaneously.
> >>>>>  The situation is safe because auxv parsing is really idempotent. The
> >>>>>  parsed data is the same, and we store atomic types (int/long/ptr) so
> >>>>>  double-init does not matter.
> >>>> 
> >>>> You still need to load acquire and store release aux_once though,
> >>>> otherwise you can see aux_once as true yet read the pre-initialised
> >>>> data. In practice that’s surely very hard to hit, but the code as
> >>>> written is now wrong. Also, idempotence should probably be made
> >>>> unnecessary by using 0/1/2 state for uninitialised/initialising/
> >>>> initialised, as it’s still technically UB from a C AM perspective due
> >>>> to not being data race free if two threads initialise at the same time.
> >>>> Better to just do the correct thing rather than risk things going wrong.
> >>> 
> >>> There is too much to handle 'in process' state for loosing thread, I need
> >>> the whole libthr machinery.
> >> 
> >> What do you need libthr for? In pseudo-C:
> >> 
> >> x = load_acquire(&aux_once)
> >> if (__predict_true(x == 2))
> >>    return;
> >> if (x == 1 || !compare_exchange_strong_acquire(&aux_once, &x, 1)) {
> >>    while (x != 2) {
> >>        yield();
> >>        x = load_acquire(&aux_once)
> >>    }
> >>    return;
> >> }
> >> /* initialise as before */
> >> store_release(&aux_once, 2);
> >> 
> >> I believe that’s all you need. Or compare exchange 0 to 1 as the
> >> initial operation; makes the source code shorter at the expense of a
> >> more expensive fast path:
> >> 
> >> x = 0;
> >> if (__predict_true(!compare_exchange_strong_acquire(&aux_once, &x, 1)) {
> >>    while (__predict_false(x != 2)) {
> >>        yield();
> >>        x = load_acquire(&aux_once)
> >>    }
> >>    return;
> >> }
> >> /* initialise as before */
> >> store_release(&aux_once, 2);
> >> 
> >> I probably have bugs in the above, but you get the gist.
> > The bug in the fragment above is with the yield().  If low-priority thread
> > enters the '1' (in progress) block, and then is preempted by high-priority
> > thread also entering init_auxv(), the process would never make a progress.
> > 
> > This is why I said that we need libthr (or umtx), to use real locking and
> > move the waiting thread off cpu.  In kernel, yield can be used in similar
> > situations because we can bump the priority, although it is tricky.
> 
> Yes, priority inversion is an issue here. But this is (without all the
> C++ abstraction) how libcxxrt implements __cxa_guard_acquire, so if
> it’s good enough for C++ constructors for static storage duration
> objects declared at local scope, surely it’s also good enough for
> aux_once? And if it’s not good enough for aux_once then libcxxrt should
> be deemed broken...
> 
> One could easily adapt the above to use UMTX_OP_WAIT/WAKE though.

As I noted, umtx(2) should be the solution.  Might be we should provide
some park/unpark interface in libc, to hide the umtx interface.  Its use
started to proliferate recently.

> 
> Jess
> 
> >>> I added the fences, thanks for noting.
> >> 
> >> Thanks.
> >> 
> >> Jess
> >> 
> >>> WRT being UB from pure C, we already have much more assumptions about
> >>> atomicity.
> 
>