svn commit: r319720 - head/sys/dev/vt

Sun Jun 11 10:27:04 UTC 2017

On Sat, 10 Jun 2017, Konstantin Belousov wrote:

> On Sat, Jun 10, 2017 at 09:33:48AM -0700, Jonathan Looney wrote:
>> Hi Konstantin,
>>
>> On Sat, Jun 10, 2017 at 2:12 AM, Konstantin Belousov <kostikbel at gmail.com>
>> wrote:
>>> No, acquire is only specified for loads, and release for stores.  In other
>>> words, on some hypothetical ll/sc architecture, the atomic_add_acq()
>>> could be implemented as follows, in asm-pseudocode
>>> atomic_add_acq(int x):
>>>         ll      x, r1
>>>         acq     x
>>>         add     1, r
>>>         sc      r1, x
>>> Your use of the atomic does not prevent stores reordering.
>>
>> If this is true, it sounds like the atomic(9) man page needs some revision.
>> It says:
>>
>>      When an atomic operation has acquire semantics, the effects of the
>>      operation must have completed before any subsequent load or store (by
>>      program order) is performed.  Conversely, acquire semantics do not
>>      require that prior loads or stores have completed before the atomic
>>      operation is performed.
>>
>> Up until now, I have relied on this description of the way acquire barriers
>> work. If this description is incorrect, I think it is important to fix it
>> quickly.

I think it is correct, but bit confusing.  Its first sentence says that
the acquire operation acts as if it completes before any subsequent load
or store operation (even non-atomic ones; even to different locations; is
it really that strong?) starts.  Its second sentence is redundant except
for "Conversely".  Completion of earlier operations is before the acquire
operation is not the converse of starting of later operations after the
acquire operation.

So the acquire operation gives a sort of read-write barrier that is
obviously not very useful by itself.  Using only acquire operations, I
don't see how to order anything except other acquire operations:

     acquire1
     load/store1    # can be any time after acquire1
     acquire2       # must be after acquire1
     load/store2    # can be any time after acquire2

Since load/store1 may be delayed until after acquire2, it may be mixed
with load/store2 in any order in general.

Release operations give an ordering that pairs correctly with acquire
operations.  Essentially what is misdescribed as the "converse" above.

> There are many issues with the atomic(9) man page, and they are not easy
> to fix.  In essence, useful, implementable and rigorous specification of
> the atomics behaviour or the memory model as whole (which cannot be avoided
> if atomics are specified) seems to be still somewhat unsolved scientific
> problem.
>
> As is, I strongly suggest to rely on the standard C or C++
> specifications and augment it with some code reading of
> arch/include/atomic.h. Despite it is unfeasible to use compiler-provided
> atomics in kernel in C runtime, the intended behaviour as specified in
> standards give us a strong base to get something that does not have
> inherent design mistakes. Unfortunately, this is the only way to get it
> correct now.

But the standards are even harder to read (since they are more formal and
more general).

>>> I'm not claiming that this fixes all bugs in this area. (In fact, I
>>>> specifically disclaim this.) But, it does stop the crash from occurring.
>>>>
>>>> If you still feel there are better mechanisms to achieve the desired
>>>> ordering, please let me know and I'll be happy to fix and/or improve
>>> this.
>>> See the pseudocode I posted in my original reply, which uses acq/rel pair.

Try to do it without using any atomic ops directly.  Atomic ops are too
hard to use.

Well, I tried it (just reading the code).  The bug seems to be from just
foot shooting.  Normal mutex locking should work, but is not used for calls
to vt_resume_flush_timer().  I think this is not just to avoid LORs, since
initialization of callouts is done while holding the lock.

>> The code you posted for vt_resume_flush_timer() would switch vd_timer_armed
>> from 0 to 1 even if VDF_ASYNC is not set; however, that is not what we
>> want. vd_timer_armed should be left untouched if VDF_ASYNC is not set.
>>
>> It sounds like what I want is some sort of fence in vt_upgrade() like jhb@
>> specified in his email. For example:
>>
>> vd->vd_timer_armed = 1;
>> atomic_thread_fence_rel();
>> vd->vd_flags |= VDF_ASYNC;
>>
>> I recognize a fence would be cleaner since I really only needed the barrier
>> and not the atomic operation. Do you agree the above would be sufficient to
>> ensure store ordering?
>
> No, it is not sufficient. The release barrier on write needs a dual
> acquire barrier on read to have an effect. So if you prefer to use
> fence_rel(), the reader should issue fence_acq() between reading
> vd_flags and doing cmpset on vd_timer_armed.

The complications seem to be mostly unnecessary since everything is under
the vt lock in vt_upgrade():

XX void
XX vt_upgrade(struct vt_device *vd)
XX {
XX 	...
XX 	VT_LOCK(vd);

The lock is actually the mutex vd->vd_lock.

XX 	if (vd->vd_curwindow == NULL)
XX 		vd->vd_curwindow = vd->vd_windows[VT_CONSWINDOW];
XX 
XX 	register_handlers = 0;
XX 	if (!(vd->vd_flags & VDF_ASYNC)) {

This flag is clear initially to tell other functions to not do any callout
stuff because our callout handler and infrastructure are not initialized
yet.  We also have foot-shooting, with testing of this flag and testing and
setting of vd->vd_timer_armed not under the mutex.  This function doesn't
have the foot-shooting, but is affected by it since it must be careful
to set the flags in an atomic and/or ordered way to support the foot-shooting.

XX 		/* Attach keyboard. */
XX 		vt_allocate_keyboard(vd);
XX 
XX 		/* Init 25 Hz timer. */
XX 		callout_init_mtx(&vd->vd_timer, &vd->vd_lock, 0);

Any callout will be locked by the same mutex that VT_LOCK() just acquired,
thus cannot race with us.  However, with EARLY_AP_STARTUP, ordinary output
can easily be running on another CPU (thread).  It uses foot-shooting to
race with us (and complicated atomic operations to avoid racing with
other instances of ordinary output).  Non-ordinary output like screen
switches might do the same (not very likely early).

XX 
XX 		/*
XX 		 * Start timer when everything ready.
XX 		 * Note that the operations here are purposefully ordered.
XX 		 * We need to ensure vd_timer_armed is non-zero before we set
XX 		 * the VDF_ASYNC flag. That prevents this function from
XX 		 * racing with vt_resume_flush_timer() to update the
XX 		 * callout structure.
XX 		 */
XX 		atomic_add_acq_int(&vd->vd_timer_armed, 1);

The previous version set this to 1 non-atomically after calling
callout_reset().

XX 		vd->vd_flags |= VDF_ASYNC;
XX 		callout_reset(&vd->vd_timer, hz / VT_TIMERFREQ, vt_timer, vd);
XX 		register_handlers = 1;
XX 	}
XX 
XX 	VT_UNLOCK(vd);

It would be more natural to complete our initialization before setting
VDF_ACTIVE:

 		(void)atomic_cmpset_int(&vd->vd_timer_armed, 0, 1);
 		callout_reset(... vt_timer ...);
 		vd->vd_flags |= VDF_ASYNC;

The cmpset is not quite right, but it is the same as what other instances
will do.  It is sure to find 0 and set to 1.  We just have to set it
before VDF_ASYNC.  Then I trust callout_reset() to complete its setting
for all threads.  Then it is safe to set VDF_ASYNC.  This setting becomes
visible to other CPUs later.  It doesn't matter if this is much later.
The other instances can't touch callouts until they see it change.
Callout handlers can't run until the lock is released, and releasing the
lock makes the setting of VDF_ASYNC visible.

Grep shows at least the following places with the foot-shooting:
- vt_window_switch() drops the lock, then soon calls
   vt_resume_flush_timer()
- vt_mouse_event() calls vt_resume_flush_timer().  I doubt that it holds
   the lock.
- similarly in vt_mouse_state()
- vt_replace_backend() drops the lock, then later calls
   vt_resume_flush_timer().  It calls vt_upgrade() in between(), so can
   clearly regain full locking.

Use lock assertions to find all the places.

Bruce