rwlocks: poor performance with adaptive spinning

Wed Sep 26 11:51:09 PDT 2007

On Wed, 26 Sep 2007, John Baldwin wrote:

> On Wednesday 26 September 2007 11:54:57 am Jeff Roberson wrote:
>> On Wed, 26 Sep 2007, Attilio Rao wrote:
>>
>>> 2007/9/26, Jeff Roberson <jroberson at chesapeake.net>:
>>>> On Tue, 25 Sep 2007, John Baldwin wrote:
>>>>
>>>>>
>>>>> So I looked at this some more and now I don't understand why the
> trywait() and
>>>>> cancel() changes were done.  Some background:
>>>>
>>>> I wanted the turnstile chain lock to protect only the association between
>>>> the turnstile and lock.  This way it further reduces the contention
> domain
>>>> for the hard lock/unlock cases.  We sometimes see a lot of contention on
>>>> the turnstile chain that may be due to hash collisions.  If this reduced
>>>> contention doesn't help I think it would be possible to revert this
>>>> change.
>>>
>>> Well, the change jhb is suggesting won't reduce tc_lock contention for
>>> sure, but it will prevent the turnstile lookup/locking if not
>>> necessary.
>>> Mainly, we need a spinlock held in order to protect mtx_lock bitfield
>>> (as well as rw_lock) by setting waiters flags.
>>> While in the past code this lock was only tc_lock, currently mtx_lock
>>> is both protected by tc_lock and ts_lock, which it doesn't seem
>>> strictly necessary.
>>> if, for example, you lock tc_lock than you scan the chain and lock
>>> ts_lock (basically what turnstile_trywait() does) and later you find
>>> lock is released or that the state of lock changed and you cannot set
>>> a waiters flag, you did the ts_locking and hash table scanning even if
>>> it wasn't needed.
>>
>> Yes I understand, the question is whether the decreased complexity is a
>> bigger win than the reduced contention.  The contention is only reduced if
>> there are collisions in the table.  So probably jhb is right, we should
>> revert this part of the change.  I need to make sure there was not some
>> other reason for it.  I had convinced myself at the time that it was
>> necessary, however, I can't recall why.
>
> Actually, since you still hold the tc_lock the entire time (trywait() doesn't
> drop it), I don't think the trywait() / cancel() changes affect spin lock
> contention at all.  I think it's more just a matter of overhead.
>
> If you do want to drop the tc_lock after the lookup in trywait() then I think
> the added complexity you would need to add (queue'ing the thread in trywait()
> to avoid a race with concurrent contesters on the "real" lock and then
> backing all that out in cancel()) would be very hairy.

Yes it's not for concurrency in lock, but in unlock.  If the chain gets as 
deep as 10 there may be some practical value in reducing contention on 
that chain.  We should look at this more before we rush into anything. 
Right now unlock holds the chain lock for almost the entire duration but 
it's not really necessary.

I think more experimentation is warranted especially since I did not 
actually measure a slowdown in heavily contended workloads.  If we can 
reduce the number of hash collisions significantly then your approach is 
definitely a win.  However it may be less so if we do not.

Thanks,
Jeff

>
>>>> I'm not sure why you're calling this O(n).  It's O(n) with the number of
>>>> hash collisions, but it's a hash lookup.  We should consider using a real
>>>> hash algorithm here if the distribution is bad.
>>>
>>> Well, the hash lookup (worst case) has O(n) complexity.
>>> This linear complexity cames from the O(1) * O(n) which is the access
>>> to the list and the scanning of this one. So this is reduced to the
>>> same complexity of a linked list.
>>
>> sure, I just don't think it's common to call a hash O(n).
>
> Well, the lookup of the chain in the bucket is what I call O(n).  Getting to
> the bucket (which you always have to do to lock the turnstile chain) is O(1).
> The change I'm suggesting would only defer walking the chain to
> turnstile_wait(), not finding the index in the bucket.  It would also defer
> grabbing the turnstile spin lock which probably dwarfs the cost of the chain
> walk in practice.
>
>>> However, some time ago I saw empirically that starting from a 'struct
>>> thread' allocated address the highest level of entropy in the word
>>> came after a right shift of 10 bits (for ia32) and 12 bits for amd64.
>>> Currently, turnstile and sleepqueues only uses 8 bits of right
>>> shifting.
>>
>> I haven't studied the hash distribution in turnstiles, I know there is
>> some debugging code in there for it so jhb must have.  Perhaps he can say
>> how it does.
>
> I don't think I ever saw it go over 10.  Note that the comments about 'struct
> thread' aren't quite relevant as sleepqueue's and turnstile's don't use
> thread pointers for hashing, but wait channels and lock pointers.  That said,
> I only used the values they used based on what tsleep() in 4.x used for its
> hash for the 4.x sleep queues.
>
> -- 
> John Baldwin
>