Fatal error 'mutex is on list' at line 139 in file /usr/src/lib/libthr/thread/thr_mutex.c (errno = 35)

Mon Mar 21 11:23:04 UTC 2016

On Mon, Mar 21, 2016 at 12:15:15PM +0200, Oleg V. Nauman wrote:
>  OK, but please take a look what I have found ( it makes me thinking that 
> problem is within the compiled KDE code ):
>  The failure point within the KDE code is the same ( at least it is true for 
> coredumps generated today ):
> 
> #7  0x0000000805a2f6be in __pthread_mutex_timedlock (mutex=0x81b200008, 
>     abstime=0x7fffffffd458) at /usr/src/lib/libthr/thread/thr_mutex.c:583
> #8  0x000000080443c4b0 in pthreadTimedLock::lock (this=0x81777b680)
>     at 
> /usr/ports/x11/kdelibs4/work/kdelibs-4.14.3/kdecore/util/kshareddatacache_p.h:252
> ....
> (gdb) f 8
> #8  0x000000080443c4b0 in pthreadTimedLock::lock (this=0x81777b680)
>     at 
> /usr/ports/x11/kdelibs4/work/kdelibs-4.14.3/kdecore/util/kshareddatacache_p.h:252
> 252             return pthread_mutex_timedlock(&m_mutex, &timeout) == 0;
> (gdb) p &m_mutex
> $1 = (pthread_mutex_t *) 0x81b200008
> (gdb) p m_mutex
> $2 = (pthread_mutex_t &) @0x81b200008: 0x8000000000000001
This is correct.  The value is the special cookie set for the process-shared
locks, the actual lock exists elsewere.

> (gdb) p &timeout
> $3 = (timespec *) 0x6
This might be some gdb issue.  Anyway, the timeout value is not the problem.

> (gdb) p timeout
> Cannot access memory at address 0x6
> (gdb)
>  
> It seems that both m_mutex and timeout are wrong
m_mutex is fine, as I noted above.

> 
> The class which generates coredumps looks like:
> 
> #if defined(KSDC_THREAD_PROCESS_SHARED_SUPPORTED) && 
> defined(KSDC_TIMEOUTS_SUPPORTED)
> class pthreadTimedLock : public pthreadLock
> {
> public:
>     pthreadTimedLock(pthread_mutex_t &mutex)
>         : pthreadLock(mutex)
>     {
>     }
> 
>     virtual bool lock()
>     {
>         struct timespec timeout;
> 
>         // Long timeout, but if we fail to meet this timeout it's probably a 
> cache
>         // corruption (and if we take 8 seconds then it should be much much 
> quicker
>         // the next time anyways since we'd be paged back in from disk)
>         timeout.tv_sec = 10 + ::time(NULL); // Absolute time, so 10 seconds 
> from now
>         timeout.tv_nsec = 0;
> 
>         return pthread_mutex_timedlock(&m_mutex, &timeout) == 0;
>     }
> };
> #endif
> 
> It is called by:
> 
> (gdb) f 9
> #9  0x000000080443c8a8 in KSharedDataCache::Private::CacheLocker::cautiousLock 
> (
>     this=0x7fffffffd5f0)
>     at 
> /usr/ports/x11/kdelibs4/work/kdelibs-4.14.3/kdecore/util/kshareddatacache.cpp:1259
> 1259                while (!d->lock() && !isLockedCacheSafe()) {
> gdb) p *d
> $4 = {m_cacheName = {static null = {<No data fields>}, static shared_null = 
> {ref = {
>         _q_value = 2731}, alloc = 0, size = 0, data = 0x6192ca 
> <QString::shared_null+26>,
>       clean = 0, simpletext = 0, righttoleft = 0, asciiCache = 0, capacity = 
> 0, reserved = 0,
>       array = {0}}, static shared_empty = {ref = {_q_value = 50}, alloc = 0, 
> size = 0,
>       data = 0x805105c3a <QString::shared_empty+26>, clean = 0, simpletext = 
> 0,
>       righttoleft = 0, asciiCache = 0, capacity = 0, reserved = 0, array = 
> {0}},
>     d = 0x8176e8180, static codecForCStrings = 0x0}, shm = 0x81b200000,
>   m_lock = {<QtSharedPointer::ExternalRefCount<KSDCLock>> = 
> {<QtSharedPointer::Basic<KSDCLock>> = {value = 0x81777b680}, d = 0x81777b6c0}, 
> <No data fields>}, m_mapSize = 10547304,
>   m_defaultCacheSize = 10485760, m_expectedItemSize = 0, m_expectedType = 
> LOCKTYPE_MUTEX}
> (gdb) p d
> $5 = (KSharedDataCache::Private *) 0x8176d2030
> 
> Well I understand that unwinding the KDE code it is a task not for humans..
> 
> The hardware is ASUS X552C notebook, Ivybridge, amd64
> I noticed massive coredumps after x11/kdelibs4 recompilation with clang 3.8.0 
> so it is possible that it is a problem with code generation.
> It is does not depend on optimization level ( at least it exhibits the same 
> behavior for both -O2 and -O0 )
> The only CPU/optimization/code generation specific setting is
> CPUTYPE?=nehalem
> in make.conf

In other words, there is no virtualization involved.

I think that the problem at hands is not related to clang update. You
recently rebuilt kde libs, which probably triggered detection of the new
feature, process-shared locks in our libthr.  Before that, older HEAD
does not exposed p/shared as implemented option.  Somehow the implementation
and KDE expectations do not match, and asserts in libthr catch that.

Anyway, please apply the debugging patch I posted in the previous mail.