pthread_cond_timedwait() broken in 9-stable? (from JAN 10)

Julian Elischer julian at freebsd.org
Fri Feb 17 18:04:51 UTC 2012


On 2/17/12 3:28 AM, David Xu wrote:
> On 2012/2/17 16:06, Julian Elischer wrote:
>> On 2/16/12 11:41 PM, Julian Elischer wrote:
>>> adding jkim as he seems to be the last person working with TSC.
>>>
>>>
>>> On 2/16/12 6:42 PM, David Xu wrote:
>>>> On 2012/2/17 10:19, Julian Elischer wrote:
>>>>> On 2/16/12 5:56 PM, David Xu wrote:
>>>>>> On 2012/2/17 8:42, Julian Elischer wrote:
>>>>>>> Adding David Xu for his thoughts since he reqrote the code in 
>>>>>>> quesiton in revision 213098
>>>>>>>
>>>>>>> On 2/16/12 2:57 PM, Julian Elischer wrote:
>>>>>>>> On 2/16/12 1:06 PM, Julian Elischer wrote:
>>>>>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote:
>>>>>>>>>> on 15/02/2012 23:41 Julian Elischer said the following:
>>>>>>>>>>> The program fio (an IO test in ports) uses pthreads
>>>>>>>>>>>
>>>>>>>>>>> the following code (from fio-2.0.3, but its in earlier 
>>>>>>>>>>> code too)
>>>>>>>>>>> has suddenly started misbehaving.
>>>>>>>>>>>
>>>>>>>>>>>          clock_gettime(CLOCK_REALTIME,&t);
>>>>>>>>>>>          t.tv_sec += seconds + 10;
>>>>>>>>>>>
>>>>>>>>>>>          pthread_mutex_lock(&mutex->lock);
>>>>>>>>>>>
>>>>>>>>>>>          while (!mutex->value&&  !ret) {
>>>>>>>>>>>                  mutex->waiters++;
>>>>>>>>>>>                  ret = 
>>>>>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t);
>>>>>>>>>>>                  mutex->waiters--;
>>>>>>>>>>>          }
>>>>>>>>>>>
>>>>>>>>>>>          if (!ret) {
>>>>>>>>>>>                  mutex->value--;
>>>>>>>>>>>                  pthread_mutex_unlock(&mutex->lock);
>>>>>>>>>>>          }
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> It turns out that 'ret' sometimes comes back instantly (on 
>>>>>>>>>>> my machine) with a
>>>>>>>>>>> value of 60 (ETIMEDOUT)
>>>>>>>>>>> despite the fact that we set the timeout 10 seconds into 
>>>>>>>>>>> the future.
>>>>>>>>>>>
>>>>>>>>>>> Has anyone else seen anything like this?
>>>>>>>>>>> (and yes the condition variable attribute have been set to 
>>>>>>>>>>> use the REALTIME clock).
>>>>>>>>>> But why?
>>>>>>>>>>
>>>>>>>>>> Just a hypothesis that maybe there is some issue with time 
>>>>>>>>>> keeping on that system.
>>>>>>>>>> How would that code work out for you with MONOTONIC?
>>>>>>>>>
>>>>>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and 
>>>>>>>>> CLOCK_MONOTONIC, and they both had the same problem..
>>>>>>>>> i.e. random early returns with ETIMEDOUT.
>>>>>>>>>
>>>>>>>>> I think we will try move out machine forward to a newer 
>>>>>>>>> -stable to see if it resolves.
>>>>>>>> Kan upgraded the machine today to today's 9.x branch tip and 
>>>>>>>> the problem still occurs.
>>>>>>>> 8.x does not have this problem.
>>>>>>>>
>>>>>>>> I have not got a 9-RELEASE machine to test on.. so I can not 
>>>>>>>> tell if this came in with the burst of stuff
>>>>>>>> that came in after the 9.x branch was unfrozen after the 
>>>>>>>> release of 9.0.
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>> I am trying to reproduce the problem,  do you have complete 
>>>>>> sample code to test ?
>>>>>
>>>>> I'm still looking the exact set
>>>>> but on my machine (4 cpus) the program from ports sysutils/fio 
>>>>> exhibits the problem when used with
>>>>> kern.timecounter.hardware=TSC-low and with the following config 
>>>>> file:
>>>>>
>>>>> pu05 # cat config.fio
>>>>>
>>>>> [global]
>>>>> #clocksource=cpu
>>>>> direct=1
>>>>> rw=randread
>>>>> bs=4096
>>>>> fill_device=1
>>>>> numjobs=16
>>>>> iodepth=16
>>>>> #ioengine=posixaio
>>>>> #ioengine=psync
>>>>> ioengine=psync
>>>>> group_reporting
>>>>> norandommap
>>>>> time_based
>>>>> runtime=60000
>>>>> randrepeat=0
>>>>>
>>>>> [file1]
>>>>> filename=/dev/ada0
>>>>>
>>>>> pu05 #
>>>>> pu05 # fio config.fio
>>>>> fio: this platform does not support process shared mutexes, 
>>>>> forcing use of threads. Use the 'thread' option to get rid of 
>>>>> this warning.
>>>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, 
>>>>> iodepth=16
>>>>> ...
>>>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, 
>>>>> iodepth=16
>>>>> fio 2.0.3
>>>>> Starting 15 threads and 1 process
>>>>> fio: job startup hung? exiting.
>>>>> fio: 5 jobs failed to start
>>>>> Segmentation fault (core dumped)
>>>>> pu05#
>>>>>
>>>>>
>>>>> The reason 5 jobs failed to start is because the parent timed 
>>>>> out on them immediately.
>>>>> It didn't time out on 10 of them apparently.
>>>>>
>>>>>
>>>>> if I set the timer to ACPI-fast it works as expected..
>>>> maybe following code can check to see if TSC-LOW works by let the 
>>>> thread run
>>>> on each cpu.
>>>>
>>>> gettimeofday(&prev, NULL);
>>>> int cpu = 0;
>>>> for (;;) {
>>>>      cpuset_t set;
>>>>      cpu = ++cpu % 4;
>>>>      CPU_ZERO(&set);
>>>>      CPU_SET(cpu, &set);
>>>>      pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
>>>>      gettimeofday(&cur, NULL);
>>>>      if ( timercmp(&prev, &cur, >=)) {
>>>>         abort();
>>>>    }
>>>> }
>>>>
>>>>
>>
>> pu05# sysctl kern.timecounter.hardware=TSC-low
>> kern.timecounter.hardware: ACPI-fast -> TSC-low
>> pu05# ./test
>> ^C
>> pu05# cat test.c
>>
>> #include <stdlib.h>
>> #include <sys/param.h>
>> #include <sys/cpuset.h>
>> #include <pthread_np.h>
>>
>> #include <sys/time.h>
>>
>> main()
>> {
>>     int cpu = 0;
>>     struct timeval prev, cur;
>>
>>     gettimeofday(&prev, NULL);
>>     for (;;) {
>>          cpuset_t set;
>>          cpu = ++cpu % 4;
>>          CPU_ZERO(&set);
>>          CPU_SET(cpu, &set);
>>          pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
>>          gettimeofday(&cur, NULL);
>>          if ( timercmp(&prev, &cur, >)) {
>>             abort();
>>        }
>>        prev = cur;
>>     }
>> }
>>
>> pu05# ./test
>>
>> minutes pass.......
>>
>> ^C
>> pu05#
>>
>> so it looks as if the TSC is working ok..
>> I'm just going to check that the program is actually moving CPU...
>> yes it is moving around but I can't tell at what speed. (according 
>> to top).
>>
>> so we are still left with a question of "where is the problem?"
>>
>> kernel TSC driver?
>> generic gettimeofday() code?
>> pthreads cond code?
>> the application?
>>
>>
> I am running the fio test on my notebook which is using TSC-low,
> it is on 9.0-RC3, I can not reproduce the problem for
> minutes, then I interrupt it with ctrl-c: looks mot
>
> http://people.freebsd.org/~davidxu/tsc_pthread/dmesg.txt
> http://people.freebsd.org/~davidxu/tsc_pthread/tc.txt
> http://people.freebsd.org/~davidxu/tsc_pthread/fio.txt
>
>
looks normal to me..
I have to been able to test this on a 9-RELEASE machine.. just 9-stable..




More information about the freebsd-stable mailing list