pthread_cond_timedwait() broken in 9-stable? (from JAN 10)
Julian Elischer
julian at freebsd.org
Fri Feb 17 08:05:11 UTC 2012
On 2/16/12 11:41 PM, Julian Elischer wrote:
> adding jkim as he seems to be the last person working with TSC.
>
>
> On 2/16/12 6:42 PM, David Xu wrote:
>> On 2012/2/17 10:19, Julian Elischer wrote:
>>> On 2/16/12 5:56 PM, David Xu wrote:
>>>> On 2012/2/17 8:42, Julian Elischer wrote:
>>>>> Adding David Xu for his thoughts since he reqrote the code in
>>>>> quesiton in revision 213098
>>>>>
>>>>> On 2/16/12 2:57 PM, Julian Elischer wrote:
>>>>>> On 2/16/12 1:06 PM, Julian Elischer wrote:
>>>>>>> On 2/16/12 9:34 AM, Andriy Gapon wrote:
>>>>>>>> on 15/02/2012 23:41 Julian Elischer said the following:
>>>>>>>>> The program fio (an IO test in ports) uses pthreads
>>>>>>>>>
>>>>>>>>> the following code (from fio-2.0.3, but its in earlier code
>>>>>>>>> too)
>>>>>>>>> has suddenly started misbehaving.
>>>>>>>>>
>>>>>>>>> clock_gettime(CLOCK_REALTIME,&t);
>>>>>>>>> t.tv_sec += seconds + 10;
>>>>>>>>>
>>>>>>>>> pthread_mutex_lock(&mutex->lock);
>>>>>>>>>
>>>>>>>>> while (!mutex->value&& !ret) {
>>>>>>>>> mutex->waiters++;
>>>>>>>>> ret =
>>>>>>>>> pthread_cond_timedwait(&mutex->cond,&mutex->lock,&t);
>>>>>>>>> mutex->waiters--;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> if (!ret) {
>>>>>>>>> mutex->value--;
>>>>>>>>> pthread_mutex_unlock(&mutex->lock);
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> It turns out that 'ret' sometimes comes back instantly (on
>>>>>>>>> my machine) with a
>>>>>>>>> value of 60 (ETIMEDOUT)
>>>>>>>>> despite the fact that we set the timeout 10 seconds into the
>>>>>>>>> future.
>>>>>>>>>
>>>>>>>>> Has anyone else seen anything like this?
>>>>>>>>> (and yes the condition variable attribute have been set to
>>>>>>>>> use the REALTIME clock).
>>>>>>>> But why?
>>>>>>>>
>>>>>>>> Just a hypothesis that maybe there is some issue with time
>>>>>>>> keeping on that system.
>>>>>>>> How would that code work out for you with MONOTONIC?
>>>>>>>
>>>>>>> Jens Axboe, (CC'd) tried both CLOCK_REALTIME and
>>>>>>> CLOCK_MONOTONIC, and they both had the same problem..
>>>>>>> i.e. random early returns with ETIMEDOUT.
>>>>>>>
>>>>>>> I think we will try move out machine forward to a newer
>>>>>>> -stable to see if it resolves.
>>>>>> Kan upgraded the machine today to today's 9.x branch tip and
>>>>>> the problem still occurs.
>>>>>> 8.x does not have this problem.
>>>>>>
>>>>>> I have not got a 9-RELEASE machine to test on.. so I can not
>>>>>> tell if this came in with the burst of stuff
>>>>>> that came in after the 9.x branch was unfrozen after the
>>>>>> release of 9.0.
>>>>>>
>>>>>>
>>>>>
>>>> I am trying to reproduce the problem, do you have complete
>>>> sample code to test ?
>>>
>>> I'm still looking the exact set
>>> but on my machine (4 cpus) the program from ports sysutils/fio
>>> exhibits the problem when used with
>>> kern.timecounter.hardware=TSC-low and with the following config file:
>>>
>>> pu05 # cat config.fio
>>>
>>> [global]
>>> #clocksource=cpu
>>> direct=1
>>> rw=randread
>>> bs=4096
>>> fill_device=1
>>> numjobs=16
>>> iodepth=16
>>> #ioengine=posixaio
>>> #ioengine=psync
>>> ioengine=psync
>>> group_reporting
>>> norandommap
>>> time_based
>>> runtime=60000
>>> randrepeat=0
>>>
>>> [file1]
>>> filename=/dev/ada0
>>>
>>> pu05 #
>>> pu05 # fio config.fio
>>> fio: this platform does not support process shared mutexes,
>>> forcing use of threads. Use the 'thread' option to get rid of this
>>> warning.
>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, iodepth=16
>>> ...
>>> file1: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=psync, iodepth=16
>>> fio 2.0.3
>>> Starting 15 threads and 1 process
>>> fio: job startup hung? exiting.
>>> fio: 5 jobs failed to start
>>> Segmentation fault (core dumped)
>>> pu05#
>>>
>>>
>>> The reason 5 jobs failed to start is because the parent timed out
>>> on them immediately.
>>> It didn't time out on 10 of them apparently.
>>>
>>>
>>> if I set the timer to ACPI-fast it works as expected..
>> maybe following code can check to see if TSC-LOW works by let the
>> thread run
>> on each cpu.
>>
>> gettimeofday(&prev, NULL);
>> int cpu = 0;
>> for (;;) {
>> cpuset_t set;
>> cpu = ++cpu % 4;
>> CPU_ZERO(&set);
>> CPU_SET(cpu, &set);
>> pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
>> gettimeofday(&cur, NULL);
>> if ( timercmp(&prev, &cur, >=)) {
>> abort();
>> }
>> }
>>
>>
pu05# sysctl kern.timecounter.hardware=TSC-low
kern.timecounter.hardware: ACPI-fast -> TSC-low
pu05# ./test
^C
pu05# cat test.c
#include <stdlib.h>
#include <sys/param.h>
#include <sys/cpuset.h>
#include <pthread_np.h>
#include <sys/time.h>
main()
{
int cpu = 0;
struct timeval prev, cur;
gettimeofday(&prev, NULL);
for (;;) {
cpuset_t set;
cpu = ++cpu % 4;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
pthread_setaffinity_np(pthread_self(), sizeof(set), &set);
gettimeofday(&cur, NULL);
if ( timercmp(&prev, &cur, >)) {
abort();
}
prev = cur;
}
}
pu05# ./test
minutes pass.......
^C
pu05#
so it looks as if the TSC is working ok..
I'm just going to check that the program is actually moving CPU...
yes it is moving around but I can't tell at what speed. (according to
top).
so we are still left with a question of "where is the problem?"
kernel TSC driver?
generic gettimeofday() code?
pthreads cond code?
the application?
More information about the freebsd-stable
mailing list