CLOCK_MONOTONIC / CLOCK_UPTIME is not really monotonic between threads

Fri Jun 3 04:16:48 UTC 2016

Hi there, we have an application here which is trying to measure UDP
command/response round-trip-time. It runs two posix threads (more actually,
but that's probably irrelevant), one (let's call it A) that does high-level
logic and the second one (B) that does network packet I/O.

The sending side is done by first thread A forming the request, then
calling the function clock_gettime(CLOCK_MONOTONIC) and passing the packet
into the thread B. Obtained timestamp is stored with some logical
transaction ID allowing us to pull that stored value later on when response
arrives. Then we have a separate process that receives those requests,
processing them and sending back some form of response.

Upon receiving a response from the network, the network I/O thread (B)
timestamps it by running clock_gettime(CLOCK_MONOTONIC) and passes the
packet data along with that value via queue to the thread A for processing.

So if we put things into timeline, what our app does would probably look
something like the following:

1. Thread A generates request.
2. A calls clock_gettime(CLOCK_MONOTONIC), storing value as t1 internally
3. A passes packet to thread B
4. B sends out packet via sendto() to server process running on the same
box (fully separate, not a thread)

[some microseconds later]

5. B receives response from server with recvfrom()
6. B instantly calls clock_gettime(CLOCK_MONOTONIC), assigns returned value
to t2
7. B passes packet data along with t2 to the A via queue
8. A picks up packet, parses it and retrieves corresponding t1 stored at
step 2.
9. A calculates RTT by doing t2 - t1 assuming it's going to be positive...

As you might have guessed if you are still reading, from time to time t2 -
t1 comes out slightly negative! Provided it's not some obscure bug in our
app, there is no way this could happen if clock_gettime(CLOCK_MONOTONIC)
would work as advertised. Event (2) could not possibly happen earlier than
(6), which is guaranteed by the fact that the request needs to be processed
by the external entity first in order for the response to be seen by our
app at all. I've added some logs and it seems to be confirming that the
server only sees a single request, there is no chance for the client to
receive some other packet and confuse it. I've also confirmed with tcpdump,
which shows reasonable time delay between request and reply of few hundreds
microseconds.

I've checked all logic and I could not find any mistakes on my end here, so
I added some logging for such events. The distribution appears to be
centered around 0.00006s, but there are some events that go as far up as
0.012s.

I've also tried using CLOCK_UPTIME_PRECISE instead, but it makes no
difference whatsoever.

My questions therefore are:

1. Is it intended/expected behavior of the said API?
2. Has anyone else bumped into this?
3. I know we are doing some clever optimizations using TSC to speed those
APIs up, could be it related to that?
4. If the answer for (3) is yes, then what is the method to disable using
TSC and use slower but possibly more reliable method?
5. Is there any way/plan to fix this long-term?
6. If the behavior is intended/expected, what is the maximum discrepancy we
can expect from this effect?

In general some time ago we've spend quite lot of time switching our app
from using REALTIME into MONOTIME in the hope to eliminate any wizardry
needed to deal with realtime possibly jumping back and forth due to NTP,
leap seconds etc, it's shame that this is not working either. Apart from
measuring command processing delay, that app does lot of high-volume voip
call billing, so even such small discrepancy can easily build up into a
bigger problem, not to mention the fact that every time we deal with
duration we now need to have some check in place to make sure we don't have
some negative values popping our of nowhere.

Any hints, suggestions, pointers are appreciated. I can also give more
debug information as needed. Thanks!

-Max