kern/127545: POSIX (1003.1b) semaphores can become negative

Philip Semanchuk philip at semanchuk.com
Mon Sep 22 18:00:07 UTC 2008


>Number:         127545
>Category:       kern
>Synopsis:       POSIX (1003.1b) semaphores can become negative
>Confidential:   no
>Severity:       non-critical
>Priority:       low
>Responsible:    freebsd-bugs
>State:          open
>Quarter:        
>Keywords:       
>Date-Required:
>Class:          sw-bug
>Submitter-Id:   current-users
>Arrival-Date:   Mon Sep 22 18:00:07 UTC 2008
>Closed-Date:
>Last-Modified:
>Originator:     Philip Semanchuk
>Release:        7.0
>Organization:
>Environment:
FreeBSD whiskey.nc.rr.com 7.0-RELEASE FreeBSD 7.0-RELEASE #0: Sun Feb 24 19:59:52 UTC 2008     root at logan.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  i386
>Description:
Specifically, when two processes are contending to acquire a POSIX semaphore, the value of the semaphore can become negative. 

In the sample code (see below), this manifests itself as an overflow error when the following sequence of events occurs:
1) The semaphore has the value 1.
2) Process 1 acquires the semaphore, value becomes 0
3) Process 2 acquires the semaphore, value becomes -1
4) Either process calls sem_post(). That function sees that the semaphore's value == SEM_VALUE_MAX (which is #defined as ~0U) and concludes that it is already incremented to its maximum so it sets errno = EOVERFLOW.

I have two different test machines both running FreeBSD 6.0. Using the sample code, both exhibit the problem 100% of the time. One of those test machines also has FreeBSD 7.0 installed. It exhibits the problem 100% of the time. The same sample code never exhibits the problem under OS X 10.5.4 and various flavors of Linux.

This problem makes semaphores pretty useless for IPC, but I'm marking this is non-critical and low priority since /usr/src/sys/conf/NOTES says, "p1003_1b_semaphores are very experimental".

>How-To-Repeat:
The sample code, log, etc. is here:
http://semanchuk.com/philip/temp/freebsd_semaphore_test.tar.gz

If you can attach a copy of that tarball to this PR, that'd be great.

The tarball contains mk.sh which will build the applications premise and conclusion. (As in Mrs. Premise and Mrs. Conclusion from the Monty Python sketch. What else would you expect from a Python extension developer?) They talk with one another to demonstrate usage of shared memory and semaphores. Premise writes a random string (the current time) to shared memory, conclusion reads it, md5s it, writes that back. Premise reads conclusion's message, verifies that it is the md5 of what she (Mrs. Premise) wrote, md5s that and writes it back. And so on, for up to 1000 iterations. However on FreeBSD they fail almost immediately due to the aforementioned EOVERFLOW error.

I was curious what was happening so I hacked up uipc_sem.c a little to print more debugging info. (I had to #define SEM_DEBUG in opt_posix.h to get the debug messages to appear in /var/log/messages.) The changes I made are described in uipc_sem.diff (which was made against the version of uipc_sem.c that shipped w/7.0). With these additions to uipc_sem.c, I was able to see the semaphore's value go to -1 and even -2. I included my /var/log/messages. Have a look:
   grep "post-decrement" messages 


>Fix:
I think a clue to the source of the problem (if you're interested in my opinion) is visible with this:
   grep -n "pid=" messages

Line 18 of the grep output shows pid 58126 entering the the critical section inside kern_sem_wait() and it doesn't leave until line 179. In the meantime, pid 58125 enters and leaves the critical section over and over. It looks like the mtx_lock() call isn't behaving as expected. 

I noticed that uipc_sem.c changed a lot in rev 1.34, but a inspection didn't lead me to believe that the changes would alter the behavior I'm seeing.




>Release-Note:
>Audit-Trail:
>Unformatted:


More information about the freebsd-bugs mailing list