:: horsing around with the C++ programming language

The sad state of pthread_mutex_t on OS X

[25th August 2012]

Recently I came across the concept of a benaphore and decided to play around with it. A benaphore is a kind of mutex that should in theory be faster in very-low-contention scenarios.

The implementation is very simple:

class benaphore : uncopyable
        benaphore() : sema(0 /* initial value */), count(0) { }

        void lock() { if (++count > 1) sema.acquire(); }
        void unlock() { if (--count > 0) sema.release(); }

        semaphore sema
        atomic<uint32_t> count;

When there's no contention, locking and unlocking is as simple as incrementing and decrementing a atomic integer, potentially making it suitable for the protection of rarely accessed resources.

In reality it shouldn't be needed, as any decent mutex implementation is going to have a fast-path that avoids interaction with the kernel, anyway. But out of curiosity I benchmarked it against a pthread_mutex_t on OS X.

The benchmark code was very simple. A bunch of threads run the same thread_body() function simultaneously:

template<typename Mutex>
struct shared_stuff
    shared_stuff(uint32_t increments) : 

    const uint32_t increments;

    char cache_line_separation1[64]; // put the mutex on its own cache line
    Mutex mtx;
    char cache_line_separation2[64]; // put the mutex on its own cache line

    uint32_t total;

// ...

template<typename Mutex>
void *thread_body(void *opaque_arg)
    shared_stuff<Mutex> &stuff = *static_cast<shared_stuff<Mutex> *>(opaque_arg);

    for (uint32_t i = 0; i != stuff.increments; ++i)

    return 0;

The semaphore I used in the benaphore implementation was dispatch_semaphore_t from libdispatch. Given the high degree of lock-contention, I expected the pthread_mutex_t implementation to out-perform the benaphore by a comfortable margin. So I was quite surprised to see this:

quad-core i5 iMac, OS X 10.7.4
Best of 5 runs.

g++ benchmark.cpp -o benchmark -O3 -Wall -Wextra -ansi -pedantic -DNDEBUG=1 -DNOCHECKS=1

            pthreads mutex    benaphore
1 thread        0.651s         0.291s
2 threads      84.964s         2.253s
4 threads     207.882s         5.962s
8 threads     547.533s        25.579s

In the high-contention situations where a mutex should out-perform a benaphore, we're actually seeing more than an order of magnitude's performance difference in the opposite direction.

A new OS X mutex implementation

I'm not sure what's going on inside the OS X pthread_mutex_t, but these results suggest we might be able to implement a superior mutex without too much difficulty:

class mutex2 : uncopyable
        mutex2() : sema(0), count(0) { }

        void lock()
            for (unsigned spins = 0; spins != 5000; ++spins)
                uint32_t unlocked = 0;
                if (count.compare_exchange(unlocked, 1u))

                sched_yield(); // #include <sched.h>

            if (++count > 1)

        bool try_lock()
            unsigned unlocked = 0;
            return count.compare_exchange(unlocked, 1u);

        void unlock()
            if (--count > 0)

        semaphore sema;
        atomic<uin32_t> count;

Here, I've taken the benaphore and simply added a spin-lock prologue to its lock() method. The value of 5000 for the spin count is roughly tuned to this test, but it might not be such a bad value for the general case, as small critical sections — like the one in this benchmark — should be the goal wherever locks are needed.

New Hotness

Adding the results to the benchmarks table:

quad-core i5 iMac, OS X 10.7.4
Best of 5 runs.

g++ benchmark.cpp -o benchmark -O3 -Wall -Wextra -ansi -pedantic -DNDEBUG=1 -DNOCHECKS=1

            pthreads mutex    benaphore     mutex2
1 thread        0.651s         0.291s       0.330s
2 threads      84.964s         2.253s       0.905s
4 threads     207.882s         5.962s       5.768s
8 threads     547.533s        25.579s       4.274s

Here we're as fast or faster than the benaphore in all cases. Somewhat curiously, when using mutex2, the benchmark program finishes faster with 8 threads than it does with 4. This may be a hint that further improvements are possible. Suggestions are very welcome, but I'm happy to leave it there for the time being.


One downside to moving away from the pthreads mutex is that we lose some of the debugging and scheduling features of pthread_mutex_t.

If condition variables are needed, we also have to reimplement those from scratch. I'll perhaps return to that in a future post, but for the impatient, I've been using the implementation from this paper (PDF) with success. It can also be used to build a condition variable on older versions of Windows, where the CONDITION_VARIABLE structure is not yet available.



Dave Johansen

[28/08/2012 at 20:28:53]

Very interesting. As far as another data point goes, I just ran the same test (well, at least the pthread_mutex_t part of it) on RHEL 5.5 and RHEL 6.3 and while it doesn't appear to be as bad, it definitely doesn't appear to scale well with increasing thread count.

Dave Johansen

[29/08/2012 at 20:39:23]

The flag that's passed to the compiler is NOCHECKS but in the code it's using NO_CHECKS, so I'm guessing that the tests were actually run with the checks still happening.

I also emailed you a version that's ported to build on Linux, so you could hopefully make it available as an alternate download.

Thanks for making the original test available.


[29/08/2012 at 22:28:39]

Thanks Dave!

Face-palming somewhat about NO_CHECKS/NOCHECKS...

I've attached your Linux benchmarks and modified the original file to use the same spelling of 'NOCHECKS'. I started to re-run them so that I could update the numbers, but while things were coming out a little different, qualitatively things were much the same. So I decided against it in the end. Also, I couldn't be bothered to wait another hour for the pthread_mutex_t runs to finish :P

Somewhat interesting is that some of the benaphore runs were a little slower after the checks were removed. I'm not sure I could speculate as to why without thinking on it for a while... bus traffic ... mumble mumble ... kernel timeouts ... mumble mumble ... :)

Dave Johansen

[30/08/2012 at 01:35:01]

No worries. Just chalk it up to the benefits of open source in action, right?

Here are the results from running on a Dell Precision m6500 (i7 X940) with Hyper Threading disabled on RHEL 5.5:

            pthreads mutex    benaphore     mutex2
 1 thread        0.53s          0.37s        0.34s
 2 threads       2.70s          7.00s        0.78s
 4 threads       6.78s         17.95s        2.25s
 8 threads      17.51s         27.94s        4.07s
16 threads      29.58s         56.09s        8.07s

Here's the results from running on the same machine with RHEL 6.3:

            pthreads mutex    benaphore     mutex2
 1 thread        0.43s          0.28s        0.28s
 2 threads       2.45s          4.94s        0.74s
 4 threads       9.24s         10.14s        3.48s
 8 threads      31.34s         52.45s        4.47s
16 threads      60.80s        108.37s        8.16s

I ran it with HyperThreading disabled, because if I don't then it appears that the scheduler will put the threads on the first logical processors so all of the bulk of the work is being done by 1 or 2 of the processors instead of all 4.

Also, the degradation in performance from RHEL 5 to RHEL 6 was not expected, so I opened a bug against RHEL 6 which can be found here.

Dave Johansen

[02/12/2015 at 18:23:50]

In case anyone is interested, the code I used to to the above testing can be found here.

(required, hint)

Links can be added like [this one ->], to my homepage.
Phrases and blocks of code can be enclosed in {{{triple braces}}}.
Any HTML markup will be escaped.