Lock-free algorithms overview and (semi-) lock-free stack implementation

By Vladimir Pavluk.

In this article I would like to discuss the topic of lock-free algorithms, and particularly, a lock-free stack implementation. In fact, this implementation is lock-free only by name; there is a lock, it just is not obvious, and the question “why” will be clearly discussed further. All examples are given in the C programming language.

For those unaware of lock-free concepts, I’m going to briefly describe the matter in question and why is it useful. In multithreaded applications, it is very common that several concurrent threads require access to a single resource (memory, object, etc.): a processing queue would be a good example. To keep this shared data consistent we need to keep it from simultaneous uncoordinated changes. Usually, the means to coordinate such access, or synchronize it, are implemented using various kinds of locks: mutexes, spinlocks, etc.), which fully lock access to the data in question when one of the threads accesses it and unlock after the changes are completed.

I won’t go into differences between mutexes and spinlocks… just want to mention that the principle stays the same whatever kind of locks is used.

Contrarily to the lock principle, lock-free algorithms use atomic operations like CAS (Compare And Swap) to coordinate concurrent access and keep the data consistent. As a consequence, lock-free algorithms are usually faster than their mutex-based counterparts, and at every stage, the lock-free algorithms guarantee forward progress in some finite number of operations. And that would be ideal if it weren’t highly complex to make a correct implementation, starting from so famous ABA problem and ending with possible access to free memory segments and program crashes. So, solely because of the complexity, lock-free algorithms are still not used everywhere.

But that was just a preamble, so let’s dive into the topic. I’ve encountered a lot of lock-free stack implementations all over the Internet. Unfortunately, a lot of them are undoubtedly non-working.  Some of them are “naive” (the implementation is called “naive” if it does not take into account the ABA’ problem), and only few are really working and useful.

So let’s see why it is so and why finding problems in lock-free algorithms is a riddle.

One of the biggest problems in lock-free implementations is the dynamic memory management. We need stack nodes to be allocated on the heap, and if we don’t want memory leaked, to be deleted when they are not used anymore. Obviously allocation rarely causes issues, but deleting nodes can present a real problem. Let’s take a look at a “naive” implementation of a lock-free stack in the C programming language:

void push(void *data) {
  struct node *n = malloc(sizeof(struct node));
  n->data = data;

  do {
    n->next = head;
  } while(!compare_and_swap(&head, n->next, n);
}

void *pop() {
  struct node *n;
  void *result = NULL;

  do {
    n = head;
  } while(n && !compare_and_swap(&head, n, n->next);

  if(n) {
    result = n->data;
    free(n);
  }

  return result;
}

This implementation is called “naive” because it assumes that if CAS succeeds, we’ve got the data we intended to get. In fact, there are a lot of scenarios where head is the same as at the beginning of the iteration, but implies completely different data.

Let’s imagine the situation when one of the threads saved the head into its n variable, and took n→next, but have not yet called compare_and_swap. Then the thread yields control to other threads, one of which pops and deletes head, the other pops and deletes head→next, and the third one pushes new element on stack, and the memory allocates at the address of the old head (which was freed by the first thread). Then the control passes back to the first thread.

Now head will be the one compare_and_swap expects in n, but n→next will be pointing to the memory which was freed by the second thread. So when the pop operation succeeds here, head will be pointing to that deleted memory area, and the program will sooner or later crash.

Many sources call this problem an ABA’ problem. This name mirrors the sequence we see in the data, which is A, then B, then A’ (which looks like A, but in fact is not actually A). This is a real hassle for those starting to dig into lock-free algorithms implementation! The most common way to solve this problem is having a tag in addition to the data, which changes with each push, so even if the pointer to the data is the same, the tag will be different, which in the case of ABA’ would be what distinguishes A from A’.

This implementation is also prone to the other type of problem: the memory deallocation problem. To demonstrate it, let’s assume the situation where one of the threads saves head to n and yields control to other threads. The other thread successfully popped from the stack, so head is changed and the previous head element is deleted. When the control returns back to the first thread, n will be pointing to the freed memory area, and the attempt to take n→next will address the invalid memory, which is obviously not good for a program and can lead to a crash.

This problem can be solved different ways. Some use the hazard pointers list, which stores the pointers that can’t be deleted at the time. Some temporarily replace the head with some ‘magic’ value that prevents reading and deleting it from the other thread. But the implementation that I suggest uses the fact that access to the stack is done via a single element – top of the stack (head). So relative to the other containers like queues or lists, the gain from a lock-free approach is not that huge.

That is why I suggest a combined approach to the stack: the lock-free writing (pushing), using CAS, and spin-locked reading (popping) which prevents concurrent simultaneous reading. Spin-locked popping in a single thread means that we’re protected from accidentally accessing deallocated memory, and that also means that while we’re reading, we can’t remove and then insert removed elements, which means that the ABA’ problem is also solved. In other words, I suggest the semi-lock-free algorithm, which is lock-free in its easy part, and using lock in its most error-prone part.

One possible implementation is as follows:

void mt_stack_push(struct mt_stack *top, void *data) {
  struct mt_stack *tb, *old;
  tb = malloc(sizeof(struct mt_stack));
  tb->data = data;

  old = top->next;
  tb->next = old;

  while(!__sync_bool_compare_and_swap(&top->next, old, tb)) {
    usleep(1);
    old = top->next;
    tb->next = old;
  }
}

void* mt_stack_pop(struct mt_stack *top)
{
  struct mt_stack *current;
  void *result = NULL;

  // Acquire the spinlock
  while(!__sync_bool_compare_and_swap(&top->stack_mutex, 0, 1)) {
    usleep(1);
  }

  current = top->next;

  // We can't pop and delete one element, because it's read-locked
  // But it can change because the push operation is lock-free
  while(current && !__sync_bool_compare_and_swap(&top->next, current, current->next)) {
    usleep(1);
    current = top->next;
  }

  if(current) {
    result = current->data;
    free(current);
  }

  // Release spinlock
  while(!__sync_bool_compare_and_swap(&top->stack_mutex, 1, 0)) {
    usleep(1);
  }

  return result;
}

This implementation was tested along with a correctly-implemented purely lock-free algorithm and the algorithms using spin-lock and mutex locking. The time of the execution for all of the mentioned algorithms is following (and fluctuated only insignificantly with the number of runs):

mutex:
real 0m1.336s
user 0m1.173s
sys 0m3.628s

lock-free:
real 0m0.533s
user 0m0.792s
sys 0m0.046s

spinlock:
real 0m0.520s
user 0m0.630s
sys 0m0.018s

semi-locked:
real 0m0.353s
user 0m0.360s
sys 0m0.075s

The fact that lock-free and spinlock algorithms differ so little is explained by the fact that stack has a single point of access (top), which I mentioned before. So why is the lock-free approach slower? Because of the all the guards against deleted pointers and the ABA’ problem that I mentioned.

The conclusions follow: before implementing a lock-free algorithm, an analysis should be performed to determine whether or not the container allows multiple access points and simultaneous access from several threads. It is possible that in this particular case (like in the case of the stack), lock-free algorithms will only add hassle, not providing any significant gain in performance.

This article also shows how mixed locked + lock-free approaches could be reasonable in implementing thread-safe concurrent algorithms.