The Levels of Concurrency

Today, Hacker News brought me a talk by Rob Pike about concurrency versus parallelism. It seems a lot of people are confused about the distinction, even though it is well-trod ground. But it is also important to understand that concurrency and parallelism are two levels of a much larger tower.

Fundamentally, all of the levels are interested in more and more decoupled agents working together to get some work done. So at the lowest level of our tower, we have synchronous code, where there is only one agent. Note that synchronous code can be written with threads, or coroutines, or with events. At this level, problems are relatively simple since they are isolated to a single agent.

A step up from synchronous code is parallel code, where multiple agents work at the same time, but they do not need to communicate to one another. An example would be applying a filter on an image; usually each pixel in the resulting image can be computed separately, and so you can have each agent compute some number of pixels without talking to any other agents. Problems here become more complex, but usually each agent executes the same code, in the same way, so bugs can still be localized relatively well. At this level, you worry about things like using parallel map constructs, building trees instead of lists, and properties like associativity. A good place to start learning about this would be Guy Steele's presentation, "How to Think About Parallel Programming – Not!". Functional programming is often proposed as a solution to problems of parallelism.

But problems that can be solved without communicating are relatively rare; if we allow communication, we're now doing concurrent programming. For now, though, we're still going to assume that communication is immediate for arbitrary-sized data. For example, this happens in a thread environment, where all your agents share an address space. Concurrent programming can happen even if there's only one agent. For example, caching and buffering are standard designs for single-threaded programs, but they involve doing two conceptually different things at the same time. Concurrency is much harder than parallelism, since it is a semantic issue. Instead of having a program that is correct no matter how it is executed (and then trying to pick the best way to execute it), we now have a program that might not work unless communication is built in. If you're interested in solving problems at this level, you'll want to learn about locks, and their associated problems: deadlock, livelock, priority inversion, and the performance hits of fine locking. Then, you want to learn about lock-free programming¹ [¹ The defining paper on the topic, Herlihy's Wait-free Synchronization, is surprisingly readable.] and software transactional memory. There's also the "communicating sequential processes" model, based on the π Calculus. A lot of programming language theory tends to focus on this level.

But often the assumption that communication is free is false, which brings us into the scary world of asynchronous programming. In a distributed system, it takes time for information to propagate. In fact, you no longer have a unified notion of "now". If, for example, you want to lock, you need to have a dedicated "locking" server, or you need to communicate with other agents to agree on the lock. So caching become much more important. With that come questions of consistency: how similar do the states of various agents have to be. At the low end, all agents have the same state because you all resynchronize after every operation. At the high end, there is some vague eventually-consistent model.

Since information takes time to spread, it may be that another agent has failed without you knowing; thus your system can have partial failure, where only some of the agents in the system work. This brings you into the realm of distributed programming. A good introduction is the Bayou project. Another place to start is the Paxos algorithm, which solves the problem of having agents agree on some value, even if some number of them are allowed to fail arbitrarily² [² Leslie Lamport has a great description of Paxos in Paxos Made Simple. It's also one of the better abstracts among computer science papers.]. Solving problems in distributed systems is very hard, due simply to the vast landscape of possible bugs, between de-synchronization, distributed deadlock, and so on.

We're not done! We can take our systems further: we can not trust the code itself. If the agents can't necessarily trust each other, we have a federated system. Here, agents might lie and cheat to work toward their own ends, or agents might be buggy in unexpected ways. Usually, federated systems involve either cryptographically ensuring that no one can lie, or verifying with other agents that everyone is behaving honestly. Bitcoin is a good example of a federated system: we can't trust other people not to lie if it gets them more money! A general way of building systems that tolerate malicious agents (often called "Byzantine faults" for obscure reasons) is described in Castro and Liskov's Practical Byzantine Fault Tolerance.

The layers are points along a continuum. A concurrent system still isn't necessarily a distributed one, since it may not tolerate partial failures or long latency in communication. A distributed system might still need all agents participating and trusting each other, and so not be federated. It is in fact rather sad that programming languages aim mostly to solve problems of concurrency and parallelism; why don't we have languages that try to make asynchronous, distributed, or federated programs easier to write?

What comes further down the ladder? Is there a way to weaken assumptions further still? I don't see any, but maybe one day someone will. But at least knowing the distinctions above, you can place where various technologies help you. Functional programming helps with parallel programming, functional-reactive with concurrent. Neither helps with distributed or federated systems. Meanwhile, Paxos is necessary for distributed, and possible to use in a federated system (with modifications), but overkill for simply concurrent or parallel programs.

Edit: I’ve split “distributed” and “asynchronous” programming, to emphasize that failure brings with it a whole new can of worms.

Footnotes:

The defining paper on the topic, Herlihy's Wait-free Synchronization, is surprisingly readable.

Leslie Lamport has a great description of Paxos in Paxos Made Simple. It's also one of the better abstracts among computer science papers.

By Pavel Panchekha

14 April 2012

The Levels of Concurrency

Footnotes: