Hacker News new | past | comments | ask | show | jobs | submit login
Why Threads are a Bad Idea (for most purposes) (1995) [pdf] (stanford.edu)
175 points by erwan on June 13, 2017 | hide | past | favorite | 148 comments

The morning paper had a nice set of blog posts about this:

Why they're equivalent (duals): https://blog.acolyer.org/2014/12/08/on-the-duality-of-operat...

Why threads are a bad idea: https://blog.acolyer.org/2014/12/09/why-threads-are-a-bad-id...

Why events are a bad idea: https://blog.acolyer.org/2014/12/10/why-events-are-a-bad-ide...

Unifying events and threads (in Haskell): https://blog.acolyer.org/2014/12/11/a-language-based-approac...

Unifying events and treads (in Scala): https://blog.acolyer.org/2014/12/12/scala-actors-unifying-th...

During the last 5 years where I wrote LOTS of concurrent code I was in various thought stages, where in each stage I had some different feeling of what the best model might be.

In the meantime I have the feeling that both threaded and single-threaded event-driven programming programming models work well, and that the main thing to avoid is mixing both.

E.g. I found that using callback based or push-style APIs (like Reactive Extensions) in multithreaded environments is not a great idea, since it often won't be obvious on which thread a callback will be running and which kind of synchronization is needed. Not using callbacks and instead waiting/blocking for results works better. In a pure singlethreaded environment it's generally easy to know what to expect from a callback or promise - it will be fulfilled from the same thread and the continuation will also run from the same thread. In a multithreaded environment it can get hard - e.g. a promise continuation (like Task.ContinueWith() in C#) could run in any thread depending on the setup. And blocking on promises (like with Task.Result) may be valid in some situations (where the Promise is fulfilled from another thread) but not in others (where the Promise is fulfilled from the same thread). Using only synchronous primitives in multithreaded environments (like it's common in Go) avoids these pitfalls and questions.

Sometimes having a singlethreaded eventloop even in multithreaded environment however makes sense, like e.g. for GUIs (and other applications which have lots of shared state). In that case it should be assured that all objects which are living on the eventloop should just be accessed from other objects on the same thread - and do not involved directly in function calls from other threads. If that's required posting events or function calls to the eventloop thread is a much saner approach then trying to add low-level multithreading synchronization primitives everywhere.

Another great read on threads: The Problem with Threads, by Edward Lee: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-....

he doesn't seem to have mentioned cooperative threads/green threads/non preemptive threads in this series http://wiki.c2.com/?CooperativeThreading

They are easier to program than events, they still have problems like its easy to run out of a very limited stack and you have to yield your cooperative thread often to other tasks, but they are easier to maintain than a state machine (that is often done as a very big switch statement).

I'm using greenlets via gevent to handle a system that interacts with a number of http APIs, exposes an RPC API and has a data processing loop. I absolutely love the concurrency model. For cpu bound processes and short API timeouts, getting to control when my "threads" yield makes it very easy to avoid the common threading pitfalls.

It also means my application can remain very simple. Ie. I don't need to think about things like Celery or Redis or handling locking meticulously.

The cooperative multitasking is mentioned in "Why events are a bad idea" as a counter-argument #3. They are considered under threads category. It's unfortunate that only some languages support them.

> It's unfortunate that only some languages support them.

yes, you need runtime support for this feature. If it is impossible to change the runtime than it can't be done for the language; now the runtime may have problem to work with a very limited stack for example, or the runtime is big and difficult to change (like the JVM), or it has long library functions that cannot yield to other co threads.

However C/C++ or other languages that do not sit on top of a big runtime dont have that problem.

There's also this nice paper on this topic (unifying events and threads): http://salmito.com/papers/rb-resd2011/

Yes, might have worked in 1995. Now, however, when even your lowly phone has 4 (or more) processor cores and a full-fledged GPU...

Learn concurrent programming techniques — or perish. Threads and sync primitives are low-level, but important, and you have to understand them to figure out what compromises and biases were taken in higher-level models.

And, frankly, it isn't that bad (debugging existing code is bad, but playing with monitors and semaphores and critical sections is easy, until code is small and isolated).

Yes, and no. Concurrency and parallelism are important, but threads (lightweight shared memory processes) are not the only solution.

Successful, popular concurrent platforms like Erlang/OTP, Nginx, and node.js, eschew threads in favor a single-threaded, async/non-blocking code model. These platforms simplify application-level code by avoiding the issues of thread synchronization and contention in a shared memory space, and instead provide/require isolated processes to exploit CPU-level parallelism.

The threading "isn't that bad" viewpoint generally comes from a limited understanding of the things that can go wrong - for instance, add "memory barriers" to the list of things to understand.

There's a good explanation of the problems with a common Java idiom "double-checked locking" at https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedL...

Of course, when things get complicated with lots of interactions between processes, or significant amounts of computation in a single-threaded process, some platforms require the user to manually yield, etc., basically trying to re-create preemptive threading provided by the OS.

Nginx and Erlang single threaded? Uhhhhh, that's just not true.

Nginx and Erlang most certainly use multiple threads by default. They both use thread pools. I'm not sure about Erlang but Nginx has a single master thread that does async select on the main socket. Once it receives a connection it hands off to worker threads. This is a traditional multithreading setup with control/worker threads.

Node doesn't support threads because JavaScript doesn't. Performance would certainly be better for many workloads if it followed a traditional multithreading approach like Nginx. Async != Single threaded. They're not related concepts. You get the best performance using both non-blocking calls and multithreading , an approach used by Nginx and Netty. They both use nonblocking IO on the main thread and thread pools.

Node requires you to manually yield by calling Response.end() because it doesn't support multithreading. This basically duplicates OS level pre-emption manually and is my single biggest criticism of node. The worker/master thread model allows the runtime to manage worker resources or just defer to the OS in a flexible way.

You are confusing threads with processes.

Processes are isolated and do not share memory. Threads within a process share memory. For example, see https://msdn.microsoft.com/en-us/library/windows/desktop/ms6...

Nginx is single-threaded, multiple process: "Each worker process is single-threaded and runs independently" https://www.nginx.com/blog/inside-nginx-how-we-designed-for-...

Erlang code is single threaded. Each Erlang "process" is an isolated, share-nothing single sequence of instructions, and you don't use semaphores, locks, or critical sections in writing Erlang code. The BEAM VM is implemented as a multithreaded process, but this is not exposed to the Erlang code, and Erlang code can be transparently distributed across multiple instances of the BEAM VM (processes) even on separate hosts.

You're correct on node.js, and "basically trying to re-create preemptive threading provided by the OS."

Nginx recently implemented thread pools for blocking operations. The distinction between threads and processes depends on the OS.

The "nothing shared" model used in Linux forking actually shares the underlying memory until the processes diverge (copy on write). The difference between this and threads is that threads expose the shared memory by default.

IMO the difference between threads and processes is somewhat academic.

I don't know much about BEAM

> IMO the difference between threads and processes is somewhat academic.

Shared memory vs. isolated memory is a huge difference. You use completely different sync/protection/IPC approaches with each, so programs look very different. And the copy-on-write optimization of logically isolated memory doesn't effect that. The beauty of it is that it speeds up forking and saves memory without changing your IPC-based programming model.

> IMO the difference between threads and processes is somewhat academic.

At a linux-implementation level the difference is almost irrelevant. At a code level the difference between your variables being shared by default versus "thread"-local by default is huge.

> Erlang code is single threaded. Each Erlang "process" is an isolated, share-nothing single sequence of instructions, and you don't use semaphores, locks, or critical sections in writing Erlang code. The BEAM VM is implemented as a multithreaded process, but this is not exposed to the Erlang code

The same is true for PHP.

No wonder Facebook developed HHVM for PHP and also uses Erlang with WhatsApp and other parts.

Well, Whatsapp servers have always been Erlang-based (well before the acquisition by Facebook) but your point stands.

"Nginx has a single master thread that does async select on the main socket. Once it receives a connection it hands off to worker threads"

Nginx has a master process and worker processes. Master process doesn't do anything, but manage worker processes in case they die or need to be restarted on configuration change. A worker process runs a single event loop that accepts connections. It doesn't hand off them anywhere, but just schedules connection handlers to be run in the same event loop. Worker processes don't share memory (almost, there are some sysv shared memory segments for a few things, but this is irrelevant). Thread pools in nginx were added only relatively recently to do blocking file I/O on linux, which is kind of slow otherwise in some scenarios.

Mentioned in another response, but Node.js does indeed use threads/threadpool... It's used internally for all async IO, and by a number of binary modules as well.

Node is single threaded only in Javascript land. Internally it does use threads !

>Successful, popular concurrent platforms like Erlang/OTP, Nginx, and node.js, eschew threads in favor a single-threaded, async/non-blocking code model.

Ugh. As someone suffering with a node.js project now I'm continually chafing under this limitation. It's great if you're making a simple CRUD website, but you'll be beating your head against the wall if you have a problem domain with centralized resource sharing that prevents you from spamming processes willy-nilly.

And node.js code is ugly.

>The threading "isn't that bad" viewpoint generally comes from a limited understanding of the things that can go wrong - for instance, add "memory barriers" to the list of things to understand.

Actually, threading isn't very hard to do right. It's more a question of choosing a reasonable architecture. Sure, you can get yourself into a lot of trouble trying to interleave threads in an attempt to use multiple processors to speed up a single operation. Don't do that.

>There's a good explanation of the problems with a common Java idiom "double-checked locking" at https://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedL....

Yes, and if you read to the end of the article you'll see the problem it's addressing was fixed 12 years ago in Java 5.

As someone who's implemented reliable, massively parallel threaded systems in Java, when I hear people say threading is too complicated to be used safely I feel like a blacksmith who's just been told horseshoes are impossible to make even though he's been making them his whole career.

As to the ugliness of node.js code, I find that if you're using a transpiler (babel) or the latest 8.x, that async/await with promises tends to clean things up a lot.

Yeah, the Koa2 framework for Node.js is a beautiful way to handle requests in my opinion and it's built on async/await functionality from the ground up.

Yeah, re-reading my comment above, it feels like I'm saying that single-threaded async is the one-true-way, and I don't think that. Both threaded and async/CSP models are useful, and both do have weaknesses.

I've also implemented reliable, parallel, `threaded systems in Java. But I've also found defects in code that I thought had been thoroughly exercised, especially when it has moved to a new environment or new level of parallelism.

There's often "one more thing" that can cause problems, and generally it can be resolved by one more concept or workaround (like adding "volatile" in Java 5).

And I'd suggest Erlang/Go, which are designed for CSP-style concurrency, for cleaner code than node.js. JavaScript-based platforms often feel like, "Wow, you can even make JavaScript do X" rather than, "X is really natural in JavaScript."

> Successful, popular concurrent platforms like Erlang/OTP, Nginx, and node.js, eschew threads in favor a single-threaded, async/non-blocking code model.

Erlang, the language, is threading model agnostic, but the main implementation has been, by default, M:N threaded on systems with multiple logical processors since, AFAICT, the R12B release in 2008. So characterizing it as single threaded is inaccurate.

Erlang the language provides single-threaded processes that do not have shared memory.

The BEAM VM hosts Erlang processes in a multi-threaded single process. This is an implementation detail/performance optimization.

The Erlang processes don't care if they are hosted in a single OS process or multiple processes or even multiple host machines. Erlang-the-platform simplifies application code by providing a single-threaded, share-nothing, async/non-blocking code model. (Edit: Yes, dfox, it's probably more accurate to say, Erlang provides a synchronous, blocking code model, since that's what the Erlang programmer sees)

> Erlang the language provides single-threaded processes that do not have shared memory.

Erlang processes share memory, and some message sends (local sends of binaries > 64 bytes in size) are implemented by sharing references to shared, reference-counted, objects. This is safe because (not considering the process dictionary, which isn't shared) Erlang processes have no mutable state, and all that is shared is immutable values, so if the implementation wanted to, all sends could be references to shared memory (for small objects, the indirection and reference counting overhead isn't worth the memory savings, so its not done.)

The Erlang processes don't care if they are hosted in a single OS process or multiple processes or even multiple host machines.

It's not possible to write a high-performance program without a good understanding of the costs of primitives. If something might vary in cost by several orders of magnitude, it's quite hard to design for.

Network transparency has never been a good fundamental for well-performing application architecture. Networks are inherently far less reliable than shared memory, and have massively more latency, particularly on the tail end of the distribution.

From programmers view, synchronous blocking IO and IPC is the natural way. That it is mapped by the VM onto some asynchronous/event-triggered loop is also an implementation detail.

Message passing in Erlang is asynchronous and this is one very important detail, that makes it good at concurrency. And equivalent to event driven programming to some extent.

Message passing in Erlang is asynchronous in the sense that messages can be queued, but from the point of view of the process receiving message is synchronous function call that happens on well defined place in the process, in contrast to truly asynchronous message receipt which would interrupt the process/start new thread at arbitrary places in code, which is the model used by some microkernels and similar systems (this can be emulated in userspace by starting background thread which runs some event loop, for example SDL, and optionally ALSA, uses this pattern for audio I/O).

In it's original application space the primary apeal of Erlang (as well as Ericsson's custom CPUs in their exchanges) was that user's interaction with the system (eg. one phone call) can be implemented by one thread/process running normal sequential synchronous code as opposed to as state in some complex asynchronous state machine (ie. what is done in node.js) that is orders of magnitude more complex to program and understand. Another solution to this problem involves inventing some monads-and-arrows-reinvented (async/await...) machinery that hides the state machine away.

Multi-threaded programs have to be concerned with low-level details like memory ordering and barriers only when they do synchronization without using OS-provided utilities or when they implement synchronization wrong. pthread_mutex_lock() is both memory and optimization barrier and that is probably all one has to know for vast majority of MT code.

> when they implement synchronization wrong.

Yes, but there are lots of ways to implement synchronization wrong because you don't understand memory barriers.

For instance, Doug Lea's article linked above includes this very simple Java code, which breaks because it wasn't aware of memory barriers:

  // Broken multithreaded version
  // "Double-Checked Locking" idiom
  class Foo { 
    private Helper helper = null;
    public Helper getHelper() {
      if (helper == null) 
        synchronized(this) {
          if (helper == null) 
            helper = new Helper();
      return helper;
    // other functions and members...
[dfox: can't reply, guess we're nesting too deeply. Per the above, the simple rule has to be modified to "do not touch OR READ shared state outside of relevant critical section". Easy to make errors...]

There is simple rule for that: do not touch shared state outside of relevant critical section.

To some extent the cited problem is unique to Java, because the whole monitor concept just creates confusion about what synchronized really means (and also creates significant overhead by making literally everything usable as mutex). Pretty common misconception is that synchronized is akin to some kind of read-write lock (in which case the code above would work as expected).

AFAIK Ericsson never had custom CPUs.

> Successful, popular concurrent platforms like Erlang/OTP, Nginx, and node.js, eschew threads in favor a single-threaded, async/non-blocking code model.

All these platforms have some problems. E.g. Nginx is PITA if your processes need to communicate with each other. Erlang/OTP has this nasty mailbox problem (O(N^2) behavior of the selective receive). The cooperative multitasking is a piece of Victorian-era technology that is so painfully bad for various reasons, but people still tend to believe that is solves something.

On the other hand, modern operating systems are awesome. The thread schedules are awesome. Thread schedulers can dynamically adjust thread priorities depending on the load and solve some nasty synchronization problems (like priority inversion or starvation) for you. It's not 1995 and OS schedulers can switch threads in O(1) and most server apps can use a thread per connection approach without any problems. The only thing you need to get right is a synchronization, but there are a lot of tools that can help (like Thread Sanitizer). It's much harder to get the synchronization right with cooperative multitasking (good luck finding that priority inversion on implicit lock caused by dumb channel use pattern) than with normal threads.

Double-checked locking has been solved in Java since 1.5, which arrived more than a decade ago. The article you link to even says it's going to be solved.

"Solved" if you remember to mark the static as volatile.

It still proves the OP's point that this stuff can be come complicated.

That doesn't seem that bad to me. It's an explicit, unambiguous recommendation, one that even be recognized by static analysis. It's not much worse than an API that requires that you call some particular method--the only part that's bad is that you can fail to notice the problem because the code will usually work.

What I worry about are things that can't be easily caught. I trust myself to write embarrassingly parallel code, and I've written some code that has real risks of deadlock. But shared data structures that have complex locking schemes still scare me.

It all depends on what problem you're solving. Note that you're focusing on libraries and languages for writing servers. But what about, say, linear algebra routines? How would you implement BLAS in a performant manner without threads?

Sometimes, threads are the preferable solution.

> Successful, popular concurrent platforms like [...] node.js, eschew threads in favor a single-threaded, async/non-blocking code model

Except that Node (by way of libuv) uses thread pools internally for things that it deems to be CPU-intensive enough to need it, like crypto.

Threads are necessary and pretty much unavoidable, but in my experience cooperative sharing (e.g., co-routines) results in systems that are much more reliable and far easier to reason about. Structured worker threads ("go away and tell me when you're done") with no shared resources also go a long way towards keeping systems reliable.

Don't get me wrong; I love a nice, concurrent and efficient system as much as the next guy. I've been in various kernels and embedded systems doing the whole spinlock/reader-writerlock/mutex scene, and it's fun as hell. But in the end, if I can spend a few more cycles and not have to worry about race conditions and debug hairy crashes all the time then I'm far more productive, and ultimate the customer is happier.

Are there any coroutine systems that present a sensible call stack and the rest of the familiar debugging experience? I've only encountered systems that do not, and found them an utter nightmare to reason about.

Threads aren't particularly low-level; they're more like a mid-level concurrency abstraction. The whole point of a thread is that you get to use abstractions (like semicolons, variable access, function calls, and program stacks) that are familiar from sequential programming in a concurrent setting.

At the machine level, these are all done with interrupts, shared memory, CAS & LL/SC instructions, and memory barriers, which actually tend to map more closely to the event model.

In the middle, there's a whole family of concurrent data structures - semaphores, producer/consumer queues, mutexes, transactions, etc. - which really are worth learning about.

Yup. Don't be scared of threads, just make sure to be diligent about using them properly. Learning decent basic multithreading techniques is something a lot of devs can pick up with some effort in a couple of weeks to months at most. It's like most things, don't dig yourself a hole and you won't have to dig yourself out of it.

Well, yes. I learnt memory management and can C/C++ like a boss, but for many types of applications, doing it yourself is a bad idea.

I've learnt and used low async primitives for years, but for many (most?) applications, as long as you know what you are doing (which i guess is your point), using them directly isn't a very good idea anymore.

You want to know what happens when you light up all 4 + GPU? Your battery life gets decimated.

You can't even do that on any modern mobile device without thermal limiting kicking in quickly(although GPU/SoC vendors love to turn that off in standard benchmarks).

Note that it's from 1995. Back then, many people still thought that machines with multiple processors were exotic things of little interest even in the enterprise. That list most notably included one Linus Torvalds. It also included OP author John Osterhout, who should also have known better - even more so, since Stanford was one of the places where such things were not so exotic. He even says, right up front, that threads still have their uses when you need true CPU concurrency. Now that's a common case. Generalizing from this presentation to the current day is probably a worse idea than threads ever were.

Your comment implies that the presentation says you shouldn't use thread. But that's not what it says. The conclusion is that you should only use threads when you are actually trying to achieve cpu concurrency.


In the context of that time, "only when you're trying to achieve CPU concurrency" was meant as "only if you're doing something exotic" (though it wasn't as exotic as Osterhout thought - I'd already been working on MP machines for years). In the context of this time, it means "only" in an extremely common case. The difference matters. People who read the presentation without putting it in historical context will be misled, even if Osterhout weasel-worded his conclusions because he kind of knew he was wrong already.

Today, there are many places that have big clusters of multi-processor machines, and none of their code is threaded. They get cpu concurrency by message passing between heavy-weight processes and with other nodes.

If you're talking about MPI and its use in HPC, you might've been mostly correct 5 years ago. Now, HPC applications at the largest sites are hybrid MPI+OpenMP (or some other threading model, or they use GPUs). On Xeon Phi systems like the latest Crays, you pretty much need OpenMP to take advantage of all the cores. The PPC A2 processors on the (formerly #1, now #4) Blue Gene/Q system at LLNL won't get full instruction issue per cycle unless you use multiple hardware threads (the threads are exposed to the hardware and each core will only pull instructions from a single process at a time, so this is somewhat unlike hyperthreading on Intel chips).

There can be overheads to running very large numbers of MPI processes, depending on how scalable the MPI implementation is, and there are overheads to having too many threads contend for memory on a single node. At least in the current state of things in HPC, there is a delicate balance to choosing the right process-thread ratio, and it differs from application to application and depends on the node count for the job. People struggle to find the sweet spot, but we're definitely not running all MPI anymore.

I wasn't talking about MPI, but if you want to talk about MPI, note that OmniPath gets faster as you have more cores talking to it, and it takes full advantage of delivering messages to the memory near the receiving process, so you'd better not move threads or processes very far.

If you look in your HPC historybook, you'll notice that I was the system architect for InfiniPath.

> there are many places that have big clusters of multi-processor machines, and none of their code is threaded.

A bit over-general, don't you think? There are many places that have big clusters full of single-threaded processes. There are also many places that have big clusters full of multi-threaded processes. I don't know which group is the biggest, but it doesn't matter as far as Osterhout's argument is concerned. What matters is that, contrary to the context in which that argument is framed, wringing performance out of multiple CPUs in a single system is an almost universal need nowadays. Many of them furthermore need high levels of intra-process as well as inter-process parallelism for that purpose. Threading is not the only way to achieve that, but he never even considers other alternatives. No matter what hairs are split or who tries to claim the biggest credentials, the OP remains seriously time-bound and its applicability to modern systems is weak.

I assume you're referencing MPI? It all depends on your problem domain. There are plenty of classes of problems where message passing incurs far more overhead than just sharing memory.

MPI is indeed something I used to do, but I was thinking more of large NoSQL setups like blekko's search engine, which was 100% event-driven code running in separate, heavy-weight processes.

(OK, in the guts, we did have some utility software that did async disk I/O using a few C threads per heavyweight process, but that was due to the limited kernel API more than us wanting threads.)

Sure, but the point is that outside of embarrassingly parallel problems you need shared memory access. You can simulate that with message passing, but for many problems the overhead of message passing is too great. I'm specifically thinking of linear algebra routines. I'd like to see someone write a performant BLAS implementation that doesn't make extensive use of threading

Well, the example I was thinking of (from after I left HPC) was a search engine, in which most of the computation took place in a NoSQL database. And it did great with event-driven programming and heavy-weight processes. You can call it embarrassingly-parallel if you like, from an HPC point of view it is. And that's a fine thing.

Glad to see HPC so represented here on HN, but an HPC counter-example does not diminish an example. To put this another way: I wasn't saying that YOUR problem could be solved without threads, I said MY problem could be solved without threads.

Completely agree, and I prefer to use something like message passing whenever possible because for a lot of problems it is the natural solution and has fewer pitfalls than using raw threads.

Great. Nothing like a vigorous discussion that turns out to ... actually be an agreement.

Message passing lets you amortize synchronization and should always be faster if properly implemented.

Not really, no. You've still got the overhead of copying, which can be significant. There's no way you could implement something like, for example, Gaussian elimination with message passing in a way that's just as fast as using locks and barriers.

I totally agree that zzzcpan was over-generalizing, but so are you: is it really the case that the Linpack benchmark (which uses arrays as large as memory) on the latest SGI shared-memory thing uses locks and barriers?

I think the cost of locks is so high, that even some primitive runtime zero-copying mechanism won't add significant overhead. After all, if you have to do something on another core, you have to copy it there, it's not really a physical shared memory.

Maybe not locks and barriers specifically, but I'd be shocked if it's using message passing, particularly because of the overhead of copying. If anyone has evidence to the contrary I'd be happy to hear it.

Wow. I guess you aren't familiar with the history of large SGI Origin and Altix machines being mostly used as pure MPI machines because that's almost always fastest? Also, Linpack doesn't involve much data motion. Also, moving data over a shared memory interconnect is as much of a copy as moving data over a message passing interconnect.

I was going to look up the current details for you, but I don't see an SGI Ultraviolet system in the current Top500. How large of a machine do you work with?

I was thinking about machines with a shared memory bus (I'm 90% sure that's the right term) so yeah not things in the Top500. I'm thinking about your standard 64-core servers. I know that in the large MPI is best because getting that many cores to use the same memory bus just isn't feasible. Sorry for the confusion, I didn't read your comment carefully.

Clearly they never wrote for the butterfly[0] ;).

[0] https://en.wikipedia.org/wiki/BBN_Butterfly

John Osterhout is a creator of Tcl, which embraces event-driven model for programming. I think it gives more perspective into his opinion.

That said, Tcl was one of the first scripting languages which got very nice thread model - see AOLServer [1].

[1] https://en.wikipedia.org/wiki/AOLserver

We used Tcl threads in one of our programs to control various hardware things while main UI responded to events sent from the threads. Everything worked very well, especially for a program in scripting language like Tcl.

Twenty years later this seems to still be good advice. Martin Thompson talks about this in his Mechanical Sympathy talk.

He says the first thing he does, as a performance consultant, is turn off threading. Claims that's often all he needs to achieve the desired improvements...

It's a good talk, I highly recommend it.


Who creates threads just for fun ? I only use them when I really need them (in C++), so there's really no alternative for me

Yep. If all you have is a blocking sys call and you want performance you gotta have threads. If you want to use more than a single core at a time for computation, you gotta have threads. [And I use the word threads loosely, you could have multiple processes as well or whatever OS primitive let's you get concurrency].

A 20 core CPU can do some things 20 times faster than a single core (multiply matrices e.g.). If you're trying to do those things and you limit yourself to a single thread - good luck!

The bad idea is taking a threaded language and retrofitting events, which Python is doing. This results in an even worse mess. Python now has two kinds of blocking, one for threads and one for events. If an async event blocks on a thread lock, the program stalls.

Or taking a event-driven language and retrofitting concurrency, which Javascript is doing. That results in something that looks like intercommunicating processes with message passing. That's fine, but has scaling problems, because there's so much per-process state.

Languages which started with support for both threads and events, such as Go, do this better. If a goroutine, which is mostly an event construct, blocks on a mutex, the underlying thread gets redispatched. There's only one kind of blocking.

Almost no one likes Tcl, but I think it used(uses) a nice model for concurrency. Much better than Python's.

Interpreter as a library, all state encapsulated in a reentrant interpreter object, different interpreters running in different threads that can send messages to each other, no need for expensive serialization deserialization (like in Python) because the interpreters are in the same process, ... so much nicer.

From a Linux perspective, threads and processes are essentially the same construct. The major difference is the set of flags that are passed when the process is created. Oversimplifying, if shared memory is requested, then it's a thread. Otherwise, it's a process. Meaning forking servers are multi-threaded.

On other operating systems, specifically those starting with the letter W, there's a major distinction. There are other constructs as well, such as "fibers".

Now, today's world is different from what it was in 1995. We used to have a single core, so threads and multiple processes were only a logical construct. Now, we have multiple cores, so we shold, at a bare minimum spawn multiple processes/threads. What's running inside them can then be debated as if it were 1995.

Obligatory CSP[0] reference.

There are only a handful of examples, that i can think of, where threading(multiprocessing, concurrency, and other names for it) is useful.

[0] http://www.usingcsp.com/

The key point is unstructured shared data is a source of errors in a threaded program.

If I had a dollar for every time I heard "thread programming is hard."

I've programmed using threads for 23 years. I've never had a non-trivial debug issue caused by trouble using semaphores, mutexes, and shared data. It's no harder than writing a hash table or balancing a tree.

I think Niko Matsakis addresses this well in his talk at C++Now: https://youtu.be/lO1z-7cuRYI

It's reasonable for one person to write a correct multi-threaded program, even a large one. But when many different people start working on the same large, multi-threaded program, correctness becomes much more difficult. People start forgetting which functions and data structures are and are not threadsafe.

> People start forgetting which functions and data structures are and are not threadsafe.

That sounds like something which could easily be solved with common sense commenting of code.

I have much less professional programming experience. Nevertheless I saw:

- Dozens of examples where access to a variable shared-between threads was not synchronized. These can be nasty, because often you won't observe an issue at all. And maybe some month or years later weird issues are reported, which can often not be reproduced and where the root-cause is hard to find.

- Deadlocks, due to locks taken in different orders. E.g. in cases where a user-thread with user-code calls in a library and the library calling back into user code from another thread.

- Nasty memory corruption issues. E.g. there was a case where one started a timer (which executed in the background thread) which should manipulate some variable when the timer elapses. In case the timer was no longer needed it was cancelled. Under some race conditions the code in the timer thread ran even after it was cancelled (from the user thread) and accessed memory which was already freed in the main thread. The correct solution would have been in the main thread to wait until the timer has guaranteed to stop (with out without executing it's code) and only to release the memory afterwards. This happened in an embedded system where you get no segfault but just weird behavior. So it was definitely a non-trivial to debug issue on which multiple days were wasted.

- In event-based systems code which blocks the whole eventloop (and thereby slows down the system immensely) by waiting for the result of a long-running operation. E.g. in .NET code you might often find people just using Task.Result without knowing it's implications (or what the difference to await Task is).

My personal experience is that while for some people concurrency and multithreading is quite easy there are also lots of people which have a hard time understanding and utilizing it correctly.

Do you have proofs for your code that it can't deadlock or livelock? How strong are your claims - would you trust it in a life support system, or aircraft auto-pilot, or similar role?

It's easier when you restrict yourself to:

- take mutex

- read/write/modify shared data structure

- release mutex

There are plenty of multi-threaded RTOS and a lot of critical embedded software is multi-threaded. Typically there are clearer guarantees in those vs e.g. Linux. Multi-threaded code isn't fundamentally (from a theoretical perspective) different than single threaded code when it's running on a single core. Multiple cores communicating is sort of orthogonal.

Nope. Nor do I have any proofs that my balanced binary tree implementation is correct, or that my hash implementation is correct.

Given the number of stories about equator issues and negative altitude issues in autopilots, I'm unconvinced there are proofs involved in those either.

I would trust my own multi-threaded code too, but it requires much stronger knowledge of expected behaviors and code paths. I have had the misfortune to try and debug other's multithreaded code. Having a debugger connected to a production app, while waiting a week for it to deadlock is tedious.

Same here. You just stick to a few simple patterns that work. If you have a shared data structure you protect that with a mutex. You don't hold the mutex and call other functions. You take the mutex, you read or write the shared structure, and you release the mutex. That's the basic pattern.

People run into deadlocks and other issues when they don't realize what mutex protects which data, or they try to take multiple mutexes, or they hold a mutex while calling a callback, or they don't protect access at all.

From a performance perspective you need to be aware of how much contention you have on a mutex. Typically locking a mutex should never block. If you have contention there are patterns for dealing with that as well.

For most types of concurrent applications you're done. Yeah, lockless techniques, spin locks, etc. may have their place but for the most part keep it simple.

> I've never had a non-trivial debug issue caused by trouble using semaphores, mutexes

Sure, those are easy, but they don't scale. You'll often want non-blocking instead of semaphores for higher levels of parallelism and thats where you'll run into trouble.

I'd rather have real threads available in my language, and use shared state sparingly. Your N single-threaded processes have to talk to each other anyway, or at least to a master, and might even share memory through memory-mapping. Threads are just a tool, and they give you options. You can use them in a share-nothing way if you want.

As someone who grew up on the Java VM and started my start-up/web career on it, I've always felt like Java programmers have a different relationship with threads than C programmers. Java gives you cross-platform, usable, debuggable native threads; it basically makes them free if you want them. In C/C++, on the other hand, threads are a library, and using them is a grungy affair. If you grew up on Rails, meanwhile, threads don't exist ("when you say worker thread, do you mean worker process? I'm confused").

Node.js was created by C programmers and launched with a lot of anti-thread propaganda, much like the link. They equated threads with shared state, and also said threads were too memory inefficient to be practical for a high-performance server (they meant that holding a native thread for every open connection would require too much stack space, which is true, but that's not what they said).

>As someone who grew up on the Java VM and started my start-up/web career on it, I've always felt like Java programmers have a different relationship with threads than C programmers. Java gives you cross-platform, usable, debuggable native threads; it basically makes them free if you want them. In C/C++, on the other hand, threads are a library, and using them is a grungy affair.

The difference is Java threads work within the constraints of the JVM memory model. Which is very well specified and understood. It is implemented in the JVM and the JVM guarantees your platform will follow it.

While C/C++ do what ever the hardware dictates.


That's part of it, for sure.

> Where threads needed, isolate usage in threaded application kernel: keep most of code single-threaded.

This is the point where performance tops: each CPU is filled with operations, and operations that don't need to wait for the result of other threads.

I'm sure many people don't think this applies to them because they don't use threads. However, in the modern day, replace "thread" with "process" and "memory" with "database", and many web applications have very similar problems. They just never actually manifest because of the small number of requests per second.

No, the problems of threads don't carry over to processes, what you just said is exactly wrong and misunderstands the issues involved. Databases have transactions and processes don't share memory, those solve the problem with threading which is shared non transactional memory.

Transactions lock and conflict with each other, so the metaphor holds fairly closely though not completely equivalent of course

Yes but these are recoverable states whereas threads suffer from partial reads in the middle of updates resulting in undefined unpredictable behavior. It's not a good metaphor to compare threads to processes as they suffer from vastly different failure modes and problems and are quite normally contrasted to each other as different solutions to concurrency problems.

the reason people don't want to use threads is because they're afraid of having to understand race conditions. At that level, "you will have to understand race conditions no matter what" is the wisdom I derive from this particular metaphor.

I agree. The actor model where actors can be scheduled on any thread is the best (Erlang, Goroutines). Second best is the node.js model of single threaded evented programming.

This presentation is not about Kernel thread versus green thread, it's about threads vs events. In this regards, Goroutines are as bad as threads since it requires some synchronization witchcraft (channel or locks) and the author advocates for your «second best one» (events).

At the time that this was written, "threading" encompassed the idea of threads sharing and mutating basically all of their state. In practice not every thread would touch everything, of course, but the concept model was that it was at least possible, and that it was the responsibility of data structures to be ready for concurrent mutation.

While Go in theory is the same as a 1990s-style-threading model, and serious Go programmers had best not forget that in the final analysis they aren't actually protected from multithreading problems simply "because Go", having a model where structures are owned by particular threads, and things are shared via messages rather than direct threads, thus making it the thread's responsibility to be ready for incoming messages, makes it vastly, vastly easier to program in in practice.

Yes, this absolutely could have been adapted back then, too. This once again goes back to the fact that a language is often more than just the language itself, but also includes the standard idioms and the standard and common libraries. You could have written your code in 1990s in the Go way in C++, but you would have had to bring up basically an entirely new library stack and written a lot of wrapping code around OS functionality.

Despite the fact that Go programming is in practice much simpler than the threading being criticized in that paper, I agree strongly with the idea that Go would be enhanced by the ability to erect much stronger boundaries around goroutines to ensure that they neither share nor reach into anything they shouldn't. Still, there are ones by convention, which is way better than nothing.

You should also consider Rust-style "synchronization witchcraft" as a very interesting alternative "isolated actors" and "locks everywhere" that strikes out in a new direction entirely. In 50 years what we currently consider all of our best sync options may in hindsight have proved to been simply a false dichotomy (albeit one with three or four choices rather than just 2, but, the same principle holds).

> and things are shared via messages rather than direct threads, thus making it the thread's responsibility to be ready for incoming messages, makes it vastly, vastly easier to program in in practice.

I kind of disagree, the Go mantra «don't communicate by sharing memory, share memory by communicating» is merely a marketing pitch, especially due to the clumsy nature of Go's channel[1].

Unfortunately, the number of Go developers being bitten by data races in Go is too high … The main danger here is the illusion of safety that the Go community is spreading (and your comment is an example of that): all the Go devs I know (including me) never really worked on multi-threaded C or C++ code before, and if you come from a nodejs or Python background like most do, you aren't really prepared to understand the challenges of multi-threaded programs. Sometimes the best way to structure your program is just to have a shared `map` of items you need to process, but if you forgot to wrap it in a mutex, you have a data-race. And if you don't even know what a data-race is, you face strange and unpredictable bugs (true story).

[1] : http://www.jtolds.com/writing/2016/03/go-channels-are-bad-an...

If you have a public map that is being accessed by multiple goroutines, you've fought Go not once but twice; once for the multiple access, and once for it being public in the first place. At the very least make the map private and create an async-safe API around it.

You definitely do have some responsibilities learning how to write truly multithreaded code after coming out of a single-threaded environment. However, having used quite a few environments, I don't know of any where that isn't true. You've fundamentally stepped up a level of complexity. In Haskell, you don't have to change too much about your style, excepting that you first had to write your code in Haskell, which most people consider quite a leap. In Erlang, you are forced to deal with process boundaries and there isn't really an equivalent of "just share a map"; even the obvious "just share a map" option (ets) is simply a prepared API around a process boundary. Rust's pretty cool, but you must write in accordance with its type system, and that's going to fundamentally change how you write programs if it's your foray into that sort of world, which is cool and good, but far more disruptive than learning how to write some Go.

If Go makes a particular error here, it is that it doesn't stick the problems in your face and makes it a bit too easy. I'm actually on the side of that being a problem (if you think I'm all "rah rah" Go, go back and read the grandparent post with an eye for counterexamples, there's at least three), but I think most people coming from Python or Node would at least initially believe they prefer this.

The reason I am a bit rah-rah on Go (just not unreservedly) is that while I still if I had my way would tweak some important aspects about it is that it's the first time it has brought that style of thinking (and libraries!) to the Algol-style languages. On the one hand, as a language dilettante I'm saddened that we lose some goodness from the non-Algol heritages, but on the other hand, it means I get to use it for my job without cringing. Haskell is... Haskell. Erlang's a neat language but it conflated a lot of stuff and forces a lot of adaptations on to you that I consider unnecessary. (I respect it and its developers deeply, but as is often the case the first effort in a certain space often makes errors that later generations will consider basic.) The C/C++/Java/C# space all pretty much went with 1990s-style threading. It's probably still the easiest entry into that space you can have, so I have a hard time laying too much blame at the language's feet rather than the problem domain's feet. (The channel criticisms are mostly legit, but at the same time, Go channels are probably better primitives than 80% of programmers have ever had access to. To see the complete picture one most manage to hold all of these aspects in one's head at once.)

Replace data structures with actors and messages and you got it :)

Event driven programming is actually very close to actor model conceptually, just slightly less powerful. And basically switching to actor model is the only path forward for someone coming from events.

But yeah, people need to stop confusing actor model with shared-memory multithreading. Threads/green-thread/goroutines are all the same and the criticism applies to all of them. They share nothing with actor model (pun intended).

Goroutines are just threads with a particularly idiosyncratic implementation.

Edit (first line); actually maybe it's my definition of 'thread' that is different. I think of a process as a security context that has one or more CPU routines (complete set of state). Goroutines are not threads in that sense, they lack that CPU state and are just segments of code (logic and data-state).

Goroutines are 'soft threads'. The underlying implementation typically defaults to number of CPU cores as max possible REAL thread count. The real threads will check the message queues for possible wakeups when the previous soft thread blocks or terminates.

In terms of safety I think fork-join is better than the actor model. The actor model is all about mutable state - the state of all the actors. Fork-join removes state entirely - it's just about passing values.

> The actor model is all about mutable state

One of the more popular actor model languages (Erlang) also generally avoids mutable state (it's available in the form of the process dictionary, but recognized best practice is to avoid using it.)

That's not quite what I mean.

I mean an Erlang actor is a state machine isn't it? It accepts messages, and each messages causes the actor to either return some data based on its state, or to enter a new state.

State machine - so it's holding state isn't it? And it's mutable because you can cause it to enter a new state by sending messages.

And you have lots of these mutable state machines in your Erlang process - lots of mutable state.

See what I mean? I know they don't have shared memory, but each process is effectively a little piece of global shared mutable state. And that's bad!

In other parallelism models like fork-join, there is no mutable state. Tasks get input parameters, and produce a result. They don't have any state in any meaningful sense because you can't observe their state externally and you can't ask them to change state externally.

How is the fork-join? How a program made in this model is better? I don't remember see any concrete example with that (I see CSP, Actor and Disruptor(?))

In the actor model you can have non-determinism. Two actors are going to send you messages to update the state in a third. Depending on which message arrives first, you can arrive at a different final state in the third actor. And which can arrive first depends on scheduling issues that can change between runs. That's non-deterministic, and it's part of what makes multi-threading so hard.

The fork-join model doesn't have this problem because you can only join a determinate number of tasks, and you have to join them in a determinate order. You can't have anything arriving at any other order than you specified. Scheduling issues cannot change how a fork-join program runs.

In the actor model you also have a really complicated state situation. You've got n actors, all each in a possibly infinite number of states. Because of the non-determinism described above the states your actors will get into in each run of the program can be different. That's another part of what makes multi-threading so hard.

The fork-join model doesn't have this problem because tasks don't have any state. They just have input and output. There's no state, only commutation.

The actor model is a big uncontrolled graph of state in my opinion. Fork-join is DAG of pure functional computation done in parallel.

Any code example?

In this model I think deadlocks is possible, like in CSP. I wonder if CSP is closer to this model, right?

No CSP is all about mutable state in each process.

Fork-join looks like, where we blur an image with two-way parallelism:

   a, b = divide image into halves
   fork {
     a' = blur a
     b' = blur b
   result = combine halves a' b'
So all the tasks in the fork { } block run in parallel. The task that created those tasks stops and waits for them to all finish and to have all their results before it continues.

And there's no way the order of the tasks finishing can change the result because you always have to wait for all your tasks to finish.

See how there's no mutable state? You can only start a task, and accept its result. You can't see or change anything in between that.

You can get a deadlock if one task crashes. Some people really care about that, some people see it as an implementation detail and wouldn't really define it as a 'deadlock' in the same way.

So, you basically have what JS calls await Promise.all(a, b) ?

Yes you can build fork-join that way, but you'd need to remember to manually follow some rules. You should create the tasks a and b immediately before you await, and do no other work between creating them and awaiting. And you should not cause any side effects in the two tasks.

Systems which have fork-join built-in would either help you to follow those rules, or force you to.

So, if I understand, you "tasks" are almost pure functions:

   fork -> send -> task1(input) & task2(input) -> joiner
And you must always(?) have a join? so I can't have a zombie task elsewhere?

I probably like how this sound! Exist a more detailed material to study on this?

Few would disagree with "better" (than either pure threads or pure events) but if you're going to say "best" the burden of proof is higher. Please explain why you think actors are strictly better than e.g. promises/futures or agent-based models.

Keep in mind that actors and futures can be used together, which produces an even more powerful result than either abstraction used alone. We use this programming model in Mesos via a library called libprocess. There will be some talks on this approach in upcoming MesosCons and we'll submit one to Cppcon.

No time to write it up.

> The actor model where actors can be scheduled on any thread is the best (Erlang, Goroutines).

Yes, but only if there can be structural sharing between threads.

How do you handle long lasting operations like checking something from time to time? Seems to me this is a natural use for a thread.

Schedule a timer, and respond to the timer events. Its a waste of resources to have a thread hanging around sleeping all the time.

This is basically "There is no Thread":


There are times when you can offload an asynchronous task to non-CPU hardware, in which case the blocking thread can disappear. The thread reappears when the hardware asynchronously informs the CPU of task completion. (This idea is what Node.JS is entirely reliant upon.)

Consider if we didn't have timing hardware to handle this task for us. We could implement a timer with a spin-loop thread which checks the clock and fires off events.

That's like saying chocolate cake is the best followed by blueberry muffins.

Can you back up your claims?

Yes, but I can't be bothered to.

How's that for an honest response lol.

IMO, the main problem with using threads is that they are such an 'all or nothing' approach to sharing data.

If you want to make use of multiprocessing, the traditional choice is either to use two separate processes (sharing nothing), or to use threads (and share everything). But for most tasks, these opposite ends of the spectrum are not what you need. There's plenty of data and state in most programs that doesn't need to be shared, and a huge source of threading bugs is through mistakenly altering some data that another thread was using.

The problem is that sharing partial state between processes is painful and many languages and OSs make it difficult to do. You have to play around with mmap() or other shared memory tools, and then pay great attention to not mix pointers or other incompatible data between the processes.

There's a link to a related discussion on HN entitled "Why Events Are a Bad Idea (for high-concurrency servers) (2003)"


Threads have been useful for me in my latest experiment. Its a C++ application that runs mame_libretro dll (copies) each in their own thread, while a BASIC intepreter runs in another thread, and the main game engine (Irrlicht) runs in the main thread. Irrlicht isn't multithreaded so I just put commands from the BASIC thread into an outgoing deque which I lock and unlock as I access it. Then there are mutexes for the video data.

I think that threads are definitely a bit tricky though since it is easy to mess up locking/unlocking or not lock things necessary and if you do then you have debugging headaches. So when not needed they should be avoided I think.

I can not agree more. Most people should not write threaded code at all.

Whatever... But Multithreaded systems are fun to design, develop and most importantly - troubleshoot... more harder the 'real time'ness more the fun :)

Threads are a bad idea for the same reason manual handling of memory space is a bad idea. Languages should provide primitives that only allow for safe construction of expressions that are run concurrently by the runtime.

> Languages should provide primitives that only allow for safe construction of expressions that are run concurrently by the runtime.

Then you run into issues like that post recently where you can't have different threads running in different context, because the runtime is the one deciding/controlling when new threads are spawned.

So does that give people who write allocators license to to threads arbitrarily?

More seriously, in C++ apps (Games financial trading etc...) it is common enough to need to carefully manage memory to get that last bit of performance even after saturating all the CPUs and GPUs. Superficially your advice makes sense.

Going forward I think I might leave threads in the same mental bucket as allocators and see if that heuristic holds up.

What if I have a blocking function call (i.e. listening on a socket)? Seems like there is no choice but to put the blocking call in its own thread...

Why should you use a blocking mecanism for listening on a socket ? Non-blocking I/Os[1] (epoll on Linux) have been a thing for a long time now.

[1] : https://en.wikipedia.org/wiki/Asynchronous_I/O

Except when the non-blocking file I/O API[1] is actually a blocking one. Yes, I realize in this case it's the API that needs fixing, but it isn't looking like that will happen anytime soon.

[1] : https://lwn.net/Articles/723752/#724198

All modern operating systems support event interface to sockets (like select, epoll).

And OP actually says that limiting usage of threads to handle blocking stuff is fine.

The answer from anti threading proponents would say you should never use blocking IO.

It's the same with every tool: The more powerful it is, the worse the consequences are when it's abused. That applies to programming in particular.

With great powers come great needs of formal verification.

Formal verification is pointless if there are no formal specifications of the program/product/whatever. And if there are formal specifications, then they might as well be in the form of code. And even formal specifications can contain errors. That is why verification is not more popular.

Need is there, but tools are not.

It's not a tooling issue (see my comment above)

Loved the programmer comparison slide. Is today's python programmer 1995's visual basic programmer? :D

You've never worked with "wordpress programmers" have you?

It's possible to be a "wordpress programmer" in any language, but man, I have read a lifetime's worth of bad PHP code. WordPress is of course the worst of all possible PHP worlds since it was a fork of an even more ancient project and thus had to maintain backwards compatibility with a purely procedural codebase.

The phrase "wordpress programmer" conjures up the person for whom that is their sole form of programming expression. Some part of me acknowledges that people like that must exist: the rest recoils in horror.

"drupal programmer"

and still powers more than 25% of the interwebs... some claim.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact