Erlang (BEAM) has schedulers that execute the outstanding tasks (reductions) on the bezillion user-space (green thread) processes.
For most of Erlang's history, there was a single scheduler per node, so one thread on a physical machine to run all the processes. There is a fixed number of reductions for each (green thread) process, then a context switch to a different process. Repeat.
A few years ago (2008), the schedulers were parallelized, so that multiple schedulers could cooperate on multi-core machines. The number of schedulers and (hw/thread) cores are independent - you can choose any number of real threads to run the schedulers on any physical machine. But, by default, and in practice, the number of schedulers is configured to be one thread-per-core, where core means hardware supported thread (e.g. often Intel chips have 2 hardware threads for each physical core).
So yes, almost always and almost everywhere, there really is one OS thread per hardware supported thread (usually 1x or 2x physical CPU cores) to run the schedulers.
As the original article noted, one of the biggest problems of "thread per core" is the name of it, because it confuses people. It does not mean "one thread per one core" in the literal sense of the word, but rather a specific kind of architecture in which message passing is NOT done between threads (as is very common in Erlang), or it is kept to the minimum possible. Instead, the processing for a single request happens, from the beginning to the end, on one single core.
This is done in order to minimize the need to transfer L1 caches between threads, and to keep each thread's cache pool tied to one request, and not much else (at least, to the extent possible).
In the context of Rust async runtimes, this is very similar to Tokio if work-stealing did not exist, and all futures spawned tasks only on their local thread, in order to make coding easier (lack of Sync + Send + 'static constraints), while also making code more performant (which the article argues it does not).
For examples of thread-per-core runtimes, see glommio and monoio.
I am extremely familiar with Erlang and its history. You are misunderstanding what "Thread Per Core" means.
Again, the fact that data moves across threads in Erlang means it is not TPC - period. Erlang is basically the exact opposite of a TPC system, it is practically its opposite because it is all about sharing data across actors, which can be on any thread.
For most of Erlang's history, there was a single scheduler per node, so one thread on a physical machine to run all the processes. There is a fixed number of reductions for each (green thread) process, then a context switch to a different process. Repeat.
A few years ago (2008), the schedulers were parallelized, so that multiple schedulers could cooperate on multi-core machines. The number of schedulers and (hw/thread) cores are independent - you can choose any number of real threads to run the schedulers on any physical machine. But, by default, and in practice, the number of schedulers is configured to be one thread-per-core, where core means hardware supported thread (e.g. often Intel chips have 2 hardware threads for each physical core).
So yes, almost always and almost everywhere, there really is one OS thread per hardware supported thread (usually 1x or 2x physical CPU cores) to run the schedulers.
https://www.erlang.org/doc/man/erl.html
https://erlang.org/pipermail/erlang-questions/2008-September...