
How Erlang does scheduling - davidw
http://jlouisramblings.blogspot.dk/2013/01/how-erlang-does-scheduling.html
======
davidw
A very interesting read. I'd be interested to hear commentary from people who
use languages like Haskell, that I'm not very familiar with, what their take
on this is.

Amongst a lot of positive things one can say about this scheduling system, I
think there are a couple of downsides:

* The scheduler needs to be able to interrupt things, so my guess is that it's harder to compile code down to native code, otherwise you might not be able to interrupt a tight loop.

* The bit about regular expressions being implemented in Erlang so as to be able to keep track of how much time they've taken up. This has a cost: slower regular expressions because you're doing them in Erlang rather than C, and not sharing nice C implementations. The one in Tcl is pretty good and very liberally licensed, for instance. Turtles all the way down means having to reimplement stuff instead of just borrowing it.

That said, for certain things, Erlang is superb.

~~~
larsberg
I worry about and benchmark this sort of stuff on a daily basis (in the
context of Manticore, a parallel dialect of Standard ML), so I have a few data
points to throw out:

\- Pinning to cores is not a good idea unless you have also pinned all other
threads in the system to a separate core. Even with Very Smart schedulers,
poor interrupts due to OS threads or even just intermittent services can end
up stealing a core that a thread is pinned to, requiring the system to notice
it and migrate. We have seen identical median execution times while removing
all of the really bad outliers by pinning to a package rather than a core and
relying on the OS scheduler to try not to blow out L1 cache by moving you
around, both on Intel and AMD architectures.

\- I am not as familiar with Haskell, but in our implementation of ML we have
interruption points the scheduler can hop in at both at allocation locations
and at a variety of other points: any loop, a C library call, etc. By doing it
at known points but introducing extra ones, we keep it down to 10 or so
instructions before a check, with a check that's a single in-memory pointer
check (the allocation pointer gets zero'd out, for the curious), _and_ we then
produce all of the information that's also needed by the garbage collector if
some series of actions leads to the need for a GC at that point. Which can
happen even if you weren't allocating if the other threads in the system
determine there's a need for a synchronized global collection, for example.

\- Not mentioned in this article is anything about the Erlang GC's interaction
with the scheduler and NUMA memory. In systems with lots of threads, if you
don't have a fairly concurrent GC at the per-thread level combined with global
collections that maintain locality (remember each CPU package has a separate
chunk of memory!) _and_ your scheduler isn't aware of where the code is
running, it's easy to have a poor GC and scheduler decision lead to really bad
performance outliers.

Erlang is far more mature than Manticore, so I'm sure they've dealt with this,
but it's a key aspect of engineering such a runtime system on modern multicore
machines, so I was surprised not to see anything about it.

~~~
jerf
"a fairly concurrent GC at the per-thread level combined with global
collections that maintain locality"

Each Erlang process gets its own independent little memory space. GC is
trivial to run per-Erlang-process as a result. The downside is, no sharing
between processes, even if you want it for some reason. There is some sharing
of binaries with reference counting, though, with the usual problems of
potentially accidentally holding on to megabytes of memory because one process
is holding on to a four-byte slice. (Though you can make copies in advance if
you know that's going to happen, and usually you have or can get a pretty good
idea.)

I don't know anything about specific NUMA support. A quick Google turns up
little to nothing. It is possible that just by the way it is structured it
tends to work out reasonably well; less well than perhaps it could with deep
integration, but better than a program written in most languages with no
thought given to it. Processes and the RAM they use to end up tightly bound
together in small contiguous memory arenas just by the way Erlang works. (It
is not uncommon to have a system with thousands and thousands of processes
that each may only be holding on to a few hundred bytes of data that actually
belongs to them.)

I've been learning Go lately. I so badly want a decent Erlang replacement
whose syntax and basic libraries are so much more friendly. And I respect Go
and think it has a future, perhaps even a bright one. But I have to admit it
has only deepened my respect for Erlang. Erlang is the shy librarian of
computer languages; the outside may be superficially unappealing, socially
awkward, not like the other glitzier girls, but at the end of the movie she
takes her glasses off, lets her hair down, and dazzles the audience by being
the best dancer there is... if you stuck with her long enough to get to the
end of the movie.

------
kaeluka
what the article fails to mention is that BIFs can mess with your timing -- a
BIF call costs one reduction, no matter how long it takes. I'd say that this
is comparable in seriousness as the array-operation example, since BIF calls
are probably more common.

edit: none the less, I like erlang, it is a brilliant tool for a small but
important class of problems. When comparing erlang to akka recently, I found
that erlang, indeed does a quite good job in delivering predictable
performance (all on a very limited benchmark), while akka is less reliable in
that regard. However, I did construct a very simple network topology that was
as similar in both cases as possible, so that excluded features like routers
in akka.

Akka's performance on average tends to be better, though (roughly same for
pure message passing, the more work you add, the more Akka tends to outrun
erlang.)

So, it's, as always, a tradeoff.

edit #2: I regard erlang's scheduler to be one of _the_ advantages erlang has:
allowing all threads to progress at the same time is amazing. If you compare
this with task based parallelism, backed by a thread pool of size N: only N
threads are able to 'make progress', while in erlang, every process gets a
fair share. That's really, really, cool.

~~~
ot
From the article:

    
    
      [n2] This section is also why one must beware 
      of long-running NIFs. They do not per default 
      preempt, nor do they bump the reduction counter. 
      So they can introduce latency in your system.
    

The OP calls them NIFs, but it is essentially what you are saying.

~~~
grating
A NIF isn't the same thing as a BIF. NIFs where introduced in R13B03 and are
in short, user-code written in C. BIFs on the other hand are built-ins
implemented in a completely different manner.

------
ot
I just want to point out that the comments in the article are almost as
interesting as the article itself.

Don't miss them.

------
jensnockert
Just wanted to point out that OS X has a thread affinity API,
[https://developer.apple.com/library/mac/#releasenotes/Perfor...](https://developer.apple.com/library/mac/#releasenotes/Performance/RN-
AffinityAPI/_index.html) in case anyone actually needed it.

