
Akka, Haskell, Erlang, Go and .NET Core compared on 1M threads - atemerev
https://github.com/atemerev/skynet
======
jakozaur
Looks misleading to me. Scala code spawns real threads while Go code uses go
routines.

If we want to compare apple to apples, we should use thread pool shared across
Scala actors:
[http://stackoverflow.com/a/1597942/341181](http://stackoverflow.com/a/1597942/341181)

Just add 7 lines of code and Scala should get at least 10x faster.

~~~
atemerev
Akka actors are not "real" threads, Akka scheduler runs them through available
resources.

~~~
merb
Still why didn't you did that:

    
    
        import scala.concurrent.duration.Duration
        import scala.concurrent.{Await, Future}
        import scala.concurrent.ExecutionContext.Implicits.global
    
        object FutureGen {
    
          def caller(num: Int, size: Int, div: Int): Future[Long] = {
            if (size == 1) {
              Future.successful(num.toLong)
            } else {
              val futures = for (i <- 0 until div) yield caller(num + i * (size / div), size / div, div)
              Future.sequence(futures).map(_.sum)
            }
          }
    
        }
    
    
        object Root extends App {
          // Warmup
          Await.result(FutureGen.caller(0, 1000000, 10), Duration.Inf)
    
          val startTime = System.currentTimeMillis()
          val x: Long = Await.result(FutureGen.caller(0, 1000000, 10), Duration.Inf)
          val diffMs = System.currentTimeMillis() - startTime
    
          println(s"Result: $x in $diffMs ms.")
        }
    

This would give Scala: 499ms and Go: 436ms which is pretty neat, since go
actually has no type system. (And I'm using the global execution context,
which could also be changed) Currently your akka version is slow, since it's
creating a new class on every call, so you would have like a dozen more
methods to dispatch than my version.

~~~
KirinDave
Just gonna put this out there: Scala is full of these scenarios where very
small changes to the code have disproportionately large changes to the
outcome. It's one of the reasons I'm really not a fan of the language. There
are so many features that are mentioned in the literature and lore that you
have to learn to not use, for fear of pervasive performance penalties.

This is a great example. I've done a lot of Scala and o found the original
version very obvious to read, but only experience tells me why the compiler
can't make it efficient. Your version is not really worse in any way, but I
think it's fair to say that it shouldn't be obvious to everyone why one is
suboptimal at first glance.

~~~
merb
even if your downvoted. I think what you wrote is somehow true. it's really
hard to optimize. but I don't think that Scala is the only language which has
this downside. Also you could write really really awful code with scala.

------
hharnisch
I think a lot of people are overlooking something really important here. There
are numerous comments about optimizing the code to perform better on this
task. I might even go as far to say people are taking offense because their
language of choice didn't perform as well they'd hoped. Many of these comments
add complexity to the solution and make it harder to maintain. I like the fact
that the author kept things simple. It's kind of like using the default
settings for each language and then comparing the results. Both the simple
version and optimized version are important data points, and they give you a
better view of each languages limitations.

~~~
merb
sorry but the scala version is actually pretty pretty dumb.

also go is not better than all of these languages. what we learned here, is
actually that go is easier since you only have one common way to do this task.
go has goroutines, but go doesn't have runnables, futures, actors,
completionstages, executioncontexts. GO has GOMAXPROCS and not dispatchers,
thread pools, etc..

Also the JVM eve Java only code would actually beat golang in many ways, but
thats nothing you should be scared of. Golang is 1 1/2 year old, the JVM lives
like 20+ years and they spent numerous times about the GC implementations
(they could be even changed so that you could use a better GC for conurrency
or for parallelism or for synchronous execution). Go is good for the task it
is created. Other languages maybe tend to be more general so that you could do
many things and due to their years of knowledge they are actually more
optimized than go could be / they could be more optimized than go could be.

~~~
otterley
> go doesn't have runnables, futures, actors, completionstages,
> executioncontexts

Some of us (most of us, I would contend) who have been successfully
programming highly concurrent software without them for 20+ years are totally
OK with that.

~~~
pcwalton
> Some of us (most of us, I would contend) who have been successfully
> programming highly concurrent software without them for 20+ years are
> totally OK with that.

I work on parallel software every day and I would not be OK with having
threads and channels as my only tools. I would agree that they're usually the
best thing to reach for initially, but it's essential to be able to have low-
level control over the scheduling if you want to get good parallel performance
out of the hardware. I have workloads that today result in 3x speedups on 4
cores but would become slower than the sequential algorithm on 1 core if I
were to rip out the highly-tuned parallel scheduler and replace it with
threads and channels.

To be fair, though, I'm working on CPU-bound software and if my workload were
I/O bound I'd likely have a different outlook on this.

------
strmpnk
I'm pretty sure this doesn't measure what it claims. It's a mostly sequential
flow with very little meaningful concurrent work going on. I would imagine a
for loop would put all of those to shame. I'm not sure why anyone would
optimize their runtime for what's basically yet another process ring benchmark
but I know it's gotten slower over time on a couple of the measured cases
because it's just not a real example and making this fast trades off other
things like good scheduler balancing.

Also check out Pony if you want something were spawn and send become basically
free. It'll run pretty close to for-loop speed for the number adding which
should be a baseline calculated in all implementations.

~~~
wizzard0
> I would imagine a for loop would put all of those to shame.

Indeed, see ".net sync" version. The point is to measure process spawn/message
passing performance.

~~~
zzzcpan
He has a valid point though. Goroutines are more like for-loops than
processes. You can't kill them, can't manipulate them, can't store references
to them, etc. Comparing that to actors is incorrect.

~~~
platz
Bingo

~~~
wizzard0
There's updated Akka version which is a lot faster, btw.

------
froh42
Another microbenchmark flamebait on top of HN? Seriously?

Those things never describe real workloads - in such a view a Porsche is
always better than a truck because it accelerates faster. Tough luck if the
workload is shipping 20 tons of manure. Put that in your Porsche.

------
CoryAmber
The Haskell version was not compiled or run as a multithreaded application.
Given sufficient heap to avoid a GC in the middle of the run I'm seeing ~
375ms +/\- 25

[https://github.com/RayRacine/skynet/blob/haskell/haskell/Sky...](https://github.com/RayRacine/skynet/blob/haskell/haskell/Skynet.hs)

~~~
porges
This is hilarious.

------
nivertech
there is a major flaw in the erlang version which causes unnecessary list
allocation and garbage collection multiplied by 1 million times. I would try
to replace lists:seq(0,9) with a tail recursion. Even that it's a small list
of 10 integers, it is still created 1 million times.

[https://github.com/atemerev/skynet/blob/master/erlang/skynet...](https://github.com/atemerev/skynet/blob/master/erlang/skynet.erl#L10)

EDIT: it seems that list:seq overhead is insignificant - so it doesn't matter,
maybe erlang compiler already optimizing it. I even inlined the calls to spawn
- it still the same result. But I found another problem: it seems that some
processes are stuck - so if you run benchmark the 3rd time - yo get out of
processes.

EDIT2: Note, that unlike other solutions. Erlang has process isolation, i.e.
if there is a panic in one of your goroutines - your entire server will crash,
but if there is an exception in your erlang process - only that process will
crash, not affecting other processes in your server (unless they linked with
it).

~~~
masklinn
> there is a major flaw in the erlang version which causes unnecessary list
> allocation and garbage collection multiplied by 1 million times.

There's probably an alloc cost, but there should be no GC cost: erlang
processes get a 233 words heap by default, a list of 10 integers is 21 words,
so the process should die without needing to run GC.

~~~
nivertech
yes, when process ends/dies all the heap is reclaimed without regular GC. I
also tried spawn_opt/5 with reduced min_heap_size option - it doesn't help.

~~~
masklinn
> I also tried spawn_opt/5 with reduced min_heap_size option - it doesn't
> help.

Yeah whether using spawn_opt or VM options (-hms and -hpds) it doesn't look
possible to _reduce_ the process heap (or process dict) below the initial
size, only to increase it.

------
travjones
Go is extremely performant. But then again, isn't the author comparing apples
to oranges so to speak? A Go program runs as a native binary, whereas Erlang
and Scala/Akka run via their respective VMs. Wouldn't these results be
expected?

~~~
memset
The author also measures .net core, which has the best performance of them
all. I believe that is also running non native.

Moreover, the core assumption that native == faster is not true. One advantage
of the jvm is that as a program runs, the runtime can internally profile and
adjust the byte code on the fly in order to increase performance.

~~~
radicalbyte
If anything is unfair it's that they're using long ints. Java (and I assume
other JVM languages) have the overhead of allocation for most of the longs.

Testing with allocation in all languages will give a better idea of
performance (and I won't be surprised if .Net core fares very well).

~~~
dtech
JVM has primitives for longs

------
wscott
The point to understand here is that Go uses lightweight concurrency and has a
layer of indirection between goroutines and hardware threads. Goroutines
allocate a tiny stack and Go uses a software scheduler to jump between
routines. Then that Go scheduler starts hardware threads as workers to start
executing the routines.

The net result is that Go is relatively efficient for naive programs that
start hundreds or thousands of threads and let them all run.

I don't know Scala or Erlang, but based on the results I would assume both are
allocating a posix thread for each thread in the software and so it is using
the kernel scheduler to switch between them with all the overhead that
entails.

But a program can be refactored in all the languages to have a smaller number
of workers and a queue of work to be consumed. This would be faster than the
Go version, but require more code and complexity.

I have been considering threading in Rust vs Go and that is the biggest
difference I see between the approaches. However it should be possible to
implement Go style goroutines in Rust using custom dispatch code like old C
coroutines. I think I have already seen libraries like that, but I haven't dug
in the details to know for sure.

~~~
atemerev
No, Scala and Erlang don't allocate pthreads. Their threads are also
lightweight and in-system.

~~~
emergentcypher
Scala does not have lightweight threading. It uses the same thread pools from
Java. All your millions of Actors are scheduled onto one of these thread
pools.

~~~
rozap
Which means you need to program with futures in order to not do blocking IO on
your limited thread pool that the actors are multiplexed across. So at that
point you need to write node.js style code.

Coming from an Erlang world to akka, this blew my mind.

------
vadiml
In go version you're using unbuffered channel, which will cause a lot of extra
goroutine swithces please try to change line 10 to: rc := make(chan int, 10)
for example.

After this change my results changed from: Result: 1783293664 in 61586 ms. to:
Result: 1783293664 in 28401 ms.

------
dxhdr
How about running this on a 36+ core machine? Testing parallelism on a dual or
quad core isn't that exciting. Maybe we'd realize this benchmark isn't really
testing parallelism but primarily sequential thread creation/deletion. This
could very well be fast on a scheduler optimized for a single core.

And just to be clear, the interesting property to test is parallelism, not
concurrency.

~~~
ovi256
Except that schedulers optimized for single cores do not grow on trees,
either. Python 3.5, which I think is a language with very high incentives for
such single-core optimizations (because of the GIL), does almost two orders of
magnitude worse than Golang (1.7s vs 63s) on a single core.

I wrote a py3.5 asyncio coroutine implementation of this benchmark, and I hope
it's just my inexperience with asyncio that makes it get such horrible
results.

Here it is:
[https://github.com/ociule/skynet/tree/master/python35-asynci...](https://github.com/ociule/skynet/tree/master/python35-asyncio)

------
circlespainter
I think it would be interesting to try
[http://docs.paralleluniverse.co/quasar/](http://docs.paralleluniverse.co/quasar/)
too (fibers, and actors on top, for the JVM).

------
igouy
Ye olde thread-ring was less about startup ;-)

[http://benchmarksgame.alioth.debian.org/u64q/performance.php...](http://benchmarksgame.alioth.debian.org/u64q/performance.php?test=threadring)

------
JulianMorrison
I had thought looking at the title that this was about a benchmark on a
machine with one million hardware threads, which would have been interesting.
Turns out to be a job of spawning loads of unprioritized identical threads
that block, do negligible work and exit. It looks like time costs are going to
be almost entirely dominated by setup and tear down, Which is very nearly
useless for any situation in which you expect threads to exist for more than
infinitesimal time. Setup and tear down are constants. They won't dominate the
runtime of most programs.

------
twic
MacBook Pro, 15-inch, Early 2011, 2.2 GHz Intel Core i7, OS X 10.8.5. Yeah
yeah, i know. At least it was more or less unloaded when i ran the tests.

    
    
        $ git show --no-patch --pretty=oneline HEAD
        82f318ad418aa10eb5553bcbfad1783e544c76cd Merge pull request #26 from bitemyapp/master
        
        $ java -version
        java version "1.8.0_11"
        
        $ go version
        go version go1.5.3 darwin/amd64
        
        $ (cd scala; sbt compile run)
        Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=384m; support was removed in 8.0
        [info] Set current project to skynet (in build file:/Users/twic/Documents/Code/skynet/scala/)
        [success] Total time: 1 s, completed 15-Feb-2016 22:59:34
        [info] Running Root
        Result: 499999500000 in 8832 ms.
        Result: 499999500000 in 4866 ms.
        Result: 499999500000 in 2457 ms.
        [success] Total time: 17 s, completed 15-Feb-2016 22:59:51
        
        $ (cd go; go run skynet.go)
        Result: 499999500000 in 399 ms.
        
        $ (cd java; ./gradlew run)
        :compileJava UP-TO-DATE
        :processResources UP-TO-DATE
        :classes UP-TO-DATE
        :run
        Streams
        -------
        Result: 499999500000 in 18 ms. (sequential)
        Result: 499999500000 in 38 ms. (parallel)
        
        RxJava
        ------
        Result: 499999500000 in 204 ms. (immediate)
        Result: 499999500000 in 216 ms. (computation)
        Result: 499999500000 in 188 ms. (io)
        
        BUILD SUCCESSFUL
        
        Total time: 10.784 secs
    

Get off my lawn.

------
spion
This kind of benchmark can show off what v8 is capable of (when used with a
library that has been carefully tuned for performance)

Code is pretty-much copy pasted from .NET Async version with bluebird promises
instead of Tasks. Ends up being almost 2 times faster:

[https://github.com/atemerev/skynet/pull/16](https://github.com/atemerev/skynet/pull/16)

(Only 50% faster if we return Promise.resolve(num) though)

~~~
wizzard0
Returning num instead of Promise.resolve(num) is unfair compared to .NET
version ;) but I merged your PR as-is

~~~
spion
I just tried it with Promise.resolve and after JIT warms up its 350ms, while
.NET (async) after JIT warmup is 300ms

------
Drup
Just to drive the whole "this comparison is misleading" point further:
[https://github.com/atemerev/skynet/pull/23](https://github.com/atemerev/skynet/pull/23)

Lwt is a cooperative multi-threading library with a particular semantic: A
task that doesn't block doesn't even yield, it returns directly.

Yes, it's comparing apple to oranges.

------
lossolo
When i saw 1,111,111 threads i knew Go would be a winner here, because Go uses
green threads which are not system threads - no syscalls needed etc.

~~~
mijoharas
Erlang doesn't use system threads either.

~~~
vmsp
Yeah, isn't it weird that the Go implementation is so much faster than Erlang
one?

~~~
masklinn
Probably not entirely, Erlang's lightweight processes have their own private
heap (~1.8KB by default on a 64b system), a process dictionary and likely more
control structures, and the language is mainly interpreted. A 4x different in
performance may be on the high side (idk), but I wouldn't expect erlang
processes to be quite as fast to spawn as goroutines.

edit: and the erlang bench allocates a bunch of lists on startup:
[https://github.com/atemerev/skynet/blob/master/erlang/skynet...](https://github.com/atemerev/skynet/blob/master/erlang/skynet.erl#L10)
`lists:seq()` will allocate a 10-elements linked list, considering this is
called recursively and concurrently it might contribute to the issue (or not)

~~~
dxhdr
Minor clarification -- Erlang is not interpreted, it's compiled to bytecode
and executed on the BEAM virtual machine.

~~~
masklinn
> Erlang is not interpreted, it's compiled to bytecode and executed on the
> BEAM virtual machine.

And Python is compiled to bytecode and executed on the CPython virtual
machine. That's still interpreted.

------
vadiml
For go version changing line 10 to rc = make(chan int, 10) make it 2x faster

------
fidget
Another one for the list of meaningless benchmarks

------
dfkf
From the description, almost 90% of created tasks in this benchmark are tiny
"leaf" tasks, which are quite likely to be optimized away completely. So it
looks like the benchmark compares how well these compilers optimize one
particular case, and not general performance of goroutines/tasks/whatever...

------
ybroze
Though not unexpected, it's good to know what the order of difference is when
choosing tools for the job at hand.

------
ilcavero
akka/erlang actors are not equivalent to go routines, a closer concurrency
mechanism would be futures/tasks

~~~
atemerev
See .NET Core implementation with tasks :)

------
pepper__chico
This is a joke, ok take this C++ version that takes 0 time to finish:
[https://github.com/atemerev/skynet/pull/26#issuecomment-1845...](https://github.com/atemerev/skynet/pull/26#issuecomment-184516070)

------
MaxGabriel
This test was performed on a 2-core machine. I'm not familiar enough with
multi-threading benchmarks to know what's standard, but that sounds suspicious
to me, like the benchmark might not show how well a language takes advantage
of a 32 core server.

Anyone know if that would be a problem?

------
merb
actually comparing jitted vm's without a warmup is extremly dump.

Second: using the default thread dispatcher for this kind of scenario is
pretty dumb on scala (also I think erlang could be made faster, too. but since
I'm not an erlang guy...) Fixed Thread Pool (GOMAXPROCS vs ForkJoin)

~~~
atemerev
All timers are started after the warmup.

~~~
mafribe
How do you know when warmup is finished?

Warmup's a tricky issue, see this brand new paper:
[http://arxiv.org/abs/1602.00602](http://arxiv.org/abs/1602.00602)

~~~
wizzard0
There's an update for Akka version, which measures 12/8/4.4 seconds on
subsequent attempts, so yes, warmup is important... but somehow that seems
kinda unfair for other languages too.

------
Ixiaus
Haskell version is naive too.

My laptop runs the original in about 3.07s and my improvements bring it down
to 1.20s though I think there's room for improvement.

[EDIT] My improvements were actually not the same algorithm as the original's,
with algorithm corrections my version is actually slower!

~~~
sa1
A combination of tuning the runtime heap size, number of threads and compiling
with -threaded made the same implementation run in 0.5s for me.

------
lpgauth
The Erlang code is far from optimal... I'll open a pull req later today.

~~~
atemerev
You are most welcome! I am not an Erlang expert.

------
coolsunglasses
This is a terrible benchmark, but

[https://twitter.com/bitemyapp/status/698975974277271553](https://twitter.com/bitemyapp/status/698975974277271553)

------
wizzard0
Added Erlang on Windows numbers, added .NET synchronous numbers.

------
unfunco
The languages tab on GitHub (click on the multi-coloured bar) is another
interesting metric, is it somewhat representative of the conciseness of the
languages tested?

------
Jabbles
That division looks expensive.

He says, with absolutely no profiling whatsoever... I would like someone to
offer some evidence though :)

------
sedatk
"async (8 threads)" is confusing. async and multithreaded are different
approaches to parallelization.

------
daxfohl
Is the Akka version doing reflection on each iteration or does it just look
like it?

------
ukd1
Just ran the checked in Crystal code on a 2015 Macbook (1.2ghz):

$ ./skynet

Result: 499999500000 in 294ms.

------
zobzu
theres a lot of nitpicks on code and go vs scala.. but .net is more flexible
and annihilated both by a fair margin as well..

------
foota
I like how the language banner is so colorful.

------
archonjobs
I think the .NET core implementation is flawed... Task.FromResult doesn't
actually happen on another thread, it returns an already completed Task
synchronously. You might want to use Task.Run instead.

~~~
wizzard0
.NET core impl has 2 versions, async and sync. Task.FromResult is synchronous,
yes. Look closer :)

~~~
solutionyogi
Doesn't matter. Unless you do TaskFactory.StartNew or Task.Run, it's not going
to try to schedule work on another thread. Everything is happening in the main
thread in your code.

~~~
wizzard0
See the updated version. Runs about 15-20% slower, which is far less than the
difference between single-threaded version and task-based one. Also, it's
pretty easy to see in Task manager how all the cores get loaded.

~~~
solutionyogi
That looks better. However, I am really not sure what you are trying to
achieve here. You didn't specify any scheduler, it means your tasks will use a
default scheduler and that scheduler uses threadpool [1] and it will never
create 1M threads. And as there is no actual work performed in the Task,
things will run reasonably fast.

I am having hard time understanding what you are trying to test here. I am
worried that .NET code is not really testing what you think it is.

[1] [https://msdn.microsoft.com/en-
us/library/dd997402(v=vs.110)....](https://msdn.microsoft.com/en-
us/library/dd997402\(v=vs.110\).aspx)

~~~
wizzard0
You can't create 1M threads on Windows (or OS X, for that matter) because
there won't be enough RAM for stack space. (also, running more threads than
physical cores on CPU-bound task is pretty pointless).

Maybe you mean we should schedule something which has a complete event loop?
That would be a lot more fair wrt the Akka version, but the Go version does
not create any threads either.

~~~
solutionyogi
Not sure what you mean by complete event loop. But I have a strong feeling
that you are comparing apples to oranges in your benchmark.

~~~
wizzard0
It's indeed apples vs oranges vs pies (Actors/Coroutines/Futures). Probably
should make a table :)

------
lazyjones
I'm surprised that none of the compilers optimized this benchmark away
completely...

~~~
ibejoeb
I don't follow. What optimization technique would fundamentally change this
program?

~~~
killercup
If a compiler could proof that creating the channels/go routines/what-have-you
does not influence the arithmetics (e.g. because only associative operations
are used and the order doesn't matter), it could constant-fold the loops and
just output 499999500000 directly.

I don't know any compiler (and language implementation) that can do that,
though.

~~~
danellis
Doesn't seem out of the question, but it would require a very good type
system. I think out of these languages, only Haskell is at that level.

~~~
sirclueless
Any of the VM languages could probably do this heuristically as well: the "if
size == 1" base case depends only on one of the parameters, you could hoist
that out of the concurrent thread/coroutine and into the caller. Then the code
path that hits only ever assigns to a future/promise/array/whatever and
immediately returns so it won't block and has no other side effects so it can
run synchronously for less than the cost of a context switch.

In general programmers are pretty good about not kicking off expensive
computations they don't need to, so the easy pickings here are probably pretty
small. And as soon as there are any side effects at all (which includes any
writes to shared memory) it becomes very hard to reason about moving
computation from one thread to another. So it's easy to see why compiler
writers are not super excited to start optimizing across concurrent threads.

