
Techempower Web Framework Benchmarks Round 10 - pneumatics
https://www.techempower.com/benchmarks/#section=data-r10
======
chrisan
Come for that stats, stay for the comedy

> The project returns with significant restructuring of the toolset and Travis
> CI integration. Fierce battles raged between the Compiled Empire and the
> Dynamic Rebellion and many requests died to bring us this data. Yes, there
> is some comic relief, but do not fear—the only jar-jars here are Java

Brilliant!

------
redstripe
There's something I don't understand in these comments. Why is everyone
interested in language comparisons instead of the huge difference between EC2
and bare metal?

~~~
sauere
Got to agree. Bare Metal performance isn't appreciated enough. I know a few
companies that run fine with a mixed Metal/AWS combo. Metal handles 80% of the
workload, and if that for some reason fails, EC2 instances are fired up to
take over until it is fixed. This setup doesn't work for every scenario, but
it is something to take into consideration.

~~~
ckluis
I remember being in a meeting talking about the tax companies using X-Metal
servers for the whole year and scaling Y-cloud instances for the 3 months they
have excessive usage.

------
Joeri
Every time I'm struck by how slow the big PHP frameworks are compared to raw
PHP. Either nobody cares about making those tests perform better, or something
is very wrong in the architecture of those frameworks.

I expect the cause is that too much code is being loaded for every request.
PHP tears down and rebuilds the world for every request, and the popular
frameworks load a lot of code and instantiate a lot of objects for every
request.

~~~
panopticon
PHP frameworks generally rely on heavy amounts of caching at every level
(database, bytecode, Varnish, etc) to make up for this.

Why this is the status quo is something I also question.

------
tolas
Still no Elixir/Phoenix inclusion? I'd really love to see how it stacked up.

~~~
bhauer
Coming in Round 11 [1]! We'll aim to have Round 11 out much quicker than 10.

[1]
[https://github.com/TechEmpower/FrameworkBenchmarks/pull/1510](https://github.com/TechEmpower/FrameworkBenchmarks/pull/1510)

~~~
perishabledave
Thanks! I'd like to see this as well.

------
DAddYE
Few notes:

* Impressive Dart

* JRuby > MRI (I'd like to see JRuby 9k)

* Padrino that offers basically everything that Rails does performs impressively well. [Shameless Plug]

~~~
cheald
My experience has been that JRuby 9k is slightly slower than MRI in a number
of cases right now, mostly because its IR has been completely rewritten, and
is still pending a performance pass.

That said, it's still a huge step up from the 1.7 series, and once the team
starts knocking out performance problems it should be pretty magnificent.

------
aikah
I'd like to see the memory benchmarks as well.It's fine to have something fast
but the more memory the more expensive the boxes are.

~~~
kainsavage
We HAVE the memory data, but we have not distilled it into a consumable form
to use on our website. Here, for example, is the cpu/ram usage report for
ULib's json test: [https://github.com/TechEmpower/TFB-
Round-10/blob/master/peak...](https://github.com/TechEmpower/TFB-
Round-10/blob/master/peak/linux/results-2015-03-24-peak-
final/20150324072137/json/ULib/stats.json)

------
vdaniuk
Okay, Nim is really fast in these benchmarks on ec2 servers, impressive.

Does anyone has experience with nim web stack? Is it ready for prime time? How
much effort is required to create a simple CRUD json api?

On a sidenote I am really looking forward to comparing rust and elixir results
in the next round of benchmarks.

~~~
kainsavage
Rust and Elixir SHOULD be included in Round 11 - we have already accepted a
pull request for the Phoenix framework (Elixir) and have had a pull request
for Rust when it was in alpha. Hopefully, we will see another Rust pull
request soon.

~~~
vdaniuk
Pleased to hear that, thanks for doing the benchmarks!

------
saryant
The chart says Play Framework didn't complete but looking at the output, the
logs say it did.

[https://github.com/TechEmpower/TFB-
Round-10/blob/master/peak...](https://github.com/TechEmpower/TFB-
Round-10/blob/master/peak/linux/results-2015-03-18-peak-
preview7/latest/logs/play2-scala/out.txt)

What am I missing?

~~~
richdougherty
An error occurs, which is logged to stderr, but the benchmark logs don't
capture stderr so it's hard to know what's happening. (Or maybe Play redirects
stderr to a log file?)

The test passes in the preview runs, in the TechEmpower continuous integration
tests and in the EC2 tests so it's probably some transient error that only
occurred in the final bare metal test. Maybe there's a race condition in the
Play 2 test scripts which only shows up sometimes.

I've spent a fair bit of time maintaining the Play 2 benchmark tests so it's
very frustrating to get no result on the final test. Oh well!

~~~
saryant
Out of the box, the start script from "play stage" does _not_ redirect stderr.

Though I didn't think to check the classpath when I was poking around the
TechEmpower github repo. I wonder if a logback.xml slipped in somewhere that's
siphoning off stderr to some unknown destination?

------
sauere
Bottle handling 5x more requests than Flask. Impressive, but overall Python
framework performance ist still... meh.

~~~
Cyph0n
Falcon is where the performance is at according to the benchmarks.

~~~
trentnelson
Or PyParallel, when they include Windows in the next round:
[https://speakerdeck.com/trent/pyparallel-
pycon-2015-language...](https://speakerdeck.com/trent/pyparallel-
pycon-2015-language-summit?slide=5)

Consistently orders of magnitude faster than everything else out there in the
Python landscape.

~~~
Cyph0n
Ok, that looks very interesting. Thanks for the heads up.

------
WoodenChair
Dart dominated the multiple queries test type.

~~~
Cyph0n
That is quite surprising. Anyone have an idea why that is?

~~~
kainsavage
Actually, not really. We checked the code to ensure that there was no gaming
the system and it definitely APPEARS to be making separate database queries as
we require in our rules. In fact, we had this same question in round 9 and had
a number of people audit it. We cannot explain it other than it might be
pretty darn fast.

~~~
emn13
A better requirement would be to define some minimum level of durability, and
some minimal level of freshness in the face of concurrent modifications.

Frankly, who cares if a caching driver avoids some database queries entirely
if it still provides the same level of durability and freshness guarantees? If
mongo+redis are OK, what's wrong with a plain hashtable?

------
sker
I'd like to see some ASP.NET running on Owin. Perhaps I'll find the time to
add it myself before round 11.

~~~
kainsavage
I am not familiar with Owin; how does it differ from Mono?

~~~
meragrin
Owin is an interface between .NET web servers and web applications.

Mono is an implementation of the .NET runtime and framework.

------
hamiltont
I've been working with this project for a while, here's some unorganized
thoughts:

    
    
       1) Statistics
       2) Running Overhead
       3) Travis-CI
       4) Memory/Bandwidth/Other info
       5) Windows
       6) IRC
       7) Ease of Contributing
    
    

1) Currently, the TFB results are not statistically sound in any sense - for
each round you're looking at one data point. EC2 has higher variability in
performance, so that one data point is worth less than the bare metal data
point. Re-running this round, I would expect to see _at least_ 5 to 10%
difference for each framework permutation. See point (2) to understand why
we're not yet running 30 iterations and averaging (or something similar)

2) Running a benchmark "round" takes >24 hours, and still (sadly) a nontrivial
amount of manpower. It's currently really tough to do lots of previews before
an official round, and therefore tough to let framework contributors
"optimize" their frameworks iteratively. I'm working on continuous
benchmarking over at
[https://github.com/hamiltont/webjuice](https://github.com/hamiltont/webjuice)
\- it's a bit early for PRs, but open an issue if you want to chat

3) As you can imagine, our resource usage on Travis-CI is much higher than
other open source projects. They have been nothing but amazing, and even
reached out to chat about mutual solutions to potentially reduce our usage.
Really great team

4) We do record a lot of this using the dstat tool. dstat outputs a huge
amount of data, and no one has sent in a PR to help us aggregate that data
into something easy to visualize. If you want this info, it's available in the
results github in raw form.

5) Sadly windows support is struggling at the minute. We need something setup
like Travis-CI but for our Windows system. CUrrently windows PRs have to be
manually tested, and few of the contributors have either a) time to do it
manually in a responsive manner or b) windows setups (a few do, but many of us
dont). Any takers to help set something up? FYI, we have put a _ton_ of work
into keeping Mono support just so we can at least test that changes to C#
tests at least run and pass verification, but naturally that isn't as nice as
really having native windows support

6) join us on freenode at #techempower-fwbm - it's really fun meeting the
brilliant people behind the frameworks

7) If I had to pick one big thing that's happened in between R9 and R10, it
would be the drastically reduced barrier to entry. Running these benchmarks
requires configuring three computers, which is much harder than something like
_pip install_. Adding vagrant support that can setup a development environment
in one command, or deploy to a benchmarking-ready AWS EC2 environment, has
really reduced the barrier to getting involved. Adding Travis-CI made it
better - it will automatically verify that your changes check out! Adding
documentation at
[https://frameworkbenchmarks.readthedocs.org/en/latest/Projec...](https://frameworkbenchmarks.readthedocs.org/en/latest/Project-
Information/) made it even easier. Having a stable IRC community is even
better! Tons of changes have added to mean that it's now easier than ever for
someone to get involved

~~~
skrowl
Yeah, I clicked stats and tried to hide everything but IIS since that's the
web server I use. No results. Closed page.

~~~
hamiltont
I'm open to recommendations for systems similar to Travis-CI but supporting
Windows! Having some type of windows CI would _really_ help bring windows
uspport up to par

EDIT: Actually, let me just link everyone to github:

Here are the windows compatibility issues -
[https://github.com/TechEmpower/FrameworkBenchmarks/issues?q=...](https://github.com/TechEmpower/FrameworkBenchmarks/issues?q=is%3Aopen+is%3Aissue+milestone%3A%22Windows+Compatibility%22)

Here is the specific issue asking for advice on what CI we should use to
support windows:
[https://github.com/TechEmpower/FrameworkBenchmarks/issues/10...](https://github.com/TechEmpower/FrameworkBenchmarks/issues/1038)

------
MCRed
Reading thru these tests they are measuring database performance as much as
framework performance.

They are also single node which is great if you're entire system is only going
to ever need one machine's worth of capacity (Eg: vertical scaling)

------
vinceyuan
Some frameworks which I never know performed very well. But looks like they
are not mature. Which framework do you recommend? I used Node.js/Express,
Rails and Sinatra but am not satisfied with them. I am learning Go.

------
cagenut
Since the "peak" hardware is a dual E5-2660v2 thats 32 threads, so a
c3.8xlarge would be a much more comparable instance.

~~~
kainsavage
We aren't trying to measure each hardware set as apples-to-apples, but rather
give the reader an idea of how performance characteristics for a chosen stack
are affected by hosting environment. Specifically, we wanted the middle-of-
the-road EC2 instances versus the extremely high-end Peak option to illustrate
that difference.

~~~
cagenut
thanks for the response

I've noticed a weird trend where amazon created various slices of instance
types a long time ago, and people have mentally gotten used to using larger
ones far slower than moores law adds cores. So people will refer to something
with 2 cores as "middle-of-the-road" and 32 as "extremely high-end" when in my
brain thats "a cell phone" and "a 2 year old server".

------
merb
Keep in mind most of these benchmarks won't happen in production. Especially
not the netty and lwan ones.

------
dilatedmind
what are the benefits of using this benchmark over using ab?

~~~
hamiltont
The main benefit is this allows rough comparison to a ton of other frameworks.
Just running _ab_ against your one server setup gives you one RPS/latency
result on one hardware setup - that's good to know as an absolute metric, but
tells you very little about your performance relative to other frameworks.

This project gives you RPS/latency metrics for many frameworks, on a few
hardware setups. This enables a rough comparison of "how does my framework
perform relative to all these other well-known or established frameworks".
Naturally, the comparison is not perfect - there are a ton of reasons that
measuring _just_ requests/sec and latency doesn't allow complete comparison
between two frameworks. However, once you accept that it is basically
impossible to fully compare any two frameworks using just quantitative methods
and these numbers should inform your choice of framework (instead of totally
control your choice of framework), we can talk about why it's valuable.

Want to run a low-cost server in language X that you happen to love? This
project can provide guidance about which frameworks written in language X are
performing the best. Want to ensure your service can support 50k requests per
second without loosing latency? This project can provide latency numbers for
you to examine that let you know which frameworks appear to maintain
acceptable latency even under high load.

If you wanted to, you could re-create this project by running _ab_ against
100+ frameworks - that's the cornerstone of what is happening here. Granted,
we currently use [https://github.com/wg/wrk](https://github.com/wg/wrk)
instead of _ab_ , but the principle is the same - start up framework, run load
generation, capture result data. Most of the codebase is dedicated to ensuring
that these 100+ frameworks don't interfere with each other, setting up pseudo-
production environments with separate server/database/load generation servers,
and other concerns that have to be addressed.

Over time, this project has started collect more statistics than just
requests/second and latency, which makes it more valuable than just running
_ab_. As more metrics are added and more frameworks are added, this becomes a
really valuable project for understanding how frameworks perform relative to
one another.

