

500,000 Requests/Sec – Modern HTTP Servers Are Fast - arete
http://lowlatencyweb.wordpress.com/2012/03/20/500000-requestssec-modern-http-servers-are-fast/

======
rdoherty
Sigh, another completely synthetic benchmark with big iron: A dual Intel Xeon
X5670 with 24GB of RAM from SoftLayer. The X5670 has 6 cores @ 2.93 GHz, 2
threads per core, /proc/cpuinfo shows 24 CPUs.

Serving a plain HTML page over _localhost_ via nginx. A medium length blog
post with pretty much no real-world information.

~~~
jules
Still, for comparison a huge site like reddit is serving around 1000 pageviews
per second. The hardware in this post isn't that good, nowadays you can do
quite a bit better. If you could get even 0.02% of these pageviews per second
out of more modern hardware that's already 100 pageviews per second. Only a
handful of sites need more than that. For example HN certainly does not. And
AFAIK HN runs single core on a language (Arc) that compiles to Scheme, not the
fastest in town. Since reddit is only about 70x larger than HN, we are
probably close to the point where with a lot of optimization a site like
reddit could be run on a single large box (if we're not there already).

~~~
ma2rten
Dude... I am sorry, but your reply does not make sense on many levels:

I don't know if reddit could run on a single server, but this benchmark does
nothing to prove it. A pageview on reddit consist of several http requests. A
http requests translates into several requests into a database backend. In
this syntactic benchmark he is requesting the same static file over and over
again, so nginx and/or linux can cache it in memory (or CPU cache).

Also even if they could put reddit on a single server, it would require a lot
of optimization and engineering time is more expensive than hardware. It would
also make adding new futures in the future a pain in ass, because you would
have to be very greedy with CPU time (and other resources like memory,
bandwidth, etc...). Also you want a multi-server setup for redundancy and
handling extra capacity.

EDIT: I don't if you also planning to put the database on the same server? Web
servers are basically stateless so they are easy to scale. It would be
interesting to calculate how much disks that server would need just to handle
the IO throughput.

Full disclosure: This is some speaking who used to work for social network
with 9M users and 3,000 servers.

~~~
jules
This benchmark does not, but the fact that Hacker News which is just 70x
smaller but conceptually similar runs on a single core on a slow language
_does_. I completely agree that it's probably not be a good idea, but the fact
that it's possible or almost possible is still interesting in my opinion. If
nothing else the power of modern hardware puts the obsession with horizontal
scalability into perspective, since most sites are nowhere near the size of
reddit.

Even though a pageview on reddit translates to several HTTP requests, only one
of those requests has to serve dynamic HTML. The requests that get the JS, CSS
and images can go to a content delivery network (and they probably already
do). You'd probably indeed want to put the database on the same server, with
the vast majority of DB accesses hitting main memory (as you can easily get a
server with hundreds of gigabytes or even terabytes of RAM nowadays -- it
helps that reddit's access pattern is very heavily concentrated on new posts).
Of course you have to replicate to multiple servers to prevent data loss.

9M users and 3,000 servers sounds like a lot. What kind of servers are these?
What is the size of the database, roughly?

------
acidx
My toy web server can achieve similar performance using a much more modest
hardware (Core i7 2640 laptop), using way less RAM (a few dozen kilobytes).

Granted this is also being tested on localhost with static content (no disk
I/O) -- but shows that event-driven servers are not that novel or difficult to
write: my code weighs around 1700 LOC of (might I say) readable C.

Static file serving is also fast (using sendfile(), etc), but needs an
overhaul to achieve usable concurrency. Currently there's a ~4x performance
drop while serving files, but I'm working on this.

(The sources are at <http://github.com/lpereira/lwan> by the way.)

------
halayli
nginx saturates at ~18k req/sec / core with a latency of ~24ms. This
saturation is not coming from nginx in particular but from OS limits (mode
switching, stack copying etc.. for read/write system calls).

There is nothing new in "modern HTTP servers". They are event-driven programs
and this has existed for a long time.

~~~
fpp
As you rightly said - the key limit if you get to such high I/O levels is
first with OS processing - the implementation of the network stack on the iron
- with performance above 50k the limit comes from the "pipes" moving the data
up from the network card to the OS layer where the web server sits (and
providing the layer 4 / TCP services).

Hence in telcom environments where you get such requirements the TCP/IP stack
is processed closer to the card - but no out-of-the-box web servers could
handle such streams - most of that is handled with e.g. C-libs closely bound
to the networking HW.

With such specialized NW cards (16 core card) plus libs you get to > 1'500'000
teardowns p.s. / connections established and roughly 15Gbps - there are now
100Gbps solutions on the market on the top-end - your actual throughput
depends mostly what kind of processing you're doing. That kind of equipment is
of course not normally used for web serving - more for (transparent) proxying
and inspection / traffic shaping, etc as more-and-more telcom cores are
completely TCP/IP based.

~~~
halayli
I think using a generic OS for IO operations is not a good idea in general. We
all do it but there got to be a better solution. We don't need kernel/userland
isolation that introduces mode switching, stack copying, copying arguments
etc... In such an OS, the environment is under our supervision and processes
are trusted.

~~~
fpp
With the telcom example I've provided above there is no generic OS used for
that - the protocol (e.g. TCP, IP or even HTTP) is (pre-)processed within the
hardware and not further up in the OS.

Generally Linux / Unix (carrier grade / HA) is used within such environments
as the OS. Context switches (user land / kernel) are some of the most
expensive operations so they are to be avoided as often as possible.

What you're pointing to are features you generally find in RTOS (real-time OS)
solutions - and these kind of cards / platforms allow using them. Look up the
Trillium platform from CCPU for example how such a stack / system is layered -
they use Wind River PNE-LE (Linux edition for network equipment) for the HA &
non-HA protocols.

------
NeutronBoy
In a real world app, I think you'd be considered lucky if the web-server is
your bottleneck, rather than your DB or network connection.

~~~
RegEx
As someone who has only worked on extremely small sites, I've never had to
deal with scaling/bottlenecks. Do you have any recommended reads on handling
DB bottlenecks and the like so I can be prepared when that day comes? Thanks!

~~~
charliepark
Gregg Pollock made a series of videos (underwritten by RPM
(<https://rpm.newrelic.com/)>), called Scaling Rails
(<http://railslab.newrelic.com/scaling-rails>). It's _phenomenal_. That link
shows the "contents" on the left-hand sidebar. Start at the bottom and work
your way up the list.

~~~
beambot
That looks phenomenal. Is it equally-applicable to, say, Django development?
Or is there a sister-series?

~~~
Luyt
Jacob Kaplan-Moss did a talk on 'Django Deployment' which is really more about
scalability (reverse proxies in front of your webapp) and reliability
(multiple database backends etc).

[http://ontwik.com/python/django-deployment-workshop-by-
jacob...](http://ontwik.com/python/django-deployment-workshop-by-jacob-kaplan-
moss/)

------
alexlitov
Quickly looking over SoftLayer's price sheet, it looks to be a little over $1k
a month for Intel 5670 with 24GB of RAM

~~~
StavrosK
I'm not really familiar with CPUs, but I got a hyperthreaded, quad core Xeon
with 24 GB RAM, 750 2xRAID1 disks and 10 TB bandwidth from Hetzner for 60
euros a month... Isn't that comparable to what you mention, apart from being
many times cheaper?

<http://www.hetzner.de/en/hosting/produkte_rootserver/ex5>

~~~
bbgm
Not quite. Check this out

<http://ark.intel.com/compare/47920,37147>

The Softlayer server is a Xeon, the Hetzner is a desktop grade processor.

Key differences for me

* Cache

* QPI (especially in a 2P config that makes a difference)

* Max TDP ... that's a HUGE one.

* ECC memory. See [http://perspectives.mvdirona.com/2012/02/26/ObservationsOnEr...](http://perspectives.mvdirona.com/2012/02/26/ObservationsOnErrorsCorrectionsTrustOfDependentSystems.aspx) for why this is important

So, you're not comparing apples to apples

~~~
StavrosK
Ah, yes, sorry, it's an i7, not a Xeon. Still, you could get around 40 of
these machines for one of those, so I'd take these.

------
stuhood
Since this is almost pure sendfile() work (aside from the headers), it really
doesn't seem like a very useful example... there isn't much static content
left on the web.

~~~
morsch
Just to give a different point of view: I'm sure you're right per request,
which I guess is what matters here. But per _volume_ , most content on the web
is static.

------
krakensden
The flipside is that as time goes on, this benchmark becomes less impressive.
The server in question had 24GB of RAM and two CPUs with 12 cores each.

------
jamesu
A comparison with older hardware would have been nice. Also using localhost is
not a very good indicator of real-world performance.

------
ghempton
The ironic part is that his blog is hosted on wordpress.

------
mthreat
I'm interested in more info on your linux TCP tuning, especially how you
decided to tcp_tw_recycle and tcp_fin_timeout (and how you decided setting the
former to 1 is safe)

