600K concurrent HTTP connections with Clojure and http-kit

danielrhodes · on Jan 28, 2013

Add an endpoint with a poorly optimized SQL query: 20 concurrent connections.

I find these benchmarks to not be very valuable. Typically the bottlenecks are in the database, not in the web server/code. Even when it is in the code, optimizations only become truly valuable when you are at scale or doing something either horribly wrong or algorithmically unique.

Legion · on Jan 28, 2013

> I find these benchmarks to not be very valuable.

It's interesting to take one piece of the stack and see how much abuse it can take, even if it's not necessarily reflective of real world behavior.

khet · on Jan 28, 2013

I am afraid such out of context benchmarks will do more harm than good.

xt · on Jan 28, 2013

I like what agentzh has done with lua + nginx. But his benchmarks are also showcasing how it works compared to other popular solutions. Check it out at http://agentzh.org/misc/slides/libdrizzle-lua-nginx.pdf

bad_user · on Jan 28, 2013

When you're talking about 25,000 real requests per second, I can assure you the web server and code are the bottleneck.

And what are you trying to say exactly? That you don't need optimizations until you do? How's that in any way helpful?

tlrobinson · on Jan 28, 2013

In an asynchronous/event driven server slow SQL queries won't limit your concurrency, just latency. Memory consumption is what limits concurrency. CPU and bandwidth limits throughput.

That said, I agree it's not a particularly useful benchmark.

fernandezpablo · on Jan 28, 2013

> In an asynchronous/event driven server slow SQL queries won't limit your concurrency

Since this is on the JVM, SQL almost always means JDBC and JDBC is blocking so yes, SQL will in fact limit your concurrency.

shenedu · on Jan 28, 2013

yes, SQL is blocking, JVM also have threads. The http-kit is asynchronous(epoll on linux), the API for programmers is sync, which is more familiar.

pygy_ · on Jan 28, 2013

Alternatively, you can use a message queue or a REST service, and whatever lightweight, async database frontend.

fernandezpablo · on Jan 28, 2013

That would just move the blocking code elsewhere, and now you have one extra framework / message queue on your stack (free complexity)

pygy_ · on Jan 28, 2013

The point is to use a non-blocking DB frontend. Not necessarily JVM-based.

alexkus · on Jan 28, 2013

They're valuable if the server is used to implement a COMET server.

We maintain a large number of connections (doing overlapping long polling) and we send the same information to each connection (about 30 times an hour at peak). Even if the query to produce that data was poorly optimised it's only being performed once and then the results pushed out to every connection.

shenedu · on Jan 28, 2013

Yes. something like long polling do require server capable of maintaining many connections. http-kit provide `async-response` for this kind of usage: see http://http-kit.org for more detail.

I am the author of http-kit, also one of the authors of delicious China: http://meiweisq.com, trying to do some useful thing for our users (like realtime) with it. :)

ptaoussanis · on Jan 28, 2013

Certainly your web server isn't likely to be the performance bottleneck in the majority of real-world applications. Even without a db call, just generating the response Clojure-side will kill your numbers (and Clojure is pretty darn fast).

Still, the ability to cheaply handle large numbers of concurrent connections is great for things like WebSockets and long polling where you aren't crunching data for every connection all the time. This is an async server, so it's well suited to such tasks.

ed_blackburn · on Jan 28, 2013

Interesting as the article is, it's difficult to disagree re: database bottlenecks. Whilst the NoSql world is iterating away I wonder how much is going on behind the scenes at Oracle, Microsoft et al in regard to asynchronous sql. I guess it's a trade off between transactions and dirty reads?

shenedu · on Jan 28, 2013

Author here.

> Add an endpoint with a poorly optimized SQL query: 20 concurrent connections.

Agreed. http-kit use threadpool to compute the response, try to minimize this a little.

mbell · on Jan 28, 2013

Maybe I'm not understanding what is going on here but it appears the author's client is communicating over a socket on the same machine as the server. The author is seeing insanely high numbers because she/he is bypassing the entire TCP/IP stack.

ptaoussanis · on Jan 28, 2013

I believe this test was run locally for simplicity (and so people can reproduce it easily). As you've pointed out, that's a pretty artificial context and can obviously have a big absolute effect on the numbers.

I will say that in my limited experience these numbers are proportionally representative of what you can get in a real-world environment.

Even testing locally, getting concurrency numbers like this is tough or impossible with many (most?) servers.

So, yes, it's an artificial test - but I think the results are interesting when taken in the correct context.

javajosh · on Jan 28, 2013

Running locally doesn't mean you bypass TCP/IP.

mbell · on Jan 28, 2013

While true, I'm suspect given this setup that some layer isn't just dropping to a unix socket when it notices localhost.

Most notably, 600k active real TCP connections in the kernel would use somewhere around 6GB of memory assuming an average of ~10kB of memory per socket for the R/W buffers and other data. EDIT: Thats just the server side, double to include the client sockets.

Most attempts I've see at this number of real TCP connections required a lot more tweaking of kernel TCP settings to achieve.

alexkus · on Jan 28, 2013

For large number (1M+) of connections (doing websockets or long polling) to replicate the functionality of a COMET server I've been playing with rolling my own TCP handling via libnetfilter_queue as I simple don't need the ~10kB of r/w buffers on each socket.

Linux (as far as I'm aware) doesn't allow tuning of r/w buffer sizes on a per interface basis otherwise I'd have one interface for the COMET server with drastically reduced r/w buffer sizes and the remaining interfaces with 'normal' TCP r/w buffer sizes to ensure the other things running on that host run without problem.

justincormack · on Jan 28, 2013

He has 16G of RAM so may have not tuned this, and not noticed.

justincormack · on Jan 28, 2013

It bypasses checksumming, but no not the whole stack. Although not sure in this case it bypasses checksumming as the interfaces were created as aliases on a physical interface, which may mean it does do checksumming still.

voidlogic · on Jan 28, 2013

Many OS's kernels use their local domain socket codepath/interface for 127.0.0.1 <-> 127.0.0.1 communications, completely bypassing TCP/IP... This is transparent to the application.

shenedu · on Jan 28, 2013

Hey, http-kit's author here. willing to answer any questions.

cgag · on Jan 28, 2013

What's different about http-kit vs say netty that allows it to handle so many connections / be so fast?

shenedu · on Jan 28, 2013

Linux's epoll or FreeBSD's Kqueue is ridiculous scalable. Both http-kit and netty take advantage of it. http-kit is all about HTTP, netty is a general framework.

shenedu · on Jan 28, 2013

Author here. The server is open source, on github: https://github.com/http-kit/http-kit Test test code is also on github: https://github.com/http-kit/scale-clojure-web-app

Suggestions very welcome!

stiff · on Jan 28, 2013

The server itself seems to be written in Java: https://github.com/http-kit/http-kit I wonder what the factors were for not writing it in Clojure if Clojure is the target platform.

ptaoussanis · on Jan 28, 2013

To clarify: the core server is indeed written in Java, but it's written to conform to the standard Clojure web server (Ring [1]) spec and to be consumed by Clojure applications.

The public API is all Clojure-side and essentially uses the Java parts as an implementation detail. This kind of interop is easy to do with Clojure and is usually motivated by performance.

The best way of describing http-kit would probably be as a "Clojure web server written in Java".

[1] https://github.com/ring-clojure/ring

shenedu · on Jan 28, 2013

author here: Java part: 1. Java NIO's performance is amazing: event driven r/w bytes 2. maintain state machine when parsing HTTP from bytes buffer need many local variables, I get used to do it in a C style code.

Clojure: I like this language. It's brilliant. So I write a fast http server/client for it. Clojure is also write in JAVA, great interoperation

nqzero · on Jan 28, 2013

can the http-kit server be used from java directly ? last i looked, all the async java webservers seemed limited to 1000s of connections, which largely defeats the purpose. i'm willing to give up the servlet spec if needed

shenedu · on Jan 29, 2013

Hey, http-kit can be used to from java directly: https://github.com/http-kit/http-kit/blob/master/test/java/o...

Not recommended! The API is very lowlevel.

Maybe you can try to tweak the max allowed open file to a larger value. The default is about 1024, that how you just get 1000(I guess). Jetty is quite good at concurrency, you can double check it.

Why not try Clojure? Your web dev's productivity increase by few times instantly.

nqzero · on Jan 29, 2013

looks good. my application (database-ish) is in java and i haven't worked on any bindings yet. just trying to put together a demo that shows off the concurrency

zwischenzug · on Jan 28, 2013

Similar article here, going to 11:

http://www.metabrew.com/article/a-million-user-comet-applica...

jadc · on Jan 28, 2013

+ 1 for the C1M comet application

The article provides a lot more in-depth information on the subject and more experimental data (across multiple machines - not only localhost).

ptaoussanis · on Jan 28, 2013

Feng (http-kit's author) should be around shortly if anyone has any questions.

In the meantime, you can also check out http://http-kit.org for more info. That page is a work-in-progress so please excuse any errors.

We were actually planning to post to HN later this week; seems someone beat us to the punch :-)

egeozcan · on Jan 28, 2013

Don't get me wrong, I really like what you are doing but there are a lot of spelling and grammar errors in your web site. I'm not a native speaker (I also make a lot of errors) but maybe you can get some help from one? Not a big deal but just wanted to let you know. Makes sense if you want to get popular =)

I also have a question: Do you plan to make any comparison tests with Compojure (https://github.com/weavejester/compojure) and lib-noir?

ptaoussanis · on Jan 28, 2013

The library author is Chinese, so English isn't his native language. We'll be cleaning up typos soon - this post caught us by surprise.

As for Compojure, etc.- those are libraries that operate on top of a Ring web server. Jetty is the default, http-kit is a drop-in replacement. So basically, you'd use both http-kit and whatever other libraries you normally would (like Compojure).

I swapped out a production Jetty+Compojure app to use http-kit+Compojure by changing ~20 lines of code.

Hope that makes sense?

egeozcan · on Jan 28, 2013

Yes it does, thanks for the clarification.

Legion · on Jan 28, 2013

Depending on a browser's screen width, the "Fork me on Github" bar's clickable box can almost completely cover the "Blogs" menu item.

ptaoussanis · on Jan 28, 2013

Thanks, we'll try get that fixed soon (Feng and I are both at work atm).

billiob · on Jan 28, 2013

This reminds me of http://blog.whatsapp.com/index.php/2012/01/1-million-is-so-2... where they handled 2 millions concurrent tcp connections.

dotborg · on Jan 28, 2013

Once You give a little bit more of real code into those requests your JVM will die from continous GC. It's simple math :)

z3phyr · on Jan 28, 2013

Anybody heard about joxa? Think about erlang beam features on a clojure with scheme like simplicity.

jlward4th · on Jan 28, 2013

I've created the same test app with Play 2.1 RC2 and Scala: https://github.com/jamesward/scale-play-web-app

On my laptop I have half as many cores as the poster and am getting about half the performance.

leoh · on Jan 28, 2013

This looks really cool. But can Apache and other frameworks accomplish this?

cgag · on Jan 28, 2013

Here are some benchmarks (that I'm not really fit to judge the quality of) that have been going around the Clojure mailing list for comparison: https://github.com/ptaoussanis/clojure-web-server-benchmarks

ptaoussanis · on Jan 28, 2013

Benchmarks author here. I did my best to get representative numbers, but good benchmarking is tough and this isn't my area of expertise. There's almost certainly errors and room for improvement (suggestions very welcome!).

In particular, I'm pretty sure nginx could be tuned here to lead.

BTW I also ran benchmarks against a more realistic EC2 environment with proportionally similar results (EC2 is slow but its effect across servers was consistent).

rorrr · on Jan 28, 2013

1) Notice that in his tests 97% of these connections don't do anything, just idle. He maxes out at 18764 req/sec. If you google around, Apache and Nginx can do more than that on a beefy server.

2) Notice that they are "keep-alived", coming from the same IP, so not truly separate connections.

3) Keep in mind that 600K concurrent connections cannot possibly do anything useful at the same time for many reasons (CPU, bandwidth, server I/O), so they are not truly concurrent.

4) Max concurrent connections are also limited by the OS, and that limit is much much lower than 600K by default:

http://serverfault.com/questions/10852/what-limits-the-maxim...

jlouis · on Jan 28, 2013

I want to make an important point on 3)

If you have 600K concurrent connections they are concurrent. You could have them on a single core, limited severely on bandwidth and disk I/O.

What you don't have is parallelism, since they are not executing at the same time. I feel that this distinction is important to make. That most of them are waiting on a given event to happen does not make it any less concurrent, but it does limit the amount of parallelism which is possible.

On a single core machine, the operating system is concurrently executing processes, perhaps a hundred of those. But only one process can be on the CPU at a time, so the parallelism count is always either 0 or 1.

shenedu · on Jan 28, 2013

author here.

> Notice that in his tests 97% of these connections don't do anything, just idle. He maxes out at 18764 req/sec.

Yes, just testing how many concurrent connection can be held. When the 600k are held, ab confirms that it can do about 31405.53 per seconds, the http body is 1024bytes.

> Notice that they are "keep-alived", coming from the same IP, so not truly separate connections

Not from the same ip, from many ips: 192.168.1.200~230

> Keep in mind that 600K concurrent connections cannot possibly do anything useful at the same time for many reasons (CPU, bandwidth, server I/O), so they are not truly concurren

They send a request every 5s~30s to server, and wait for response

rorrr · on Jan 28, 2013

> Not from the same ip, from many ips: 192.168.1.200~230

So from 31 IPs, which can be done with 31 keepalive connections.

Try hitting your server with even 50K real connections and see how long it lasts (if it lasts at all).

> They send a request every 5s~30s to server, and wait for response

Exactly. ALL of them don't do anything concurrently, they just sit idly.

alexkus · on Jan 28, 2013

You seem to be missing the point of the scenario they're testing.

Lots of idle connections (doing overlapping long polling) is exactly how many COMET servers work.

We send ~60 "events" via our COMET server (APE from www.ape-project.org) in a typical 2 hour period.

The server side work to decide when/what to send the clients is easy because it's the same information that gets sent whether there is 1 connection or 1,000,000.

The fact they're from just 31 different IP addresses isn't relevant. They're still individual connections from clients to the end server.

rorrr · on Jan 28, 2013

> The fact they're from just 31 different IP addresses isn't relevant. They're still individual connections from clients to the end server.

That's where you are wrong. Not only they are keepalive connections, they are completely local. Do it over an actual network from 50K different IPs and see how that performs.

alexkus · on Jan 29, 2013

Again you're missing the point. Just checking 31 sockets for data is much much much less work than checking 600k sockets, even if they are all via local IPs.

I agree that a connection from a local IP is not as much work for the kernel as from a remote IP, but it's the same amount of work for the server portion of the software to service each of the connections whether they are local or remote. Remember too that the host machine is running both the server and the process generating the client load. Generating the client traffic will be costlier than what is saved by the local traffic not traversing the full stack.

Yes, ideally two machines (one with a whole bunch of virtual IPs to fake the clients, and the other hosting the server) would be a better test, that way the machine hosting the server is going via the full network stack.

> Do it over an actual network from 50K different IPs and see how that performs.

And I don't see what difference (as far as how the networking performance of the server will vary) of having unique IPs or not will make. Incoming connections (from a real network) are going to cause the same amount of work regardless of the remote IP (assuming there are no DNS lookups); and iptables or firewall stuff should have minimal impact even if you spray a huge number of unique IPs at it.

For my testing of a similar scenario I use a couple of old blade servers (2 chassis of 24 PIII 700MHz blades each) to generate the load. Each blade has a unique IP and for 500,000 connections per blade I need 1M sockets (each connection can have two open concurrently as they overlap) = 41666 sockets per blade; that fits with a tweak to the ephemeral port range.

My server keeps long polling connections for ~25 seconds. The total network cost of each poll is ~800 bytes[1] (TCP connection initiation, HTTP request, HTTP response, TCP teardown). 500,000 polls every 25 seconds = 20,000/sec.

20,000 conns/sec * 800 bytes = 16,000,000 bytes/sec = 128,000,000 bits/sec.

Luckily each blade chassis has 3 x 100Mbps ethernet ports (Gigabit would have been nice but these are old blade servers) on separate backplanes (public, private, mgmt) so I split the 24 blades up with 8 on each interface to keep well below the 100Mbps limit of each port.

1. Which is why Websockets is much more efficient, roll on adoption in the popular browsers (not just the few who run relatively recent installs of Chrome/Firefox).

programnature · on Jan 28, 2013

but is this an apple-to-apple comparison to the node.js results referenced in the post?

codeka · on Jan 28, 2013

It seems to me it's not because the Node.js test used a remote machine to drive the connections whereas this one is driving the connections from the same machine. Running the server and test harness on the same machine bypasses a whole lot of networking stack.

Also, it seems the limitation in Node.js is the maximum heap size of around 1.4GB [1]. This test used a heap size of 3GB which is just not possible in Node (until the limitation is removed, anyway).

[1]: http://blog.caustik.com/2012/04/10/node-js-w250k-concurrent-...

shenedu · on Jan 28, 2013

author here. Yes, Clojure can use more than 1.4G of heap, and can use threads. http-kit is multithreaded (one just for IO, others for computing response). A strength of Clojure?

notimetorelax · on Jan 28, 2013

Frankly as somebody pointed out http-kit is written mostly in Java, so I would say that you're showing strength of JVM.

shenedu · on Jan 28, 2013

Yes, JVM is strong. Clojure inherit it.

dschiptsov · on Jan 28, 2013

What is memory usage per connection? How much unnecessary data copying is going on? What is latency?

What happens under the real load of, say, a thousand concurrent TCP connections together with few thousand of back-end/other data-sources pending calls?

What will happen to memory usage and latency when simple setup above will serve simplest remote requests (which means stalled connections, re-tranmitions, etc) for 8 hours? 24 hours?

shenedu · on Jan 28, 2013

http-kit needs a few kilobytes of memory (buffer for parsing HTTP request, maintain state, etc) for a connection.

The thread model used by http-kit: a dedicated thread(server-loop), only doing events IO and parsing, when done, queued the request for a thread poll to take it, thread pool compute the response, queue it for the server-loop thread to write it back to client.

Since epoll and kqueue's readiness selection is O(1), idle connections does not hurt latency at all.

ptaoussanis · on Jan 28, 2013

Memory use is a few KB per connection. Latency is good (same ballpark as nginx in my tests with most workloads). Can't comment on data copying.

This is a new server so obviously nowhere near battle-tested yet, but anecdotally:

Both the author and I are running/testing small/medium sized deployments that are performing quite reliably.

It's much too soon to say anything conclusively, but I would say the early data I've seen is promising.