Add an endpoint with a poorly optimized SQL query: 20 concurrent connections.
I find these benchmarks to not be very valuable. Typically the bottlenecks are in the database, not in the web server/code. Even when it is in the code, optimizations only become truly valuable when you are at scale or doing something either horribly wrong or algorithmically unique.
In an asynchronous/event driven server slow SQL queries won't limit your concurrency, just latency. Memory consumption is what limits concurrency. CPU and bandwidth limits throughput.
That said, I agree it's not a particularly useful benchmark.
They're valuable if the server is used to implement a COMET server.
We maintain a large number of connections (doing overlapping long polling) and we send the same information to each connection (about 30 times an hour at peak). Even if the query to produce that data was poorly optimised it's only being performed once and then the results pushed out to every connection.
Yes. something like long polling do require server capable of maintaining many connections.
http-kit provide `async-response` for this kind of usage: see http://http-kit.org for more detail.
I am the author of http-kit, also one of the authors of delicious China: http://meiweisq.com, trying to do some useful thing for our users (like realtime) with it. :)
Certainly your web server isn't likely to be the performance bottleneck in the majority of real-world applications. Even without a db call, just generating the response Clojure-side will kill your numbers (and Clojure is pretty darn fast).
Still, the ability to cheaply handle large numbers of concurrent connections is great for things like WebSockets and long polling where you aren't crunching data for every connection all the time. This is an async server, so it's well suited to such tasks.
Interesting as the article is, it's difficult to disagree re: database bottlenecks. Whilst the NoSql world is iterating away I wonder how much is going on behind the scenes at Oracle, Microsoft et al in regard to asynchronous sql. I guess it's a trade off between transactions and dirty reads?
Maybe I'm not understanding what is going on here but it appears the author's client is communicating over a socket on the same machine as the server. The author is seeing insanely high numbers because she/he is bypassing the entire TCP/IP stack.
I believe this test was run locally for simplicity (and so people can reproduce it easily). As you've pointed out, that's a pretty artificial context and can obviously have a big absolute effect on the numbers.
I will say that in my limited experience these numbers are proportionally representative of what you can get in a real-world environment.
Even testing locally, getting concurrency numbers like this is tough or impossible with many (most?) servers.
So, yes, it's an artificial test - but I think the results are interesting when taken in the correct context.
While true, I'm suspect given this setup that some layer isn't just dropping to a unix socket when it notices localhost.
Most notably, 600k active real TCP connections in the kernel would use somewhere around 6GB of memory assuming an average of ~10kB of memory per socket for the R/W buffers and other data. EDIT: Thats just the server side, double to include the client sockets.
Most attempts I've see at this number of real TCP connections required a lot more tweaking of kernel TCP settings to achieve.
For large number (1M+) of connections (doing websockets or long polling) to replicate the functionality of a COMET server I've been playing with rolling my own TCP handling via libnetfilter_queue as I simple don't need the ~10kB of r/w buffers on each socket.
Linux (as far as I'm aware) doesn't allow tuning of r/w buffer sizes on a per interface basis otherwise I'd have one interface for the COMET server with drastically reduced r/w buffer sizes and the remaining interfaces with 'normal' TCP r/w buffer sizes to ensure the other things running on that host run without problem.
It bypasses checksumming, but no not the whole stack. Although not sure in this case it bypasses checksumming as the interfaces were created as aliases on a physical interface, which may mean it does do checksumming still.
Many OS's kernels use their local domain socket codepath/interface for 127.0.0.1 <-> 127.0.0.1 communications, completely bypassing TCP/IP... This is transparent to the application.
Linux's epoll or FreeBSD's Kqueue is ridiculous scalable. Both http-kit and netty take advantage of it. http-kit is all about HTTP, netty is a general framework.
The server itself seems to be written in Java: https://github.com/http-kit/http-kit I wonder what the factors were for not writing it in Clojure if Clojure is the target platform.
To clarify: the core server is indeed written in Java, but it's written to conform to the standard Clojure web server (Ring [1]) spec and to be consumed by Clojure applications.
The public API is all Clojure-side and essentially uses the Java parts as an implementation detail. This kind of interop is easy to do with Clojure and is usually motivated by performance.
The best way of describing http-kit would probably be as a "Clojure web server written in Java".
author here:
Java part:
1. Java NIO's performance is amazing: event driven r/w bytes
2. maintain state machine when parsing HTTP from bytes buffer need many local variables, I get used to do it in a C style code.
Clojure:
I like this language. It's brilliant. So I write a fast http server/client for it. Clojure is also write in JAVA, great interoperation
can the http-kit server be used from java directly ? last i looked, all the async java webservers seemed limited to 1000s of connections, which largely defeats the purpose. i'm willing to give up the servlet spec if needed
Maybe you can try to tweak the max allowed open file to a larger value. The default is about 1024, that how you just get 1000(I guess). Jetty is quite good at concurrency, you can double check it.
Why not try Clojure? Your web dev's productivity increase by few times instantly.
looks good. my application (database-ish) is in java and i haven't worked on any bindings yet. just trying to put together a demo that shows off the concurrency
Don't get me wrong, I really like what you are doing but there are a lot of spelling and grammar errors in your web site. I'm not a native speaker (I also make a lot of errors) but maybe you can get some help from one? Not a big deal but just wanted to let you know. Makes sense if you want to get popular =)
The library author is Chinese, so English isn't his native language. We'll be cleaning up typos soon - this post caught us by surprise.
As for Compojure, etc.- those are libraries that operate on top of a Ring web server. Jetty is the default, http-kit is a drop-in replacement. So basically, you'd use both http-kit and whatever other libraries you normally would (like Compojure).
I swapped out a production Jetty+Compojure app to use http-kit+Compojure by changing ~20 lines of code.
Benchmarks author here. I did my best to get representative numbers, but good benchmarking is tough and this isn't my area of expertise. There's almost certainly errors and room for improvement (suggestions very welcome!).
In particular, I'm pretty sure nginx could be tuned here to lead.
BTW I also ran benchmarks against a more realistic EC2 environment with proportionally similar results (EC2 is slow but its effect across servers was consistent).
1) Notice that in his tests 97% of these connections don't do anything, just idle. He maxes out at 18764 req/sec. If you google around, Apache and Nginx can do more than that on a beefy server.
2) Notice that they are "keep-alived", coming from the same IP, so not truly separate connections.
3) Keep in mind that 600K concurrent connections cannot possibly do anything useful at the same time for many reasons (CPU, bandwidth, server I/O), so they are not truly concurrent.
4) Max concurrent connections are also limited by the OS, and that limit is much much lower than 600K by default:
If you have 600K concurrent connections they are concurrent. You could have them on a single core, limited severely on bandwidth and disk I/O.
What you don't have is parallelism, since they are not executing at the same time. I feel that this distinction is important to make. That most of them are waiting on a given event to happen does not make it any less concurrent, but it does limit the amount of parallelism which is possible.
On a single core machine, the operating system is concurrently executing processes, perhaps a hundred of those. But only one process can be on the CPU at a time, so the parallelism count is always either 0 or 1.
> Notice that in his tests 97% of these connections don't do anything, just idle. He maxes out at 18764 req/sec.
Yes, just testing how many concurrent connection can be held.
When the 600k are held, ab confirms that it can do about 31405.53 per seconds, the http body is 1024bytes.
> Notice that they are "keep-alived", coming from the same IP, so not truly separate connections
Not from the same ip, from many ips: 192.168.1.200~230
> Keep in mind that 600K concurrent connections cannot possibly do anything useful at the same time for many reasons (CPU, bandwidth, server I/O), so they are not truly concurren
They send a request every 5s~30s to server, and wait for response
You seem to be missing the point of the scenario they're testing.
Lots of idle connections (doing overlapping long polling) is exactly how many COMET servers work.
We send ~60 "events" via our COMET server (APE from www.ape-project.org) in a typical 2 hour period.
The server side work to decide when/what to send the clients is easy because it's the same information that gets sent whether there is 1 connection or 1,000,000.
The fact they're from just 31 different IP addresses isn't relevant. They're still individual connections from clients to the end server.
> The fact they're from just 31 different IP addresses isn't relevant. They're still individual connections from clients to the end server.
That's where you are wrong. Not only they are keepalive connections, they are completely local. Do it over an actual network from 50K different IPs and see how that performs.
Again you're missing the point. Just checking 31 sockets for data is much much much less work than checking 600k sockets, even if they are all via local IPs.
I agree that a connection from a local IP is not as much work for the kernel as from a remote IP, but it's the same amount of work for the server portion of the software to service each of the connections whether they are local or remote. Remember too that the host machine is running both the server and the process generating the client load. Generating the client traffic will be costlier than what is saved by the local traffic not traversing the full stack.
Yes, ideally two machines (one with a whole bunch of virtual IPs to fake the clients, and the other hosting the server) would be a better test, that way the machine hosting the server is going via the full network stack.
> Do it over an actual network from 50K different IPs and see how that performs.
And I don't see what difference (as far as how the networking performance of the server will vary) of having unique IPs or not will make. Incoming connections (from a real network) are going to cause the same amount of work regardless of the remote IP (assuming there are no DNS lookups); and iptables or firewall stuff should have minimal impact even if you spray a huge number of unique IPs at it.
For my testing of a similar scenario I use a couple of old blade servers (2 chassis of 24 PIII 700MHz blades each) to generate the load. Each blade has a unique IP and for 500,000 connections per blade I need 1M sockets (each connection can have two open concurrently as they overlap) = 41666 sockets per blade; that fits with a tweak to the ephemeral port range.
My server keeps long polling connections for ~25 seconds. The total network cost of each poll is ~800 bytes[1] (TCP connection initiation, HTTP request, HTTP response, TCP teardown). 500,000 polls every 25 seconds = 20,000/sec.
Luckily each blade chassis has 3 x 100Mbps ethernet ports (Gigabit would have been nice but these are old blade servers) on separate backplanes (public, private, mgmt) so I split the 24 blades up with 8 on each interface to keep well below the 100Mbps limit of each port.
1. Which is why Websockets is much more efficient, roll on adoption in the popular browsers (not just the few who run relatively recent installs of Chrome/Firefox).
It seems to me it's not because the Node.js test used a remote machine to drive the connections whereas this one is driving the connections from the same machine. Running the server and test harness on the same machine bypasses a whole lot of networking stack.
Also, it seems the limitation in Node.js is the maximum heap size of around 1.4GB [1]. This test used a heap size of 3GB which is just not possible in Node (until the limitation is removed, anyway).
author here.
Yes, Clojure can use more than 1.4G of heap, and can use threads. http-kit is multithreaded (one just for IO, others for computing response). A strength of Clojure?
What is memory usage per connection? How much unnecessary data copying is going on? What is latency?
What happens under the real load of, say, a thousand concurrent TCP connections together with few thousand of back-end/other data-sources pending calls?
What will happen to memory usage and latency when simple setup above will serve simplest remote requests (which means stalled connections, re-tranmitions, etc) for 8 hours? 24 hours?
http-kit needs a few kilobytes of memory (buffer for parsing HTTP request, maintain state, etc) for a connection.
The thread model used by http-kit: a dedicated thread(server-loop), only doing events IO and parsing, when done, queued the request for a thread poll to take it, thread pool compute the response, queue it for the server-loop thread to write it back to client.
Since epoll and kqueue's readiness selection is O(1), idle connections does not hurt latency at all.
I find these benchmarks to not be very valuable. Typically the bottlenecks are in the database, not in the web server/code. Even when it is in the code, optimizations only become truly valuable when you are at scale or doing something either horribly wrong or algorithmically unique.