Hacker News new | past | comments | ask | show | jobs | submit login
More than one million requests per second in Node.js (github.com/uwebsockets)
255 points by _e3th on Feb 4, 2017 | hide | past | favorite | 86 comments

Happy µWS "customer" here. I've been using the C++ library standalone in production since October (i.e. without Node.js).

Crazy fast, ultra-low memory usage, and was easy to integrate into our codebase. Author is hilarious and deeply cares about performance.

Easily the best C++ WebSocket library. I'm not at all surprised Alex has managed to get some additional performance out of HTTP on Node.js as well.

We are also using the uWS C++ library in production and have been extremely pleased with the performance. Integrating it was trivial and we haven't had any issues.

Alex has always been very responsive and helpful and his focus on performance is always extremely refreshing in the wake of the webdev world's "eh, good enough" mentality.

I'll have to give µWS a try. I've been using websocket++ [0] which I've also found to be excellent and stable, but a bit verbose to use.

[0] https://github.com/zaphoyd/websocketpp

often times webserver benchmarks are misleading because of how the tests were done.

nginx is a fully fledged webserver with logging enabled out of the box, and other bells and whistles. By just having logs enabled for example you're adding significant load on the server because of log formatting, writes to disk, etc.

At the very least include the configs of each server tested.

And details on the wrk (load gen) setup too, please.

The pipelining benchmark is identical to that of Japronto (another, very similar thing posted here on HN a few days ago). Japronto's repo on GitHub holds the wrk pipelining script used.

I haven't had the time to add configurations for every server tested (esp. Apache & NGINX) but the main point here is to showcase the Node.js vs. Node.js with µWS perf. difference.

How did you not have the time? Apologies, I might be missing something, but was this an emergency work assignment?

If not, then you should have taken the time to provide the information for a fair comparison with the other stacks.

As it is, you're just asking the community to take your word for it.

We don't need to take his word for it. It's open source, so we can run the tests ourselves.

I think it's completely understandable that he threw in the others, probably default config, without caring much about it since they weren't the point of the writeup.

Does this pass all the HTTP tests in Node.js repo? If not the perf diff is irrelevant.

It has a mostly-compatible API but strict conformance doesn't seem to be the goal here. If your application does not make use of obscure features provided by core http (it could probably be refactored to do without anyways), then it's a free boost in performance.

Is the req object a readable stream? is the res object a writable stream? How do you handle backpressure with this mostly-compatible API?

Although it mirrors what the other parent comments are making, I wish there was more information readily available (or maybe it is, and I'm just not aware of where to look for it?) information about what real-world performance is like in different cases.

For example, in my job, since none of the frontend APIs need to handle that many requests at once, we're considering setting up a few node "frontend APIs" to lift application complexities from our JS single page app up one level. Stuff like having to hit multiple inconsistent APIs, dealing with formatting issues, etc. If you have a single API it seems much easier to deal with that, as well as expand it as time goes on. But due to lack of knowledge and experience, I don't have as much confidence with pushing this decision as I'd like. We'll obviously end up investing time and effort in performing benchmarks to make sure it meets our requirements first, but as since we're a startup that's not so large, we can't realistically afford to dump THAT much time into something that doesn't end up getting us some clear benefits.

A bit related to the topic... I know it's not exciting and sexy, but I wish more people wrote about larger non-trivial applications and how they end up tackling the challenges they encountered and details of the kinds of scales they handled. Both with respect to architecture and scaling. Maybe it's my lack of experience, but I find it really difficult to guess at how much money certain things will end up costing before doing a "close-to-real-world implementation".

If you need a consistent api (based on exisisting apis/endpoints) with formatting options you should consider graphql. Its made for exactly that purpose.

Look into postgrest instead of graphQL as well and use the database as a flexible middle layer to construct your API on demand.


We did that, due to a lack or experience with graphQL. We use Postgres as a transactional key-value store (with proper schema though). We implemented the filtering as simple params to the API, not as flexible as graphQL but it is straightforward to implement on the backend side. I am not sure what is the meaning of inconsistent API though.

Sounds like GraphQL

Or a proxy such as AWS API Gateway or Apigee

you figure out what to optimize or scale out by measuring and identifying bottlenecks.

This looks interesting. I'm surprised there aren't many existing native HTTP modules for NodeJS. Found websockets/was as an alternative https://github.com/websockets/ws

It would be a fun experiment to implement a native HTTP module in Rust using Neon. https://github.com/neon-bindings/neon

>HTTP pipelining (made famous by Japronto)

>Japronto's own (ridiculous) pipeline script

are you trolling? :)

He is (a little bit).

Really? 5x faster than plain nginx? That's .. remarkable, if true. I can't seem to find the sources for that benchmark however.

Yeah the thing is these benchmarks are measuring the bit that isn't slow. Think about it - if this were a significant benchmark and they could do a million requests a second then they could literally run 20 Googles on one machine.

Obviously they can't, and the reason is that it isn't a significant benchmark. It doesn't actually do anything.

Wholly irrelevant, just like the Python "benchmark" that was on the front page yesterday.

Irrelevant for what? It's opening and closing an unfathomable number of sockets in a short time span, so it would be limited to how many the kernel can handle. Maybe subject to limitations in the glibc epoll() wrapper as well. So it's not irrelevant if you want to benchmark some change in the kernel for example. (There is a http parser in there as well, but I don't think even replacing it with a dummy one would quadruple total throughput even. Which is why I'm skeptical. The Python thing didn't claim performance above nginx.)

This kind of benchmark is completely useless.

You tune the heck out of node.js and then take another tool without tuning it (JVM, apache, nginx etc), give it a ridiculous task that you'll never find in real world and present your results as if they are meaningful.

Why do people still waste time doing it?

These not-real-world microbenchmarks are definitely useless from an engineering perspective. However they aren't completely useless. Their use is marketing. It gets the word out to developers that this product X is really good! So what if it doesn't translate to real world scenarios or even if the numbers are completely fabricated and you can't even reproduce this under lab conditions. [1] Very few people care enough to look at things that closely. Just seeing a bunch of posts claiming product X is really good is enough to leave a strong impression that product X really is that great. Perception is reality, and perception is usually better influenced by massive claims (even if untrue) rather than realstic iterative progress. Our product is 5% faster than state of the art! just doesn't have that viral headline nature that you need to win over the hearts of the masses.

As for why would someone do this. Maybe they don't know better, or maybe they are doing it because they have decided to invest in some technology tribe and thus profit from that tribe surviving, and even more from growing. This is a pretty automatic behavior for humans. Take any tribal war, e.g. XBox One vs PS4. Those who happen to own a XBox One (perhaps as a gift) can be seen at various places passionately arguing that XBox One is better than PS4, even if objectively it has worse hardware and less highly acclaimed exlcusive games. The person is on the XBox One tribe, and working towards getting more users to own XBox One will mean that more developer investments are also made towards the XBox One thanks to the bigger userbase. Thus even if the original claims to get users into the tribe were false, if growth is big enough it may work out well enough at the end.


[1] The RethinkDB postmortem [2] had a great paragraph about these microbenchmarks. People wanted RethinkDB to be fast on workloads they actually tried, rather than “real world” workloads we suggested. For example, they’d write quick scripts to measure how long it takes to insert ten thousand documents without ever reading them back. MongoDB mastered these workloads brilliantly, while we fought the losing battle of educating the market.

[2] http://www.defstartup.org/2017/01/18/why-rethinkdb-failed.ht...

Your example from RethinkDB really struck home to me. The idea that superior technology might lose out due to poor marketing or (in this case) a system that is optimized for the real world rather than being optimized for benchmarks really disturbs me.

And (this is just my personality) I don't like being disturbed about something without trying to "solve" it. So here's my best thought on how to handle the situation where a team feels that they have a superior product which is losing out to another product that is optimized for benchmarks:

> Provide a setting called something like "speed mode". In this mode it is completely optimized for the benchmarks, at the cost of everything else. Default to running without "speed mode", but for anyone who is running benchmarks ask them if they've tried it in "speed mode". A truly competent evaluator will insist on trying the system with the options that are really used in the real world, but then the competent evaluator won't be using an unreliable benchmark anyway. Anyone running the benchmarks just to see how well it works will be likely to turn on something named "speed mode", or at least to do so if asked to. Forums will eventually fill up with people recommending "for real-world loads, you should disable 'speed mode' as it doesn't actually speed them up".

Hmm... sounds cool, but I'm not so sure it would actually work. The danger is that you would instead develop a reputation for "cheating" on benchmarks. This is why I'm not very good at marketing.

Not totally useless. This shows the performance and overhead of the library/framework not the task.

Many readers should have a feel for their own use cases and be able to relate them to "hello world" benchmark responses, for example I usually take it to mean divide stated performance by 10 immediately if a simple DB query is involved, etc.

Also if today you are using one of the compared setups you should know what performance you currently have and what tuning went into it to get a relativity.

Microbenchmark results don't linearly scale to everything else. Just because language X can print "Hello, World!" 2x faster than language Y doesn't mean that every other operation is also 2x faster. For example a huge factor is algorithm quality. Language X may be fast at "Hello, World!", but then proceed to have QuickSort as its standard sorting function, while language Y has Timsort. [1] Language X may have a nicely optimized C library for hashing, while language Y has AVX2 optimized ASM. Some languages don't even have a wide & well-optimized standard library. Thus you can only really tell how good a language/library is for your usecase if you test with an actual real world scenario.

Additionally, the ultimate microbenchmark winning code is one that does every trick in the book while not caring about anything else. This means hooking the kernel, unloading every kernel module / driver that isn't necessary for the microbenchmark, and doing the microbenchmark work at ring0 with absolute minimum overhead. Written in ASM, which is implanted by C code, which is launched by node.js. Then, if there's any data dependant processing in the microbenchmark, the winning code will precompute everything and load the full 2 TB of precomputed data into RAM. The playing field is even, JVM & Apache, or whatever else is the competition will also be run on this 2 TB RAM machine of course. They just won't use it, because they aren't designed to deliver the best results in this single microbenchmark. The point is that, not only don't microbenchmark results mean linear scaling for other work, but the techniques to achieve the microbenchmark results may even be detrimental to everything else!


[1] For some data sets QuickSort is actually faster. Goes to show you that the best choice is highly dependant on actual use.

> Microbenchmark results don't linearly scale to everything else.

Certainly they don't. But when evaluating something like this it is up to the reader to have critical thinking skills and realistic expectations about the level of experimental design applied to an admittedly alpha implementation published on a wiki on GitHub vs. maybe reading something like published in a peer review journal.

Well, there is another aspect to it. Having 2.000.000 clients connected at the same time is much more important scaling factor than we need 1M req/s. Usually people scale services on multiple dimensions (also financial aspects). On the topic of we are just benchmarking the framework, sure, but I would like to also add the test when we tested for other requirements as well. Visualising the results with p50..99.99 latency also would be meaningful.

This is why they should benchmark against a raw c implementation with same feature set. People seem to try to raise the bar with just ignoring what is ahead of them.

If you hit the README for the project you'll see they've done comparisons against a number of their peers.

I don't understand why you would say this benchmark is useless. As we can see from the wiki, the core http module can handle just 65k requests/second. But the new websockets approach can handle a million requests/second. I think this is astonishing.

First off it uses HTTP/1.1 pipelining. There's not a single actively developed browser in existance that supports HTTP/1.1 pipelining out of the box. [1] Thus you certainly can't use this for web development.

Secondly, the post doesn't seem to mention this but I'm willing to bet that this microbenchmark, like all others like this, are doing all these million requests from a single client that's located on the same machine.

How many real world use cases are there where a single localhost client will do a million requests per second and also supports HTTP/1.1 pipelining?


[1] https://en.wikipedia.org/wiki/HTTP_pipelining#Implementation...

It's useless for other reasons. Pipelining and non-idempotent methods (i.E POST) don't go togheter.

basically all apps now a day communicate via http to avoid firewalls.

Well, exactly that. They can open a million sockets a second. Handling that many requests is entirely different bulpark. You have to account for a lot other factors: transaction types, server loads, what kind of rampup the load had, etc.

Just because the word "million" seems impressive, it doesn't mean much. There is a difference between a million photons hitting a tree and a million meteors hitting a tree. The rest of the context is important.

>Well, exactly that. They can open a million sockets a second. Handling that many requests is entirely different bulpark.

That's both a trivial AND useless information. The request handler could do an expensive 2-hours operation that uses 100% of a core for all we know. That's up to the web programmer to optimize.

The http-lib programmer, on the other hand, should optimize, and give data, for exactly what it does, nothing more, and nothing less.

People seem to conflate those responsibilities all the time when they see a benchmark. A http-parser benchmark's role is not to tell you how fast your app will serve.

But nodejs network, last time I checked, ran on one thread. Opening sockets doesn't run in vacuum.

What you said may be true for multithreaded apps, but resources are shared in nodejs.

>But nodejs network, last time I checked, ran on one thread.

It can multiplex operations at the event level however, and all its common libs follow that model. So while it might run "on one thread" it can leverage the CPU quite efficiently. And you can always run multiple processes.

With that logic you can't benchmark anything.

It's useless because a real world application will take some non trivial amount of system resources to actually serve the request and that is going to be the actual bottleneck, not the HTTP module.

So for example lets say your application has to parse a JSON POST body, talk to a database, and then serialize a JSON response. You'll be lucky to get 1k reqs/sec throughput. At that point it actually doesn't matter whether your http module can handle 65k req/sec or 1 million reqs/sec because you will never be able to serve that many anyway. If your http module did manage to pick up 65k reqs/sec from clients they would all just timeout.

These benchmarks reach those numbers by doing nothing but serving a tiny static string, but that's not what happens in real life. In summary these benchmarks are interesting, but its optimization in an area which isn't actually the thing holding back most backend servers from serving more requests per second.

Having lower latency and bigger throughput is always good. But will it have any impact in real apps? Memory management and IO aren't gone just because your http stack is fast. The average node app will probably fall way behind just because of GC.

>Having lower latency and bigger throughput is always good. But will it have any impact in real apps?

Obviously yes.

>Memory management and IO aren't gone just because your http stack is fast.

Obviously yes again. But they are helped by it.

>This kind of benchmark is completely useless.

On the contrary, that's the only useful type of benchmark.

I don't care for "load simulation" full benchmarks, with loads and usage patterns that will be invariably different than mine, and which tell me nothing much.

Microbenchmarks on the other hand, are constrained to very specific situations (like the above query), and as such can be very precise in the numbers they give.

I know that if I use a similar machine and Node version, and have the same query, I will get the same performance.

And that's exactly what programmers use to identify pain points ("hmm, this kind of response handling is slow") and fix it. Isolated and targeted microbenchmarks.

I don't care for a "full" benchmark to tell me that "things will slow down with business logic and DB queries". Well, DUH!

> I don't care for a "full" benchmark to tell me that "things will slow down with business logic and DB queries". Well, DUH!

Yes but you don't know exactly how much those things will differ between languages, which is important. For instance, you may say "wow, Node smokes Java in this echo server benchmark, I'm sold!", only to later find out that e.g., DB queries run 3x slower in Node than Java. Suddenly a more real-world benchmark makes sense...

>only to later find out that e.g., DB queries run 3x slower in Node than Java. Suddenly a more real-world benchmark makes sense

No, then I just need an additional DB query microbenchmark.

I think it shows that the author(s) behind the library is/are committed to improving performance.

Sure, a lot of companies like to publish benchmarks like this to make their product look better than it is (E.g. only showing the good parts), but in this case I know the author and I can vouch that he is independent and uncompromising.

You could argue that the baseline performance of a library doesn't matter as much once you start adding lots of custom logic on top, but it's still highly relevant for lightweight workloads (which are actually quite common E.g. Basic chat systems).

>>I think it shows that the author(s) behind the library is/are committed to improving performance.

No, it shows that they care about benchmarks, which are rarely if ever representative of real-life scenarios.

Actually it's kind of funny how Node community are obsessed about benchmark and speed because Node is very slow on CPU and cannot even do parallelism properly.

That's completely incorrect. Well known cluster modules allows very easy parallelism at any level in your application. Plus Node is among the fastest interpreted languages around today -- coming close to the jvm in performance.

I am awfully wary of these statements which paint languages like Python (via PyPy), Javascript (via Node) as very close competitors of the JVM. Once the JIT engine kicks in, on "real" workloads, JVM beats the lights out of these carefully tuned interpreted languages on a CPU intensive workload.

>Once the JIT engine kicks in, on "real" workloads, JVM beats the lights out of these carefully tuned interpreted languages on a CPU intensive workload.

For one, JS is also JITed. Second we have video players and other tasks done on native JS, which would be impossibly slow on say Python.

Second, JS can also be compiled -- there's asm.js and WebAssembly coming down the road.

So, yes, it might be slower than the JVM, but not that slower for most practical purposes.

> For one, JS is also JITed

But not all JITs are equal; that's like putting Brainfuck in the mix because it has a JIT. It is worth noting that JVM JIT has years of research behind it and being statically typed only adds to he benefits.

> So, yes, it might be slower than the JVM, but not that slower for most practical purposes.

Sure, my point is that the "not that slower" varies on lot depending on the kind of computation would run and having a notion that these dynamic languages are fast enough just perpetuates the misunderstanding that there exists free lunch...

>It is worth noting that JVM JIT has years of research

I hear that a lot and it's a moot point. It's not like the same research is not available to those doing the JS JITs. Unless we're talking about patents, techniques for faster JITing are widely known, and get propagated to newer languages and runtimes all the time.

And in fact, even the people are usually the same (e.g. people that started the initial fast JITs in the days of Smalltalk, then went to JVM, and now work on V8).

>Sure, my point is that the "not that slower" varies on lot depending on the kind of computation would run and having a notion that these dynamic languages are fast enough just perpetuates the misunderstanding that there exists free lunch...

Well, certainly fast enough for web apps, where we have been using 10x slower languages with no JITs and huge overheads.

I remember the dev of uws got some hate last year. Ppl were complaining that uws would only perform well on C++ stacks and be slow on Node. So I guess he simply wanted to prove them wrong.

IO-intensive (what this module would be used for) != CPU-intensive. When considering "pure" code in CPython, YARV or Perl, JavaScript easily out performs all of those, when thinking of dynamic languages which the Node community would be mostly comparing against. Obviously compiled and statically compiled languages are a different ball park in terms of both performance and culture.

Well isn't that a reason to compare against real competitors instead of showing off against the ones you can beat?

Yes definitely, just really pointing out that the Node community probably has more in common to those languages community and motivations than the C++, Java, Haskell community (just to pick some examples) for many differing reasons.

>because Node is very slow on CPU and cannot even do parallelism properly

Compared to what? Python? Ruby? PHP?

None of which can do "parallelism properly" and all of which are from 2 times to an order of magnitude slower than v8 (Node's engine).

Even Go can't do parallelism properly -- it's model is full of manual handling of deadlocks and races. Erlang, yes.

Could you mention what this is based on, please?

> As http sockets in µWS build on the same networking stack as its websockets, you can easily scale to millions of long polling connections, just like you can with the websockets. This is simply not possible with the built-in Node.js http networking stack as its connections are very heavyweight in comparison.

I'm a bit confused by what's going on here. Are you saying the network stack required to do websockets is vastly superior to the network stack of http, and hence using a websockets network stack in http calls can produce superior results? (I didn't know the underlying networking would be different and any clarity would be helpful).

I'm not really understanding the differences but it is definitely interesting nonetheless.

What I mean with this is that any connection (socket) in Node.js builds on net.Socket which builds on uv_tcp_t which together requires a significant amount of memory (bloat).

A socket in the networking stack of µWS is far more lightweight (which already has been shown when it comes to µWS's websockets). The "HttpSocket" of µWS is about as lightweight in memory usage as "WebSocket", which is far more lightweight than net.Socket in Node.js.

One million WebSockets require about 300 mb of user space memory in µWS while this number is somewhere between 8 and 16 GB of user space memory using the built-in Node.js http server.

µWS is a play on its "micro" (small) sockets.

I feel like "bloat" is thrown about so much these days, with little credence to actually defining it in a per situation context. It would be far more credible to me to not use such a handwavey term and instead talk about what the memory differences are, and why one might use much less memory than the other. Often times one person's "bloat" is another's necessary feature to accomplish their goals.

It's like saying Django has a lot of bloat in comparison to some super basic http lib, except it has all the features I'll need to build a non-trivial app.

Fantastic work Alex. I sent you email earlier when I saw this.

It really is stunning, and yes microbenchmarks are very important to me and my product. I personally really do want to know how much every piece costs so I can budget memory cycles and machines. So thanks for providing the data. Even if it is slightly "ballpark".

We use it in our server as well (and have done for ages), and uWS just plain rocks.

Highly recommended

I'd love to see a design document explaining the differences from, say, nginx, that enable this kind of performance results.

Thanks for this amazing work ! Can't wait to use it.

As a community we have to work on addons and make node the true versatile and performant language it should be :)


I definitely agree the Node.js universe needs to take a better look at using addons. My opinion is that one should only use JS for the application logic, which requires high productivity, and only (or mostly) implement core modules as addons. It makes sense to use JS where productivity matters and to skip it where performance matters.

This is pretty great stuff. Please keep it up and don't pay attention to the naysayers. This type of optimization is great and will pay dividends down the road for a lot of projects if this can take off.

What if standard http module to be replace with this in express? Is it going to work?

uWS certainly reduces the overhead to a minimum, saving lots of memory that can be used to scale up and saving CPU left to your app's code. I wrote this article a few months ago when I switched to uws in the WebTorrent tracker. https://hackernoon.com/%C2%B5ws-as-your-next-websocket-libra...

Did you ever try to disable permessage-deflate with ws? It will never be as lightweight as uws because ws is built on top of `net.Socket` but I think you hit this ws issue https://github.com/websockets/ws/issues/804 in the WebTorrent tracker.

I think ws will use 3/4 times more memory than uws with permessage-deflate disabled which is a lot, but far different from 47 times as advertised.

I'm more interested in the techempower style benchmarks, at least those show some sort of semblance of real life usage. Do some queries, return encoded json etc.

I think the key benefit is actually significantly reduced memory footprint.

I agree, this is one single factor that is constant in all apps: your long kept sockets will require far less memory which directly impacts the number of long polling clients you can have. Having fast throughput is just a bonus.

It's important to note that with all these "X requests per second" benchmarks, they're almost never testing actual performance, but rather just less features. The architecture (event loop, forking, threading or any combination of those) also matters a lot, but they serve completely different purposes.

For example, they're using Apache as a reference point, but Apache does so much more than their code example. For one thing, you'll want to try disabling .htaccess support and static file serving so Apache doesn't actually hit the disk, like their code example doesn't.

I've found it trivial to make Python perform on the order of dozens of millions of requests per second, and I can keep scaling that basically indefinitely. But all I'm really testing, as is the given code example in the article, is a bit of looping and string manipulation.

> I've found it trivial to make Python perform on the order of dozens of millions of requests per second

Really curious. How did you achieve that? When you say "dozens of millions", it implies a minimum of 24+ million requests per second, which is quite unbelievable.

Through forking + gevent and then sleeping in each request handler. Of course it measures nothing other than a whole bunch of while loops running in one fork per CPU waiting for just about nothing. In other words, I'm benchmarking "how much memory do I have", which is pointless. But it sure does scale!

TCP handshakes->HTTP parsing->Sleep->Response writing. Can the overhead added by these (and more) possibly produce 24+ million requests / sec on a commodity machine?

Could you share some examples or snippets?

the way most python web apps work is they don't do the TCP handshake & http parsing, and leave that up to the front-end web server (nginx/apache/etc). Python only comes in via a fastcgi or wsgi proxy.

> I've found it trivial to make Python perform on the order of dozens of millions of requests per second [...]

No, no you haven't. This is not a cluster of servers, this is one single thread serving 1 million responses per second. Inside of Node.js.

Yes I have, and I'm not talking about a cluster of servers.

With HTTP pipelining then?

Show it to us

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact