Rob Pike has spent the last 25 years developing evented languages, and while the Hoare-style CSP approach he's settled on allows for physical concurrency, he doesn't give a shit about bare-metal performance. The fundamental purpose is to be able to write concise programs that directly model the parallelism of the real world, written in an expository manner as to be more obviously correct.
Pike's point is that you should be getting the best true performance by working in an environment that helps you arrive at the ideal algorithm in it's purest form. Making compromises to get more local physical concurrency is a fool's errand, since at scale you're going to far outgrow single machines anyway!
I think Clojure's concurrency model ends up more concisely and correctly expressing the "parallelism of the real world" when you consider the dimension of time. Rich Hickey has done some really important thinking there.
Finally, I think you'll find that anyone with a fixed budget for servers isn't going to think that making the most of "local physical concurrency" on every machine in their cluster is a "fool's errand". Hardware is cheaper than it used to be, but it isn't free, and deploying and maintaining it is costly and time consuming. If you can make the most of your hardware and operations investements with a little more thought-work, why not do so?
I find it funny that you're hung up on local physical concurrency — for me that's the prime signifier of "Scaling in the Small"! If you're going to have to distribute your workload across multiple machines, why not just run multiple copies of your single-cpu process on each machine, and let your network-level work distribution mechanism handle it?
We're not talking about shared memory supercomputing using OpenMP and MPI on special network topologies or NUMA hardware, just commodity machines running HTTP app servers in front of data stores. We aren't curing cancer, and our workload is already embarrassingly parallel (responding to discrete requests).
I've actually disabled some intra-request concurrency (some ImageMagick operations are multithreaded by default) on a system I'm working on now because it makes the workload wildly inconsistent — when independent requests try to take advantage of all the available CPUs, the latencies spike for everybody. It's just more software to have to monitor, and the ideal returns are slim.
Anyway, that's a good reply, and your ImageMagick case study is interesting. It just goes to show how individualized "scaling" really is. Thanks for the taking time.
If it's going to actually scale or have high-availability, a system is already going to have to have a socket pool to distribute across machines. Since that already works to distribute across processes on one machine, why bother adding another layer to pool native threads? It just adds to the complexity budget.
It could easily get you much lower (1/N) latencies on unloaded systems, but in most cases at high volume the gains aren't going to be very big compared to running another N-1 single-threaded processes per box and it would take more concerted effort to keep the request latencies consistent.
It seems obvious that it's worth compromising the model and adding locks or STM to get a N00% gain for a solitary request, but what if that's only a 10% gain when you're pumping out volume? Do stop to consider that not everyone wants their individual app processes to scale across cores — actual simultaneous execution within one address space comes at a cost.
There are probably some cases at scale where the gains from thread-pooling are substantial, but I could see a lot of them being where the work wasn't very event-ish to start with, like big-data batch-flavored stuff where Hadoop would work great (especially from data locality).
And a really icky default at that, for the performance reasons you mention and with the added bonus of a chance of hitting some pointless bug. I've been turning it off ever since I spent a pleasant evening staring at the stacks of wedged apaches that ended with a pile of ImageMagick functions culminating in futex-something-or-other.
Fucking programming like it's 1975 (http://varnish-cache.org/wiki/ArchitectNotes)
I also can't help but to gush about varnish a little (a handy thing to put in front of an image processing server, for one thing), which has a number of post-1975 features beyond its memory management such as an actually useful configuration language, statistics aggregation and reporting and the inexplicably uncommon ability to manage and modify configuration without a restart.
You nailed it.
On my first try I was mostly underwhelmed with node precisely because of the callback hell you end up in. I've already had my share of that in twisted and despite the arguments various people make for it; thanks, but no thanks.
That's not to bash node as a whole, mind you. I'll most certainly revisit it when and if it grows a stable co-routine wrapper or similar spaghetti prevention facilities.
It may be, it may not be. The idea that it always is, though, that's balderdash.
That's not saying threaded code doesn't have its own set of issues; but having written a tool that saw very high traffic, was written an event-driven style, and suffered performance reliability issues, I can say that the event-driven approach is no panacea.
That said, for certain applications, one or the other approach is clearly appropriate. But to make that decision requires an understand of the strengths and drawbacks of each.
I would argue that your architecture for distributing load across machines is more important than evented vs threaded. If we think that the difference in performance is small (regardless of which is faster,) then the question becomes one of which programming style you prefer. Node's implementation of events means you can have mutable state without worrying about locks or synchronization blocks or whatnot.
In other words if 100 machines are 60 times more powerful than 1 machine, increasing your scaling factor is less important than doubling each machines throughput (assuming you keep the same scaling factor).
Exactly. As he mentions, engineering any given system presents its own unique set of challenges, and should be addressed as such.
Use the proper tools to solve problems. There are no one-size-fits-all solutions in science. Computer or otherwise.
Node.js clearly wins to the extent that it is code of type 2 replacing code of type 1. But when you compare Node.js against other technologies of type 2, I find it hard to conclude that it really stands out on any front except hype. All the type 2 technologies tend to converge on equal performance on single processors, driven by the performance of the underlying select-like primitive, so what's left is two things: comparing the coding experience, where "evented" really loses out to thread-LIKE coding that actually under-the-hood is transparently transformed into select-like-primitive-based code, but is not actually based on threads, and two, dispatching the tasks to multiple processors, where Node.js doesn't even play.
Which leaves what great alternatives? I have to admit, no other popular environments come to my mind. Arc, maybe?
(Though part of the reason something like Twisted is crufty is that the design forces in the domain push you in that direction. Twisted is crufty, yes, but it's crufty for reasons that apply to Node.js and Node.js will naturally either have to gravitate in the same direction or hit the same program-size-scaling-walls that drove Twisted in that direction in the first place.)
None of the environments are "popular" by any reasonable meaning of the term. Don't let the hype fool you, neither is Node.js.
(And while the original article mentions it only in passing, that spaghetti code scaling wall for this sort of code is very real. I'd liken it to something O(n^1.3) as the code size grows; sure, it doesn't seem to bite at first and may even seem to win out over O(n log n) options like Erlang at first, but unfortunately you don't hit the crossover point until you're very invested in the eventing system, at which point you're really stuck.)
(Incidentally, why am I down on the whole idea of eventing? Because twice in my experience I have started on one of these bases and managed to blow out their complexity budget as a single programmer working on a project for less than a year. And while you'll just have to take the rest of this sentence on faith, I actually use good programming practices, have above-average refactoring and DRY skills, and use unit testing fairly deeply. And I still ended up with code that was completely unmanageable. Nontrivial problems become twice as nontrivial when you try to solve them with these callback systems; this is not the sign of scalable design practices. On the other hand, I've built Erlang systems and added to Erlang systems on the same time frame and the Erlang systems simply take it in stride, it hardly costs any complexity budget at all. And Erlang's actually sort of lacking on the code abstraction front, if you ask me, it's an inferior language but the foundational primitives hold up to complexity much better.
In fact, I'm taking time out from my task right now of writing a thing in Perl in POE, a Perl event-based framework, in which I have to cut the simple task of opening a socket, sending a plaintext "hello", waiting for a reply, upgrading to SSL, and sending "hello"s within the SSL-session to various subcomponents into six or seven-ish sliced up functions with terrifically distributed error handling, and even this simple bit of code has been giving me hassles. This would be way easier in Erlang. Thank goodness the other end is in fact Erlang, where the server side of that is organized into a series of functions that are organized according to my needs, instead of my framework's needs. The Erlang side I damn near wrote correctly the first time and I've just sort of fiddled with the code organization a bit to make it flow better in the source, this is my third complete rewrite of the Perl side. And I've got at least a 5:1 experience advantage on the Perl side! The Perl side requires a bit more logic, but the difficulty of getting it right has been disproportionate to the difference in difficulty.)
Could you please share a little more about this? I'm interested to know what it is you were building solo and the tools you were using, and what made it so difficult to maintain.
For the record, this is absolutely no problem given the generational GC in the Hotspot JVM; the structural sharing in Clojure's persistent data structures generate a little extra garbage, but it's barely a blip on the radar. Laziness can have more of an impact on memory usage, but it's usually really easy to spot.
I often see "scalable" and "fast" used interchangeably in these discussions. These are different concepts. Is the contention that threading is not as scalable as evented, not as fast as evented, or both?
A different issue was that OSs used to be very bad at scheduling large numbers of threads--- if you spawned 10,000 threads, the system scheduler blew up. I believe that's mostly much improved these days.
Most of the threads-don't-scale arguments also assume OS threads; user-level threads are another story.
Threaded, done right (which isn't always easy) - can scale on a single machine to use up all the cores you can throw at it. This can be fairly optimal on the machine in question, depending on the app.
Traditionally when dealing with threaded programming you have a whole lot of locking and whatnot to keep track of - and things can get ugly fast. It's HARD.
And all that hard work, while it might help you scale up to the biggest multi-processor beast you can find - doesn't help you one bit when it comes to scaling horizontally to multiple nodes..... I'd wager, depending on the situation, that it gets right up in your face and in the way. But tha'ts back to comparing apples and oranges.
"scalable" is about overall design - what needs to talk to what, how will it scale, will it be a purely shared-nothing environment like erlang where we're passing messages and can scale out horizontally like mad, but where every function call is effectively a full memory copy, possibly over a network? Or do we nail it down to a precisely crafted C++ threaded app that our expert designers know down to it's finest detail, that's faster than sh*t througha goose, and hey, it uses some evented stuff as well.
It seems the real rise interest in the event-driven nature of node.js is that it makes it obvious to even beginner programmers where the difference is - you are setting up some actions, and letting the OS kernel handle listening for the events for you in the fastest possible way. You could do all that on your own in another language, but here it is all neatly wrapped up, in a language you already know.
The one-thread-per-connection model is conceptually simpler at the cost of a higher resource footprint, the event-driven approach adds complexity but allows you to process more concurrent connections.
Overly broad and simplified but that's about the gist of it. Like everything non-trivial, the real answer is: It depends.
Models like Java Servlet 3.0 allow threaded servers to handle a large number of concurrent connections. Are there other reasons why threaded servlets are slow or not scalable?
Most web applications don't have a large number of concurrent connections because they application request handlers do not block for a long time. The resource use for thread stacks is very modest for these applications even without a suspend & resume feature.
I guess one can claim that any suspend and resume logic in the application is a form of eventing, but it's different from node.js in that it's not pervasive through the application. It's also different from node.js in that the server infrastructure is not involved in the "events".
Foremost, context switching. This is a non-trivial process: on-die caches get flushed, virtual memory has to get re-mapped, registers re-loaded. This is by definition lost time, time the OS is spent not doing your workload. If the number of context switches gets too large, a system is spending an inordinate amount of time performing context switches, and less time doing "real work" inside threads. You cite two isues, slow and not scaling.
Second, memory usage. Each thread has its own stack space, which consumes a fixed memory size.
As to "slow" and "does not scale;" slow is relative; the amount of time spent context switching is very dependent on a variety of workload factors. Threading _can_ be slow. Scaling, on the other hand, is less unequivocal. A quarter million active threads, each doing very little, is not a good thing: too much memory, too many context switches, and too high a latency.
The problem is not threading per se. It's pre-emptive multi-threading, where a programs state is transparently pulled from the CPU by the OS and serialized to someplace else, and another programs state is de-serialized and loaded in transparent; that is, without the loaded or unloaded program knowing. Erlang and tasklets are examples of userspace or lightweighs threads where pre-emption is not permitted, and the context switching penalty is negligible. Instead of threads being ignorant, and being preemptively swapped in and out, they are coordinated and actively yield themselves back to the scheduler, thereby reducing the context switching penalty.
I can only point to a stupid app of mine, which triggered Twitter searches for a list of search results via Ajax (so one page of search results would trigger, say, 10 search reguests to Twitter via Ajax, proxied through the server). Since that app was written for a competition and non-commercial, I hosted it on the free Heroku plan (it was a Rails app). That is how I found out that apparently that Heroku plan has only one process, and Rails doesn't use threads. So not only would the search requests to Twitter only be handled sequentially, while those threads were running, no other visitors would get to use the site (the 10 Ajax requests would already use up up to 10 threads, and I had only one).
Just a stupid example - with event based request handling, it would not have been an issue because the search requests to Twitter would not have blocked everything else.
Aside: haproxy, configured cleverly, can deal with limiting the number of concurrent connections permitted to your app on a URL basis (or just about any other part of the request you can deconstruct) to allow your app to not get hung up on slow queries and keep fast queries going where you want them.
Not sure how haproxy would help?
I haven't done a whole lot of concurrency programming, so before reading this article I didn't realize that the events vs. threads debate is still being had. The example that Ryan Dahl likes to use to illustrate the superiority of the event model whenever he talks about Node is nginx vs. Apache. That example did more in my mind to reinforce the idea of events being superior to threads (in terms of speed and memory consumption) than anything else.
Keep in mind when reading this article that Alex recently left a little company called Twitter. It's safe to say that relatively few companies will ever have to scale the way that Twitter has.
You mean the failwhale way? ;-)
Not meaning to discredit al3x but I really don't consider twitter a success story in terms of scaling. I don't know if it's incompetence or just some truly bad early decisions that they're still suffering from.
But one thing is for sure; other sites of much higher complexity have scaled much more smoothly to similar sizes (facebook, flickr, just two from the top of my head).
"Twitter is still fighting an uphill battle to scale in the large, because doing so is about much, much more than which technology you choose."
That's part of my point.
I'm curious because I've worked with various messaging systems myself. And albeit never having pushed them to near twitter-scale, the principles of scaling those horizontally seem quite straightforward to me, unless complex routing comes into play. But I don't see complex routing at twitter.
To be more precise: For all I know twitter could (should?) probably just append those tweets to one file per user and would scale beautifully from there. Auxiliary services like search are a different story, of course, but those don't need to trigger failwhales when they break...
What wall do they keep hitting?
At this point in time, I don't really want to comment any further on what issues they may or may not be having. I haven't worked there in over a couple months, and I'll bet that big parts of the system have changed in that time. I'm no longer informed about what's under the hood at Twitter, and I also don't want to second-guess my former coworkers, who I'm sure are doing their best. Sorry!
Twitter's original codebase started as a "my first blog" tutorial in Rails — the polar opposite of a messaging platform. That's a hell of a ship to turn around, especially live on the world stage while the userbase and datastore are growing exponentially and you're rapidly approaching the second half of the chessboard.
Facebook both very intentionally limited their growth (new colleges) and avoided their most expensive features (like photos) for as long as possible. Flickr was originally an open-ended MMO/MUD based on user-contributed content called Game NeverEnding, which was far harder to scale than the photo-sharing feature in their chat client that was extracted to base their final business on.
Where on earth did you get that idea? It's not true:
Moreover, the idea had been gestating for years:
Non-blocking I/O is no magic pony, so if you need access to more CPU cores, you start creating worker processes that communicate via unix sockets. I would argue that this is a superior scaling path than threads, because if you can already manage the coordination between shared-nothing processes, moving to multiple machines comes natural.
Otherwise I agree with the post. Nothing will allow you to scale a huge system "easily".
I'd much rather node have a multi-threaded Web Workers implementation of nodejs that uses immutable shared memory for message passing. Not coincidentally, I'd much rather push the real distributed scaling problems (such as coordination and marshalling/demarshalling) up to a much much higher level in my application stack. As it stands now, I need multiple marshal/unmarshallers, for the IPC layer and the app's scaling-out/distributed layer.
Even though I disagree with much of his technical argument, this is an extraordinarily important point that I find myself agreeing with more and more upon rereading. Nothing is scalable out of the box, anything can be fucked up, and there's no silver bullet.
consider the plethora of means of doing inheritance that exist for JS.
To me as an outsider, it sounds more as Starling (or Kestrel) got its performance upgraded, not scalability.
May sound pedantic, but it's important to differentiate performance from scalability.
"Thanks-some valid criticisms. The story doesn't stop at the process boundary. Concurrency through Spawning&IPC should be discussed"
"ie, "actors", although not explicitly given that name, are baked in. I just don't feel then need to put them inside the same process."
asked about impl discussion;
"Not really. It's not fancy: unix socks, no framing protocols, no RPC daemons. But using what's there is a valid concurrency solution."
Interested to al3x's thoughts on the RoR vs. Python-Django also.
I think you have his thoughts right there. Ruby and python don't allow for a variety of concurrency approaches...at least not on the threading side. You have to resort to multi-process or evented schemes.