I like that they posted their unicorn config file too!
Wasn't there a heated debate here just the other day about the prefork model?
Guess it's at least back en vogue @github.
I believe Apache was '95; it came out when I was working at an ISP, and I graduated from high school in '94.
So it appears as if this setup replicates a small portion of what Apache does without all the bells and whistles.
1) "Threads are out -- processes are better than threads."
2) Process-per-connection architectures.
The first is demonstrably false -- for instance, look at Erlang, which maps lightweight erlang processes to operating system threads, providing SMP scalability at a low cost without running into Github's issues with mongrel "thread-killing". More broadly used, look at Servlets and the Servlet 3.0 support for async comet-style event-based request handling. Each request is handled on a thread as necessary. Inter-thread communication (where necessary) is cheap, and this scales just fine.
If you use the fork() model and a long-running request blocks the entire process, comet is basically a non-starter. This is why people are interested (and implement) lightweight threads, coroutines, and restartable request implementations.
For a conceptual challenge, consider how you would implement a live web chat system that can scale up to a considerable number of clients, with extremely low resource usage, and instant message distribution (no polling). Locally, we implemented this with async servlet support and M:N scheduled scala actors -- a blocking HTTP request doesn't hold a thread or process hostage in the web server or the application, and we can scale up to enormous number of live clients on one machine.
The second was merely a lack of understanding of the model. In implementations where fork() is used as an alternative to threads and multiple connections are not handled per sub-process, you quickly run into scaling issues with subprocess memory utilization. In cases where subprocesses handle multiple connections via an event mechanism, you're just using fork() instead of threads.
I'm wondering how many servers github is using to keep up with load -- If each 8 core 16 gigabyte (!!!) server can actually only handle 16 concurrent requests via a pool of 16 workers, that's an incredibly poor (and expensive) scaling model.
Also, our Ruby application code spends very little time blocking on external resources (compared to time spent in Ruby execution), so having async execution on these wouldn't give us that much more efficiency. It's true that running Ruby code is generally more expensive (dollar-wise) per given task than other languages, but we made a conscious decision to make that tradeoff for the productivity gains that Ruby affords us. We use Erlang and EventMachine in other pieces of the architecture where they make sense.
The simple fact is that Unicorn works better for us (and our specific use case) than anything else we've tried, and so we're using it.
At those numbers, it sounds like you'd max out at roughly 32 workers (so 32 concurrent requests), or am I missing something important?
It's true that running Ruby code is generally more expensive (dollar-wise) per given task than other languages, but we made a conscious decision to make that tradeoff for the productivity gains that Ruby affords us.
It's really easy to measure performance, but productivity claims seem difficult to support and highly susceptible to confirmation bias.
I think that Ruby, Scala, and Clojure (just for instance) would stack up against each other very well in terms of productivity, but I wouldn't know how to prove it.
You seem hung up on the number of concurrent requests. An 8 core machine can only do 8 things at once, no matter if you are using threads or processes.
You can't assume that the threads are 100% CPU bound. If they were 100% CPU bound, you would have a point -- 8 cores, 8 CPU-bound processes, no processing time left over -- that is where event based and hybrid thread/event-based architectures excel.
In reality, webapp threads will sleep -- waiting on the network, waiting on disk, waiting on the database. When they sleep, the the processor has nothing to do. If the OS can schedule another thread while one sleeps, work can move forward.
If you can run 500 threads to completion in 10ms, then you can serve 500 requests within 10ms.
If you can run 16 (or 32) processes to completion in 10ms, then you can only serve 16 (or 32) requests within 10ms.
In Linux 2.6, there's not a lot of difference between switching between processes and threads.
From the github dude: "Our Ruby application code spends very little time blocking on external resources." That is why using threads in this particular case won't help.
You might, but I don't think that's the general case.
From the github dude: "Our Ruby application code spends very little time blocking on external resources." That is why using threads in this particular case won't help.
It hits a database, loads content from memcached, and waits on HTTP requests to complete, does it not?
I'd be surprised if the usage modeled a pure event-driven sendfile() based static-only file server (such as lighttpd), for instance (where threads vs. processes is moot).
Moreover, a fair amount of effort has been expended on a complex architecture to push only a very specific type of requests to Unicorn.
I'm also very curious about the 16 workers per box model. Frankly, that github would ditch haproxy in favor of a pre-fork Ruby web server is shocking.
I'm just glad we have CherryPy in the Python world.
Actually, antonovka did an excellent job explaining it above. The most important aspect is that threads are "lighter weight" than processes. They use less memory and are quicker to context switch (usually). The result of this lighter weight is that you can spawn more threads than you could processes, on the same hardware. And when you're using one thread/process per connection, that means more concurrent connections on the same hardware. So, if Unicorn used its exact same architecture, but replaced the worker processes with worker threads, you could scale much better.
Secondly, haproxy is written in C, which generally means it's going to perform much better and use a lot less memory than a Ruby webserver. This translates, once again, to more output from the same amount of hardware.
That's why I was pretty surprised to see that github would choose both Ruby and pre-fork over C and threads (or in the case of haproxy, async, which scales even better).
Assuming haproxy can work with unix sockets, github could configure nginx/unicorn to use it, but with unicorn its current work load does not require additional load balancing acrobatics beyond what the unix socket does.
Similarly, many rails sites use thin/nginx or mongrel/nginx and do not even have a workload that necessitates haproxy.
Plus, haproxy adds another layer of complexity, configuration, and management, which is nice to avoid if you can.
If the bottleneck of Ruby were fixed, then the next bottleneck would be the database.
Processes and threads have very similar execution models under most Unixes from what I understand. Threads don't use all that much less memory, either, given a copy-on-write friendly environment (e.g., not Ruby MRI). Perhaps you're confusing threads vs. processes under Windows or Java to threads vs. processes under Unix?
> The result of this lighter weight is that you can spawn more threads than you could processes, on the same hardware. And when you're using one thread/process per connection, that means more concurrent connections on the same hardware. So, if Unicorn used its exact same architecture, but replaced the worker processes with worker threads, you could scale much better.
You're crazy. If Unicorn were 1.) somehow able to take advantage of native threads (it's not), and 2.) moved to a thread-per-connection model instead of a process-per-connection model, it would have basically zero practical impact on the efficiency with which it processes requests. It's already sharing a great deal of its base memory footprint thanks to preloading in the master and fork, so processes don't have significant memory overhead. It's using the kernel to balance requests between processes, so it's not like there's a bunch of IPC between master and worker processes that could be removed with threads.
Say you're running 8 process-per-connection backends on eight cores and each process is 100% CPU bound processing requests. You have a load of 8.0, machine is 100% utilized. If you then change this to a single process with 8 native-thread-per-connection workers, absolutely nothing will change. The load will still be 8.0. You can start more threads to do work but it will have the near exact same effect as starting the same number of processes.
Process-per-connection doesn't fall down under these levels of concurrency. It does eventually - you can't use process-per-connection to solve C10K problems, for instance. But we're talking about Ruby backends, which are always managed as multiple processes due purely to the way Ruby web apps are written (not efficient, not async). Requests execute within a giant request-sized Mutex lock.
And even if threads were considerably more efficient than processes, you still don't want to run a lot of them because each consumes network resources, like database/memcached connections. Using native threads would not let Rails apps spawn thousands (or even hundreds) of thread-per-connection workers. A high concurrency threading model works for Apache (and Varnish and Squid) because the work performed in each thread is fairly simple and doesn't require the kind of network interaction and resource use that an app backend does.
Basically, this notion that threads (even native threads) would be a considerable improvement to Unicorn's design is just all wrong.
> Secondly, haproxy is written in C, which generally means it's going to perform much better and use a lot less memory than a Ruby webserver. This translates, once again, to more output from the same amount of hardware.
Your world seems considerably more simplistic than mine. I'm not even sure how haproxy and unicorn can be compared in any useful way. Unicorn is not a proxy. The master process does not do userland TCP balancing or anything like that. Unicorn is designed to run and manage single-threaded Ruby backends efficiency, HAproxy is a goddam high availability TCP proxy.
> That's why I was pretty surprised to see that github would choose both Ruby and pre-fork over C and threads (or in the case of haproxy, async, which scales even better).
I think you're confusing the types of constraints you have when you're building high concurrency, special purpose web servers and intermediaries (like nginx and haproxy) with the types of constraints you have when you're building a container for executing logic-heavy requests in an interpreted language. Nginx and HAproxy have excellent designs for what they do, and so does Unicorn. They're different because they do different things.
Say that when all the workers are being utilized, the average dynamic request takes 50ms (which is on the high end). That means that each box can handle 320 dynamic requests per second. Which is a decent amount, and I'd be surprised if they see that much traffic.
If they are using http caching and ESI properly, the average request time would be significantly lower, and the requests per second would probably be > 500/s on that box.
500 req/s is pretty abysmal for a 8 core 16 gigabyte server. I realize that some of that has to do with Ruby performance, but yikes -- that is frighteningly bad scaling, regardless of the cause.
Github isn't a microbenchmark.
A concrete example:
I can say that one of my rails sites handles 850 dynamic requests per second running on a single small 1 cpu server. That's because all that particular request does is lookup 4k of data from memcached and returns it. (i.e. http://www.tanga.com/feeds/current_deal.xml)
However, as a general rule, I know that each small server can handle about 30-50 pages per second, because each page takes a lot of data crunching to generate (and because I haven't been bothered to make it as efficient as possible, it's fast enough as is).
If all I was doing was returning a small bit of text that didn't require much lookups or calculation, then sure, a 8 core cpu with rails could probably do 4-5000 requests per second easily.
Erlang can use native threads, but they do not map to its lightweight processes. Erlang's lightweight processes are basically green threads. Native threads are used for SMP scheduling only. So on a four-core system, you might have thousands of Erlang lightweight processes running on just four OS threads in one OS process.
That's not what I said; I said: "[Erlang] maps lightweight erlang processes to operating system threads."
Erlang processes are green threads mapped to OS threads. They're called processes in erlang, but they're M:N scheduled threads.
So on a four-core system, you might have one thousands of Erlang lightweight processes running on just four OS threads in one OS process.
Or you could have thousands of Erlang lightweight processes mapped to 185 OS threads (not that it will, but it could). The fact that they're green threads is irrelevant in terms of scalability -- they're still threads, but even less resource intensive than a direct mapping to OS threads. M:N scheduled green threading is also very hard to do in the general case, which is why you don't see more of it (see the abandoned attempts to implement generic M:N OS:green thread scheduling in FreeBSD, for instance).
You could also be using Scala, where an actor only maps to a thread when it's blocked inside of its react loop, but in that case will consume a whole thread.
What's your point? :)
The Erlang model is to utilize poll/select/whatever together with a language that contains its own scheduler.
If you execute arbitrary python/ruby/tcl/php code, there's a chance it will block. If you execute arbitrary Erlang code, the chance of that is very low. This means that you can handle all kinds of long running calculations in the same OS process, while you continue to handle incoming requests.
The fact that Erlang farms out a few threads is a later optimization added to the language several years back in order to take advantage of SMP. However, it only creates a number of threads to match the number of processes, IIRC. After that it shouldn't spawn any more, so I don't think you would particularly call it an example of a 'threaded model'. Erlang's big advantage is the internal scheduler, and a systematic approach to writing code that will never block the Erlang OS process.
Solaris 2-8 implemented a general purpose M:N thread scheduler mapping user-space threads to kernel threads. This is the exact same solution Erlang implements for the less general purpose use-case of actor messaging.
Is Solaris' M:N thread implementation somehow not 'threaded'? If it is threaded (and it is), then how is Erlang's implementation not also an argument for threads?
To elucidate rather than rely on Erlang as an example, the argument for threads over processes:
1) Extremely low-cost alternative to IPC. Threads allow (but do not require the use of) shared mutable state. This is a much, much cheaper way to communicate between concurrent entities. You can achieve the same effect with processes and shared memory, but it's significantly more complex to implement and subject to the disadvantages listed below.
2) Extremely low memory foot-print. A thread costs a stack plus minor OS book keeping. A fork(2) can leverage COW pages, but almost invariably the number of non-shared pages will be significantly higher than with a thread. If an operation blocks a thread, it's cheap to create more. If an operation blocks a process, you'll hit resource constraints far more quickly trying to fork() more.
Of course, leveraging shared memory and other operating system tools, you can turn a fork(2) implementation into a thread alternative -- but then, you'd have ... threads. This is what Linux's clone(2) syscall was intended for -- a thread-implementation friendly fork(2) alternative -- and the pthread library was built on top of it.
... for a while, you could call setuid() in a Linux "thread" and the new uid would be a thread-local change, because the thread was actually a 'process'.
This is effectively not an argument against fork(2) (although it's an expensive route to the same solution offered by threads), but rather against scaling models that will block an entire OS process for the sake of a single request.
There's an important difference: in the threading model, all memory is shared by default, whereas in the processes + shm model, you need to explicitly allocate and access shared memory regions. Because sharing mutable state is such an important design consideration, being explicit about what is being shared improves reliability by reducing the chance of unintentional sharing. A related benefit is that processes are protected from one another by the virtual memory mechanism: it is much more feasible for a service to recover from a crashed process than to recover from a crashed thread (which typically takes down the entire process).
So, threads are really just an optimized case of launching N Erlang vm's, where N is the number of processes, and splitting the work between them. They were a late addition to the language too - this only happened several years ago, if I recall correctly.
In other words, threads are a nice boost to Erlang, but not really the secret to its success.
I think you missed my meaning. On FreeBSD 4.x and Linux 2.0.x, threads were implemented entirely in user-space. They didn't allow for true concurrent execution of multiple threads on multiple CPUs, but they were still very much threads.
Likewise, erlang's lightweight processes are exactly the same -- threads. The fact that they can be modeled to arbitrary numbers of operating system threads is irrelevant to the nature of the model -- one in which execution of a thread can proceed independently of others, implemented via context switching of execution state across those threads, able to access shared state.
However, in the fork model as described in the original article, processes are the only form of concurrency. A blocked request will, in turn, block a process, and unlike threads, far fewer processes may be run concurrently due to their significantly increased resource constraints. Furthermore, those processes are much more limited in their ability to implement low-cost inter-thread communication via shared mutable state.
If the processes actually relied on their own internal M:N scheduled green threads, then at least that part of the concurrency problem would be (mostly) solved, and fewer processes would be required. IPC is still an issue, and of course, there's the multi-decade demonstration of the high difficulty in implementing a 'performant' general-purpose M:N scheduled thread system.
> The Erlang virtual machine has what might be called 'green processes' - they are like operating system processes (they do not share state like threads do) but are implemented within the Erlang Run Time System (erts). These are sometimes (erroneously) cited as 'green threads'.
Going back to Ruby, sure, threads might be theoretically better than processes, but in practical terms, other things are the real bottleneck, so it ends up not really mattering that much.
Erlang processes' implementations share internal mutable state, despite the fact that the inter-process messaging API available via the runtime does not expose this. They have to -- internal cooperative context switching can't be implemented without shared mutable state.
I'm not sure I follow. Is this something like "ruby is really, really slow, so no need to worry about concurrency" ?
You're already running a reverse proxy in front of them! There's no reason each Unicorn couldn't be listening on a different port. Does that third layer of local load-balancing between the HTTP proxy and the event-driven app server actually get you anything?
It's a general result in queue theory that you want one global queue as early as possible in the system. This is because requests don't take exactly the average service time to clear backends: it's a distribution. The more workers that can pull from a queue, the less the worst case service times affect the average service time.
One queue per worker with no global queue is the worst configuration and should be avoided if at all possible. Anyone who's run large reverse proxy installs knows this pain well.
The ideal system would be for the balancer machine(s) to hold the requests, and for backends to pull them in a sort of ping/pong fashion. I gather fuzed runs in a pattern like this, though I've not used it.
Telcom folks have analyzed this stuff in detail for the better part of a century. There's a lot of theory out there and it's surprisingly practical and applicable to real world web applications.
We have not had success with them in the past which is why we employed HAProxy with mongrel.
Aren't you going to have multiple machines each with their own blessing of unicorns? You're still going to have to use some kind of load balancer in front of independent sockets.
I never understood the complex HAProxy in front of Apache in front of Nginx in front of Mongrel type setups that seem to be popular in the Rails world. Why not just use Unicorn? What value is GitHub getting from having Nginx in front?
nginx also has features like ESI, serving from memcached, and rate limiting which Unicorn does not (and doesn't need).
Also, nginx is going to be more efficient for serving static files, though most larger apps will have broken such requests out to a separate set of domains likely serviced by a cdn.
That will restart the rails processes. No connections are dropped, and I've not seen any downtime.
If you have any questions or are looking for details or something, ask away - just stick a comment on the blog post.