Twitter switches from Mongrel to Unicorn

wglb · on March 31, 2010

This is a good engineering discussion. At the core of it is what we used to call back in the 70s doing real-time medical work "multiple server queues" versus "multiple queues each with one server". The different performance implications of each of these was pretty well known well before we studied it. The reference book we used as I recall was "Real-Time Data processing" by Stimmler (maybe also Robert Martin of Robert Martin fame).

anotherjesse · on March 31, 2010

I thought the coverage was pretty weak. Not due to unicorn but because there are many options (including passenger (aka mod_rails) for apache) that allow queueing to occur at the proxy instead of at the individual workers.

I wish they had discussed why unicorn instead of mongrel and why didn't solutions like haproxy or passenger or ... didn't work?

This discussion was: our apache configuration didn't work so we switched the load proxy mechanics and our app server at the same time.

I've had great success with mongrels with HAProxy (with maxconns=1 so only 1 request per mongrel at a time) for years. I've also had great success at Passenger with apache.

I think it is a great step forward for twitter's servers, I just wish the article had some meat. Isn't this the twitter engineer blog, not blog for the general audience about how twitter works?

teej · on March 31, 2010

(HAProxy + Mongrel) and (Apache/Nginx + Passenger) work great. Don't get me wrong, I've seen lots of different Rails server architectures and I've spoken at RailsConf on this very topic. I would recommend either setup to a Rails startup in an instant.

But in the edge case of immense load, they simply don't keep up to Unicorn.

The thing that sets Unicorn apart is that it does it's load balancing on the Kernel level. All Unicorn worker processes are listening on the same socket. The OS takes care of getting each request to a single, available worker. So unlike Mongrel, you don't end up with per-worker queues. Though HAProxy is smart about distributing load well, Unicorn makes it seamless. The workers simply ask for a new request and the Kernel gives it one.

There are some other niceties too. Unicorn processes are forked from a master process. If you are using REE, they can keep Rails in a shared memory. When we deployed it, we dropped memory usage by 30%.

On top of all that, Unicorn's flawless rolling restarts are a pretty big plus.

In conclusion, if you're in the top 10% of Rails apps by traffic, give Unicorn a look. It is likely that switching over is worth the dev risk and cost. Otherwise, keep it on your radar, but don't consider it a must-have.

[reference]

http://unicorn.bogomips.org/DESIGN.html

http://news.ycombinator.com/item?id=872283

FooBarWidget · on March 31, 2010

(FYI I'm a Phusion Passenger developer.) I find it interesting that you think the shared socket performs better than letting a proxy distribute the load. I've also done some tests and I find that the shared socket harms performance on high concurrency situations because of the so-called thundering herd problem. The problem is that all Unicorn workers select() on the socket, but when a client comes in all workers are waken up, all of them try to accept() the client but only one succeeds, and the rest goes to sleep. We're working on some pretty heavy performance and scaling optimizations for the upcoming Phusion Passenger 3 and we've found that avoiding the shared socket gives us much better overall performance. It'll be nice if the kernel provides an interface for performing select() and accept() in a single atomic operation but until then I think the shared socket isn't that good.

From what I've seen so far, peoples' experiences with both Phusion Passenger and Mongrel/Unicorn can vary drastically. Some people noticed a huge response time drop and performance increase when they switch from Mongrel/Unicorn to Phusion Passenger, others experience the opposite. I guess it depends a lot on the server. Phusion Passenger has got some pretty heavy users though, e.g. the New York Times Obama real time election results page was running on Phusion Passenger. The Dutch national TV broadcasting organization is running all their Rails apps on Phusion Passenger and they get huge spikes of traffic whenever something is mentioned on TV.

albahk · on March 31, 2010

Amusing quote taken out of context: "..called Stormcloud, to kill Unicorns when they ran out of control."

Tomorrow's headlines: "Twitter is killing Unicorns"

armandososa · on March 31, 2010

Haha! I was about to quote the same line, thinking that somewhere there's an mythical animals and chimeras protecting organization.

xal · on March 31, 2010

Seriously? Apache? There is another low hanging fruit guys :-). Tremendous speedups can be had by moving to nginx from apache (something we did at shopify about 2 years ago )

netik · on March 31, 2010

Define 'tremendous' and present hardware specs and numbers please. Shopify runs varnish (we run varnish on Twitter search) - are speedups coming out of varnish or ngnix for you?

We've done plenty of simulations, load tests, and lots of graphing that shows ngnix's performance gains are negligible given our hardware configuration. We've also got huge dependencies on mod_rewrite right now and didn't want to convert to ngnix for that very reason.

There seems to be this awful myth, completely unsupported by science, that seems to state that unless you're running Rails with ngnix that you're doing the wrong thing. The prevelance of ngnix in the Rails community is astounding.

It's a good server and it certainly has it's place in the world, but it's just not for us and not supported by our benchmarks.

xal · on April 2, 2010

Our web nodes use 6mb of memory for all the web serving each. This frees up gigabytes compared to apache which we can use for more processes ( web and app notes are combined in shopify's case ). The speedup lies there. Added benefit is that nginx somehow manages to terminate SSL with a lot lower CPU load which makes the web/app configuration very appealing. We experimented with terminating SSL in the Ciscos but it seems to be impossible for Cisco that ships a firmware that has this and weighted load balancing working both.

Tremendous, in this case, is because the extremely low resource usage of nginx allowed us to remove an entire layer in our serverfarm flowchart and now we can use our machines much more efficient.

oomkiller · on March 31, 2010

I'm not sure why they needed to develop their own process monitoring script, as bluebill does just as good as monit or god, even managing the children processes. With my bluepill script, I have it setup so if I send restart, it uses hot code reload. My deployments consist of git pull origin master && bluepill restart unicorn!

netik · on March 31, 2010

When we deployed we didn't know about bluepill and wrote a quick script. Stormcloud does a fair number of other things such as tracking min/max lifetime of children and it attempts to sort out why the process died in the first place.

Thanks for the bluepill link though, we'll take a look at it and might consider a rewrite of our internal tool at some point.

I'd like to open source Stormcloud when it becomes more accessible. Unfortunately, it's laden with internal dependencies to our monitoring systems, and not publicly consumable at the moment.

generalk · on March 31, 2010

Is there some technical reason I'm missing for why they wouldn't go the Passenger route? Seems to be by-far the easiest way to deploy a Rack-capable web app, and it (correct me if I'm wrong) allows for "hot deploy": update the code on the server and it'll sit there unused until you touch "tmp/restart.txt" under your app root.

adacosta · on March 31, 2010

I had issues with passenger at an event where a couple thousand people were relying on my app (about 1 thousand concurrent). I had a single dual quad machine running a rails 3-pre app on ruby 1.9, which was mem leaking every 30-45 min and melting down the goods. On site, I switched from nginx + passenger to apache + passenger so I could use a max-requests-per-worker type directive (only available with apache + nginx), which didn't solve my problems. I made one more leap (on the spot) to nginx + unicorn, and problem solved. It also saved me at least 1gb of normal operational ram.

I still think passenger is great for low demand sites. It took a while to figure out everything that had to be done for unicorn, which included a proper unicorn config file, a rake task to start/stop/restart, and a matching init.d script for my ubuntu; I also wrote a rake task to install the init.d script. Someone should post that stuff online.

ssp · on March 31, 2010

Shopping is even more efficient if you distribute all your items across all the cashiers. That way, the time from you are first in line until you are done, is n / num_cashiers instead of n.

Someone, please try this the next time you go shopping, and then blog about what happened.

ck2 · on March 31, 2010

For a moment the name made me think it was an early april fools but that turned out to be very interesting.

These days however, almost anything is faster and more stable than apache under load. It's really starting to show it's age unfortunately.

(and darn I wish we had a Fry's here!)

grinich · on March 31, 2010

Unicorn surprised us by dropping request latency 30% and significantly lowering CPU usage.

Wow.