Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What is your preferred Python stack for high traffic webservices?
233 points by bigethan on Aug 22, 2011 | hide | past | web | favorite | 67 comments
There's a large project on our roadmap to rework a big ole chunk of legacy code into something that is actually an asset for the company instead of an anchor.

I'm considering giving it a run with a base of gunicorn/gevent/nginix/pyramid. Seems that gunicorn/gevent give us the ability to use threads where best, but without having to make everything callbacks. And Pyramid gives us a flexible framework to run our web service through (currently the main focus). Kicked around the idea of using M2/0MQ as a way to implement a SOA of sorts, but it feels like a bit to much.

So, if you were starting from scratch and wanted to build a robust high traffic web service (site, app, api, etc), what would you use?

  * haproxy - frontline proxy
  * varnish - app server cache
  * nginx - static files
  * uwsgi - app server
  * flask - primary framework
  * tornado - async/comet connections
  * bulbflow - graph database toolkit
  * rabbitmq - queue/messaging
  * redis - caching & sessions
  * neo4j - high performance graph database
  * hadoop - data processing

That is the hippest stack I've ever seen. :)

Makes me want to try a few of those that are new to me though.

Why Varnish in front of Nginx? AFAIK Nginx can pretty much handle the role of Varnish.

Varnish connects directly to the app servers -- it's not in front of nginx (nginx is to the side serving other content).

Any particular reason you chose Tornado over other options? I know that they each have their strengths, was wondering which strengths you valued.

And for the web frame works side, have you found flask to be too light at times?

Tornado is solid and proven, however we will explore gevent on uwsgi more in the future. Using gevent for comet/async would enable us to consolidate the Tornado code into Flask, but we have been focused on other stuff so we'll test gevent when we have more time.

Flask is at about the right level of abstraction for what a Web framework should be these days. In this era of the social graph, it can be more interesting to store your social graph in a graph database and use it as your primary datastore. And if you're not using a relational database as the primary datastore, why would you want a framework that's built around an ORM?

ORM-based frameworks are ok if you stay inside the box, but they can get in the way when you're not using the RDBMS for authentication and authorization. And when you strip out all the stuff that's tied to the ORM and auth, you end up with with something that looks a lot like Flask. It's usually cleaner to start with something that was designed from the ground up to be a polyglot framework.

P.S. We chose uswgi because it's high performance and high quality (Roberto is a really smart guy), and there's a little-known feature in works -- uwsgi binary connectors to varnish and haproxy that will enable you to connect uwsgi directly to varnish and haproxy over a binary protocol and thus eliminate the HTTP overhead.

Varnish has a modern architecture that's superior to most of the other Web caches (https://www.varnish-cache.org/trac/wiki/ArchitectNotes) -- it was designed by Poul-Henning Kamp, the FreeBSD kernel hacker, and the code is so clean it was published as a book (http://phk.freebsd.dk/misc/_book2.pdf)

I don't hear varnish and haproxy being used together often, especially when nginx is also in the mix. Are they really complementary and worth the extra complexity?

At scale yes. They're built to solve different problems and excel at each individually. Nginx can proxy, but HAProxy is more flexible. Nginx can cache, but Varnish is much more flexible and efficient. We use all three technologies and route the traffic to the service best suited for the traffic.

I'm a fan of Django, so my ideal stack looks something like this:

  * puppet - managing server packages / infrastructure
  * monit - monitoring server processes / fixing things
  * django - primary web framework and ORM
  * amazon mysql - it's hosted, and works via plug-ins with Django
  * amazon s3 - storing static assets (images, css, javascript, etc.)
  * amazon elastic load balancer - for scaling incoming HTTP requests across multiple web app servers
  * amazon autoscale - for spinning up new web app servers to handle spikes in traffic
  * rabbitmq - message queueing
  * celery - processing async tasks in a robust fashion. must have
  * memcahed - no explanation necessary
  * git
  * fabric for deploying software
  * jenkins for testing / building software
  * nginx for buffering elastic load balancing requests to web app servers

how many EC2 servers do you have running for all this?

I am considering this stack off the shelf in my next big project:

- uWSGI - performs better than gunicorn and has support for async apps using gevent

- nginix - front end server

- pyramid - web framework

- mongodb - database

- mongoengine - mongodb and python mapper

- zeromq - messaging and communication

- jinja2 - for template engine

- gevent - for async processing

- gevent-zeromq - to make zeromq non-blocking and gevent compatible

- socket-io - JS lib for realtime communication

I still need to develop robust session management. I considered various options and came to conclusion if I want something fast, truly distributed and not using sticky session I should come up with my own session manager demon hosted on each node. I would use ZeroMQ to communicate to it.

Have you considered using Beaker with one of these backend extensions? https://github.com/didip/beaker_extensions

I personally use Redis, but you could shove your sessions into Mongodb since you're already using it.

Otherwise my preferred stack is very similar to yours, except I use Mako for templates and PostgreSQL/Redis for backend storage.

Thanks. Actually I did consider Beaker it didn't fulfill my requirement. I wanted to replicate sessions across nodes actively and async, and also I wanted to persist to MongoDB async.

So here is how I was considering to build:

- sessions updated and validated by a demon process per node.

- each validation and update will be one call via ZeroMQ's Req/Rep pattern. With each call I can validate session and reset timestamp.

- after each validation I asynchly will replicate session across other nodes via ZeroMQ's Pub/Sub ( I don't care about extra memory)

- sessions will also persist in MongoDB (async), just in case each node is restarted, thus preserving session.

Btw, I only want to valid/invalidate a session token and keep authorization information. Any other small values I could simply keep in the cookies encrypted, e.g. User ID. Though I could keep session information in cookie as well, but this allows sessions to live forever and it's not good if I want to kick some user out.

What's the use case that raises these requirements? Or, more politely, what problem are you solving with this setup?

Primarily I wanted it to be really fast, be able to invalidate session, and be available on any node via active replication. If I used Beaker, I had to use memcached which is not replicated (by default setup), and each validation would have required 2 round trips to memchached (validation, update stamp).

It's easier to write my own with ZeroMQ and I can build with custom logic at demon level.

You don't have to use memcached, the beaker extensions let you use whichever backend storage mechanism you want. Redis Cluster can be replicated and would work fine for this, for example.

I don't have a good grasp of what you're trying to achieve and why, but my overengineering alarms are going off.

Thanks for the suggestion. I haven't considered Redis Cluster, but I definitely will.

Awhile back I looked into Mongo for sessions, but it isn't really designed to do this. Sessions are temporal, and AFAIK Mongo doesn't have a built in mechanism for expiring them, plus session data is usually small -- there are other datastores better suited for this type of data.

Primarily sessions will live in RAM. And I was going to use MongoDB to persist sessions just in case the daemon dies or session is not found in RAM as a fallback. The reason for using MongoDB is because it's my primary database.

But your concern about growing table of sessions is valid, which I will handle by periodically archiving the old sessions so that index sizes remain small.

Are you worried about Mongo's reliability issues? I like Mongo, but i don't know that i could accept the risk on a project i really care about.

Initially I was, because I am coming from the Oracle world. But for the performance gain, I can give up ACID compliance, which is solvable with additional code complexity.

For extra reliability in MongoDB, I am thinking to use replica sets (active/passive) and journaling.

Interesting, thanks.

Hmm, would be interesting to benchmark uWSGI against Gunicorn with Meinheld workers.

is the uWSGI choice a purely based on performance? I like gunicorn for the simplicity, uWSGI has tons of options thought I wonder if they make management more difficult.

Yes, it was primarily because of the performance. According to these benchmarks, unicorn totally choked: "... At the bottom we have Twisted and Gunicorn ..."


Look at the comment, the twisted code is wrong. By default the twisted reactor is multiplatform, but you can imporve performance for your specific platform. It's in the doc. If you run Twisted under linux and you should, use the epoll reactor.

I use it for the performance. It takes a little time to learn but it's fairly easy to use.

This is what I am using currently:

  * haproxy - frontline proxy
  * nginx - static files and back proxy
  * supervisord - service uptime
  * gevent/meinheld - wsgi
  * django
  * gevent/eventlet - websockets/comet
  * postgresql - Database obviously
  * memcached - caching for django
  * rabbitmq - message queuing
  * celery - message processing
  * fabric - deploying
  * hudson - building

Just described the stack that I use to a "T".

The only things I would add are "solr" for search, and "redis" for miscellaneous speed improvements, such as statistics tracking and counting.

Its not a high traffic site, but I'm running a app that served average of 5 req/s with Mongrel2 + wsgid + MySQL + django and thats working pretty well.

Also, the benchmark of Python web servers that gets linked everywhere (http://nichol.as/benchmark-of-python-web-servers) is getting old. I'm planning on doing a new benchmark, probably this coming weekend. As of now, I'm planning to test gunicorn, uWSGI, tornado, bjoern, eventlet, and gevent over HTTP, flup over FCGI, and uWSGI and wsgid over zeroMQ (behind Mongrel2). Thinking of it, I probably need to put all of the HTTP servers behind nginx for a more fair comparison. Am I forgetting any servers that people would like to see benchmarked?

Could you also try benchmarking uWSGI in async mode? Preferably with gevent?

Looking forward, thanks!

Also checkout gunicorn/meinheld combination.

Surprised at the low number of CherryPy posts in this thread. Not only is it a great framework, it supports Python3 out of the box. My stack:

- ubuntu/debian - apt ftw

- python 3

- haproxy - proxy

- nginx - w/ uwsgi

- cherrypy - framework that supports PY3

- sqlalchemy - orm and sql

- postgres - relational storage

- mongodb - "mandatory" NoSQL

- 0MQ - messaging

You should find Simon Willison's talk about Building Lanyrd very relevant.

Slides and video here: http://lanyrd.com/2011/brightonpy-building-lanyrd/

* nginx * gunicorn * Django * PostgreSQL * memcached * Whatever else I need to implement the logic of the site (redis, celery, etc.)

I am newbie to using python for web services. Will django be better to start with or should I consider pyramid/flask/uWSGI as suggested here?

Flask is great for beginners because it's well documented and easy to understand. You can become familiar with Flask in a weekend -- start with the Quickstart and then go through the tutorial (http://flask.pocoo.org/docs/).

You won't have to spend much time learning it or fighting with it -- you won't find yourself asking, "Will I be able to do what I want in the framework without hacking it?" Flask let's you program in Python rather than writing to the framework like you typically have to in larger, opinionated framework's like Django and Rails.

Ironically, this also makes Flask an ideal choice for advanced Python programmers because it gives you flexibility rather than always wondering "will the framework allow me to easily do...?"

BTW uwsgi is a production app server. For example nginx has a built-in uwsgi connector and you use uwsgi to serve Flask apps (see http://flask.pocoo.org/docs/deploying/), but it's not a framework like Django/Pyramid/Flask.

Flask to teach you about web services in general, and then move on to Django once you have a decent grasp of things.

Flask is still excellent for small to medium, toy sized projects. Weekend hacks thrown together as a demo, for example. Beyond that I've found the lack of structure and third-party applications in Flask to be a hinderance for more mature sites. I'd say that for any given feature I need on a site that isn't extremely specific to the purpose of it, 90% of the time I can find it on django-packages and have it dropped-in and integrated within 10-15 minutes.

I'd venture to guess that for every hour I spend getting something to work that doesn't fall nicely into Django's structure, I save a hundred or so from not having to re-write a mature component that the community has solved ten times over.

I use http://cherrypy.org/ (behind a nginx revproxy) to my satisfaction. The data is stored either in http://www.mongodb-is-web-scale.com/ , or in MySQL. Disclaimer: these websites are not designed for 50.000 hits per second, but during benchmarking I get consistent times like 4 msec per page, and I'm confident that nginx can handle many slow clients simultaneously.

We are using Django for web services and it works for us.

Varnish / Frontline server sends static media to nginx, and other request to uwsgi cluster. Nginx / static media servign UWSGI / app servers Django / Web Framework PGSQL / Relational Database Redis / NoSQL / cache / sessions *RabbitMQ / messaging queue

we use varnosh as a frontend server it handles the load balancing betwen our UWSGI servers, and if the request is a static file its send to our nginx server. we them use redis to store all of our cache and sessions, we cache everything so everytime there is a read from our database via the django ORM our api grabs the whole object returned and stores it in redis so next time we need to retrieve it we just hit redis.

uWSGI, nginx, pyramid, sqlalchemy, postgresql, mako, beaker and fabric to deploy

My preferred setup that works for most cases. All reliable and fast.

FYI : Paster contains the CherryPy WSGI server - so that development and production server launching can be (practically) equivalent. CherryPy WSGI doesn't test too badly either.

That does sound lovely. Have you played with any of the evented/thread options of uWSGI?

Not really. I haven't had a need to yet, but it is on my list to check out.

At my job, we are running tornado w/ gunicorn and membase w/ haproxy to load balance (and not much else) and handling quite a bit of traffic. If I were to write my own from scratch I'd want to learn some erlang first ;)

nginx - frontline proxy, static files

tornado - web

memcache - cache

mysql - database

Does anyone have any opinions on web.py? I played around with this and it seemed pretty easy to use.

web.py development has stagnated lately. It's a great framework that many other Python frameworks used as inspiration, such as the webapp framework on Google App Engine and Tornado (and even Flask). However, Flask is more modern, under more active development, and is extensively documented. It's simple to use like web.py, and I would argue it's one of the cleanest Python Web frameworks out there.

Tornado has a very similar feel to web.py, yet can do much more when you are ready for it. If you like web.py you might as well use tornado.

Tornado is a different animal -- it's asynchronous so when you write code for it you have to program using callbacks rather than in a traditional style.

My current project requires real-time/always-on connections so I started to develop it using Tornado and then decided to switch to the Quora model -- use a traditional Web framework like Flask for most things, and connect back to Tornado for the real-time stuff.

When I switched to this model, the development process sped up considerably. In addition to just being really-well designed, Flask has an amazing debugger that makes me more productive, and it's easier to write unittests for Flask because you can write them in a traditional way and don't have to contend with Tornado's IOLoop.

For real-time stuff, you could forgo Tornado all together and instead use gevent to deploy your Flask app (http://flask.pocoo.org/docs/deploying/others/#gevent), like some have done with Django and Pyramid, but I haven't tried this yet.

You don't need to write tornado in an async fashion. You can use it like you would Flask, but its got a more similar feel to web.py than Flask.

nginx + gunicorn + pgbouncer + postgres, S3/CloudFront for -all- media. The gunicorn app server sit behind one of Amazon's Elastic Load Balancers, but could just as easily be HAProxy.


Seriously, it's amazing. It's like gevent, but with better documentation.

I was under the impression that gevent was like a new version of eventlet. Other than the docs (which isn't insignificant), why choose one over the other?

They're pretty similar. The main difference is that Eventlet supports more event loops, has a messier basic architecture, is slightly slower, and lets you defer blocking tasks to threads. Since I'm connecting to MySQL, which happens through blocking C code in libmysql, that's pretty important to me. Eventlet's db_pool module is very nice.

Personally, I'm happy with eventlet and can't see why I would want to go to gevent. Does anybody using gevent want to chime in on what's better about it?

gevent has utilities to turn Python blocking calls into non-blocking ones.

In your case you could switch to a pure Python MySQL driver and gevent will turn all your mySQL calls into async.

Actually there're also gevent non-blocking MySQL drivers that are written in C.

Twisted/Twisted/Twisted/Twisted. >:3

More seriously, Twisted/Flask/SQLAlchemy has been the formula for the past two deployments I've done, and I'm happy with it.

Why Twisted over Tornado or other threaded solutions? I'm curious about the specific strengths that you enjoy.

Twisted is a tcp/udp[/ssl] framework, Tornado is an http one.

In other word, with Twisted, you can write http, ftp, smtp, ... and your are free to write your own protocol.

Well, twisted is an async framework, and that's the important thing. Back in the day I actually wrote a Twisted protocol to send/receive SMS messages from a modem over serial. In total the code was like 100 LOC. That's how flexible, well-organized it is.

Compared to Tornado, Twisted is:

~ Mature. Twisted has been around for a lot longer than Tornado and has learned from all of the available history of modern OS networking.

~ Tested. Twisted's been in production for a long, long time, and is covered with thousands of unit tests. It's official policy that code may not enter the Twisted tree without accompanying tests.

~ Flexible. Twisted can be used as a general-purpose networking library, it can integrate with Pygame and Pyglet, GTK+, Wx, Qt, Tk; it doesn't have to be used for servers.

~ Extensible. Twisted's connectors are explicit, and rely on interfaces and adapters rather than inheritance. As an example, Twisted's SSH library lets you separate the SSH server, SSH channel, and SSH shell from each other. Annoying if you want a standard SSH server, but terrific if you're building a custom SSH proxy or tunnel. (I did this a few weeks ago at work. A lot easier with Twisted than with Paramiko!)

I should note that it's not an either-or; there is a branch of Tornado which throws out all of the event loop and uses Twisted's event loop instead.

That branch/port of Tornado is https://github.com/fiorix/cyclone and it's pretty nice. I greatly prefer it to t.web or nevow, which I used to use.

The tornado and cyclone APIs are closer to what I am thinking about as a web developer. Virtually never do I make an object and then think about how I'd like to adapt it to return an HTML view or programmatically add it under a parent object.

PHP 4 running on IIS.

Did you miss Python in the title?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact