* Byzantine fault-tolerance: How does this system handle failures when a specific node fails in a way that it fails to withdraw its routes. When a node's haproxy fails, how is BIRD informed of its failure? What if the failure is in some way that internal fault detectors don't see the failure.
* How is the ECMP hashing problem handled? ECMP hashing on most gear is just a plain hash, that means when a route is withdrawn, the rest of the systems see their traffic rebalance. How does this not result in all connections being severed?
This two-stage process also allows for good health checking from the much-simpler Linux ipvs L4 load balancer servers to the more-complicated L7 load balancer servers.
This was described in this Velocity 2013 talk - http://velocityconf.com/velocity2013/public/schedule/detail/...
To your first point use an external node, outside te data path, as your control plane.
To the second the simplest is to mange layer 3 ecmp on top of mobile layer 2 address assignments. Think bgp to carp'd next hops. Depending on your router implementation you also have more choices than simple 5 tuple for the ecmp hash inputs.
BGP has heartbeats at usually 15 to 45 second intervals depending on configuration. If BIRD stops responding, the router will withdraw its routes.
BIRD can be controlled via a Unix socket. Usually people build a health-check daemon that does queries against the local app and communicates its finding to BIRD. Working through all of the failure modes here is tricky, but doable.
The ECMP hash is often implemented as something like a CRC-16 of (protocol, source IP/port, dest IP/port) modulo the number of next-hops. I suspect the trick to keep TCP happy is to try keep the number of next-hops (shards) constant for each route.
The reason that this isn't really a problem on the internet, is because you're typically not using ECMP, and just plain anycast.
The reason for my ambiguity is that our cache-hit ratio is actually a result of our application. This architectural design afforded us the ability to maintain our (very high) cache-hit ratio, even when we outgrew the total slab size of a single varnish node.
That ability to maintain the same cache-hit ratio, the motivation for this effort, is the result of not needing to evict cached content prior to TTL.
So, if you have a low cache-hit ratio due to eviction (and you don't just have excessive TTLs), then your cache pool is probably too small. If so, then you might want to give this design a shot - it's an analog to using ketama for growing memcache beyond a single node.
My current stack: nginx/php-fpm/redis. Nginx and Redis serve me well, but php-fpm makes my website rather slow with high volumes of traffic, so I believe the solution for that would be Varnish(?).
I definitely agree that I wouldn't use Redis (or memcache for that matter) for storing entire pages and should be used for more of an object cache. Even then, we use memcache for "simple" data structures and when we need more complex data structures, will use Redis.
Redis is great if you need some kind of persistence as well (and it's fairly tunable), where as memcache and varnish, are completely in memory (varnish 4.0 I believe is introducing persistence). So you kick the process, and that's all she wrote for your cache until it gets warm again. (Which has its own challenges).
Varnish also gives you a language called VCL to play around with to maximize your caching strategies and optimize the way varnish should be storing things. It's got an easy API to purge content, when you need to purge it and it should support compression for your pages out of the box without too much tuning.
If you're having issues just speeding up static content, give varnish a whirl. Spend some time with it, and you won't be disappointed.
I believe you can also look into using nginx as a caching alternative to cache responses, but I don't have too much experience with that. I've heard it used with some success though.
If you're using Redis as your database, I'd suggest not doing that anyway, as you'll start running into problems as your dataset gets bigger than available memory and it has to start swapping. I've found it works much better if you use it like memcache with a richer set of data structures.
Ideally you have one caching model to rule them all, unless you're doing a lot of module-specific caching (ie, this element should be cached longer than this element, etc.)
Or use Varnish, use Redis for expired events but with an empty value and use keyspace notifications to automatically remove the data related to that key from the database and purge Varnish.
The team responsible for all of Tumblr's perimeter (Perimeter-SRE) is comprised of 6 people (including one manager).
This article is describing the architecture of the portion of our perimeter responsible for blogs serving, one of our more highly trafficked perimeter end-points.
If you found this interesting please checkout the jobs page  at Tumblr, we are constantly looking for new folks. Specifically  for positions on the teams that implemented everything described in the article.