> Some API requests need to get a higher priority over other API requests, and that is not taken care of. This was one of our main problems with this architecture, especially with a mix of real-time clients and clients that send batch jobs.
Haproxy would let you assign pools to different requests, each with their own priority and queue. At reddit, we had it broken down roughly into four quartiles based on 95th percentile response speeds of each API call.
> This architecture assumes that there is enough Processing Servers to handle all requests (peak throughput). If there isn’t the Processing Server applies back pressure on the API server (error or timeout), which in turn returns an error to the API user which in turn re-tries the API request (applying more pressure). To avoid this, the number of running processing servers needs to be enough to handle peak traffic.
Using haproxy in the middle, that tier will queue the requests, so all the back pressure builds up at that second load balancer. Whether that is good or not is questionable, but at least you won't return errors to the clients right away. You'll still have to have a pretty long time out depending on how long it takes for resources to come online, but then you could go back to the first part and have longer or shorter timeouts based on the api call as is appropriate for your application.
> ELB was not designed to handle huge traffic spikes since it takes a few minutes to internally scale. You should contact AWS support to warm your ELB if you have a planned traffic spike. We at Forter are in the eCommerce market where traffic spikes are rare.
This is still true, and there is no easy way around it unless you also make haproxy your front end load balancer. If you make it your frontend as well, you can have a hot spare on standby and spin up new ones pretty quickly. That being said, I believe the ELB team is making improvements in this area in 2015.
> API Server needs to handle retries. Not provided by the ELB itself.
Processing Server must respond within the http timeout (configurable between 1 to 3600 seconds). Otherwise the protocol would need two phases which adds more complexity.
Still true, although it will have to retry less because of the queues in the middle layer. Again though if you use haproxy as the front end you can solve this issue as well by sending the request to a "retry pool".
FWIW, I've been working on a framework for empirically testing queue performance for scaled-up, distributed deployments (https://github.com/tylertreat/Flotilla). Haven't gotten around to adding support for AWS services yet, but would be interesting to see how they compare.
I'm hoping to do some in-depth analysis of several brokers at some point, but I want to get the benchmark instrumentation right first.
No message delivery guarantee in face of RabbitMQ server failure.
> The server MUST implement at least 2 priority levels for basic messages, where priorities 0-4 and 5-9 are treated as two distinct levels.
Depending on your client, it may be difficult to prioritize consumption of one queue over another, so this solution could be preferred.
Really enjoyed reading your post by the way...
When you mention low latency, how low are we talking? ms, seconds, minutes? The reason I am asking is because you can use s3 as an intermediate storage where you ship your compressed logs/events at a rollover interval and the processing servers discover them there. Now of course this only works if latency is not a big deal.
You can also get rid of auto discovery and use config files.
It's not perfect but pretty damn close. And it run on AWS et al.
SNS guarantees delivery with back-off to HTTP/S endpoints (handling HTTP status codes properly) with very low latency in an asynchronous manner.