We plan to do a blog post about this at some point, but we had the pleasure of seeing exactly how elastic the elb is when we switched Cronitor from linode to aws in February 2015. Requisite backstory: Our api traffic comes from jobs, daemons, etc, which tend to create huge hot spots at tops of each minute, quarter hour, hour and midnight of popular tz offsets like UTC, us eastern, etc. There is an emergent behavior to stacking these up and we hit peak traffic many many times our resting baseline. At the time, our median ping traffic was around 8 requests per second, with peaks around 25x that.
What's unfortunate is that in the first day after setting up the elb we didn't have problems, but soon after we started getting reports of intermittent downtime. On our end our metrics looked clean. The elb queue never backed up seriously according to cloud watch. But when we started running our own healthchecks against the elb we saw what our customers had been reporting: in the crush of traffic at the top of the hour connections to the elb were rejected despite the metrics never indicating a problem.
Once we saw the problem ourselves it seemed easy to understand. Amazon is provisioning that load balancer elastically and our traffic was more power law than normal distribution. We didn't have high enough baseline traffic to earn enough resources to service peak load. So, cautionary tale of dont just trust the instruments in the tin when it comes to cloud iaas -- you need your own. It's understandable that we ran into a product limitation, but unfortunate that we were not given enough visibility to see the obvious problem without our own testing rig.
I was coming here to ask whether pre-warming is still an issue with the ALB service. Maybe jeffbarr can comment on whether that's changed?
GCE's load balancer does not use independent VM instances for each load balancer, instead balancing at the network level. So you can instantly scale from 0 to 1M req/s with no issues at all.
You can request pre-warming for additional ELB capacity, when you know far enough in advance that you will have a spike. AWS customer service responds by asking 10 clarifying questions via email. The thing is, we can't look under the hood to see currently provisioned and utilized ELB capacity, so we just have to trust that AWS engineers will properly allocate resources according to the answers to those questions. IMO, it's a rather cumbersome process that would be better implemented as a form.
It is weird that given how much in the way of tools and self-help that AWS gives you, that prewarming involves this manual process involving a dozen questions, some of which are unanswerable, and there's no 'pre-warming' form to fill out - the service rep gives them to you in free text.
It's not clear why it's not useful? The OP said hot spots at tops of each minute, quarter hour, hour and midnight of popular tz offsets like UTC, us eastern, etc., so wouldn't he just tell Amazon "Hey, we get xx requests/second at our peak, so we'd like the ELB ready scaled to handle that load"?
Yes, Google's load balancer is the best one available today. The lack of native websockets support is probably the only disadvantage but made up for with the scaling abilities and cross-region support.
ALB pricing is strange too, classic AWS style complexity.
Yes, it's always possible by dropping down to the lower network level but it would be nice to have it as a natively supported protocol as part of an existing HTTPS LB.
bgentry, what do you mean with not needing VM instances? I believe that regardless of the layer at which you load balance (network or application) you still need compute instances to run the LB logic, host the connection tables, etc.
I think the general difference is, in AWS you provision your own "private" little loadbalancer instance, and they have logic on how little or big (in terms of compute, bandwith etc) this specific loadbalancer needs to be, and resize it constantly for you.
Google runs a single gigantic distributed loadbalancer, and simply adds some rules for your specific traffic to it. All of the compute and bandwith behind this loadbalancer is able to help serving your traffic spike.
Google's load balancer is a SDN (software defined networking) setup that basically runs as software on their normal computing clusters that power the rest of their services. They have plenty of capacity already handling all the other traffic so there's no real difference in handling a few more requests, unlike AWS which manages custom instances and resizing just for your ELB.
This is actually fairly common. ELB scales up as your traffic scales up but it can not handle tsunami levels of traffic. It can only handle incremental level increases of traffic. You have to contact Amazon support to get them to fix your ELB at a larger instance for the entire life of the instance.
250 rps in absolute terms is not enormous; and peaks of 25 times relative to base load is not unheard of.
What I think you are indicating is that you have a very unusual thing that ELB is not set up to handle: you go from base to peak load in seconds flat. Or even less? That's interesting and quite unlike the very common model of human visitors to a website ramping up, that ELB is likely designed around.
My biggest issue with ELB is how long it takes for the initial instances to get added to a new ELB.. it takes f-o-r-v-e-r... I've seen it take as long as fifteen minutes, even with no load. I'm hoping ALB fixes that.
Re: adding to ELB, I haven't had that experience - for me it's been pretty reliably in line with the healthcheck config (2 healthchecks x 30 sec apart = 60ish seconds to add). Or are you including instance spin-up time in that number?
As someone who is apparently all-in on AWS, can you explain how you justify the cost? I understand the convenience of having all the pre-built services, but that is a finitely limited benefit. The vendor lock-in of the entire infrastructure and deployment processes being extremely AWS-specific means it's financially infeasible to migrate elsewhere once you are launched. Tack on the expensive-beyond-belief S3 server pricing that gets you terrible allocations of shared hardware, the sky-high insane prices they charge for bandwidth, and the penny-and-nickeling of all other services. I continue to be baffled that any small, medium, or large company believes AWS serves them better than any dedicated or colocation alternative.
The vast, vast, vast majority (seriously, probably 95-98%) of companies do not build out the required AWS infrastructure to remain highly available, with failover, with on-demand auto-scaling of all services that would make AWS the go-to choice. I continue to come across the individuals who maintain the fantasy that their business will remain online if a nuclear bomb wipes out their primary data centre. Yet they all deploy to a single availability zone, the same way you'd deploy a cluster of servers anywhere else. I cease to be amazed at businesses that spend $10k+ a month on AWS that would cost them half that with a colocated deployment.
Here's some cases that I've handled with AWS that justifies the cost:
- About a month ago, our database filled up, both in space and IOPS required. We do sizeable operations every day, and jobs were stacking up. I clicked a couple buttons and upgraded our RDS instance in-place, with no downtime.
- We were going through a security audit. We spun up an identical clone of production and ran the audit against that, so we didn't disrupt normal operations if anything crashy was found.
- Our nightly processing scaled poorly on a single box, and we turned on a bunch of new customers to find that our nightly jobs now took 30 hours. We were in the middle of a feature crunch and had no time to re-write any of the logic. We spun up a bunch of new instances with cron jobs and migrated everything that day.
100% worth it for a small business that's focused on features. Every minute I don't mess with servers is a minute I can talk to customers.
We're paying an agility premium, that's why. My company has both colocated and AWS assets, and while we save a bunch of money with the colocated assets over their AWS equivalents, we would much rather work with the AWS assets.
We don't have to bother ourselves with managing SANs, managing bare metal, managing hardware component failures and FRUs, managing PDUs, managing DHCP and PXE boot, managing load balancers, managing networks and VLANs, and managing hypervisors and VMs. We don't have to set up NFS or object stores.
Being on a mature managed service platform like AWS means that if we want 10 or 100 VMs, I can ask for them and get them in minutes. If I want to switch to beefier hardware, I can do so in minutes. If I want a new subnet in a different region, I can have one in minutes. There's simply no way I can have that kind of agility running my own datacenters.
Nobody disputes that AWS is expensive. But we're not paying for hardware or bandwidth qua hardware or bandwidth - we're paying for value added.
Curious, would you say this an indication that there are not enough talented/competent sysadmin/infrastructure people to employ to manage those tasks in-house? Or is it your opinion that AWS provides so much value that in-house simply can't compete in terms of the man-hours it would require to manage the equivalent? The whole "spin up in minutes" is certainly not unique to AWS; most hosting providers, especially if you are a sizeable business, is going to be at your beck and call whenever you need them.
I still think the benefits of AWS are over-emphasized within most businesses. Of the 4 companies I've worked for that used AWS, 3 of them did absolutely nothing different than you'd do anywhere else. One-time setup of a static number of servers, with none of the scaling/redundancy/failure scenarios accounted for. The 4th company tried to make use of AWS's unique possibilities, but honestly we had more downtime due to poorly arranged "magical automation" than I've ever seen with in-house. I suppose it requires a combination of the AWS stack's offerings and knowledgeable sysadmins who have experience with its unique complexities.
Disclaimer: I'm a developer rather than a sysadmin, not trying to justify my own existence. :p
We have finite time to improve a product. Any minutes spent racking servers (physically or otherwise) are minutes spent not working on something that adds value for our users. Driving the top line is more important than optimizing expenses that are relatively small.
We have a pool of elastic IPs that we rotate with Route53 using latency based routing. The ability to move the IP atomically (by moving the ENI) gives us operational flexibility. We were pretty surprised ourselves that the (huge) hotspots in our traffic distribution alone were enough to "break" the ELB, despite overall traffic being fairly low. We had to see it ourselves to believe it. The current setup has worked out well for us as we've scaled over the last year.
Also I'll add here to another point made below: I don't blame the ELB for not being built to handle our traffic pattern, despite the fact that websites are probably a minority on EC2 vs APIs and other servers. My specific critique is that none of their instrumentation of the performance of your load balancer indicates to you that there is any problem at all. That is... unfortunate.
Please don't say "huge" when talking about your traffic. That is misleading.
The appropriate word to describe 8 requests/s is "nothing". Health checks and monitoring could do that much by themselves when there are no users.
200 requests/s is a very small site.
To give you some point of comparison: 200 HTTP requests/s could be processed by a software load balancer (usual pick: HAProxy) on a t2.nano and it wouldn't break a sweat, ever.
It might need a micro if it's HTTPS :D (that's likely to be generous).
To be fair, I hardly expect any performance issues from the load balancer before 1000 requests/s. The load is too negligible (unless everyone is streaming HD videos).
All the answers about scaling "ELB" are nonsense. There is no scale in this specific case. The "huge" peak being referred to would hardly consume 5% of a single core to be balanced.
I used to criticize ELB a lot and avoid them at all cost. So do many other people on the internet. But at your scale, all our hatred is irrelevant, you should be way too small to encounter any issues.
N.B. Maybe I'm wrong and ELB have gotten so buggy and terrible that they are now unable to process even little traffic without issues... but I don't think that's the case.
In my opinion, it is HTTP and WS LB rather than Application LB as it supports just two protocols. In contrast, if you look at F5 load balancer, it can look at LDAP packets or Diameter packets and do a L7 load bakancing. So ALB seems misleading terminology to me as HTTP and WS != All L7 applications
Judging by how Amazon has handled other things, I wouldn't stick too hard to that conclusion. They tend to focus first on something generally useful and easy then come back to fill in more difficult/less popular options over time.
The real question: does this provide a faster elasticity component than ELBs?
At a previous employer, we punted on ever using ELBs at the edge because our traffic was just too unpredictable.
Combining together all of the internet rumors, I've been led to believe that ELBs were/are custom software running on simple EC2 instances in an ASG or something, hence being relatively slow to respond to traffic spikes.
Given that ALBs are metered, it seems like this suggests shared infrastructure (binpacking peoples ALBs onto beefy machines) which makes me wonder if that is how it actually works now, because it would seem the region/AZ-level elasticity of ALBs could actually help the elasticity of a single ALB.
If you don't have to spin up a brand new machine, but simply configure another to start helping out, or spin up a container on another which launches faster than an EC2 instance... that'd be clutch.
Those underlined headers hurt the eyes. Could you please switch to a different heading style, please? Maybe smallcaps?
EDIT: And while you're listening: AWS documentation is a mess in the sense that it's way too unorganized; it might be documented but one cannot find it easily.
Google's load balancer does do HTTP/2. It doesnt have native websocket protocol support (you have to use the network LB for that traffic) but it does provide cross-region balancing.
Yes, saw that, but it requires setting up a whole new LB just for websockets. It would be nice if it was just a natively supported protocol on the existing HTTPS LBs, is there a reason why that cant be done?
We're using CloudFormation and ECS heavily for Convox, and just can't get off a CloudFormation custom handler (Lambda func) for managing ECS task definitions and services for small reasons like this.
Exciting! Disappointing that you can't route based on hostname yet, though. I've got 5 ELBs set up to route to different microservices for one app, and because we couldn't do path-based routing before, that's all segmented by hostname. As soon as ALB supports hostname routing, I can collapse those all into a single LB.
Yeah, but they only allow you to route via path matching right now. The ALB has access to the headers, but configuring it based on arbitrary headers isn't exposed to end users yet.
I excitedly set up an ALB as soon as I read the post, because I've needed it, only to find that support for what I want isn't available to me yet!
The article actually just states that layer 7 load balancers have access to the header information and typically allow you to route based on those headers. ALB, however, doesn't :(
Weird thing to highlight when the product being announced doesn't even have that feature.
This looks pretty sweet. The next big thing for api versioning would be header instead of url based routing, looking forward to 'give you access to other routing methods'.
It only allows you to route via path matching right now.
"each Application Load Balancer allows you to define up to 10 URL-based rules to route requests to target groups. Over time, we plan to give you access to other routing methods."
The ALB clearly has technical access to the headers, but use of them isn't exposed to users yet.
Every backend architecture should account for having multiple machines. What are you going to do once you get enough traffic that one machine can't handle it?
These new features are cool... but they still pale in comparison to something like HAProxy.
I guess the tradeoff is that with ELB/ALB, like most PaaS, you don't have to "manage" your load balancer hosts. And it's probably cheaper than running an HAProxy cluster on EC2.
But for the power you get with HAProxy, is it worth it?
Does anyone have experience running HAProxy on EC2 at large scale?
ELB is HAproxy. :-) Sure, you get a lot of flexibility configuring HAproxy yourself, but you also have to run it yourself. 90% of the time it's easier to just use ELB (plus it has some direct integration with other services, like IAM-stored server certs/keys, ASG, etc).
I have swapped out ELB for HAproxy and/or nginx on a couple of occasions. If you know your load and feature requirements intimately, you might be able to do a better job. But it's work.
I'm curious if this will Convox to route to multiple services with just a single ALB instead of the historical default of 1 ELB per service. Would be a real cost savings for a micro-services architecture.
It should allow Convox to do that, yes. Although it doesn't appear (from the screenshots) that you can route by Host header, so you'd have to put all your services on the same hostname with different path prefixes to make it work.
This is very good.
Recently my workflow has been ELB -> NGINX -> Cluster.
Nginx was a cluster of machines that did routing based on rules into the ec2 machines. Now that the AELB has some of those capabilities it's time to evaluate it.
I love elastic beanstalk (minus it's mediocre docs). Agreed here, does elastic beanstalk support setup of new-style elbs (without me doing extensive customization)? When will it if not?
It seems that routing is done in the following way /API/* goes to applications and expects :8080/api/ rather than the root. Would be nice to have the option to direct traffic to just :8080.
no, they replaced bandwidth costs with new pricing component $0.008 per LCU-hour[1]. If you have 1M idle websocket connections you will pay 100 times more for ALB vs ELB (i.e. $2K/mo vs $18/mo).
Good thing ELB is still here, so you can choose between them depending on your workload.
I don't believe classic ELB supports websockets[1] making this a tenuous comparison. There might be workarounds that I'm not aware of (our production network isn't currently on AWS so I'm a couple of years behind in my day-to-day experience with them).
That said, I don't dispute that there might be use cases where classic ELB is a better option. And I'm glad it's still available (as opposed to ALB replacing classic).
Anybody know whether the new ALB handles a client TLS (SSL) when operating in http mode?
I was trying secure an API Gateway backend using a client certificate but found ELB doesn't currently support client side certificates when operating in http mode.
There was this complicated Lambda proxy workaround solution but I gave up halfway through...
I'm the process of containerizing an app that includes a Websockets service, and given ECS / ELB limitations we'd just decided to go for Kubernetes as the orchestration layer.
This ALB announcement + the nicer ECS integration could tip the balance though.
Any thoughts on how likely it is that Kubernetes can/will take advantage of ALBs (as Ingress objects I suppose) soon ?
The Amazon Certificate Manager uses SNI, and you can request certificates with multiple hosts and even wildcard domains. I would imagine if you upload your own multi-domain certificate that it would work in the same way, but I have never tested that.
> You're missing the use-case where you want to use a wildcard certificate and an EV certificate. You can't get an EV wildcard certificate.
Yes that's not possible as EV certs are not issue for wildcards.
My counter is that EV certs are for chumps and the entire concept is a scamola. The only justification I'd accept for getting one is proper A/B testing that an EV cert lead to increased revenue. There's no inherent security argument for them.
This is definitely nicer than having to create subdomains for microservices and mapping each subdomain url to its own Elastic Loaad Balancer + Elastic Beanstalk instance. But I have already gone down this path so I am unlikely to use AWS Application Load balancer. I wish I had this option a year ago.
I wouldn't consider this a full haproxy/nginx replacement just yet. It doesn't support host based routing, so you would need a different ALB for each host in order to get the same thing you would get with haproxy/nginx.
If your setup worked under classic you didn't have to setup your VPC and worry about internet gateways and routing all of that stuff. It is better to use VPC and if you have a newer account and in fact you don't even get the option, but some of us have instances that have been running without issue for 7 years and don't want to reconfigure what is working.
Where is this documented officially ? There's very little mention of Websockets in the guides, and none at all in the ALB creation wizard, which I didn't find reassuring.
What's unfortunate is that in the first day after setting up the elb we didn't have problems, but soon after we started getting reports of intermittent downtime. On our end our metrics looked clean. The elb queue never backed up seriously according to cloud watch. But when we started running our own healthchecks against the elb we saw what our customers had been reporting: in the crush of traffic at the top of the hour connections to the elb were rejected despite the metrics never indicating a problem.
Once we saw the problem ourselves it seemed easy to understand. Amazon is provisioning that load balancer elastically and our traffic was more power law than normal distribution. We didn't have high enough baseline traffic to earn enough resources to service peak load. So, cautionary tale of dont just trust the instruments in the tin when it comes to cloud iaas -- you need your own. It's understandable that we ran into a product limitation, but unfortunate that we were not given enough visibility to see the obvious problem without our own testing rig.