We were doing exactly this - but we had a flaw: we didnt handle the case when the AWS API was actually down.
So we were constantly monitoring for how many running instances we had - but when the API went down, just as we were ramping up for our peak traffic - the system thought that none were running because the API was down - so it just kept continually launching instances.
The increased scale of instances pummeled the control plane
with thousands of instances all trying to come online and pull down their needed data to get operational -- which them killed our DBs, pipeline etc...
We had to reboot our entire production environment at peak service time...
The poster is aware of this, which is why they talked specifically what they did wrong.
Think long and hard why you felt it necessary to make your comment and what value it actually provided.
Yeah, that's what I was telling the other poster in a nicer way. Same goes to you. You can provide advice without repeating criticism for no reason when the person specifically said they did something wrong.
And save me the lecturing on "long and hard." My comment has 15 upvotes, so it's pretty unlikely I'm the one off the mark in this conversation.
That statement is a joke, here it is reworded:
> A person should be invulnerable to criticism as long as they make a humbling remark.
Doesn't sound so great now does it.
Also, I'm not surprised you got 15 upvotes. This place has ceased to be a hacker forum for many years now. Too many eternal politically correct Septembers.
You were the one using upvotes to validate your argument when it's a fact that posting a political opinion in either a left and right forum will net highly different responses. Of course the environment plays a part.
Won't even bother dissecting the first shot. Your arguments have been weak at best till now this final one was the final straw, man.
You know what I think its a fault of:
Lack of a canonical DevOps "university" stemmed from SV startups.
DevOps at this point should not just be a degree -- it should be a freaking field of study and classes offered by YC.... Look at their pedigree of companies at scale. We should all make an effort to give back in this regard...
Couldn't query, but could initiate.
Would love to hear what others are also tracking.
Goes to show what their business numbers are....
We also added smart loading across AZs due to spot instances getting whacked when our fleet was outbid and AWS took them back.
As well as other monitoring methods to be sure we werent caught with a smart system doing dumb things.
Limits are malleable based on your use case. Speak with your rep.
You might even not know how limits came to be... I am.
I have had a time when git suffered a flaw, and a junior dev also suffered a flaw in checking in secrets.... thousands of instances across the globe were launched... for bitcoin mining... $700,000 in a few hours...
We all learned a bit that day.
Coupled with a horizontal pod autoscaler (which sets the replica count based on a metric) you get the best of both worlds.
For example, we have daemons reading messages from SQS, if you try to use auto scaling based on SQS metrics, you come to realize pretty quickly that CloudWatch is updated every 5 minutes. For most messages, this is simply too late.
In a lot of cases, you are better off with updating CloudWatch yourself with your own interval using lambda functions (for example) and let the rest follow the path of AWS managed auto scaling.
There is also a cascading auto scale that you need to follow. If we take ECS for example, you need to have auto scaling for the containers running (Tasks) AND after that you also need auto scaling for the EC2 resources. Both of these have different scaling speeds. Containers scale instantly while instances scale much slower. Even if you pack your own image, there is still a significant delay.
For example, imagine that each machine has 20 available threads for processing messages received from SQS. Then I'd track a metric which is the percent of threads that are in use. If I'm trying to meet a message processing SLA, then my goal is to begin auto-scaling before that in-use percentage reaches 100%, e.g., we might scale up when the average thread utilization breaches 80%. (Or if you process messages with unlimited concurrent threads, then you could use CPU utilization instead.)
The benefit of this approach is that you can begin auto-scaling your system before it saturates and messages start to be delayed. Messages will only be delayed once the in-use percent reaches 100% -- as long as there are threads available (i.e., in-use < 100%), messages will be processed immediately.
If you were to auto-scale on SQS metrics like queue length, then the length will stay approximately zero until the system starts falling behind, and then it's too late. If you scale on queue size then you can't preemptively scale when load is increasing. By monitoring and scaling on thread capacity, you can track your effective utilization as it climbs from 50% to 80% to 100%; and you can begin scaling before it reaches 100%, before messages start to back up.
The other benefit of this approach is that it works equally well at many different scales; a threshold like 80% thread utilization works just as well with a single host, as with a fleet of 100 hosts. By comparison, thresholds on metrics like queue length need to be adjusted as the scale and throughput of the system changes.
For example (and I know nothing about the use-case of OP, can only estimate), you might be able to buffer requests into a queue and have it scale up slower.
You might have auto scaling that needs to be close to real time and auto scaling that can happen on a span of minutes.
Every auto scaling needs to also keep in mind the storage scaling, often you are limited by the DB write capacity or others.
Additionally, a major use case for ECS is machine learning workloads powered by GPU's and Fargate does not yet have this support. With ECS you can run p2 or p3 instances and orchestrate machine learning containers across them with even GPU reservation and GPU pinning.
ECS GPU scheduling is production ready, and streamlined quite a bit on the initial getting started workflow due to the fact that we provide a maintained GPU optimized AMI for ECS that already has your NVIDIA kernel drivers and Docker GPU runtime. ECS supports GPU pinning for maximum performance, as well as mixed CPU and GPU workloads in the same cluster: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/...
One is your options for doing forensics on Fargate. AWS manage the underlying host so you give up the option of Doing host level investigations. It’s not necessarily worse as you can fill this gap in other ways.
Logging is currently only via CloudWatch logs so if you want to get logs into something like Splunk you’ll have to run something that can pick up these logs. You’ll have that issue to solve if you want logs from some other AWS services like Lambda to go to the same place. The bigger issue for us is that you can’t add additional metadata to log events without building that into your application or getting tricky with log group names. On EC2 we’ve been using fluentd to add additional context to each log event like the instance it came from, the AZ, etc. Support for additional log drivers on Fargate is on the public roadmap so there will hopefully be some more options soon.
 Fargate Log driver support v1 https://github.com/aws/containers-roadmap/issues/9
 Fargate log driver support v2 https://github.com/aws/containers-roadmap/issues/10
Edit: another one is that you can run ECS on spot fleet and save some money.
The choice of metric is important, but it needs to be a metric that predicts future traffic if you want to autoscale user facing services. CPU load is not that metric.
The best way to do autoscaling is to build a system that is unique to your business to predict your traffic, and then use AWS's autoscaling as your backup for when you get your prediction wrong.
If you're already invested in ECS it's a different story, of course.
AWS is committed to improving both ECS and EKS. You can see our public roadmap with many in progress improvements for both container orchestration platforms here: https://github.com/aws/containers-roadmap/projects/1
Feel free to add your own additional suggestions, or vote on existing items to help us prioritize what we should work on faster!
Some of the problems we're seeing is task placement and waiting is too hard (we had to write our own jittered waiter to not overload the ecs api endpoints when asking if our tasks are ready to place). Scaling the underlining EC2 instances is slow. The task definition=>family=>container definition hierarchy is not great. Log discovery is a bitch.
Are these all solved under K8S? I've no experience with kubernetes, but if so, might need to rethink where we run our containers. ECS was just so easy, and then so hard.
Do you mind if I ask how long this takes right now with you and how long you expect it to take to be fast?
It's significantly more complex to host kubernetes infrastructure, and EKS is significantly more expensive.
ECS would be a first choice, with EKS a second choice if my needs dictated it (perhaps a hybrid or milticloud scenario).
To clear up the confusion on the relationship between Fargate and ECS, think of Fargate as the hosting layer: it runs your container for you on demand and bills you for the amount of CPU and GB your container reserved per second. On the other hand ECS is the management layer. It provides the API that you use to orchestrate launching X containers, spreading them across availability zones, and hooking them up to other resources automatically (like load balancers, service discovery, etc).
Currently you can use ECS without using Fargate, by providing your own pool of EC2 instances to host the containers on. However, you can not use Fargate without ECS, as the hosting layer doesn't know how to run your full application stack without being instructed to by the ECS management layer.
Which is why, I assume, Fargate is now listed as an integral feature of ECS on the product page.
Having two resources (such as DB and app) scale in concert can be exceedingly difficult.
This means your resources are too tightly coupled. If they are so tightly coupled that they need to scale together then they are not two resources and you should look into restructuring them into two actual resources or bind them more closely to make a single resource.
But when I tried it, it turns out the docker support requires very specific base images.
Not really docker then is it?!?
Jelastic public cloud providers offer automatic vertical scaling with pay-as-you-use billing model (please do not confuse with pay-as-you-go). Except this our team helps related technologies to become more elastic, for example Java https://jelastic.com/blog/elastic-jvm-vertical-scaling/
Regarding the docker support, there are two flavors inside Jelastic:
1) Native Docker Engine - you can create a dedicated container engine for your project in the same way as you do on any IaaS today, for example “How to run Docker Swarm” https://jelastic.com/blog/docker-swarm-auto-clustering-and-s.... An advantage here is the vertical scaling feature. In Jelastic unused resources will not be considered as paid while at any other cloud provider you will have to pay for the VM resource limits.
2) Enhanced System Containers based on Dockerfile - there is no need to provision a dedicated docker engine or swarm. This solution provides even better density, elasticity, multi-tenancy and security, more advanced integration with UI and PaaS features set compared to #1. It supports multiple processes inside a single container, you can get an SSH access and use all standard tools for app deployment, write to local filesystem, use multicast and so on. It supports traditional or legacy apps while images can be prepared in the same familiar Dockefile format. Unfortunately it's not fully compatible with Native Docker Engine due to specifics limitations/requirements of docker technology itself.
Thank you for pointing out this issue. In the upcoming release we will clarify the difference between two and provide more tips which one is better to use in various cases.
Regarding 2: Is there any fundamental reason why full compatibility will never work?
2 - As Docker, Open Containers and other related technologies evolve the difference between system and application containers gets slimmer over the time. As an example well known issue with memory limits https://jelastic.com/blog/java-and-memory-limits-in-containe... now can be solved with help of lxcfs https://medium.com/@Alibaba_Cloud/kubernetes-demystified-usi.... I hope at some point we will be able to use benefits of both in one container engine.
2) in 'Surprise 3' the author claims that the Terraform's aws_appautoscaling_policy 'is rather light on documentation'. Since I am a user of Terraform for several years, I find it inaccurate mostly because of the several examples available in the documentation https://www.terraform.io/docs/providers/aws/r/appautoscaling... as well as many more when doing a Github exact search for "aws_appautoscaling_policy" language:HCL will reveal many, many more examples from open-source repos (some with permissive licenses too). I'd created a custom ecs-service TF module which creates for each service (optionally) an ALB along with listeners and the attached ACM-issued TLS certs and TGs, the scale-in/out CW alerts with configurable thresholds/policies, SGs, Route53, etc. allowing one to quickly configure and launch an ECS service fast and reliably.
Regarding the scale-in, I typically also have that at intervals between 5-15 minutes to avoid an erratic scale-in/scale-out 'zig-zag' happening even at the cost of briefly over provisioning.
Read capacity is easy to skill to infinity with caches. But if a DB can only write 1000 updates per second, nothing will change that.
In many cases - it's ok to not process EVERYTHING right away. Process the important stuff RIGHT AWAY. Slowly process the unimportant stuff in your spare time.
Having to deploy updates to stopped instances would be complicated and you’d have to pay EBS costs for stopped instances, but I’m curious if there are other issues. Launching an instance from an AMI, even after the instance comes up the disk tends to be very slow for some time as if it’s lazily loading the filesystem over the network.
This is a feature that is currently on our public roadmap for container services, in the "Researching" category: https://github.com/aws/containers-roadmap/issues/76
Feel free to drop a thumbs up on the roadmap item to show your support and boost its priority on the roadmap, or leave a comment to let us know more about your needs.
I have been in Devops since it before it had a name and I see many companies solving the same problems.
Like ths auto-scaling post. That's not the first company to deal with it (nor the last). Providing a set of tools can be very beneficial but so hard to dial down.
I have a very big itch around solving this problem.
EDIT: I meant killing containers, not the Tasks themselves. Sorry.
I'm pretty confident AWS's product for this use case would be Lambda and the new on-demand DynamoDB.
Is there actually a use case in analytics that requires a server that accepts connections from multiple clients, and then has to have <60ms latency including state over the wire and executing sophisticated business logic, between those clients, for time periods longer than 5 seconds? I.e. something that resembles a video game?
Because if there isn't, if your goal is to scale, why have containers at all?
“Today the service runs for between $150/mo and $200/mo depending on burst traffic and other factors. If I factor that into my calculator (assuming a $200/mo spend), ipify is able to service each request for total cost of $0.000000007. That’s astoundingly low.
If you compare that to the expected cost of running the same service on a serverless or functions-as-a-service provider, ipify would cost thousands of dollars per month.”
Sadly so, because I can't stand lambda!
We have one service running that consumes data to a Kinesis stream published as an API GW endpoint. Preprocessing is done in Lambda in batches of 100 records and the processed records get pushed to Firehose streams for batched loads to a Redshift cluster for analytics. So far we've been very happy with the solution - very little custom code, almost no ops required and it performs and scales well.
I'm surprised it can't use load average to estimate the true resource demand.
The number of switching tasks may be very high for a number of reasons, including a very large number of threads which do very small chunk of work each and yield.
How many errors are in that sentence? The 95% -> 90 mysterious conversion. 100/90 is actually 1.111(repeating of course), not 1.1. And if it did equal 1.1 it would be 10%, not 11%.
They haven't yet matured enough, it seems. Cloudformation is the right way to code your infra, not web Console or anything else.
Good sign, though, is they use ECS, not Kubernetes :)