Hacker News new | comments | ask | show | jobs | submit login

So we had this cause a spectacular outage a few years ago.

We were doing exactly this - but we had a flaw: we didnt handle the case when the AWS API was actually down.

So we were constantly monitoring for how many running instances we had - but when the API went down, just as we were ramping up for our peak traffic - the system thought that none were running because the API was down - so it just kept continually launching instances.

The increased scale of instances pummeled the control plane with thousands of instances all trying to come online and pull down their needed data to get operational -- which them killed our DBs, pipeline etc...

We had to reboot our entire production environment at peak service time...






That's not the right way to do it. You shouldn't monitor how many instances you're running. You just need to determine how many instances you should be running based on your scaling driver (cpu, # of users, database connections, etc). Then you call the Auto Scaling SetDesiredCapacity API with the number, and it is idempotent[1]. If the AWS API is down, your fleet size just won't change.

[1] https://docs.aws.amazon.com/autoscaling/ec2/APIReference/API...


> That's not the right way to do it.

The poster is aware of this, which is why they talked specifically what they did wrong.


While the poster was aware of it, he did not provide a solution whereas DVassallo provided a valuable step by step on how to do it properly. Which may help others in the future.

Think long and hard why you felt it necessary to make your comment and what value it actually provided.


> Think long and hard why you felt it necessary to make your comment and what value it actually provided.

Yeah, that's what I was telling the other poster in a nicer way. Same goes to you. You can provide advice without repeating criticism for no reason when the person specifically said they did something wrong.

And save me the lecturing on "long and hard." My comment has 15 upvotes, so it's pretty unlikely I'm the one off the mark in this conversation.


> You can provide advice without repeating criticism for no reason when the person specifically said they did something wrong.

That statement is a joke, here it is reworded:

> A person should be invulnerable to criticism as long as they make a humbling remark.

Doesn't sound so great now does it.

Also, I'm not surprised you got 15 upvotes. This place has ceased to be a hacker forum for many years now. Too many eternal politically correct Septembers.


A strawman argument + when your view is not popular, the environment must be the problem. Classic undefeatable argument. I'm surprised you have problems getting along here.

> when your view is not popular, the environment must be the problem

You were the one using upvotes to validate your argument when it's a fact that posting a political opinion in either a left and right forum will net highly different responses. Of course the environment plays a part.

Won't even bother dissecting the first shot. Your arguments have been weak at best till now this final one was the final straw, man.


If this is still a pitfall for users of AWS ~5 years in.. then its not a fault of my communication...

You know what I think its a fault of:

Lack of a canonical DevOps "university" stemmed from SV startups.

DevOps at this point should not just be a degree -- it should be a freaking field of study and classes offered by YC.... Look at their pedigree of companies at scale. We should all make an effort to give back in this regard...


devops as a field of serious study is pretty pathetic. I wouldn't trust a devops 'grad' to do anything or know anything. But with the resurgence in certs and boot camps and other snake oil making $$ why not?

Yeah, like I said - this was a few years ago, and the system wasnt designed to be able to scale using ASGs at the time. (Fleet didnt yet exist, amd a bunch of other reasons) - but scaling was based on users and load complexity for the data we were handling -- this wasnt a web service.

I don't understand--how were you launching instances if the API(s) was/were down? Your system was unable to determine that there were instances running, but it was able to send RunInstances requests to EC2?

Correct.

Couldn't query, but could initiate.


Queue them up?

Another thing to keep in mind is that AWS local capacity can run incredibly close to the wire at times. You might be surprised if you knew how much capacity for your instance type was actually available under the hood. I’ve personally seen insufficient capacity errors.

I've always kept a mental capacity tracker in my head...

Would love to hear what others are also tracking.

Goes to show what their business numbers are....


What was your resolution to this issue? Did you fix your service to account for the API being down, or did you switch to an entirely different approach?

I can't recall the exact implementation detail, but We then logged the number of running instances in a file, and read the last qty of instances and the delta from when launched - and made the system not get over aggressive if it couldnt read the current set.

We also added smart loading across AZs due to spot instances getting whacked when our fleet was outbid and AWS took them back.

As well as other monitoring methods to be sure we werent caught with a smart system doing dumb things.


AWS has limits on the amount of resources you can have in a VPC. You can request to increase these via a process out of bounds of the API. This mechanism is there exactly for these kind of things (and malicious API calls should you get hacked). Maybe someone at your company was thinking to big? Normally these are around 10/50 for each EC2 type.

you are aware that if you have a close enough relationship with AWS you can request and set your own limits?

Limits are malleable based on your use case. Speak with your rep.

You might even not know how limits came to be... I am.

---

I have had a time when git suffered a flaw, and a junior dev also suffered a flaw in checking in secrets.... thousands of instances across the globe were launched... for bitcoin mining... $700,000 in a few hours...

We all learned a bit that day.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: