> Can't imagine what a service refactor is like at A. I bet it sucks
It's not all that hard. AWS heavily focuses on Service Oriented Architecture approaches, with specific knowledge/responsibility domains for each. It's a proven scalable pattern. The APIs will often be fairly straight-forward behind the front end. With clearly lines of responsibility between components, you'll almost never have to worry about what other services are doing. Just fix what you've got right in front of you. This is an area where TLA+ can come in handy too. Build a formal proof and rebuild your service based on it.
I joined Glacier 9 months after launch, and it was in the band-aid stage. In cloud services your first years will roughly look like:
1) Band-aids, and emergency refactoring. Customers never do what you expect or can predict them to do, no matter how you price your service to encourage particular behaviour. First year is very much keep the lights on focused. Fixing bugs and applying band-aids where needed. In AWS, it's likely they'll target a price decrease for re:invent instead of new features.
2) Scalability, first new feature work. Traffic will hopefully be picking up by now for your service, you'll start to see where things may need a bit of help to scale. You'll start working on the first bits of innovation for the platform. This is a key stage because it'll start to show you where you've potentially painted yourself in to a corner. (AWS will be looking for some bold feature to tout at Re:Invent)
3) Refactoring, feature work starts in earnest. You'll have learned where your issues are. Product managers, market research, leadership etc. will have ideas about what new features to be working on, and have much more of a roadmap for your service. New features will be tied in to the first refactoring efforts needed to scale as per customer workload, and save you from that corner you're painted in to.
Year 3 is where some of the fun kicks in. The more senior engineers will be driving the refactoring work, they know what and why things were done how they were done, and can likely see how things need to be. A design proposal gets created and refined over a few weeks of presentations to the team and direct leadership. It's a broad spectrum review, based around constructive criticism. Engineers will challenge every assumption and design decision. There's no expectation of perfection. Good enough is good enough. You just need to be able to justify why you made decisions.
New components will be built from the design, and plans for roll out will be worked on. In Glacier's case one mechanism we'd use was to signal to a service that $x % of requests should use a new code path. You'd test behind a whitelist, and then very very slowly ramp up public traffic in one smaller region towards the new code path while tracking metrics until you hit 100%, repeat the process on a large region slightly faster, before turning it on everywhere. For other bits we'd figure out ways to run things in shadow mode. Same requests hitting old and new code, with the new code neutered. Then compare the results.
side note: One of the key values engineers get evaluated on before reaching "Principal Engineer" level is "respect what has gone before". No one sets out to build something crap. You likely weren't involved in the original decisions, you don't necessarily know what the thinking was behind various things. Respect that those before you built something as best as suited the known constraints at the time. The same applies forwards. Respect what is before you now, and be aware that in 3-5 years someone will be refactoring what you're about to create. The document you present to your team now will help the next engineers when they come to refactor later on down the line. Things like TLA+ models will be invaluable here too.
Indeed your side note should be the first point. I see that very often in practice. As a result, products get rewritten all the time because developers dont want to spend time understanding the current system believing they can do better job than the previous one. They will create different problems. And the cycle repeats.
The not-invented-here (or by-me) syndrome is probably also at play here.
The English guy, if that's enough of a clue. It has been about 4 1/2 years now since I left Glacier, so there's every chance our paths never overlapped.
> Dat soundz like a bank and not a cloud provider.
The first stage in making something reliable, sustainable, and as easy to run as possible is to understand the problem, and understand what you're trying to achieve. You shouldn't be writing any code until you've got that figured out, other than possibly to make sure you understand something you're going to propose.
It's good software engineering, following practices learned, overhauled, and refined over decades, that have a solid track record of success. It's especially vital where you're working on something like AWS, Azure, etc. cloud services.
If you leap feet first in to solving a problem, you'll just end up with something that is unnecessarily painful down the road, even in the near term. It's often quicker to do the proposal, get it reviewed, and then produce the product than it is to dive in and discover all the gotchas as you go along. The process doesn't take too long, either.
Every service in AWS will follow similar practices, and engineers do it often enough that whipping up a proposal becomes second nature and takes very little time. Just writing the proposal in and of itself is valuable because it forces you to think through your plan carefully, and it's rare for engineers not to discover something that needs clarified when they write their plan down. (side-note: All of this paperwork is also invaluable evidence for any promotion that they may be wanting, arguably as much as actually releasing the thing to production). It shouldn't take a day to write a proposal, and you'd only need a couple of meetings a few days apart to do the initial review and final review. Depending on the scope of what came up in the initial review, the final review may be a quick rubber stamp exercise or not even necessary at all.
Where I am now, we've got an additional cross-company group of experienced engineers that can also be called on to review these proposals. They're almost always interesting sessions because it brings in engineers who will have a fresh perspective, rather than ones with preconceived notion based on how things currently are.
An anecdote I've shared in greater detail here before: Years ago we had a service component that needed created from scratch and had to be done right. There was no margin for error. If we'd made a mistake, it would have been disastrous to the service. Given what it was, two engineers learned TLA+, wrote a formal proof, found bugs, and iterated until they got it fixed. Producing the java code from that TLA+ model proved to be fairly trivial because it almost became a fill-in-the-blanks. Once it got to production, it just worked. It cut down what was expected to be a 6 months creation and careful rollout process down to just 4 months, even including time to run things in shadow mode worldwide for a while with very careful monitoring. That component never went wrong, and the operational work for it was just occasional tuning of parameters that had already been identified as needing to be tune-able during design review.
In an ideal world, we'd be able to do something like how Lockheed Martin Corps did for the space shuttles: https://www.fastcompany.com/28121/they-write-right-stuff, but good enough is good enough, and there's ultimately diminishing returns on effort vs value gained.
It's not all that hard. AWS heavily focuses on Service Oriented Architecture approaches, with specific knowledge/responsibility domains for each. It's a proven scalable pattern. The APIs will often be fairly straight-forward behind the front end. With clearly lines of responsibility between components, you'll almost never have to worry about what other services are doing. Just fix what you've got right in front of you. This is an area where TLA+ can come in handy too. Build a formal proof and rebuild your service based on it.
I joined Glacier 9 months after launch, and it was in the band-aid stage. In cloud services your first years will roughly look like:
1) Band-aids, and emergency refactoring. Customers never do what you expect or can predict them to do, no matter how you price your service to encourage particular behaviour. First year is very much keep the lights on focused. Fixing bugs and applying band-aids where needed. In AWS, it's likely they'll target a price decrease for re:invent instead of new features.
2) Scalability, first new feature work. Traffic will hopefully be picking up by now for your service, you'll start to see where things may need a bit of help to scale. You'll start working on the first bits of innovation for the platform. This is a key stage because it'll start to show you where you've potentially painted yourself in to a corner. (AWS will be looking for some bold feature to tout at Re:Invent)
3) Refactoring, feature work starts in earnest. You'll have learned where your issues are. Product managers, market research, leadership etc. will have ideas about what new features to be working on, and have much more of a roadmap for your service. New features will be tied in to the first refactoring efforts needed to scale as per customer workload, and save you from that corner you're painted in to.
Year 3 is where some of the fun kicks in. The more senior engineers will be driving the refactoring work, they know what and why things were done how they were done, and can likely see how things need to be. A design proposal gets created and refined over a few weeks of presentations to the team and direct leadership. It's a broad spectrum review, based around constructive criticism. Engineers will challenge every assumption and design decision. There's no expectation of perfection. Good enough is good enough. You just need to be able to justify why you made decisions.
New components will be built from the design, and plans for roll out will be worked on. In Glacier's case one mechanism we'd use was to signal to a service that $x % of requests should use a new code path. You'd test behind a whitelist, and then very very slowly ramp up public traffic in one smaller region towards the new code path while tracking metrics until you hit 100%, repeat the process on a large region slightly faster, before turning it on everywhere. For other bits we'd figure out ways to run things in shadow mode. Same requests hitting old and new code, with the new code neutered. Then compare the results.
side note: One of the key values engineers get evaluated on before reaching "Principal Engineer" level is "respect what has gone before". No one sets out to build something crap. You likely weren't involved in the original decisions, you don't necessarily know what the thinking was behind various things. Respect that those before you built something as best as suited the known constraints at the time. The same applies forwards. Respect what is before you now, and be aware that in 3-5 years someone will be refactoring what you're about to create. The document you present to your team now will help the next engineers when they come to refactor later on down the line. Things like TLA+ models will be invaluable here too.