Cells are basically subsets or shards of customer resources with infrastructure isolated from other cells. The reference deployment pipeline in the blog post [0] would deploy to each cell individually in different waves depending on how your team configures the waves ultimately limiting the blast radius of a deployment to a smaller subset of customers.
I like having groups of customers segmented into copies of your architecture. It means you know how to spin that up independently which is good for a lot of reasons.
This looks similar to the internal pipelines framework that all of amazon uses to deploy code. Funny to see truly how much value amazon has gotten out of developing frameworks for the retail business, and then making those frameworks available publicly. The abstractions work because they have already been vetted across amazon for a decade
I think the last missing piece is getting reproducible builds which I think you can get from buildpacks [0], NixOS [1], or Bazel [2] running in your CodeBuild.
I was thinking someone could probably build a manyrepo solution similar to Brazil on top of an idea like this. I don’t see many tools like Bazel built for manyrepos.
There’s no equivalent OSS implementation of a multi language lock file like Brazil’s. You have to build your own or go with a monorepo tool like bazel. Amazonians take reproducibility for granted because of Brazil.
I’d be happy if I’m wrong about this but I don’t think you can get a reproducible Node.JS build or python build from NixOS out the box. You need to build something on top of it.
It's nice to see something like this. I maintain a reference implementation of some open-source software on AWS, but that was pieced together over the years from other people's work, now deleted blog posts (God bless the Internet Archive), and unmaintained sample code on GitHub. Hopefully, a fully worked example like this will fill in a lot of the blanks in my understanding, like how to best implement blue/green deployments (the CodeDeploy hook for CloudFormation has a lot of weird limitations), unit test automation, and monitoring.
Edit: It's blog spam, sadly. Here are direct links to the actual blog posts and tools.
Not mentioned in the article but I’d recommend an end to end test pipeline as well. Deploy your tests like they’re your customers, independent and “adversarial”.
Also, one-box/one-cell deployment stage before your production stage. Beta/gamma are well and good, but they’ll never perfectly replicate an actual prod deployment
What level of usage warrants these kinds of setups? Say, beyond a workflow that runs CI on PR/master off of Github (Actions or CircleCI etc) and deploys to something like Heroku or Vercel.
Do companies run apps that get 1k rps on stuff like cell-based-architechture (mentioned in top comment)?
Kinesis? Services operating at massive, hard-to understand scales?
FWIW, I work at Amazon, and a lot of things are still solved with "Boring" tech that's not "web scale". Even without our own ranks we've got to remind people that engineering is about fitting the architecture to the requirements and no more. It seems to be lost knowledge these days that an RDBMS gets you really, really far. The vast majority of apps -- even those that appear superficially "large" -- will never see usage patterns that justify anything more than a box and a database. You'd be surprised which systems are just humming along unceremoniously with a humble 3-tier architecture
So, as you scale up and out in size, you stop being able to depend on higher level abstractions from elsewhere.
You can't build a Tier-1 service like S3 or EBS on top of RDS. If you need a database solution to help you with building and running tools of that scale, it has to be something else that is a lot less complex and a lot more robust.
It's okay for RDS to be dependent on a Tier-1 service like S3 or EBS, but not the reverse. You don't want to get into priority inversion problems here.
And when your service has to be in every AZ in every region, and they all have to be kept separately running, then things get even more complex.
Then you have to ask yourself what is the bootstrap process for building a new region.
It is quite weird that "Build" and "Unit Test" are part of the Deployment Pipeline.
For me the Deployment Pipeline starts with the Acceptance Tests that ensures business quality running over already created artifacts. There are advantages to this approach, as different deployment pipelines may be required to deploy to different platforms even if all artifacts are generated by one build pipeline.
AWS seems to mix the concepts of Build Pipeline with Deployment Pipeline.
Deployment has had historically a meaning in software development, I do not understand why to change it now.
>> AWS seems to mix the concepts of Build Pipeline with Deployment Pipeline.
>> Deployment has had historically a meaning in software development, I do not understand why to change it now.
Maybe because traditionally software shops would run a build server on premise, and a deployment on AWS. Now Amazon want's to move the build process on AWS as well - making them a little bit of extra money.
That may be the reason. To create an article that looks like professional best practices but it is just an advertisement. Microsoft used to do a lot of that in its war against Open Source back in the day.
So, at Whole Foods, we made heavy use of CodeBuild and CodePipeline. But I also had experience at building and managing most of the Jenkins servers that we used to use before moving over to the native commercial AWS tooling.
Conceptually, it's easy enough to put these two sets of components together in the same "pipeline" tool. You just have to make sure that all the right steps and components are available, and that they are properly used. If so, you can set things up so that all your code goes through multiple online checks when it gets uploaded to the repo, and you prevent the code from being able to be merged to "live" (or "head" or "master" or whatever you want to call it), unless all the checks have passed.
Once all the checks have passed, your pipeline can proceed to push that to live (possible with a human approval required), then deploy that to beta, then gamma, then a canary "onebox" in one AZ in one region, and then finally into your first real multi-server production environment in one AZ in one region. Then you can chain on from there, with potential tests that have to be passed at each stage, and baking in periods that are required, and then on to the next stage with the next onebox in the next AZ in the next region.
Rinse and repeat.
Of course, you also need auto rollback processes, in case your deployments fail. And what happens if you need to rollback the rollbacks? Ad infinitum.
To that level, it's all just logical extensions of the initial concepts.
Is that still a common practice? I figured most companies are running builds in their existing pipelines infrastructure which in my experience is usually GitHub/Travis/GitLab/Azure DevOps/CircleCI/etc.. and then moving it up into AWS
In my experience it is - having on premise build server is a lot cheaper than running a very CPU intensive task in the cloud. Can save thousands of $ a year.
I tried to like it, but when it'd get stuck for a long time at a step with no good way to get information or control what was going on... I moved on to Terraform and have been relatively happy ever since.
My only nit with their build stages is I don’t see them
mentioning that the pipeline stages they outline are logical. When it comes to how the real pipeline actually runs you should ignore the stages and just specify the minimal dependency graph and let everything that can be run in parallel run. This usually means produce your build artifact first and then run everything immediately after. If you’re feeling fancy you can have the steps remember the logical stage they’re in and bail out of later stages on failure.
Not much. Having a function build pipeline takes a bit of work to get set up, but once it's up and running, and projects are built to depend on it, it takes away a fair bit of ops work.
That can help smaller teams become more agile, and being able to walk through a consistent process across multiple products for each step from commit against a dev branch through deployment in production makes it easier to account for team members being away, training new folks, etc. This is especially important for "side projects" or other small services that end up becoming critical infrastructure but are really maintained by one or two team members and only have a couple of internal customers.
Even small companies deploy to staging environments before production.
Your deployment pipeline most likely looks like:
- Build your language artifacts
- Build a container
- Deploy that container somewhere
- Run approvals like integ tests, unit tests etc. and monitor metrics to decide whether to deploy to your next environment or rollback.
This is just a structured way of building these CD pipelines you could also use Jenkins or Google started building something called Skaffold for Google Cloud Deploy: https://skaffold.dev/docs/
As your company gets larger you'll want to limit the blast radius of new changes so that you can meet your SLAs and you'll want to use multiple AZs and cloud regions for HA. That's where the idea of waves and cells comes from.
If you deploy to multiple regions or datacenters it doesn't matter how big you are -- you should orchestrate deployments so you don't deploy to all of them at the same time. If something breaks, you'd rather break a subset of regions than all of them.
You should also deploy new code to some kind of non-production environment before going to prod. Even companies with the best possible canarying in prod will still do that.
Some of the technology involved is definitely AWS-scale, but the principles of software rollout are broadly applicable.
The very first line of the article[1] calls it out, "....for enterprise-grade deployment pipelines." So I guess it's relevant for Fortune 500ish companies.
That said, a base version of this pipeline is very useful even for a small startups. I concede that it takes a bit of time (took me about a week) to setup end to end. By end to end I mean beginning with empty AWS account to setting up VPC, ECS, CloudFront, Fargate, Codepipeline and so on. But once set you don't have to touch it for months on end. Scaling up/down, if needed, is as simple as minor config change.
thats true. even if not amazon scale.. lets say you building multi tenant SaaS product. cell-architecture provide strong tenant isolation avoid noisy neighbor problem. other benefits like deploy changes to one cell so tenant may request delaying upgrade if they need to do pre-req work before hand.
for ordinary service use case or web site.. most probability do not need this. over engineering infrastructure when should focus on customer and problem domain..thing that increase top or bottom line.
source: principal SDE who launched several services at amazon using cell architecture.
I've always seen AWS as a rather expensive answer to the build. Any other vps provider or even dedicated Colo is likely much cheaper and less locked-in than the amazon ecosystem.
That having been said, the "reference architecture" being offered just feels like a slow day in the AWS marketing department during a recession. Nothing really special about it unless I've completely missed something?
At $dayjob, AWS consultants regularly "assist" with architectures where the final solution is invariably draped across every one of the AWS proprietary offerings. Whether on purpose or not, the result is that customers are locked in and the budgets are often a "surprise!".
In my experience when negotiating any discounts the Amazon sales guys are very aware exactly how locked in the customer is. If they use a ton of proprietary AWS features Amazon may offer 5% discount, on the other hand if the customer uses Kubernetes and other stuff that is easy to move to another cloud they may offer 50% off.
I'm just curious, have you actually built and operated a feature-equivalent build/deploy system on VPSes? If so, color me impressed, I'd love to see some source code or other artifacts if you're willing to share.
very interestingly, Vercel [1] (which uses AWS for its deployment pipeline), does not use Codepipeline, Codebuild, etc. It uses Fargate for build performance reasons [2]
And a talk on what a cell based architecture is here: https://youtube.com/watch?v=HUwz8uko7HY
Cells are basically subsets or shards of customer resources with infrastructure isolated from other cells. The reference deployment pipeline in the blog post [0] would deploy to each cell individually in different waves depending on how your team configures the waves ultimately limiting the blast radius of a deployment to a smaller subset of customers.
[0]: https://aws.amazon.com/blogs/aws/new_deployment_pipelines_re...