You can install Windows (proved with XP) in five minutes or less.
How many build pipelines have you worked with that take longer than installing Windows?
Granted, they're different things - one is copying a giant set of files and creating a small DB and one is running tests and packaging software.
Here's the thing though. Most of the time, these builds are slow due to 1. lack of parallelism and 2. Blindly following "best practices" instead of using our $170k+ Silicon Valley brains.
These are problems you can solve. You can get your build to be faster than installing Windows. Will most companies let their developers do things that may not be "best practices" to do that? Not today.
Not spending 9 months on-boarding might be worth extra minutes on deployment, IDK...
Even more importantly, the engineer is likely to stick with a truly high tech workplace with much less "churn" than in yet another dime a dozen startup with the same fashionable tech stack as the slightly better competition.
You can use Fargate for example, which should give you easy to manage and fast deployment times - you just have to figure out the fast build time.
For FastComments.com I can do a build in under two minutes. The second you merge to master:
1. E2E testing runs, which is massively parallelized. Each test suite's data gets namespaced so they don't step on each other.
2. The build runs, takes 30 seconds, because it's not a compiled language. Although if it was in Java I don't think it would take very long.
3. Deployment happens, Orcha sends the bundle to all the servers, stops the deamon, extracts it, restarts the systemd daemon.
4. E2E testing runs again against production as a smoke test.
This all happens in under a couple minutes. Not a very small app - package.json has 50 dependencies and the E2E test suite has ~40 tests. These tests are complicated - some will open two browsers for example to test the live commenting features...
Deploying faster means you can onboard faster, because you get quicker feedback on your changes, getting insight into code and systems faster.
Deploying faster means you can onboard faster, because you have fewer (ideally just one) pending changes to deploy to keep in your mind while working, increasing the amount of changes you can make in a work week.
Deploying faster means you can onboard faster, because you spend less time deploying and more time studying the system.
distcc/ccache also works. Years ago we had a rack full of machines devoted to this. Requires a little tuning but lets you do -j 50. Of course, that was in the days of dual-core machines and these days you can buy a lot of cores. Bottleneck is probably writing intermediate files to disk and flushing them.
Untangling the inter-dependencies that cause all the rebuilding is very hard and often forces design changes you don't want, such as use of the "pimpl" idiom.
and linting, and code quality checks, and compiling, and optimizing, and pushing the package up to package & image repositories, and deploying it to sandboxes
I agree there is room for improvement. Parallelism in particular is a step that could shave up to 30% off some of the build pipelines I've worked with. I disagree however that your example should "make your blood boil". I feel like "Blindly following "best practices" instead of using our $170k+ Silicon Valley brains" is inciteful language that ignores the reality of the situation. We are all working from the tools we've got. The majority of those tools don't have good models for Parallelism. CircleCI for example supports "test splitting" parallelism, where tests can be run in parallel in different containers, but there's no workflow parallelism that I know of so that style checking, deployment to ephemeral environment, publishing, &c can proceed once the bud artifact is generated.
Yes we should try to adapt & grow. The very name, build pipelines, bespeaks a linear process. The ability to change is definitely at hand, & there are legions of us who would pick up the torch on this one & move forward.
But I disagree that our failure to do so is "blindly following 'best practices'". Yes this is one of the "problems you can solve" but it's a bit of a big deal. Configuring build pipelines is a fairly advanced task, & to advance to parallelism & our maybe 30% gains we need not only the machine work to support parallelism, but also the language work to redefine more flexibly how work ought to flow through a system, to build new models for computation that users can author that are graph based, that let work spin up when it's dependencies are defined & satisfied.
"Will most companies let their developers do things that may not be 'best practices' to do that?" Let's start with smaller steps, eh? Should companies let their employees venture forth & tackle new non-linear practices for software assembly? Yes. How many? Every company? No probably not. What responsibilities do you want to put on the team beyond delivery of something that works for them? What documentation do you need? Who reviews the proposed architecture to approve this new best practice you are trying to create? Is this an internal system, something you are going to make a core competency and sell, or are you going to try to really make it a best practice & open source it and try to grow it openly? What best practice tools will you use to develop this async build system, or should developers eskew convention & use their "170k+ Silicon Valley brains" on this (& other) layers too?
There are of course some build/integration/deployment technologies out there that do support wider realms of parallelism. Jenkins with it's groovy syntax at least supports creating stages that can be run in parallel. This is still leaving possible optimizations behind, as all items in a stage have to finish before the next stage can start; it's not fully async. Compromises were made but it works pretty well, helps. I've seen 25% cycle time reductions from parallelism here, it helps.
Overall, to me, this issue feels not so pressing. Engineers scrap together an enormous amount of their own time & effort to do what they can with what is available. They are extremely resourceful. I do indeed want companies that better let their employees pursue process & systems improvements, that spend more time stepping outside of product development & more time improving the house they live under together. Most of all, I hope that we are learning & sharing, openly, publicly, #LearningInPublic, so that the state of the art can improve, so that there is more & better material out there to decide what a best practice might be & how to head there. And how to have it make sense 2 years down the road when all the original engineers have found other opportunities & the new less seasoned crew has to suss out how to proceed.
I also want to touch briefly on the original article, although I'm not sure how. Because it has such a neat exploration of so many interesting rich capabilities, capabilities and that let software manage software: software defined software, software defined systems. And in ways that are outside our newer conventions, where we manage polyglot systems via containerized processes (versus using powerful multi-system vm's to manage). The authors love of empowering the operator with online tools, to let them situate & operate from inside the running machine systems. This is such a glorious & important capability, about how we commune with machines, & author is so right to call attention to what we've lost, to how much less operators are online with their systems & how much more distance there is. Being able to use with the machines is important.
And yet, here too I think the situation has nuance & reason. That it's not only a matter of best practices socially gumming us up & social group think making us dumb. Or, if they are gumming us up, it is in response to another form of getting gummed up. Systems used to be dealt with as pets, as carefully groomed & growing buckets of bits that had acquired various adh-hoc operations & scripts & subsystems over time, as operators went about working on the system & working on systems to help them better work on the system- the author leads with going in depth on the ad-hoc deployment systems that had grown up over the years internally on their system, in just such a manner. When anything is possible & at hand, many things happen. Over time, it's hard for newcomers (& seasoned veterans) to understand what has been done, why things were done, hard to even see what pieces are there. A system that is never killed, that's always carried forward, mutated, ongoingly & organically becomes an unintelligible bucket of bits, an opaque intertwined illegible historical artifact. In contrast, immutable containers don't attempt to strike much of a balance to this situation, don't leave a lot of affordances for many of the wonderful erlang capabilities, but they insure that there is a well know & iterateably improveable way to get from code to deployed code that everyone can observe & understand, that brings predictability & understandability that often, in the past, had been sorely absent.
Please have sympathy. Things are complex. Figuring out not only how to move forwards but move forwards in a robust enduring well-thought-out & mindful manner is a big challenge, & retaining that hunger & appetite for moving forwards, for doing good things is equally important, & navigating ourselves while under this tension is something we have to learn to live with, hopefully peacefully.
Start from nothing. No shared state with previous requests. Plough right through as many errors as you can. Log them, but don’t stop. Keep executing code until you run out of things to do.
Fail-fast leads to more stable systems.
WhatsApp famously runs on Erlang, as do large chunks of our telecomms infrastructure. It has an enormous number of users. What it doesn't have is an enormous number of developers. One of the good things about the Erlang (and Elixir) ecosystem is that there are a relatively small number of people implementing the bits you care about, and they all seem to be extremely competent engineers. The pieces are designed well, they are reliable and they fit together. They also have good abstractions so if you use good pieces then you get good results (e.g. with Phoenix you'd have a hard time creating a XSRF or SQL injection problem).
PHP has a bit of a perfect storm of problems - it is so accessible that anyone can and did use it. This meant a lot of people implemented things using it, with varying levels of competence. This means a lot of problematic implementation of a lot of pieces, built on a language and framework that didn't really try to prevent security problems and so didn't really have the tools to do so. In contrast to Erlang/Elixir above, you can take a set of tools in PHP and easily implement a very insecure website.
I guess where they're similar is that an error is generally going to only affect that one request, the next request will have a fresh handler.
(Of course, these are generalizations, you can always do things differently in either language).
It throws out the context baby with the state bathwater.
One way to describe it, and I forgot where I read it, is that one advantage of interpreted languages used to be that you no longer had to grab a cup of coffee every hour while your work compiled. With containerization, things are even now: everyone gets to wait while the images get built.
Of course, I've also worked at places where the environments are set correctly and take advantage of the tech the team is using, instead of having to rebuild all the time. Which is why I don't hate containerization on its own. Everything can be badly applied or used.
It's funny to me that erlang (and Elixir) are so dogmatically immutable / no shared state "in the small" at the module and function level, but apparently not "in the large", at the system level. Whereas you tend to see the reverse with deploying most other languages.
At the systems level, immutability makes operating the system much harder. Especially when you consider Ericcson's basic design requirement of a telephone switch with a redundant pair of control computers; it's safer to deploy changes by making a temporary change to the live system than a permanent change to the backup system and switching over. In case of a bad deploy, you can live reload the old version on the live system, or if that's not possible, promote the backup system and restart the primary back to the known good state. If all goes well though, there's no period of reduced redundancy. In a restart to deploy situation, every deploy has a period with loss of redundancy, as well as a primary/secondary switch, and you can't deploy to degraded systems where one of the redundant pair is not working pending repair (these switches would most likely be deployed to towards the end of the cycle to reduce risk)
I get the sense something is missing, and I get the sense the missing thing was that the replacements were actually better for the business. A couple of highly-talented people on a small team where the original developers owned production operations could run it as designed, and it was "really neat," but the business was limited in how they could use the system. Perhaps you couldn't easily add developers and make them work on new features. Perhaps you couldn't integrate OSS components that worked fine, and you were left behind your competitors. Perhaps the "one or two developers" who operated the system got married and started a family and weren't able to be on-call 24/7/365 any longer. Perhaps this system was one of dozens run by the business, and all the other ones were using crash-only containers, and this system continued to be the odd one out.
I'd like to hear from the other side of this - I'd like to hear the stories of the people who pushed the redesign away from the systems. Until then, this seems entirely too much like "We built this awesome prototype with a bunch of neat properties in theory." My suspicion is that the other side of these migrations were also driven by highly-talented technical people who have technical stories to tell, but their story is that they cleared out a bunch of technical debt and replaced it with a higher-reliability system.
I'm certainly not one of those people, but they'll tell you things like how they standardized the new acquisition onto "best practices" and eliminated the "cowboy deployment" culture that was previously running rampant. I was part of an acquired Erlang team that went through this, although I mostly didn't deploy into the acquirer's systems, I managed the legacy systems, until they were mostly retired, then I retired.
They won't tell you that deploy times grew massively, requiring a larger team to manage (and that increasing the team size requires increasing the team size, because of increased time spent in coordination).
They won't tell you that because they've built their system to handle unreliable environments, they've neglected to make their environments reliable, and the system redundancy factor had to be increased to 3x instead of 2x because the churn rate on systems is too high to expect at least one out of two systems in a pair to be running. Of course, increasing the number of systems, increases deployment complexity, too.
For the first system, it was deprecated without replacement, and just let to run by managers and people who had moved on to other teams (but used to work on it) who did the minimal maintenance required, and former employees were given emergency contracts in weird circumstances to deal with things. Roughly 3-4 years after, they finally replaced it after 2-3 attempts at rewrites that had failed before. The old design with minimal maintenance for years finally approached limits to how it could be scale without bigger redesigns; I consider this to be extremely successful.
It wasn't exactly bulldozed nearly as much as declared "done" and abandoned without adequate replacements while major parts of the business was just being rewritten to use Go and a more "standard" stack. Obviously these migrations always start with something easier and by replacing components that are huge pain points to its contemporaries, and you're left with more legacy stuff in the end that is much harder to replace done in the final pushes. I felt that the blog post would have veered off point if the whole thing became about that, though.
The people on these teams left in part because the hiring budget was redirected towards hiring on the new projects. The idea was that everything could be done in Go for these stacks (by normalizing on tools and libraries developed in another project and wanting to have one implementation for both the private and shared platforms), and the rewrites were to start with Ruby components.
You knew working on the Erlang side of things that no feature development would ever take place again, that no new hands would be hired to help, and that you would be stuck on call 24/7 with no relief for years. All efforts were redirected to Go and getting rid of Ruby, and your stuff fell in between the cracks. I was one of the people who left on the long tail there. After my departure, I was brought back on a lucrative part-time contract as a sort of retainer for years to help them in case of major outages (got 1 or 2 in 4 years) since that was the only way they could get expert knowledge once they drove us all away.
I'm still on good terms with the people there, it's just that "maintaining a self-declared unmaintained legacy stack without budget or headcount until we get to rewrite it in many years" is not where any of us wanted to drive our careers.
Interestingly, we tried very hard to add new developers. We wrote manuals, tooling, a book on operating these systems (see https://www.erlang-in-anger.com), wanted to set up internal internships so developers from other teams could come and work with us for a while, etc. Whereas our team was very willing, internal politics (which I can't easily get in a public forum) made it unworkable and most attempts were turned down. These things were not always purely business decisions, and organizational dynamics can be very funny things.
On the other hand, it seems like a fairly common problem that using neat but niche technology makes it hard to hire for it and get continued development. My own employer has a pretty nice system in Clojure that is well-maintained and has people working on it but nonetheless is always on the wrong side of things because it's using a language (and a development and release workflow, in turn) that nobody else is using.
Is the problem that we should build systems that have all the advantages of the neat tech we like but still look like the more boring and less complicated things, operationally? Was it that you had the backstop in your Erlang system but it wasn't enough to convince the business that it could be treated like a normal system? Or was it separate (and could it have been shaped to look like the more boring thing)?
Alternatively, if we take the goal of reliability engineering broadly as building systems that work despite both dysfunctional computer systems and dysfunctional human systems, it seems like that argues in favor of picking the approach that doesn't line you up to be on the wrong side of organizational politics. I don't like this conclusion - I'm a huge fan of solving problems with small but well-applied amounts of technical expertise - but I don't really have a good sense of how to avoid it. (Maybe the answer is more blog posts like yours influencing expectations.)
But you'll also get issues with staffing and getting people interested in working in mainstream languages that are less cool than they used to be, frameworks that are on older versions and hard to migrate, deployed through systems that aren't as nice on your resume than newer ones, or on platforms that aren't seen as positively.
I don't have a very clear answer to give about why Erlang specifically wasn't seen positively. The VP of Eng at the time (now the github CTO) saw Erlang very positively (https://twitter.com/jasoncwarner/status/1287383578435780608) but I know that some specific product people didn't like it, and so on. To some extent a lot of the work pushing us aside was just done by very eager Go developers who just started doing work on replacing our components with new ones on the other side of the org, and then propagating that elsewhere.
Whether the roadmap or other policies ended up kneecapping our team on purpose or accidentally is not something I can actually know. I kept pushing for years to improve things for our team, but at some point I got tired and left for a different sort of role.
Of course a build and deploy system specifically designed for your product/environment is going to be more feature rich and performant, but once the initial developers leave or move on, all is lost. I have seen it happen multiple times.
Sticking with slower clunkier industry practices means you can hire someone off the street and they’ll have a much better chance of being able to understand the system.
If you want longevity, you use the standards. All engineering (mechanical, electrical, etc.) agrees on that.
I've built that once and it was hell to get right. But probably 90+% of programs I've ever worked on do not achieve any business advantage from graceful restarts (as opposed to drain and restart i.e. the stateless model).
Kubernetes is great for easily sharing and executing recipes, but it can make it tough for people that want to become chefs that are able to use reason and intuition to build and understand complex systems.
Acknowledging that I'm changing the goalposts a bit, I would consider a lack of understanding about the lifecycle of an application that causes development delays an abstract form of "noticing when it fails"
Kubernetes does a really good job at hiding this.
When it hides this, it also hides a couple other things with software — like failing containers, or apps that require churn or fail. I mean this is literally the reason why things like monit were built in the older days — things broke if you didn’t restart them when they broke.
In kubernetes, it also is easy to accidentally to change the system and make that behavior change (at a global level), that results in stuff like your app not being restarted as often, which results in elevated latency, and subsequent failures.
In a way, you're better off with a 99% reliable dependency than 99.9999%. If you expect it to be down some of the time, you'll plan for it in your architecture and (in theory) gracefully handle the failure whether it's due to an insufficiency warmed container, a network issue, a bad deploy, or whatever else comes along.
It's not great for protocols with significant handshaking and potential for long connections. Things like chat services, shoutcast streaming, calling, imap, really anything realtime or notification driven would do better without forcing clients to reconnect. Especially since many will reconnect multiple times, as they may reconnect to servers which are running old code while the drain and restarts happen. If you need near instant cutover (which is pronably rare), in a drain and restart scenario, you basically need double the capacity, where hotload doesn't need anything extra.
Changing the instance means having to re-transfer all of the data and re-establish all of that state, on all of the nodes. You could easily see draining of connections take 15-20 minutes, and booting back and scaling up to be taking 15-20 minutes as well, if you can do it for _all_ the instances at once (which may not be a guarantee, and you could need to stagger things to be more cost-effective).
You start with each deploy taking easily over an hour. If you deploy 2-3 times a day and that your peak times line up with these, you can more than double your operating cost just to deploy, and that can take more than 4 figures to count.
Some of the systems we maintained (not those we necessarily live deployed to, but still required rolling restarts) required over 5,000 instances and could not just be doubled in size without running into limits in specific regions for given instance types.
If a blue/green deploy takes a couple minutes, you're probably not having a workload where this is worth thinking about that much.
Did you look at any of the node-based options like checkpointing the state to a file on the node, and loading that into your newly started pods? Or using read-many persistent volumes? (Not sure if you needed to write to the state file from every process too?)
(This doesn’t help with connections of course, that’s a bit more thorny.)
For some types of my nodes, the majority of the state was tcp connections and associated processes. I don't think there's a generally available system that is capable of transferring tcp connection state (although, I'd love to build one! if you've got a need, funding, and a flexible time table), which would be a prerequisite to moving the process state. All of those connections need to be ended, and clients reconnect to another server (where they might need to do it again, if they don't get lucky and get a new server to begin with).
The other nodes with more traditional state had up to half a terrabyte of state in memory, and potentially more on disk, and a good deal of writes. That's seven minutes to transfer state on 10G ethernet, assuming you can use the whole bandwidth and sending and receiving that data is faster than the network.
Although, in my experience, we didn't tend to explicitly replicate disk based storage for new nodes, all of our disk based data was transient, so replacing nodes meant writing to new nodes, read from new and old, and retiring the old nodes when their data had all been fetched and deleted, or the data retention cap was missed.
I/O Volume meant networked filesystems would be a big stretch. You could probably do something with dual-ported SAS drives, and redundant pairs of machines on the same rack, but then that pair with both go down when that rack has an unforseen problem, plus good luck getting dual-ported SAS drives hooked up properly when you're in someone else's bare metal managed hosting.
(Yeah OK, maybe we had big performance requirements, but hotloading works just as well for stuff that fits on a single redundant pair, or even a single server in a pinch)
The best policy is for a deployment to lower the health check on the service, wait a few minutes per box, then bounce the process.
This is assuming the service is 100% stateless.
Complexity costs money.
I'd advance the theory that picking an off-the-shelf popular solution is going to be beneficial in that case because you externalize the costs of maintaining your expertise and knowledge to the rest of the ecosystem or industry, to other companies, and just never develop that fiber within your organization. It is, however, worthwhile to develop it for more things than just your tech stack. Everything having to do with on-boarding and dealing with legacy code is improved, along with broader dynamics if you try to be a learning organization that develops expertise in its people.
2. You always design your system in accordance to its deploy system, regardless of whether you realize it or not. If you ship binary artifacts and signed packages to customers who install them on their own devices, you will have a different development practice than if you do CI/CD with a single pipeline that always goes forwards. This will also be somewhat different if you work with open-source components that require paying attention to version schemes rather than just pushing a hash in a container image. If you use feature flags to help merging but also to control deployment, adopt A/B testing, and all these practices, you're intimately adapting your development approaches to deployment mechanisms that are available to you. It's not an extra cost, it's a cost you already pay today.