Hacker News new | past | comments | ask | show | jobs | submit login
A Pipeline Made of Airbags (ferd.ca)
125 points by bshanks 71 days ago | hide | past | favorite | 57 comments

Here's a neat comparison that might make your blood boil.

You can install Windows (proved with XP) in five minutes or less.

How many build pipelines have you worked with that take longer than installing Windows?

Granted, they're different things - one is copying a giant set of files and creating a small DB and one is running tests and packaging software.

Here's the thing though. Most of the time, these builds are slow due to 1. lack of parallelism and 2. Blindly following "best practices" instead of using our $170k+ Silicon Valley brains.

These are problems you can solve. You can get your build to be faster than installing Windows. Will most companies let their developers do things that may not be "best practices" to do that? Not today.

The problem we solve by religiously sticking to "best practice" is the extreme engineering churn in "Silicon Valley" companies.

Not spending 9 months on-boarding might be worth extra minutes on deployment, IDK...

On the other hand, after 9 months on-boarding an engineer is likely to operate much more productively than the "best practice" would allow.

Even more importantly, the engineer is likely to stick with a truly high tech workplace with much less "churn" than in yet another dime a dozen startup with the same fashionable tech stack as the slightly better competition.

Or dig deeper and deeper into a rabbit hole, and come out almost unemployable 7 years later while the entire industry has moved on.

You don't have to have crazy on boarding. In fact, with systems like Kubernetes it can take me months to finally grasp the whole system... especially if you have your own Kubernetes Operator and custom YAML configuration that maps to Kube's... which I have seen.

You can use Fargate for example, which should give you easy to manage and fast deployment times - you just have to figure out the fast build time.

For FastComments.com I can do a build in under two minutes. The second you merge to master:

1. E2E testing runs, which is massively parallelized. Each test suite's data gets namespaced so they don't step on each other.

2. The build runs, takes 30 seconds, because it's not a compiled language. Although if it was in Java I don't think it would take very long.

3. Deployment happens, Orcha sends the bundle to all the servers, stops the deamon, extracts it, restarts the systemd daemon.

4. E2E testing runs again against production as a smoke test.

This all happens in under a couple minutes. Not a very small app - package.json has 50 dependencies and the E2E test suite has ~40 tests. These tests are complicated - some will open two browsers for example to test the live commenting features...

Deploying faster means you can onboard faster, because part of onboarding is deploying.

Deploying faster means you can onboard faster, because you get quicker feedback on your changes, getting insight into code and systems faster.

Deploying faster means you can onboard faster, because you have fewer (ideally just one) pending changes to deploy to keep in your mind while working, increasing the amount of changes you can make in a work week.

Deploying faster means you can onboard faster, because you spend less time deploying and more time studying the system.

I work with a project that takes an hour to build because it’s thousands of C++ files. What would you suggest I do to improve my build times?

Surprisingly, "concatenative builds" where you chuck all the files in a single file and compile that, tend to be a lot faster.

distcc/ccache also works. Years ago we had a rack full of machines devoted to this. Requires a little tuning but lets you do -j 50. Of course, that was in the days of dual-core machines and these days you can buy a lot of cores. Bottleneck is probably writing intermediate files to disk and flushing them.

Untangling the inter-dependencies that cause all the rebuilding is very hard and often forces design changes you don't want, such as use of the "pimpl" idiom.

Even without having programmed in C/C++ for years, it doesn't sound very surprising that "concatenative builds" are faster - I assume it has something to do with the preprocessor having exponentially less work the fewer files you have?

Our QT project had about 400 compilation units (header+source, compiled to a separate object file). Many of them were 100-200 lines of code only. Full build used to take 40 minutes. Each compilation unit pulls in about 40k lines of code from Qt through headers. So the compiler has to process at least 400*40k = 16M lines of code. Concatenative build only processes app code (about 50k) plus about 60k of cumulated framework code. This gives about 100 times speedup. Our concatenative build took indeed less than 1 minute.

This is with unified builds :(

Do you need to recompile all multi-thousand files every time you make a change? Modularising the build would be my first step.

A good build system and caching. By "a good build system" I mean a hermetic one like Bazel that can be confident of knowing the inputs to each build step so that they can be run or not as appropriate. Sometimes you'll touch a core header file and a lot of things will rebuild, but it should still be a lot better.

A buddy slashed his CI build times by using a RAM disk. But yeah, usually faster to avoid compiling things.

We have a similar situation and use ccache and distcc is setup on each dev's workstation. It also helps to use forward declarations instead of including header files whenever possible.

> and one is running tests and packaging software.

and linting, and code quality checks, and compiling, and optimizing, and pushing the package up to package & image repositories, and deploying it to sandboxes

I agree there is room for improvement. Parallelism in particular is a step that could shave up to 30% off some of the build pipelines I've worked with. I disagree however that your example should "make your blood boil". I feel like "Blindly following "best practices" instead of using our $170k+ Silicon Valley brains" is inciteful language that ignores the reality of the situation. We are all working from the tools we've got. The majority of those tools don't have good models for Parallelism. CircleCI for example supports "test splitting" parallelism, where tests can be run in parallel in different containers, but there's no workflow parallelism that I know of so that style checking, deployment to ephemeral environment, publishing, &c can proceed once the bud artifact is generated.

Yes we should try to adapt & grow. The very name, build pipelines, bespeaks a linear process. The ability to change is definitely at hand, & there are legions of us who would pick up the torch on this one & move forward.

But I disagree that our failure to do so is "blindly following 'best practices'". Yes this is one of the "problems you can solve" but it's a bit of a big deal. Configuring build pipelines is a fairly advanced task, & to advance to parallelism & our maybe 30% gains we need not only the machine work to support parallelism, but also the language work to redefine more flexibly how work ought to flow through a system, to build new models for computation that users can author that are graph based, that let work spin up when it's dependencies are defined & satisfied.

"Will most companies let their developers do things that may not be 'best practices' to do that?" Let's start with smaller steps, eh? Should companies let their employees venture forth & tackle new non-linear practices for software assembly? Yes. How many? Every company? No probably not. What responsibilities do you want to put on the team beyond delivery of something that works for them? What documentation do you need? Who reviews the proposed architecture to approve this new best practice you are trying to create? Is this an internal system, something you are going to make a core competency and sell, or are you going to try to really make it a best practice & open source it and try to grow it openly? What best practice tools will you use to develop this async build system, or should developers eskew convention & use their "170k+ Silicon Valley brains" on this (& other) layers too?

There are of course some build/integration/deployment technologies out there that do support wider realms of parallelism. Jenkins with it's groovy syntax at least supports creating stages that can be run in parallel. This is still leaving possible optimizations behind, as all items in a stage have to finish before the next stage can start; it's not fully async. Compromises were made but it works pretty well, helps. I've seen 25% cycle time reductions from parallelism here, it helps.

Overall, to me, this issue feels not so pressing. Engineers scrap together an enormous amount of their own time & effort to do what they can with what is available. They are extremely resourceful. I do indeed want companies that better let their employees pursue process & systems improvements, that spend more time stepping outside of product development & more time improving the house they live under together. Most of all, I hope that we are learning & sharing, openly, publicly, #LearningInPublic, so that the state of the art can improve, so that there is more & better material out there to decide what a best practice might be & how to head there. And how to have it make sense 2 years down the road when all the original engineers have found other opportunities & the new less seasoned crew has to suss out how to proceed.

I also want to touch briefly on the original article, although I'm not sure how. Because it has such a neat exploration of so many interesting rich capabilities, capabilities and that let software manage software: software defined software, software defined systems. And in ways that are outside our newer conventions, where we manage polyglot systems via containerized processes (versus using powerful multi-system vm's to manage). The authors love of empowering the operator with online tools, to let them situate & operate from inside the running machine systems. This is such a glorious & important capability, about how we commune with machines, & author is so right to call attention to what we've lost, to how much less operators are online with their systems & how much more distance there is. Being able to use with the machines is important.

And yet, here too I think the situation has nuance & reason. That it's not only a matter of best practices socially gumming us up & social group think making us dumb. Or, if they are gumming us up, it is in response to another form of getting gummed up. Systems used to be dealt with as pets, as carefully groomed & growing buckets of bits that had acquired various adh-hoc operations & scripts & subsystems over time, as operators went about working on the system & working on systems to help them better work on the system- the author leads with going in depth on the ad-hoc deployment systems that had grown up over the years internally on their system, in just such a manner. When anything is possible & at hand, many things happen. Over time, it's hard for newcomers (& seasoned veterans) to understand what has been done, why things were done, hard to even see what pieces are there. A system that is never killed, that's always carried forward, mutated, ongoingly & organically becomes an unintelligible bucket of bits, an opaque intertwined illegible historical artifact. In contrast, immutable containers don't attempt to strike much of a balance to this situation, don't leave a lot of affordances for many of the wonderful erlang capabilities, but they insure that there is a well know & iterateably improveable way to get from code to deployed code that everyone can observe & understand, that brings predictability & understandability that often, in the past, had been sorely absent.

Please have sympathy. Things are complex. Figuring out not only how to move forwards but move forwards in a robust enduring well-thought-out & mindful manner is a big challenge, & retaining that hunger & appetite for moving forwards, for doing good things is equally important, & navigating ourselves while under this tension is something we have to learn to live with, hopefully peacefully.

At some point most of us (from engineers to CTO) fall for a "silver bullet". Then after a while, reality speaks up loudly and we decide things like, "not everything should be an object", or "not every function should be a microservice", or now "containers aren't the only solution".

I always admired PHP’s runtime for the way it serves web requests. I think it’s a similar ethos to that described in the post, even though Erlang and PHP are quite different.

Start from nothing. No shared state with previous requests. Plough right through as many errors as you can. Log them, but don’t stop. Keep executing code until you run out of things to do.

This is probable the reason for why it's so popular (meaning wordpress), but also why it's so full of security exploits.

Fail-fast leads to more stable systems.

Generally agree, but if erlang was as popular as wordpress we don't know what kind of issues hackers would find

I think it would be ok, but part of the question is what you mean by "popular".

WhatsApp famously runs on Erlang, as do large chunks of our telecomms infrastructure. It has an enormous number of users. What it doesn't have is an enormous number of developers. One of the good things about the Erlang (and Elixir) ecosystem is that there are a relatively small number of people implementing the bits you care about, and they all seem to be extremely competent engineers. The pieces are designed well, they are reliable and they fit together. They also have good abstractions so if you use good pieces then you get good results (e.g. with Phoenix you'd have a hard time creating a XSRF or SQL injection problem).

PHP has a bit of a perfect storm of problems - it is so accessible that anyone can and did use it. This meant a lot of people implemented things using it, with varying levels of competence. This means a lot of problematic implementation of a lot of pieces, built on a language and framework that didn't really try to prevent security problems and so didn't really have the tools to do so. In contrast to Erlang/Elixir above, you can take a set of tools in PHP and easily implement a very insecure website.

I love both Erlang and PHP, but the philosophy of what to do with errors in requesrs is like diametrically opposed. For PHP, usually it's log, but keep on going as best you can. For Erlang, it's log and stop, and the request handler will generally be restarted.

I guess where they're similar is that an error is generally going to only affect that one request, the next request will have a fresh handler.

(Of course, these are generalizations, you can always do things differently in either language).

I’d recommending reading the post about the difference between erlang and php and reading the article again. This isn’t silently swallow errors. It’s saying start with fail aggressively but also don’t forget to identify situations resulting in failures in the first place and fix those bugs our handle the corner conditions properly. That’s diametrically incompatible with “log and continue” as at no point is that desirable according to this post.

I've always thought that's precisely why PHP is bad.

It throws out the context baby with the state bathwater.

While I still love/docker and containerization and use them (judiciously), this article does describe a couple of situations I've witnessed.

One way to describe it, and I forgot where I read it, is that one advantage of interpreted languages used to be that you no longer had to grab a cup of coffee every hour while your work compiled. With containerization, things are even now: everyone gets to wait while the images get built.

Of course, I've also worked at places where the environments are set correctly and take advantage of the tech the team is using, instead of having to rebuild all the time. Which is why I don't hate containerization on its own. Everything can be badly applied or used.

We run a lot of production Elixir apps at work, but we deploy them all in Docker containers on AWS basically. The whole live-upgrade erlang philosophy has always been one I've been curious about and interested in, but also a little skeptical of.

It's funny to me that erlang (and Elixir) are so dogmatically immutable / no shared state "in the small" at the module and function level, but apparently not "in the large", at the system level. Whereas you tend to see the reverse with deploying most other languages.

Erlang language level immutability is a means to an end, it makes the runtime (especially garbage collection) simpler than a system with mutable state where you can have references to newer and older data.

At the systems level, immutability makes operating the system much harder. Especially when you consider Ericcson's basic design requirement of a telephone switch with a redundant pair of control computers; it's safer to deploy changes by making a temporary change to the live system than a permanent change to the backup system and switching over. In case of a bad deploy, you can live reload the old version on the live system, or if that's not possible, promote the backup system and restart the primary back to the known good state. If all goes well though, there's no period of reduced redundancy. In a restart to deploy situation, every deploy has a period with loss of redundancy, as well as a primary/secondary switch, and you can't deploy to degraded systems where one of the redundant pair is not working pending repair (these switches would most likely be deployed to towards the end of the cycle to reduce risk)

A large part of the philosophy is controlling state. Immutability in the small helps control where state lives. It’s difficult to control (or understand) state spread across hundreds of little objects (e.g. most OO languages) and be able to modify a running system. OTP Supervision trees and GenServers are all about partitioning state and organizing it. It’s similar in many ways to micro services, but more advanced (as the article mentioned regarding OTP Supervision trees and k8s).

Not sure why this wasn't marked a dupe of https://news.ycombinator.com/item?id=24577881 which was submitted ~10 hours earlier and I think was briefly on the front page.

This is a weird article. If the two systems the author described worked so well, how did the business ever decide to "bulldoze" it and replace it with something inferior? Even if you take the view that some non-technical person came in and demanded a particular technical change (??), any halfway-competent business is going to keep the old thing running while they build the new one so they can continue making money, and they'll notice that the new thing doesn't work as well as the old. Right?

I get the sense something is missing, and I get the sense the missing thing was that the replacements were actually better for the business. A couple of highly-talented people on a small team where the original developers owned production operations could run it as designed, and it was "really neat," but the business was limited in how they could use the system. Perhaps you couldn't easily add developers and make them work on new features. Perhaps you couldn't integrate OSS components that worked fine, and you were left behind your competitors. Perhaps the "one or two developers" who operated the system got married and started a family and weren't able to be on-call 24/7/365 any longer. Perhaps this system was one of dozens run by the business, and all the other ones were using crash-only containers, and this system continued to be the odd one out.

I'd like to hear from the other side of this - I'd like to hear the stories of the people who pushed the redesign away from the systems. Until then, this seems entirely too much like "We built this awesome prototype with a bunch of neat properties in theory." My suspicion is that the other side of these migrations were also driven by highly-talented technical people who have technical stories to tell, but their story is that they cleared out a bunch of technical debt and replaced it with a higher-reliability system.

> I'd like to hear from the other side of this - I'd like to hear the stories of the people who pushed the redesign away from the systems.

I'm certainly not one of those people, but they'll tell you things like how they standardized the new acquisition onto "best practices" and eliminated the "cowboy deployment" culture that was previously running rampant. I was part of an acquired Erlang team that went through this, although I mostly didn't deploy into the acquirer's systems, I managed the legacy systems, until they were mostly retired, then I retired.

They won't tell you that deploy times grew massively, requiring a larger team to manage (and that increasing the team size requires increasing the team size, because of increased time spent in coordination).

They won't tell you that because they've built their system to handle unreliable environments, they've neglected to make their environments reliable, and the system redundancy factor had to be increased to 3x instead of 2x because the churn rate on systems is too high to expect at least one out of two systems in a pair to be running. Of course, increasing the number of systems, increases deployment complexity, too.

Author here.

For the first system, it was deprecated without replacement, and just let to run by managers and people who had moved on to other teams (but used to work on it) who did the minimal maintenance required, and former employees were given emergency contracts in weird circumstances to deal with things. Roughly 3-4 years after, they finally replaced it after 2-3 attempts at rewrites that had failed before. The old design with minimal maintenance for years finally approached limits to how it could be scale without bigger redesigns; I consider this to be extremely successful.

It wasn't exactly bulldozed nearly as much as declared "done" and abandoned without adequate replacements while major parts of the business was just being rewritten to use Go and a more "standard" stack. Obviously these migrations always start with something easier and by replacing components that are huge pain points to its contemporaries, and you're left with more legacy stuff in the end that is much harder to replace done in the final pushes. I felt that the blog post would have veered off point if the whole thing became about that, though.

The people on these teams left in part because the hiring budget was redirected towards hiring on the new projects. The idea was that everything could be done in Go for these stacks (by normalizing on tools and libraries developed in another project and wanting to have one implementation for both the private and shared platforms), and the rewrites were to start with Ruby components.

You knew working on the Erlang side of things that no feature development would ever take place again, that no new hands would be hired to help, and that you would be stuck on call 24/7 with no relief for years. All efforts were redirected to Go and getting rid of Ruby, and your stuff fell in between the cracks. I was one of the people who left on the long tail there. After my departure, I was brought back on a lucrative part-time contract as a sort of retainer for years to help them in case of major outages (got 1 or 2 in 4 years) since that was the only way they could get expert knowledge once they drove us all away.

I'm still on good terms with the people there, it's just that "maintaining a self-declared unmaintained legacy stack without budget or headcount until we get to rewrite it in many years" is not where any of us wanted to drive our careers.

Interestingly, we tried very hard to add new developers. We wrote manuals, tooling, a book on operating these systems (see https://www.erlang-in-anger.com), wanted to set up internal internships so developers from other teams could come and work with us for a while, etc. Whereas our team was very willing, internal politics (which I can't easily get in a public forum) made it unworkable and most attempts were turned down. These things were not always purely business decisions, and organizational dynamics can be very funny things.

Thanks, that's helpful. I think it is impressive, and actually kind of a selling point, that the system maintained itself without formal staffing. I was going to object that you didn't build a high-reliability system, you built a high-reliability system that worked as long as you had a couple of experts staffing it, but it sort of sounds like that's not actually what happened.

On the other hand, it seems like a fairly common problem that using neat but niche technology makes it hard to hire for it and get continued development. My own employer has a pretty nice system in Clojure that is well-maintained and has people working on it but nonetheless is always on the wrong side of things because it's using a language (and a development and release workflow, in turn) that nobody else is using.

Is the problem that we should build systems that have all the advantages of the neat tech we like but still look like the more boring and less complicated things, operationally? Was it that you had the backstop in your Erlang system but it wasn't enough to convince the business that it could be treated like a normal system? Or was it separate (and could it have been shaped to look like the more boring thing)?

Alternatively, if we take the goal of reliability engineering broadly as building systems that work despite both dysfunctional computer systems and dysfunctional human systems, it seems like that argues in favor of picking the approach that doesn't line you up to be on the wrong side of organizational politics. I don't like this conclusion - I'm a huge fan of solving problems with small but well-applied amounts of technical expertise - but I don't really have a good sense of how to avoid it. (Maybe the answer is more blog posts like yours influencing expectations.)

I think it's a problem of "not being the tech of choice". Obscure languages have a higher cost in that you can't just out-source the work of training people to the rest of the industry as easily.

But you'll also get issues with staffing and getting people interested in working in mainstream languages that are less cool than they used to be, frameworks that are on older versions and hard to migrate, deployed through systems that aren't as nice on your resume than newer ones, or on platforms that aren't seen as positively.

I don't have a very clear answer to give about why Erlang specifically wasn't seen positively. The VP of Eng at the time (now the github CTO) saw Erlang very positively (https://twitter.com/jasoncwarner/status/1287383578435780608) but I know that some specific product people didn't like it, and so on. To some extent a lot of the work pushing us aside was just done by very eager Go developers who just started doing work on replacing our components with new ones on the other side of the org, and then propagating that elsewhere.

Whether the roadmap or other policies ended up kneecapping our team on purpose or accidentally is not something I can actually know. I kept pushing for years to improve things for our team, but at some point I got tired and left for a different sort of role.

I agree with you. This article complains that generic industry practices are not as optimized as his niche solutions. Thats no surprise.

Of course a build and deploy system specifically designed for your product/environment is going to be more feature rich and performant, but once the initial developers leave or move on, all is lost. I have seen it happen multiple times.

Sticking with slower clunkier industry practices means you can hire someone off the street and they’ll have a much better chance of being able to understand the system.

If you want longevity, you use the standards. All engineering (mechanical, electrical, etc.) agrees on that.

I can understand the technical argument against the stateless containerized model. It's not neat. But what is the business argument? That's where this perspective falls down for me. If you're building a database that manages a disk directly, I can see the argument for graceful shutdown and handover. In that case you really need the program to be colocated with the data and if you're down then the data is unavailable.

I've built that once and it was hell to get right. But probably 90+% of programs I've ever worked on do not achieve any business advantage from graceful restarts (as opposed to drain and restart i.e. the stateless model).

Kuberenetes is a leaky abstraction that by default allows the deployment of sloppy unreliable apps if you don't know what you're doing. If something is broken and it just restarts automatically and no one is paying attention, it will do that in perpetuity. Ideally, you would see those issues and iterate to make your app more stable, but it makes it easy to just forget about it. The additional layer of complexity is also harder to understand easily for most devs that haven't spent the time to learn kubernetes and even with devs that do spend the time it can still make understanding how everything works hard. Letting tech debt like unstable apps grow into mud balls can reduce a team's velocity in the long term and in some cases the short term as well.

Kubernetes is great for easily sharing and executing recipes, but it can make it tough for people that want to become chefs that are able to use reason and intuition to build and understand complex systems.

On the flip side, why bother spending time optimizing a process if nobody notices when it fails? Aren't there more important problems to be solving in at scenario?

Tech debt filled mud balls where feature after feature is clumped on top of one another without worrying what came before it and what will come after it are not great to maintain. Sure if you don't plan on maintaining or building on the component and you don't care if it restarts, then let it restart forever. But if it's a component that you plan on adding features to for a long period of time and stability issues are ignored, those types of issues can compound over time and make it more difficult to reason about and maintain.

I agree, things that change frequently need more attention to make sure you're not spending more time than you need to make changes.

Acknowledging that I'm changing the goalposts a bit, I would consider a lack of understanding about the lifecycle of an application that causes development delays an abstract form of "noticing when it fails"

Maybe people do notice, but it’s not easy to report back? Or they notice the system runs slowly? Or they just accept the fact that all systems (extrapolating from several banking, public service, and publishing sites) are crap, and need to live with it?

When you have two systems that need to talk to each other, and one system expects the other system to be reliable, it’s difficult to completely toss out all graceful shutdown and handover. Usually this logic is handled at the service discovery or load balancing layer with the deployment bits, so people don’t notice.

Kubernetes does a really good job at hiding this.

When it hides this, it also hides a couple other things with software — like failing containers, or apps that require churn or fail. I mean this is literally the reason why things like monit were built in the older days — things broke if you didn’t restart them when they broke.

In kubernetes, it also is easy to accidentally to change the system and make that behavior change (at a global level), that results in stuff like your app not being restarted as often, which results in elevated latency, and subsequent failures.

No system is 100% reliable, though. Some systems are close enough that we grit our teeth and pretend they are and wave away the inevitable outages that have nothing to do with our app and everything to do with an unreliable dependency.

In a way, you're better off with a 99% reliable dependency than 99.9999%. If you expect it to be down some of the time, you'll plan for it in your architecture and (in theory) gracefully handle the failure whether it's due to an insufficiency warmed container, a network issue, a bad deploy, or whatever else comes along.

I'd argue the biggest reason for a graceful restart would be reliability and latency.

But if you have the app behind a load balancer that can discover new tasks, rolling restarts for updates do not negatively affect those parameters.

This works pretty well for request oriented protocols with short response times.

It's not great for protocols with significant handshaking and potential for long connections. Things like chat services, shoutcast streaming, calling, imap, really anything realtime or notification driven would do better without forcing clients to reconnect. Especially since many will reconnect multiple times, as they may reconnect to servers which are running old code while the drain and restarts happen. If you need near instant cutover (which is pronably rare), in a drain and restart scenario, you basically need double the capacity, where hotload doesn't need anything extra.

So you pay a public cloud provider for 2x capacity for a couple of minutes while you do the rollout, and in exchange you get a much simpler design because the individual servers are stateless. I think that's better for the business.

Some services may require gigabytes of state to be downloaded to work with acceptable latency on local decisions, and that state is replicated in a way that is constantly updated. These could include ML models, large routing tables, or anything of the kind. Connections could be expected to be active for many minutes at a time (not everything is a web server serving short HTTP requests that are easy to resume), and so on.

Changing the instance means having to re-transfer all of the data and re-establish all of that state, on all of the nodes. You could easily see draining of connections take 15-20 minutes, and booting back and scaling up to be taking 15-20 minutes as well, if you can do it for _all_ the instances at once (which may not be a guarantee, and you could need to stagger things to be more cost-effective).

You start with each deploy taking easily over an hour. If you deploy 2-3 times a day and that your peak times line up with these, you can more than double your operating cost just to deploy, and that can take more than 4 figures to count.

Some of the systems we maintained (not those we necessarily live deployed to, but still required rolling restarts) required over 5,000 instances and could not just be doubled in size without running into limits in specific regions for given instance types.

If a blue/green deploy takes a couple minutes, you're probably not having a workload where this is worth thinking about that much.

This sounds like an interesting usecase.

Did you look at any of the node-based options like checkpointing the state to a file on the node, and loading that into your newly started pods? Or using read-many persistent volumes? (Not sure if you needed to write to the state file from every process too?)

(This doesn’t help with connections of course, that’s a bit more thorny.)

> checkpointing the state to a file on the node, and loading that into your newly started pods

For some types of my nodes, the majority of the state was tcp connections and associated processes. I don't think there's a generally available system that is capable of transferring tcp connection state (although, I'd love to build one! if you've got a need, funding, and a flexible time table), which would be a prerequisite to moving the process state. All of those connections need to be ended, and clients reconnect to another server (where they might need to do it again, if they don't get lucky and get a new server to begin with).

The other nodes with more traditional state had up to half a terrabyte of state in memory, and potentially more on disk, and a good deal of writes. That's seven minutes to transfer state on 10G ethernet, assuming you can use the whole bandwidth and sending and receiving that data is faster than the network.

Although, in my experience, we didn't tend to explicitly replicate disk based storage for new nodes, all of our disk based data was transient, so replacing nodes meant writing to new nodes, read from new and old, and retiring the old nodes when their data had all been fetched and deleted, or the data retention cap was missed.

I/O Volume meant networked filesystems would be a big stretch. You could probably do something with dual-ported SAS drives, and redundant pairs of machines on the same rack, but then that pair with both go down when that rack has an unforseen problem, plus good luck getting dual-ported SAS drives hooked up properly when you're in someone else's bare metal managed hosting.

(Yeah OK, maybe we had big performance requirements, but hotloading works just as well for stuff that fits on a single redundant pair, or even a single server in a pinch)

Interesting case study, thanks for sharing!

if a request takes 100 ms, and a portion of them are murdered halfway through then either clients retry for reliability in exchange for extra latency, or reliability takes a hit.

The best policy is for a deployment to lower the health check on the service, wait a few minutes per box, then bounce the process.

This is assuming the service is 100% stateless.

> But what is the business argument?

Complexity costs money.

Not sure I follow. The solutions proposed in the article - designing your software so it can be live-reloaded, having people who understand the system deeply etc. - all seem more complex and more expensive than just stuffing everything in Kubernetes and letting pods restart.

1. You always need people that understand the system. Not everyone and not all of it, but you want that overall, people are able to understand and figure out what is going on. You similarly hope to always have people who understand the customers, the environment you operate in, and so on. If you can afford to do your thing without having people understanding how it works, you probably can afford to do your thing without a an actually good solution (just scrape by, it's alright) and can go on with whatever.

I'd advance the theory that picking an off-the-shelf popular solution is going to be beneficial in that case because you externalize the costs of maintaining your expertise and knowledge to the rest of the ecosystem or industry, to other companies, and just never develop that fiber within your organization. It is, however, worthwhile to develop it for more things than just your tech stack. Everything having to do with on-boarding and dealing with legacy code is improved, along with broader dynamics if you try to be a learning organization that develops expertise in its people.

2. You always design your system in accordance to its deploy system, regardless of whether you realize it or not. If you ship binary artifacts and signed packages to customers who install them on their own devices, you will have a different development practice than if you do CI/CD with a single pipeline that always goes forwards. This will also be somewhat different if you work with open-source components that require paying attention to version schemes rather than just pushing a hash in a container image. If you use feature flags to help merging but also to control deployment, adopt A/B testing, and all these practices, you're intimately adapting your development approaches to deployment mechanisms that are available to you. It's not an extra cost, it's a cost you already pay today.

If you really need stateful services in k8s, why not just use a StatefulSet and keep your old deploy process?

I thought this was going to be about Kramerica Industries.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact