Hacker News new | past | comments | ask | show | jobs | submit login
Lessons Learned from Writing Over 300k Lines of Infrastructure Code (gruntwork.io)
344 points by jbkavungal 4 months ago | hide | past | web | favorite | 153 comments

All of the points are very valid and valuable. But by reading only this post, you will have basically no idea how to go ahead and do all these things. You have to scour the internet, and collect every blog post and video ever produced on the subject, and intone them all, and start experimenting, and just sort of create your own version of what these things are.

I'm going to interviews right now where people are trying to hire a "Senior DevOps Engineer". They seem to think hiring this somewhat low-level employee will magically imbue their organization with DevOps principles, and suddenly everything will be cheaper and faster. I tell them it may take them 2 years to build a solid DevOps practice, assuming they do everything right, and they frown. Why can't we just have the DevOps right now? To that I reply, go hire a company that specializes in DevOps. And again they frown.

I don't think it should be this hard. I think we can consolidate all the important knowledge and pre-packaged work, collaboratively, to empower teams to actually get shit done quicker and better. I think we can make it so nobody ever has to go through the painful process of learning by trial and error ever again. And, of course, I think it should all be free. So I've started a wiki, and I hope I can convince a few more people to help contribute content.

GruntWork, I love what you're doing, but I'm also hoping to make your business obsolete. :)

The fundamental truth I've seen is that DevOps (the practices, not the role) is anti-Taylorist organizational design when most corporations are run completely opposite to that. The successes happen only when from top-down the organization has come to the sober realization that they really need to break down barriers and enable individuals in addition to hiring those capable of working across multiple functions (most people used to being coddled with purely focused work and using managers as shields and go-betweens don't work out well in this transition). Most organizations only want to give people just enough to be productive to a certain capacity not just because of least access principles in security but also because of the cargo cult of manager worship that is antithetical to both Agile and DevOps.

Unfortunately, I've also seen plenty of consulting firms selling "Devops Solutions" and productizing everything because their customers look at everything as products and solutions that they acquire and presume that their culture is fine as-is. While I don't doubt that these firms make plenty of money I don't think the practice will be viable long-term because the problems are entirely due to culture rather than tools or even people.

Pardon my ignorance, but I thought Taylorism was primarily about measurement, & standardization of work practices. Could you elaborate on how it views organizational structures? Or do you just mean a mentality where managers end up viewing themselves as scientists and then viewing their workers as various dials to turn?

Sure, Taylorism is indeed about measurements and standardization as well but where it differs greatly from the approach that Deming espoused is in where authority and primary allocators of resources are positioned and housed. In Taylorism, it’s much more centralized and managers are the dedicated resources that decide allocation of resources primarily and need information constantly to make decisions (hence a culture of most corporations constantly asking for reports that are usually outdated / irrelevant by the time corrective actions are enacted). In the approaches toward quality that Deming talked about, much more emphasis is placed upon the end product and responsibilities and authority being granted closer to the work being performed (this is not Communism - quite the opposite!). The infamously centralized US Army has started to abandon many Taylorist principles to help fight in the Middle East against a decentralized enemy force. In a really well done “devops” culture while managers are still valuable, their roles are a lot more limited in scope day-to-day than in traditionally structured organizations. Other approaches include “flat” organizations (fully peer-to-peer network mesh) and holocracy. This, in Deming’s view there are many less managers and you may not need them at the smallest unit of organizational structure (you may report to someone as a formality but you have more autonomy and KPIs attached to you).

Deming’s work was not widely accepted in the US, but his approach was accepted at a place many Americans do recognize for superior quality and consistency - Toyota. Every other devops process of feedback and organizational structure is derived squarely from Deming’s principles. I always thought it’s rather ironic that a society known for being relentlessly individualist adopted corporate practices of conformity and a society known for being conformist took on an organizational philosophy that put more control into individuals. Maybe it was done this way to counter natural social inclinations in wider society? I’m not sure honestly. My gut feeling is that Taylorism is precisely how large militaries historically work and following WW2 this was easier for American workers to adopt.

Oh the irony... When someone's proposing that responsibilities should be allocated differently you are the biggest proponent of it unless someone's proposing that your responsibilities should be allocated somewhere else (like I proposed below that engineers could do what devops engs do).

Not that I am surprised when I see hypocrisy. I just like to call it out.

I've been meaning to read some of his work for quite a while. Could you recommend a book? There seem to be quite a few on Amazon

Out Of The Crisis.

>I think we can consolidate all the important knowledge and pre-packaged work, [...]. I think we can make it so nobody ever has to go through the painful process of learning by trial and error ever again.

Just an fyi... that goal is only possible for ideas that have stabilized and broad consensus has converged into widely accepted best practices. Yesterday's "hacks" that are insanely complex get vetted (or modified and replaced) and eventually get folded into the baseline of Normal Things Everybody Knows. But... the world isn't standing still and it evolves in complexity. Therefore, there's always new complexity that requires new hacks that everybody reinvents.

At a high level, the cycle looks like:

  chaos --> create order --> new chaos again
We see a chaos of hacks and complexity. We then notice recurring patterns in the disparate hacks and extract some unifying principle that can "simplify" the complexity. If we stop here, it seems we've tamed disorder and everything should be able to be documented and people can stop wasting time repeating everyone else's past mistakes. But all we've really done is raise the baseline of systems sophistication. There's always a new delta of chaos and complexity.

E.g. In the 1980s, sysadmins might implement a rudimentary sync of userids with a shell script and copying "/etc/passwd" files around. But we notice that's chaotic and stupid and doesn't scale. So we rationalize the system with a more unified technology such as a LDAP identity server. But utopia is still not achieved because while LDAP might be ok for employees' sign-on accounts, it's not suitable for millions of customer accounts of a B2C web business. The complexity treadmill is neverending!

>GruntWork, I love what you're doing, but I'm also hoping to make your business obsolete.

If companies like GruntWork and AWS are properly doing their job, they will never be obsolete because the changing world keeps throwing new demands at us and this constant delta of new chaos is always there that needs to be "solved".

Well sure, but it's not like they're in a vacuum. Anyone working in the industry sees the changes. We all could use a place to share information, and a place to discuss and remember over time. I think a forum (or mailing list?) could be used to manage the chaos, and its conversations used to help govern the order of the Wiki. Not that Wikis need order... It just sort of emerges over time.

> I don't think it should be this hard.

Totally agree. Unfortunately, you are up against the industry that is not interested in getting shit done quicker and better.

"Institutions will try to preserve the problem to which they are the solution"


Couldn't believe my ears when I heard an Ops guy say "we can't automate stuff, then what would we do?" He came around after he had 3x more work then he could ever do himself.

> we can consolidate all the important knowledge and pre-packaged work

Noble intent, but time for that is not yet. The state of cloud infrastructure practices today is a primordial soup of ideas, marketing and the myriad of tools trying to survive on the market, like Terraform vs Ansible vs Cloudformation :)

Good infrastructure is hard, there is global networking with CDNs, DNS, CI/CD, Git integration, HA, monitoring, security, you name it. Thats why you will not find the code on Github, only blog posts on Medium, code brings bread and butter to so many people, including me :), why sharing it ?

And most important. To create really reusable by many infra code is even harder. You need to feel what features should be there for everyone, and what is bloat. It'll take many times the effort of writing non-reusable architecture for one company.

> .. And again they frown

Made me smile :)

> So I've started a wiki, and I hope I can convince a few more people to help contribute content.

Where can this be found?

Right now, DevOps.yoga (and SRE.pizza). Shamefully little content so far. If you'd like to add something, just send a PR and I'll merge it!

> I don't think it should be this hard

The difficulty arises from the underlying complexity. The underlying layers are necessarily complex, and the law of conservation of complexity applies. Documentation can be improved, and it will improve with time - and managed services will get more powerful. But yeah.

Would you please relate to Cloud native then? https://landscape.cncf.io/format=landscape

Took me a while before I found that, would have been helpful earlier on.

I recently replaced ~2000 lines of code written my predecessor with about 200 lines. The resulting code also performed 1000 times faster.

After many years of real world experience I developed the following rule of thumb. A competent, experienced developer should produce about 20 lines of code per day, on average, to have any hope for a decent code quality. I am talking about high level languages with big standard libraries and reach ecosystems like python. At that rate 300,000 lines of code is ~41 menyears.

> A competent, experienced developer should produce about 20 lines of code per day, on average, to have any hope for a decent code quality.

That's simply measuring the wrong thing altogether. Lines of code per day produced is simply not a valid metric at all. It's just a bad metric period no matter what number you put on it. You cannot measure the quality of a dev by the lines of code he produces, at all.

The OP isn't saying "20loc/day implies good developer", his implication is the other way around. That is, at least how I read it, a good developer -- with high-productivity tools -- nets about 20loc/day.

> That is, at least how I read it, a good developer -- with high-productivity tools -- nets about 20loc/day.

And I'm saying that's still a terrible measure and not accurate. LOC is simply not a valid statistic to even look at. It's an example of lying with statistics; you could take a thousand good developers and you'd probably find their LOC per day counts all over the board. The amount of code you write is not a measure of how good a developer you are whether it's high or low. That's not how you measure a good developer.

The OP is trying to say looking for high LOC counts are not a measure of a good developer, and he's right; but he's failed to realize yet that the metric itself is what's bad, not the values derived from it.

Good developers don't have to be highly productive, but they might be; what matters are the results of their efforts. Do they produce good programs, with few bugs, that age well over time, and are easy to change to adapting circumstances. Good developers write good code; some write more than others, but LOC tells you nothing about how good a developer is whether that number is high or low.

Like any writing, it's the quality of the code that matters, not the quantity. You cannot measure quality by looking at quantity.

Of course it's quality over quantity. 20 lines is not a lot and I think that's the point; whether you're measuring LOC -- which is a well-established bad metric -- or something equally spurious (e.g., semicolons per kilobyte hour) the fact that a very low number was chosen is telling. LOC is just used as a proxy here for something quantifiable; accuracy is somewhat irrelevant.

Writing is not the best analogy -- because of the typical churn-edit cycle -- but to use your example, of course reaming out pages of text in one day will, on average, be of a lower quality than a virtuoso author who carefully chooses his/her words and maybe produces, on average, a paragraph a day. If you were to tell a layperson that a good writer can output a paragraph a day, they'd probably find that counter-intuitive because it's such a minute amount; "Surely anyone can write a few sentences in eight hours? Even I could do that!" They're making the same false-entailment, without seeing the craft that went into those paragraphs/day, or realising that there will be days when said author writes nothing and others where they're in the zone and write an almost-perfectly-formed chapter. That doesn't stop "paragraphs/day" from being a bad metric, nor does it imply that anyone who writes a paragraph a day will end up with anything good.

  if (false) {
    // my daily 20 lines here
    // ...

The most impactful change I made in 6 years of professional programming was 1 LOC in one file and about 5 LOC in another file. It was a 10x reduction of queries at a billion queries per day scale. I fixed a broken cache and it took me over two weeks to find the issue.

On the other hand I put out hundreds of LOC per hour, when writing the basic HTML for an internal crud app.

This metric is as useless as it can be. Even as an average. It has too much to do with your role and tasks.

It's not uncommon to refine things as a project evolves. I've had a few cases where I ended up replacing some huge messy code with a simpler and better performing alternative, however in my case I was also often the one to have written the original. Not that I consider this a bad thing, since it highlights the experience increase and skill progression that took place in-between.

I can't say I agree with your rule of thumb. How'd you reach that number? In my experience the number fluctuates wildly depending on what tasks you're working on. Your output will usually shoot up while a project is young or you're working on a new well-defined feature, and it'll go down as you focus on fixing bugs or improving performance. The best days are when you write negative lines of code.

Lines of code is usually not a good signal for code quality. If I pull in a huge new dependency in order to quickly solve a problem then it might seem like my project is maintaining high code quality, but that doesn't paints a complete picture. For each included dependency you should probably have at least one person in your team responsible for maintenance, which includes stuff like: keeping it updated, tracking security issues, checking releases for breaking changes or bugs, updating your code to handle changes or notifying others if the scope is too large, etc. Ideally, they should also be sufficiently familiarized with the code-base to fix bugs or other serious issues which affect you. These points are only meant to serve as a rough example of some of the responsibilities one might be expected to shoulder, but it should be understood that requirements vary based on a large number of factors.

I remember when I was in school and first heard that professional developers average about 20 LOC/day. I thought it was ridiculous, because as a student I regularly banged out much more than that in few hours for class assignments. Little did I know what i didn't yet know.

It's implied in the article that 300,000 lines of code is the total written by the company (of which the author is a founder), not by the individual author. With a company that's 3 years old, 41 man-years implies a staff of roughly 15-20, which is quite reasonable for a startup.

And written for "hundreds" of clients according to the author, which basically implies a good chunk of the 300k lines to be duplicates, with a relative ammount of customization.

That sounds reasonable for application code. 15-20 People working on architecture code for 3 years? That sounds like an unreasonable amount of developers dedicated to devops for a startup

The startup's whole product is DevOps - their elevator pitch is literally "DevOps as a service".

You should probably scope and clarify your rule of thumb. For example, writing a function that renders readable error messages to a UI will be far more than 20 lines of code.

It’s an average. Novel problems: low SLOC/day. Common or “solved” (but not implemented) problems: high SLOC.

EDIT: “Autocorrect” got me.

So? That means it will take more than one day to write, by this metric.

If you have your list of possible errors, it really shouldn't take that long to write a `errorObj -> string` function.

I think when writing UI or 3D code 20 LoC/day isn't enough, but yes there are probably parts of the stack where this holds true.

It’s an average. It’s also 20 lines of tested and debugged code per day. Once you factor those things in, it becomes a lot more reasonable as an average.

Would be interesting to check to what degree that was true for LLVM, Swift or other high impact projects while they were still in their infancy.

That's looking at it completely wrong. The important thing is staying up to date with best practices in your field, and automating away anything you do more than twice.

Beyond that, if you're only writing 20 lines of code a day each and every single day, then you're either wasting a lot of your and your employer's time, or your role isn't just programming and you have other things to attend to.

Terraform code is easily 10x as verbose as normal code, and it involves a lot of copy-and-paste.

> ... as normal code

True. ... as Cloudformation template code.

True. Maybe it's a general property of infrastructure-as-code code.

I stopped coding long before I ever got proficient enough to say I was a “coder”, but have spent enough time struggling with “intro to x” beginner code writing courses to get a hint of what goes on when writing programs.

I’m presently going through something similar to the authors situation as I’m migrating my leads into Microsoft Dynamics and learning to make the platform for my use case (real estate brokerage). It’s not “infrastructure” but it’s the foundation for my business

I can say that even something as simple as correct implementation of software is not something that should be done at speed, or without serious consideration towards portability.

For example, because I’m solo and also working another full time job, my emphasis is on completing the migration ASAP energy though I’m aware of small errors in the database that might bite me later.

Also, while I’m comfortable with MS’s record of maintaining enterprise software, it’s slightly suffocating to feel that I will effectively loose my time investment should they decide to abandon Microsoft Dynamics

They're not gonna abandon Dynamics. If anything, they're doubling down on the whole 365 suite. (And they should, SharePoint and OneDrive for Business are really good.)

I'd like to hear the logical conclusion of "Large Modules Considered Harmful." Is he implying infrastructure code (e.g., terraform or ansible or cloud formation) should be split between the repositories for each component? Or just saying, "be smart about how you layout the infrastructure repo"?

More of the latter. Don't put all your eggs in one basket. Don't create a single "module" (i.e., single deployable thing) with 100,000 lines of code and all your infrastructure in it. Break things up into small, reusable, composable pieces. This is what you typically do in any general purpose programming language, and it turns out it's a very good idea with infrastructure-as-code languages too.

So much this. I've gotten so tired of seeing the same story again and again: Someone sees something that's done by multiple pieces of code and goes "I´m gonna write a framework for that!". Two months later you've got an unmaintainable behemoth of a god class that does everything under the sun and drives everyone insane who has to look at it (a Cthulhu class, if you will).

So, yes, I got to the same conclusion. Basically, solve every part of the problem:

- in isolation/as focused as possible

- as a library

- with the easiest to use and most composable API you can come up with

After that, if you have to, you can still glue them together into a framework-like thing. But at least, anyone can pick any feature out of it without succumbing to madness. Also, this is a great way to avoid the whole "Big Ball of Mud" problem that forces you to pull a whole ecosystem into your project although all you wanted to do was log something.

I would say it's be smart about it.

I use Ansible quite a bit. When I first started, based on some examples I found, my play books were pretty complicated, with a lot of conditional steps, doing things (or not) based on results of previous steps, etc.

What I found is that it's a lot easier to have a lot of roles that each do one little thing, and then use them in a play book as needed.

A role might be as simple as installing a package, templating a configuration file, or maybe even just changing just one line in a config file.

When roles are small and do just one thing, they are easy to combine in many different ways, and it's more obvious what's going on.

Being struck by this a couple of times myself, if one component changes too much you end up with a large piece of code that has too much responsibility, never gets rewritten(because it takes too much time) and is bug prone because each commit is tough to know exactly how it affects everything.

If you go with largish modules you need to be smart when you build it, if you build smaller modules you don't have to be that smart. Dumb is good, dumb is easier to explain and read for others (including yourself in 3 years).

So, be dumb keep it simple, whatever that is in your case. If the code is easy maybe a giant repo is good, if the team is small. If the team is big maybe you have other means of validating access to your repo, but if not you need to split it up. But then commits need to be synced when pushed over several repos..

Edit: make sure it's possible to make clean commits and rewrite the whole code in small steps. Things Will Change(tm)

I think Segment’s terraform repos and blog show the logical conclusion of this pretty well. I’ve gone through the process of splitting up a single cfn stack for the entire environment, it was less about repo structure and more about deployment units for us


Writing infrastructure code is a pain because it's so much more than everyone think it is. I think this article puts it well and hopefully makes it easier for folks to explain it to their team/boss. There's many single-devops out there facing this challange.

I'd say it's mostly a pain because it interfaces with parts of the system that you only touch a few times a year if it's working correctly, but yet is totally outside of your domain expertise. Nobody works with directly Ansible or EC2 instances enough to be an expert at it.

Like how every time I need to renew a SSL cert, I need to reread the man pages for how to make certs and cert requests. You'd think a replicable procedure could be documented, but there's just enough variables so that it's slightly different every time.

Definitely appreciate the prod-ready checklist; it's super useful to have a sanity check that you didn't miss anything.

Here is their full checklist: https://www.gruntwork.io/devops-checklist/

There is a simple way to make it much easier. Use hosted platforms (eg PaaS) instead of self-maintained infrastructure.

Unfortunately, most enterprises are still not willing to outsource their IT platform to the real pros and thereby make it someone else’s problem to keep that infrastructure up and running. They want to build their own in-house competency. But the problem is, if you’re a true DevOps pro, why would you work for some random company in which IT is considered a cost center, rather than working for either a hosted platform provider or a (for lack of a better phrase) “independent provider of DevOps”?

IT and security managers are able to convince business executives that IT infrastructure needs to be kept in-house, even though today’s typical enterprise has no chance of actually building and maintaining a infrastructure that is 1% as good as what they could rent from a dedicated provider.

The problem with your theory is that even in hosted platforms, IT is a cost center.

"The real pros" you speak of are also under the same cost pressures as everyone else. I have encountered many such services where after the "real pros" have set up a system, they are replaced by low cost staff who can maintain and minimally extend while billing...

Is it just me or are we constantly repeating the same lessons, the same messages, over and over and over again?

Same cycles over and over again? Because people in tech don't learn from history.

We do and do again and do it again again.

And if you finally have it more or less done it in your local environment / company, you switch and suddenly you start again.

But you know, my salary still increases anyway so shrug (still looking for a solution)

So if I made an open source solution that automates a lot of this ... I guess it would help a lot of people.

The vast majority of developers don’t know what those details are

Let me fix this for you. Vast majority of developers are actively denied access to the infrastructure by their managers. They pretty much know everything what an infra eng knows (and even more, after all all this catchy cutting-edge apps like Kubernetes/Elasticsearch/etc. was written by them) or even more.

Blame the managers and no-one else.

Having been able to provide all the infrastructure to developers, I find a lot of things on the 'production grade checklist' are not even on their radar. And I think we have pretty good devs. Logs, backups, HA, scalability, testing with production sized datasets, testing with production sized load. We found it better to empower our sysadmins to get involved with development, rather than the devops approach of empowering developers to be involved with deployment (we do both).

Are the developers-who-are-denied-access that you describe also on-call and watching monitoring alerts for when something goes awry?

I've worked in companies where "DevOps" was thought to mean: "Devs have a lot of access, even root/dba/etc. access, and while they're sleeping or if they get stuck on something, they should punt it over half-finished or half-broken to the infra eng / devops group"

^^ This is so incredibly common, especially in small/startup environments. It's a nightmare I actively aim to fix when I join / consult at companies.

There's instability in the environment, inconsistencies, or just a mess, and you parachute in there to fix. Step 1, kick out all devs from any systems they have too much access (unless they're special, knowledgeable ones and know and understand what you're trying to do), Step 2, stabilize the system. Step 2 involves environment separation, real automation that more than one person can use, setting up monitoring, and much much more.

> while they're sleeping or if they get stuck on something, they should punt it over half-finished or half-broken to the infra eng

By the description of it, the root cause of the problem is your setup. Dismantle the infra eng team and make the dev team to monitor their own apps.

By reading of your explanation devs should have more access and responsibilities.

You set up a half baked process and you are blaming the devs for it? Not cool.

Ha ha. That's assuming the dev team wants to monitor apps. I've met many, many devs who have 0 desire to be woken up at night for crap they coded during the day. They just want to code.

It's incredibly common.

Not that I blame them, if you can work just 9-5, why wouldn't you?

I've never seen anything like that, nor have my dev and DevOps buddies, ever seen anything like what you just suggested.

Dismantle infra eng and make devs do everything? I support that, for one reason: a year from now I can charge consulting dollars when I come to fix these unstable, poorly documented, environments with minimal automation and HA.

Well I did that in the startup I worked in previously. And no, they are not gonna call you for your shitty service. They are pretty stable.

Guess what? We knew how to configure private clouds, loadbalancers, sharding and high availability. We pretty much know everything about Kubernetes/Amazon ECS et. al and related technologies and we didn't have any problems with monitoring or low level networking among others either.

Anyone can learn this shit. There is nothing special about it.

Don't hold your breath for seeing your consulting dollars, LOL. :D

yeah, I recently left a job in which a third of the group was sys admins, there was no 'dev system' to test applications on, and I was not allowed to touch any of the data systems, despite developing applications for our clients that worked on that data, go figure. A colleague of mine at a previous employer would refer to this as 'self imposed denial of service' ...

> developers are actively denied access to the infrastructure

As it should be. Devs make apps, Infra guys, SRE, make these apps run in production. Different skills, PHP vs sysadmin's work.

Very outdated thinking. As per Martin Fowler's advice if you write microservices you should organize your code around products and use vertical teams. The guy who writes the frontend should be able to write the deployment code for it too and maintain it for the complete lifecycle. If a devops eng can and want to do only sysadmin work he should travel back maybe 20 years in time. That was common back then.

I mentioned sysadmin skills just for example of what is out of scope of apps or web devs skills. I may add networking, load balancing, monitoring, autoscaling, DB sharding, DR and so on. And all the above requires coding skills to make it work in the cloud.

Ask your frontend guy about SR-IOV, just for fun :)

These are all integral part of the dev skills. So your point is?

I worked in all kinds of companies. The lamest and slowest ones were these outdated thinking ones. The most productive and highest quality products were delivered in those companies where the team was responsible for their apps networking, load balancing, monitoring, autoscaling, DB sharding.

You think these are somehow special skills. But you didn't provide any arguments.

I’ve spent hundreds and even thousands of man-hours to undo the lack of planning that developers have inflicted upon their early-stage start-up environments because they simply haven’t ever managed networks or systems beyond the scale of their laptops. They have no context for setting up VLANs or subnetting beyond their cute home networks. Putting everything into huge /8 subnets, extremely loose IAM permissions, overlapping VPC CIDRs, etc. have caused a great, great deal of issues that make growth slowed or even impossible. For example, we’ve had to do a number of DB migrations resulting in some maintenance periods (when our product is sold as a zero-downtime solution as a competitive differentiator) because we had to re-IP our AWS VPCs from the ground up to accommodate multiple regions and accounts. If you read through the best practices of AWS networking, all of this would have been not a big deal.

You wouldn’t hire an operations engineer to architect your software, why would you have a software architect setup your network? Software may be tough to change but I can assure everyone that the millions of miles of cabling out there in legacy data centers out there would be much better if someone with sufficient networking backgrounds had gotten involved earlier. Operations is a cost center, but it can be a force multiplier for your revenue centers of less than 1.0

If you don’t have the knowledge and experience to lay out a network and expect to grow the footprint beyond more than a handful of instances, you should probably at least get a 1 hour session over coffee with a moderately experienced engineer that has actually thought about these things before. Heck, I’d do it practically for just beer money because I’d consider it an act of goodwill for any future infrastructure engineer that has to work on the environment.

Here is what I see happening today: we have a bunch of hype inducted by cloud companies and their marketing departments around their solutions. Which makes their solutions look desirable and inevitable. Also makes them seem cutting-edge.

We also have another bunch of actors in the industry: devops and sysadmins who because their old cushy position was kind of unnecessitated by the emergence of these new services (after all ANYONE can create an AWS account) realized that they have to come up with new smokes and mirrors to rationalize why are their role is so important.

I am very very much against this stance and proposition and I will fight this thinking in every possible forum.

The possibility is real and the risk/cost is very low to empower dev teams in today's cloud landscape so there is really no need to prevent this happening.

And this discussion also reminds me of an old Uncle Bob article [0] where Uncle Bob summarizes the (back then situation) as follows:

> I witnessed the rise of a new job function. The DBA! Mere programmers could not be entrusted with the data – so the marketing hype told us. The data is too precious, too fragile, too easily corrupted by those undisciplined louts. We need special people to manage the data. People trained by the database companies. People who would safeguard and promulgate the giant database companies’ marketing message: that the database belongs in the center. The center of the system, the enterprise, the world, the very universe. MUAHAHAHAHAHAHA!

Now replace data in the above excerpt with infrastructure and you will arrive to the same conclusion as I did. Q.E.D.

[0] https://blog.cleancoder.com/uncle-bob/2012/05/15/NODB.html

I view the democratization of infrastructure similar to democracy- the best part of it is that anyone can do it, and the worst part of it is that anyone can do it. On the flipside of specialists getting involved, I also see an awful lot of bad / inappropriate networks and security layouts in cloud environments created by traditional infrastructure engineers because they carried too many principles from managing physical networks. I'm just happy that I shouldn't need to be hassled by anyone to create a random VM for them to test something quick out with such flexible infrastructure.

The devops / Agile philosophy of everyone being empowered to do most things works pretty well when people want to do all these things, are invested in the outcome together, and are at least vaguely competent in their tasking. However, the approach has limitations when it comes to tasks that nobody wants or can do but is still important to the business. It's even worse when something is important but nobody even knows it because of groupthink blindness.

I don't think I'm being hypocritical in recommending specialists for topics I know something about while advocating for empowerment in other functions because if my previous employers / clients knew what they were doing with infrastructure and healthy software development practices, they could have grown much more before needing to hire someone to do it full-time for them. I really don't want to have to re-IP another awful network again and have to tell leadership that you have to incur downtime to do it because their software can't handle database hiccups like when failing over to a hot replica. It is boring, unfulfilling, stressful work to me that - even worse - offers no tangible business value when done well but when done inappropriately is an albatross.

Compute infrastructure across different industries is in an overall state of health where everyone loves junk food but is starting to recognize its harm, some vaccines have been developed for the flu but nobody gets it or the vaccine costs $20k per shot for some people, doctors for Hollywood actors and pro athletes debate publicly over which lifting program is more optimal, and the two fitness trends are competitive decathlons and walking from their car to their desk instead of taking a Bird. In comparison, software is much further along with at least a vague sense of a board of medicine in different states (that is determined through a pageant and feats of strength, not experience in Mississippi), people are taught about the dangers of junk food, there is a debate on GMOs (Uncle Bob is strictly against it, I see some positives although DBs are much more controversial now than the well-researched topic of GMOs), and only the literally crazy people don't believe in use of vaccines.

You and I both agree that tons of people needlessly hire a personal trainer when the information to start exercising is out there and basically free now. What I think you're suggesting is to "just start jogging and it'll work out - everyone can run without a trainer helping you" but I think it's mistaken not because I think trainers are required. Right now most cloud providers don't give you shoes for free because they want to sell you Air Jordans or hiking shoes to recuperate their substantial investments, the roads are totally unpaved except for paths through lemonade stands charged by how fast you run, and I've seen a lot of people hit by cars while running because they kept stopping to pick glass out of their feet all because the common theme is they started running with socks and they were "forced" to keep running. I don't think I'm being unreasonable in saying that by default people start walking with socks on because they think personal trainers are too much when flip flops can work really well until you need to start running. By your view every other former sysadmin is now a personal trainer trying to get people into some shoes when we can do fine without one, and while I can see that I'm personally not the typical sysadmin type because I started off as a developer only caring about running fast and have learned starting off on the wrong foot can cause serious problems that can be very cheaply and easily corrected. Perhaps we are disagreeing over how much those flip flops cost or how difficult bad footwear is to discard?

I totally agree with the last paragraph but would point out that as a front-end software engineer exposed to all aspects of business-critical systems front-to-back and troubleshooting them alongside folks whose job it was to fix things (but maybe couldn’t...) I’ve been trying since mid-2014 to follow a certain amount of network infrastructure best practices through the PacketPushers podcasts—-what started with trying to learn more about the buzzword of Software Defined Networking led to independent study by doing my own research into things like FD.io, virtual machine networking, and the various networking solutions and benefits that Kubernetes and cloud providers can offer and the important role of usually-centralized control plane abstractions in distributed systems like networking — be it serverless or self-serve, efficient routing, or troubleshooting/monitoring. I’ve found that if you don’t have an experienced cloud-native software engineering resource to learn from, you’ll have to follow many different sources (and source code!) in order to pick up how some of this works. In the end, I’m left both with a greater appreciation for the sysadmin skills required, but also for the disruption new technologies can bring. And I’ll leave looking up the details on my next implementation to, as noted above, looking up best practices in a just-in-time sort of way. (One of my favourite methods to pick up SRE tidbits is to search recent HN comments for advice from the trenches, I really appreciate these mentions and resources...)

That said, every time I use terms that aren’t directly related to code or software architecture, I feel imposter syndrome creeping up on me—as in writing much of this post. (I’d welcome comments or additional learning resources!)

Also, I’m not sure AWS is the best example, their network terms and best practices seem far more oddly named and unique to AWS than what I’ve heard from Google’s recent conference presentations for enterprise adoption of Google Cloud.

For your specific case, you're already far, far ahead of the curve on my experiences with developers familiar with infrastructure concerns and spreading yourself too thin does nobody any good job or career-wise. I'm in the camp myself - I'm building my depth back and dropping whatever curiosities and novelties show up to get my head on straight. I might argue even that you've gone too far - your value as someone that can code beyond a rudimentary level is much more generalized by default than any network engineer, and context-switching between coding and infrastructure work is almost as bad as switching from coding to meetings. Computers are really fast these days and unless you're typically working with massive scale infrastructures I have trouble seeing an advantage in a developer understanding much more than how to troubleshoot things effectively so that a network engineer can determine what's wrong (being able to identify when packet fragmentation and inappropriate MTU is causing app slow-downs on top of using MTR rather than traceroute will make you a great friend to many network engineers already).

Your infrastructure / SRE peers will appreciate you more for writing software that is easier to deploy and maintain when you're clueless about networking than if you understand networks and designed a system that is extremely stateful when it offers no technical advantage to be that way (databases / caches get a pass, everyone else writing software in 2018 has no excuse to keep state on a machine for longer than a business transaction window). High performing software teams deploy often and have the culture to encourage it in a healthy manner - there is no excuse to write software that is deployed weeks or months as huge chunks of changes unless you fall into very niche enterprisey domains and even still you should at least be making production-releasable software daily.

And like it or not, AWS is the enterprise cloud by fiat now, so it'll determine the bar and etymology for other vendors to meet and (hopefully) exceed.

In my experience, this is pretty far from true. I've worked with developers who tried to cover all of these skill sets and instead of being good developers they end up being pretty shallow generalists. Research is a zero sum game and picking "everything" just means you have small amounts of surface knowledge for many topics instead of deep knowledge of a few.

The difference is that today it is possible to "enough" generalization on top of managed services and cloud infrastructure that you can be deep on cloud deployments + full-stack webdev, or cloud deployments + game engines or whatever that avoiding the need to have horizontal separation of teams gives significant leverage. The teams that own their stack end-to-end (on top of other external systems and teams, true) can move faster that those that require internal coordination and they can do it without losses in quality/availability/stability.

The hard part for me is finding the time to do it all myself—even if I can learn it all, it simply doesn’t scale to wait for me to do all the work, and somehow fix my own bugs/problems with my ideas. But knowing something about everything does make it easier to work with others and defer to their expertise while also offering suggestions or doing an extra bit of code review. My personal preference is to look up best practices as I need to, just-in-time, or just get it done if I’ve done something similar recently. It’s really important to work with others though, especially when they’ve specialized knowledge and experience to share. Building it yourself or as a generalist alone often means a lot of learning-it-the-hard-way. But it’s always possible, if you can accept the tradeoffs or build a smaller subset of a system.

"Research is a zero sum game" - I don't agree with that.. Having insights into both processes i.e DevOps + Dev will most likely make you better at both !

> make you better at both

You'll get insights, as you correctly worded, good. Thing is, designing production infra takes much more than that. Years of telecom experience for example, where downtime is not an option, deep linux and network knowledge, HA designs and lessons from past mistakes. That cannot be learned overnight, and you get sixth sense if it looks right or not.

Add to that AWS SA Pro cert, Python and you'll be getting somewhere :)

LOL. I worked in Nokia Networks for quite some time.

> downtime is not an option

HAHAHA - sure. Of course.

> all integral part of the dev skills

Then you found a bunch of unicorns, congrats :) In real life good app devs and good infra devs are different people. I mean good not as in nice and easy personalities, the skills. Do all your app/web devs have AWS certificates?

And by production I mean Netflix like infrastructure.

Microservices are overkill for a lot of thing and right now are useful, but extremely overrated, technology.

I wouldn't allow most engineers I know to touch infra code. It's just a very different skillset.

Nope. It is not. There you go. I gave a similarly strong argument to yours.

Or did you really want to give arguments but you forgot it?

The arguments are well laid out by other responses, but the gist of it is that 90% of devs are wasting their time thinking about the details of infrastructure code, as it does not pertain to their skillset and it takes _years_ to master it.

You cannot get junior engineers contributing to infra, or most Semi-senior engineers either. You need a ton of upfront thinking about releases, policies, backouts, risk, etc.

And tbh, if you're thinking about those things it makes no sense that you also focus on user-facing features.

This sorta assumes specialization isn't a thing, or isn't valuable.

Maybe I should clarify what I mean. I think specialization is useful in some context. Let's say you are a IaaS or PaaS provider of some sort. Then probably your employees will only work on devops and nothing else. I can see that people would only need special skills then. But then there is no need to rant about how developers don't know the details. Because there are no developers involved. Also if you are working for a cloud provider for example you can't escape installing hardware and maintain physical infrastructure. I am aware of that too.

But we are talking about devops in the context of gruntwork. Which is targeting developers/dev teams and then making unsubstantiated claims about those devs' understanding. That's why I say what I say.

Are you trying to solve the problems caused by an insufficiently sized ops group? The other posts in this sub-thread imply devs are going to make a destructive mess of your whole system architecture, but I find myself blocked on more mundane things like being unable to fix deployments or builds, or toy around with a new Amazon service in dev because every new service requires new IAM roles and guess who exclusively owns that ability?

If I read it right what you mean than probably we are on the same side. Yes, the problem you described is the exact problem I am trying to prevent in every organization that I join and I actively call out outdated development/managing practices and shame them/blame them.

People who propose that managing their precious infrastructure cannot be bestowed on those inferior developers are the real hindrance to any business which want to go forward fast. They simply don't realize that the lack of decentralization of project management and lack of empowering teams creates a terrible communication overhead and grotesque, half baked solutions.

grotesque, half baked workarounds are among the most long-term damaging things I've seen, yea. nobody understands them, so they limp along until they take other things down with them.

Like: the emphasis on tests, with real pointers to testing Terraform plans. That's legit, and hard advice to come by.

Dislike: 300K lines of infrastructure code, but all he talked about was DevOps issues? Surely he learned something about writing good code, too, and I hope he didn't write 300K lines of Terraform code.

> hope he didn't write 300K lines of Terraform code.

If you look at the company, their product is providing infrastructure code, it was not a byproduct.

... , then replace all that Terraform code with Cloudformation templates. You will get 5x less code, native vs third party tool v0.11 and my respect :)

I've written tens of thousands of lines of both and find Terraform requires me to write many, many less lines of code because I don't need to break out Lambda functions nor deploy a random web service to serve as a Custom Resource. I don't need to write code to search for AMIs in Terraform, I have a data provider for those kinds of things. CloudFormation is sometimes substantially behind Terraform in support for certain flags. And lastly, Terraform supports a lot more different services than just a single cloud provider - it truly is an infrastructure description language in comparison to the much more tightly constrained CloudFormation. However, CF is my go-to when we're talking about application deployments based upon instances over Terraform - that's when Terraform starts falling apart and requiring tons of code over CF. In fact, I think the Terragrunt author solved his app deployment problem in Terraform by generating CloudFormation stacks from his Terraform - good luck doing the reverse from CloudFormation!

Furthermore, transitioning from Cloudformation to Terraform is much, much easier than the reverse - you can't import existing resources into CloudFormation management, period (you can't even write the aws-prefixed tags to try to confuse it unless you're an AWS employee or something).

Nothing wrong with starting with CloudFormation or any other provider-native deployment description language if you're 100% in AWS and will stay that way. For the rest of us, Terraform is basically the only choice that can work besides paying for eye-watering expensive boutique tools that were developed a decade ago and have more limitations than Terraform.

You are the first person I've encountered who preferred cloudformation over terraform. CF has far worse ability to cope with out-of-band changes or problems in general (in contrast to TF, which can usually just `apply` things back into compliance), and somehow gets feature support slower than TF in spite of being 1st party.

> just `apply` things back into compliance

If your things are wandering out of compliance by themselves, you have bigger problems, imho.

Likely the commenter is referring to development environments, where it is common for infra to be put in bizarre states by actors performing changes. Terraform alone is does not constitute change control.

Then don’t do that. But CF at least now has drift detection

But what I prefer about CF is the “easy button”. If something weird happens or I can’t figure out something, I can use our business support plan with AWS.

> drift detection

As of 2 weeks ago[0] drift detection exists, but coverage is poor[1].

[0] https://aws.amazon.com/about-aws/whats-new/2018/11/aws-cloud...

[1] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

Terraform is multi cloud, which has its benefits for a few of us.

Also multi-service. We are an all AWS shop, but integrating things like DataDog monitors and alerting into our infrastructure, or github organization membership linked to IAM makes things much more maintainable.

That multicloud thing is just a marketing trick. You'll never get it, different clouds have different resource types, each having different set of parameters. Want two clouds, write two independent infra codebases.


Only if you use aws

> Make sure your team has the time to master these tools

What a luxury!

Really though, if you can't master the tools, maybe you should avoid them.

There are degrees here. Depends what kind of scotsman we are talking about. Having a general understanding of the uses of the tool, understanding what feasible in general, some best-practice rules of thumb and knowing how to google for details gets you most the way there. You could call that mastery. It certainly requires greater expertise than "I read a blog post and changed a few things until it seemed to work." But the bar is much lower than: is familiar with the source code; or even you can use the tool for common uses cases without recourse to the documentation.

I think you do need to invest in getting over that initial learning curve such that your tools aren't "magic." You have a conceptual mental model of what is going on. And I agree that takes much more than a token effort. Yet for me, that also stops short of mastery.

Well, it should at least be possible to become master. "I know how this works but someone else did the grunt work".

I always aim for simple to use tools, if I need to I can dig deeper and just-fix-the-problem. Or replace it at a whim.

So I agree with you.

Infrastructure code is really tough when it's scaling, so keep your modules in shape/order and small/easy.

Wow, thanks for the insight! All this time I thought I was supposed to be using tools with a half-assed understanding. It makes so much sense now! :)

Seriously though, what I was suggesting is that some companies make promises to their customers, on behalf of their engineering team without understanding (or properly estimating) the impact and technology involved. Then it's a mad rush to get something delivered. I understand this is a bad way to do business but it's a sad reality.

You made the perfect target for my point, sorry about that :) I think it's a good point to really push at though. Something we as techs need to push up in the agenda of management.

Yes, and then everybody is upset that things aren't delivered on time/riddled with bugs. There's no way escape out of sync management.

It feels like there's a personal vendetta here?

> Something we as techs need to push up in the agenda of management.

And you're absolutely right, we should. It just seems like a losing battle at times, that's all I was trying to convey.

> .. we as techs need to push up in the agenda of management

> .. It just seems like a losing battle at times

Bad management. Managers in tech companies must have tech background. I was burned before, joining big banking corps, where managers are just bureaucrats, quit with regret for a lost time.

Next time, I will vet the management layer before joining.

has the time to master =/= can't master

At a lot of companies, asking "can I have a week to play around with these new tools/frameworks before I get started with them" will be met with "no way". Of course, not saying that's good: often the new tools are strong enough that you would still long term get massive development time savings even with the upfront costs. But a lot of managers and PMs are only focused on "shipping" ASAP, never mind the technical debt or maintenance costs (and to be fair, sometimes you need to ship ASAP... but sometimes you don't)

I think the comment still stands though, if you can't master the new tools including for reasons like the company not giving you time to learn them then maybe you (the company) should stick to the old tools.

Doesn't apply if the new tools are forced by management, but often it's the developers driving new tech.

> Doesn't apply if the new tools are forced by management

That's the reality I was trying to capture with my original comment. For example if you're moving from on-premise servers or a simpler cloud provider (Rackspace, Digital Ocean, or Joyent) to AWS there is no way to escape learning AWS, and the last thing you want to do is manage everthing through the web console so you will need to learn a new tool to manage the infrastrucutre when your old tools don't work with the new stuff.

Management and sales types hear about how great, fast, and agile the cloud is and don't consider that it also takes some time to do things right, and there's is no way around learning lots of AWS specific bits at the very least.

And according to bus factor there should be more than one that could master it. Or at least possible to do another hire.

Indeed; work with tools you CAN master. Often you have to make the decision between managed services, PaaS and IaaS, each rung involving a lot more specialized knowledge you need to know. Each one also having more potential savings - because managed services (Firebase, DynamoDB, things like that) trade cost for ease of use / lack of setup / maintenance.

Can't we all jsut keep using "bash" :) I never want to work with Ansible again ! Having to edit yaml files and learn Ansible concepts and language... Bash is still very understandable and the common denominator on linux ?

Bash development is hard to scale. You'd need quite a bit of rigor and there's few best practices available for when you want/need to write shell scripts. You can do it, but at the end of the day you still end up with a bespoke solution.

Bespoke solution into a well-understood and used language...

You'd be surprised at the "well-understood" part.


Agreed :D

Are there any good "worked examples" of infrastructure as code out there? Something one can learn good practices from?

The https://github.com/gruntwork-io & https://github.com/cloudposse examples spring to mind.

As for example implementations review cloudposse in depth or take a look at the https://github.com/travis-ci/terraform-config repo.

This answer is skewed towards infrastructure as code. Often conflated are things such as configuration management & provisioning.

Several good tips in here, but slides 60-72 are probably the most valuable - don't do anything by hand, find some tools and automate everything using those tools.

LOC is a very very bad metric. How exactly is it counted? Include mass refactoring? Is there a better metric? Yes, a long list of testimonials from happy customers.

One glaring omission is metrics.

In the production grade checklist, there is nothing mentioned about metrics.

Infact there is no mention in the post of metrics or graphs.

This implies that everything is done via logs, which is just horrific at scale.

Everything should emit metrics: o hits per second? metric o Memory use? metric o upstream service response time? metric o That new lib you wrote? metrics

This is especially important with microservices. Open trace is grand, but thats for after you've found where the problem is. Your metrics should be your single pain of glass that indicates the health and performance of your system.

i think it's called monitoring in the checklist.

Lots of people in this thread bashing on LOC as a metric, which is fair. I'd just like to point out that infrastructure code is incredibly verbose as is, so the number is way overstated to start.

It's a pretty good example of doctoring headline worthy titles. IIRC the author gave a talk of a similar name at the hashiconf recently.

Is there a version of this that is //just text// instead of a whole bunch of images and other resources that don't load without scripts enabled?

There's a transcript of the talk on the HashiCorp website: https://www.hashicorp.com/resources/lessons-learned-300000-l...

Try using Mercury Reader or Outline if you don't like the way they chose to present the article.

The text loads without problem right?

anyone care to take a stab at defining what the author means by 'infrastructure code' ?

Infrastructure as code is starting to be a big thing. Lets say you want to provision a new AWS EC2 instance, you could go into the UI and click around to do it, but at a large company, that's probably not the best idea. That's not really scaleable and leaves a lot of room for human error.

You could instead use something like terraform, which allows you to write code specifying your requirements and then run that code. This allows other devs to review your code, takes a lot of human error out of the equation, and is much more sustainable.

When he says infrastructure code, I think he means code that keeps the infrastructure running.

Some examples that come to m would be terraform files, dockerfiles/docker-compose files, jenkinsfiles, bash scripts, etc. Basically, code that keeps the servers running.

So a JCL then?

The largest code base I worked on (a map reduce based billing system), also had for the time early 80's a fairly complex set of JCL that could compile and build all the systems module's on dev and also push it out to the 15-16 or so live systems.

This also handled all the glue that held together the map reduce.

It’s not procedural. You write what you want your infrastructure to look like in the case of CloudFormation yaml or JSON.

The first time you run your template it creates all of your infrastructure and is usually smart enough to figure out dependencies.

After you make changes to your template and run it again it knows based on the changes in your template whether it can modify the existing resources or whether it needs to delete and recreate your resources.

Procedural certainly can be "Infrastructure as code." It just isn't the most modern way to do it, due to additional complexity and the potential to be more error prone. I'd certainly prefer CloudFormation over writing a bunch of python/boto code, but it could be done.

Interestingly, Dockerfiles brought back a bunch of procedural configuration management. We had migrated to ansible for all our server-level configuration. But as we've adopted docker/containerization in recent years, simplifying our applications (now separate containers, rather than severs on common servers) has such a reduced level complexity such that simple docker files with `apt-get install foo` are much preferred.

Most people probably think of Terraform[0] or Fugue[1] here, but it also includes the more venerable likes of CloudFormation[2] and OpenStack Resources[3], both writeable in Troposphere[4].

It also includes any random mess of shell scripts and the like that manage the lifecycle of infrastructure.

[0] https://www.terraform.io/

[1] https://www.fugue.co/

[2] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGui...

[3] https://docs.openstack.org/heat/latest/template_guide/openst...

[4] https://github.com/cloudtools/troposphere

The first sentence links to gruntwork.io[0] where 29 different tools/projects are listed. Seems to be provisioning and configuration for existing services (most of them Amazon, from the looks of it). There are several Hashicorp projects as well, including Vault, Nomad.

[0]: https://gruntwork.io/infrastructure-as-code-library/

It's code that provisions or configures VMs, containers, cloud resources like RDS, etc. Usually this is accomplished using tools like Terraform, Ansible playbooks, SaltStack, Kubernetes, and the many other similar tools.

Code that defines what your servers/services are, what they should contain, where they should be put, how they should interact.

E.g. a simple plan in Terraform gives you the possibility to say how many of what instances should be up in given zones.

Edit: every piece of tool used here is used for infrastructure, controlling them is infrastructure code: (it's massive!) https://landscape.cncf.io/format=landscape

The issue with configuration management is it creates a DAG above a build system that does the same, as in its superfluous.

Configuration management provides opinionated abstractions over inconsistent and frequently user-unfriendly systems. You might as well say that C is superfluous because it all maps down to assembly.

I'm saying creating two DAGs is superfluous, fbsd, obsd, nbsd, etc all don't do this. If you want to abstract the CLI tools why do it in a non-portable implementation specific language?

nice. it's rather on-point.

That your writing too much code?

Are you criticizing the abstractions, or amount of infrastructure?

Thats code on top of frameworks, to do roughly similar things. As a software engineer I wouldn't be happy writing, and definitely maintaining so much code.

His points are fine, but shocker its the same principals that apply to all code, infra isn't that special, thought we established that years ago.

Also starting by boasting about the number of lines of code you have written to achieve something is asking for trouble.

It feels like many in the 'devops' community that are from an ops background are rediscovering software principals, this chap isn't alone in that!

Infrastructure is special - if I make a change to my code, it usually won’t kill a whole database, load balancers, knock out connectivity, etc.

A load balancer is not much use if the application on the other side is borked because of the application code is faulty, I don't really see your point.

What I do see is 'devops' typically changing code in critical areas without taking sufficient care, but thats more a cultural thing than infra code being special.

Agreed and I'm from the "devops" side of the house.

I don't see a lot of build pipelines with automated tests and deployment for IaC. Standardized processes like pull requests with code review are also not ubiquitous.

With terraform one typically writes a template, does a plan, then apply and hopes for the best. What I've seen and the OP mentions the same in his talk is that things occasionally go terribly awry in ways that common software development practices would prevent.

If you make a minor mistake in application code, it usually doesn’t affect the whole site. Besides, when you are deploying code on a group of servers, hopefully you have sense enough to at least do a rolling deployment and are using automated health checks to make sure that your whole site isn’t down during a deployment.

Nor does my infra code, because I write tests, and have a pre-prod end to test it in a controlled ci approach.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact