
Lessons Learned from Writing Over 300k Lines of Infrastructure Code - jbkavungal
https://blog.gruntwork.io/5-lessons-learned-from-writing-over-300-000-lines-of-infrastructure-code-36ba7fadeac1
======
peterwwillis
All of the points are very valid and valuable. But by reading only this post,
you will have basically no idea how to go ahead and do all these things. You
have to scour the internet, and collect every blog post and video ever
produced on the subject, and intone them all, and start experimenting, and
just sort of create your own version of what these things are.

I'm going to interviews right now where people are trying to hire a "Senior
DevOps Engineer". They seem to think hiring this somewhat low-level employee
will magically imbue their organization with DevOps principles, and suddenly
everything will be cheaper and faster. I tell them it may take them 2 years to
build a solid DevOps practice, assuming they do everything right, and they
frown. Why can't we just have the DevOps right now? To that I reply, go hire a
company that specializes in DevOps. And again they frown.

I don't think it should be this hard. I think we can consolidate all the
important knowledge and pre-packaged work, collaboratively, to empower teams
to actually get shit done quicker and better. I think we can make it so nobody
ever has to go through the painful process of learning by trial and error ever
again. And, of course, I think it should all be free. So I've started a wiki,
and I hope I can convince a few more people to help contribute content.

GruntWork, I love what you're doing, but I'm also hoping to make your business
obsolete. :)

~~~
jasode
_> I think we can consolidate all the important knowledge and pre-packaged
work, [...]. I think we can make it so nobody ever has to go through the
painful process of learning by trial and error ever again._

Just an fyi... that goal is only possible for ideas that have stabilized and
broad consensus has converged into widely accepted best practices. Yesterday's
"hacks" that are insanely complex get vetted (or modified and replaced) and
eventually get folded into the _baseline of Normal Things Everybody Knows_.
But... the world isn't standing still and it evolves in complexity. Therefore,
there's always _new complexity_ that requires new hacks that everybody
reinvents.

At a high level, the cycle looks like:

    
    
      chaos --> create order --> new chaos again
    

We see a chaos of hacks and complexity. We then notice recurring patterns in
the disparate hacks and extract some unifying principle that can "simplify"
the complexity. If we stop here, it seems we've tamed disorder and everything
should be able to be documented and people can stop wasting time repeating
everyone else's past mistakes. But all we've really done is raise the
_baseline_ of systems sophistication. There's always a _new delta_ of chaos
and complexity.

E.g. In the 1980s, sysadmins might implement a rudimentary sync of userids
with a shell script and copying "/etc/passwd" files around. But we notice
that's chaotic and stupid and doesn't scale. So we rationalize the system with
a more unified technology such as a LDAP identity server. But utopia is still
not achieved because while LDAP might be ok for employees' sign-on accounts,
it's not suitable for millions of customer accounts of a B2C web business. The
complexity treadmill is neverending!

 _> GruntWork, I love what you're doing, but I'm also hoping to make your
business obsolete._

If companies like GruntWork and AWS are properly doing their job, they will
_never_ be obsolete because the changing world keeps throwing new demands at
us and this constant delta of new chaos is always there that needs to be
"solved".

~~~
peterwwillis
Well sure, but it's not like they're in a vacuum. Anyone working in the
industry sees the changes. We all could use a place to share information, and
a place to discuss and remember over time. I think a forum (or mailing list?)
could be used to manage the chaos, and its conversations used to help govern
the order of the Wiki. Not that Wikis need order... It just sort of emerges
over time.

------
perfunctory
I recently replaced ~2000 lines of code written my predecessor with about 200
lines. The resulting code also performed 1000 times faster.

After many years of real world experience I developed the following rule of
thumb. A competent, experienced developer should produce about 20 lines of
code per day, on average, to have any hope for a decent code quality. I am
talking about high level languages with big standard libraries and reach
ecosystems like python. At that rate 300,000 lines of code is ~41 menyears.

~~~
zorga
> A competent, experienced developer should produce about 20 lines of code per
> day, on average, to have any hope for a decent code quality.

That's simply measuring the wrong thing altogether. Lines of code per day
produced is simply not a valid metric at all. It's just a bad metric period no
matter what number you put on it. You cannot measure the quality of a dev by
the lines of code he produces, at all.

~~~
Xophmeister
The OP isn't saying "20loc/day implies good developer", his implication is the
other way around. That is, at least how I read it, a good developer -- with
high-productivity tools -- nets about 20loc/day.

~~~
zorga
> That is, at least how I read it, a good developer -- with high-productivity
> tools -- nets about 20loc/day.

And I'm saying that's still a terrible measure and not accurate. LOC is simply
not a valid statistic to even look at. It's an example of lying with
statistics; you could take a thousand good developers and you'd probably find
their LOC per day counts all over the board. The amount of code you write is
not a measure of how good a developer you are whether it's high or low. That's
not how you measure a good developer.

The OP is trying to say looking for high LOC counts are not a measure of a
good developer, and he's right; but he's failed to realize yet that the metric
itself is what's bad, not the values derived from it.

Good developers don't have to be highly productive, but they might be; what
matters are the results of their efforts. Do they produce good programs, with
few bugs, that age well over time, and are easy to change to adapting
circumstances. Good developers write good code; some write more than others,
but LOC tells you nothing about how good a developer is whether that number is
high or low.

Like any writing, it's the quality of the code that matters, not the quantity.
You cannot measure quality by looking at quantity.

~~~
Xophmeister
Of course it's quality over quantity. 20 lines is not a lot and I think that's
the point; whether you're measuring LOC -- which is a well-established bad
metric -- or something equally spurious (e.g., semicolons per kilobyte hour)
the fact that a very low number was chosen is telling. LOC is just used as a
proxy here for something quantifiable; accuracy is somewhat irrelevant.

Writing is not the best analogy -- because of the typical churn-edit cycle --
but to use your example, of course reaming out pages of text in one day will,
on average, be of a lower quality than a virtuoso author who carefully chooses
his/her words and maybe produces, on average, a paragraph a day. If you were
to tell a layperson that a good writer can output a paragraph a day, they'd
probably find that counter-intuitive because it's such a minute amount;
"Surely anyone can write a few sentences in eight hours? Even I could do
that!" They're making the same false-entailment, without seeing the craft that
went into those paragraphs/day, or realising that there will be days when said
author writes nothing and others where they're in the zone and write an
almost-perfectly-formed chapter. That doesn't stop "paragraphs/day" from being
a bad metric, nor does it imply that anyone who writes a paragraph a day will
end up with anything good.

------
mlacks
I stopped coding long before I ever got proficient enough to say I was a
“coder”, but have spent enough time struggling with “intro to x” beginner code
writing courses to get a hint of what goes on when writing programs.

I’m presently going through something similar to the authors situation as I’m
migrating my leads into Microsoft Dynamics and learning to make the platform
for my use case (real estate brokerage). It’s not “infrastructure” but it’s
the foundation for my business

I can say that even something as simple as correct implementation of software
is not something that should be done at speed, or without serious
consideration towards portability.

For example, because I’m solo and also working another full time job, my
emphasis is on completing the migration ASAP energy though I’m aware of small
errors in the database that might bite me later.

Also, while I’m comfortable with MS’s record of maintaining enterprise
software, it’s slightly suffocating to feel that I will effectively loose my
time investment should they decide to abandon Microsoft Dynamics

~~~
Jorge1o1
They're not gonna abandon Dynamics. If anything, they're doubling down on the
whole 365 suite. (And they should, SharePoint and OneDrive for Business are
really good.)

------
Boxxed
I'd like to hear the logical conclusion of "Large Modules Considered Harmful."
Is he implying infrastructure code (e.g., terraform or ansible or cloud
formation) should be split between the repositories for each component? Or
just saying, "be smart about how you layout the infrastructure repo"?

~~~
brikis98
More of the latter. Don't put all your eggs in one basket. Don't create a
single "module" (i.e., single deployable thing) with 100,000 lines of code and
all your infrastructure in it. Break things up into small, reusable,
composable pieces. This is what you typically do in any general purpose
programming language, and it turns out it's a very good idea with
infrastructure-as-code languages too.

~~~
btschaegg
So much this. I've gotten so tired of seeing the same story again and again:
Someone sees something that's done by multiple pieces of code and goes "I´m
gonna write a framework for that!". Two months later you've got an
unmaintainable behemoth of a god class that does _everything under the sun_
and drives everyone insane who has to look at it (a Cthulhu class, if you
will).

So, yes, I got to the same conclusion. Basically, solve every part of the
problem:

\- in isolation/as focused as possible

\- as a library

\- with the easiest to use and most composable API you can come up with

After that, if you have to, you can still glue them together into a framework-
like thing. But at least, anyone can pick any feature out of it without
succumbing to madness. Also, this is a great way to avoid the whole "Big Ball
of Mud" problem that forces you to pull a whole ecosystem into your project
although all you wanted to do was log something.

------
isodude
Writing infrastructure code is a pain because it's so much more than everyone
think it is. I think this article puts it well and hopefully makes it easier
for folks to explain it to their team/boss. There's many single-devops out
there facing this challange.

~~~
vinceguidry
I'd say it's mostly a pain because it interfaces with parts of the system that
you only touch a few times a year if it's working correctly, but yet is
totally outside of your domain expertise. Nobody works with directly Ansible
or EC2 instances enough to be an expert at it.

Like how every time I need to renew a SSL cert, I need to reread the man pages
for how to make certs and cert requests. You'd think a replicable procedure
could be documented, but there's just enough variables so that it's slightly
different every time.

------
yjftsjthsd-h
Definitely appreciate the prod-ready checklist; it's super useful to have a
sanity check that you didn't miss anything.

~~~
tarp
Here is their full checklist: [https://www.gruntwork.io/devops-
checklist/](https://www.gruntwork.io/devops-checklist/)

------
kgilpin
There is a simple way to make it much easier. Use hosted platforms (eg PaaS)
instead of self-maintained infrastructure.

Unfortunately, most enterprises are still not willing to outsource their IT
platform to the real pros and thereby make it someone else’s problem to keep
that infrastructure up and running. They want to build their own in-house
competency. But the problem is, if you’re a true DevOps pro, why would you
work for some random company in which IT is considered a cost center, rather
than working for either a hosted platform provider or a (for lack of a better
phrase) “independent provider of DevOps”?

IT and security managers are able to convince business executives that IT
infrastructure needs to be kept in-house, even though today’s typical
enterprise has no chance of actually building and maintaining a infrastructure
that is 1% as good as what they could rent from a dedicated provider.

~~~
mrunkel
The problem with your theory is that even in hosted platforms, IT is a cost
center.

"The real pros" you speak of are also under the same cost pressures as
everyone else. I have encountered many such services where after the "real
pros" have set up a system, they are replaced by low cost staff who can
maintain and minimally extend while billing...

------
w_t_payne
Is it just me or are we constantly repeating the same lessons, the same
messages, over and over and over again?

~~~
sigi45
We do and do again and do it again again.

And if you finally have it more or less done it in your local environment /
company, you switch and suddenly you start again.

But you know, my salary still increases anyway so _shrug_ (still looking for a
solution)

~~~
w_t_payne
So if I made an open source solution that automates a lot of this ... I guess
it would help a lot of people.

------
fogetti
_The vast majority of developers don’t know what those details are_

Let me fix this for you. Vast majority of developers are actively denied
access to the infrastructure by their managers. They pretty much know
everything what an infra eng knows (and even more, after all all this catchy
cutting-edge apps like Kubernetes/Elasticsearch/etc. was written by them) or
even more.

Blame the managers and no-one else.

~~~
mancerayder
Are the developers-who-are-denied-access that you describe also on-call and
watching monitoring alerts for when something goes awry?

I've worked in companies where "DevOps" was thought to mean: "Devs have a lot
of access, even root/dba/etc. access, and while they're sleeping or if they
get stuck on something, they should punt it over half-finished or half-broken
to the infra eng / devops group"

^^ This is so incredibly common, especially in small/startup environments.
It's a nightmare I actively aim to fix when I join / consult at companies.

There's instability in the environment, inconsistencies, or just a mess, and
you parachute in there to fix. Step 1, kick out all devs from any systems they
have too much access (unless they're special, knowledgeable ones and know and
understand what you're trying to do), Step 2, stabilize the system. Step 2
involves environment separation, real automation that more than one person can
use, setting up monitoring, and much much more.

~~~
fogetti
> _while they 're sleeping or if they get stuck on something, they should punt
> it over half-finished or half-broken to the infra eng_

By the description of it, the root cause of the problem is your setup.
Dismantle the infra eng team and make the dev team to monitor their own apps.

By reading of your explanation devs should have more access and
responsibilities.

You set up a half baked process and you are blaming the devs for it? Not cool.

~~~
mancerayder
I've never seen anything like that, nor have my dev and DevOps buddies, ever
seen anything like what you just suggested.

Dismantle infra eng and make devs do everything? I support that, for one
reason: a year from now I can charge consulting dollars when I come to fix
these unstable, poorly documented, environments with minimal automation and
HA.

~~~
fogetti
Well I did that in the startup I worked in previously. And no, they are not
gonna call you for your shitty service. They are pretty stable.

Guess what? We knew how to configure private clouds, loadbalancers, sharding
and high availability. We pretty much know everything about Kubernetes/Amazon
ECS et. al and related technologies and we didn't have any problems with
monitoring or low level networking among others either.

Anyone can learn this shit. There is nothing special about it.

Don't hold your breath for seeing your consulting dollars, LOL. :D

------
sulam
Like: the emphasis on tests, with real pointers to testing Terraform plans.
That's legit, and hard advice to come by.

Dislike: 300K lines of infrastructure code, but all he talked about was DevOps
issues? Surely he learned something about writing good code, too, and I hope
he didn't write 300K lines of Terraform code.

~~~
cataflam
> hope he didn't write 300K lines of Terraform code.

If you look at the company, their product is providing infrastructure code, it
was not a byproduct.

------
auslander
... , then replace all that Terraform code with Cloudformation templates. You
will get 5x less code, native vs third party tool v0.11 and my respect :)

~~~
yjftsjthsd-h
You are the first person I've encountered who preferred cloudformation over
terraform. CF has far worse ability to cope with out-of-band changes or
problems in general (in contrast to TF, which can usually just `apply` things
back into compliance), and somehow gets feature support slower than TF in
spite of being 1st party.

~~~
auslander
> just `apply` things back into compliance

If your things are wandering out of compliance by themselves, you have bigger
problems, imho.

~~~
ucarion
Likely the commenter is referring to development environments, where it is
common for infra to be put in bizarre states by actors performing changes.
Terraform alone is does not constitute change control.

------
actionowl
> Make sure your team has the time to master these tools

What a luxury!

~~~
isodude
Really though, if you can't master the tools, maybe you should avoid them.

~~~
jacobr1
There are degrees here. Depends what kind of scotsman we are talking about.
Having a general understanding of the uses of the tool, understanding what
feasible in general, some best-practice rules of thumb and knowing how to
google for details gets you most the way there. You could call that mastery.
It certainly requires greater expertise than "I read a blog post and changed a
few things until it seemed to work." But the bar is much lower than: is
familiar with the source code; or even you can use the tool for common uses
cases without recourse to the documentation.

I think you do need to invest in getting over that initial learning curve such
that your tools aren't "magic." You have a conceptual mental model of what is
going on. And I agree that takes much more than a token effort. Yet for me,
that also stops short of mastery.

~~~
isodude
Well, it should at least be possible to become master. "I know how this works
but someone else did the grunt work".

I always aim for simple to use tools, if I need to I can dig deeper and just-
fix-the-problem. Or replace it at a whim.

So I agree with you.

Infrastructure code is really tough when it's scaling, so keep your modules in
shape/order and small/easy.

------
pmiller2
Are there any good "worked examples" of infrastructure as code out there?
Something one can learn good practices from?

~~~
dom_hutton
The [https://github.com/gruntwork-io](https://github.com/gruntwork-io) &
[https://github.com/cloudposse](https://github.com/cloudposse) examples spring
to mind.

As for example implementations review cloudposse in depth or take a look at
the [https://github.com/travis-ci/terraform-config](https://github.com/travis-
ci/terraform-config) repo.

This answer is skewed towards _infrastructure_ as code. Often conflated are
things such as configuration management & provisioning.

------
mattbillenstein
Several good tips in here, but slides 60-72 are probably the most valuable -
don't do anything by hand, find some tools and automate everything using those
tools.

------
deboflo
LOC is a very very bad metric. How exactly is it counted? Include mass
refactoring? Is there a better metric? Yes, a long list of testimonials from
happy customers.

------
KaiserPro
One glaring omission is metrics.

In the production grade checklist, there is nothing mentioned about metrics.

Infact there is no mention in the post of metrics or graphs.

This implies that everything is done via logs, which is just horrific at
scale.

Everything should emit metrics: o hits per second? metric o Memory use? metric
o upstream service response time? metric o That new lib you wrote? metrics

This is especially important with microservices. Open trace is grand, but
thats for after you've found where the problem is. Your metrics should be your
single pain of glass that indicates the health and performance of your system.

~~~
bsaul
i think it's called monitoring in the checklist.

------
dom_hutton
Lots of people in this thread bashing on LOC as a metric, which is fair. I'd
just like to point out that infrastructure code is incredibly verbose as is,
so the number is way overstated to start.

It's a pretty good example of doctoring headline worthy titles. IIRC the
author gave a talk of a similar name at the hashiconf recently.

------
mjevans
Is there a version of this that is //just text// instead of a whole bunch of
images and other resources that don't load without scripts enabled?

~~~
brikis98
There's a transcript of the talk on the HashiCorp website:
[https://www.hashicorp.com/resources/lessons-
learned-300000-l...](https://www.hashicorp.com/resources/lessons-
learned-300000-lines-code)

------
itronitron
anyone care to take a stab at defining what the author means by
'infrastructure code' ?

~~~
superfrank
Infrastructure as code is starting to be a big thing. Lets say you want to
provision a new AWS EC2 instance, you could go into the UI and click around to
do it, but at a large company, that's probably not the best idea. That's not
really scaleable and leaves a lot of room for human error.

You could instead use something like terraform, which allows you to write code
specifying your requirements and then run that code. This allows other devs to
review your code, takes a lot of human error out of the equation, and is much
more sustainable.

When he says infrastructure code, I think he means code that keeps the
infrastructure running.

Some examples that come to m would be terraform files, dockerfiles/docker-
compose files, jenkinsfiles, bash scripts, etc. Basically, code that keeps the
servers running.

~~~
walshemj
So a JCL then?

The largest code base I worked on (a map reduce based billing system), also
had for the time early 80's a fairly complex set of JCL that could compile and
build all the systems module's on dev and also push it out to the 15-16 or so
live systems.

This also handled all the glue that held together the map reduce.

~~~
scarface74
It’s not procedural. You write what you want your infrastructure to look like
in the case of CloudFormation yaml or JSON.

The first time you run your template it creates all of your infrastructure and
is usually smart enough to figure out dependencies.

After you make changes to your template and run it again it knows based on the
changes in your template whether it can modify the existing resources or
whether it needs to delete and recreate your resources.

~~~
jacobr1
Procedural certainly can be "Infrastructure as code." It just isn't the most
modern way to do it, due to additional complexity and the potential to be more
error prone. I'd certainly prefer CloudFormation over writing a bunch of
python/boto code, but it could be done.

Interestingly, Dockerfiles brought back a bunch of procedural configuration
management. We had migrated to ansible for all our server-level configuration.
But as we've adopted docker/containerization in recent years, simplifying our
applications (now separate containers, rather than severs on common servers)
has such a reduced level complexity such that simple docker files with `apt-
get install foo` are much preferred.

------
nwmcsween
The issue with configuration management is it creates a DAG above a build
system that does the same, as in its superfluous.

~~~
ezrast
Configuration management provides opinionated abstractions over inconsistent
and frequently user-unfriendly systems. You might as well say that C is
superfluous because it all maps down to assembly.

~~~
nwmcsween
I'm saying creating two DAGs is superfluous, fbsd, obsd, nbsd, etc all don't
do this. If you want to abstract the CLI tools why do it in a non-portable
implementation specific language?

------
justaaron
nice. it's rather on-point.

------
mehh
That your writing too much code?

~~~
yjftsjthsd-h
Are you criticizing the abstractions, or amount of infrastructure?

~~~
mehh
Thats code on top of frameworks, to do roughly similar things. As a software
engineer I wouldn't be happy writing, and definitely maintaining so much code.

His points are fine, but shocker its the same principals that apply to all
code, infra isn't that special, thought we established that years ago.

Also starting by boasting about the number of lines of code you have written
to achieve something is asking for trouble.

It feels like many in the 'devops' community that are from an ops background
are rediscovering software principals, this chap isn't alone in that!

~~~
scarface74
Infrastructure is special - if I make a change to my code, it usually won’t
kill a whole database, load balancers, knock out connectivity, etc.

~~~
mehh
A load balancer is not much use if the application on the other side is borked
because of the application code is faulty, I don't really see your point.

What I do see is 'devops' typically changing code in critical areas without
taking sufficient care, but thats more a cultural thing than infra code being
special.

~~~
scarface74
If you make a minor mistake in application code, it usually doesn’t affect the
whole site. Besides, when you are deploying code on a group of servers,
hopefully you have sense enough to at least do a rolling deployment and are
using automated health checks to make sure that your whole site isn’t down
during a deployment.

~~~
meh2frdf
Nor does my infra code, because I write tests, and have a pre-prod end to test
it in a controlled ci approach.

