Hacker News new | past | comments | ask | show | jobs | submit login
Don't Make My Mistakes: Common Infrastructure Errors I've Made (matduggan.com)
136 points by todsacerdoti on Dec 4, 2021 | hide | past | favorite | 36 comments



I was hoping this would've been a more surprising list, but it was titled 'common'.

> Don't migrate an application from the datacenter to the cloud

Good advice that follows the pattern of don't do a rewrite with a plan to cut-over. Instead make two and gradually transition to the newer one.

> Don't write your own secrets system

Follow-on to 'don't do your own crypto'

> Don't run your own Kubernetes cluster > Are you a Fortune 100 company? If no, then don't do it.

> Don't Design for Multiple Cloud Providers

This seems like an extension of YAGNI. If you're not using it, don't build it. If you really will need it, use it as you're building it. A good example is if you're doing multi-datacentre redundancy using standard tech, it might be worthwhile to do multi-cloud. But have an actual threat model in mind that this mitigates, e.g. cloud provider 'A' may become a conflict-of-interest competitor in the foreseeable future.

> Don't let alerts grow unbounded

> Don't write internal cli tools in python > Nobody knows how to correctly install and package Python apps. If you write an internal tool in Python, it either needs to be totally portable or just write it in Go or Rust. Save yourself a lot of heartache as people struggle to install the right thing.

+1 This one rarely gets mentioned.

One that I would add, that I haven't experienced but can certainly foresee is "don't switch major datastores". Switching between MySQL and PostgreSQL might be rough but doable. Switching between MySQL and CockroachDB for a large, heavily-used app could stall developing new features for a long time or make everything take 5x longer. The reason isn't the query syntax, it's the query characteristics. Using a std rdbms gives you the ability to do many relatively quick round-trip queries (though you should avoid N+1's) some can be tolerated. A distributed DB can have high throughput but will have high latency for simple queries.


I get what you mean with the Python tools, but I've had a good experience distributing tools with an internal pypi and pipx. Then users do "pipx install <package>" from the README, voila, commands work.

At that point I think it's pretty competitive with "go get". In both cases users have to install one thing, then use it to get your thing.

Haven't seen an internal rust tool yet.


IMHO real answer isn't to use "go get" instead of "pipx install"; the answer is to use Go to build statically-linked binaries and then to distribute those binaries with an OS-level, language-agnostic package manager (e.g. Homebrew or apt-get).


Why do you need a package manager if you have a statically linked executable? Can't you just save it in ~/bin and have that in PATH? (Users who can't do that shouldn't be using CLI would be the argument I guess.)


So that people can install and update the tools via the same habits they’ve already learned.

> Users who can’t do that shouldn’t be using CLI

Users who can do that will still find it an odd departure from convention.


I second having an internal PyPI. Not requiring your customers to build your platform wheels for you simplifies Python distribution quite a bit.

It'll never be as clean as a static binary build, but it saves us from having to build out two language ecosystems when the rest of the company uses Python for everything.


And when your client has a slightly different python install or some other library which deps on an older incompatible version of a lib... Then what?

Python is awful for this stuff


Don't disagree. It's a problem. As the parent poster and others have mentioned, you can use something like pipx, or bare virtualenv. At least once I've deployed via dh-virtualenv and rpmvenv, pyinstaller, cx_freeze, or tested for compatibility with the system Python if I could control for the client machines.

It's never fun. It's never pleasant. But to be fair, if I have a CLI tool that needs a deployed SSH client, or Tensorflow, or SDL or Qt or something else, I'm not convinced packaging gets much easier no matter what language we're talking about. If your use case is simple, Python is easy enough to deploy, and Go is even easier. If you can't disable CGO or need a third party component, I imagine the fun is just getting started anyway.

As a counterpoint, awhile back, discovered that Golang had a minimum kernel version requirement. That pretty much eliminated it as a possibility for writing tools for legacy systems. Python was viable though, Bash moreso. :) Couldn't tell you if that was still a requirement for Go today.


All good comments. I should have noted that my comment was not regarding Python (which I have little real-world use) and meant to apply more generally to utility or in-house tooling should be as easy to install and use as possible. Being that it's not the main focus of development, it will bitrot so anything you can do to guard against it is worthwhile, and a single compiled Go binary is hard to beat in that regard.


> Don't let alerts grow unbounded

This seems an infrastructure extension of treating compile warnings as errors.


Which is probably a mistake when some warnings are compiler bugs and some are difficult-to-impossible to fix. Or if you don't control the compiler being used.


That's why we have 'ignore' type annotations/pragmas/etc.


"Page an engineer for any blip in the radar" is the equivalent of treating compile warnings as error.


If you as a company refuse to ignore the warnings your only choice is to drive them down to as low a level as possible which maximizes the value of future warnings.


> don't switch major datastores

Agree with you strongly. Also by committing to a database you can take advantage of things it does for you rather than trying to stay "generic" and portable. Pick a database (I like Postgres), and then wring everything you can out of it.


Aside from the alert point everything else is basically "I don't know how to do it proper so you shouldn't do it either":

> Don't migrate an application from the datacenter to the cloud [..] Instead port the application to the cloud.

You can certainly do it without an app rewrite if you have good infrastructure engineers with the support of a small competent dev migration team. The real questions is when and why you should do it. One valid scenario is: you want to sell the app and an AWS setup is a lot more appealing to a buyer than a custom self owned or collocated setup.

> maybe even doing something terrible like connecting your datacenter to AWS with direct connect in an attempt to bridge the two environments seamlessly

You can certainly do that too, and AWS can be used as a cold disaster recovery option.

> Don't write your own secrets system [...] how do you keep from hitting this service a million times an hour but still maintain some concept of using the server as the source of truth

Simple, you get the secrets very rarely: at deployment. If you need to change anything you redeploy that part which should be very easy and fast if you got your orchestration/config management in top shape. Why would you do it? To avoid paying the Vault enterprise license and still get a highly available, version controlled and even more simpler and stable service.

> Don't write internal cli tools in python [...] Nobody knows how to correctly install and package Python apps.

This one is particularly sad snapshot of the state of the industry expertise. Debian packages solve this easy and completely. Oh, you don't understand your distro packaging system? Stop wasting time on blog posts and start reading documentation!


I've seen Python CLI apps for internal tooling being very maintainable and setting up a Makefile with an "install" task makes it easy for everyone in the team to get started.

> Simple, you get the secrets very rarely: at deployment.

Do you mean writing custom scripts to get secrets? Where do you store secrets in such a scenario? What if you need to change secrets at runtime.

Vault is nice in that you can fine grain access:

- ACLs which allow some team members to be responsible for setting secrets and other team members for using them,

- temporary credentials for well known DBMSs which helps secure logs from leaks,

- PKI Management,

- temporary SSH keys.

And more. Not sure how you'd get that with a deploy time script.


> Where do you store secrets in such a scenario?

Anywhere basically, a HTTP server with directory listing is enough. You don't need to roll your own security management, you can just encrypt secrets with ssh or pgp keys.

> What if you need to change secrets at runtime

I certainly don't need that as an infrastructure engineer. If you want that for development be my guest, pay the exorbitant Vault license and don't come crying to me Vault is down. Vault is always down, you just introduced a critical runtime dependency on an immature solution. Your problem, I have the email to prove you took responsibility for this decision against my recommendation.


You can totally have multiple secrets' sources: one for development that takes secrets from a plain text local file and one for production that takes secrets from Vault. I don't see an issue here.

Also, what's immature about Vault?


Have you used in production at a decent scale for significant time? It's an SLA killer.


> Oh, you don't understand your distro packaging system? Stop wasting time on blog posts and start reading documentation!

I had read them, and I'd anything just to be able to not deal with that mess. I believe there should be special people with a proper mental constitution and a high salary to deal with it. Not me.


I don't want anybody to get the wrong ideea from your post. Linux package managers solved the software distribution problem 20 years ago and are the most stable, feature full and elegant way to ship into production. You have reproducible builds, dependency handling, clean removal, rollback, exepction handling, logs, automated upgrades, etc.

The problem is complex so the solution is not one youtube view away from being understood. Deb packages helped Debian to ship tens of thousands of software packages with frequent updates for years and years with few maintainers.

I am sorry to say this, but calling it a mess says a lot more about you than about the package formats.


You said it yourself: they solve a damn great heap of problems. The heap is so great that it cannot be explained clearly. You forgot to say, that it was not a single solution, but a long painful road, not every solution you mentioned was available 20 years ago.

So it is a mess of a lot of solutions condensed over a period of 30 years.

I did look into two of such package management systems (deb and ebuild), and I'm happy to use any. To keep my system up to date they are fine, but not for anything else. It needs a lot of domain specific knowledge to make a package. Knowledge that is useless outside of the world of packaging. Knowledge detoriates when not used, and every time like the first time. Grr.

> I am sorry to say this, but calling it a mess says a lot more about you than about the package formats.

You shouldn't feel sorry saying it. I said it already. Though I used different words, but it is essentialy the same idea: there should be special people to make .deb packages.

Probably you tried to say that it is me who is special in that regard? Can I advise you to can check your intuition? How many developers bother with preparing .deb packages? Is it 10% or 90%? How many github repos contain .deb? Most of developers who doesn't prepare .deb packages are like me. They would use flatpack, or some other format allowing to capture an environment easily. But they wouldn't use deb.


I agree with your facts, I disagree with your perspective and I respect your good faith.

As such I will stop arguing and offer something that you might find useful in the future: arch linux PKGBUILD (https://wiki.archlinux.org/title/PKGBUILD) - an order of magnitude simpler than debs. You can learn this in a day and you can use them on any distribution.


I prefer cargo. It works with rust only, it have no heaps of solved problems, but it just works. In any case I stopped doing system-wide installs of packages from outside of an official repository of my distro. No alien packages, no overlays. If I need something that is not in the Portage, I'd build it myself and I'd install it locally into ~/.local. This way is even better with cargo.


I agree with this, with one big exception: if you are a SaaS company and want enterprise clients you need to be multi-cloud and able to silo customer data in flexible providers and regions.

At $previousjob I watched in glee as a conference room full of sales people smugly told us the whole product was "cloud native" on AWS, and then the realization slowly spread from face to face that they were pitching Amazon's biggest competitor and we wouldn't allow our data to reside on their systems.



Great advice.

There’s a temptation, when the sun is shining, to make extra work for oneself. Let’s add a compatibility layer around our app so we can port it from AWS to GCE on a whim.

Then the storms come and you realise the last thing you want, when under pressure, is to have added-complexity to your business logic. Oof.

On the final point, pip install . and a tiny setup.py has worked really well for me. Putting these all in one place and having a house style is nice too. There’s probably even a debhelper to turn them into native packages, though again, that seems like extra makework compared to just throwing your junk onto automatically provisioned production hosts. Vive l’/opt!


One thing that I wish was mentioned in this article is that Kubernetes already provides a pretty good compatibility layer between cloud providers.

If you don't completely bake your app into AWS by using the AWS SDK all over the place or using a database that only exists in AWS or something, moving individual apps is just not that bad. You still gotta solve the cross-cutting things like logging and metrics, but you gotta do that anyway, and that shouldn't (!) require code changes to your app.

To be fair, that's all moot if you're not using Kubernetes in the first place. As well, things like EKS pod roles add great value that you'll have to sacrifice to truly call your app "portable".


What does "pip" resolve to in your suggested solution?


> Don't Design for Multiple Cloud Providers

Some industries legally require you to plan for the case where a cloud provider cuts you off. So sometimes you don't have a choice.


What are some actual examples of industries (which geography) that have multi cloud a sa requirement?


I'm not sure on the legal particulars, but it was mentioned when I worked at JP Morgan Chase. I think most of it revolved around vendor management and vendor relations. It wasn't just AWS--for such a big business, a single provider anywhere represented a lot of risk

In those cases, it wasn't "This app needs to be AWS/GCP active/active". It usually meant: if you're building out core functionality to use AWS, you should build out on one of GCP, Azure, etc, too so there's an alternative available. The long term strategy was always to just stick things where they were cheapest to run (on prem, AWS, etc)

(So the same thing came up with physical equipment like Dell and HP servers)

From the business side, there was also concern about Amazon becoming a competitor in the Fintech space but that wasn't regulatory related


On the one hand "don't add additional layer unless its clearly and instantly beneficial" is a good advice. You can do this at any point in time. On the other hand "allow people to use any feature on AWS they ask for" is a wrong reasoning. When you allow for this you get a ton of dependencies that are not requiered and in turn make your system hard to maintain and sometimes extremely cosly. Keeping your dependencies controlled is always a good advice. The real problem is how to do this without creating internal tools that become costly dependencies on their own.


> On the other hand "allow people to use any feature on AWS they ask for" is a wrong reasoning. When you allow for this you get a ton of dependencies that are not requiered and in turn make your system hard to maintain and sometimes extremely cosly.

I agree with the spirit of what you wrote, but I think that not allowing it is ultimately even worse then the unnecessary dependencies.

Limiting your dependencies should be a choice the developers themselves are meant to make, otherwise you risk estrangement and having unmotivated workers that don't feel responsible for their own work. That ends up costing more in the long run I think.


> Don't run your own Kubernetes cluster

Note that there is one big Kubernetes consultancy (namely, Flant) that requires you to run your own Kubernetes (managed by them) and not the managed offer by the cloud provider. They do it because they know how to run Kubernetes and don't know quirks of a zillion managed Kubernetes providers.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: