
Computers Can Be Understood - miked85
https://blog.nelhage.com/post/computers-can-be-understood/
======
m0ther
I am a senior software architect. My job is to balance performance against
complexity.

If my system is slow, it's my fault.

If debugging or expanding the system is too difficult, it's my fault.

If someone wants to know how the system or business works in depth, I am the
one that they should come to.

I spend the majority of my time chasing down, enforcing, and simplifying the
universal theory of our business (the core of our software solution).

The universal theory of our business is a living collection of concepts,
designed to accurately model non-virtual concerns in virtual space. If there
are too many edge cases, it is a sign that the universal theory is inaccurate,
or not robust enough in some areas. If there are too many bugs, it is a sign
that the universal theory has not been communicated or enforced well enough,
is inaccurate, or is too complex in some areas. As our business grows, or our
understanding of the business expands, the universal theory will change; at
times dramatically. Malleability (the ability to adapt our software to these
changes effectively and efficiently) is one of my top two concerns; the other
is latency (how long it takes for any one request to get a complete response).

The theory shrinks and becomes better documented as it evolves; the goal is to
move from describing behaviors as correlation to describing them as accurate
causation. To fill in the blanks.

I should mention here that this does not mean every line of code in my
projects is easy to understand. Writing a fast system of high complexity
requires at least some components that are written exclusively for the
computer's benefit (that is, highly optimized and inherently difficult to
read). These components should be written with clear documentation, clearly
defined public members, written discussions of why it works the way it does
(and common ways to accidentally break it), redundant ownership, and regular
auditing to ensure code rot is avoided.

I have yet to meet another architect that sees their job the way I see mine.

If the system you're working on defies complete understanding, you can
probably blame your architect.

~~~
gen220
Thank you for a beautiful description of your philosophy, and your philosophy
itself. I wholeheartedly agree with you, and see things the same way; though I
haven't seen it expressed so clearly before.

I'm curious – and a little saddened – that you haven't met anyone else
thinking this way?

Anecdotally, there are maybe 10 out of 200 engineers at my current company who
demonstrably take this "Quality of Craft" part seriously and explicitly: they
think in a similar way to you.

It's a rare set of skills: empathy for future humans, understanding how the
universal theory is likely to evolve, caring enough about it to enforce it
consistently, and effectively actuating on all of these thoughts and emotions.
Those add up to somebody special, I suppose.

Anyways, I think I'll link to this comment many times in the future; thank you
again.

~~~
m0ther
I am not aware of a standard (and in depth) definition of what a software
architect is; other than it being the next step after team lead, and the last
step before pure management (unless your company offers a research track -
which nearly no one does). The interpretations I've run into are pretty
varied. My way is certainly not the only way, and my way is an amalgamation of
aspects of others that I admired.

For reference, I'm 36, have been programming professionally for 20 years, and
have been an architect for 10. I seek out difficult projects.

Here are some (likely idealized) descriptions of my favorite past architects:

"Never say no"

My first architect took immense pride that at 40 something, he had never had
to say no to a product manager. We were going to build whatever the business
wanted, with no push-back (only what he called lateral guidance, or "yes,
and"). The business would arrive with specifications, and he'd treat them as
if they were rough drafts; "OK, let's get to work". He'd disappear into
meeting with the business, and emerge days later with a nearly completely
rewritten specification (which he had written himself and was usually more
feature rich than the original), and everyone was happy. He explained to me
after a few years that most bugs and development problems come from the blind-
spot the business has for how software works, and the blind-spot development
has for the business. A specification with no blind-spots is much easier to
turn into software. Every project I worked with him on went smooth, and was
delivered on time. I strait up stole and expanded his technique as the
foundation of my architecture philosophy.

"Make me one"

I worked with an architect that insisted everything be built in-house (no
unnecessary external dependencies). He'd look at the features of some other
framework or tool, copy the bits he liked out of their documentation (as if
they were requirements), and turn to us and say "now we're going to build our
own". I loved this guy, he pushed me so far outside of my comfort zone that
every day was a frantic adventure. "Write me a sketch; you have 4 hours". He
didn't like whiteboards, he liked code and rapid iteration. He was brutally
honest and ripped my code apart at least twice a day. We'd play code tennis
for a couple of days, then a polish phase before QA. I was in hyper-focus for
8 hours every work day, and the practice at rapidly building (and rebuilding
and rebuilding) all sorts of difficult components (that all made it to
production) made me fearless. I lost that job to the 2008 bubble burst, but
the year I spent there easily advanced my skill set by 5 years.

"If it's not fast it's useless."

I worked with an architect who came from systems programming. He had an
incredible resume. He spent a lot of time refactoring for performance. If he
liked you, he'd explain what he did to your code and why; if not, that code
became his now. He spent a lot of his time in instrumentation, testing this
solution against that solution for execution time and memory usage, merging
the best bits, and testing against a different approach. His favorite phrase
was "prove it" (and I spent a lot of down time trying to prove things to him
through examples and instrumentation). I learned a ton in an effort not to
disappoint him. It was no longer enough to know how to do something; I had to
know many ways to do it, and which performed better in what scenario. "What,
as an individual programmer, do you bring to the table if not performance? If
your solution doesn't perform, you let everyone else down." He dramatically
and permanently changed the way I code and the way I design applications.

"If it's not predictable I hate it."

I worked with an architect who's main focus was the application as it ran in
production. He spent a significant amount of his time combing through debug
logs and recreating log messages locally to understand what's going on,
tracking down run-time issues that customers called in with, building
instrumentation and internal tools, looking at characteristics of the
application as it ran with network operations, and treating the production
application as a living thing. I learned a ton from him; he changed the way I
look at applications.

I've also worked with a number of architects who I did not like, and who I
learned nothing from. Among them:

\- The guy that misinterpreted the book Clean Code and turned the code-base
into a ridiculous mess

\- The guy who made decisions like we were Google, when we most certainly were
not

\- The "lead by Lint" guy

\- The design by committee guy

\- The "non technical" architect

\- The "I rebuild most of the application last night" guy who just made it
worse

\- The "I read it in a blog so it must be both true and universal" guy

~~~
LeifCarrotson
I love it! I wish I could have some of those difficult projects. I've sadly
been wearing down my curiosity and idealism in favor of pragmatism and economy
for the last few years.

How do you find work where this quality of code is valued? I can produce an
entire machine in a couple hundred hours of work that will make my client
money for a decade until the product line is cancelled and the machine and the
time and money I put into it is depreciated. In that couple hundred hours,
there's some time for problem modeling and architecture considerations at the
start, and usually at least one or two really hard problems where nothing off-
the-shelf will work and you have to "make yourself one" that might take a few
days each, but the vast majority of the time is just taking in business logic,
translating it to code, and testing/documenting.

I treasure the times I get to solve an interesting problem or can develop a
system that's beautiful and accurately models the system required. But the sad
truth that I'm coming to understand is that in a non-software company you can
get a lot done for not a lot of money by duct-taping together purchased
components and copy-pasting a simple, versatile component a couple times. I
believe there are businesses where this pragmatic approach won't work -
there's not much done in the space yet, the project will live long enough and
evolve so it needs to support rework, and you're somewhere between Google and
a one-hit wonder. But industrial automation is not one of those businesses.

------
empath75
What shifted my career was actually an almost opposite revelation. I'm a self-
taught programmer who never went to college, and bounced around in entry level
sort of "IT jobs" like support for close to a decade -- and I assumed that
"real" software engineers and sysadmins were geniuses who did understand
everything their systems were doing on a deep level and just typed out perfect
code off the top of their head.

Then I went to a large tech company, and got some attention for hacking away
at a few bash scripts to automate things, got promoted to a sys admin job, and
realized that hardly anyone understood systems, or anything at all on a deep
level, and that everyone was just hacking away like I was.

What I've eventually realized is that while, yes, everything in a complex
system in principle can be understood, that human beings have limited
bandwidth and memory and that it will absolutely paralyze you if you wait
until you understand something thoroughly before diving into do work on it. I
jumped into AWS without understanding it, Chef without understanding it,
Jenkins without understanding it, and Kubernetes without understanding of it,
and every time I did that, my salary went up by 20-50%, to the point where i'm
making something like 3-4 times now than what I was making 5 years ago.

There are two reasons to spend the time to learn something on the job:

1) You need to learn it to do the task that you're working on right now.
Whether it's troubleshooting a problem or trying to install a new system. If
you're trying to get a kubernetes cluster running, you absolutely do _not_
need to know all the details of how docker containers work. While it's
interesting, and it will probably be necessarily to learn it eventually, it
would be a waste of time to do that before just getting started deploying
kubernetes.

2) Satisfying your own personal curiosity -- sometimes things are just
inherently interesting to you, and it's usually worth chasing those things
down, I've found. It may not be immediately applicable, but usually I'll
figure out how to apply it somewhere.

~~~
jcranmer
> 2) Satisfying your own personal curiosity -- sometimes things are just
> inherently interesting to you, and it's usually worth chasing those things
> down, I've found. It may not be immediately applicable, but usually I'll
> figure out how to apply it somewhere.

A few months ago, while searching for a personal project, I stumbled across a
mention that the performance events on Linux can be used to set hardware
watchpoints. Literally the very next day at work, we had a problem where we
needed a watchpoint but gdb was too intrusive to timing to let the bug happen,
so I was able to fall back on perf to confirm the stack trace of the site
corrupting memory.

Debugging is ultimately about hypothesizing which part of the system is
failing and why, so understanding more of how your system works is likely to
be beneficial towards future debugging sessions.

------
umvi
Counterexample:

Convolutional Neural Networks (CNNs) (or any ML model, really)

We still don't fully understand how or why they work so well. If you have
software that queries a CNN and the CNN returns some prediction, there's
absolutely no way of knowing or understanding the reasoning behind the
decision. It's a black box of magical layers and weights that have been
"trained" to make good predictions.

If you have fraud detection software using ML and it flags something as fraud,
that's the extent of the information you get. In this case the computer cannot
be understood - it cannot explain to you how it arrived at the conclusion that
something is fraud, nor can you explain how the computer arrived at the
decision. It just has huge tables of weights and biases that magically "know"
when something is fraud.

~~~
hinkley
I love my bank but hate their fraud detection (someone suggested last time I
brought this up that it might be outsourced).

I went way across town to save a couple hundred bucks on a new TV. By the time
I got home they had flagged by card. Oh, yeah, TV. I should call them and tell
them it was me.

No. They flagged it because I spent <$10 at a carwash on the way home.

My old bank would flag me for buying gas on a road trip I took 3 times, every
time. When your bank thinks that going out of town is exceptional behavior,
well, it feels a little judgemental.

~~~
cellularmitosis
Back when I bought a 250cc motorcycle, my card was constantly getting flagged
(lots of $1.50 gas stops ~50 miles from home). It took months for this to let
up.

------
mjw1007
I think this is a good attitude.

I wish that people who publish advice on writing documentation would put more
emphasis on writing for people who have this attitude.

I see too much emphasis on providing usage examples, and on "write for a
reader who is trying to get something specific done", rather than explaining
the system for someone who wishes to understand it.

~~~
api
The emphasis on getting things done vs. understanding is another manifestation
of our age's extreme time poverty. People don't have time to understand
anything.

Add to that the fact that computing is so trendy. Take the time to learn
something -- web development, mobile development, the latest trendy language
-- and it won't be hip anymore in 5-10 years.

As an experienced programmer I only take the time to deeply understand things
that my intuition tells me will be around for a while and for which
understanding is a major benefit to me. My intuition is guided by a mix of
market share, speed of change in the target market, and robustness of the
thing in question.

~~~
EvanAnderson
It's another manifestation of the classic "it's expensive to be poor"
situation. My personal struggle is with making the call re: understanding vs.
getting things done compared with the return on investment. Understanding
things is fun, though, and our systems are built upon enough analogy that
understanding usually transfers (if only in part) to something else down the
road.

~~~
thisisnico
Realistically it depends on how invested I am in the system. If it's something
that I'm using all-the-time, it absolutely makes sense for me to have a deeper
understanding because that understanding could save me time and give me a
better perspective on why certain aspects of the system are the way they are.
If it's something I use only as an auxiliary system, I don't invest as much
time in understanding the system, only what the system can do for me.

A prime example of understanding how to use the system, but not the system
itself is e-commerce. There are 17+ year old kids making 100K+ a month with
e-commerce, having no programming skills or a deep understanding of the web or
computers. They are experts at the tools they use, but honestly are not
"computer" people. We've abstracted e-commerce to the point that it's not even
necessary to deeply understand the web, with tools that exist now (Shopify
etc.)

------
troughway
From the post:

"My advice for a practical upshot from this post would be: cultivate a deep
sense of curiosity about the systems you work with. Ask questions about how
they work, why they work that way, and how they were built."

This is the poignant thing about it all. The individual for who this post is
written does not need to be told to cultivate this. The individual who does
need to be told, on the other hand, is unlikely to.

In some ways the post reads as a "keep doing what you are doing" reminder.

~~~
hinkley
Sometimes, it's good for people who engage in persuasive speaking to talk to
each other and compare notes.

"Preaching to the choir" is known in some circles as rehearsing.

Similarly, I have owned a couple of books I got almost nothing out of, other
than something I can hand to people who ask a lot of questions. They're good
books to own. One is quite beat up at this point, but worth it.

------
mwcampbell
I’d like to add another pitfall: obsessing over optimizing or eliminating
intermediate layers when one should be focusing on meeting business
requirements in higher layers. I’m prone to this one myself. This one is tough
in the current move-fast-and-disrupt-things environment of startups, or even
teams within larger companies that are trying to be nimble like a startup.

~~~
daze42
This is the hardest for me. It's almost like those two forces are in
opposition to each other. I'd love to find an industry where they could align.

~~~
chubot
Yeah I think there is a tension there. If you care about systems and internal
quality and want to get paid for it, I'd suggest software performance or
security as promsiing fields.

Performance and security are both cross-cutting concerns that span layers,
i.e. you're not just thinking at the application level. It really is a
separate kind of thinking and a separate kind of work.

Though it seems that it's mostly large companies (e.g. "big tech" and a few
other places) that care enough about performance and security to have dedicate
staff for them. Most other shops are too busy with "business stuff", and they
may reasonably want to outsource that work to experienced consultants.

------
bitwize
Again -- "Walk into the fire, Morty. Embrace the suck."

[https://news.ycombinator.com/item?id=22601623](https://news.ycombinator.com/item?id=22601623)

------
adrianmonk
I agree, and I vastly prefer this mindset.

But saying "computers can be understood" is a bit like saying "loans can be
paid off". Both absolutely true in principle. In practice, much to my chagrin,
it isn't always feasible.

I've had to learn the art of strategically choosing which things shall remain
black boxes for the time being. My curiosity doesn't disappear, but I can't go
doing a depth-first search through all knowledge all the time.

~~~
outworlder
> I can't go doing a depth-first search through all knowledge all the time

No-one can, not even the author.

What we have to understand is:

1) Which components exist. Can't understand what you don't know exists.

2) How those components interface, at a high level.

Once you know which black boxes exist and how they are interconnected, you may
then choose to open them.

For instance, we know that a keypress is detected by the keyboard and sent to
the computer, and it will eventually appear somewhere (graphical terminal?).
Do we need to know how the wiring works? Only if we want to understand why
multiple simultaneous keypresses don't always work. Do you need to understand
how information is encoded? Or how the USB protocol works (assuming it's USB)?
Or what are keyboard interrupts, and so on. You'll end up learning about font
rendering starting from a keypress.

You may not need to know how these things work, but it's useful to know they
are there, so you can easily find information when you need to. You are more
likely to worry about terminal escape codes rather than keyboard wiring, but
still.

That goes for anything else. Do you need to worry how compilers work? Maybe
not. But a high level understanding is important, so you won't immediately say
stuff like "language X is faster than Y because it's compiled". Yeaahh, sure,
it's possible that it is true. Since we know what compilers do, and they are
not a magical "get faster code box", we'll probably say... "it depends, I
don't know. Let's benchmark/disassemble/etc".

------
gpanders
> A belief in the understandability of software systems can very easily become
> a need to understand the systems you work with.

I relate to this point especially, as this is how I often am as well. I do a
lot of work with FPGAs and embedded systems, and the tools offered by the
vendors in this space are often huge and complex in an attempt to create an
easy "happy path" for hardware engineers who don't want to/can't learn
software engineering and just want to get something working.

I, however, stubbornly chose to eschew these tools in favor of "doing it from
scratch". In the end I think it was 100% worth it as this process deepened my
understanding of the field and has left me with some valuable skills, but the
end result was that it took me weeks to do something that another colleague
could have done in a day or two, maybe less.

I understand the underlying system and implementation better, but that may or
may not be considered valuable depending on the project and your company's
culture.

------
petrogradphilos
> You will never understand every detail of the implementation of every level
> on that stack; but you can understand all of them to some level of
> abstraction, and any specific layer to essentially any depth necessary for
> any purpose.

As David Deutsch argues in _The Beginning of Infinity_ , this is true not just
for computers, but for _everything_.

------
Koshkin
But sometimes trying to understand something just isn't worth it.

~~~
umvi
This happened to me.

We were seeing a USB issue between a Marvell SoC we purchased and a Qualcomm
SoC. The more I tried to understand this issue, the deeper the rabbit hole
went until we were in the weeds of the XHCI controller looking at
scatter/gather operations, etc. Eventually we bought a USB protocol analyzer
that seemed to indicate that the actual host controller had a bug. We
contacted the Marvell, who in turn contacted the vendor of the USB core
(Synopsys). They bisected USB cores with an FPGA until they found a day which
had a breaking change. Unfortunately, it appears Synopsys didn't use version
control with their source verilog a few years ago (they just had archived
daily builds), so while they could say "yep, looks like something that changed
on 13 Jan 2015 broke it" they didn't know what change that was. However,
before it could be resolved the company I was working for went bankrupt and so
I never did get closure on the issue.

I sunk many, many months into understanding that issue over the course of
several years only to give up because it was near impossible to understand
enough to come up with a workaround (and because the company went bankrupt).

------
jeffrallen
The author has clearly never chased a heisenbug in a 25 year old Makefile for
embedded systems. The kind where a timing bug, plus NFS automounter, plus WAN
weather, equals an unexpected PATH, resulting in a different minor version of
a linker than expected, thereby changing the order of static initializers on
some targets, thereby resulting in sshd not checking passwords, but only on
MIPS and only with the 16 mbps flash config.

~~~
geofft
All of those can be understood, no? (I'd expect to go at least that deep in a
postmortem of that problem, and I regularly do go about that deep in similar
postmortems.)

~~~
ken
That might be stretching the definition of "deterministic". If it depends on
implicit state, precise timing, external conditions, etc., I would say it's
closer to fitting the definition of "chaotic" than "deterministic".

With perfect observation and recording, it may be possible to see a tornado
and work backwards to figure out which butterflies flapped their wings to
cause it, but that would still be a long ways off from being able to train all
butterflies to move only in ways which won't cause tornados.

~~~
geofft
Sure, but you don't need to know _which_ butterflies flapped their wings - you
just need to know "if there's a tornado, the house falls down, and we're in
tornado country."

In the example above, it seems sufficient and entirely testable to conclude,
if for whatever reason NFS is flaky, $PATH resolution will skip the directory
with the right linker and fall back to the old one, and if for whatever reason
you call the old linker, it has different behavior about static initializer
order. That's enough to both describe the problem, reproduce it, and fix it.
No, you're not going to be able to identify which particular packet was lost,
but (and I say this as someone who's maintained production systems that rely
on cross-continent NFS) you expect that packets _can_ be lost and you figure
out how to make the system robust against it. Don't put multiple files with
the same name on $PATH, or call the linker by its full path, or something.

And that leaves you a very far distance from "something spooky is happening
with the computer and so passwords weren't being checked today, who knows."

