
Systems that defy detailed understanding - a7b3fa
https://blog.nelhage.com/post/systems-that-defy-understanding/
======
Ididntdothis
I often wonder if things would be better if systems were less forgiving. I bet
people would pay more attention if the browser stopped rendering on JavaScript
errors or misformed HTML/CSS. This forgiveness seems to encourage a culture of
sloppiness which tends to spread out. I have the displeasure of looking at
quite a bit of PHP code. When I point out that they should fix the hundreds of
warnings the usual answer is “why? It works.” My answer usually is “are you
sure? “.

On the other hand maybe this forgiveness allowed us to build complex systems.

~~~
yuliyp
This often devolves into extremely fragile systems instead. For instance,
let's say you failed to load an image on your web site. Would you rather the
web site still work with the image broken or just completely fail? What if
that image is a tracking pixel? What if you failed to load some experimental
module?

Being able to still do something useful in the face of something not going
according to plan is essential to being reliable enough to trust.

~~~
twic
Systems need to be robust against uncontrollable failures, like a cosmic ray
destroying an image as it travels over the internet, because we can never
prevent those.

But systems should quickly and reliably surface bugs, which are controllable
failures.

A layer of suffering on top of that simple story is that it's not always clear
what is and what is not a controllable failure. Is a logic error in a
dependency of some infrastructure tooling somewhere in your stack controllable
or not? Somebody somewhere could have avoided making that mistake, but it's
not clear that you could.

An additional layer of suffering is that we have a habit of allowing this
complexity to creep or flood into our work and telling ourselves that it's
inevitable. The author writes:

> Once your system is spread across multiple nodes, we face the possibility of
> one node failing but not another, or the network itself dropping,
> reordering, and delaying messages between nodes. The vast majority of
> complexity in distributed systems arises from this simple possibility.

But somehow, the conclusion isn't "so we shouldn't spread the system across
multiple nodes". Yo Martin, can we get the First Law of Distributed Object
Design a bit louder for the people at the back?

[https://www.drdobbs.com/errant-
architectures/184414966](https://www.drdobbs.com/errant-
architectures/184414966)

And let us never forget to ask ourselves this question:

[https://www.whoownsmyavailability.com/](https://www.whoownsmyavailability.com/)

~~~
andai
> systems should quickly and reliably surface bugs, which are controllable
> failures

I was thinking, if the error exists between keyboard and chair, I want the
strictest failure mode to both catch it and force me to do things right the
first time.

But once the thing is up and running, I want it to be as resilient as
possible. Resource corrupted? Try again. Still can't load it? At this point,
in "release mode" we want a graceful fallback -- also to prevent eventual bit
rot. But during development it should be a red flag of the highest order.

~~~
numpad0
Are robustness and loose engineering the same/overlapping quality
measurements?

If so makes sense to be not strict, if not it’s you(and us all) rolling up two
different modes of failures into a single classification.

------
smitty1e
Great article.

Recalls Gall's Law[1]. "A complex system that works is invariably found to
have evolved from a simple system that worked."

Also, TFA invites a question: if handed a big ball of mud, is it riskier to
start from scratch and go for something more triumphant, or try to evolve the
mud gradually?

I favor the former, but am quite often wrong.

[1]
[https://en.m.wikiquote.org/wiki/John_Gall](https://en.m.wikiquote.org/wiki/John_Gall)

~~~
ssivark
> _if handed a big ball of mud, is it riskier to start from scratch and go for
> something more triumphant, or try to evolve the mud gradually?_

Reminiscent of Chesterton’s fence. But then, we end up in such a “complex”
situation only when one thing can have multiple causes & effects — which is
difficult to model correctly in a clean slate formulation.

The simplest solution seems to be to avoid making software that complex in the
first place (we can exert far more control than in the physical world).

But then if we think about Peter Naur’s perspective about programming as a
mode of theory building (of the domain) (unsurprising, given the basic
cybernetics principles such as the law of requisite variety & the good
regulator theorem), then the answer seems to be — unless your domain is really
complex, think hard before you implement, and keep refactoring as your
understanding improves (and truly to pick problem formulations / frameworks /
languages which make that feasible. Of course, easier said than done.) The key
point is to _keep refactoring “continuously“_ to match our understanding of
the domain, rather than just “adding features”.

Aside: In my experience, software built on a good understanding of the domain
will function well, untouched, for a long time — so long as it is suitably
decoupled from the less-well-understood parts. The latter kind, though,
generates constant churn, while also being an annoying fit. Really brings home
the adage _“A month in the laboratory can save a day in the library.”_

~~~
BurningFrog
> _The key point is to keep refactoring “continuously“ to match our
> understanding of the domain, rather than just “adding features”._

This is also what I wanted to say.

One important part of that is that refactoring is a pretty difficult skill,
and many programmers do not have it.

So... for those people, some other advice is probably better.

~~~
karmakaze
I wish this process was called 'factoring' and you had to be able to name the
concept that was being isolated. Often 'refactoring' just means moving code
around or isolating code for it's own sake. If a factor was properly isolated
you shouldn't have to do that one again. Sometimes you choose different
factors, but that's much less common.

~~~
mntmoss
"Factoring" is sometimes used in the Forth world, since code being factored
into small words is of such eminence.

And it offers good lessons about what's worth factoring and how. Forth words
that are just static answers and aliases are OK! They're lightweight, and the
type signatures are informal anyway. "Doing Forth" means writing it to exactly
the spec and not generalizing, so there's a kind of match of expectations of
the environment to its most devoted users.

On the other hand, in most modern environments the implied goal is to
generalize and piling on function arguments to do so is the common weapon of
choice, even when it's of questionable value.

Lately I've cottoned on to CUE as a configuration language and the beauty of
it lies in how generalization is achieved while resorting to a minimum of
explicit branches and checks, instead doing so through defining the data
specification around pattern matching and relying on a solver to find logical
incoherencies.

I believe that is really the way forward for a lot of domains: Get away from
defining the implementation as your starting point, define things instead in a
system with provable qualities, and a lot of possibilities open up.

------
mannykannot
Big balls of mud result from a process that resembles reinforcement learning,
in that modifications are made with a goal in mind and with testing to weed
out changes that are not satisfactory, but without any correct, detailed
theory about how the changes will achieve the goal without breaking anything.

~~~
bitwize
Sounds like all of Agile, really. One can characterize Agile as a ball-of-mud
maintenance process that scales desirably with the amount of mud.

------
carapace
"Introduction to Cybernetics" W. Ross Ashby

[http://pespmc1.vub.ac.be/ASHBBOOK.html](http://pespmc1.vub.ac.be/ASHBBOOK.html)

> ... still the only real textbook on cybernetics (and, one might add, system
> theory). It explains the basic principles with concrete examples, elementary
> mathematics and exercises for the reader. It does not require any
> mathematics beyond the basic high school level. Although simple, the book
> formulates principles at a high level of abstraction.

~~~
AndrewKemendo
I find it really sad that cybernetics completely evaporated as a field with
the closest remnant being cognitive science. I think there is a huge need for
more interdisciplinary fields

~~~
carapace
A lot of it was incorporated or duplicated in feedback control theory, but
mostly in the context of industry, so it didn't really feed back (heh, sorry)
into other, more academic, areas. And, on the other hand, it spun off into
(IMO) fluffy "second-order" cybernetics and became a kind of toy philosophy.

I find it sad too. PID controllers are great but from my POV they're barely
the first step.

However, another way to look at it is, you can study and apply "Intro to Cyb"
and leapfrog into the future.

------
xyzzy2020
I think this is useful even for systems (SW stacks) that are much smaller and
"knowable": you start by observing, trying small things, observing more,
trying different things, observe more and slowly build a mental model of what
is likely happening and where.

His defining characteristic is where you can permanently work around a bug
(not know it, but know _of_ it) vs find it, know it, fix it.

Very interesting.

------
naringas
I firmly believe that _in theory_ all computer systems can be understood.

But I agree when he says, it has become impractical to do so. But I just don't
like it personally, I got into computing because it was supposed to be the
most explainable thing of all (until I worked with the cloud and it wasn't).

I highly doubt that the original engineers who designed the first microchips
and wrote the first compilers, etc... relied on 'empirical' tests to
understand their systems.

Yet, he is absolutely correct, it can no longer be understood, and when I
wonder why I think the economic incentives of the industry might be one of the
reasons?

for example, the fact that chasing crashes down the rabbit hole is "always a
slow and inconsistent process" will make any managerial decision maker feel
rather uneasy. This make sense.

Imagine if the first microprocessors where made by incrementally and
empirically throwing together different logic gates until it just sort of
worked??

------
jborichevskiy
> If you run an even-moderately-sophisticated web application and install
> client-side error reporting for Javascript errors, it’s a well-known
> phenomenon that you will receive a deluge of weird and incomprehensible
> errors from your application, many of which appear to you to be utterly
> nonsensical or impossible.

...

> These failures are, individually, mostly comprehensible! You can figure out
> which browser the report comes from, triage which extensions might be
> implicated, understand the interactions and identify the failure and a
> specific workaround. Much of the time.

> However, doing that work is, in most cases, just a colossal waste of effort;
> you’ll often see any individual error once or twice, and by the time you
> track it down and understand it, you’ll see three new ones from users in
> different weird predicaments. The ecosystem is just too heterogenous and
> fast-changing for deep understanding of individual issues to be worth it as
> a primary strategy.

Sadly far too accurate.

------
woodandsteel
From a philosophical perspective, I would say this is an example of the
inherent finitudes of human understanding. And I would add that such finitudes
are deeply intertwined with many other basic finitudes of human existence.

------
lucas_membrane
I suspect that systems that defy understanding demonstrate something that
ought to be a corollary of the halting problem, i.e. just as you can't figure
out for sure how long an arbitrary system will take to halt, or even figure
out for sure whether or not it will, neither can you figure out how long it
will take to figure out what's going on when an arbitrary system reaches an
erroneous state, or even figure out for sure whether or not you can figure it
out.

~~~
nil-sec
I’m not sure about this. Define your “erroneous” state as “halt”. Now the
question becomes, for a systems that halts, find out how it reached this
state. The mathematical answer to this is simply the description of the Turing
machine that produced this state. Whether you can understand this description
or not isn’t relevant.

------
natmaka
Postel's Robustness principle seems pertinent, along with "The Harmful
Consequences of the Robustness Principle". [https://tools.ietf.org/id/draft-
thomson-postel-was-wrong-03....](https://tools.ietf.org/id/draft-thomson-
postel-was-wrong-03.html)

------
INTPnerd
Even if you can reason about the code enough to come to a conclusion that
seems like it must be true, that doesn't prove your conclusion is correct.
When you figure something out about the code, whether through reason and
research, or tinkering and logging/monitoring, you should embed that knowledge
into the code, and use releases to production as a way test if you were right
or not.

For example, in PHP I often find myself wondering if perhaps a class I am
looking at might have subclasses that inherit from it. Since this is PHP and
we have a certain amount of technical debt in the code, I cannot 100% rely on
a tool to give me the answer. Instead I have to manually search through the
code for subclasses and the like. If after such a search I am reasonably sure
nothing is extending that class, I will change it to a "final" class in the
code itself. Then I will rerun our tests and lints. If I am wrong, eventually
an error or exception will be thrown, and this will be noticed. But if that
doesn't happen, the next programmer who comes along and wonders if anything
extends that class (probably me) will immediately find the answer in the code,
the class is final. This drastically reduces possibilities for what is
possible to happen, which makes it much easier to examine the code and
refactor or make necessary changes.

Another example is often you come across some legacy code that seems like it
no longer can run (dead code). But you are not sure, so you leave the code in
there for now. In harmony with this article, you might log or in some way
monitor if that path in the code ever gets executed. If after trying out
different scenarios to get it to run down that path, and after leaving the
monitoring in place on production for a healthy amount of time, you come to
the conclusion the code really is dead code, don't just add this to your
mental model or some documentation, embed it in the code as an absolute fact
by deleting the code. If this manifests as a bug, it will eventually be
noticed and you can fix it then.

By taking this approach you are slowly narrowing down what is possible and
simplifying the code in a way that makes it an absolute fact, not just a
theory or a model or a document. As you slowly remove this technical debt, you
will naturally adopt rules like, all new classes must start out final, and
only be changed to not be final when you need to actually extend them.
Eventually you will be in a position to adopt new tools, frameworks, and
languages that narrow down the possibilities even more, and further embedding
the mental model of what is possible directly into the code.

------
jerzyt
Great read. A lot hard earned wisdom!

------
drvortex
What a long winded article on what has been known to scientists for decades as
"emergence". Emergent properties are systems level properties that are not
obvious/predictable from properties of individual components. Looking and
observing one ant is unlikely to tell you that several of these creatures can
build an anthill.

~~~
svat
Your comment was very puzzling to me, as I couldn't figure out what kind of
misunderstanding about this article would prompt a comment such as this. But
finally a possibility occurred to me: perhaps you think the point of this
article was simply to say that there exist "systems that defy detailed
understanding". It is possible that one could think that, if one went in with
preconceived expectations based only on title of the post. (But this is a very
dangerous habit in general, as outside of personal blogs like this one, almost
always headlines in publications aren't chosen by the author.)

But we all know such systems already: for instance, _people_! No, this post is
a supplement/subsidiary to the previous one ("Computers can be understood" —
BTW here's another recent blog post making the same point:
[https://jvns.ca/blog/debugging-attitude-
matters/](https://jvns.ca/blog/debugging-attitude-matters/)), carving out
exceptions to the general rule, and illustrating concretely _why_ these are
exceptions (and what works instead). It is useful to the practitioner as a
rule-of-thumb for having a narrow set of criteria for when to avoid aiming to
understand fully (and alternative strategies for such cases). Otherwise, it's
very easy to throw up one's hands and say "computers are magic; I can't
possibly understand this".

(The point of the article here is obvious from even just the first or last
paragraphs of the article IMO.)

