Hacker News new | past | comments | ask | show | jobs | submit login
Seeing Like an SRE: Site Reliability Engineering as High Modernism (usenix.org)
160 points by zdw 4 days ago | hide | past | favorite | 25 comments





It's an interesting comparison. Looking back in the history of software, "A pattern language" was an architectural treatise which inspired the software concept of "software design patterns".

Similarly, I can see that considering the known issues with top-down vs. bottom-up city planning/evolution could be beneficial for software-centric organizations too; the issues with badly-fit top-down city plans seem to match very well with the pains of an ill-fit software architecture that's mandated from an ivory tower, complete with users using the planned cities "wrong".

I'm sure there are differences though. You have a lot more observability into your software systems, and at the end of the day, they are orders of magnitude less complicated than cities, so you can comprehend more of the system at once, and truly find common usecases to standardize around. This is in contrast to cities where it's impossible to really know every citizen's unique needs, temperament, and usage patterns.

Worth thinking about more; given the relatively low cross-pollination rates between the fields, I suspect there are more lessons that software engineers could glean from architecture and city planning.


A key underlying assumption in Scott’s perspective in “Seeing like a State” is that diversity is critically important to healthy functioning of biological/human/cultural ecosystems. In large computing system fleets we’re often okay with the opposite — simplifying by fiat because the understanding/control of the architect is more important than the diversity of individual machine configurations. Yes, the monoculture could lead to correlated failures (Eg: all machines are vulnerable to the same exploit), but the common perspective is that the simplicity/controllability and efficiency gains are worth it.

I think we might be able to get by with this perspective so long as we’re seeing computers/systems only as inert tools. It’s interesting to consider whether there’s any motivation for that to change, as we move towards more ubiquitous & intelligent computing. (Eg: should IOT devices be thought akin to insects?)


One of the key differences is that (the various components of) nature has no common goal except that each individual component wants to reproduce, while large computing systems are almost always constructed to achieve some particular objective. Thus, nature is OK with it if predators randomly kill some percent of the population while most factories would frown very much if a random employee started sabotaging lathes or something..

You could argue that netflix-style chaos engineering is an attempt to introduce more resilience into the system precisely by mimicking natures "anything can die at any moment" principle, but even then it typically only applies to computers. Netflix is known for firing fast but I don't think even they would consider randomly firing employees to make sure there are no single points of failure in the employee makeup. Would be interesting though: tax filing need to be submitted next Tuesday but the CFO was just fired, what is your recovery plan?


I've encountered the idea of a Chaos HR Simian. People get random, unplanned, multi-week vacations.

Mentioned here: https://www.cognitect.com/blog/2016/3/24/the-new-normal-embr...

I know I've been on teams that were significantly disrupted by jury duty, medical incidents, traffic accidents, etc. So it seems like a reasonable way to simulate this.


Something the author didn't touch on specifically is the limit on languages at Google. When I left the officially supported languages were Java, C++, Python, and Go. That limited the scope of CI/CD, tracing and monitoring, and debugging to something tractable for the developer tools teams. It also made it tractable for SRE teams to be able to engage with new product teams without having to learn a whole new language.

A really useful thing my team did (and I think it was a moderately successful trend on other SRE teams) was to role play recent outages. The oncall who had seen a particularly interesting outage would DM using the graphs, error messages, and logs they encountered when debugging a alert for a chosen victim (ahem, role-player) who would have to choose which graphs, dashboards, and logs to look at and which remediation actions to take to track down and fix the actual problem. It was perfect for building metis since it was done in a team setting so everyone benefited from the insights into the system architecture and behavior and the role-player learned practical oncall skills. Things like escalating to other teams and running incident management were built into the RP.


Python is “supported” but if you want to write a new program in Python you need approval from the area tech lead ¯\_(ツ)_/¯

This is such a great idea. I struggle to see it being adopted at my current, OKR driven organization where literally any work is debated until death lol.

The author touches on knowledge management, which is one of the most interesting subjects I was able to study at uni (part of CS/SS). A kind of analogy to the techne/metis is the concept of explicit and implicit knowledge.

We codify knowledge or information into explicit knowledge such as documentation, expert systems or design. Not all knowledge lends itself to this.

Implicit knowledge is that which often require experience and learning by doing. It is hard to capture explicitly. On the one hand because the skilled individual might be unaware of the skill in action, on the other they may be unable to express it.

Various hacks are then tried to pry this valuable asset out into the open, so it can be recorded on a corporate wiki.


My (admittedly limited) experience is that systems aren't maintainable except by people that are very familiar with them. The basic principles of the SRE don't ignore that, they embrace it. Rather than trying to manage a system from the top, they encourage the admin to delve in and craft it themselves. By bringing infrastructure close to the users of that infrastructure, everyone gets a chance to gain hands on knowledge. Is that how it actually turns out? Maybe, maybe not.

> systems aren't maintainable except by people that are very familiar with them.

I think that a consequence of "two sorts of knowledge: techne and metis" is that standardisation is good, but it only gets you so far. Past that point, you need to be familiar with the system.

This should not devalue our efforts to standardise, e.g. get systems to all log to the same aggregator, and emit the same basic stats, agree on naming and forwarding of correlation ids that will allow us to cross-reference related log entries.

But we should also recognise that those efforts will never cover everything.

e.g. If I changed over to working on an unfamiliar system in the same organisation, I would know where it should be logging to, what the field naming and general structure of those log entries should be, but I would not not know what healthy operation should look like in those logs.


The author makes Kubernetes sound like it's a technocratic regime controlled by a political class of anyone who's ever held the title SRE at Google. They do control the means of production. Me however, I'm just a member of the typing pool.

Perhaps everyone who was ever an SRE at Google added one new configuration option to Kubernetes, and that's how it ended up this way.

You joke but thats what happened with Tensorflow at Google. Everyone wanted a "contributed to tensorflow" on their resume

Well I think what they wanted was for their work to be used. It was a great big bag of things.

Poor Corbusier, getting blamed for the architectural errors of Mies van der Rohe's sadly untalented copy cats, pseudo-intellectual ideologues, and greedy developers.

For the record, Corbusier's Ville Radieuse (Radiant City) predates the Cold War by a rather hot World War II (1930). Interestingly enough, it was a very Googly impulse -- "organize all the world's" bipeds -- that motivated the relatively young control freak aka architect. After WWII, Corbu mellowed. And his collective residential structures, Unité d'habitation, were the result of his synthesis of a generative measuring system and modularity. OP and fellow SREs have quite a lot to learn from the mature thoughts of Le Corbusier.

Over here in America, we had our own native genius, Frank Lloyd Wright, who devised his vision of an urbanism for a democracy - The Broadacre City:

https://franklloydwright.org/revisiting-frank-lloyd-wrights-...

But of course, the "high modernism" clique (ran by the moneyed set of East Coast (think MoMA), and the "ex-Nazi", Phillip Johnson) that did everything to marginalize Wright. And it was this clique, having imported wholesale (ironically) the leftist architects of Europe escaping Fascism, that gifted us with "high modernism" dystopia.

If you want to learn about modern architecture, I recommend Ken Frampton's Modern Architecture: A Critical History. He was one of the very few actual teachers I had in architectural school worthy of the designation.

https://en.wikipedia.org/wiki/Kenneth_Frampton

https://www.goodreads.com/book/show/70140.Modern_Architectur...

https://en.wikipedia.org/wiki/Philip_Johnson#Controversy_ove...

https://en.wikipedia.org/wiki/Ludwig_Mies_van_der_Rohe (His own works were exquisite gems.)


This is an extremely well written article. The concepts of techne and metis, I hope these become part of tech vocabulary and allow us to talk about differences in perspectives on infrastructure and especially infrastructure migrations more effectively without hating each other.

> Techne is universal knowledge: things like the boiling point of water, Pythagoras’ theorem, the rule that all RPCs should have deadlines, or that we should probably alert if no instances of our jobs are running.

> metis, is local, specific, and practical. It’s won from experience. It can’t be codified in the same way that techne can. The comparison that Scott gives is between navigation and piloting. Deepwater navigation is a general skill, but a pilot knows a specific port — a ‘local and situated knowledge,’ as Scott puts it, including tides, currents, seasonal changes, shifting sandbars, and wind patterns. A pilot cannot move to another port and expect to have the same level of skill and local knowledge.


It might be worth noting that we don't need to rely on this particular book as a source for this distinction. It is essentially congruent with the necessary/contingent distinction in philosophy.

Other expressions of it include the strategy/tactics distinction and the nomothetic/idiographic distinction. The idea is based on the very ancient observation that phenomena involve both general laws and specific circumstances.


Great article. Nice insights on techne & metis.

This relates directly to automated testing. Unit test coverage is important, but equally important are functional tests from the perspective of a user executing real workflows.

The full picture of app behavior is invaluable to the new or learning engineer, or even experienced engineers learning some unfamiliar subsystem.


I agree with this post 200%.

> Irecently spent some time trying to write a set of general guidelines for what to monitor in a software system

Reframe as "Shit That Needs To Run To Make The Customer Happy" and you get closer to what you want. Which is to say, it's completely product-specific. A general list of technical things to monitor is about as useful as monitoring the cotton thread fiber integrity of a pair of shoelaces. Is it the cotton thread fiber integrity what you care about, or a general quality of the shoelaces? Are they shitty laces, or just decent, or great laces? Quantify that.

> Typically, the former kind of PRR will take a quarter or more, because invariably, new large services have a significant amount of tasks to do to get them production-ready. The SRE team onboarding the service will spend significant time finding gaps, understanding what happens when dependencies of the service fail, improving monitoring and runbooks and automation.

I deal with these a lot of the time, and I hate them because they are so stupid. We could make these reviews completely self-service and automated and they'd move a lot faster, and could even be on-going as the product is actually released to customers. But SRE and Architecture remain their own silos, and neither of them work closely enough with the product team or core engineering groups to find the streamlined, agile ways of doing these things. Basically, none of them grok the concept of finding quicker and better ways to get this shit done. Or they just don't care to.

> The second kind of PRR typically does not uncover much to be done, and devolves into a tick-box exercise where the developers attempt to demonstrate compliance with the organisation’s production standards.

Architecture and SRE don't explain to the product team WTF they are going on about, so of course they just tick boxes mindlessly. Nobody wants to stop and understand the whole picture, so you end up with empty formalism.

The way to "formalize" and "standardize" the operationalizing of a product is to make it clear what the fuck is going on at each stage of your product. Who the fuck are my customers? What the fuck are they doing with the product? How the fuck does the product work for them, and internally? What the fuck are the external dependencies and how do they work? You need simple, practical ways to express these things.

And you also need to train people as to why everyone needs to understand these things. Why you cannot just allow someone to sit in their little corner of a room and jerk off and collect a paycheck. I often hear it from developers ("I just want to write code") but literally everyone else in the organization does it too.


It's baffling to see a majority of the tech industry adopt the job title that a giant advertising company invented as some sort of panacea, because they wrote a book.

SRE has some good ideas in principal. But in practice, unless you are Google, it often leads to over-engineering.


It's an industry specialization with no formal training. You can get a degree in computer science. You can't get a degree in running large-scale computer systems. Plumbers have better training than we do. Garbage men have better training than we do.

If any organization is bold enough to write a book on it, that book becomes the de-facto standard. (it helps that it's 10000% easier than buying and reading a million disparate ISO standards on Information Technology)


Thats sort of my point. The book outlining the practices used at a company with 130k employees doesn’t translate directly to a company with 100 employees, and yet here we are.

What I get from the SRE book is an approach, or even, god help us, a pattern for how to do operations.

So yes, it doesn't translate from 130K to 100 people, but the concepts do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: