Agreed, the devil is in the detail for SRE functions, and the organizational details of how to leverage this framework are largely absent from this writeup. With so many teams struggling to get the organizational components right just for traditional SRE (due to budget constraints, internal politics, misunderstanding of SRE by leadership, etc), I'd imagine implementing the changes need to leverage the ideas in this writeup will be impossible for all but extremely deep-pocketed tech companies.
Nonetheless, lots of interesting concepts, so I would like to see a Google SRE handbook style writeup with more info that might be of more practical value.
It has potential to be incredible from a safety perspective and will save lives (I have an inreach but there are many people in the backcountry who don't, which makes sense given the cost and pricing model). I look forward to seeing the reliability data -- reliability is the one thing that kept me on inreach when the iOS sat feature came out.
Would love to see this for the SF Bay Area - lots of great cycling around here and I'm unsatisfied with using Strava to create routes!
One of the most frustrating things for me with snapping to official paths is not being able to modify it for common workarounds (for instance, going across the Golden Gate bridge, everyone takes a shortcut through a parking lot, but every map routing platform I have used forces me to go the official route and messes up my nav)
I'm around hill country TX now where there is typically a patchwork of bike lanes segments that start and stop without much attention given to continuity of (sub)urban planning, walkability, safety, or design consideration for non-motorized users.
It feels like the approach OP is taking won’t be able to take this short cut into account.
I too am interested in linking up good bike trails. Mostly for the east bay gravel systems. Today, I save GPX or geojson from routes I find on Strava and import into a map client (CalTopo). It’s a okay solution but my problem is in finding more alternative routes.
It's the parking lot near Lincoln Blvd and Merchant Rd. If you come from Chrissy field, you wouldn't cut thru it, but my route is Arguello -> Washington -> Lincoln then through the parking lot onto the path by the battery onto the bridge
Agents do not necessarily need to do the entire job of an SRE. They can deliver tremendous value by doing portions of the job. i.e. bringing in relevant data for a given alert or writing post-mortems. There are aspects of the role that are ripe for LLM-powered tools
I really can’t think of anything more counter-productive than AI post-mortems.
The whole point of them is for the team to understand what were wrong and how processes can be improved in order to prevent issues happening again. The idea of throwing away all the detail and nuance and having some AI generated summary will only make the SRE field worse.
Also really don’t understand the benefit of LLM for bringing in relevant data. Would much prefer that be statically defined i.e. when a database query takes 10x longer, bring me the dashboards for OS and hardware as well.
As a software engineer, DevOps engineer, platform engineer and SRE in a mixed bag, I would say not building monoliths -- instead build a microservice but slightly larger that can still be easily cloneable, scalable and fault tolerant. A mix of monolith and microservice, you may say, and I would like to call that "siloservice".
Silo: A silo is a cylindrical tower used for bulk storage, like grain silos that stand tall near farms. Another kind of silo is harder to see — military silos are underground.
Obviously, you don't need 10 fragmented microservices interdepending on each other, that's one of the biggest overengineering for microservices in real world practices, but you can build multiple "siloservices" that does the same stuff more effectively while getting easy maintenance. I got this inspiration from working with monorepos in the past.
While I agree that there are certainly cases of microservices being used in places they shouldn’t be, I have trouble imagining that monoliths are strictly better in every case. Do you have suggestions for running monoliths at scale?
I think the big problem is it tries to do too much. We used to have many tools as SRE but now teams are really limited. We handed the keys to the engineers which I think is overall a good intention. But we didn’t set them up with sensible defaults, which left them open to making really bad decisions. We made it easy to increase the diversity in the fleet and we removed observability. I think things are more opaque, more complicated, and I have fewer tools to deal with it.
I miss having lots of tools to reach for. Lots of different solutions, depending on where my company was and what they were trying to do.
I don’t think one T-shirt size fits all. But here are some specific things that annoy me.
Puppet had a richer change management language than docker. When I lost puppet, we had to revert back to shitty bash scripts, and nondeterminism from the cicd builds. The worst software in your org is always the build scripts. But now that is the whole host state! So SREs are held captive by nonsense in the cicd box. If you were using Jenkins 1.x, the job config wasn’t even checked in! With puppet I could use git to tell me what config changed, for tracked state anyway. Docker is nice in that the images are consistent, which is a huge pain point with bad puppet code. So it’s a mixed bag.
The clouds and network infrastructure have a lot of old assumptions about hosts/ips/ports. This comes up a lot in network security, and service discovery, and cache infrastructure. Dealing with this in the k8 world is so much harder, and the cost and performance so much worse. It’s really shocking to me how much people pay because they are using these software based networks.
The Hypervisors and native cloud solutions were much better at noisy neighbor protection, and a better abstraction for carving up workloads. When I worked at AWS I got to see the huge lengths the ebs and ec2 teams put into providing consistent performance. VMWare has also done a ton of work on QoS. The os kernels are just a lot less mature on this. Running in the cloud inside a single vm removed most of the value of this work.
In the early 2010s, lots of teams were provisioning ec2 instances and their bills were easy to see in the bill as dollars and cents. At my last company, we were describing workloads as replicas/gbs/cpus/clusters on a huge shared cluster. Thousands of hosts, a dozen data centers.
This added layer of obfuscation hides true cost of a workload. I watched a presentation from a large well known software service company say that their k8 migration increased their cloud spend because teams were no longer accountable to spend. At my company, I saw the same thing. Engineers were given the keys on provisioning but were not in the loop for cost cutting. That fell to the SREs, who were blamed for exploding costs. The engineers are really just not prepared to handle this kind of work. They have no understanding about the implications in terms of cost and performance. We didn’t train them on these things. But we took the keys away from the SRE’s and handed it to the engineers.
The debugging story is particularly weak. Once we shipped on docker and K8 we lost ssh access to production. 10 years into the docker experiment, we now have a generation of senior engineers who don’t know how to debug. I’ve spent dozens of hours on conference calls while the engineers fumbled around. Most of these issues could have been diagnosed with netstat/lsof/perl -pe/ping/traceroute. If the issue didn’t appear in New Relic, then they were totally helpless. The loss of the bash one-liner is really detrimental to engineers progress.
There is too much diversity in the docker base images and too many of them stuck. The tool encourages every engineer to pick a different one. To solve this my org promised to converge on alpine. But if you use a docker distribution, now you are shipping all of user mode to every process. I was on the hook for fixing a libc exploit for our fleet. I had everyone on a common base image, so fixing all 80 of my host classes took me about a few days. But my coworkers in other orgs who had hundreds of different docker images were working on it a year later. Answering the question, which LibC am I on became very difficult.
Terraform has a better provisioning/migration story. Use that to design your network, perform migrations. Use the cloud native networking constructs. Use them for security boundaries. Having workloads move seamlessly between these “anything can be on me hosts” make security, a real nightmare.
I left being an SRE behind when I saw management get convinced docker/k8 was a cancer treatment, a desert topping and a floor wax. it’s been five years and I think I made the right call.
Thanks! We've also been impressed with the performance of out-of-the-box LLMs on this use case. I think in part it is because k8s is a significantly more constrained problem-space than coding, and because of that we'll get to a much more complete solution with the existing state of LLMs than we would for a product like a general software engineer agent.
also open-sourced: https://github.com/wilson090/ride-render