Thanks for sharing. As a data engineer who dabs into docker and k8s and terrafor...

_siis · on June 2, 2023

In many areas, DevOps means different things to different companies.

While I'm not in a DevOps position, nor have I been in one. The people I've met who were, primarily focused on providing value and automating repetitive work.

This usually took the form of creating tooling or instrumentation that made everyone else's jobs easier or reliable (eliminating human error).

While that is easier said than done, the tooling they created needed to be rock solid, and there are certain operation aspects that are often overlooked because Operations rarely has as much input as Developers in more classical teams. As long as it continues running without problems few people care.

Volume 2 of TPOSNA for Cloud breaks some of these operational requirements down. I'd also add, its important to understand and know the limitations of computation, and maybe a little compiler/automata theory on the types of problems and what leads to halting/undecidability.

In that kind of position, you will often be dealing with automation; and you may need to be able to verify requirements of computation such as if a output interface is injecting non-determinism into subsequent forked processes. Any automation after that will fail in unpredictable ways, and it happens in a lot of places (take a close look at ldd coreutils output sometime).

There are a few core requirements that programs assume are true, but then break down when those assumptions stop being true. Having an implicit knowledge of System's and Signals is very valuable with this in mind as those properties can be tested; and effective troubleshooting only works when certain properties are present allowing you to characterize problem classes quickly and allocate/provide estimates of time more effectively. Non-deterministic type problems take exponentially longer since you can only guess and check.

As for certifications, if you want a background in System Administration RHCSA was good; I haven't seen it since they were acquired by IBM.

You should look at the person who issues the certifications and see how long or how they validate those certifications. With deprecated certifications, they often stop validating whether its been issued so its as though you didn't have any certification.

markus_zhang · on June 2, 2023

Thanks for the long reply. Your observation agrees with mine about the automation part and thats's exactly the reason I love the job. I did a lot of automation for the team.

But I'm missing a lot. I know absolutely nothing about network. I don't know how to build infra-as-a-service as our DevOps does -- I only use and dev on top of them. I'm especially wearing a blank face reading your "system and signal" paragraph.

I guess my best shot is to learn by earning a certification and go from there. I found all companies need sort of a senior DevOps who can immediately begin work. I need to fill at least some of the gaps.

_siis · on June 3, 2023

To elaborate, System's and Signals is a course usually reserved for EE majors, you can watch lectures about it at MIT OCW; there's a Textbook which is relatively cheap and while its very math heavy (probably requiring Calc3+a bit of abstract math for EE work), an intuitive understanding of what the properties are and how to test for their presence is all you really need to get some basic use from it as a System Admin/DevOps.

A barebones understanding of network is pretty simple (its an onion), it can get complicated as soon as you need to start worrying about time to convergence, multi-homed networks, and/or dealing with BGP and ASN's or if you need to reimplement network stacks. CCENT/CCNA/CCNP are certs that cover the material exceptionally well though its dry to put it mildly. Its also what I would expect of a Network Admin, not a System Admin/DevOps.

I've found System's and Signals provides a useful paradigm/filter for characterizing certain classes of operational problems (by whether the properties are present or not in computational contexts).

Computation for example in general require Step (usually associated with a clock signal), Time Invariance (given the same inputs provides same output regardless of time shifts), and Determinism (the fixed same input states going into function only provide a single unique output) systems properties to function and do work.

The latter property must hold and includes any series of intermediate steps as well, mainly to fulfill requirements of the theory of computation (afaik) regarding turing machines (finite state automata/automata theory with implications of halting and decidability when not present; this is mainly found in the dragon book or a compiler design course, pretty sure video courses for this are available on MIT OCW).

Time Invariance for example is affected by Memory properties and fails in the presence of Memory (such as a cache, which is why there must always be a way to clear it from an operational perspective to allow troubleshooting).

Troubleshooting interconnected systems relies heavily on Time Invariance. If you can demonstrate the property isn't there as a rule of thumb, you can make estimates that it will take much longer to resolve certain issues given the guess and check nature of the problem.

While these are mainly rules of thumb without formal verification (at least for me), they've been aligned from observations I've made and served me well as an additional tool. Most of what I know in this regard is self-taught aside from the initial exposure from going for an undergraduate in engineering where I took but withdrew from this course. I don't have a degree, though I know quite a bit about computer science and related theory at this point.

If you can verify a property is missing, you don't spin your wheels, also expectations are set appropriately upfront once you know the property is missing; well before investing a lot of resources to work the problem. Sometimes the problem isn't worth digging into in terms of cost, and recognizing it allows you to offer value in potential cost savings.

As for Infra, Scaling, and such. Volume 2 of TPOSNA will give you a solid foundation as its about Cloud. Volume 1 is all about in-house IT.

Edit: Being able to compose state graphs and a cursory amount of graph theory is also important, if you have to verify determinism. Most commonly people miscount the absence of something as a state, and it ends up borking the automation in weird edge-cases since the absence often gets mapped as two or more potential (non-unique) output nodes breaking determinism on a state graph.