Life is hard for C coders. Kubernetes made writing poor code a breeze. At work w...

carlmr · on Nov 6, 2021

K8s might save you from crashes affecting availability but what about data corruption, logicals errors sound APIs etc.? I'm guessing if code quality is neglected all of these suffer.

DeathArrow · on Nov 6, 2021

Code quality is not neglected to the point we have logical errors. Architecture is reasonable and we have an extensive suite of end to end tests and integration tests.

carlmr · on Nov 6, 2021

This... sounds better than most companies.

noobermin · on Nov 6, 2021

Just enough to get a new round of funding, a bonus, then to parachute out and let someone else deal with the bills when they come due

DeathArrow · on Nov 6, 2021

No founding needed, we are corporate owned. The business guys demand features like there's no tomorrow so we have to do some trade offs. If the dev team owned the thing we would have taken other decisions.

shoo · on Nov 6, 2021

the basic pattern is having a load balancer sitting in front of N services, and then having a service manager keep an eye on each service and restart them if they crash. so kubernetes can get the job done, but you can do essentially the same thing with a load balancer & some VMs & using the service manager that comes with your operating system (even windows has one)

rurban · on Nov 6, 2021

Better crash hard and early, than crawl forever.

KronisLV · on Nov 6, 2021

> At work we have microservices crashing 20 times a week but SLOs are not affected since traffic is routed to surviving pods.

This probably makes a lot of people have strong negative emotions, but at the same time i feel that you're not wrong and it's the only way to deal with the modern web dev, where clients/business push for features instead of quality, versus something like kernel/system software development, where there is more pushback against this for historial and cultural reasons.

At work, we have this one monolith system that's in the center of everything else within a particular project - it's not really scalable and it has multiple scheduled processes within it, as well as serves a lot of external API requests, oh and also has an administrative UI. So far, my attempts to warn people against having a single point of failure like this have fallen on deaf ears and we still have outages where the JVM misbehaves or scheduled processes gobble up all of the server's memory and GC slows everything down on a regular basis.

Contrast this to me finally getting to implement something more like microservices in another project - the services are containerized and run on servers that have been configured with Ansible, are horizontally scalable and have proper load balancing. Furthermore, the scheduled process functionality and others can sit behind feature flags and be enabled within a particular instance, all while not having multiple separate projects and keeping things simple with a single, modular codebase. I actually dubbed this approach "moduliths", horizontally scalable and modular monoliths, since there is no way that this org can handle "proper" microservices, about which i wrote more here: https://blog.kronis.dev/articles/modulith-because-we-need-to...

That said, even the older monolith projects can benefit from modern approaches like Ansible for configuration management (which also prevents situations where environment configuration diverges over time and no one has any idea why) as well as being put into containers - the horrible monolith application now also lives within a container (not yet in prod, sadly) and has built in health checks. Were it ever to break and fail to recover in a set time, then it will automatically restart, making an outage that lasts an hour and possibly makes someone get paged in the middle of the night instead be a minute long interruption before everything restarts.

Personally, i think that with the direction that the industry is headed all services and even servers should be restarted every now and then anyways, since with JVM/CLR you sometimes get weird things happening after a service being up for months or years. Knowing why that happens would be nice, of course, but no one actually has the time to address those.

kaba0 · on Nov 6, 2021

> Knowing why that happens would be nice, of course, but no one actually has the time to address those.

More than likely bad user code. Perhaps even race conditions. But in case of the JVM, with flight recorder and other forms of logging you could find out the problem with quite a good chance.

Which version of the JVM do you use?

KronisLV · on Nov 6, 2021

> ...you could find out the problem with quite a good chance.

It's not that it's impossible to do so due to technical limitations. Even without JFR, there's still VisualVM and any number of APM solutions, like JavaMelody, Apache Skywalking, Stagemonitor etc.

It's rather a problem of telling the clients/business:

  Hey, look, for the next X days/weeks i won't be developing any new features or tending to your user stories, but instead will attempt to track down this persistent, yet somewhat hard to reproduce problem. 
  And because of limitations in place that pertain to accessing production environments, this process will likely take much longer than it otherwise should, especially in case of blocking synchronous communications when asking for production logs or heap dumps, which are sometimes wrongly exported after the server restart, which makes them meaningless.
  Alternatively, i will spend a similar amount of time attempting to first get the application instrumented and then we'll run into similar challenges regarding the access permissions for those, before returning to the aforementioned attempts to debug and solve the application issues, because adding instrumentation doesn't magically solve those.

Depending on the environment that you work in, this proposition might either be accepted, you might also find yourself fighting an uphill battle, or people might just look at you like you have two heads and without having proper backing support of the other engineers you'll find yourself for critiqued both for wasting the time on debugging with no guarantees of actual payoff in the end, as well as the application quite possibly still not working.

I'm actually in the middle of implementing an APM solution to hopefully give better insights into how the application works, but in many of the environments out there this will be a Catch 22: https://www.merriam-webster.com/dictionary/catch-22

So, if you have control over the application from day one instead of being onboarded into a maintenance project with SRE not having been a concern throughout its development, consider building for failure - treat it as a "when?" question instead of "whether?" and do what you can to mitigate the actual user impact even when components may fail.

Horizontal scaling is one way to achieve that, and a pretty decent one, as long as you don't attempt to scale your single source of truth.