Hacker News new | past | comments | ask | show | jobs | submit login
Keep the monolith, but split the workloads (incident.io)
230 points by kiyanwang on April 24, 2023 | hide | past | favorite | 157 comments



I quite like the article and the advice it presents, building what I'd call modular monoliths (that can have modules be enabled or disabled based on feature flags) is indeed a good approach for both increasing resiliency and decreasing the blast radius of various issues.

However, this bit stuck out to me:

> When a bad Pub/Sub message was pulled into the binary, an unhandled panic would crash the entire app, meaning web, workers and crons all died.

I've never seen a production ready web framework or library that would let your entire process crash because of a bad request. There's always some error handling logic wrapping around the request handling, so that even if you let something slip by, the fallout will still be limited to the threads handling the requests and its context.

Now, of course there can still be greater failures, such as issues with DB access stalling for whatever reason, leading to the HTTP requests not being processed, leading to the corresponding thread pool for those filling up, as well as any request queues maxing out and thus all new requests getting rejected. But for the most part, requests with bad data and such, short of CVEs, always have limited impact.

I can imagine very few situations where things working differently wouldn't be jarring.


The most dangerous kinds of mistakes in that area are bugs or oversights that cause massive resource usage or endless loops. Those can easily bring down parts of the application that are otherwise robustly separated in the way you mentioned. Consuming too much memory will kill the app at some point, just using lots of CPU or IO will make everything else grind to a halt. Holding onto resources that are pooled like DB connections is a good way to break everything as well.

There are some languages and frameworks that are designed around even handling and isolating issues like these to some extent, Erlang/Elixir for example.

But in general I would expect a web application backend to isolate services the way you mentioned. But I don't think this is really a preconfigured default in every case, and it's not safe do catch everything and keep running. Isolating web requests and only failing the individual request is of course reasonable and should be the default. But for background services your framework can't know which ones are required and which ones are optional and when to crash the process. Especially as most languages allow shared state there, so the framework can't know how independent they are.


Indeed, I remember a significant degradation of service that happened when some data files were corrupted.

Each request would need to work with one or more of them. There was a cache, and the code was written to avoid a thundering herd. But in the case where the data was corrupt, an exception was thrown, and nothing was put in the cache. So the application sat dedicating multiple cores to a cycle of loading and parsing these data files then errorring out.


Have you worked with Go codebases before?

Standard practice is to wrap all your entrypoints - the start of a web request, the moment you begin to run a job - in a defer recover() which will catch panics.

Sadly, recover won’t apply to any subsequently goroutine’d work. That means even if your entrypoints recover, if anything go func()s down the call stack, then if that function then panics it will bring down the entire process.

We were aware of this but ended up including one by accident anyway. It’s very sad to me that you can’t apply a global process handler to ensure this type of thing doesn’t happen, but to my knowledge that isn’t possible.

Worth mentioning Go doesn’t really encourage ‘frameworks’, and most Go apps compose together various libraries to make their app, rather than using something that packages everything together. Failures like this are an obvious downside to not having the well reviewed nuts-and-bolts-included framework to handle this for you.


> Have you worked with Go codebases before?

Several but ...

> Standard practice is to wrap all your entrypoints [...] in a defer recover()

... I've never seen that. Is there some literature pointing to this as best practice?


You’ll find some libraries do this for you, such as HTTP servers.

They do this because if your server code makes a mistake such as accessing a nil pointer, a segfault or panic would bring down the entire process. That’s why you want to recover(), to avoid a your process dying.


I mean, net/http does it. That's the standard library.

A go convention that I've made up, or that is perhaps a real one, is to always look at the standard library for guidance on how to write Go. net/http suffers a little bit from being a very early library and thus you might not want to emulate its API surface. But in general, the Go team thought "you know what every HTTP handler in Go needs? recovery from panics" and that is worth some weight when considering your own design.

I would personally recommend fuzz testing your code, including HTTP handlers. The more panics you find in development, the fewer customers you lose from panics in production. Remember that recovering from a panic in an HTTP handler still means your user's request didn't get processed. They are not happy about that, even if your program can still service other users.


> They do this because if your server code makes a mistake

This is neither good or best practice.

My take:

- Know that there are simple-mistake panics, and internal-state-just-went-bonkers panics. For the latter, you can guess the boundary of impact (one request, one connection, one job, one userID, one process, ...) but exit(1) is much more reliable than guessing.

- Tests can easily catch simple mistakes like accessing a nil pointer.

- Know that the kind of programmer who does not care about simple tests, will also not care about concurrency bugs which introduce the more dangerous types of state corruption. This corruption would likely be not limited to a single request.

Good luck to frameworks who assume that nothing bad would happen if they ignore a panic and continue serving more requests.


> - Tests can easily catch simple mistakes like accessing a nil pointer.

no they can't . this is exactly what is hard to test for, complex state that can occur by some combination of many variables on many values


Respectfully, this is a really bad take. And one that flies in the face of the Go stdlib, given the default HTTP server will catch panics for you by default.


To me, it's unclear what the best solution is here. Other languages solve this differently with tradeoffs, e.g. Java defaults to threads silently dying when an exception isn't caught. Your program will continue to run, but it's probably in some undefined state at that point. There are mechanisms for propagating exceptions elsewhere, but they have to be explicitly set up (like in Go). You can set a default uncaught exception handler, but that's effectively a global variable with all the subsequent "fun", and the uncaught exception handler has to know how to clean up and restore state if the exception was thrown from anywhere, which seems generally difficult to do correctly.


Erlang/Elixir have a great story here: “let it crash”. Because each slice of activity in an application is wrapped in its own process (think single threaded loop but you can run a million at a time, almost free to create and destroy), if it crashes it only takes down that web request/process. Recovery mechanisms are built in to get back to a know good state.


I haven't used Erlang extensively, what happens if you crash in the middle of e.g. holding a lock, or during a coordinated dance with other processes?

My concern isn't really "does the program keep running?", it's "does the program keep running correctly?".


That sort of problem is beyond the scope of the runtime in any case, isn't it? In either of the examples you offered (holding a lock, coordinating with other processes), there must be timeouts enforced by the lock or the other processes so that, if something goes wrong, the system isn't waiting for a crashed process to continue the work.

Erlang/Elixir do make this pretty easy to manage, including the scenario where the process does recover by reverting back to a known good state. It won't do it for you automatically, but it exposes enough surface area to make problems like that solvable without reaching for a lot of extra tools - it's built into the runtime.


> That sort of problem is beyond the scope of the runtime in any case, isn't it?

Yes, which is why Go's outright crashing also makes sense to me...both Go and Erlang's behavior seem conceptually the same, with some architectural tradeoffs. It's not really that different for a process to die and restart. If some shared resource reaches an undefined state, then you have to kill everything and reset your state anyway. I suppose Go's behavior lends itself better to "microservices", whereas Erlang's behavior is better suited for "monolith" processes that do a lot of different things.

IMO either of these are better than Java's default behavior of silently swallowing the exception and allowing the thread to quietly die.


The key to Erlang error handling is that crashes should bubble up to a high level which then restarts everything below it in a known good state.

If you're in a coordinated dance with another process you link to that process. If a process you're linked to crashes then you crash too. There's no way to block yourself in Erlang such that you can't be told to crash.

After you crash your supervisor might restart you, if that's what you configured. Or you might give up on your specific task.


Pm2 for nodejs is the same.


> when an exception isn't caught

Not catching all exceptions is a glaring P0 bug.


You should almost never catch all exceptions, i.e. Throwable on the JVM. That is one of the few things that Scala really got right. The `catch NonFatal(e) =>` idiom is doing that nicely. It will catch all throwables with a selected set of special cases, e.g. OutOfMemoryException and all the other VirtulMachineErrors. Catching those in a framework lead to extending the time until a crash follows on a serious issue. Crashing early is often beneficial in such a situation. Together with a process watchdog, like systemd, kubernetes, dockerd, whatever crashing early increases the uptime.


Node.js changed behaviors over time.


I probably misunderstand what you wrote. Because I think a (wrapped) panic will only result in crashing the one request that caused it?

For example Gin provides a middleware wrapper for handling panics: CustomRecoveryWithWriter returns a middleware for a given writer that recovers from any panics and calls the provided handle func to handle it.

(https://github.com/gin-gonic/gin/blob/master/recovery.go)


The post-mortem we published about this outage explains the details of how this crashed the binary.

It gives some code examples and explanations that should clear this up: https://incident.io/blog/intermittent-downtime#mitigation-1-...

^ link should go to the relevant section


Will have a look, thanks!


We had a single supervisor for both a Elixir/Phoenix web app and all the tasks it spawned to do long running jobs and scheduled tasks. That was very naive. A job too quickly too many times would bring down the supervisor and the web app. We moved the web app to its own supervisor tree and isolated tasks to their own supervisor too. I don't remember the details but at the end everything could crash and everything else would keep running.


That's why you use a separate `Task.Supervisor`.


> I've never seen a production ready web framework or library that would let your entire process crash because of a bad request.

It's rare, but I've seen linked C libraries trigger a segfault that takes the whole thing down; no global try/catch strategy can help you when that happens. There should generally be something that supervises and restarts the whole server process, but it's definitely painful and can affect threads other than the one handling the bad request.


Depends on the underlying framework on how it isolates processes... I could see this happening in a monolithic JVM application where it pretends there are separate containers, but a fatal error in the JVM will crash the world on that server.

A related example I lived through with a Perl application, someone decided to use this library 'Storable' that would serialize the memory in a binary format. We upgraded the library and started seeing "slow" performance across the server farm.

We recognized the processes were intermittently crashing and after decoding core dumps... figured out it was this upgrade to the Storable library. Apache httpd server chugged along just fine restarting processes. So different run-time, different type of crash resiliency.

Long-term lesson... be extra cautious with memory-serialized objects. Newer libraries have better protection on this to parse a header to detect compatible issues before loading the raw object into memory, but the potential is there especially with distributed systems today.


> There's always some error handling logic wrapping around the request handling

Yeah, some languages make this easier than others.

The way you described it made me think it was Rust, a language where those handlers are not trivial, but not incredibly hard either. But it seems that Go developers have it worse (no big surprise here).

Of course, the champions on unhandled failures are always C and C++, where almost any issue is impossible to recover from. It's no coincidence that those are low level languages (and Go).

When you move from the more controlled languages into those, you tend to also move from in-language error recovery to system-wide error recovery.


I actually found that in practice my go applications were much more resilient. First one in production was handling well over 100k req/s and it was extremely reliable. As a beginner it was very satisfying.


My team has experienced _exactly_ the same cyclical panicking in production before because of the same library.

The Google Pub/Sub Go library does not handle panics by default, so if any message payload plays badly with your code and panics, you cyclically panic the service.

That's because the message keeps getting retried because you don't `Ack` it. Non acknowledgements get retried automatically, you get the picture.


> I've never seen a production ready web framework or library that would let your entire process crash because of a bad request.

Node does this (regardless of framework/server) if you have some unhandled promise rejection/error in an asynchronous function.


You can catch these with Node but I wouldn’t do it - if it’s not explicitly handled then it’s either a bug or something really bad happened and the app needs to get alarm bells ringing.


Error recovery in long-running dynamic-y processes is nontrivial, especially for “programming errors”. While it can be done, much easier to stick to crashonly practices. Also helps in production where you can safely crash it and restart without worrying about stale something.


which can be handled easily with unhandledrejection.

Though ideally, web frameworks such as NestJs is already handled the rejection on request level and can be override easily with Filters. Yeah I know express isn't handled, you'll need to handle it yourself.


Erlang and any Actors models I think


Yes and no. Erlang does apply the "fail fast" idea and recommend to let actor crash instead of trying to handle the error (with some pragmatism ofc). But we are not talking about crashing the process of the whole app. In most case in Erlang (but you find the same idea with over massively parallel system or with micro-services), your actor will be handled by a supervisor, which job it is to keep it alive, and so restart it when it crash. Also, most of the time in Erlang, having a actor crash and spawning a new one is really fast, so you can afford to crash even for something as basic as a malformed json request or similar. Not something you can really afford when spawning a new system is more in the order of several hundred ms.


One critique I have is that this presents a binary option of either full monolith and microservices.

The truth is somewhere in between where you have dependent services large enough to warrant being their own monolith being split off. Breaking up of a monolith is almost never (anecdotal observation after 15 years of seeing this argument surface in every company I've been in) a technical need, but a combination of organizational and business requirements. This is not to say it's never technical, just that it's rare for it to be a reason.

Organizational: The teams are so huge that they prefer their own development cycle, deployment, and rollout safeguards.

Business: Partitioning of products for different SLAs, which is difficult to guarantee when you have a different team introducing some bug every other week that takes down the servers.


A monolith doesn’t have to be one giant ball of mud. It can be discrete, well-factored services all by itself. I recently worked on decomposing a monolith into micro services, but it felt like we were just spreading one big problem over multiple services. All of the services ended up being tightly coupled, even the the goal was to avoid that. We created a macrolith.


And it’s incredibly difficult to work with a distributed monolith. I find that the joy I get from development just fades completely when even trivial changes involve too much faff.


Exactly! All this debate seems to stem from people being forced to work with a big ball of mud and then concluding that no one process should ever be allowed to become that big, because big processes are balls of mud. And so, having missed the real lesson (which isn’t ‘don’t make it big’ but rather ‘don’t make it out of mud’!) they build a big ball of little balls of mud.


> they build a big ball of little balls of mud

Nah, they make a lot of sturdy, well structured rocky balls.

And let them move around a big mud soup that is outside of their sight. So they don't even see mud.


It's not about monoliths always being a ball of mud. Even the most well-composed monolith still has problems with teams wanting to do conflicting release cycles, needing clearer ownership over who has responsibility for what part of the codebase, and knowing who should be responsible for on-call for which services. And there is, of course, dependency hell, since everything in your monolith probably should depend on the same version of third-party libraries.


> everything in your monolith probably should depend on the same version of third-party libraries.

And definitely runs on the same underlying language/compiler/runtime version


Yep. Want to upgrade from Python 3.8 to 3.10? Good luck, cause you've gotta get every engineering team in the organization to buy in and schedule time to do upgrade testing, not to say anything about fixing any breaking changes.


Agreed, transforming a monolith into a 'distributed monolith' just creates a different set of even worse problems, the root issues are still unsolved.

Still, thanks to AWS it's vogue, expensive and probably keeps half of us employed when we have to untangle it all!


You’re right, that’s a very fair point. I should’ve made it clear that the middle ground of moderately-sized but logically separate services is very much a good world to be in.

I’d actively encourage splitting a monolith into services when the candidate split would be:

1. A service that has a clear purpose that can be explained either separately from the monolithic whole or as a complementary piece

2. It will receive enough active development that the cost of upgrading/on-going investment is an acceptable percent of total work time

3. It can improve the organisations state of ‘ownership’ and better align the codebase with the team structure that own it

There is always a middle ground, and I should’ve done better to explain that.


And one technical / business side that's very valid: error-proofing the service and database. It is to prevent the database to be populated incorrectly, such as financial data, that you proof it behind a set of well-defined API, to prevent any leakage of sensitive data (PII or credit-card info) and to be easily audited.


Do you think it's good idea to migrate from monolith to microservices before scaling the team?

We are essentially doing this right now, I personally don't think the teams are yet large enough to warrant this migration, but leadership says that we can't scale further before we have microservices. Imo most of the problems could be solved with increased testing and knowledge about the system. The tech also could definitely still scale with just by boosting the hardware and occasional performance passes on slow endpoints/queries.


> but leadership says that we can't scale further before we have microservices

These are depressingly common alarm bells.

There are companies with valuations over 12 figures which don't do microservices, so the scaling justification is dubious at best.


can't scale further before we have microservices.

Scale what? Load (Seems not)

Scale the team? This is often an overlooked aspect of microservices you're able to isolate the domain knowledge to smaller services and hire more specialized engineers who can focus on a specific silo of the app and only have to work to discrete APIs.

Scale features? There's a legit argument to be made that if you see yourself going to a microservice architecture in the future based on your current codebase or product roadmap that doing it sooner then later is always going to be preferable. The bigger the monolith the harder it is to unwind. It could also be that there's some tech debt that's something of a limiting factor, making the change in architecture could be an opportunity to address some of that.


Are you doing “extreme” micro services? A esperare database for each micro service, one endpoint for each microservice etc.. or the slightly more sensible - let’s split the monolith up into obviously independent system (very much like the article talks about).


>extreme microservices

>separate databases

Isnt it like the basic thing?


If your database isn't the performance issue and those two services spent 99% of the response time on calculating the 15755th digit of pi, not separating the database isn't a problem. Similarly, if both of your services need to fetch some user information, but you haven't created a user information service with its own database they could both talk to, having a shared database is fine.

Splitting each and every service with its own database, talking to other services to access their database by default is cargo-cult behavior. All it's going to do is make it hell for you to work on those services.


But if 1 database dies, then all microservices dont work

So why are you even using them?

The benefit of microservices is also reliability, which you just threw away

Isnt this basically distributed monolith?

So you combined bad things of both worlds!

Single point of failure of monolith

And deployment difficulties and need for saga like patterns to deal with network trickiness from distributed designs

Whats the point?


It wouldn't be unreasonable to say database products are an order of magnitude or two more reliable than code written by many teams.

There's some advantage in having code separation first, but it is admittedly not a benefit of reliability.

These half-measures are generally done when converting an older architecture to a newer one. I don't think many places actually do big bang conversions/releases going from monolith -> microservices where you can do everything correctly. That runs counter to the CI/CD trends that we taught them.

The other issue might be a conflict in database table constraints (or worse, triggers/sprocs) that don't match your microservice boundaries, so the business ends up wanting wins faster than your team can resolve those conflicts.


Once again, cargo-cultish reasoning, where it's all or nothing.

1/ Replication & read-only copies for resilience exist, to allow you to function in degraded state if your database goes down.

2/ If the database for one service goes down, any service that calls it ends up being down anyways. No, being able to respond with a 503 that says that megatron-service is down and you can't fetch the data isn't different from responding with a regular 500 saying that the database is down and you can't fetch the data..

3/ All of your calls do not necessarily end up calling the database. Your service starts even if the DB is down, some calls will simply fail. You're providing a degraded service.

4/ If they need similar data, you really don't want to have two databases holding that data. Because I can promise you, the cost of having to handle replication, de-duplication, synchronization, configuration, GDPR compliance, etc on X databases is infinitely higher than having a database where you simply turn on a write replica that fails over.

5/ SPOF-fear is bullshit, you always have a single point of failure in your stuff. 2FA service is down ? Sure, your entire service isn't dead, but hey, users can't log in, sounds like a pretty fucking big point of failure. Microservices just distributes your single point of failure into a dozen, equally bad, equally hard to track down problems. And even in a monolith, you can easily work around these problems: what kind of fucking code have you seen that a whole server will not start because it can't connect to a database ?

6/ You do not need your servers to be distributed. That is bullshit, and if you are at the scale where you actually need to, it'll be the least of your problems.


1st point is really good, make sense!

Your 3rd point shows scenerio where 2nd point is not applicable.

>what kind of fucking code have you seen that a whole server will not start because it can't connect to a database ?

Server? Idk. App? Sure, CRUD app that requires config/filling cache dictionaries from db

>Once again, cargo-cultish reasoning, where it's all or nothing.

Thats fair point, so what are micro services doing for you then?

Allowing teams to deploy independently?


The third point is directly linked to the second point: If you are doing a call that requires access to that database, whether it's going through two services or one, will lead to an error (just a different one). If you are doing a call that doesn't require it, it'll go through smoothly.

>Server? Idk. App? Sure, CRUD app that requires config/filling cache dictionaries from db

Not once in my life have I written either a server or an app that will completely fail upon not having a database (the .php files written in early college years don't count.). Either of these can function in a degraded way without the database; In the case of an app, still display your UI, still display everything that can be displayed without the presence of the database, and warn that said database is down. The same way you'd do with micro services.

>Thats fair point, so what are micro services doing for you then? >Allowing teams to deploy independently?

Where I currently work, our services are gathered in various bills of materials to have a set of components that we know works, and that the client requires (government work is always fun). Multiply this by 20 different environments, and you're quite happy to have those BoMs. However, it could very much be done without microservices. Microservices allow us to bypass some problems: One part of the server really, really, really need to read a lot of data and to build a graph before starting, which can in some cases take upwards of two hours. Anything that doesn't depend on this would be very happy to start up without waiting for that to happen, so this is when you split it.

Don't do microservices because it's the hip thing to do. Micro services are just one more tool to alleviate large problems. Sometimes, splitting it into two services is the right thing to do. Sometimes, sharing the database is the right thing to do.


Why even bother then? If your code shares the same database then it’s a single service. Anything else is just unnecessary overhead.


Wut? You can share databases and simply not write to the shared database because it’s a read only replica. Saves a bunch of overhead…

There’s a dozen reasons to share a db.


Then you're depending on the schema of the other service, which means you can't deploy them independently. You might as well have kept everything in a single service.


Aren’t you doing that anyway with json, protobufs, or whatever and using RPC? There’s always some agreed upon contract between services and changing that contract can always be delicate.


If you depend both on schema and RPC, it becomes very difficult to gradually deploy changes without downtime. You’re forced to be both forwards and backwards compatible.

Worse, schemas are often only part of the contract. It’s very common to have logic and constraints in code, maybe even a majority. To share that across services you end up with shared models, then you end up with some utils with logic, before you know it you have yet another deployment difficulty.

Might as well just not split into services and make it trivial to maintain compatibility and deploy changes.


Ah, I agree with your conclusion but not about sharing db's. Sometimes its literally the only way to share data (particularly very large sets of data).


I generally think micro services are a bad idea, partly because it’s so useful to share data and code implicitly.

Somewhat more macro services can make sense, especially when each team owns a service. That’s also when sharing data is harder.


Yep. The only example I have from my career was a team having their own read replica and ETLing that to Hadoop. The schema hadn’t changed in 10+ years, so it wasn’t even considered a real risk.


We had microservices that shared a database, but each got a separate schema. We decided to treat cross schema queries the same as rpc: we export and API from a schema (view or stored procedure) that other microservices can depend on.

The microservice that exposes database level API can do any updates as long as the API stays the same (exactly same as with classical rpc).


> This is not to say it's never technical, just that it's rare for it to be a reason.

It will be a lot more technical reason from now on, when we begin to utilize more "AI" stuff, like language models.

all of those stuffs are slow to start, because they need time to load the huge model, so it is better to make a service that dedicated to the AI stuff, so the development of the business side do not get hold back by the slow restart cycle.

and AI is just an example, there is a lot of case when it more reasonable to split it to another service too, for example some CPU intensive tasks. Even Erlang/BEAM can't do anything about it if the code is writing in C/Zig/Rust and get called using NIF.


One often finds derogatory remarks about PHP, or that the popularity of the language is declining. Interestingly, the concept of PHP prevents exactly such problems, as a monolith written in PHP only ever executes the code paths that are necessary for the respective workload. The failure that the Rails app experienced in the post simply wouldn't have happened with PHP. Especially for web applications, PHP's concept of "One Request = One freshly started process" is absolutely brilliant.

So, OP is right, write more monoliths, and I would add: write them in languages that fit the stateless nature of http requests.


> "One request = one freshly started process"

This isn't true for all (most?) PHP applications today. PHP installations include the FastCGI Process Manager (php-fpm). According to Wikipedia (https://en.wikipedia.org/wiki/FastCGI),

> Instead of creating a new process for each request, FastCGI uses persistent processes to handle a series of requests.

According to the PHP Internals book (https://www.phpinternalsbook.com/php7/memory_management/zend...), is close to a "share-nothing architecture" thanks to custom implementations of `malloc()` and `free()` that know about the lifecycle of a request.


It depends on the configuration.

One can set pm.max-requests=1 and have the process respawn per request.

https://www.php.net/manual/en/install.fpm.configuration.php#...

And the main point still stands: If a request manages to crash a php-fpm child process, other requests are unaffected and another process is spawned to replace the crashed one.


PHP is amazing, yes the syntax is a little awkward. I love working with it.

Especially with libraries/ framework that are 10+ years old Their documentation is extremely detailed.

So you have more confidence that you'll deliver a working product. Instead of reading github issues to fix an obscure error message.

Old does not always mean outdated.


> Interestingly, the concept of PHP prevents exactly such problems, as a monolith written in PHP only ever executes the code paths that are necessary for the respective workload.

This is just lovely when you have hundreds of PHP endpoints written by your predecessors and each endpoint has rewritten an arbitrary slice of the stack (usually data model layer) because there is no common code path required. Refactoring anything below the html layer becomes impossible.

In fact, calling it a software monolith is misleading, because each PHP script is its own little microservice with poorly-defined API boundaries.


Hmmm. When was the last time you wrote php? One file per route has been outdated for over 10 years.


This particular codebase I am maintaining is 10-20 years old.

The point stands. Isolated code paths are great until you have to care about the ways they are not actually isolated (e.g. common database, platform-specific code, etc.).


I wasn’t arguing that, just arguing that that is quite an unusual approach in modern php.


Small note that the app in the post was not Rails, it’s a Golang app.

You’re right in that PHP couldn’t fail like this though, as the isolation level is at a unix process, where the OS should contain all your errors.


The irony is that in a normal Rails app workloads are already split by default (requests and background jobs are handled by different processes that share the same codebase).


Yep, it's a very normal thing in the Rails world to separate things like this. For Django too.

I'd encourage people to look a past just the splitting of workloads though. The other point about resource limits for shared resources across workload tiers is really key, such as having very granular database connection pools even inside of a single workload deployment.

That's less common – though not unheard of – in Rails-land.


It comes down to how you want to do concurrency I guess? Processes? Threads? Event loop? Is there a clear winner? I often see debates or even flamewars about what is the best approach.


This is a good point. Having a single process to handle all requests is asking for trouble.


Author here, thanks for sharing!

This is a strategy I’ve used across many projects in several languages, from big Rails apps to the Go monolith we deploy at incident.io today.

It’s just one way you can make a monolith much more robust without moving into separate services, which helps you keep the benefits of a monolithic application for longer.

Hope people find it interesting.


While I completely agree that the efforts to build and maintain a set of micro-services are often better leveraged by a single monolith (even one that has several "run modes" as you've done here), a few questions inevitably come up:

How do you coordinate the efforts of 200 engineers on a single repo/code base?

Do engineers frequently get into long/drawn-out merge sessions as common code may be modified by a number of engineers who are all trying to merge around the same time? This is actually one of the reasons I really like GOLANG: "A little copying is better than a little dependency."


I have worked in a company with a monolith and about ten teams working on it. This is what helped:

- Merging was automated (a robot tried to run tests against fresh master and merge only if green).

- Deploy was fully automated and limited to working hours.

- We added tests for problematic parts. For example static analysis for database migrations to prevent only safe actions in an automated fashion.

However, if something goes wrong in some component, you have to revert and stop the deploys for everyone which sucks. I'd say around 8 - 10 deploys per day, it makes sense to start splitting the components or at least not adding new teams to the same monolith.


I worked on a monolith with 100+ devs working on it daily. Merging was automated as well for us, and a few unit tests were run on a production env — basically tests that asserted an engineer didn’t do anything stupid like add an infinite loop, or any of the other dozens things engineers had done that caused downtime. Deployment was done by running a script that took a lock, monitored the deployment progress, then released the lock. Sometimes you’d run the script and see people who hadn’t deployed yet, so you’d send them a message and ask if it was good to deploy. Sometimes they would say no if they found a last minute bug, so they would deploy your code for you.

Surprisingly, coordination isn’t that difficult. Humans are pretty good at talking to each other.


Thank you for sharing this article!

> But having spent half a decade stewarding a Ruby monolith from 20 to 200 engineers and watched its modest 10GB Postgres database grow beyond 5TB, there’s definitely a point where the pain outweighs the benefits.

Woah!

I for one would also love to hear your insights on scaling personnel from 20 to 200.

We’re in a similar boat / anticipated growth phase (at 20 odd engineers), and whilst there’s a lot of content on this topic, I’d appreciate your practical “from the coal face” take (much like what you’ve done with this monolith article).

Perhaps a follow up article? ^_^


Glad you enjoyed the article!

The experience I reference was from my time at GoCardless, where I was a Principal SRE leading the team who were on-call for the big Rails monolith.

I’ll put the topic of “what does 10x eng team scaling feel like” on my todo list for posts, but if you’re interested there’s a bunch of articles on my personal blog from this time.

One that might be interesting is “Growing into Platform Engineering” about how the SRE function evolved:

https://blog.lawrencejones.dev/growing-into-platform-enginee...


Question: Have you considered one separate binary (but still sharing the entire codebase) for each "workload" instead of using a flag to switch? If so, can you highlight the advantage of the flag way?


Yep: this is definitely a valid alternative.

The reasons we went with one binary are:

- We can ensure common runtime setup by only having one codepath that runs the initialisation. It's nice that every deployment definitely instantiates the Prometheus server and initialises secrets from Google Secret Manager at the same point and in the same way.

- Compiling three targets adds time to our build process. Even when Go will cache much of the compilation, linking separate binaries is still another ~30s and ~300MB of binary to our resulting docker image.

- It's very convenient in development to run all deployments inside of a single binary for hot-reloading purposes. This is how we develop the app locally.

I don't think it makes much difference which you choose in terms of the operational elements. But for the reasons above, there are clear advantages to the single binary for our use cases.


Thanks!


* edit: I am an idiot who can’t recognize golang code (facepalm)

So it’s a Ruby* app, which means code you put in the container but never use doesn’t do you any harm - it doesn’t bloat a binary or take up space in memory.

but…

… whenever you make a change in one of your pub-sub handlers, you still have to redeploy your web workers.

… code and dependencies for your pub-sub handlers is sitting idle in your web workers, increasing your security risk and attack surface

Why not just turn this into a monorepo that publishes three artifacts? Having just one container image seems like a weird thing to optimize for


I think that publishing 3 artifacts causes a number of other changes which expand the difficulty of management of the backend of the app. While they do allow, hopefully, independent scaling and seperate recovery processes. Separate artifacts also make it a bit more difficult to do local development, and deployments become more complex so it's not a completely clean tradeoff.

I know systems kubernetes / dockerswarm which handle the management of multiple artifacts but that is now another tool needed in order to do development.

So I guess what I'm saying is that "just" publishing multiple artifacts has more knock-on effects which need to be considered. Eventhough I do agree with you that it's probably for the best in the long term if these processes are in seperate containers.


In this model there are already three different command lines to run the thing in three different modes. Deployment complexity of moving that decision one notch left to buildtime instead of startup time seems like it doesn’t increase the accidental complexity at all.

In fact now you can eliminate variation in how your three different processes are started up and health checked and monitored, so instead of code that knows how to operate three different things, you have code that knows how to start and operate an arbitrary thing, and you parameterize it with your three different containers.

Oh hey - that’s what kubernetes is for.

Maybe all this ‘complexity’ of tooling isn’t dumb after all.


> … whenever you make a change in one of your pub-sub handlers, you still have to redeploy your web workers.

We have a similar model with Django. We deploy at least four times a day, so it's much easier to just redeploy everything than reason about what needs and what doesn't need to be redeployed.


It’s a Go app, not a Ruby one. Hence the Go code examples.

And the rationale behind one binary rather than several is to reduce build time and deployment artefact size, along with benefits in development from one binary. All of which mean deploying is much quicker/easier, which means it’s less of a big deal that we deploy everything whenever any part changes (we deploy anywhere up to 30 times a day).

Hope that helps it make more sense.


Apologies - read the ‘Ruby monolith’ up top and then glossed over the code without noticing the language mismatch. Mondays, am I right?

I don’t understand how bundling three sets of functionality into one binary reduces build time or artifact size.

Surely the binary containing all the web and pubsub and scheduled tasks is strictly bigger than the binary for just the web would be, and takes longer to build and test that the binary for just the web?


No problem, easy to skim by it!

In terms of how this reduces artifact size: Go statically compiles everything, and the majority of that weight is from dependencies. Compiling into three different binaries would:

1. Increase build time, because even while the dependencies are cached from each subsequent build, linking isn't free. Takes about 20s to link each binary, so it adds about 40s+ to the build to create them twice.

2. Increases build size, because each binary is about 150MB, so we go from 150 -> 450MB, which increases time to upload to container registry and download into the infrastructure that runs things.

In answer to the security surface, the code between cron/worker/web is very interlinked. There would be little practical reduction in surface area from splitting the binary, though perhaps for some applications there might.


You only have to rebuild the artifacts that are affected by each change.

And each change (assuming CI/CD) only causes the changes build artifacts to be pushed to prod.

So the amortized build/push cost depends on how often you are making changes that affect multiple artifacts versus fewer artifacts.

If an individual change only affects either the web binary or one of the cron job binaries then those build times will be shorter than they would be to build the big binary, and the amount of data pushed to prod will be smaller.

If an individual change affects multiple binaries then building all of them will take longer, and more data will need to be pushed (although - maybe it nets out the same in terms of fa out to multiple servers - your cron servers only need the cron binary, your web servers only need the web binary)

So, if the vast majority of the time you are only impacting a single component build, you save build time and deploy bytes by building smaller artifacts.

At the cost that making cross-system changes requires a slightly longer build time.


Maybe the three binaries have shared deps, by bundling them on one you get to not duplicate those shared deps?

I dunno as I have never used Go, but this would make sense with node

Also thinking how all the related tasks a la CI/CD might be total time faster with one process than three, since they're doing less total work, both during tasks and to wake_up the process etc dunno just speculation on my part


> reduce build time and deployment artefact size

I'm confused... doesn't this plan _maximize_ your build time and deployment size? You can't do worse than building and deploying every line of code you own on every build and deploy; it's the worst case scenario. Go is fast enough in compilation that it doesn't matter, right? But then what's the point?


This is standard practice on monolithic apps. It is effectively an N-tier architecture. On Rails it is very expected that you would deploy the same codebase to two tiers (web and worker) and simply run rails s in one and sidekiq in the other.


We're doing this with our monolithic Node.js / Express application as well, but I wasn't aware that it's a common pattern. For us having a monolithic app greatly simplifies development, deployment, updating dependencies, shared code, etc. Just having a single set of third-party dependencies to update makes things so much easier.


It’s ‘standard practice’ for many frameworks like Rails which encourage the pattern via a separate deployment of the async worker tier.

But outside of these frameworks, you won’t often see it. And especially in languages like Go which tend to avoid opinionated frameworks it’s common not to see a split like this, even when it might be really beneficial.

As with most programming concepts, standard practice varies a lot depending on the tools you use. It’s why I thought this post was worth writing, in that there are many people who haven’t run Rails/etc frameworks before and may not have considered this approach!


It's so common Heroku has worker dynos and cron support via the scheduler add-on.


Ditto in Django + Celery


Anyone using a more modern/lightweight alternative to Celery?


If you're using PostgreSQL, then

django-postgres-queue: https://github.com/gavinwahl/django-postgres-queue

procrastinate: https://github.com/procrastinate-org/procrastinate/


Fantastic, I'm. Thanks.


We use RQ[0], it has Redis as a dependency. It’s pretty straightforward and we’re very happy with it. If you are using Django you may want to look at Django RQ[1] as well. RQ has built in scheduling capabilities these days, but historically it did not so we used (and still use) RQ Scheduler[2] which I think still has some advantages over the built in stuff.

[0] https://python-rq.org/ [1] https://github.com/rq/django-rq [2] https://github.com/rq/rq-scheduler


Thanks, will check it out!


Yes, dramatiq is quite good, but I ditched it for something even more lightweight -- pubsub subscribers via push notification so workers are web workers responding to http requests that might take up to 10 minutes


That's a very interesting approach. Do you have anything written about it or any public repo? Thanks!


ditto Laravel (PHP)


I appreciate the keep your monolith argument. It seems like a lot of people have the attitude of let's start with a monolith, and switch to microservices when the monolith becomes painful.

Having seen how much effort goes into getting microservices running well for any non trivial setup, I wonder what could have been achieved if 25% of that resource had gone into improving the monolith.


> Bad code in a specific part of the codebase bringing down the whole app, as in our November incident.

This is a non-issue if you're using a Elixir/Erlang monolith given its fault tolerant nature.

The noisy neighbour issue (resource hogging) is still something you need to manage though. If you use something like Oban[1] (for background job queues and cron jobs), you can set both local and global limits. Local being a single node, and global across the cluster.

Operating in a shared cluster (vs split workload deployments) give you the benefit of being much more efficient with your hardware. I've heard many stories of massive infra savings due to moving to an Elixir/Erlang system.

1. https://github.com/sorentwo/oban


We do something similar with Reclaim.ai:

It's one monolith Java repo, split into two major workloads that have the same runtime code but are configured differently. In our case, the two workflows are "api" and "jobs". This allows for operational issues on, say, jobs while still allowing user-facing APIs to be less likely to be impacted. The source code itself has many modules, but we deploy it as a single unified runtime.

We're very happy with this approach and so far have only been tempted a few times to break out micro/mini-services. I suspect we will eventually break out a few services, but by the time we do, it'll be for very good reasons.


When you deploy a new version does it go out to all nodes? Or are the distinct workloads versioned separately?


For now, it goes out to all nodes, though the pace/rollout is different. For our jobs cluster, it swaps them out pretty quickly. For our API cluster, it takes longer so that in-flight requests gracefully get handled by our load balancer before pulled out of the cluster rotation. So far that has worked well enough for us.


We used this approach in a previous project, and can work quite well in my experience. It's wonderful to be able to avoid the insanity of microservices.


Front end gateways make this so easy. Stand up the service a couple of times, then send different routes to different pools. This is the easiest implementation of the Bulkhead Pattern one can see, it's great. https://learn.microsoft.com/en-us/azure/architecture/pattern...

I've had so many jobs where we either have some awful slow routes, or we have some latency critical routes. Thusfar I have yet to convince a single company what an incredible & vital win this would be, to create different service pools for certain classes of routes, but wow have I tried to make it happen & man what a win it would be for users.


Splitting the workloads is a good, natural progression. The next pain point will be data boundaries. What tables are shared and joined against in different areas of the code? The ability to scale data access can be hard. As teams grow, shared understanding goes down and eventually data will get tangled together. Scaling systems is scaling data access. Scans, joins and aggregations will put pressure on the db and moving segments of data to their own data stores will be called for. Will you be able to untangle the monolith?


Doesn’t mean microservices is a good fit. You’ll need a team per serviceC but you don’t have the people who know the boundaries. Better keep it as a monolith in that case.

It’s far far far worse to split something up with bad understanding about the data and access patterns than to just struggle along.


It’s odd that there’s still one binary. You can have a mostly monolithic library but produce a web binary and a worker binary, etc. Go can produce them efficiently with a single build command.


It's more convenient to have a single binary that does all related operations depending on how it is run.


Agreed, you can even put all the binaries in the same image.

Ultimately not a big deal.


Leave the Gun, Take the Cannoli

> When a bad Pub/Sub message was pulled into the binary, an unhandled panic would crash the entire app, meaning web, workers and crons all died.

Several years ago while working on a Message Queue based application, we had a poison message, one where the consumer process would read the message and then die because the message was malformed and then the message would get back to the queue and due to the FIFO nature of the queue, the retries will also fail.

Ideally, there should be enough checks to ensure that a poison message can never be created. But if edge cases cause this issue, the handler should have some way of knowing that it is retrying the message for the n-th time and then if it fails, move it to a dead-letter-queue and report failure on handling that message.

Unfortunately, this pattern of issues are quite common. So, we fully expect this and design around it.

Then, any message in a dead-letter-queue gives you an example of a crash-inducing message which can then be debugged for root-cause.


I suppose PubSub is at least slightly better than this because it's not a perfect queue. I.E. messages that are not acknowledged will be retried with exponential backoff.

That at least gives some time to process some stuff before you encounter the poison message again.

Whereas with a FIFO queue you're completely screwed.


> No complex dev setup required, or RPC framework needed, it’s the same old monolith, just operated differently.

It seems to me like "operated differently" is doing a lot of heavy lifting that often involves those same frameworks or dev/testing environments. If a monolith used to communicate between workloads in-process, now there needs to be some way for those workloads to communicate between processes, and it has to continue to work in dev. The example in the article mentions roundtripping through something like Redis or Postgres, but now doesn't your dev environment need to spin up a database? What if the communication pattern isn't a shared cache, but instead a direct function call that, say, spins up a new goroutine to process some work asynchronously? Now you need to operate and maintain either an RPC protocol or an event queue, and make sure those things can be started up and discoverable during testing and local development.


It's pretty normal to have a local dev database on your machine. It's how we did it for decades. Are there really developers in the wild now that have never been exposed to working with a locally installed SQL instance?

These days you can even set it up in docker, automatically migrate the database to the latest version and install a bunch of test data. So you can wipe the whole thing and go back to a good state if you muck it up with a bad or ill-thought-out local migration, etc.

Same with redis, etc.

And it's still much, much simpler than a microservice architecture.


You should already be spinning up caches and databases in dev anyway?

I agree though that the article is missing some explicit insight into how this change is handled on the local dev environment. I'm assuming the local dev environment run commands were also updated to be these three commands, one per workload.

Basically, this distinction should be represented throughout all environments, dev/test/prod


Keep the monolith but break your system up into virtual actors. https://learn.microsoft.com/en-us/dotnet/orleans/overview


I think this is a very pragmatic and common sense approach to getting the most out of a monolith.

Whether there is 1 binary or N binaries I think really depends on context of the code, and is more of an implementation detail than something vital.


I never see a blog post looking at the specific requirements of an application and fitting the architecture, data model, and implementation to those requirements. "Monolith vs Microservices" is like saying "18-wheeler vs 20 Toyota Corollas". There are other forms of transportation, and using just one mode may not serve your business well.

By the way, it's 2023... if you have a greenfield project, you should be using event-driven data processing. We got rid of the horse and buggy, let's please get rid of cron jobs.


>if you have a greenfield project, you should be using event-driven data processing. We got rid of the horse and buggy, let's please get rid of cron jobs.

I don't see why event driven processing and cron-jobs (or any other time-based task scheduling mechanism), are strictly mutually exclusive.


They're not mutually exclusive. But the vast majority of tasks run by cron jobs are not time-sensitive.


Example of these jobs people are using for cron that are not time sensitive? Every shop I’ve worked in, events driven stuff might happen on producer consumer queue mechanism, whereas stuff that needs to happen at a certain time (ie, time sensitive) uses a scheduler/cron, or a idempotent scheduling system that’s consistently polled for things that should be happening “now”.


This is very good advice. I've been writing an article that touches on splitting workloads as part of a strategy for having an easily maintainable monolith, it goes a bit further than splitting workloads by type (worker/server/publisher) to splitting them by domain (literally redirect some endpoints to one server and others to another).

It also details some rules for restricting access between domains via clear boundaries. I should finish it and post it, I think it's a great balance between team separation and maintainability.


Monolith is not a best practice.

With most web applications you'll end up with a web part and at some point some crob jobs. In this case they also have a service that processes a queue.

They should in be in the same repo, that's a Monorepo.

A web app with some cronjobs is best deployed as separate executables (ideally docker containers). In this case they had everything in one exe, including threads to run the queue etc.

So it looks like there solution was more complicated by aiming to be a Monolith.


I came to the same approach in a past job.

One minor problem was that some functionality of that monolith was based on keeping big chunks of DB in memory. So on separated workloads, the instances dedicated to serve other functions, were keeping unnecessary data in memory, putting unneeded load on DB during startup. This could be solved by introducing some startup parameters. But for beginning this problem can be ignored.


Current job we have a rule: we should not write no more than four layers deep of abstraction in our codebase. The thought being any more that is too much for more than a single person to keep in their head.

I feel like this slays the monolith and forces developers to really think out stuff before they write it. There is of course exceptions to everything including this, but it’s served us pretty well.


I'm curious as to how this plays out in practice (though I love the theory). Some questions:

1. Is this measured in _depth_, in breadth, or in both directions (and if both is the total calculated with `MAX` or `SUM`)?

2. Do you run multiple `main`s, each with this restriction?

3. Do you count third-party frameworks and libraries as layers of abstraction?


I just don’t get the obsession with micro services (by with I mean a complete tech stack for each service). I can only see the relevance in a sub optimal org structure - huge org, merger etc.

SOA, sure (hello CICS :-)).

DB sharding if needed for load / resources.

Otherwise logical division into sets of related services that can be their own little monolith or go full monorepo.

Perhaps I'm missing something …


The difference with SOA being that all the services use roughly the same stack?


I always think of the difference between SOA and microservices as being that in SOA the services often share access to the same database.


If I have a python monolith and I have a specific set of heavy dependency like a chromium browser, I don't want my webserver to build with it only async task. Is there a straightforward way to build a separate environment in python that just had the extra dependencies while still having benefits described in the article?


I've found this works well in practice.

A couple other somewhat common workloads are long polling/WebSockets and web-but-slow. The latter is relevant especially when making further requests to external slow services to deal with load balancing problems.


This post suggest to main rules: Rule 1: Never mix workloads Rule 2: Apply guardrails

Applying those two rules help to decompose a monolith into smaller components to get the benefits of monolith and avoid the problem with microservices.



Yep, that article is about very similar concepts but grounded in Spring as the framework.

I like what they do around package imports and it looks a lot like what we do at incident.io, with some rules about which packages can import what.

For people in the Ruby world who want a similar solution, Shopify provide an open-source framework called packwerk that is designed just for this:

https://github.com/Shopify/packwerk


Also affectionately called macroservices.

No idea if I stole it some where or came up with the idea when I was falling into slumber after losing a few days to a P0 incident, but I think it describes the N-tier architecture and what I adopt at work quite well.


Yes! Having worked on both monoliths and microservice spaghetti this is how I'd do it.

I wonder if at scale you could see additional benefits by further splitting the web tier by route?


I often feel the word monolith is as poorly defined as the term microservices.

What’s in between a monolith and microservices? Is it SOA?


> Writing code is hard enough without each function call requiring a network request

And eating food is difficult enough without all that time between plate and mouth, which is why only nutrient slurry forced through a tube makes sense! Why to scale from 20 to 200 engineers we only need to add more tubes and none of that pesky dinnerware!

This is just silly.

It's ok to like monoliths (or tubed slurry), but there's no reason to make up nonsense to justify it.


We have a similar setup with TS/Node.js app and it is very easy to work with.


Isn’t a monolith with split workloads just a monorepo?


Rule 3: Consider using Elixir


Best of luck to the author on their tiered monolith journey! Some things to think about:

(1) How many distinct total binary versions of your monolith will you permit to run in production? Some options might include

At most two globally ("current" and "new" canary/blue-green deployment, no special code on particular tiers)

At most two globally, but sometimes you're willing to deploy a special build to a single tier to mitigate an emergency, with eventual convergence

At most two per tier, but with no attempt to keep each tier running the same binary code (maybe you don't want to redeploy your async consumers as frequently as you redeploy your http handlers)

An unlimited number (maybe you deploy customer-specific binary code to specific instances within a tier)

Would you like an alert when there are too many distinct versions running in production? Who should get that alert and what should they do when it fires?

(2) Does this thing deploy simultaneously everywhere?

If so, is any specific person or team responsible for making sure the deployment worked ok on every tier, declaring an incident if not, and rolling back and finding an owner to resolve the issue? Will every team who owns a part of the monolith contribute someone to a shared rotation for release monitoring?

(3) Suppose there is a blocking problem in one part of the monolith, for example async message processing stops working reliably. Should this block deployment or development for other teams whose changes are outside this blast radius?

(4) Suppose some low-level intermittent compilation error prevents the binary from starting up 10% of the time after a certain build revision for every tier. What team will work to resolve this kind of problem? Is there a team writing telemetry and common logging for your monolith everywhere? Is there a team who will implement common operational concerns like feature flags to gate binary changes?

(5) Does your monolith run in any non-production environments? Is every tier running in each environment? Does somebody publish an SLO for those environment? Is one team allowed to break everyone else in pre-prod by deploying experimental code to the monolith in some environment? Who deploys to the pre-production environment and how?

(6) Suppose you discover you need to split up your workload (one kind of http request is much slower than all others and you want to separate failure domains). How much work does it take to create an additional tier -- updating deployment jobs, quality gates, and CI/CD pipelines throughout various environments, provisioning resources, setting up graphs and alerts, creating new tests? Who does this work?

(7) How will you manage configuration for your monolith? Will configuration directives be delivered to every tier simultaneously? Can someone accidentally break the behavior of another team's tier with a typo or logic error in a configuration change?

(8) When it comes time to split this thing into microservices or macroservices for a few years before a successor team looks at the mess and decides to reimplement a monolith, how do you set up your architecture to successfully allow a split?

(9) Are you absolutely sure your tiers do what you think they do? Can API customers bypass rate limiting by pointing to the hostname of your async-worker tier? If a security vulnerability in a particular http route affects your monolith, will you remember to block the route on every tier (even the ones you think don't normally serve web traffic)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: