We need to support an order of magnitude more daily active users than we currently do. I'm not exactly sure how close the current system would actually get to that, but my gut feel is it wouldn't hold up. It does OK as is, but only OK.
It's a combination of three different problems working against us in concert.
1) compute layer is multitenant but the databases are single tenant (so one physical DB server can hold several hundred tenant database with each customer having their own).
2) We're locked into some very old dependencies we cannot upgrade because upgrading one thing Cascades into needing to upgrade everything. This holds us back from leveraging some benefits of more modern tech.
3) certain entities in the system have known limits whereby when a customer exceeds a certain threshold the performance on loading certain screens or reports becomes unacceptable. Most customers don't come near those limits but a few do. The few that do sometimes wind up blowing up a database server from time to time affecting other clients.
For most of the domain stuff, to be honest I'd like to fix the performance problems and deadlocks by just making data access as efficient as possible in those spots. I think that could get us quite a bit more mileage if we took it seriously and pushed it.
For the single tenant database situation, I don't really know how to approach fixing that. I don't see us having enough resource to ever reengineer it as is. Maybe it's possible for us as a team, maybe it's not. The thinking is that for the parts of the domain we're able to split out, we could make those datastores multitenant.
There's also a bunch of integration code stuck in the monolith that causes various noisy neighbor problems that we are trying to carve out. I think that's a legitimate thing to do and will be quite beneficial.
But yeah... It's a path we're dipping our toes into this year in an effort to address all of these problems which are too big for us to tackle one by one.
Sounds like you would do best solving the problem you have clearly identified first - the need for a multi-tenant database solution with decent performance. To limit the re-engineering work necessary you could look to separate a coherent functional area as a service where query performance analysis shows heaviest contention, create a microservice for just that, then use best practice like adding a tenant_id key and a system that shards intelligently on that, e.g. citus.
That's reasonable. Also consider going the other way: keeping per-tenant logical databases, and splitting up some or all of the compute layer to have single tenancy or bounded tenancy. For example, if your compute layer is a web server, making multiple sets of webservers with something in front of them routing requests to a given set of servers based on a tenant identifier can chunk up your multiple-noisy-neighbors problem into at least multiple noisy "neighborhoods", with the (expensive) extreme of server-per-tenant. If your compute layer is e.g. a service bus/queue/whatnot worker, the same principles apply: multiple sets of workers deciding what to work on based on a tenant ID or per-tenant/group topics or queues. You can put the cross-cutting/weird workloads onto their own areas of hardware, as well.
I propose this because I think having database instances split up by tenant (even if multiple DBs share the same physical server) is actually a pretty good place to be, especially if you can shuffle per-tenant databases around onto new hardware and play "tetris" with the noisiest tenants' DBs. Moving back to multitenant-everything seems like a regression, and using (message|web|request) routing to break the compute layer up into per-tenant or per-domain clusters of hardware can often unlock some of the main benefits of microservices without a massive engineering effort.
>That's reasonable. Also consider going the other way: keeping per-tenant logical databases, and splitting up some or all of the compute layer to have single tenancy or bounded tenancy. For example, if your compute layer is a web server, making multiple sets of webservers with something in front of them routing requests to a given set of servers based on a tenant identifier can chunk up your multiple-noisy-neighbors problem into at least multiple noisy "neighborhoods", with the (expensive) extreme of server-per-tenant. If your compute layer is e.g. a service bus/queue/whatnot worker, the same principles apply: multiple sets of workers deciding what to work on based on a tenant ID or per-tenant/group topics or queues. You can put the cross-cutting/weird workloads onto their own areas of hardware, as well
This pretty much describes exactly where we are right now. We've been able to migrate the big customers to a new, less overloaded database server. We could continue to do that. I believe it's what you call a "bridge" architecture, so the compute layer is stateless and can serve any tenant. It's also got a queue/service bus to offload a lot of stuff that the web servers shouldn't be doing. That stuff is all on autoscaling but even that's not a panacea.
It's a combination of three different problems working against us in concert.
1) compute layer is multitenant but the databases are single tenant (so one physical DB server can hold several hundred tenant database with each customer having their own).
2) We're locked into some very old dependencies we cannot upgrade because upgrading one thing Cascades into needing to upgrade everything. This holds us back from leveraging some benefits of more modern tech.
3) certain entities in the system have known limits whereby when a customer exceeds a certain threshold the performance on loading certain screens or reports becomes unacceptable. Most customers don't come near those limits but a few do. The few that do sometimes wind up blowing up a database server from time to time affecting other clients.
For most of the domain stuff, to be honest I'd like to fix the performance problems and deadlocks by just making data access as efficient as possible in those spots. I think that could get us quite a bit more mileage if we took it seriously and pushed it.
For the single tenant database situation, I don't really know how to approach fixing that. I don't see us having enough resource to ever reengineer it as is. Maybe it's possible for us as a team, maybe it's not. The thinking is that for the parts of the domain we're able to split out, we could make those datastores multitenant.
There's also a bunch of integration code stuck in the monolith that causes various noisy neighbor problems that we are trying to carve out. I think that's a legitimate thing to do and will be quite beneficial.
But yeah... It's a path we're dipping our toes into this year in an effort to address all of these problems which are too big for us to tackle one by one.