It is amazing how many large-scale applications run on a single or a few large RDBMS. It seems like a bad idea at first: surely a single point of failure must be bad for availability and scalability? But it turns out you can achieve excellent availability using simple replication and failover, and you can get huge database instances from the cloud providers. You can basically serve the entire world with a single supercomputer running Postgres and a small army of stateless app servers talking to it.
The big names started using no-sql type stuff because their instances got 2-3 orders of magnitude larger, and that didn't work. It adds a lot of other overhead and problems doing all the denormalization though, but if you literally have multi-PB metadata stores, not like you have a choice.
Then everyone started copying them without knowing why.... and then everyone forgot how much you can actually do with a normal database.
And hardware has been getting better and cheaper, which makes it only more so.
Still not a good idea to store multi-PB metadata stores in a single DB though.
> Then everyone started copying them without knowing why
People tend to have a very bad sense of what constitutes large scale. It usually maps to "larger than the largest thing I've personally seen". So they hear "Use X instead of Y when operating at scale", and all of a sudden we have people implementing distributed datastore for a few MB of data.
Having gone downward in scale over the last few years of my career it has been eye opening how many people tell me X won't work due to "our scale", and I point out I have already used X in prior jobs for scale that's much larger than what we have.
100% agree. I've also run across many cases where no-one bothered to even attempt any benchmarks or load tests on anything (either old or new solutions), compared latency, optimize anything, etc.
Sometimes making 10+ million dollar decisions off that gut feel with literally zero data on what is actually going on.
It rarely works out well, but hey, have to leave that opening for competition somehow I guess?
And I'm not talking about 'why didn't they spend 6 months optimizing that one call which would save them $50 type stuff'. I mean literally zero idea what is going on, what actual performance issues are, etc.
Yep. I've personally been in the situation where I had to show someone that I could do their analysis in a few seconds using the proverbial awk-on-a-laptop when they were planning on building a hadoop cluster in the cloud because "BIG DATA". (Their Big Data was 50 gigabytes.)
I remember going to a PyData conference in... 2011 (maybe off by a year or two)... and one of the presenters making the point that if your data was less than about 10-100TB range, you were almost certainly better off running your code in a tight loop on one beefy server than trying to use Hadoop or a similar MapReduce cluster approach. He said that when he got a job, he'd often start by writing up the generic MapReduce code (one of the advantages is that it tends to be very simple to implement), starting the job running, and then writing a dedicated tight loop version while it ran. He almost always finished implementing the optimized version, got it loaded onto a server, and completed the analysis long before the MapReduce job had finished. The MapReduce implementation was just there as "insurance" if, eg, he hit 5pm on Friday without his optimized version quite done, he could go home and the MR job might just finish over the weekend.
It's self reinforcing too. All the "system design" questions I've seen have started from the perspective of "we're going to run this at scale". Really? You're going to build for 50 million users -from the beginning-? Without first learning some lessons from -actual- use? That...seems non-ideal.
Place I’ve recently left had 10M record MongoDB table without indexes which would take tens of seconds to query. Celery was running in cron mode every 2 second or so meaning jobs would just pile up and redis eventually ran out of memory. No one understood why this was happening so just restart everything after pagerduty alert…
Yikes. Don’t get me wrong, it’s always been this way to some extent - not enough people who can look into a problem and understand what is happening to make many things actually work correctly, so iterate with new shiny thing.
It seems like the last 4-5 years though have really made
it super common again. Bubble maybe?
Huge horde of newbs?
Maybe I’m getting crustier.
I remember it was SUPER bad before the dot-com crash, all the fake it ‘til you make it too. I even had someone claim 10 years of Java experience who couldn’t write out a basic class on a whiteboard at all, and tons of folks starting that literally couldn’t write a hello world in the language they claimed experience in, and this was before decent GUI IDEs.
> It seems like the last 4-5 years though have really made it super common again. Bubble maybe?
Cloud providers have successfully redefined the baseline performance of a server in the minds of a lot of developers. Many people don't understand just how powerful (and at the same time cheap) a single physical machine can be when all they've used is shitty overpriced AWS instances, so no wonder they have no confidence in putting a standard RDBMS in there when anything above 4GB of RAM will cost you an arm and a leg, therefore they're looking for "magic" workarounds, which the business often accepts - it's easier to get them to pay lots of $$$$ for running a "web-scale" DB than paying the same amount for a Postgres instance, or God forbid, actually opting for a bare-metal server outside of the cloud.
In my career I've seen significant amount of time & effort being wasted on workarounds such as deferring very trivial tasks onto queues or building an insanely-distributed system where the proper solution would've been to throw more hardware at it (even expensive AWS instances would've been cost-effective if you count the amount of developer time spent working around the problem).
Just to give a reference for those that don't know, I rent a dedicated server that has 128gb of ram and 16 core processor (32 threads) and 2tb of local SSD storage and virtually unlimited traffic for $265 USD a month. A comparable VM on AWS would be around $750 a month (if you reserve it long term) and then of course you will pay out the nose for traffic.
the one of those most likely to be humming along fine is redis in my experience. once ssh'd to the redis box (ec2), which was hugely critical to business: 1 core, instance had been up for 853 days, just chilling and handling things like a boss.
This is funny, because I suffer from the opposite issue... every time I try to bring up scaling issues on forums like HN, everyone says I don't actually need to worry because it can scale up to size X... but my current work is with systems at 100X size.
I feel like sometimes the pendulum has swung too far the other way, where people deny that there ARE people dealing with actual scale problems.
In this case it might be helpful to mention the solutions you’ve already tried/evaluated and the reasons why they’re not suitable. Without those details you’re no different from the dreamers who think their 10GB database is FAANG-scale so it’s normal that you get the usual responses.
I mean what percentage of companies are web scale or at your scale? I would guess around 1% being really generous. So it makes sense that the starting advice would be to not worry about scaling.
I get it, and I can't even say I blame the people for responding like that.
I think it is the same frustration I get when I call my ISP for tech support and they tell me to reboot my computer. I realize that they are giving advice for the average person, but it sucks having to sit through it.
Nothing quite as anger inducing as knowing WHY it is that way, but also knowing you are stuck, it makes no sense for you, and it sucks ass.
My new fav rant is the voice phone systems for Kaiser, which makes me say 'Yes or No' constantly - but literally can only hear me somehow if I'm yelling. And they don't tell you to press a number to say yes or no until after you've failed several times with the voice system.
All human convos have zero issues, not even a little faint.
Probably true - hopefully you can prefix your question with 'Yes, this is 10 Exabytes - no, I'm not typo'ng it' to save some of us from foot-in-mouth syndrome?
That is probably a good idea, get that out of the way up front.
I feel similar frustrations with commenters saying I am doing it wrong by not moving everything to the cloud… I work for a CDN, we would go out of business pretty quickly if we moved everything to the cloud. Oh well.
Yes, exactly. When people cite scaling concerns and/or big data, I start by asking them what they mean by scale and/or big. It's a great way to get down to brass tacks quickly.
Now when dealing with someone convinced that their single TB of data is google scale, the harder issue is changing that belief. But at least you know where they stand.
That sounds like you're not giving enough detail. If you don't mention the approximate scale that you have right now, you can't expect people to glark it from context.
Same. I think there's this idea that 5 companies have more than 1PB of data and everyone else is just faking it. My field operates on many petabytes of data per customer.
Yes, the set of people truly operating "at scale" is more than FAANG and far, far less than the set of people believing they operate "at scale". This means there are still people in that middle ground.
One gotcha here is not all PBs are equal. My field also is a case where multi-PB datastores are common. However for the most part those data sit at rest in S3 or similar. They'll occasionally be pulled in chunks of a couple TB at most. But when you talk to people they'll flash their "do you know how big our storage budget is?" badge at the drop of a hat. It gets used to explain all sorts of compute patterns. Meanwhile, all they need is a large S3 footprint and a machine with a reasonable amount of RAM.
With postgres you'll want to tune those cost paramters however. Eg: lowering the random page cost will change how the planner does things on some queries. But don't just blindly change it -- like modify the value and run the benchmark again. The point is that the SSD is not 10x the cost of RAM (0.1 vs 1.0). In our example a few queries the planner move to, what I always assumed was the slower sequential scan -- but it's only slower depending on your table size (how tall and how wide). I mean, PG works awesome w/o tweaking that stuff but if you've got a few days to play with these values it's quite educational.
Ideally you offer developers both a relational data store and a fast key-value data store. Train your developers to understand the pros and cons and then step back.
There’s nothing inherently wrong with a big db instance. The cloud providers have fantastic automation around multi-az masters, read replicas and failover. They even do cross region or cross account replication.
Even when it’s not a cloud provider, in fact, especially when it’s not a cloud provider: you can achieve insane scale from single instances.
Of course these systems have warm standbys, dedicated backup infrastructure and so it’s not really a “single machine”; but I’ve seen 80TiB Postgres instances back in 2011.
We are currently pushing close to 80tb mssql on prem instances.
The biggest issue we have with these giant dbs is they require pretty massive amounts of RAM. That's currently our main bottle neck.
But I agree. While our design is pretty bad in a few ways, the amount of data that we are able to serve from these big DBs is impressive. We have something like 6 dedicated servers for a company with something like 300 apps. A hand full of them hit dedicated dbs.
Were I to redesign the system, I'd have more tiny dedicated dbs per app to avoid a lot of the noisy neighbor/scaling problems we've had. But at the same time, It's impressive how far this design has gotten us and appears to have a lot more legs on it.
Can I ask you how large tables can generally get before querying becomes slower? I just can't intuitively wrap my head around how tables can grow from 10gb to 100gb and why this wouldnt worsen query performance by x10. Surely you do table partitions or cycle data out into archive tables to keep up the query performance of the more recent table data, correct?
> I just can't intuitively wrap my head around how tables can grow from 10gb to 100gb and why this wouldnt worsen query performance by x10
Sql server data is stored as a BTree structure. So a 10 -> 100gb growth ends up being roughly a 1/2 query performance slowdown (since it grows by a factor of log n) assuming good indexes are in place.
Filtered indexes can work pretty well for improving query performance. But ultimately we do have some tables which are either archived if we can or partitioned if we can't. SQL Server native partitioning is rough if the query patterns are all over the board.
The other thing that has helped is we've done a bit of application data shuffling. Moving heavy hitters onto new database servers that aren't as highly utilized.
We are currently in the process of getting read only replicas (always on) setup and configured in our applications. That will allow for a lot more load distribution.
The issue with b-tree scaling isn't really the lookup performance issues, it is the index update time issues, which is why log structured merge trees were created.
EVENTUALLY, yes even read query performance also would degrade, but typically the insert / update load on a typical index is the first limiter.
If there is a natural key and updates are infrequent then table partitioning can help extend the capacity of a table almost indefinitely. There are limitations of course but even for non-insane time series workloads, Postgres with partitioned tables will work just fine.
A lot depends on the type of queries. You could have tables the size of the entire internet and every disk drive ever made, and they'd still be reasonably fast for queries that just look up a single value by an indexed key.
The trick is to have the right indexes (which includes the interior structures of the storage data structure) so that queries jump quickly to the relevant data and ignore the rest. Like opening a book at the right page because the page number is known. Sometimes a close guess is good enough.
In addition, small indexes and tree interior nodes should stay hot in RAM between queries.
When the indexes are too large to fit in RAM, those get queried from storage as well, and at a low level it's analogous to the system finding the right page in an "index book", using an "index index" to get that page number. As many levels deep as you need. The number of levels is generally small.
For example, the following is something I worked on recently. It's a custom database (written by me) not Postgres, so the performance is higher but the table scaling principles are similar. The thing has 200GB tables at the moment, and when it's warmed up, querying a single value takes just one 4k read from disk, a single large sector, because the tree index fits comfortably in RAM.
It runs at approximately 1.1 million random-access queries/second from a single SSD on my machine, which is just a $110/month x86 server. The CPU has to work quite hard to keep up because the data is compressed, albeit with special query-friendly compression.
If there was very little RAM so nothing could be kept in it, the speed would drop by a factor of about 5, to 0.2 million queries/second. That shows you don't need a lot of RAM, it just helps.
Keeping the RAM and increasing table size to roughly 10TB the speed would drop by half to 0.5 million queries/second. In principle, with the same storage algorithms a table size of roughly 1000TB (1PB) would drop it to 0.3 million queries/second, and roughly 50,000TB (50PB) would drop it to 0.2 million. (But of course those sizes won't fit on a single SSD. A real system of that size would have more parallel components, and could have higher query performance.) You can grow to very large tables without much slowdown.
The current application is Ethereum L1 state and state history, but it has useful properties for other applications. It's particularly good at being small and fast, and compressing time-varying blockchain-like or graph data.
As it's a prototype I'm not committing to final figures, but measurement, theory and prototype tests project the method to be significantly smaller and faster than other implementations, or at least competitive with the state of the art being researched by other groups.
> Why did you reinvent the wheel?
Different kind of wheel. No storage engine that I'm aware of has the desired combination of properties to get the size (small) and speed (IOPS, lower read & write amplification) in each of the types of operations required. Size and I/O are major bottlenecks for this type of application; in a way it's one of the worst cases for any kind of database or schema.
It's neither a B-tree nor an LSM-tree, (not a fractal tree either), because all of those are algorithmically poor for some of the operations required. I found another structure after being willing to "go there" relating the application to low-level storage behaviour, and reading older academic papers.
These data structures are not hard to understand or implement, once you get used to them. As I've been working on and off for many years on storage structures as a hobby (yeah, it's fun!), it's only natural to consider it an option when faced with an unusual performance challenge.
It also allowed me to leverage separate work I've done on raw Linux I/O performance (for filesystems, VMs etc), which is how random-access reads are able to reach millions/s on a single NVMe SSD.
> Is it as durable as Postgres?
Yes.
Modulo implementation bugs (because it won't have the scrutiny and many eyes/years of testing that Postgres does).
The important point is that many (though not all) queries are executed by looking things up in indexes, as opposed to searching through all of the data in the table. The internal pages of a B-Tree index are typically a fraction of 1% of the total size of the index. And so you really only need to store a tiny fraction of all of the data in memory to be able to do no more than 1 I/O per point lookup, no matter what. Your table may grow, but the amount of pages that you need to go through to do a point lookup is essentially fixed.
This is a bit of a simplification, but probably less than you'd think. It's definitely true in spirit - the assumptions that I'm making are pretty reasonable. Lots of people don't quite get their head around all this at first, but it's easier to understand with experience. It doesn't help that most pictures of B-Tree indexes are very misleading. It's closer to a bush than to a tree, really.
At my old workplace we had a few multi-TB tables with several billion rows in a vanilla RDS MySql 5.7 instance (although it was obviously a sizable instance type), simple single-row SELECT queries on an indexed column (ie SELECT * FROM table WHERE external_id = 123;) would be low single-digit milliseconds.
Proper indexing is key of course, and metrics to find bottlenecks.
Well, any hot table should be indexed (with regards to your access patterns) and, thankfully, the data structures used to implement tables and indexes don't behave linearly :)
Of course, if your application rarely makes use of older rows, it could still make sense to offload them to some kind of colder, cheaper storage.
Think of finding a record amongst many as e.g. a binary search. It doesn't take 10 times as many tries to find a thing(row/record) amongst 100 as it does amonst 1000.
React makes us believe everything must have 1-2s response to clicks and the maximum table size is 10 rows.
When I come back to web 1.0 apps, I’m often surprised that it does a round-trip to the server in less than 200ms, and reloads the page seamlessly, including a full 5ms SQL query for 5k rows and returned them in the page (=a full 1MB of data, with basically no JS).
There’s shit tons of money to be made for both startups and developers if they convince us that problems solved decades ago aren’t actually solved so they can sell you their solution instead (which in most cases will have recurring costs and/or further maintenance).
Scaling databases vertically, like Oracle DB, in the past was the norm. It is possible to serve a large number of users, and data, from a single instance. There are some things worth considering though. First of all, no matter how reliable your database is, you will have to take it down eventually to do things like upgrades.
The other consideration that isn't initially obvious, is how you may hit an upper bound for resources in most modern environments. If your database is sitting on top of a virtual or containerized environment, your single instance database will be limited in resources (CPU/memory/network) to a single node of the cluster. You could also eventually hit the same problem on bare metal.
That said there are some very high density systems available. You may also not need the ability to scale as large as I am talking, or choose to shard and scale your database horizontally at later time.
If your project gets big enough you might also start wanting to replicate your data to localize it closer to the user. Another strategy might be to cache the data locally to the user.
There are positive and negatives with a single node or cluster. If retools database was clustered they would have been able to do a rolling upgrade though.
You can scale quite far vertically and avoid all the clustering headaches for a long time these days. With EPYCs you can get 128C/256T, 128PCIe lanes (= 32 4x NVMes = ~half a petabyte of low-latency storage, minus whatever you need for your network cards), 4TB of RAM in a single machine. Of course that'll cost you an arm and a leg and maybe a kidney too, but so would renting the equivalent in the cloud.
It's all fun and games with the giant boxen until a faulty PSU blows up a backplane, you have to patch it, the DC catches on fire, support runs out of parts for it, network dies, someone misconfigures something etc etc.
Not saying a single giant server won't work, but it does come with it's own set of very difficult-to-solve-once-you-build-it problems.
I agree in principle. But one major headache for us has been upgrading the database software without downtime. Is there any solution that does this without major headaches? I would love some out-of-the-box solution.
The best trick I know of for zero-downtime upgrades is to have a read-only mode.
Sure, that's not the same thing as pure zero-downtime but for many applications it's OK to put the entire thing into read-only mode for a few minutes at a well selected time of day.
While it's in read-only mode (so no writes are being accepted) you can spin up a brand new DB server, upgrade it, finish copying data across - do all kinds of big changes. Then you switch read-only mode back off again when you're finished.
I've even worked with a team used this trick to migrate between two data centers without visible end-user downtime.
A trick I've always wanted to try for smaller changes is the ability to "pause" traffic at a load balancer - effectively to have a 5 second period where each incoming HTTP request appears to take 5 seconds longer to return, but actually it's being held by the load balancer until some underlying upgrade has completed.
Depends how much you can get done in 5 seconds though!
>The best trick I know of for zero-downtime upgrades is to have a read-only mode.
I've done something similar, although it wasn't about upgrading the database. We needed to not only migrate data between different DB instances, but also between completely different data models (as part of refactoring). We had several options, such as proper replication + schema migration in the target DB, or by making the app itself write to two models at the same time (which would require a multi-stage release). It all sounded overly complex to me and prone to error, due to a lot of asynchronous code/queues running in parallel. I should also mention that our DB is sharded per tenant (i.e. per an organization). What I came up with was much simpler: I wrote a simple script which simply marked a shard read-only (for this feature), transformed and copied data via a simple HTTP interface, then marked it read-write again, and proceeded to the next shard. All other shards were read-write at a given moment. Since the migration window only affected a single shard at any given moment, no one noticed anything: for a tenant, it translated to 1-2 seconds of not being able to save. In case of problems it would also be easier to revert a few shards than the entire database.
Yes, it simply looped over shards, we already had a tool to do that.
The app handled it by proxying calls to the new implementation if the shard was marked as "post-migration", the API stayed the same. If it was "in migration", all write operations returned an error. If the state was "pre-migration", it worked as before.
I don't already remember the details but it was something about the event queue or the notification queue which made me prefer this approach over the others. When a shard was in migration, queue processing was also temporarily halted.
Knowing that a shard is completely "frozen" during migration made it much easier to reason about the whole process.
Depends on the database - I know that CockroachDB supports rolling upgrades with zero downtime, as it is built with a multi-primary architecture.
For PostgresQL or MySQL/MariaDB, your options are more limited. Here are two that come to mind, there may be more:
# The "Dual Writer" approach
1. Spin up a new database cluster on the new version.
2. Get all your data into it (including dual writes to both the old and new version).
3. Once you're confident that the new version is 100% up to date, switch to using it as your primary database.
4. Shut down the old cluster.
# The eventually consistent approach
1. Put a queue in front of each service for writes, where each service of your system has its own database.
2. When you need to upgrade the database, stop consuming from the queue, upgrade in place (bringing the DB down temporarily) and resume consumption once things are back online.
3. No service can directly read from another service's database. Eventually consistent caches/projections service reads during normal service operation and during the upgrade.
A system like this is more flexible, but suffers from stale reads or temporary service degradation.
Dual writing has huge downsides: namely you're now moving consistency into the application, and it's almost guaranteed that the databases won't match in any interesting application.
I'd think using built-in replication (e.g. PostgreSQL 'logical replication') for 'dual writing' should mostly avoid inconsistencies between the two versions of the DB, no?
The way I've done it with MySQL since 5.7 is to use multiple writers of which only one is actively used by clients. Take one out, upgrade it, put it back into replication but not serving requests until caught up. Switch the clients to writing to the upgraded one then upgrade the others.
This is such a huge problem. It's even worse than it looks: because users are slow to upgrade, changes to the database system take years to percolate down to the 99th percentile user. The decreases the incentive to do certain kinds of innovation. My opinion is that we need to fundamentally change how DBMS are engineered and deployed to support silent in-the-background minor version upgrades, and probably stop doing major version bumps that incorporate breaking changes.
The system needs to be architected in certain way to make upgrade without downtime. Something like the Command and Query Responsibility Segregation (CQRS) would work. A update queue serves as the explicit transaction log keeping track of the updates from the frontend applications, while the databases at the end of the queue applies updates and serves as the querying service. Upgrading the live database just means having a standby database with new version software replaying all the changes from the queue to catch up to the latest changes, pausing the live database from taking new changes from the queue when the new db has caught up, switching all client connections to the new db, and shutting down the old db.
Cassandra can do it since it has cell level timestamps, so you can mirror online writes and clone existing data to the new database, and there's no danger of newer mutations being overwritten by the bulk restored data.
Doing an active no-downtime database migration basically involves having a coherent row-level merge policy (assuming you AT LEAST have a per-row last updated column), or other tricks. Or maybe you temporarily write cell-level timestamps and then drop it later.
Or if you have data that expires on a window, you just do double-writes for that period and then switch over.
> It is amazing how many large-scale applications run on a single or a few large RDBMS. It seems like a bad idea at first: surely a single point of failure must be bad for availability and scalability?
I'm pretty sure that was the whole idea of RDBMS, to separate application from data. You badly lose the very moment when some of your data is in a different place -- on transactions, query planning, security, etc. -- so Codd thought "what if even different applications could use a single company-wide database?" Hence the "have everything in a single database" part should be the last compromise you're forced to make, not the first one.
I have been on so many interview loops where interviewers faulted the architecture skill or experience of candidates because they talked about having used relational databases or tried to use them in design questions.
The attitude “our company = scale and scale = nosql” is prevalent enough that even if you know better, it’s probably in your interest to play the game. It’s the one “scalability fact” everyone knows, and a shortcut to sounding smart in front of management when you can’t grasp or haven’t taken the time to dig in on the details.
And a lot of applications can be easily sharded (e.g. between customers). So you can have a read-heavy highly replicated database that says which customer is in which shard, and then most of your writes are easily sharded across RDBMS primaries.
NewSQL technology promises to make this more automated, which is definitely a good thing, but unless you are Google or have a use case that needs it, it probably isn't worth adopting it yet until they are more mature.
I would love to have stats of real world companies on this front.
Stuff like “CRUD enterprise app. 1 large-ish Postgres node. 10k tenants. 100 tables with lots of foreign key lookups, 100gb on disk. Db is… kinda slow, and final web requests take ~1 sec.”
The toughest thing is knowing what is normal for multi tenant data with lots of relational info used (compared to more large and popular companies that tend to have relatively simple data models)
plus caching, indexes and smart code algorithms go a looooooooong way
a lot of "kids these days" dont seem to learn that
by that I mean young folks born into this new world with endless cloud services and scaling-means-Google propaganda
a single modern server-class machine is essentially a supercomputer by 80s standards and too many folks are confused about just how much it can achieve if the software is written correctly
> To resolve this, we ended up choosing to leave foreign key constraints unenforced on a few large tables.
> We reasoned this was likely safe, as Retool’s product logic performs its own consistency checks, and also doesn’t delete from the referenced tables, meaning it was unlikely we’d be left with a dangling reference.
I was holding my breath here and I'm glad these were eventually turned back on.
Nobody should ever rely on their own product logic to ensure consistency of the database.
The database has features (constraints, transactions, etc) for this purpose which are guaranteed to work correctly and atomically in all situations such as database initiated rollbacks that your application will never have control over.
It's difficult to make a blanket statement like this.
I've built some very high throughput Postgres backed systems in my years, and doing application side foreign key constraints (FKC) does have its benefits. Doing this client side will result in constraints that are usually, but not always in sync with data. However, this kind of almost-consistency lets you do much higher throughput queries. An FKC is a read on every write, for example, and does limit write throughput. Of course, this isn't ok for some workloads, and you do proper FKC in the DB, but if you don't need absolute consistency, you can make writes far cheaper.
The trade-offs between foreign key constraints vs none are almost identical to the trade-offs between static typing vs dynamic typing. Nowadays people realize that when they turn off these features is that they'll eventually have to re-implement them later.
You make this claim as if this happens to every company sooner or later, but if a company the size of GitHub can still do without (
https://github.com/github/gh-ost/issues/331#issuecomment-266...) it does become a little bit of a "you do not have google problems" type discussion.
(Perhaps you do have such problems, I don't know where you work! But 99%+ of companies don't have such problems and never will.)
>But 99%+ of companies don't have such problems and never will.
Not sure where you get your metrics, but I would say a more general rule would be that the more people work on an evolving product that includes code and schema changes, then the more you need db constraints to enforce what it means to have correct data.
If only 1 or 2 people are involved in a db that changes pretty infrequently then possibly in the long term you can get away with it.
But if you have a constantly evolving product which must carry new features, logic changes and additions to the schema, then I would say you definitely need db constraints - FK and column. It only takes a few different developers to decide that T,F,Y,N,TRUE,FALSE,YES,NO,Null,NULL,None all mean the same thing, and you've got a slowly evolving mess on your hands.
Does that pattern you describe require any considerations when writing code? I’m thinking of applications I’ve worked on where events are triggered by change, and so a database rolling back independent of my application would be a nightmare. I treat the database as a place to store data, not an authority: the application is the authority. Do you approach it differently? Thanks!
The database is the only place that can be the authority because the application can have race conditions. It’s the only way to guarantee data integrity.
There's no way to specify every single application specific constraint directly in the database. Race conditions are not present when using locking reads (select ... for update, or DB specific shared locking selects) or serializable isolation level, which are the typical way of enforcing application level constraints.
ON DELETE CASCADE can be dangerous when used with applications that expect to be notified of deletions, like in your case.
Ideally, everything that needs to change when a row is deleted would be changed automatically and atomically using database-side constraints and triggers. In practice, applications often need to sync state with external services that the database knows nothing about, so I understand your concerns.
ON DELETE RESTRICT, on the other hand, will result in errors just like any other query error that you can handle in your application. Nothing happened, so there's nothing to be notified of.
You'd be surprised. I used to work on a product where the lead developer made sure foreign keys were NOT enabled on production. They were only "allowed" in dev. Some teams have a strict "no foreign keys" rule.
Does this mental giant know that modern databases use foreign keys to optimize queries, sometimes even eliding JOIN operations entirely if the schema, foreign keys, and other indexes make it safe to do so?
I’ve seen certain large-table queries improve by 10x when foreign keys were added.
Hah. In his defense, this was MySQL 5.x...
He also refused to allow "datetime" or "timestamp" datatypes anywhere. All timestamps were stored as bigints.
These are actually very common restrictions at many of the companies with the largest/busiest MySQL installations in the world, typically OLTP workloads with extreme query rates.
Avoiding FKs enables sharding, reduces write locking time, and greatly simplifies online schema changes. These are essential requirements at extreme scale. (and in response to GP's point, at least in MySQL, FK constraints strictly harm performance; although the underlying index improves performance, you can have that without the FK constraint)
As for bigints for time values: storing UTC unix timestamps in a bigint avoids some issues with unexpected timezone conversions and DST conversions, as well as (in older mysql versions) unwanted auto-update behavior for the first timestamp column in a table. This one tends to be more of a "reduce support burden on the database team" type of requirement -- the risk of datetime or timestamp issues goes up as the number of product engineers massively outnumbers the db engineers, or once you have product engineers in many different timezones, data centers in many timezones, etc.
Of course, there are major trade-offs with these decisions. And they may or may not make sense for your former company's size and situation. But in any case these restrictions are not particularly unusual for MySQL.
Totally... This is all valid. This MySQL install was over 10 years ago! At a previous company (about 15 years ago) we used FKs with MySQL and they did cause issues doing all the things you described.
They're limited in what they can express; your application often has invariants you want to enforce/maintain that can't be (performantly) expressed with DB constraints, and must be validated another way.
As great as it can be to enforce rules within the database, a lot of them usually end up needing to be enforced at the application layer instead. Especially when performance at scale comes into play.
I think it’s a balance. Transactions + Constraints can enforce most things but there will certainly be things that can only be verified in the app.
My goal is always to verify what I can in the database to minimize potential data cleanup. In my experience, app only verification always leads to future time investments to clean up the mess.
DB constraints can verify an important but inherently limited, simplified subset of data integrity.
For a crude example, it's trivial for DB constraints verify (via a foreign key constraint) that all your contracts belong to some customer, but very difficult for DB constraints to verify that all your currently active contracts belong to a currently active customer, even if the definition of 'active' is some relatively simple business logic.
So in my experience it's not that rare to have some code-based data integrity tests that run various sanity checks on production data to verify things that DB constraints can not.
Depends on the database, sometimes the database config, as to whether they'll actually be enforced or not, or in what situations data might evade enforcement of the constraints…
Applies to vendors, too. Had some data in Rackspace Files where "list files" would say X existed, but "GET X" got you a 404. Had an AWS RDS instance; query on it returned no results. Adding the "don't use the index" index hint caused it to return data. (Allegedly, this bug was fixed, but we had migrated off by that point, so I never got to confirm it.)
Conversely, I do like DB constraints, because if the DB constraint doesn't exist, then I guarantee you the production DB has a row that is a counter-example to whatever constraint you think the data should obey…
> Had some data in Rackspace Files where "list files" would say X existed, but "GET X" got you a 404.
Well yes, Rackspace Files (aka OpenStack Swift) is eventually consistent. It says so literally in the first sentence of the documentation [1]. But this discussion is about relational databases with ACID guarantees, where the C is literally "consistent".
Hey folks—I wrote the post! This was my biggest Postgres project to date, and it proved quite tricky since I didn't rehearse with a test database of the same size. I learned a bunch about Postgres, not least the incredibly powerful NOT VALID option for safely and quickly adding constraints.
Happy to stick around and answer any questions you have.
Not in my experience. You can use the —-link flag to get it to use hard links so it doesn’t need to move the data at all. Have been through this process myself a few times and it only took seconds on a 50-100GB db. I’m always a little surprised with how week it works.
I'm curious why a test run on a proper sized replica database wasn't in the testing plans. That is something I've been ensuring happens for a while now for similar projects.
I'm probably missing something, but it sounds like using Warp has a bunch of downsides vs "just" creating a read only replica using logical replication and then failing over. Did you choose Warp only because of Azure's limitations or were there other reasons?
Great post! Did you migrate your old database to Azure Flexible Server, Hyperscale (Citus) or a standalone VM, as Postgres 13 does not seem to be available for Azure Postgres Single Server.
This technique saved us from seriously increasing the cost of our Heroku Postgres instance. Thank goodness it exists and works so well. Multiple 80+ GB indexes shrinks down to less than 10GB after just a couple of hours.
Exact strategy to be determined—we're looking at various data layers at the moment. I wish we could do something simple like a rotating log file, but we want to be able to query it in the app (for instance, to show recent logins).
Have you considered an OLAP database like Clickhouse or QuestDB? An OLAP database would be a much better fit for audit tables given the read and append-only writing requirements, would compress better, and you might be able to fit it directly without changing any app code with a Postgresql foreign data wrapper.
You can partition your audit/event table by time period and archive old events [1] or you can avoid the records hitting the database in the first place by generating the events elsewhere to begin with [2].
I'm not OP, but they were upgrading from Postgres 9.6, which at least implies that this initial db was from ~2017.
This is barely past the initial release of Cockroach. It would have been kind of crazy for the Retool team to use an experimental db with a lack of history when building up their product (that was not dependent on experimental new features)
It's written in Python, spins up a queue.Queue object, populates it with ranges of rows that need to be copied (based on min < ID < max ranges), starts up a bunch of Python threads and then each of those threads uses os.system() to run this:
psql "{source_url}" -c "COPY (SELECT * FROM ...) TO STDOUT" \
| psql "{dest_url}" -c "COPY {table_name} FROM STDIN"
This feels really smart to me. The Python GIL won't be a factor here.
For ETL out of Postgres, it is very hard to beat psql. Something as simple as this will happily saturate all your available network, CPU, and disk write. Wrapping it in Python helps you batch it out cleanly.
Thanks Simon! I can indeed confirm that this script managed to saturate the database's hardware capacity (I recall CPU being the bottleneck, and I had to dial down the parallelism to leave some CPU for actual application queries).
Sounds to me like this is the exact thing that the normal parallel command was made for, not sure python is needed here if the end result is shelling out to os.system anyway.
For listennotes.com, we did a postgres 9.6 => 11 upgrade (in 2019), and 11 => 13 upgrade (in 2021). ~0 downtime for read ops, and ~45 seconds downtime for write ops.
Our database is less than 1TB. One master (for writes + some reads) + multiple slaves (read-only).
Here's what we did -
1, Launched a new read-only db with pg9.6, let's call it DB_A.
2, Stopped all offline tasks, and only maintained a minimal fleet of online servers (e.g., web, api...).
3, Changed all db hosts (no matter master or slave) in /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use old read-only db with pg9.6, let's call it DB_B. From this point on, all write ops should fail.
4, Ran pg_upgrade (with --link) on DB_A to upgrade to pg11, and promoted it to be a master db.
5, Changed /etc/hosts on the minimal fleet of online servers (e.g., web, api...) to use DB_A for all db hosts. By this point, DB_A is a master db. And write ops should be good now.
6, Changed /etc/hosts for all other servers and brought back all services.
Step 4 is the most critical. If it fails or runs too long (e.g., more than 10 minutes), then we had to rollback by changing /etc/hosts on those online servers.
We carefully rehearsed these steps for an entire week, and timed each step. By the time we did it on production, we knew how many seconds/minutes each step would take. And we tried to automate as many things as possible in bash scripts.
We did something similar recently jumping from 10 to 13. We took measurements, did some dry runs, and came up with strategies to ensure our read-only followers would work fine and we’d have a minimum downtime for writes.
We missed one or two pieces of reconnecting things afterwards, and some of that seems to be limitations of Heroku Postgres that we couldn’t change. Hopefully those keep improving.
By the way, Google Cloud recently launched in-place upgrade of Postgres instances. A few days ago we used it to upgrade our multi TB database in my company as well.
What is "in-place" about this? According to the docs you'll have ~10min downtime and you'll loose table statistics which is exactly what happens when you run pg_upgrade manually.
The biggest problem with all the cloud providers is that you won't know exactly when this 10 minute downtime window will start
I guess the only advantage here is that you don't have to do 9->10->11->12->13->14 like in the past and maybe that was one of the blockers Azure has. AWS allows to skip some major versions but 9->14 is not possible.
In-place is a separate concept from zero downtime. Similarly, an inplace upgrade of your OS doesn't mean you can continue using the OS during the upgrade; it means you get to keep your data without restoring from an external backup.
The benefit of an inplace upgrade for postgres is you don't have to spin up another server, restore from backup, and run pg_upgrade yourself.
Azure doesn't really offer migrations paths, and their database migration tool has a ton of edge cases (not supported over 1tb etc) so while pg_upgrade is nice, Azure doesn't really have a path to use that.
On top of that Azure postgres (limited to pg11) has essentially deprecated in place of their v2 Flexible tier with no official migration path.
Woah, this is great. Been waiting for this, since Cloud SQL has been very reliable in the past few years I've been using it, but upgrading was always a pain.
If you're running a 4 TB Postgres database, but you still have to worry about this level of maintenance, what's the value proposition of using a hosted service? There's usually insane markup on any hosted Postgres instance.
If you want to pay multi-thousands dollars a month for a database server, it's WAY cheaper just to slap a server with a ton of drives in colocation.
Might be an accounting scam^H^H^H^H trick to book the costs as an operating expense vs a capital expenditure. In general capital expenditures have to be planned carefully, held on the books for years, and show a return on investment. Perhaps an accountant can provide more detail.
Yeah, my petty 6Tb also went fine with pg_upgrade and practically no downtime. Upgrade slave, promote to master, upgrade master and then promote it back. It's a marvelous piece of technology.
It's really just a handful of core people who did most of the work, crafting it so thoughtfully over the years and it has such a huge impact on the world.
Doing huge part of it before postgresql was as popular as it is today, spending countless hours on making some great design choices and implementing them carefully.
It seems unlikely any of them will read that but I'm so deeply grateful to these people. It allowed so many things to flourish on top thanks to it being open source and free.
It's DAS. Mostly it's in one box of RAIDed 32x5.5Tb drives with a couple of tablespaces/WAL elsewhere. The DB is mostly read-only and not many concurrent users, so that's probably not the most typical case.
I am generally a fan of using as few moving parts as possible but if > 60-70% (2TB + a "few hundred GB") of your prod database are an append only audit log surely if would make sense to split that part into a separate DB server? Especially when you are using a hosted service. It sounds like both uptime and consistency requirements are very different between these two parts of the production data.
Postgres makes a lot of sense with append only tables. You can easily partition them by time (usually) and thus have an easy way to break up the index trees as well as using a cheap indexing scheme like BRIN and being able to just drop old chunks as they become irrelevant.
I used to work on a company that had MongoDB as the main database. Leaving a lot of criticism aside, the replicaset model for Mongo made the upgrades much easier than the ones in other type of databses.
While that’s true, managed services on eg AWS provide hot replica’s as well, which you can use to upgrade the database and do a failover to the new version.
We actually migrated from vanilla Postgres to Aurora that way with minimal risk / downtime, it was a really smooth process.
Low / zero downtime is totally achievable with pg_logical and really boils down to whether or not you want to try to adopt bi-directional writes (and conflict management / integrity issues), or if you're willing to just have a brief session termination event and swapover. To me, the latter has generally been preferable as conflict management systems tend to be more complicated in reality (based on business logic / state) than what pg_logcical provides. Interested if people here have had success with bi-directional writes though.
Having successfully built (and sold!) a technology startup myself, I would always, always opt for a managed database service. Yes, it’s more expensive on paper and you want to run the numbers and choose the right offering. But nothing beats the peace of mind of storing your customers’ data on a system that others (Google Cloud in our case) look after. Not to mention that you’re better off focussing on your core value proposition and spending your scarce resources there than to have a database administrator on your payroll.
In SQL Server you just... do the upgrade. You install the upgrade on your nodes starting with the passive nodes, and it will automatically failover from the old version to the new version once half the nodes have been upgraded. No downtime, but your redundancy drops when some nodes have been upgraded but the cluster hasn't fully been upgraded yet. You certainly don't have to dump and restore your database. Without giving private numbers, our database is much bigger than OP's 4TB; dump and restore would be wildly unacceptable.
The idea that you don't get a seamless upgrade of the database itself with PostgreSQL is absurd to me. The part about "maximizing the amount of time this upgrade buys is" is only necessary because of how difficult PostgreSQL makes upgrades. We upgrade to every new version of SQL Server. It's not that big of a deal.
With every PostgreSQL blog article I read, I become more and more of an SQL Server fanboy. At this point it's full-blown. So many "serious business" PostgreSQL ops posts are just nothingburgers in the SQL Server world.
PG is far behind SQL Server on ease of upgrade but the method described in this post is not the best practice right now, which I think is:
- physical restore to new cluster
- pg_upgrade the new cluster
- catch up on logical wal logs from old cluster
- failover to new cluster
- STONITH
I think the above was not open to them because of the limitations of their managed PG instance. I haven't used Azure but GCP managed SQL has loads of limitations. It seems very common and I think is a major (and undiscussed) drawback of these managed instance.
But the truth is that very few of the people who use PG want to hear that things are better in the MS SQL community for reasons of prejudice and as a result you're being downvoted unfairly for pointing out PGs relative backwardness here.
I'm curious how this works for on-prem. Our SQL Server cluster is on-prem; we can't just spin up another cluster. An important aspect of the SQL Server upgrade process is it doesn't require any extra nodes. What did people do for pgsql upgrades before everyone moved to the cloud?
Here's a nice thing about PostgreSQL over SQL Server to satiate the fans: SQL Server is absurdly expensive to run in the cloud. I can't believe anyone uses RDS for SQL Server. Even in EC2 it's horrifically expensive. That's the main reason we have an on-prem cluster.
was more-or-less the process last time I did this. We have only 500GB of data, and I think pg_upgrade ran in 15 seconds or so.
If a minute of downtime isn't acceptable, then it presumably isn't acceptable in case of unexpected hardware failure either, and you'd be using one of the commercial multi-master extensions.
They created a replica database running the new version, then switched over to it. Not too dissimilar to what you describe, although more work since they started out with only a single instance without replication support.
They _ultimately_ didn't dump and restore, but it was the first thing they tried. It didn't work; it actually failed catastrophically for them. They describe this under the "Implementing logical replication" section. Their ultimate solution is what they tried after the dump-and-restore based DMS method failed and they took an unplanned outage due to yet more PostgreSQL-specific issues (unvacuumed tuples).
All of this is exactly what I'm talking about. This blog post describes kind of a nightmare process for something that is trivial in SQL Server. They actually needed a third party product from Citus just to successfully upgrade PostgreSQL! Stunning.
I don't think they "needed" the Citus tool, per se, it was just the easiest option. I don't know much about MS-SQL, but no doubt PostgreSQL has areas that can be improved, or even that outright suck.
The main barrier against adopting MS-SQL is just the pricing and that it's not open source. Another thing that PostgreSQL seems to do a lot better than MS-SQL is in the extensibility department, hence we have things like TimescaleDB, Citus, EdgeDB, and a whole lot more. I can't really find anything like that for MS-SQL, but perhaps I missed it?
You're absolutely right. A serious part of managing SQL Server is keeping your costs down. RDS for SQL Server is so unbelievably expensive that I can't believe anyone uses it. I'm not aware of any meaningful extensions to MSSQL in the sense of TimescaleDB and friends either. I'll make the claim that we don't need Citus because everything they offer is built into MSSQL, and we don't need TimescaleDB because column stores are built in too, but if you did find some kind of deep extension you wanted to build, you can't do it. Simply not an option with MSSQL. You either build it outside of SQL Server, or you don't build it.
Postgres dump and restore tooling is very poor performance-wise , easily 10x slower compared to SQL Server. I love Postgres dearly and prefer to use it despite that, but I wish Postgres devs renewed their interest in improving neglected dump/restore tooling.
It’s been a long time since I used SQL Server so I don’t know that upgrade process well (I’m willing to believe it’s smoother though, especially wrt to replication / failover).
Keep in mind that they’re upgrading from a database version that’s almost 6 years old. Postgres has improved a lot in the last 5 major versions since then.
Another thing here is that I’m pretty sure they could have just done the in-place upgrade and it would have been fine. I’ve run pg_upgrade myself for a bunch of major versions now and it’s easy and doesn’t require dumping / restoring anything. Maybe there’s something else going on that I’m missing though.
What setup are you running with sql server to have it automatically failover? Is it a multi master configuration or are the additional nodes just read replicas?
These days Postgres actually allows logical replication so your servers can be running different versions at the same time, which allows for much smoother upgrades (haven’t tried that myself yet, don’t quote me on it!)
I believe pg_upgrade isn't guaranteed to always work; it's possible they might change the table storage such that it's unreadable in a new version, and pg_upgrade is documented to fail if so. However, I don't think it's ever happened. That may just be an abundance of caution in the documentation. I wonder why the author of this article didn't mention this possibility.
SQL Server is designed to run in a Windows Server Failover Cluster; the SQL Server-side feature is called "Always On availability groups" in an HA configuration. It's a single-master arrangement, you can either have a fully passive secondary (that's what we do) or read-only replicas. The WSFC handles managing quorum, that's what causes the automatic failover as soon as >50% of the nodes are running the new version.
For what it's worth, it's worked for upgrading me this far (9.6 -> 13); though I'll be looking to go the logical replication route for the next round.
I suspect the way I'll be setting it up is much the same as what you describe in your WSFC configuration (with a little more manual wrangling, no doubt).
What facts are you looking for? I just described the steps from this document: https://docs.microsoft.com/en-us/sql/sql-server/failover-clu... -- specifically, the "Perform a rolling upgrade or update" section. There's nothing else to my post other than contrasting the SQL Server upgrade process to the one described in the article, and musing about my growing appreciation for SQL Server; I apologize if it seemed like it was going to be deeper than that.
EDIT: I realized you're looking for the other PostgreSQL blog posts. Here's an example of two recent HN posts about PostgreSQL issues that I pulled out of my comment history. Both of these blog posts exist because PostgreSQL doesn't have query hints. SQL Server has them; I've dealt with issues like these blog posts describe but they have trivial fixes in the SQL Server world. Nothing to write a blog post about. I don't have a link handy regarding PostgreSQL's txn id wraparound problem, but SQL Server doesn't have that problem, either.
The first upgrade Strategie is not the normal or easy one on anything production btw.
Very small companies might be able to do this on 'no load day' but from a pure business perspective, running your db twice is easier and way less risky.
You could have done this even without downtime by letting your connection proxy handling the switch.
Hey @mrbabbage! Retool customer here! Great service :)
Non distributed RDBMS is a great (yet underrated) choice. Thank you for the good writeup.
I was thinking you could have a much less delicate migration experience next time (some years from now). So you can go for a quicker parallel "dump and restore" migration.
For example:
- Client side sharding in the application layer: you could shard your customers' data across N smaller DB instances (consistent hashing on customer ID)
- Moving the append-only data somewhere else than postgres prior to the upgrade. You don't need RDBMS capabilities for that stuff anyway. Look at Elasticsearch, Clickhouse, or any DB oriented to time series data.
The second bullet point is underway! Getting audit events out of the main database will be a major headache saver.
The first bullet point is on our radar for the near term. We have a very natural shard key in our schema (the customer ID), with AFAIK no relationships across that shard key. And once we start horizontally sharding, we can do cool things like putting your data in a shard geographically close to you, which will greatly increase app performance for our non US customers. Exciting stuff coming down the pike!
>"Last fall, we migrated this database from Postgres version 9.6 to version 13 with minimal downtime."
I thought it was interesting that they upgraded 4 major version numbers in one go. I kept expecting to read something about version compatibility and configuration but was surprised there was none. Are major upgrades like this just less of an issue with Postgres in general?
I think so? PostgreSQL is very well written software AFACT.
I've run into version incompatibilities before, but it was my fault – they were expertly documented in the release notes and I just hadn't read them (or sufficiently tested the upgrade before the live performance of it).
Nifty. But I can't help thinking this was harder than it needed to be in the cloud. Because frankly, 4 TB is not big: my home Synology backup server is 4 TB. Making a pair of standalone Linux servers to rehearse this locally, and with full control of which Postgres modules and other software to use, would have made things easier, it seems.
I wouldn't think using "standalone" versus whatever would make much of a difference.
If you're using a hosted DB service, you're (probably) stuck in needing/wanting to rehearse using the hosted service (which is what the blog post describes).
If they were running the DB on 'regular server' cloud instances, it seems just as good to me to rehearse with other cloud server instances versus "standalone" servers.
It's crazy how many apps opt for an RDBMS for append-only data like audit events. It's so tantalizing at the beginning but turns into a nightmare time marches forward.
audit events -> queue -> elastic -> blob storage
is so easy to maintain and we save TBs from living in the DB.
Actually, I've seen more problems with folks mixing lots of different tools up then I have from folks doing an append only audit event in a RDBMS.
When your audit trail is in DB, you can pretty easily surface audit events to your customers. Who changed what when is just another feature. Capturing audit events is also usually pretty smooth.
The folks doing the blob storage route, you would not BELIEVE the complexity they have to spin up to expose very simple histories etc. This matters a LOT in some spaces (financial etc), less so in others.
In my RDBMS model, who changed this field when from what to what is a basic select. You can even shard by recordID or similar if you want to reduce table scans, good select of indexes etc can be a huge help as well. In most cases users don't mind a bit of latency on these queries.
My only experience in the financial sector indicates the opposite. The firm held its trading history for tens of thousands of accounts going back 60 years in a SQL Server. Anyone who wanted a question answered had to submit it for overnight analysis and get the answer the next day. But in an optimal non-RDMS representation, said trading history could be condensed to a single 200MB flat file that could be queried interactively, in microseconds. Dumping the RDBMS for most use cases pretty much revolutionized the daily experience for the people at that firm.
This seems borderline impossible? I'd be curious if there were missing indexes or bad index selection or efforts to do things like joins? Normally for very large data sets you can shard by account if needed if that's the common filter if there is some insane table scan. Audit stuff tends to be "sparse" so if you can get an index which just tells you which pages have results, that is usually a 100X speedup with a pretty small index size.
But agreed, a daily dump to something can go a long way to unlocking other tools - in govt this is especially true not because the DBMS is hard to use, but because so many layers of consultants and others in the way it's not usable.
SQL Server can easily handle such datasets if columnstore is employed. I wouldn't be surprised if a single weekend of building a proper index (...being very generous here) wouldn't make their DB go literally 100x faster.
The problem at that firm wasn't the SQL Server it was the RDBMS mindset which led them to do joins between tables of journaled trades and much larger tables of corporate actions going back decades, instead of materializing a more useful intermediate result. This is my main beef with RDBMSs: they lead to cognitive hazards of this type. It's not that the databases are themselves naturally bad systems.
It was "optimized" for the aesthetic concerns of RDBMS purists and for ease of implementation of certain other systems, in other words it was optimal for the developers and severely sub-optimal for the users, which is a problem endemic to RDBMS fandom.
One of the biggest thing that keeping audit records in your DB gives you is transactionality around your audit logs. Sending audit events to an external system (quite often) loses this, and the resources to address this before you have to are way larger than a slightly larger AWS/GCP/Azure/<insert-computer-provider-here> bill.
We're implementing something similar to what OP describes, but we'll keep the "queue" in the DB in order to insert the application audit event in the same transaction as the data change. A background process then uploads to secondary storage.
We won't have billions of rows though, so once uploaded to secondary storage we'll just clear the blob field and set a "processed" flag.
This way we can find all the relevant keys for a given order, invoice etc quickly based on a partial key search in the database, and transparently fetch from either db directly or secondary storage as needed.
We (Retool) are going to be doing this very soon and cut our database size by 50%+. And you're exactly right: it's so easy to get started by sticking audits (or other append-only data schema) in an RDBMS, but it quickly becomes a headache and bottleneck.
up until some size the headache from maintaining a separate datastore is bigger. everything should be in the RDBMS until proven otherwise for the sake of simplicity. it's actually amazing how much you can squeeze out of 'old school' databases.
I worked for a company that did a similar "upgrade by replication", but with MySQL. It's quite a few years ago, so I don't remember the versions involved, but it was quite straight-forward once we had done _weeks_ of test runs on a dev environment.
One invaluable thing, though: our application was from the beginning designed to do 100% of all the reads from a read-only slave _if the slave was up to sync_ (which it was 95% of the time). We could also identify testers/developers in the application itself, so we had them using the upgraded slave for two weeks before the actual upgrade.
This made it possible for us to filter out problems in the application/DB-layer, which were few, which means that we probably did a minor version upgrade.
But upgrading by replication is something I can recommend.
MySQL's built-in replication has always been logical replication, and it officially supports replicating from an older-version primary to newer-version replicas. So similar concept to what's described here, but much simpler upgrade process.
Generally you just upgrade the replicas; then promote a replica to be the new primary; then upgrade the old primary and turn it into a replica.
The actual "upgrade" step is quite fast, since it doesn't actually need to iterate over your tables' row data.
At large scale, the painful part of major-version MySQL upgrades tends to be performance testing, but that's performed separately and prior to the actual upgrade process. Third-party tools (pt-upgrade, proxysql mirroring, etc) help a lot with this.
"However, on our 4 TB production database, the initial dump and restore never completed: DMS encountered an error but failed to report the error to us."
THERE IS NO CLOUD: It’s just someone else’s computer