Hacker News new | past | comments | ask | show | jobs | submit login
Updating a 50 terabyte PostgreSQL database (2018) (adyen.com)
168 points by Maksadbek 7 days ago | hide | past | favorite | 60 comments

This article mentions that a key piece of software they use in all this is Slony[0], software which allows replicating one primary/master to many secondaries/slaves, even when those replicas have different versions. That's pretty cool, but I'd like to instead draw your attention (dear reader) to the phenomenal writing happening in Slonys description of itself. It describes, in plain but technical language what Slony is, then immediately jumps into addressing "why should you use this?" and does so very directly. It reiterates the problem it solves, mentions other potential solutions by name to contextualize the current solution space, then explains in a nice bulleted list exactly the situations and niches that Slony is tailored for.

Oh my heavens, I wish every piece of technology described itself this way!

[0] - Slony -- https://www.slony.info/

Archived link, since it appears to be down at the moment: https://web.archive.org/web/20211031124351/https://www.slony...

One of the things I always wonder with giant relational database is. How much of the "typical relational stuff" are they actually using?

Do they have constraints on rows? Are they using views or do they just denormalize and duplicate? Do they use joins at all? Are they even doing more than 1 thing in a transaction?

My only experience with databases of that size is for data analysis so yeah, constraints are relaxed. But even at that point ideas like normalization are critical to extracting performance out of large datasets.

Normalization is a performance optimization. Denormalization is a development shortcut. Neither is right or wrong but I would be surprised if a 50TB OLTP database wasn’t already highly normalized.

If it isn’t then my next guess is that it could be made smaller or more performant if it was.

We used to be proud of server uptime, back when we gave them names. Today if you have a server up for 900 days you’re going to be accused of malpractice.

Similar for data. We used to be proud of being able to keep “big” data online, but I’m no longer impressed. You’re just spending money. Did you actually solve the business case as efficiently as possible given available resources? Do you need 50TB in the first place?

Normalization is pretty independent of optimization strategy and generally that's not presented as one of its advantages. Normalizing data can improve write performance but will make reads slower, so if you have a write intensive database then you will see some performance gains.

There are performance benefits to normalization at scale. Take a look at the 2005 schema changes at Wikipedia [1] for a real-world example.

> Normalizing data can improve write performance but will make reads slower, so if you have a write intensive database then you will see some performance gains.

In OLTP this might be true, based on access pattern. Normalization may improve both read and write performance. It could also make it worse if you take it too far. Domain Key normal form could kill you in unique indices alone.

In OLAP different levels of normalization actually helps both read and write performance. Take a look at the star schema to see how that works. In general dimensions as they relate to fact tables are normalized but dimensions themselves are denormalized.

Normalization is a tool like anything else. You can’t make absolute statements about it. The appropriate application will depend on your use case and may change over time.

[1]: https://en.m.wikipedia.org/wiki/MediaWiki

>Normalization is a tool like anything else. You can’t make absolute statements about it.

That's exactly my point. I specifically said that normalization is independent of optimizations, it might help, it might not. It was specifically in response to your absolute statement about normalization being critical to performance, and I quote:

>Normalization is a performance optimization.

Emphasis taken directly from your quote.

Normalization has very little to do with optimization and it's one of the least cited reasons (if cited at all), for doing it.

In data(bases) performance and normalization are intimately related.

Denormalization adds to the volume of data to read and write, and often reduces the global hit ratio of caches.

On most servers the most plentiful and cheap resource is the CPU (not the storage!), and an adequate level of normalization loads CPU while relieving storage.

Normalization often adds storage 'seeks', and therefore doesn't costs much on some storage (SSD...).

You have that backwards.

Denormalization is a performance optimization. By duplicating data you reduce the need for costly joins at the expense of data consistency.

Denormalization is only a performance optimization if it matches exactly to your queries. If your data is denormalized in some way, but later on you discover you want to query it differently, it may be way more expensive to query it then. Even if you know beforehand that you want to access the data in multiple different ways and you denormalize it for all those ways, it might be faster to use normalization because of better memory-locality. So, there are a lot of tradeoffs.

both of you can be wrong, but in practice its more like: finding the right level of normalization is a performance optimization.

Aren't a lot of nosql database, especially document databases like mongo based around the idea that denormalization is a performance optimization (for read-heavy load)?

Well, no. Normalization benefits DynamoDB too, if you understand the nature of the database. It’s all a spectrum.

In the last year my team did a lot of iterating on our DynamoDB schema and eventually just re-discovered normalization.

From what I can tell the benefit of databases like DynamoDB is that you can shard your workload over many hosts mostly transparently. So you get the benefit of more resources than fit in one box. But it’s not magic and you pay the price in other areas, such as hot partitions, and implementing join logic in your application.

Also Postgres is the subject at hand and is relational so normalization is an unavoidable design decision.

Said another way normalization is a fundamental concept to modeling data in general. It might even be the fundamental question. You have to balance your requirements to arrive at an answer.

Particularly in the dynamo case, you're working outside of a common buffer pool. One of the key benefits of normalization in a typical db is that you can fit more stuff into memory if you normalize - dynamo renders that point largely moot.

That's only one benefit of normalization.

In our case many of our objects had redundant data (aka denormalized) so updates required multiple calls to the DynamoDB service. By normalizing we saw throughput gains in our application and reduced service calls by taking fewer trips. Additionally we had conflated a couple of our domain-specific concepts in the data model and by splitting what was actually two independent entities that had been modeled as one we reduced the absolute record count.

I describe these optimizations as "making the data smaller" and "normalization".

> Normalization is a performance optimization

that is a stupid statement. because it's way too generic and it depends on the use case. read heavy data that needs a lot of joins are most often denormalized IF updating a lot is not a problem. sometimes you need to create views that pull in different stuff with different queries which would make them not really performant especially not on postgres which is just super slow when it comes to listing/filtering data.

As a former Adyen employee, I can confirm that they use the relational features a lot. It's a ledger with millions (probably billions by now) of financial records, and it's heavily optimized for flexible querying. You have raw tables of transactions that are linked to many other tables (accounts, exchange rates, risk, transaction accept/capture/settle records, batches etc.), and you can ideally query the state of any transaction by just joining those tables on the db. What they also have is plenty of aggregate tables materialized by consumers and compactors, which are used a lot for reporting and auditing - for instance, daily transacted amount broken down by account and currency, or monthly revenue reports broken down by marketplace and currency. Additionally, being a financial institution they are also compelled to hold these records for several years, so there aren't many options for reducing the size of the db by aggregating/moving old records.

One question to ask is if they truly need all the 50 TB data in one database instance at once.

Some old inactive data can be moved into partitions, then detached and moved to a different DB instance, as an archiving step.

It reads like their processed payments (5000 tx/second) are very suitable for daily, monthly, yearly archiving.

For analytics, a summary stab can be kept in place of the detailed (archived) transactions.

For statutory reporting, the slow archives can be accessed on demand.

I didn't understand how they avoid downtime during the upgrade - they start with "Stop traffic to the database cluster" and then perform a bunch of steps that each take 5-10 minutes or longer before starting traffic again - so do they just accept the downtime and schedule it in advance, or did I miss a detail?

Whilst others have mentioned their application is architected to cope with this I wanted to echo the sentiment. This post feels very much like a “draw the rest of the owl” meme.

Yeah, they architect their application to accept DB downtime – but I'm sure their services are still degraded to some degree or another during this, and they aren't clear how much total DB downtime they need for this (and how that time scales across various axes).

Overall my takeaway is basically "if you want to upgrade a large Postgres db, you'll need like an hour of planned downtime and a lot of careful work" which… doesn't make me excited to upgrade big postgres db's in production.

> Overall my takeaway is basically "if you want to upgrade a large Postgres db, you'll need like an hour of planned downtime and a lot of careful work"

HA is one of those things where MySQL wins hands down, sadly. Sadly in the sense that PostgreSQL HA still looks like a couple of hacks held together with duct tape, at least when compared to MySQL solutions.

The Percona MySQL distribution has multi-master HA (Percona Xtradb Cluster) which is great for HA, and vanilla MySQL 8.x has group replication which iirc also is multi-master. MariaDB instead has multi-master based on Galera Cluster (although i'm not very knowledgeable about MariaDB).

In PostgreSQL world there are many HA solutions, none of which is both multi-master and open source (sadly).

Only people who haven't run MySQL multi-master HA at scale praise it. In reality, it's a world of pain, where one single seemingly innocent query can take down the entire cluster.

Speaking from operation experience with MariaDB Galera and Percona Xtradb Cluster (which is also Galera).

> Only people who haven't run MySQL multi-master HA at scale praise it.

This statement (and the following phrases) is so vague that the only sensible thing that I can reply is that you're probably using it wrong, and that you should go back to the fine manual.

We run (several) Percona XtraDB Cluster (PXC) and even though there is some occasional crash (usually a node that goes into OOM) it's very stable. We never had, so far, a whole cluster go down. In the worst case a single node was still up, and was operative while we recovered the other nodes.

Of course stability doesn't come from for free: we spent some time using PMM (Percona Monitoring&Management) to find bottlenecks, and we keep PMM around for developers to troubleshoot performance issues on their own.

At the same time we just don't grant risky permissions to developers. One of the riskiest grants when dealing with PXC is ALTER TABLE, which requires the appropriate tooling (pt-online-schema change from the percona toolkit) and also requires following a procedure. Need an ALTER? Open a ticket. Ez.

What can I say?

If PXC doesn't work well for you, you're probably using it wrong.

In practice, you should typically only send write traffic to 1 galera node, and only fall back to "multi-master" for a split-second when promoting another galera node as the new writer node, either due to scheduled maintenance or unexpected downtime on the old writer node. This setup works great when being fronted by a fleet of stateless proxysql nodes. The whole DBA team can sleep well, without having to worry about whether the automated failover logic would actually work or not.

Also note that galera is just a set of API (https://github.com/codership/wsrep-API), which can be implemented against other databases to bring the same replication capability.

> In practice, you should typically only send write traffic to 1 galera node

Yes, we did that too, it's the only way to reduce the deadlock and certification failure with Galera. However, it won't prevent a single `ALTER TABLE` to lock up the entire cluster. You can use `pt-online-schema-change` to workaround that, but it's not perfect.

We've switched to simple Asynchronous Replication with Openark Orchestrator for the majority of our clusters and the DBA team is much happier now.

I did a little googling, and still find the answers are "logical replication, which is slow for large DB's" or "pg_upgrade, which requires nontrivial downtime".

Gitlab wrote a much more in-depth rundown in 2020 of a pg_upgrade, which in their case took 2h of downtime and 4h total time: https://about.gitlab.com/blog/2020/09/11/gitlab-pg-upgrade/ (submitted to HN here: https://news.ycombinator.com/item?id=29926316). They included a video of the whole thing which is pretty incredible.

> One other detail to note is that we built our software architecture in such a way that we can stop traffic to our PostgreSQL databases, queue the transactions, and run a PostgreSQL update without affecting payments acceptance.

That’s downtime by any reasonable measure depending on the exact specifics. Is writing to a journal while the DB is down “downtime”?

Site wise no but DB-wise yes.


was wondering why they were using old PG's :-)

Irony. I think the HN post took down the Slony website.

I've not seen that website before, going back to the home page and clicking through the database options literally had me laughing out loud - thank you :-)

50TB is not so big these days. I read that in 2008 (!) Yahoo had a 2+ PB PG database. What is the largest you know of, 14 years later?

50TB is big. Bigger is possible I'm sure, but I'd guess 99.something% of all PG databases are less than 50TB.

If someone here commented they had a 2PB database, I guarantee someone else here would be like "pfft, that's not big"...

The OP message could have better said that 50TB databases are common these days when single metal or 24xl I3en or I4* instance on AWS can hold 60T raw.

it's more than big enough to cause big problems / risk days of downtime to change, yea. 50GB is not big. 50TB is at least touching big - you can do it on one physical machine if needed, but it's the sort of scale that benefits from bigger-system architecture. 50PB would be world-class big, hitting exciting new problems every time they do something.

With 50TB, and if you were doing a full text search, wouldn't the entirety of the index have to be held in memory?

No. Full-text indexes exist.

You can also do an incremental/streaming search. Lots of ways to avoid loading it all into memory at once, yeah.

Around ~2005 I took a tour of the [a well known government organization] and they were bragging about several-PB-sized databases at the time. Interestingly, there was a TON of server racks there in a bomb-proof building with tons of security, and they were all IBM servers (a supercomputer maybe?), if I remember correctly. Also, there was one small server rack that was painted differently from the rest (it looked like something made in-house), and we asked what it was, and the tour guide (a PhD computer scientist) said that technically it doesn't exist and he couldn't talk about it even though it was super cool. Now that I know what they were doing around that time (and probably still today) I am kinda scared at the implications of that tour guide's statement and what that one tiny rack was for. I'm glad I never went to work in their organization, since that tour was meant to recruit some of us a few years down the road.

This comment contains no information other than an ego boost for yourself, AFAICT.

I need every ego boost I can get these days, friend. Either way, I was intending to tell a story directly relevant to the OP about how there were very large databases even back then. Interestingly, the same size databases are probably run on much less hardware today.

Was that a three letter US government agency?

How are people dealing with databases this large? At work we have a mysql db with a table that has 130M records in it and a count(*) on the table takes 100 seconds. Anything but a simple look up by id is almost unworkable. I assumed this was normal because its too big. But am I missing something here? Are SQL databases capable of actually working fast at 50TB?

count(*) is always going to be slow. They don't store the number of live tuples, just an estimate so it's a full table scan. The secret is to use indexes to get down to a small bit that you care about. If you're filtering on 3 columns, the goal is to get the index to wipe out at least half the results you don't care about and so on and so forth.

A 130M record table with no indexes is going to be crazy slow. Although if all you need are primary key updates, then that's the way to go.

Even at the 130M rows range, you should still be able to take advantage of indexes for fast queries beyond just the primary key. It's been a while since I used mysql, but around 2010 I was working on mysql 5.something and we had several >100M row tables that could still serve indexed queries very quickly (sub ms, or couple ms, iirc). If you are not able to do this, I suggest looking into mysql config and adding/tuning indexes. But yes count(*) will be slow, I'm not aware of good workarounds for that other than caching or using table stats with postgres (if you don't need perfect accuracy) - not sure if mysql supports similar.

It depends on the queries you run. In postgres we use stuff like materialized views, partial indexes, hyperloglog and it you are using citusdb (postgres for adults), you can even have columnar tables to accelerate olap stuff

Security and incident response systems ingesting log files from other systems can get big, add in ‘must store for $x years’ compliance fuzz and you might hit some big numbers

Was it a single server?

Thanks for sharing this. Really interesting. But a basic question, why not upgrading to PG 13 instead? I am curious about the reasons for staying on an older version of PostgreSQL.

I see it now that it's a 2018 article. Makes sense!

Interesting. Does PG13 let then do this without downtime?

Not from 9.4, but now there is a path using logical replication between versions: https://www.percona.com/blog/2019/04/04/replication-between-...

Upgrading, not updating.

While others saying 50T DB is relatively normal, personally I can't imagine myself dealing with that sort of data which is always online. If I were them, I'd ask myself, could we archive this data into split chunks? Is this archive even required to be online? What is the size of a subset that has actuality? And so on. Of course they have answers to that and they are generally "no", but my imagination just stops here.

Some time ago when I studied bitcoin ledger structure, I was confused about how it does sum up all transactions to get the balance so quickly, like we in accounting usually do, with the help of some period caching, which is another point of failure and maintenance. Bitcoin is a large enough database to not do that easily. Few docs later I realized that it doesn't do this:

  mining -> A 1
  A -> B 0.2
  (sum(A) == 0.8)
And instead it does this:

  mining -> A 1
  A -> (B 0.2; A 0.8)
  (last(A) == 0.8)
No sums required, all balances are checked by essentially "lookup the latest appearance in a db", where lookup also involves some merkle trees optimization, which I didn't understand enough to remember.

> I was confused about how it does sum up all transactions to get the balance so quickly, like we in accounting usually do

For what it's worth, in a system design context, this is called event sourcing (in more general terms it's just a fold, though the idea of event sourcing doesn't preclude caching). I worked at a bank for a few years and this was how we calculated balances too.

> And instead it does this:

I don't quite follow you here. It does something distinct from either a fold or caching the current total? (Do you mean that each transaction encodes the resultant balance?)

Not the entire balance, because a single wallet may have many "inputs", but that's just an implementation detail, as far as I understand it.


If the input is worth 50 BTC but you only want to send 25 BTC, Bitcoin will create two outputs worth 25 BTC: one to the destination, and one back to you (known as "change", though you send it to yourself). Any input bitcoins not redeemed in an output is considered a transaction fee; whoever generates the block can claim it by inserting it into the coinbase transaction of that block.

I.e. any "input" is spent completely by a transaction and this makes it irrelevant to future calculations.

See also: https://bitcoin.stackexchange.com/questions/13069/how-does-t...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact