Hacker News new | comments | ask | show | jobs | submit login
Lessons learned scaling a PostgreSQL database to 1.2bn records/month (medium.com)
259 points by willvarfar 18 days ago | hide | past | web | favorite | 73 comments

> The primary reason we migrated away from Google SQL for PostgreSQL is because we discovered a bug that was causing data corruption. This was a known bug that is fixed in newer PostgreSQL versions. However, Google SQL for PostgreSQL is several versions behind. The lack of response from the support acknowledging the issue was a big enough red-flag to move on. I am glad we did move on, because it has been 8 months since we have raised the issue, and the version of PostgreSQL has not been updated to-date

We have had similar experiences on various DB related issues. AWS are miles ahead on anything to do around storage and databases.

Google cloud support is PATHETIC.

We had a production Kubernetes outage caused by a bug in Google Cloud. We were paying for gold level support.

They deescalated and reescelated our ticket 3 times.

In the end they closed the ticket and sent me a link to "Architecting distributed application" docs. I think they were trying to be condescending. That idiot should have been canned.

Eventually I threw a big fit in the Google Cloud slack channel and thockin ended up looking at the issue and finding the bug.

LOL. Thanks for the shout out. I am sorry you had a bad experience. Please be assured that we are working on it, and that we take this seriously.

If I may be so bold, where should I direct this complaint so that people hear it:

When you delete a CloudSQL instance, it also deletes the back-ups associated with that instance along with it.

There is also no way to mark a CloudSQL instance as "protected" so one bad keypress can lose you your production database and all backups.

This is... cartoonishly dumb. Clownish.

- Cloudsql instance deletion protection should exist

- Those backups should be preserved somewhere when I delete the instance.

agreed - backups have nothing to do with an instance. i should be able to spin up and down postgresql instances, but backups should be stored (and billed) as per cloud storage rates.

Tim Hockin is my freakin hero.

There's a Google cloud slack channel? Can you link ?

What's even more strange is that the PostgreSQL of GCloud is relatively simple. It's just a https://cloud.google.com/compute/docs/disks/#repds replicated disk. Probably inside a K8s Cluster.

I mean setting this up is extremly simple, in a High availability setting on K8s. So I would guess that a company in the size of Google can actually find out how to add a way to run pg_upgrade to upgrade a cluster (inject something into the image that runs the task....). Even on AWS it will run pg_upgrade with a downtime (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...).

Unless you make the ill-advised choice to use AWS Aurora :(

We did that and only then found out that despite documentation suggesting the contrary [1], there is no upgrade path for major versions of AWS Aurora PostgreSQL. You're stuck with whatever you had when you started. (Unless you want to do a pg_backup/pg_restore which could take weeks.)

[1] https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...

In PostgreSQL performance, Google Cloud Platform "beats AWS by a factor of 2 in all [Aiven.io] tests".

  > A program that produces incorrect results twice as fast is infinitely slower.
    — John Osterhout

Not sure if this comment was about the data corruption bug in Cloud SQL's PostgreSQL version, but the benchmarks we ran at https://aiven.io/blog/postgresql-cloud-performance/ used Aiven's PostgreSQL service on top of VMs from the cloud infrastructure providers and weren't affected by Cloud SQL issues.

The same Linux, PostgreSQL, etc versions were used on all platforms. We didn't use RDS, CloudSQL or other managed services from the infrastructure providers, but the PDF version of the presentation does include comparison of our managed PostgreSQL service with RDS PG and Aurora PG if you're interested.

The article doesn't say the GCP results were incorrect. What's your source?

The article said they suffered data corruption. The Osterhout is à propos.

Aiven is a managed service vendor and uses cloud VMs. This is not a test of the AWS/GCP directly managed Postgres.

So my PostgreSQL deployment processes quite a bit more than 1.2bn records / month (closer to 10bn / month).

What I have to say for AWS (which I use), the back up is why I use AWS. The ease of backing up, rolling back, and connecting additional services, and scaling make it killer.

That being said, I’ve seen around 5x - 10x speed improvement for the same exact cost utilizing bare metal servers. I’ve been using online.net/scaleway and similar services. However the real value save is limited as you need to do a lot more work and often pay more for backing up. Overall, real savings might be 50% cost reduction, with some additional risk. Need more space? Launch a new server, copy, check, switch over, pray. With AWS it’s one click

The real question is what do you want to manage the database or not and how much time you have.

The author mentioned using https://www.pgbarman.org for this, with a freelance DBA.

Does anyone else have experience between the two options?

I can attest that pgbarman works great. cough I accidentally nuked a production server once. However thanks to barman the downtime was only tens of minutes.

It works by making a full backup at fixed intervals (say weekly) and then "pretending" to be a replica which makes it receive all the changelogs in realtime. (Basically it receives and archives WAL's). This makes point in time recovery possible.

Highly recommended.

I've used Barman for physical backups (I used the streaming backup option). Works well, though it's a bit of a slog to get set up the first time (the docs can be confusing in some areas) but once it's up it's great.

Add replication across regions into the mix and things swing even further in favor of AWS. Dealing with cross datacenter replication of MySQL was eating our lunch and preventing us from getting anything else done before we moved to an early version of RDS.

Can you share some information on what were the problems with cross region replications? Does AWS use a replication protocol / transport other than the MySQL's built-in replication?

the real question is also, when AWS becomes the "authorized, secure" source to whatever, while independent network nodes are marked as "unauthorized, untrusted" via some opaque mechanisms

The article confirms the fact the best cost to performance ratio is colocated hardware you build.

Next is managed server.

Last: any of the cloud database providers.

The key piece is knowing what you are doing. The clouds don't care about your tiny problems. They won't fix things for you.

You must fix things for you.

Yep. We've tried to run compute intensive tasks on the cloud but it was a total disaster. With 24U worth of Supermicro we're able to do it with ease with a 10x reduction in costs. The cloud has its place, but nothing beats bare metal for speed (and in some cases, reliability)

Exactly my experience. And not only about Databases.

At a previous job I was able to cut our infrastructure expenses by a lot (70%) by essentially have our infra being 1 load-balancer, 2 beefy application servers and 1 (+1 failover) PostgreSQL database. Each of the application server costs 300~ Euros p/m , 24 cores and 256GB of RAM. I have not experienced the same cost to perf ratio anywhere else.

What was your salary, and the salary of additional people to increase the bus factor to an acceptable level? There is a lot more than raw perf numbers that go into these calculations for any responsible business owner.

FWIW I'm in a similar situation, running a moderately sized site (10 servers, ~2gbps) on leaseweb hosts, costing ~$1,500/mo - every year or so I look at cloud offerings, and see that the bandwidth alone would be something like $40,000/mo.

I also have ~10 smaller hobby projects (<100 monthly active users) which could easily fit inside the smallest cloud services, but the smallest on offer seems to be $5/mo, so $50/mo in total. For $25/mo I can get a pair of bare metal servers which host all of them with failover and a ton of capacity to spare.

Then my actual day-job is at the opposite end of the scale, where we build our own datacenters...

I am very confused by the movement towards using the cloud for everything, because I work at all sorts of scales, and can't find a cloud host who works well for any of my use-cases :S

> I am very confused by the movement towards using the cloud for everything, because I work at all sorts of scales, and can't find a cloud host who works well for any of my use-cases

Just a couple off the top of my head:

1 - Incredibly large infrequently run / ad-hoc workloads

2 - Companies with such a horrendous internal work environment that they can't retain decent technical talent, where managed services can substitute to some degree.

Do you mind if I ask what service you're using to host your hobby projects?

Kimsufi - AFAIK they get their low prices by using second-hand hardware, but I've not found that to be an issue (hard drive failures are a little more common than the leaseweb servers, but "more common" is still only "one failure per server per 5 years" - and that's easily mitigated by setting up a pair of servers with failover)

This misnomer is frequently brought up, but untrue. Whether you’re on prem or cloud, the skills are transferable and staff needed regardless unless you’re a dev who just knows API calls for cloud services (my condolences). Cloud doesn’t mean you don’t need to know the infrastructure, it’s workload management. You still need to know how the underlying fundamentals operate, or you get burned (either through excessive cost, unexpected performance degradation, or data loss).

Disclaimer: 18 years in different ops roles

Absolutely. Its not like you will have one less person in the cloud. You'll still need the same amount of people (maybe more!), but they'll just be doing different things.

At my current role we looked to moving from Rackspace to AWS or self hosted. The cost difference between AWS and Self Hosted was easily 2x as much for the former. And Self Hosted was faster! Plus you get better control.

Sure there are challenges, but when there are issues, we can figure it out. We don't have to wait for support.

And I have built a very similar, perhaps even slightly larger, system, only in MySQL with TokuDB :)

These free old-fashioned RDBMSs really can go fairly big and stay really fast; much bigger and faster than the noSQL stuff I also use.

One thing that pushes people towards noSQL is that it is easier to get started; but I think that I've saved buckets of time in the long run by going with a conventional relational database.

ymmv but its always worth thinking 'is this problem too big for a classic free RDBMS?' and the answer is almost always 'hell, no!'

Thanks for the read!, a couple of questions:

> Now we are renting our own hardware and maintain the database. We have a lot better hardware than any of the cloud service providers could offer

Which provider are you using?

> That 30% we can use to hire a freelance DBA to check in on the servers once a day.

Where did you find this freelancer?

> hire a freelance DBA to check in on the servers once a day

Not sure whether you were just interested in the author's specific freelancer but--

The term for this is "Remote DBA" and there are lots of them. I suppose the difficult part is confirming trustworthiness and skills before giving them direct access to your database.

Great example of a job queue correctly implemented inside a transactional relational database (with SKIP LOCKED and a table dedicated to the job queue), instead of using a specialized tool.

yes. the linked 2ndquadrant blog post explaining how to implement this using SKIP LOCKED is a good read: https://blog.2ndquadrant.com/what-is-select-skip-locked-for-...

i used this a few years ago to implement my own (toy-scale: 2 workers 1 db running on a potato-class thin client computer) idiosyncratic job queue in postgres. i have a number of recurring jobs that need to be run periodically, and the jobs are not independent -- different jobs depend upon shared resources -- so if lots of jobs that depend on resource R just ran, we might need to wait a while before the other jobs that also depend on R are deemed feasible to be scheduled and run. it was reasonably easy to cobble something together of the form:

each worker process does the following:

  0. within a transaction
  1.   select a shortlist of rows for jobs that are due and are currently feasible, subject constraints ;
  2.   attempt to acquire one of of those rows using "select ... for update skip locked limit 1"
  3.   do work, update job result and status in db, commit the transaction
the only part of this that is slightly clever is that i don't actually have a hard requirement for mutual exclusion between worker processes to mediate access to shared resources, so that means each worker process can maintain its own state of its 'fair share' of resources that it is consuming, and throttle itself without needing to coordinate this with other workers.

When you don't need insane scalability, having a transactional queue is a godsend.

For the record, it's possible to have scalability and transactional enqueuing at the same time, with something like Google Cloud Datastore: https://cloud.google.com/appengine/docs/standard/python/data...

Yeah, with sharding. And global transactions. Which are costly. As long as you pass the cost to the customer. But if you're doing a lot, then you can't use them.

Very interesting read. Having worked myself on large Postgres databases, I would have loved to know about these tuning opportunities. It’s disappointing to see Google providing such poor support for Postgres though in Google Cloud. Even Microsoft provide first class support. They need to up their game.

What is Aiven supposed to do about a bug in Timescale that is still open?

Aiven is great if you don't need the superuser access. They will run some bigger plans on local SSDs and allow you to create replicas and migrate across clouds and regions easily. It's a great way to run multi-cloud without shipping the data yourself. True that they could use more alerting features on resource usage.

GCP managed services of open products are highly unrecommended for this exact reason of being very behind or even obsolete with very limited customization and control. Stick with BigTable, BigQuery, Dataflow and run the rest yourselfs on GCE/GKE.

Unfortunately we were not able to offer a concrete workaround in this case, but as mentioned to the author of the post, we were and are in contact with our friends at Timescale to resolve this, and an update with the fix will be available in Aiven shortly.

We are not the upstream for any of the engines running in Aiven, so we work with the upstream developers and communities to resolve any issues found. Often our involvement in fixing the issues is through providing logs, stack traces and other debugging information to the original authors of the code, but in some cases we also fix the issues ourselves and always contribute such fixes back upstream, see e.g. this commit for a PostgreSQL bug that was found by our customers and fixed by us: https://github.com/postgres/postgres/commit/07ef035129586ca2...

> In general, I didn’t understand what added value Aiven.io provides – we weren’t even warned when the database was running out of storage.

We have been considering using Aiven on GCP. Could you comment on the statement above ? cos that's a little worrying.

It's impossible to comment the specific circumstances with the report, but we do generally monitor and warn our users on running low on resources - be it storage, CPU or memory. I admit that the alerting is by no means foolproof; we may miss reaction on rapid or sudden changes in the usage patterns, but the alerting works quite well on more steady and common workloads and usage patterns.

Should the worst case scenario happen and the storage run out, there's always an option to upgrade to the next resource tier size. This will restore the DB state to the latest successfully recorded transaction.

Just FYI - a fix for this bug was merged relatively quickly (within 4 days), and was released just this morning.


Congrats on 1.2, the analytical functions are a good addition.

Thank you :)

Very interesting read. I'm not convinced that Postgres should be used for every aspect of a data processing pipeline yet but the author has certainly proved it is possible. You also seem to get a lot of features for free that people using Kafka can only dream of.

It will be interesting to see how people will take advantage of new Postgres features in the future.

I will never complain about my minor pg issues again. The beast that is running here says whatever else, my problems are small.

I'm interested in learning more about sharded design, time series and json data if anyone else reading this or the OP wants to write more.

>Column order matters. We have tables with 60+ columns. Ordering columns to avoid padding saved 20%+ storage

I need to look into this

This was an informative read: https://blog.2ndquadrant.com/on-rocks-and-sand/

> The primary reason we migrated away from Google SQL for PostgreSQL is because we discovered a bug that was causing data corruption

Is anyone familiar with the details of the data corruption bug? I'm using Cloud SQL and would like to understand the risk.

My Cloud SQL instance is currently on 9.6.6, I just got an email from GCP saying they are upgrading to 9.6.11 next week. There is a data corruption bug mentioned in the 9.6.7 release notes, not sure if it's the one the author is referring to.


"Fix vacuuming of tuples that were updated while key-share locked (Andres Freund, Álvaro Herrera)

In some cases VACUUM would fail to remove such tuples even though they are now dead, leading to assorted data corruption scenarios."

> The takeaway here is that Google and Amazon prioritise their proprietary solutions (Google BigQuery, AWS Redshift)

Except this is nonsense and there is no evidence to back it up.

AWS Redshift is a replacement for an enterprise data warehouse. It is not suitable for a website, OLTP or for many analytics use cases and AWS has never claimed it to be as such. It's why they they offer a suite of different paradigm databases e.g. Aurora, DocumentDB, DynamoDB, Neptune and Redshift. Because no one database fits every use case.

Also using a database as a job queue really is a strange thing to do when you are running on the cloud and there are dedicated services available specifically for this use case.

I don't think that's what he's implying. He means I think that the cloud providers are not very friendly to devs who will take the time to customize their RDBMS to squeeze more performance out instead of switching over to a more expensive solution capable of handling the kind of high volume OLAP he had to deal with.

Thanks for writeup. Yes PostgreSQL is an incredible database except for it’s Mat Views, IMO this is their biggest weakness vs Oracle, SQL Server mat views are also poor with umpteen restrictions. With PG any view can be a Mat View, problwm is they don’t have incremental refresh or auto refresh. So not v useful, I roll my own, using custom tables and transaction id’s, not too difficult since all are auto generated but would be nice if Postgres added incremental refresh. I think this their most urgent missing capability that should be top of the list for v12.

For the large tables, what about using timestamp based table partitioning to split data into smaller units ? And then materialize each of them one by one ?

So you get to keep materialized data and a single logical table.

You ended up co-locating your own hardware, saying "We have a lot better hardware than any of the cloud service providers could offer." Can you clarify then what kind of hardware you're using? All SSDs, HDDs, how much capacity, are you going for all in-memory transactions, etc., etc.?

I'm confused why hosting PostgreSQL themselves on GCP or AWS was not considered an option?

my understanding was that they were trying to outsource as much of the ops & infrastructure side of things as possible. hence trying a third service provider after GCP & AWS before giving up and doing it themselves.

I got that. :) But they went from managed-db, managed-host to unmanaged-db, unmanaged-host when the clear middle ground is unmanaged-db, managed host.

Maybe I missed it.

Hi, author here scanning through the comments.

There are quite a few things that I had to leave out as the post was getting really long. I am writing a follow up article now titled "How Kubernetes and twelve-factor methodology saved us USD 300k in 24 months" where I am going to touch on this.

The short version is that we never intended to stay with a cloud provider due to the costs of operating a large cluster (we continue to use GCP for critical components of our infrastructure, such as monitoring and alerting). It was convenient while we could afford it thanks to their startup programs, but migrating our data aggregation workers from cloud providers to dedicated servers cut costs by 40% (which adds up to a lot, as we are using ±0.5k vCPUs and _a lot_ of RAM).

> Hi, author here scanning through the comments.

Small note, there's a typo in the heading of "Google Cloud SQL for PosetgreSQL". ;)

PosetgreSQL -> PostgreSQL

Very interesting design. It looks like they are capturing events, and trying to predict things. I wonder why they went with Uber database and didn’t use cheaper S3 got storage, some processing for prediction and something simple and cheap for serving.

GCP beats out AWS by a factor of 2? Interesting

1.2b/s try that its far moerchallenging

thats the kind of problem i work on (millions per sec at peak). everything is a queue and we stream from servers via kafka and into processors into redshift, more or less. Its free once it's moving around in AWS, but redshift and processors get expensive and im interested in reliable ad hoc solutions... but havent seen it.

can you elaborate on the process of copying data into Redshift? How often do you run COPY commands? On a continuous basis? Asking since Redshift as an OLAP isn't really optimized for heavy writes, so if you've figured out a way to do it, please bottle that magic!

What kind of database are you writing 1+ GB/s to...

i mean 1.2 billion records per second. not month. now that was a tough problem back when I was sane.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact