We have had similar experiences on various DB related issues. AWS are miles ahead on anything to do around storage and databases.
We had a production Kubernetes outage caused by a bug in Google Cloud. We were paying for gold level support.
They deescalated and reescelated our ticket 3 times.
In the end they closed the ticket and sent me a link to "Architecting distributed application" docs. I think they were trying to be condescending. That idiot should have been canned.
Eventually I threw a big fit in the Google Cloud slack channel and thockin ended up looking at the issue and finding the bug.
When you delete a CloudSQL instance, it also deletes the back-ups associated with that instance along with it.
There is also no way to mark a CloudSQL instance as "protected" so one bad keypress can lose you your production database and all backups.
This is... cartoonishly dumb. Clownish.
- Cloudsql instance deletion protection should exist
- Those backups should be preserved somewhere when I delete the instance.
I mean setting this up is extremly simple, in a High availability setting on K8s. So I would guess that a company in the size of Google can actually find out how to add a way to run pg_upgrade to upgrade a cluster (inject something into the image that runs the task....). Even on AWS it will run pg_upgrade with a downtime (https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/USER_...).
We did that and only then found out that despite documentation suggesting the contrary , there is no upgrade path for major versions of AWS Aurora PostgreSQL. You're stuck with whatever you had when you started. (Unless you want to do a pg_backup/pg_restore which could take weeks.)
> A program that produces incorrect results twice as fast is infinitely slower.
— John Osterhout
The same Linux, PostgreSQL, etc versions were used on all platforms. We didn't use RDS, CloudSQL or other managed services from the infrastructure providers, but the PDF version of the presentation does include comparison of our managed PostgreSQL service with RDS PG and Aurora PG if you're interested.
What I have to say for AWS (which I use), the back up is why I use AWS. The ease of backing up, rolling back, and connecting additional services, and scaling make it killer.
That being said, I’ve seen around 5x - 10x speed improvement for the same exact cost utilizing bare metal servers. I’ve been using online.net/scaleway and similar services. However the real value save is limited as you need to do a lot more work and often pay more for backing up. Overall, real savings might be 50% cost reduction, with some additional risk. Need more space? Launch a new server, copy, check, switch over, pray. With AWS it’s one click
The real question is what do you want to manage the database or not and how much time you have.
Does anyone else have experience between the two options?
It works by making a full backup at fixed intervals (say weekly) and then "pretending" to be a replica which makes it receive all the changelogs in realtime. (Basically it receives and archives WAL's). This makes point in time recovery possible.
Next is managed server.
Last: any of the cloud database providers.
The key piece is knowing what you are doing. The clouds don't care about your tiny problems. They won't fix things for you.
You must fix things for you.
At a previous job I was able to cut our infrastructure expenses by a lot (70%) by essentially have our infra being 1 load-balancer, 2 beefy application servers and 1 (+1 failover) PostgreSQL database. Each of the application server costs 300~ Euros p/m , 24 cores and 256GB of RAM. I have not experienced the same cost to perf ratio anywhere else.
I also have ~10 smaller hobby projects (<100 monthly active users) which could easily fit inside the smallest cloud services, but the smallest on offer seems to be $5/mo, so $50/mo in total. For $25/mo I can get a pair of bare metal servers which host all of them with failover and a ton of capacity to spare.
Then my actual day-job is at the opposite end of the scale, where we build our own datacenters...
I am very confused by the movement towards using the cloud for everything, because I work at all sorts of scales, and can't find a cloud host who works well for any of my use-cases :S
Just a couple off the top of my head:
1 - Incredibly large infrequently run / ad-hoc workloads
2 - Companies with such a horrendous internal work environment that they can't retain decent technical talent, where managed services can substitute to some degree.
Disclaimer: 18 years in different ops roles
Sure there are challenges, but when there are issues, we can figure it out. We don't have to wait for support.
These free old-fashioned RDBMSs really can go fairly big and stay really fast; much bigger and faster than the noSQL stuff I also use.
One thing that pushes people towards noSQL is that it is easier to get started; but I think that I've saved buckets of time in the long run by going with a conventional relational database.
ymmv but its always worth thinking 'is this problem too big for a classic free RDBMS?' and the answer is almost always 'hell, no!'
> Now we are renting our own hardware and maintain the database. We have a lot better hardware than any of the cloud service providers could offer
Which provider are you using?
> That 30% we can use to hire a freelance DBA to check in on the servers once a day.
Where did you find this freelancer?
Not sure whether you were just interested in the author's specific freelancer but--
The term for this is "Remote DBA" and there are lots of them. I suppose the difficult part is confirming trustworthiness and skills before giving them direct access to your database.
i used this a few years ago to implement my own (toy-scale: 2 workers 1 db running on a potato-class thin client computer) idiosyncratic job queue in postgres. i have a number of recurring jobs that need to be run periodically, and the jobs are not independent -- different jobs depend upon shared resources -- so if lots of jobs that depend on resource R just ran, we might need to wait a while before the other jobs that also depend on R are deemed feasible to be scheduled and run. it was reasonably easy to cobble something together of the form:
each worker process does the following:
0. within a transaction
1. select a shortlist of rows for jobs that are due and are currently feasible, subject constraints ;
2. attempt to acquire one of of those rows using "select ... for update skip locked limit 1"
3. do work, update job result and status in db, commit the transaction
Aiven is great if you don't need the superuser access. They will run some bigger plans on local SSDs and allow you to create replicas and migrate across clouds and regions easily. It's a great way to run multi-cloud without shipping the data yourself. True that they could use more alerting features on resource usage.
GCP managed services of open products are highly unrecommended for this exact reason of being very behind or even obsolete with very limited customization and control. Stick with BigTable, BigQuery, Dataflow and run the rest yourselfs on GCE/GKE.
We are not the upstream for any of the engines running in Aiven, so we work with the upstream developers and communities to resolve any issues found. Often our involvement in fixing the issues is through providing logs, stack traces and other debugging information to the original authors of the code, but in some cases we also fix the issues ourselves and always contribute such fixes back upstream, see e.g. this commit for a PostgreSQL bug that was found by our customers and fixed by us: https://github.com/postgres/postgres/commit/07ef035129586ca2...
We have been considering using Aiven on GCP. Could you comment on the statement above ? cos that's a little worrying.
Should the worst case scenario happen and the storage run out, there's always an option to upgrade to the next resource tier size. This will restore the DB state to the latest successfully recorded transaction.
It will be interesting to see how people will take advantage of new Postgres features in the future.
I'm interested in learning more about sharded design, time series and json data if anyone else reading this or the OP wants to write more.
I need to look into this
Is anyone familiar with the details of the data corruption bug? I'm using Cloud SQL and would like to understand the risk.
"Fix vacuuming of tuples that were updated while key-share locked (Andres Freund, Álvaro Herrera)
In some cases VACUUM would fail to remove such tuples even though they are now dead, leading to assorted data corruption scenarios."
Except this is nonsense and there is no evidence to back it up.
AWS Redshift is a replacement for an enterprise data warehouse. It is not suitable for a website, OLTP or for many analytics use cases and AWS has never claimed it to be as such. It's why they they offer a suite of different paradigm databases e.g. Aurora, DocumentDB, DynamoDB, Neptune and Redshift. Because no one database fits every use case.
Also using a database as a job queue really is a strange thing to do when you are running on the cloud and there are dedicated services available specifically for this use case.
So you get to keep materialized data and a single logical table.
Maybe I missed it.
There are quite a few things that I had to leave out as the post was getting really long. I am writing a follow up article now titled "How Kubernetes and twelve-factor methodology saved us USD 300k in 24 months" where I am going to touch on this.
The short version is that we never intended to stay with a cloud provider due to the costs of operating a large cluster (we continue to use GCP for critical components of our infrastructure, such as monitoring and alerting). It was convenient while we could afford it thanks to their startup programs, but migrating our data aggregation workers from cloud providers to dedicated servers cut costs by 40% (which adds up to a lot, as we are using ±0.5k vCPUs and _a lot_ of RAM).
Small note, there's a typo in the heading of "Google Cloud SQL for PosetgreSQL". ;)
PosetgreSQL -> PostgreSQL