This is a killer feature. But it is hard to adopt it by itself, while being on RDS, which is what most people end up doing.
If you can build a management layer that gives better-than-RDS features ...while allowing me to BYOB my own cloud, you are on your way to win.
At my last company, we built a very similar tool in-house, using LVM w/ thin snapshots, instead of ZFS. The database snapshot host would receive streaming logical replication from our UAT and Staging environments.
Users (developers, QA, etc) could then take an instant snapshot of a multi-TB database, provide a Git hash/branch, and we'd instantly clone the DB, spin up the application and migrate it to the latest version.
That worked great, and the same thing could work with RDS. You could use something like AWS Database Migration Service to replicate from your RDS installation to your Database Lab host.
This way, you get to keep RDS for your actual test and staging environments, but you get streaming replication into your Database Manager instance for Dev/QA.
my guess is that then your market will be restricted to those who are already running their own databases. Because RDS restore-from-snapshot is good enough to not go through the DMS hackery (which is exceedingly nasty) and probably will negate some of the "instant" features that you have.
Just my $0.02 though.
pg_dump -C -h remotehost -U remoteuser dbname | psql -h localhost -U localuser dbname
1) dump/restore approach is one of "thick cloning" methods. It will take a really significant amount of time. Roughly, thick cloning speed for modern hardware and networks is somewhat ~1 TiB / hour just for transferring over network and writing to disk. Additionally, in the case of dump/restore, the timing will depend on the number of indexes that need to be built, materialized views to be refreshed, etc.
2) Dump/restore is "logical copying", it means that physical structure is not preserved. Logically, you'll get the same data. Physically – a completely different database. Bloat is gone, indexes are "fresh and good", and so on.
With Database Lab, unless you're on RDS, initial copying is preferably done using some "physical" methods (pg_basebackup, or just rsync, or – the best option – restoration from WAL-G/pgBackRest/Barman/etc backups). The physical structure is preserved. You have an identical copy of the original database. Then you request "thin clones", it really takes a couple of seconds for a multi-terabyte database. And can do various kinds of performance troubleshooting and experiments before you deploy your changes to production. Request 5 thin clones and check your idea 5 times, each time adjusting it a little bit. This is what usually really needed in development.
By the way, I recognized you guys from the other tool you publish, Nancy. While I don't have a direct usecase for it myself, I thought the idea was super valuable as well.
I am a huge Postgres guy (did you know you can build an entire web app in JUST Postgres?) so I had bookmarked a couple of your projects.
Thank you for the response, you guys do some fantastic work.
I work at a small company, and currently we clone our production database to our dev machines for testing. However certain information in the database is sensitive and we don't want to include it on our dev machines. (This specific sensitive data is also stored in an encrypted format and the key is not included, but we'd still prefer it not included).
Basically I'd like to be able to clone the database, and run some SQL to replace the data in some tables with some mock data. But I can't think of an easy way to do this without cloning the database, starting a temporary one, run the sql, then clone it again - and distribute that final build to the devs.
This is requested quite often – for example, if we copy the database from production, sometimes it's needed to remove all personal data not to break regulations.
It is possible in Database Lab, but it's not a very user-friendly feature yet. Briefly, the process is as follows.
The "sync" Postgres instance is configured to be a production replica (better using WAL shipping from the WAL archive). Then, periodically, a new snapshot is created, currently it's done using this Bash script: https://gitlab.com/postgres-ai/database-lab/-/blob/master/sc.... (We are going to make it a part of the database-lab server in the upcoming releases).
Here https://gitlab.com/postgres-ai/database-lab/-/blob/master/sc... you can place any data transformations, so the final snapshot that will be used for thin cloning has adjusted data sets. For example, all personal data is removed or obfuscated.
Of course, if you do this, you need to keep in mind that physically, you'll have a different database. It may affect some kinds of testing (for example, troubleshooting bloat issues or some cases of index performance degradation). There are various choices to be made here. If interested, we'll be happy to help, please join our community Slack which is mentioned in the docs and README.
There are masking tools out there that can mask data from production in-flight before dropping it into a dev environment for testing (so you can get the same data with the sensitive parts changed), but this doesn't appear to be one of them. I know people who work in the field, it's a tough nut to crack to keep the database good enough to test with while making sure you don't have devs running around with customer data on their laptop/server.
Thank you so much! "Masking" was the phrase I needed to find some potential solutions! I wasn't able to find much before, that points me in the correct direction. Thanks!
- https://blog.taadeem.net/english/2019/01/03/8_anonymization_... – description of methods, and a tool for Postgres, postgresql_anonymizer
- https://habr.com/en/company/yandex/blog/485096/ – not for Postgres, it's for ClickHouse (open-source DBMS for analytics) but covers the topic very well.
But thanks for the suggestion!
Any comments from people using it? I designed a CI system using Delphix at an old client and it was awesome!
We switched our staging databases to be on ZFS and saw inexplicable, very noticeable decreases in performance. Those trade-offs are fine for developer machines, but ultimately too much for our staging server where we needed to run UAT.
But, could of just been quirks of how we had things set up.
Delphix is more using the power of ZFS and Copy On Write to give you instant database copies with minimal storage footprint. We used it to give every developer their own Database that they could do anything to and not worry about locks etc. Refreshing to a more recent copy was a very quick operation and required a developer running one command. Had a nice web interface and a REST interface for automation.
I'd recommend it but with the caveat that I've no idea how much it costs as it was a Dev thing a client was encouraging use of. Was really nice to work with though.
As for NetApp, probably the do other things but most popular is their NAS which internally uses Copy On Write filesystem similar to ZFS.
Maybe that's one of the tech which "John Carmack" meant when he said sophisticated tech is indistinguishable from magic.
Really great idea here!
We have PII, analytics, catalog, transaction etc databases. We only let people clone non-sensitive databases. Others are cloned only at schema level and data is left out.
This also allows us to move and scale independently across different applications. But that's just a symptom of this design.
There are some complications with this approach but it's a trade off and we try to come on top using foreign-data-wrapper approach and using a database application router for read and writes. We have open sourced a Django db router which routes queries on the basis of tags/app name/table name etc