If you're interested how it works (and how to possibly implement it by yourself), I wrote about the idea a few years ago. [1] To Postgres, recovering as a clone looks the same as if someone suddenly unplugged the computer power cable.
The shell script that implements the idea with LVM snapshots (relies on an existing Postgres physical replica) is not too long. It's used over SSH.
$ wc -l /usr/local/bin/snapshot-*
164 /usr/local/bin/snapshot-create
16 /usr/local/bin/snapshot-drop
180 total
Had this tool existed at the time, I'd have probably used it (monitoring and REST API might be handy). Still, the core idea can be implemented very easily.
It's much more sophisticated than cloning - it's only cloning the changes between the databases instead of the full databases. Ie 10 x 1TB database is only taking 1TB on disk.
Does this have something to handle potential PII? One of the problems with using prod data for development and testing is that it is often contains sensitive strings which need to be scrubbed somehow.
There are several ways to process PII for use in the Database Lab, including masking and data obfuscation.
The main approaches can be found here: https://postgres.ai/docs/database-lab/masking.
This is simply great for master data & content management. Fast clone, override/mask data for QA/dev/test.
All that's left is to diff what's changed between clone and main database, and merge the changes back up for a full, safe workflow for master data and content management.
This looks like a really useful tool for staging and development environments, and it's even OSS that can be self-hosted! Does anybody here have any real-life experiences with it they could to share?
I'm using it (mostly locally at the moment) for semi-automating testing of django db data migrations, and it's really great.
I've previously spent a fair bit of time tuning postgres for fast/less durable settings to speed up testing, and this puts all of that to shame when you need to start with any non-trivial schema/data and revert back to it.
I had been testing out some ideas with `docker commit` to save derived images which included a premade db, then reverting to that, but I don't think it's worth bothering with since I found dblab.
Haven't yet tried using it CI, and I suspect it might need a somewhat-custom-than-normal VM setup to use ZFS than you could easily do on most hosted CI runners, but it's on my list of things to investigate/setup eventually.
I implemented a similar idea for a few companies. The response was always very positive.
In one company, it grew into a cluster of a few servers for higher capacity and availability. This load balancing and management code is a quite more complex than the original single-server snapshotting script. Unfortunately, it's not open source. For several years, the single server was enough. Go for it. :-)
I'm doing something similar by cloning the data directory using BTRFS snapshots for backups and testing, apart from the UI is there a reason why this tool might be worth looking at?
I just came across testcontainers this week, which seems to solve similar problems. Has anyone tried both solutions and can comment about their experiences with them?
>How do you specify data types for columns or schema names for tables?
For now, to keep it simple, all datatypes will be created as varchar and foreign keys as int, to declare any other column as int, need to write id keyword anywhere in name when defining syntax.
> For modern Postgres versions it's also recommended to use standard compliant identity columns rather than the proprietary serial "types"
I have ran it in Postgres version 14 using psql(shell) and it is working fine, but I shall look into it.
Not sure about webapps but in analytics, this is a common practice. Real data has too many exceptions and edge cases. If you dont develop and test with real data, you will end up effectively developing or testing on production.
I have always made clones of production data on my dev machine for testing -- it has the added benefit of also regularly testing my DB backup restoration process. How else do you test and optimize SQL queries that are only slow with production-size data?
If the data is too big to fit on my machine, I might clone to a nearby colocated server. Testing your DB backup and restoration mechanism becomes even MORE important if you have huge amounts of data.
I agree that this is a possible way. The main difficulty of the generated data is related to their quality and structure. Namely, how artificial data correspond (quantitatively and qualitatively) to real data.
Random data may give incorrect results when optimizing a query.
You can buy a hardware appliance that encrypts credit card numbers for database storage. Then if someone accidentally (or maliciously on purpose) copies your database, you are not exposed to PCI liability.
At least with our DB we tokenize protected information, so all you have in the database is a GUID, the actual sensitive information is stored elsewhere.
I generate "backtests" of a stock trading model, IE I see how it would have done historically. Each backtest reads lots of vendor data from Postgres, and writes some output to Postgres. I can't run a backtest on a "dev" version of the vendor data, I need all of it to make my results correct. The vendor data is too big to easily have two copies.
To make this more complex, it's not a simple process where I read vendor data then generate analysis. My model is generated in several steps, which involves reading vendor data, generating output, then combining the output I generated in a previous step with more vendor data to generate the next step of output.
Backtests are "prod-like" in that they must use real vendor data and must generate correct results that drive business decisions. But they are "dev-like" since I don't want my backtest to interfere with my production system, which generates the version of the model that I currently trade. For a backtest I might want to make a new database table, or change the behavior of an existing process that generates data.
I've tried 2 solutions to this: One is have classes to handle all data reads and writes, and configuration that causes functions to read or write from a dev or production DB as needed. This is a pain to set up, but works well.
The other solution is DB clones. One advantage of DB clones is that it lets me write SQL that joins data I've generated on vendor data. I don't love having business logic in SQL for the obvious reasons, but it can be very performant and easy to maintain. Using classes for data access means that I can't easily do a SQL join between vendor data (which is always on the production DB) and data I generate (which might be on the dev DB.)
I think it will work well for verifying DB changes and optimizing SQL on real data.
This seems like a great tool to eliminate the risk of database-related downtime.
You are absolutely right. Depending on the current environment, you should think about the protection of sensitive data (for example, data masking and obfuscation, access control).
The shell script that implements the idea with LVM snapshots (relies on an existing Postgres physical replica) is not too long. It's used over SSH.
Had this tool existed at the time, I'd have probably used it (monitoring and REST API might be handy). Still, the core idea can be implemented very easily.[1]: https://www.sedlakovi.org/blog/2019/03/fast-postgres-snapsho...