Hacker News new | past | comments | ask | show | jobs | submit login
Thin PostgreSQL Clones (github.com/postgres-ai)
114 points by samokhvalov on Jan 25, 2022 | hide | past | favorite | 36 comments



If you're interested how it works (and how to possibly implement it by yourself), I wrote about the idea a few years ago. [1] To Postgres, recovering as a clone looks the same as if someone suddenly unplugged the computer power cable.

The shell script that implements the idea with LVM snapshots (relies on an existing Postgres physical replica) is not too long. It's used over SSH.

    $ wc -l /usr/local/bin/snapshot-*
     164 /usr/local/bin/snapshot-create
      16 /usr/local/bin/snapshot-drop
     180 total
Had this tool existed at the time, I'd have probably used it (monitoring and REST API might be handy). Still, the core idea can be implemented very easily.

[1]: https://www.sedlakovi.org/blog/2019/03/fast-postgres-snapsho...


It’s a tool that clones Postgres databases. It’s not a clone of Postgres the application.


The postgres binary is only around 6MB, which is astonishingly small.


It's much more sophisticated than cloning - it's only cloning the changes between the databases instead of the full databases. Ie 10 x 1TB database is only taking 1TB on disk.


I had that exact idea!


Does this have something to handle potential PII? One of the problems with using prod data for development and testing is that it is often contains sensitive strings which need to be scrubbed somehow.


There are several ways to process PII for use in the Database Lab, including masking and data obfuscation. The main approaches can be found here: https://postgres.ai/docs/database-lab/masking.


This blueprint about Database Lab from GitLab is quite interesting to understand the use case and what benefits it brings: https://docs.gitlab.com/ee/architecture/blueprints/database_...


This is simply great for master data & content management. Fast clone, override/mask data for QA/dev/test.

All that's left is to diff what's changed between clone and main database, and merge the changes back up for a full, safe workflow for master data and content management.


Exactly! Great thought, it aligns with our vision.

(Postgres.ai founder here)


This looks like a really useful tool for staging and development environments, and it's even OSS that can be self-hosted! Does anybody here have any real-life experiences with it they could to share?


I'm using it (mostly locally at the moment) for semi-automating testing of django db data migrations, and it's really great.

I've previously spent a fair bit of time tuning postgres for fast/less durable settings to speed up testing, and this puts all of that to shame when you need to start with any non-trivial schema/data and revert back to it.

I had been testing out some ideas with `docker commit` to save derived images which included a premade db, then reverting to that, but I don't think it's worth bothering with since I found dblab.

Haven't yet tried using it CI, and I suspect it might need a somewhat-custom-than-normal VM setup to use ZFS than you could easily do on most hosted CI runners, but it's on my list of things to investigate/setup eventually.


There are several public Case Studies of real users (from small teams to quite large companies): https://postgres.ai/resources/

For more real-life feedback, welcome to the Database Lab Community Slack: https://slack.postgres.ai/


I implemented a similar idea for a few companies. The response was always very positive.

In one company, it grew into a cluster of a few servers for higher capacity and availability. This load balancing and management code is a quite more complex than the original single-server snapshotting script. Unfortunately, it's not open source. For several years, the single server was enough. Go for it. :-)


I'm doing something similar by cloning the data directory using BTRFS snapshots for backups and testing, apart from the UI is there a reason why this tool might be worth looking at?


I just came across testcontainers this week, which seems to solve similar problems. Has anyone tried both solutions and can comment about their experiences with them?


It's also a great tool. However, as far as I know, Testcontainers just manages container-based dependencies but doesn't prepare the data in any way.


A little off topic, but you can convert simple text to PostgreSql scripts,

https://text2db.com/


Interesting.

How do you specify data types for columns or schema names for tables?

For modern Postgres versions it's also recommended to use standard compliant identity columns rather than the proprietary serial "types"


Thanks for reviewing it.

>How do you specify data types for columns or schema names for tables?

  For now, to keep it simple, all datatypes will be created as varchar and foreign keys as int, to declare any other column as int, need to write id keyword anywhere in name when defining syntax.
> For modern Postgres versions it's also recommended to use standard compliant identity columns rather than the proprietary serial "types"

I have ran it in Postgres version 14 using psql(shell) and it is working fine, but I shall look into it.

Thanks for reviewing it Thanks


Is this advocating the use of production data in dev/test/qa environments?

I am struggling to come up with scenarios where that would be a good idea.


Not sure about webapps but in analytics, this is a common practice. Real data has too many exceptions and edge cases. If you dont develop and test with real data, you will end up effectively developing or testing on production.


Absolutely agreed. In addition, I know some Data Scientists who use thin clones from Database Lab to get data at a specific time for their models


I have always made clones of production data on my dev machine for testing -- it has the added benefit of also regularly testing my DB backup restoration process. How else do you test and optimize SQL queries that are only slow with production-size data?

If the data is too big to fit on my machine, I might clone to a nearby colocated server. Testing your DB backup and restoration mechanism becomes even MORE important if you have huge amounts of data.


> How else do you test and optimize SQL queries that are only slow with production-size data?

With something like this https://www.getsynth.com/docs/blog/2021/03/09/postgres-data-... (disclaimer: no affiliation with them, I've not used their product but it appears to be fully open source)


I agree that this is a possible way. The main difficulty of the generated data is related to their quality and structure. Namely, how artificial data correspond (quantitatively and qualitatively) to real data.

Random data may give incorrect results when optimizing a query.


Yep! The idea of that is to generate data that isn't just random but has the right structure. But, I don't know if they deliver.


Do you also store customer personal details and credit cards locally too?


You can buy a hardware appliance that encrypts credit card numbers for database storage. Then if someone accidentally (or maliciously on purpose) copies your database, you are not exposed to PCI liability.


At least with our DB we tokenize protected information, so all you have in the database is a GUID, the actual sensitive information is stored elsewhere.


Not every database has customer secrets in it. But you are not wrong that it should at least be understood what is sensitive before cloning.


You can clone the "depersonalized" data for your developer teammates

in a team: 10 developers working on 32 issues: --> need at least 32 separated environments.


I generate "backtests" of a stock trading model, IE I see how it would have done historically. Each backtest reads lots of vendor data from Postgres, and writes some output to Postgres. I can't run a backtest on a "dev" version of the vendor data, I need all of it to make my results correct. The vendor data is too big to easily have two copies.

To make this more complex, it's not a simple process where I read vendor data then generate analysis. My model is generated in several steps, which involves reading vendor data, generating output, then combining the output I generated in a previous step with more vendor data to generate the next step of output.

Backtests are "prod-like" in that they must use real vendor data and must generate correct results that drive business decisions. But they are "dev-like" since I don't want my backtest to interfere with my production system, which generates the version of the model that I currently trade. For a backtest I might want to make a new database table, or change the behavior of an existing process that generates data.

I've tried 2 solutions to this: One is have classes to handle all data reads and writes, and configuration that causes functions to read or write from a dev or production DB as needed. This is a pain to set up, but works well.

The other solution is DB clones. One advantage of DB clones is that it lets me write SQL that joins data I've generated on vendor data. I don't love having business logic in SQL for the obvious reasons, but it can be very performant and easy to maintain. Using classes for data access means that I can't easily do a SQL join between vendor data (which is always on the production DB) and data I generate (which might be on the dev DB.)


I think it will work well for verifying DB changes and optimizing SQL on real data. This seems like a great tool to eliminate the risk of database-related downtime.


It is, but it introduces new risks. Like your test code emailing all of your production users. And more generally data security issues.


You are absolutely right. Depending on the current environment, you should think about the protection of sensitive data (for example, data masking and obfuscation, access control).

Database Lab Engine provides some possible solutions to protect sensitive data: https://postgres.ai/docs/database-lab/masking

By the way, the Database Lab Engine maintainer is here.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: