Looks interesting. I am struggling mightily with a problem at work where we have a very large production DB and need a subset of it for local dev environments. You can't just randomly sample the data as that will break foreign keys and other relationships.
Anyone have a solution to this problem that works well for them?
At each employer I now start by writing a data extractor that takes one client's data and loads it into my local DB. Usually includes some 'skip' or filter options to make the process faster. And some transformations like anonymizing PII.
With PG you can bypass the FK checks locally by temporarily changing the replication mode from origin to replica. Though if you don't eventually copy the referenced records you may run into FK errors during normal operations.
I‘m using https://github.com/ankane/pgsync for that, it has a very easy config yaml and you can just define queries that get pulled from the db.
I have a simple one like „base_tables“ that just pulls me all the fixtures into my local db, then entity specific ones that pull an entity with a specific id + all related entries in other tables for debugging but as long as you can query it you can set up everything very easily.
I would distinguish between "local dev environment" and "debugging specific data-dependent customer issue".
My experience is you can usually write the application with no dependencies on customer data, and do local development entirely with synthetic data.
For debugging specific customer escalations or understanding data distributions for your synthetic data modeling, you can use a replica of production. No need to extract just a subset or have devs host it themselves. All access can be controlled and logged this way.
Better than a replica would be something like [1] but that isn't available everywhere.
I agree the problem of "dump a DB sample that respects FK constraints" comes up sometimes and is interesting but I'm not sure I'd use it for the problem you describe.
We solve this by managing referential integrity across relational databases when we subset. Effectively you have to build a graph of edges and nodes where edges are constraints and nodes are tables, then you have to follow these edges from node to node to maintain the constraints. It's certainly not trivial and gets very complicated when people use sensitive data as a primary key (for ex. email)/foreign key, then you need a way to anonymize those emails and shelf those primary keys until you can re-insert them later so you don't break the constraints.
(disclaimer - co-founder of Neosync: github.com/nucleuscloud/neosync, open source tool to generate synthetic data and orchestrate and anonymize data across database)
And I'm updating the documentation for v2.0.0 in real-time https://docs.dbsnapper.com. The new configuration is there, but I have some work to do adding docs for the new features.
Works with Mac and Linux (Windows soon)
If you're interested in giving the alpha a try, I'd love feedback and can walk you through any issues you might encounter.
Built something like this a few years ago for work, but it has become a monstrosity that is difficult to enhance and maintain. We've been debating whether to build a more extensible v2 since I haven't seen anything that fits our needs on the market. I'm excited to check this out!
It's closed-source / proprietary at the moment, but I'm highly considering opening parts / all of it up as a golang library at a minimum, or do a free/pro version to support development. Any thoughts, suggestions?
Postgres can do a pg_dump with RLS (Row Level Security) enabled to only get 1 tenants data. Very convenient if you use RLS already, if you don't, wouldn't take too long if you have tenant_id on every table
Anyone have a solution to this problem that works well for them?