Looks interesting. I am struggling mightily with a problem at work where we have...

paulryanrogers · 2024-02-17T15:31:55 1708183915

At each employer I now start by writing a data extractor that takes one client's data and loads it into my local DB. Usually includes some 'skip' or filter options to make the process faster. And some transformations like anonymizing PII.

With PG you can bypass the FK checks locally by temporarily changing the replication mode from origin to replica. Though if you don't eventually copy the referenced records you may run into FK errors during normal operations.

dewey · 2024-02-17T15:44:34 1708184674

I‘m using https://github.com/ankane/pgsync for that, it has a very easy config yaml and you can just define queries that get pulled from the db.

I have a simple one like „base_tables“ that just pulls me all the fixtures into my local db, then entity specific ones that pull an entity with a specific id + all related entries in other tables for debugging but as long as you can query it you can set up everything very easily.

big_whack · 2024-02-17T17:58:42 1708192722

I would distinguish between "local dev environment" and "debugging specific data-dependent customer issue".

My experience is you can usually write the application with no dependencies on customer data, and do local development entirely with synthetic data.

For debugging specific customer escalations or understanding data distributions for your synthetic data modeling, you can use a replica of production. No need to extract just a subset or have devs host it themselves. All access can be controlled and logged this way.

Better than a replica would be something like [1] but that isn't available everywhere.

I agree the problem of "dump a DB sample that respects FK constraints" comes up sometimes and is interesting but I'm not sure I'd use it for the problem you describe.

[1] https://devcenter.heroku.com/articles/heroku-postgres-fork

edrenova · 2024-02-18T23:33:43 1708299223

We solve this by managing referential integrity across relational databases when we subset. Effectively you have to build a graph of edges and nodes where edges are constraints and nodes are tables, then you have to follow these edges from node to node to maintain the constraints. It's certainly not trivial and gets very complicated when people use sensitive data as a primary key (for ex. email)/foreign key, then you need a way to anonymize those emails and shelf those primary keys until you can re-insert them later so you don't break the constraints.

(disclaimer - co-founder of Neosync: github.com/nucleuscloud/neosync, open source tool to generate synthetic data and orchestrate and anonymize data across database)

joescharf · 2024-02-17T20:58:32 1708203512

I'm about to release v.2.0 of [DBSnapper](https://dbsnapper.com) which includes the subsetting feature. Works with Postgres and MySQL.

I have a prelease build (v2.0.0-alpha) available at https://github.com/dbsnapper/dbsnapper/releases

And I'm updating the documentation for v2.0.0 in real-time https://docs.dbsnapper.com. The new configuration is there, but I have some work to do adding docs for the new features.

Works with Mac and Linux (Windows soon)

If you're interested in giving the alpha a try, I'd love feedback and can walk you through any issues you might encounter.

Joe

karn09 · 2024-02-17T21:29:41 1708205381

Built something like this a few years ago for work, but it has become a monstrosity that is difficult to enhance and maintain. We've been debating whether to build a more extensible v2 since I haven't seen anything that fits our needs on the market. I'm excited to check this out!

PeterZaitsev · 2024-02-18T19:46:00 1708285560

Joe,

What is the license for DBSnapper ? GitHub does not seems to list one

joescharf · 2024-02-19T15:52:17 1708357937

Hi Peter,

It's closed-source / proprietary at the moment, but I'm highly considering opening parts / all of it up as a golang library at a minimum, or do a free/pro version to support development. Any thoughts, suggestions?

markhalonen · 2024-02-17T20:12:26 1708200746

Postgres can do a pg_dump with RLS (Row Level Security) enabled to only get 1 tenants data. Very convenient if you use RLS already, if you don't, wouldn't take too long if you have tenant_id on every table

antman · 2024-02-17T15:59:10 1708185550

Jailer?