
Ask HN: What tools to grab slices of production? - was_boring
Where I work, it has become infeasible to grab the entire production database for local testing, debugging or dev. Not to mention the regulatory climate changing which discourages this.<p>What tools do you use to grab just slices of it? We use Postgres as our db.
======
jrowley
I have this problem too and don’t have a good solution. One approach I’ve been
thinking about doing is to just automatically deploy yesterday’s RDS snapshot
to a dedicated test instance that rolls over every 24 hours and is replaced.
It would dedicated for testing tricky migrations and support cases.

------
devm0de
We just take a weekly snapshot where we run a script to drop most customer
data, sanitize the little bit that is left, add our internal users and data
and apply the grants needed for each person internally. This image is built
and everyone can pull down the container. New grants preferably wait until the
weekly snapshot happens.

------
bradwood
If you have the cash in your firm you could look at Delphix. It only works on
RDBMSs but allows databases to be “forked”, branched committed to, rolled back
and datamasked.

------
davismwfl
It depends on if you have a shared dev environment or a truly local laptop
environment and whether or not you have any restrictions on data usage from
prod. I have had to deal with this a lot over my career as most of the time
the production databases have been quite large. And in many cases, privacy or
security rules prevented us from just copying the data straight away. If you
have no restrictions it is easiest to do snapshots of prod on regular
intervals and just restore a snapshot into Staging or local etc.

Outside of that, the other methods I have seen & used varied over the years,
but I'll share the most common pattern I have used which seemed to work the
best, but it does require some work to setup and maintain.

For SQL specifically, we used to write a number of functions (stored
procedures or the equivalent) where for each record type a function would
anonymize the data and return the new record. Then we would start combining
these procedures into modules which we could run a single function and specify
the number of records to pull and recreate in our destination. We also would
always write a special use case procedure where if some records we found broke
production in the past we would always create those records in the dev/test
data to make sure new code doesn't fail. This worked really well and yes,
takes time initially to setup and takes some time to maintain, but it means
you can reproduce a meaningful dataset into any environment for testing or
development quickly and automated.

The best one I was involved with we had a SQL Server backend and so we wrote a
C#.NET component to take place of the majority of the stored procedures, which
made it faster and let us use RAM more efficiently to handle some of the
translations & lookups. The same could be done in Postgres, MySQL, Mongo or
just about any database in a language of your choosing for the most part.

For what it is worth, I have tried most database data tools out there that
claim to solve this problem and never found them to be as good as spending a
little time and building a custom toolset. Our .NET version (which could
easily be done in Java or another language for Postgres) could be run from a
stored procedure itself, so even repopulating staging environments etc could
be done on a schedule and be 100% automated (plus gave DBA's comfort). We made
the .NET one sophisticated enough it could import tables by enumerating the
tables through SQL API's and then pull data from them generically. This
process worked great for all the lookup tables and things that didn't need to
have extensive translations. So as we got better more of the code was generic
and we moved the translations and details to meta data about tables stored in
SQL. In one case this worked against a database that was spread across 4
physical DB servers and ~700 tables and huge amounts of data. We could pull a
reasonably sized local dev set and be working in roughly 20 minutes and that
would include all the weird edge cases that had to be created. This was the
closest we came to making something generic. For our shared staging
environment we could use the same code and just specify a larger dataset,
which typically took many hours to run but we'd just run it overnight. We also
built in ways to pull specific client's dataset in if we were struggling with
one client having unique issues.

FWIW too, currently at the startup I am in we are using primarily Postgres
right now and I have been replicating a similar system to the .NET one I built
before because we have strict requirements about the data given it is medical
data. So we can't just let people have access to production or let people copy
data to staging or local machines, as it would create a compliance and legal
nightmare. I tried a couple of tools recently that again said they could do
this, but the amount of time & money to invest in those tools and the features
you still have to supplement just doesn't make sense to me. I'd much rather
buy than build in most cases but this is one where it is hard to do that.

