
Database Lab – Full-size staging databases as a service - stansler
https://gitlab.com/postgres-ai/database-lab
======
sandGorgon
My advice - go all the way into RDS territory.

This is a killer feature. But it is hard to adopt it by itself, while being on
RDS, which is what most people end up doing.

If you can build a management layer that gives better-than-RDS features
...while allowing me to BYOB my own cloud, you are on your way to win.

~~~
jlmorton
I don't think that's necessary, and it might even detract a bit.

At my last company, we built a very similar tool in-house, using LVM w/ thin
snapshots, instead of ZFS. The database snapshot host would receive streaming
logical replication from our UAT and Staging environments.

Users (developers, QA, etc) could then take an instant snapshot of a multi-TB
database, provide a Git hash/branch, and we'd instantly clone the DB, spin up
the application and migrate it to the latest version.

That worked great, and the same thing could work with RDS. You could use
something like AWS Database Migration Service to replicate from your RDS
installation to your Database Lab host.

This way, you get to keep RDS for your actual test and staging environments,
but you get streaming replication into your Database Manager instance for
Dev/QA.

~~~
sandGorgon
> _You could use something like AWS Database Migration Service to replicate
> from your RDS installation to your Database Lab host._

my guess is that then your market will be restricted to those who are already
running their own databases. Because RDS restore-from-snapshot is good enough
to not go through the DMS hackery (which is exceedingly nasty) and probably
will negate some of the "instant" features that you have.

Just my $0.02 though.

------
gavinray
Love the idea, but is this essentially then just a faster version of
_pg_dump_?

    
    
      pg_dump -C -h remotehost -U remoteuser dbname | psql -h localhost -U localuser dbname

~~~
samokhvalov
Well, not really. Two key differences:

1) dump/restore approach is one of "thick cloning" methods. It will take a
really significant amount of time. Roughly, thick cloning speed for modern
hardware and networks is somewhat ~1 TiB / hour just for transferring over
network and writing to disk. Additionally, in the case of dump/restore, the
timing will depend on the number of indexes that need to be built,
materialized views to be refreshed, etc.

2) Dump/restore is "logical copying", it means that physical structure is not
preserved. Logically, you'll get the same data. Physically – a completely
different database. Bloat is gone, indexes are "fresh and good", and so on.

With Database Lab, unless you're on RDS, initial copying is preferably done
using some "physical" methods (pg_basebackup, or just rsync, or – the best
option – restoration from WAL-G/pgBackRest/Barman/etc backups). The physical
structure is preserved. You have an identical copy of the original database.
Then you request "thin clones", it really takes a couple of seconds for a
multi-terabyte database. And can do various kinds of performance
troubleshooting and experiments before you deploy your changes to production.
Request 5 thin clones and check your idea 5 times, each time adjusting it a
little bit. This is what usually really needed in development.

~~~
gavinray
Ahh okay, that makes sense. I wasn't aware that using different methods of
copying + backing up data had tangible effects on it later. Super interesting,
and vital if you're looking to gather accurate data from experiments.

By the way, I recognized you guys from the other tool you publish, Nancy.
While I don't have a direct usecase for it myself, I thought the idea was
super valuable as well.

I am a huge Postgres guy (did you know you can build an entire web app in JUST
Postgres?[0]) so I had bookmarked a couple of your projects.

Thank you for the response, you guys do some fantastic work.

[0]:
[https://github.com/aquametalabs/aquameta](https://github.com/aquametalabs/aquameta)

~~~
samokhvalov
Thank you! I'll check out aquameta, looks interesting.

------
penagwin
Question: Can this clone a database, but also apply certain operations to it?

I work at a small company, and currently we clone our production database to
our dev machines for testing. However certain information in the database is
sensitive and we don't want to include it on our dev machines. (This specific
sensitive data is also stored in an encrypted format and the key is not
included, but we'd still prefer it not included).

Basically I'd like to be able to clone the database, and run some SQL to
replace the data in some tables with some mock data. But I can't think of an
easy way to do this without cloning the database, starting a temporary one,
run the sql, then clone it again - and distribute that final build to the
devs.

~~~
jedieaston
It doesn't appear to be able to at this point, no. Just clones the existing
DB.

There are masking tools out there that can mask data from production in-flight
before dropping it into a dev environment for testing (so you can get the same
data with the sensitive parts changed), but this doesn't appear to be one of
them. I know people who work in the field, it's a tough nut to crack to keep
the database good enough to test with while making sure you don't have devs
running around with customer data on their laptop/server.

~~~
penagwin
> There are masking tools out there that can mask data from production

Thank you so much! "Masking" was the phrase I needed to find some potential
solutions! I wasn't able to find much before, that points me in the correct
direction. Thanks!

~~~
samokhvalov
There are various approaches here, and there are some FOSS tools that you can
use.

Some links:

\-
[https://blog.taadeem.net/english/2019/01/03/8_anonymization_...](https://blog.taadeem.net/english/2019/01/03/8_anonymization_strategies_with_postgres)
– description of methods, and a tool for Postgres, postgresql_anonymizer

\-
[https://habr.com/en/company/yandex/blog/485096/](https://habr.com/en/company/yandex/blog/485096/)
– not for Postgres, it's for ClickHouse (open-source DBMS for analytics) but
covers the topic very well.

------
satyrnein
Would it be practical to use this in an automated regression testing context
to give each test its own copy of the database that it could mutate, check
results, then throw away?

------
u02sgb
This looks like open source Delphix using Postgres. It even seems to be using
ZFS under the hood.

Any comments from people using it? I designed a CI system using Delphix at an
old client and it was awesome!

~~~
takeda
I saw them contributing often to ZFS on FreeBSD, what is their product? Is it
equivalent to NetApp?

~~~
u02sgb
Not seen NetApp but from a quick Google it seems to be more intelligent
database management.

Delphix is more using the power of ZFS and Copy On Write to give you instant
database copies with minimal storage footprint. We used it to give every
developer their own Database that they could do anything to and not worry
about locks etc. Refreshing to a more recent copy was a very quick operation
and required a developer running one command. Had a nice web interface and a
REST interface for automation.

I'd recommend it but with the caveat that I've no idea how much it costs as it
was a Dev thing a client was encouraging use of. Was really nice to work with
though.

~~~
takeda
Ah, so this indeed looks like Open Source version of Delphix.

As for NetApp, probably the do other things but most popular is their NAS
which internally uses Copy On Write filesystem similar to ZFS.

------
diminish
This helps a lot in developing and testing for postgres.

Maybe that's one of the tech which "John Carmack" meant when he said
sophisticated tech is indistinguishable from magic.

~~~
sergioro
John Carmack was not the first to say those words, see [Clarke's
law]([https://en.wikipedia.org/wiki/Clarke%27s_three_laws](https://en.wikipedia.org/wiki/Clarke%27s_three_laws))

------
francislavoie
Does such a thing exist for MySQL? If so I'd appreciate a link. We have weekly
snapshots that we use for development, but it would be amazing to have a way
to reset/throw away any changes quickly, especially to repeatedly test a
specific use-case against some prod data.

------
veritas3241
Snowflake does something similar since it's a metadata store and cloning is
mainly copying the metadata. It does scale with data volumes though so it
doesn't seem like it's be as quick.

Really great idea here!

------
arithma
A bit tangential, but this reminded of an idea I had when I was learning about
containerization and overlayfs. Would postgres work with data stored in an
overlay and allow a similar workflow?

------
geoffbp
What are people using these days for data sanitization? I.e. to remove PII
data from dev snapshots

~~~
purrpit
Short answer: Different databases.

We have PII, analytics, catalog, transaction etc databases. We only let people
clone non-sensitive databases. Others are cloned only at schema level and data
is left out.

This also allows us to move and scale independently across different
applications. But that's just a symptom of this design.

There are some complications with this approach but it's a trade off and we
try to come on top using foreign-data-wrapper approach and using a database
application router for read and writes. We have open sourced a Django db
router which routes queries on the basis of tags/app name/table name etc

------
johnmarcus
Is this just using zfs to copy the data directory? Why do o need a tool to do
that?

