Hacker News new | past | comments | ask | show | jobs | submit login
Database Lab – Full-size staging databases as a service (gitlab.com/postgres-ai)
137 points by stansler on Feb 7, 2020 | hide | past | favorite | 31 comments

My advice - go all the way into RDS territory.

This is a killer feature. But it is hard to adopt it by itself, while being on RDS, which is what most people end up doing.

If you can build a management layer that gives better-than-RDS features ...while allowing me to BYOB my own cloud, you are on your way to win.

I don't think that's necessary, and it might even detract a bit.

At my last company, we built a very similar tool in-house, using LVM w/ thin snapshots, instead of ZFS. The database snapshot host would receive streaming logical replication from our UAT and Staging environments.

Users (developers, QA, etc) could then take an instant snapshot of a multi-TB database, provide a Git hash/branch, and we'd instantly clone the DB, spin up the application and migrate it to the latest version.

That worked great, and the same thing could work with RDS. You could use something like AWS Database Migration Service to replicate from your RDS installation to your Database Lab host.

This way, you get to keep RDS for your actual test and staging environments, but you get streaming replication into your Database Manager instance for Dev/QA.

> You could use something like AWS Database Migration Service to replicate from your RDS installation to your Database Lab host.

my guess is that then your market will be restricted to those who are already running their own databases. Because RDS restore-from-snapshot is good enough to not go through the DMS hackery (which is exceedingly nasty) and probably will negate some of the "instant" features that you have.

Just my $0.02 though.

Love the idea, but is this essentially then just a faster version of pg_dump?

  pg_dump -C -h remotehost -U remoteuser dbname | psql -h localhost -U localuser dbname

Well, not really. Two key differences:

1) dump/restore approach is one of "thick cloning" methods. It will take a really significant amount of time. Roughly, thick cloning speed for modern hardware and networks is somewhat ~1 TiB / hour just for transferring over network and writing to disk. Additionally, in the case of dump/restore, the timing will depend on the number of indexes that need to be built, materialized views to be refreshed, etc.

2) Dump/restore is "logical copying", it means that physical structure is not preserved. Logically, you'll get the same data. Physically – a completely different database. Bloat is gone, indexes are "fresh and good", and so on.

With Database Lab, unless you're on RDS, initial copying is preferably done using some "physical" methods (pg_basebackup, or just rsync, or – the best option – restoration from WAL-G/pgBackRest/Barman/etc backups). The physical structure is preserved. You have an identical copy of the original database. Then you request "thin clones", it really takes a couple of seconds for a multi-terabyte database. And can do various kinds of performance troubleshooting and experiments before you deploy your changes to production. Request 5 thin clones and check your idea 5 times, each time adjusting it a little bit. This is what usually really needed in development.

Ahh okay, that makes sense. I wasn't aware that using different methods of copying + backing up data had tangible effects on it later. Super interesting, and vital if you're looking to gather accurate data from experiments.

By the way, I recognized you guys from the other tool you publish, Nancy. While I don't have a direct usecase for it myself, I thought the idea was super valuable as well.

I am a huge Postgres guy (did you know you can build an entire web app in JUST Postgres?[0]) so I had bookmarked a couple of your projects.

Thank you for the response, you guys do some fantastic work.

[0]: https://github.com/aquametalabs/aquameta

Thank you! I'll check out aquameta, looks interesting.

Can you explain what a thin clone is?

Sure. Thin cloning is a way to obtain a clone of data that looks fully independent (you can modify, add, delete data), but physically multiple clones share physically the same data -- except those blocks that were changed. This is implemented using copy-on-write -- see, for example, https://en.wikipedia.org/wiki/Copy-on-write#In_computer_stor...

Question: Can this clone a database, but also apply certain operations to it?

I work at a small company, and currently we clone our production database to our dev machines for testing. However certain information in the database is sensitive and we don't want to include it on our dev machines. (This specific sensitive data is also stored in an encrypted format and the key is not included, but we'd still prefer it not included).

Basically I'd like to be able to clone the database, and run some SQL to replace the data in some tables with some mock data. But I can't think of an easy way to do this without cloning the database, starting a temporary one, run the sql, then clone it again - and distribute that final build to the devs.

Hi! Postgres.ai founder here.

This is requested quite often – for example, if we copy the database from production, sometimes it's needed to remove all personal data not to break regulations.

It is possible in Database Lab, but it's not a very user-friendly feature yet. Briefly, the process is as follows.

The "sync" Postgres instance is configured to be a production replica (better using WAL shipping from the WAL archive). Then, periodically, a new snapshot is created, currently it's done using this Bash script: https://gitlab.com/postgres-ai/database-lab/-/blob/master/sc.... (We are going to make it a part of the database-lab server in the upcoming releases).

Here https://gitlab.com/postgres-ai/database-lab/-/blob/master/sc... you can place any data transformations, so the final snapshot that will be used for thin cloning has adjusted data sets. For example, all personal data is removed or obfuscated.

Of course, if you do this, you need to keep in mind that physically, you'll have a different database. It may affect some kinds of testing (for example, troubleshooting bloat issues or some cases of index performance degradation). There are various choices to be made here. If interested, we'll be happy to help, please join our community Slack which is mentioned in the docs and README.

It doesn't appear to be able to at this point, no. Just clones the existing DB.

There are masking tools out there that can mask data from production in-flight before dropping it into a dev environment for testing (so you can get the same data with the sensitive parts changed), but this doesn't appear to be one of them. I know people who work in the field, it's a tough nut to crack to keep the database good enough to test with while making sure you don't have devs running around with customer data on their laptop/server.

> There are masking tools out there that can mask data from production

Thank you so much! "Masking" was the phrase I needed to find some potential solutions! I wasn't able to find much before, that points me in the correct direction. Thanks!

There are various approaches here, and there are some FOSS tools that you can use.

Some links:

- https://blog.taadeem.net/english/2019/01/03/8_anonymization_... – description of methods, and a tool for Postgres, postgresql_anonymizer

- https://habr.com/en/company/yandex/blog/485096/ – not for Postgres, it's for ClickHouse (open-source DBMS for analytics) but covers the topic very well.

Other common terms for this are sanitization or redaction

If you're willing to pay for it Delphix would do what you want. I've not used those features but their tech guys were describing that very scenario in a presentation I saw.

Our company is pretty small (We only have 5 developers) and judging by the lack of pricing info (you can only request a demo) - I'm guessing it's too expensive for us haha.

But thanks for the suggestion!

This looks like open source Delphix using Postgres. It even seems to be using ZFS under the hood.

Any comments from people using it? I designed a CI system using Delphix at an old client and it was awesome!

Also curious what the performance is like?

We switched our staging databases to be on ZFS and saw inexplicable, very noticeable decreases in performance. Those trade-offs are fine for developer machines, but ultimately too much for our staging server where we needed to run UAT.

But, could of just been quirks of how we had things set up.

I saw them contributing often to ZFS on FreeBSD, what is their product? Is it equivalent to NetApp?

Not seen NetApp but from a quick Google it seems to be more intelligent database management.

Delphix is more using the power of ZFS and Copy On Write to give you instant database copies with minimal storage footprint. We used it to give every developer their own Database that they could do anything to and not worry about locks etc. Refreshing to a more recent copy was a very quick operation and required a developer running one command. Had a nice web interface and a REST interface for automation.

I'd recommend it but with the caveat that I've no idea how much it costs as it was a Dev thing a client was encouraging use of. Was really nice to work with though.

Ah, so this indeed looks like Open Source version of Delphix.

As for NetApp, probably the do other things but most popular is their NAS which internally uses Copy On Write filesystem similar to ZFS.

Would it be practical to use this in an automated regression testing context to give each test its own copy of the database that it could mutate, check results, then throw away?

This helps a lot in developing and testing for postgres.

Maybe that's one of the tech which "John Carmack" meant when he said sophisticated tech is indistinguishable from magic.

John Carmack was not the first to say those words, see [Clarke's law](https://en.wikipedia.org/wiki/Clarke%27s_three_laws)

Does such a thing exist for MySQL? If so I'd appreciate a link. We have weekly snapshots that we use for development, but it would be amazing to have a way to reset/throw away any changes quickly, especially to repeatedly test a specific use-case against some prod data.

Snowflake does something similar since it's a metadata store and cloning is mainly copying the metadata. It does scale with data volumes though so it doesn't seem like it's be as quick.

Really great idea here!

A bit tangential, but this reminded of an idea I had when I was learning about containerization and overlayfs. Would postgres work with data stored in an overlay and allow a similar workflow?

What are people using these days for data sanitization? I.e. to remove PII data from dev snapshots

Short answer: Different databases.

We have PII, analytics, catalog, transaction etc databases. We only let people clone non-sensitive databases. Others are cloned only at schema level and data is left out.

This also allows us to move and scale independently across different applications. But that's just a symptom of this design.

There are some complications with this approach but it's a trade off and we try to come on top using foreign-data-wrapper approach and using a database application router for read and writes. We have open sourced a Django db router which routes queries on the basis of tags/app name/table name etc

Is this just using zfs to copy the data directory? Why do o need a tool to do that?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact