Two things stick out here that I don't fully understand. *Declarative schemas, a...

GeneralMayhem · on Sept 6, 2021

>Isn't this impossible because some schema changes require data migration?

In the most general case, sure - although there are workarounds for some specific cases (e.g., including previously-known-as names in the declarative schema to allow automatically planning renames). But 99% of the time, you're adding and removing tables and columns in a way that's very well defined. This is one of those areas where the best solution is to legislate away the hard problems - a tool that covers 98% of schema changes automatically (the provably safe ones), and then fails/requires a human to approve the last 2% is still dramatically better than having humans manually write and sequence change plans all the time.

Data migrations will require human effort, but you can sequence the changes and isolate the parts that need different kinds of work. If you're changing a string to an integer, for instance, you can make it clear in your change history that you (1) add the new column, (2) start dual writes, (3) backfill, (4) move readers to the new column, (5) stop the dual write, (6) drop the old column. You can do that with checked-in migration code, but think about what you end up with at the end - with imperative migrations, you have the clutter from all of that history; with declarative schema definitions, you just have the final state, just like how it works for code.

Declarative schemas also usually come with nice ancillary properties - for instance, they can give you automatic rollbacks, because you can pack up the whole schema definition as a static artifact.

>Why not use a dedicated and feature rich queue such as Rabbit MQ or, if you want to get really fancy, Kafka?

Atomicity. It's really, really powerful to be able to write a row and send or receive a message as a single exactly-once transaction. Imagine a queue of events. The writer writes the event data, and sends a message as a notification that it exists; the consumer will keep some sort of aggregate (count, filtered total, etc.). With a separate queueing system, you have to explicitly manage the timing and error cases bridging different atomicity domains, usually reimplementing some sort of idempotency log on your own. If it's all in one place, you just write a transaction that reads the old value, consumes the queue message, and writes a new value.

throwaway_pdp09 · on Sept 6, 2021

I'm doing a project perhaps relevant to this. Talk of declarative schemas (or what you imply they offer) is very interesting and I'd like to know more but I can't find anything relevant (just magento and sqlalchemy). Indeed, searching for <<<"Declarative schemas" "sql">>> in google gets this very thread on the first results page.

Any links to clear, actionable and reasonably comprehensive examples of these would be most helpful. Obviously abstract statements of the required semantics are also needed, but I also need to see what actual code would look like.

TIA

GeneralMayhem · on Sept 6, 2021

Magento (the new XML-based version, not the old create/update/revert-script version) gives a lot of these properties. Making it part of the database instead of a third-party tool would be better, though - it lets you cover more features, and with deeper integration you can get transactional updates and history logs.

saulrh · on Sept 6, 2021

> Why not use a dedicated and feature rich queue such > as Rabbit MQ or, if you want to get really fancy, Kafka?

If your queue is in the same database as your data, you can do "ack message/mark job as complete" and "commit result of job to database" in the same transaction. That simplifies a lot of the harder parts of things like materialization pipelines or BI systems.