
Truth First, or Why You Should Mostly Implement Database-First Designs - MarkusWinand
https://blog.jooq.org/2018/06/06/truth-first-or-why-you-should-mostly-implement-database-first-designs/
======
ben509
This is a really important piece for developers to read.

We get very used to the ability to blow away everything and start from
scratch, and that's very often the right answer for build tools, but the
database is fundamentally different from that, and you _need_ to internalize
this.

A database, in most cases, is the business. It's essentially a collection of
all the contracts made with customers, and your mode of thinking "I can blow
this away and start over" that you've learned from all sorts of build tools is
completely contrary to that.

An infamous example of this is in Active Record migrations[1], and possibly
some other migration tools.

> Note that we define the change that we want to happen moving forward in
> time. Before this migration is run, there will be no table. After, the table
> will exist. Active Record knows how to reverse this migration as well: if we
> roll this migration back, it will remove the table.

These are smart people, and indeed there are some _rare_ cases where this is
the right thing to do. But only our strongly ingrained belief that data is
ephemeral would cause us to think this should be the normal behavior, that
we'd want to blow away customer data to force the database to track what
version X of the code thinks it should.

[1]
[http://edgeguides.rubyonrails.org/active_record_migrations.h...](http://edgeguides.rubyonrails.org/active_record_migrations.html)

~~~
DanHulton
There are very, very common cases where this is the right thing to do -
development. You'll want to be able to check out a branch, apply the
migrations, test the code, and roll back the migrations before you switch off
to another branch all the dang time.

When you're limiting yourself to discussion of production environments,
absolutely you're correct, rollbacks should be essentially prevented, ideally
through tooling that makes them nearly impossible to run, especially without
performing a backup first.

~~~
megaman22
I don't think I would be comfortable doing database schema related work in a
branch unless I was also essentially branching my database at the same time.
Even for my mostly garbage personal developer database that's filled to the
brim with nonsense names and keyboard mashing. The chance of breaking
something and wasting a large amount of time and data are too great. Of
course, Thou Shalt Have Backups, but it's not exactly a frictionless
experience if you fry things and have to go back to them.

Unfortunately it's not quite as quick and easy to clone off a copy every time
you want to run some tests or fiddle with a spike as it is to jump between Git
branches.

~~~
paloaltokid
_> branching my database_

This sounds like a great product idea! Someone please do this. :)

~~~
isotropy
Check out Delphix ([https://www.delphix.com](https://www.delphix.com)). It
provides a ZFS-backed copy of your production database per developer using
snapshots, with rollback. Disclaimer: a friend worked there, I haven’t used it
personally, and it’s not free.

------
Diggsey
No, you shouldn't use a code first, _or_ a database first design.

The article explains the problems with the code-first approach very well, but
there are just as many problems with the database-first approach:

\- There is never just one database. There will be your production database, a
staging database, one or more testing databases, local development databases,
etc. This makes the database useless as a single source of truth.

\- When you write code against your development database, there's no guarantee
that it will be the same as the production database, and you shouldn't be able
to connect to your production database while developing locally to verify
that. Usually you want to deploy identical build artefacts to staging and
production, so there's a danger that your staging database gets out of sync
with production (eg. after an issue was caught and improperly rolled back on
staging) and then you deploy your fully tested code from staging to production
and your entire system breaks down.

\- When you need to restore a backup of your database, you need to somehow
find a compatible version of the code to deploy as well.

My recommendation is to use your migration scripts as your source of truth.
Each service should access separate schemas within the database, or at the
very least separate tables. If you really need multiple services to share
tables then those services should be versioned and deployed as one.

The database should have a table which stores a list of the migrations which
have been applied, and you should use a migration tool such that when you
deploy, outstanding migrations are applied as required.

If you want something easier to read than the history of migration scripts,
then you should keep an up-to-date schema _as well_ next to your code, and
verify as part of CI that running all the migrations results in that exact
schema.

~~~
andreimackenzie
It can help to establish rules such as "new DB migrations must always be
compatible with old code" aka "migrations first". This gets everyone into the
mindset that a db+code change must be carefully staged in a way that both (new
DB, old code) and not just (new DB, new code) are tested. If the migration
goes wrong due to different data in prod, it's easier to roll that back before
new code is deployed.

~~~
hinkley
Anything faster than what you’re suggesting here is trying to rush the
process.

Create the data before it’s needed, clean up old records, verify the quality,
make it required. And only then do you use it.

------
nikisweeting
From a startup perspective I strongly disagree, I think the UI should almost
always be designed first. The HTML/CSS/JS frontend that the users see comes
first, and have it hooked up to an empty backend that does nothing. Then once
the UI requirements are fleshed out and the final specs nailed down, only then
the database schema should be designed. The backend glue e.g. Django/Rails
comes last.

I find this approach has led to the fewest number of code rewrites and
database migrations needed to get from initial mockup to finished product,
because UI requirements can change much faster and are more fickle than DB
schemas and backend code.

~~~
nicodjimenez
I agree that the UI / user experience needs to be spec-ed out first. However,
in terms of actual development, writing the DB schema's first is usually the
right call. I typically feel that things should be designed outside-in, and
developed inside out. Of course those are just general principles.

~~~
sethammons
> [T]hings should be designed outside-in, and developed inside out.

This is gold, and so many do not do this.

------
WorldMaker
My two cents: always source control your databases.

I don't care if you are code first or database first, but I do think database
schema and migration scripts are code artifacts and need to be source
controlled _somewhere_.

You should be able to track who made a change and when and what for, and
maintaining by hand a "commit list" in a comment at the top of a stored proc
doesn't count.

You shouldn't need a full backup of a Production database to spin up a new
test server, and you shouldn't need to rifle through every developer's
documents and desktop folders looking for random .sql script files to figure
out if you are missing a migration/update/refactor script somewhere.

(The argument that you rarely spin up a new Production server is I think
tempered by how many developers and testing environments you need. Setup time
for developer/testing instances indicates a part of how long onboarding new
developers may take; and here I lean to the side of developer productivity and
good backups.)

Most code first methods give you source control "for free", which makes them
easy to recommend to junior developers / small startup projects.

Visual Studio has decent SQL Server Project support to help with the database-
first crowd, as one option I've relied on heavily. Unfortunately, I've yet to
see an open source, more database agnostic relative to it.

------
ams6110
First supervisor I had in my 2nd job out of school had a mantra: get the data
model right and everything else falls into place.

Has worked for me for 30+ years.

~~~
mavelikara
"Show me your flowcharts and conceal your tables, and I shall continue to be
mystified. Show me your tables, and I won’t usually need your flowcharts;
they’ll be obvious."

\-- Fred Brooks

~~~
jessaustin
I've gotten the feeling that flowcharts used to be considered quite important
by lots of coders, while today they are only used in particular subcultures. A
flowchart is sort of an ungainly artifact. It repeats information that is in
code, but is more difficult to produce and save than code is.

~~~
lovich
I've found them worth the effort pretty much only when trying to explain to
stakeholders that they have given me conflicting requirements. When they can
follow the logic path with their own eyes they are usually able to figure out
what they actually want

------
sixdimensional
One use case for when you may consider code-first - pure experimentation and
fast iteration around the data model that supports your application.

I have used both models, and did database-first extensively, for at least 15
years. However, having been working in .NET lately and having EF Code First
migrations (painful, but read on), I did discover a sort of hybrid model that
worked.

Eventually, I believe you will always end up at database-first, even if you
start code-first, unless your database is relatively small (by # of
table/procs/views count) or limited in scope. But as applications grow,
multiple databases become a reality, the database might grow truly huge(# of
tables/procs/views), you might need customization to your database platform,
etc. - limiting it by the application / code-first becomes impossible.

Where code-first works well, for me, is limited scope OR, more importantly,
when you need really fast iteration for prototyping and don't want to think
about building the tables. I've found that use case, code-first can be really
helpful to think in code, let the system generate the database, at least
enough to get to an MVP or PoC type-level of work product. Then, once it's
working enough, to really take it all the way, switching to a database-first
approach to do all the customizations to a particular database platform's
strengths, etc. can work - I've done it myself on several projects
successfully.

The tooling is almost there to do this... just start code first, and then once
you have your system bootstrapped, generate the database first model and
switch. But, you have to be careful with naming things so that names of things
in your code match up (or can easily be updated) if you switch from code-first
to database-first - something that not all code generators let you control
completely (most template-driven ones do, though).

------
steve-chavez
If you're using PostgreSQL, I invite you to try PostgREST
[https://postgrest.org/en/v5.0/intro.html](https://postgrest.org/en/v5.0/intro.html)
which takes the database-first approach even further, with PostgREST you
develop all your business logic in SQL, there's no code generation, no ORM,
which no matter how elegant will always impose a performance and abstraction
overhead(to do an aggregate you have to read the ORM docs, learn/understand
the abstractions, add the code, all of this in addition to what you already
know how to do in SQL).

Check the API
[https://postgrest.org/en/v5.0/api.html](https://postgrest.org/en/v5.0/api.html)
for more details on all the features that work out of the box, no ad-hoc code
needed.

------
mnm1
I couldn't agree more. The code first design is what kept me away from Rails.
I'm sure they have sensible approaches as well, but unfortunately I couldn't
get over my prejudice that anyone who teaches the creation of the database
through the app's code instead of actually designing the database is someone I
don't want to follow or learn from and whose framework I don't want to use
even if it provides alternatives (I hope it does). Database design in a CRUD
app is by far the single most important task. Show me your code and I know
nothing, show me your data structures and I understand the whole app never
applied more.

The consequences of not following this advice I see everyday at my current job
where we are stuck with an ORM that needs at least 3-4 files modified for each
entity change and that requires management using an external GUI tool. The ORM
adds 10x overhead to already unoptimized SQL queries so optimizing the queries
is never a problem because the system's overhead is so slow even if it was
possible to make the queries run in zero time, the overhead would still kill
us. 10x. Query is 10ms and overhead is 100ms. And the trade-off for this?
Nothing but a shitty ORM that makes no sense (Doctrine 2/PHP) and slows down
development a few times over as well. Changing even the name of a column
requires editing multiple files. We have camelCase, snake_case, and kebab-case
for all our entities and depending on where you are in the system, a different
case is required. Using the ORM to generate entities after you're in
production? Don't make me laugh. That's impossible. And this isn't even the
worst ORM I've used (though it's close). Everything the author mentions and
then a whole bunch of stuff that's a million times worse has resulted from
this drive to abstract away the most important component, the database, and
not have it be the single source of truth. Frankly, it's fucking stupid and
beyond maddening, not to mention slow and due for a rewrite. Never again will
I tolerate this kid of design that ORMs encourage or the use of ORMs at all,
especially if they replace solid database design. I'll leave that kind of
fuckery and stupidity to the rookies.

~~~
rimliu
What’s stopping you from designing your database in Rails? You can spend all
the time you need designing it, and then just write your design down in
migrations/models.

~~~
jessaustin
I've never used Rails, but I will be surprised to learn this
"migrations/models" abstraction allows the full range of foreign keys,
indexes, triggers, constraints, stored procs, etc. without which we might as
well just be on NoSQL... Not that there is anything wrong with that in
principle, but in many situations that's not what we want.

~~~
senotrusov
Rails provide ruby DLS for migrations but you can always put plain SQL in that
migration files. I do this most of the time and the Rails conveniently tracks
which script needs to be executed against particular database state and runs
them for me.

I find myself rarely do triggers. Not because I can't. It's just more handy to
do on the ActiveRecord layer having access to all my ruby code.

Indexes, indexes on PL/pgsql procedures, foreign keys and constraints – they
absolutely have their place in Rails applications.

Although you could do a rather complicated SELECT's with ActiveRecord,
sometimes, for particular queries, I prefer to create a database VIEW and then
point the ActiveRecord model to it.

~~~
toasterlovin
> Although you could do a rather complicated SELECT's with ActiveRecord,
> sometimes, for particular queries, I prefer to create a database VIEW and
> then point the ActiveRecord model to it.

Database views + ActiveRecord are a super power.

------
vinceguidry
What I want is a doctrine that gives an ideal approach for data management,
similar to how 12 factor gives an ideal approach for overall application
architecture management. Data architecture, even though it's a subfield of
application architecture, is important and vulnerable enough to need it's own
doctrine.

What I like about 12 factor is that you don't have to get it all done at once.
The concepts start high level and you can dive into the details as they become
necessary. Initially, you can get by with dotenv for configuration management,
and slowly build out more tooling _as you need it_.

But data management seems to need a new lexicon and set of organizing
principles. How do I think about the varied aspects of database design in a
way that I can be reasonably sure won't bite me in the ass before I can get
around to scaling it? How do I pick a database, what are the relevant decision
criteria? At what point do you want to introduce different kinds of databases,
like say a graph database or ElasticSearch?

~~~
pgwhalen
Undoubtedly, Designing Data-Intensive Applications is a great place to start:
[https://dataintensive.net/](https://dataintensive.net/)

It’s not as purely distilled as the 12 factor app by any means, but amongst
all the details about B trees and such, a number of very solid principles
around storing data creep out.

~~~
vinceguidry
Looks great, thanks!

------
soulnothing
Jooq has become my favorite ORM with how bare metal it is. I have it set up in
gitlab pipelines to auto build my schema into an ORM. Just kick build on a sql
change, and then inherit the new package.

I saw an ancillary remark about prod connections, being bad. I generally
agree. I usually build mine off of my staging environment. Assuming a staging
database schema, was promoted to production. Then I have a sister build job,
that kicks off my jooq build, trigger on the promotion task.

The biggest thing is the professional license for several databases. But it's
not that bad really.

It also goes to one of my thoughts as well. The data model is crucial to the
application. This is like the foundation for your home. I always try and
hammer in talk about the data model first. Where did it come from, where next,
and finally where do you want it to go.

~~~
willtim
If JOOQ is anything like LINQ-to-SQL (and it certainly looks inspired by it),
then it isn't an ORM. It's a typed query language with a one-to-one mapping to
SQL database tables/views. This gives us the low-level control of SQL, but
with the ability to build our own abstractions and compositions using the host
language. ORM's are frameworks (bad), JOOQ is a library (good).

~~~
soulnothing
You're exactly right it's very reminiscent of LINQ. It's pretty much raw sql
exposed as methods. I just say ORM, because saying near bare metal sql.
Bothers a lot of developers.

------
sriku
Sometimes, I get the feeling that we're all in the story of the "blind ones
and the matter of the elephant". Design UI first? Design data model first?
Design code first? Write documentation first? Write tests first?

How many things can we actually do "first"? All of them are important. Writing
documentation tells you about what you might need implemented. So does
designing UI, but that might still leave out what users need. Which means we
probably don't know enough about the data model anyway. Or the code.

So ... here is one more blind person's view - iterate as quickly as you can on
everything.

------
TimJYoung
I'm not sure that the statement "All generated DDL is only an approximation of
that." is necessarily correct and is database engine/server-dependent.

It's typically pretty easy to reverse-engineer the production database schema,
including any vendor-specific extensions, into a SQL file that can be used to
recreate an empty version of the same database schema. Furthermore, it's just
as simple with most database engines/servers to use an automated tool to
compare two schemas and generate a DDL script that will upgrade the old schema
to the new schema. It helps immensely because such tools will avoid circular
dependencies with foreign keys, etc. that may trip you up if you're trying to
manually write such upgrades.

Other than that, this article is spot-on. The database should be the source of
all truth and yes, you need to become proficient in SQL to be a good general
application developer.

------
exabrial
I know in the HN comments they'll be a whole bunch of "you don't do design
when you're trying to iterate fast". I wholly disagree. Design phases expose
problems in your thinking that you'll be tripping over very shortly, except
now you didn't waste the time implementing something that wouldn't work
anyway. Skipping the design phase is "productivity theater", where it appears
you're being productive because code is being written, but in reality, you're
just being busy.

------
drblast
The database really is an API, in that client code depends on it and its
structure will be difficult to change once you go live, whether you want that
to be true or not. Writing the API before code that depends on it seems like
an inherently good idea to me.

You can pretend like you don't need to figure these things out first and get
away with it for a while, but at some point the lack of a solid foundation is
going to bite you.

I think all devs should answer the question, "What if they don't want to use
my UI or interface, or they want to automate something?" That's a good
scenario to support for any client, including yourself.

~~~
wwweston
I think it's really interesting that I find this insightful:

> The database really is an API, in that client code depends on it and its
> structure will be difficult to change once you go live, whether you want
> that to be true or not.

And then have a different conclusion than this:

> Writing the API before code that depends on it seems like an inherently good
> idea to me.

I'd guess it comes down to differences about the value of top-down and bottom-
up approaches to systems.

My experience is that top-down thinking usually is a better guide to creating
the API you actually want to call -- writing the API calls you wish you had in
the flow of the code, and then making them real.

But then again, that's application-specific context, and may not lead one to
answer the question "What if they don't want to use my UI or interface, or
they want to automate something?"

------
Fradow
Perhaps I am too new to back-end development, but what strikes me in this
article is the misconception that code-first is going to give you a bad DB
schema, and that migrations will loose data.

I think part of it could be blamed on the tools: from the code I see, the ORM
doesn't seem well-built, if it needs that much code to generate a proper DB
schema.

For the record, I'm using Django (for about a year, I was developping apps
before), and my workflow is:

1) code, while thinking about what the DB schema will be as a result

2) generate the migrations

3) apply them and check if what's generated is good and usable when you need
to directly query the database (mostly for dev purpose). If need be, either
change your code, or even manually modify the migrations. Migrations are code
too, they are there to be overriden when required.

Sure, your rollbacks can loose data, but only if you don't pay attention to
them and don't change them to avoid that. As a developer, it's your
responsibility to think about that if it's of critical importance to your
business.

Feel free to enlighten me if something I say seems very wrong, after all I'm
still quite junior at this, and have not experienced huge scales / large team.

------
skybrian
The article seems to advocate generating code from a live connection to the
production database.

This seems very risky. Typically you don't want developers to even have access
to the production database most of the time, let alone make it a hard
dependency to do a build. Development work should be done using temporary
databases.

Also, database schema migrations need to be tested somewhere other than the
prod database before going live. All the code changes needed (including to
applications) should be written and tested _before_ actually doing the
migration.

But this approach would be reasonable with a bit of indirection: one developer
(DBA?) makes a database schema change, resulting in a change to a schema file
that's checked in (after testing), and code gets generated from the schema
file. That way developers all have access to a history of the prod database
schema, without access to the database itself.

(This is pretty close to how things are done using protocol buffers.)

~~~
NovaX
I don't think anyone advocates for generating based on a production database
connection. I can only tell you how I've used jOOQ, using the same general
setup for 6 years despite job changes. I wrote the original Gradle plugins for
jOOQ and Flyway, the latter being a migration tool. These were then
contributed back and I use the official variants now.

I write SQL migrations for H2, an in-memory db. This is used by the build tool
to code generate prior to compilation. It is also used by unit tests, when the
Postgres queries can be emulated (majority of the time). The Postgres
migrations are written separately, with similar code but minor dialect
differences. That might be less duplicated (but trivial) work if I used
LiquiBase instead of Flyway. When the application starts the migrations are
run. Someday I might drop H2 for a containerized Postgres, especially if they
add in-memory support for faster testing. This would avoid some codegen
differences, like jsonb or (soon to be released) enum support.

The work setting things up wasn't too bad. It was more when I originally
started, since I had to build part of the ecosystem. But repeating the process
elsewhere has been straightforward. It definitely pays off and is a joy to use
day to day.

------
geebee
Fantastic.

There's an earlier article from the same blog that addresses this at a deep
level

[https://blog.jooq.org/2014/01/02/why-your-data-will-
outlast-...](https://blog.jooq.org/2014/01/02/why-your-data-will-outlast-your-
sexy-new-technology/)

A couple of fundamental principles here: Systems tend to outlast their
developers Data tends to outlast the systems operating on it

This is why I'd almost always go with database first design.

As a general rule (and test), a well designed database is useful outside of
the context of the application that accesses it for persistence.

As an example from the kind of apps I've worked on: suppose you are writing a
system for tracking and managing inventory, which will be used through a web
interface. If you have direct access to the database, can you get useful
information about orders, lead times, existing inventory levels, and so forth?
If not, you don't have a useful database, and you are at risk. Migrating to a
new system, which is probably inevitable down the road, will be very difficult
as you no longer have control over your data.

Now, apps are useful, and code is useful. I wouldn't want to try to code up
various supply chain management algorithms in SQL directly. The estimated lead
time can be quite a calculation, and yeah, you are probably going to want to
write python or ruby or java code to do that. But in the ruby world (with
rails in particular), I have noticed databases used almost as a kind of object
serialization and persistence, where the information makes almost no sense and
can't be queried in any meaningful manner through the DB. It is simply used to
store and retrieve information that needs to be drawn into and reassembled
through various bits of code and configuration files in a rails app before it
takes a form that makes any sense.

In short, the database has no meaning, almost, outside the rails app.

Now, I'm not saying this to knock rails. I've used it, and I enjoyed the
experience. Metaprogramming, in fact, makes it very easy to map an object to a
database table without a ton of irritating configuration or extra code. You
don't have to generate your schema through rails to map models to tables, you
can create them independently of the app and use rails to do the mapping for
you. Also, it's entirely possible to use rails generators to create a very
sensible database design. The problem is probably that a lot of people using
these generators never considered databases or SQL or schemas as something
that had meaning outside the context of a rails app. My guess is that this
anti-pattern happens in almost all frameworks that allow developers to quickly
push objects with relationships off into a database schema that they never
took the time to consider or design. It's almost no different than if they
serialized the objects for persistence, and used a hash to find and retrieve
them.

I also think this emphasizes why clarity is so important in code and database
design. Sometimes people talk about upgrading an existing code base, and say
"well, it's already written in [java/ruby/python/...] , so it'll be easier if
we stick with a [java/ruby/python/...] based framework". Nah. If the code
lacks clarity, it'll be hell to port it from python to python. If it has
clarity, it won't be hell to port it from python to ruby.

If you can use your database to answer meaningful questions about your data
outside the context of an app, your database is not only useful, it also
probably has clarity of design and purpose. If you can easily read your code
and know what it is intended to do and why, you have clarity.

I'm amazed with how often developers are willing to sacrifice clarity to meet
the demands of a testing framework, or a particular architecture, or to
accommodate a cutting edge UI/javascript approach. These can be good things,
but you _know_ that shit ain't lasting, and then, without clarity, where will
you be?

Well, we all know, the developer who gained experience with the nifty new
framework will be: at a new, better paid job.

Beyond that... eh, I suppose a code generator from a database can be useful
and save you some typing, but if it's all cluttered and clunky and lacks
clarity I'd rather just do it myself. It's not really that hard.

------
ngrilly
Really great post. Code generation is seen by many as old school, but this
post shows it's still well alive. We are seeing a revival of code generation
with tools like jooq, Protocol Buffers, static website generator, and some Go
tools for example.

------
lucidguppy
There are some who say that architectures that are able to postpone choice of
front end and DB are flexible and well designed.

You should design your app to make it desktop _or_ web... and choose any DB
you want.

------
nickpeterson
My favorite version of this, is when a new version of something is greenlit by
the business, and a bunch of enthusiastic devs look at the old system and
decide to rewrite. Instead of starting with the existing data, and determining
how it's going to fit in the new design, they just make a new design and
figure, "we accounted for most business requirements, so the data should be
able to fit...". Then they set a launch date with the business, and just need
to do that data migration. What could go wrong?

------
crb002
This goes with any system. Get the data pipelines hot then drive in the
business logic. Otherwise you are surprised with latency and have to do major
refactors.

