
Old, Good Database Design - jelnur
https://relinx.io/2020/09/14/old-good-database-design/
======
commandlinefan
Some people choose nosql alternatives because they've spent time analyzing the
performance of a proper relational model and have determined that an RDBMS
will generate too much overhead for their data load and consciously accept the
tradeoffs involved in giving up automated referential integrity.

Most people, though, choose nosql alternatives because they're too lazy to
learn how to model data.

~~~
SamuelAdams
I am forever grateful that I took a full semester of database design in my
undergrad. This single skill has stood with me for my entire career so far and
has enabled me to figure out the root cause of many production issues. Plus
people really like it when you can answer ad-hoc questions like "what
inspections are still open and when were they first opened".

If y'all can understand Angular / React / Vue there's no reason to not learn
databases.

~~~
throw345hn
Could you suggest resources (books, articles, videos, moocs or others) to
learn good database design. I am picking up skills about sql but want to
better understand and learn about databases. As someone who doesnt have that
background, a lot of the times I am just googling for stuff and just trying
out bits and pieces.

~~~
mcny
I'd like to know this as well. I think you'll just have to build things
(potentially horribly) and fail.

I took three semesters of database (granted, baby database classes) and I
still have no idea how you can do something pretty straightforward like
creating a room reservation system.

If there is a reservation beginning at 10:15 AM and ending at 12:30 PM and
someone tries to book a reservation from 10:00 AM to 10:30 AM, the transaction
should fail.

and before someone screams db2! yes, db2 can. but then you'd have to use db2
[https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/co...](https://www.ibm.com/support/knowledgecenter/SSEPGG_11.1.0/com.ibm.db2.luw.sql.ref.doc/doc/r_overlaps_predicate.html)

Why is this so difficult...

~~~
sk5t
The documentation for Postgresql range types describes how to do exactly this.

[https://www.postgresql.org/docs/11/rangetypes.html#RANGETYPE...](https://www.postgresql.org/docs/11/rangetypes.html#RANGETYPES-
CONSTRAINT)

Edit: and if you didn't want to use postgres, you could have "starttime" and
"endtime" columns and reject any bad bookings with a before insert / before
update trigger.

~~~
gavinray
This approach is the best and works really well if you don't need to do a join
on a related table to look up information. If you need to use data outside of
the current table for exclusions/check constraints, you have to write a
trigger function (as far as I know).

I had to solve this recently, where the actual start/end times were stored on
a related table. I'm no SQL wizard, but I'd love to share my solution in case
it helps others (it might be terrible).

 _Note: I changed the actual tables /domain to be generic, this is a poor
example and it made more sense for my usecase, but this shows general
approach._

    
    
      -- Let's pretend we have these tables (awful design, but for sake of example):
      -- room <-> reservation <-> reservation_info
      -- Where "reservation_info" has "start_time" and "end_time"
      CREATE FUNCTION check_for_overlapping_reservations()
        RETURNS trigger
        LANGUAGE plpgsql AS
      $$ BEGIN
        IF (
            -- Find the newly created reservation and join it with the info record to grab "start_time" and "end_time" for check below
            with this_reservation as (
                  select * from reservation
                  inner join reservation_info on reservation_info.id = reservation.reservation_info_id
                  where reservation.id = NEW.id
              ), bookings_for_timerange as (
                  -- Select every other reservation, where the reservation is happening in the same room
                  select * from reservation as other_reservation, this_reservation
                  inner join reservation_info as other_reservation_info
                    on other_reservation_info.id = other_reservation.reservation_info_id
                  where other_reservation.room_id = this_reservation.room_id AND
                  -- And the timerange from start to end overlaps the newly created record
                      tstzrange(this_reservation.start_time, this_reservation.end_time) &&
                      tstzrange(other_reservation_info.start_time, other_reservation_info.end_time)
              -- Get a count of all the records, it should only be 1. If it's greater than one, there's overlap.
              select count(*) from bookings_for_timerange
            ) > 1
        THEN
            RAISE EXCEPTION 'Room is already reserved during this time period';
        END IF;
        RETURN NEW;
      END;$$;

~~~
sk5t
TBH it's a bit more verbose and less performant than need be, but hey, if it
works, rock on! But don't use 'select *' in production code.

Consider next time something like

    
    
        if exists (select from sometable t1
          join sometable t2 on 
            t1.resource_id = t2.resource_id 
            and t1.res_id <> t2.res_id
            and tstzrange(t1.start, t1.end) && tstzrange(t2.start, t2.end)
          where t1.res_id = new.res_id )

then ... raise exception

~~~
gavinray
Ahh, that looks much cleaner. Thanks for the knowledge =)

------
throwaway894345
If one of the purposes of relational databases is data modeling, I've always
wondered why there aren't good semantics for sum types. The real world is full
of them, but databases can't express them. When I bring this up, some people
respond that this is the purpose of ORMs; however, this implies that we have
an arbitrary bifurcation in which some of the processing happens efficiently
in SQL and anything that depends on sum types has to get hoisted up and over a
network to application code. Further, it allows for different clients to
behave differently, possibly having different notions of what any given sum
type's variants are (which leads invariably to data corruption). I really wish
databases did better here, but maybe I'm missing something.

~~~
alextheparrot
Can you give an example of real world data modeling where you want more
expressive sum types over just using enums? Enums are technically a subclass
of sum types, but even those are non-trivial to use at a data format level
(Try evolving them in an on-the-wire message format like Avro or Protobuf).

~~~
talaketu
How do you model "postal address"? Some postal addresses are PO Boxes, some
are street addresses, etc. There are canonical representations of these
different cases. Do we just shove it all in a string, and let the application
perform domain validation?

~~~
gfody
postal address is one of those cases where you probably do just want to shove
it all in a string as most structural constraints eventually backfire -
especially if you support international:
[http://www.columbia.edu/~fdc/postal/](http://www.columbia.edu/~fdc/postal/)

the most common schema I've seen is usually something like line1, line2,
line3, city, state, country, zip, etc. if it's a reporting database then
city/state/country/zip is often mashed into some sort of location id.

~~~
outadoc
Watch out for `state` as well, please don't make it mandatory like so many
websites. No, a region is not the equivalent of a state in France, you don't
need it for my package to get there!

------
gopalv
> A well-thought design can save us many hours of coding, testing, and
> troubleshooting.

That is the very definition of a waterfall design model.

I've turned into a fluid-design advocate over the years, where every design
principle follows a next question - "okay, this is good but how would I change
it?".

So you start with a unique constraint and four months later, you find out that
it is not actually unique (like "two patients with the same email, because the
2nd one is a newborn on day of birth").

Or you normalize a data-set only to find out that your 1:N relationship turns
into a 2:N relationship from before/after dates (like "UK" goes from "EU" to
"UK").

The lost-time work of a design is usually the "okay, we did it in a world
where UK was in EU - but we undo it & here's how" notes.

Having a plan to decommission the nuclear plant you're building is super
useful and often more relevant than designing it for efficiency alone.

~~~
nightski
Good design as advocated by the author is generally conducive to change and is
saying pretty much the same thing as you are.

It's a lot easier to take away unique constraints later on instead of adding
them in. It's easier to de-normalize some data for performance than to
normalize it later on. The list goes on.

The reason the waterfall method received so much bad press is because of
requirements gathering, not the software or data design phases. Requirements
are hard to get right the first time and they also change over time. But I'd
be surprised to find someone argue that good architecture and design is a bad
thing (being defined as the ability to adapt to changes in requirements).

~~~
dragonwriter
> The reason the waterfall method received so much bad press is because of
> requirements gathering, not the software or data design phases.

No, it was because all three were done wrong.

Requirements gathering is the biggest problem, true. But even if requirements
were both knowable and fixed, for most projects, big up front requirements
gathering, design, and then implementation would have lots of waste in the
lean sense of effort expended that spends time not delivering customer value.

Now, that gets made worse with the rework created by the fact that
requirements gathering without validation by use gets lots of stuff wrong and
that the context is often evolving such that requirements will drift between
gathering, design, and implementation in a waterfall project, so that lots of
work is done which never delivers value and needs reworked before it can do
so, but the problem exists even without that exacerbation.

------
simonw
My least favorite part of database design is the bit where you have to pick
lengths for your char columns.

Twenty years in and I'm still picking these pretty much by guessing. And when
I guess wrong it causes really annoying problems further down the line.

I love how SQLite doesn't make me do this - it just has a TEXT type which is
always unlimited in length.

~~~
brobinson
Postgres also has a TEXT type like this.

Edit: I see mpolun left a similar comment here, but it looks like he has been
mostly shadowbanned for about 8 years.

~~~
codetrotter
> Edit: I see mpolun left a similar comment here, but it looks like he has
> been mostly shadowbanned for about 8 years.

Yeah same. I don't have enough karma to vouch for their comment but I see no
big reason that said comment should be dead. Their account is from 2012 and
the vast majority of their few comments seem to be contributing to the
conversation. Their first dead comment is also from 2012 but at a glance said
comment is the only one that stands out as not contributing much to the
conversation. And that's being harsh even – I've probably made less
substantial comments in the past myself. Yet like 30% of their 3 pages of
comments are dead. And looking at their submissions they have ever only
submitted 3 stories, 2 of which appear to be from a domain that they
themselves control. Hardly enough to be subjected to having so many of their
comments killed I think. Though of course there might be other factors at
play, but from what I see on their profile page I see nothing bad enough to
warrant this.

~~~
brobinson
I have 1500 karma, but I didn't see an option to vouch for it... I have seen
that option on other comments, though. Maybe dang can take a look at their
account.

------
60secz
Haven't you seen ghostbusters? "Don't cross the streams. It would be bad." If
your data model has lines crossing over, it's the first smell of a bad design.
Seriously tho, 5 minutes of untangling would make that data model diagram 100x
better.

~~~
klysm
I wonder if there’s actually any significance to if your data model forms a
planar graph

------
janvdberg
Reminds me of this great Derek Sivers post:
[https://sive.rs/pg](https://sive.rs/pg)

If your design is good, you need less code.

~~~
nemothekid
> _If your design is good,_

"Draw the rest of the fucking owl"

Good design is _hard_. There are arguments to be made for both, but the
problem with "Old, Good Database Design" is when the design changes it either
devolves into

1\. Downtime trying to move X billion rows

2\. Some ad-hoc K/V store on top of your RDBMS

And most companies tend to opt for (2) rather than (1). It's no surprise that
some systems just decide to choose (2) from the onset.

------
markus_zhang
I'm wondering what you guys think about columnar databases and wide tables.

We use Vertica and from senior colleagues and even Vertica developers I got
the impression that big wide tables are good because it eliminates the needs
of join. Thus we don't use star schema and just wide tables.

However I think data modelling is also about embedding proper business logic
and it would be a lot more confusing if two unrelated columns are stuffed in
the same table.

For example let's say my event based game has a base for each player and he
can do a lot of things in his base. If we use a wide table then we wilk see
columns related to factory building side by side with pig feeding, and because
each small feature has some unique columns, a lot of those columns are NULL
simply because this event json doesn't even have th fields.

I'm wondering if we should use Vertica for a transactional type table and then
use say sql server for dwh and build more traditional data modelling. But this
could be awfully wrong maybe...

~~~
teej
A few comments based on lots and lots of experience:

\- Wide tables in columnar DBs can make some analytics queries easier to write
and sometimes more performant.

\- Wide tables can come at high storage cost and make other queries less
performant (like SELECT *)

\- How much of these things happen is extremely DB dependent. How does the
DB's underlying storage mechanism work? How is the data partitioned and
distributed? How sophisticated and storage-aware is the query planner? How
large is your data? How fast is your data growing? How fast do you need a new
data point to be reflected in your dashboard?

There's no free lunch when it comes to this stuff. A perfect solution doesn't
exist, but the technology is getting better all the time. I've personally
never used sql server as a data warehouse but plenty of folks do.

The stuff I use that I recommend very highly - Snowflake, TimescaleDB, vanilla
Postgres. Also, use dbt.

~~~
markus_zhang
Thanks teej for the answer.

>How much of these things happen is extremely DB dependent. How does the DB's
underlying storage mechanism work? How is the data partitioned and
distributed? How sophisticated and storage-aware is the query planner? How
large is your data? How fast is your data growing? How fast do you need a new
data point to be reflected in your dashboard?

I think most of my frustration comes from not knowing these stuffs. I work as
a BA-BI hybrid as I found my data analysis skills are good complements to data
modelling/airflow type of work, so I persuaded my manager to give me some BI
work. But that also means I don't have the technical knowledge such as DB
internals (and TBH I can't even find a book for Vertica on that matter).

Our DB and DBAs and all ETL processes are located in HQ and we actually don't
own our databases. This, I guess, adds an extra layer of discomfirt as we are
effectively cut off from database-level optimization. Our data engineer is
about to leave because he has no DE work to do (every ETL has to go through
HQ's process and we only need to submit some configuration files).

We don't have access to the databases you recommended (again HQ controls
that), but I do believe I could try DBT, may I ask how do you use it? From my
understanding it is mostly a transformation tool, but what makes it stand out?

~~~
teej
No amount of technology can fix a dysfunctional org.

A large amount of the work of BI is getting data into the right shape in order
to ask questions about it.

\- What’s level on average do new players drop off?

\- How are sales of the new skins we introduced performing vs the last set?

\- How often are players engaging with feature A vs B?

You might be able to answer all of these questions with just event data in a
wide table, but you’ll write very different queries to answer them. dbt is a
way to write queries on your source data that leads to systematic, repeatable,
and reusable new tables for answering analytics questions.

------
tarkin2
JSONB objects with SQL relations in Postgresql is my happy-medium between the
joy of schema-less JSON and the reassurance of SQL relations.

~~~
c17r
Would you explain a little more? I'm intrigued. Are you saying SQL relations
between fields inside JSONB columns between tables? A field in the JSONB
column has foreign key to another table? And you can do a join?

~~~
tarkin2
You can do both. You can make relations between jsonb fields.

Or, as I do: have a jsonb blob in one column, and a normal sql field in
another, and do the sql joins with that.

So I have some data fields as sql columns, and some as jsonb properties.

------
slifin
Most relational databases aren't relational enough

For example in Drupal you can have a node table which is a foreign key
relationship to many other custom tables

In SQL I can't say get me all the nodes that have a start date without
explicitly left joining to a potentially dynamic number of other tables using
a higher level language to modify the query

In Datomic or Datascript or Datahike or Datalevin or Crux this query would be
trivial these are not nosql databases they're EAVT datalog engines, that in
some cases also support SQL for compatibility

~~~
petepete
You're judging all relational databases having looked at Drupal's approach?
Ok...

------
whirlwin
This is useful for most small apps or truly monolithic apps. ...But
normalization in a distributed environment with 50+ apps? Really? You need to
have some sort of duplication unless you want big bottlenecks, performance
penalties, and hot headaches

From the linked MS article: "Redundant data wastes disk space and creates
maintenance problems"

Made me laugh well

~~~
flowerlad
> _You need to have some sort of duplication unless you want big bottlenecks,
> performance penalties_

I think that's called a cache. You can still have solid database design at the
core. Then add redis on top of it.

~~~
Raidion
It can be non cached data as well. Imagine a data that needs to be sorted by
data that's owned by another another service. You need to replicate the data
as you can't do a join+sort between the data coming from two apis effectively.
You don't treat that data as a source of truth, but do use it some UI
purposes.

------
TheRealSteve
This seems contrary to what I have learned in my career as an application
developer on data heavy platforms. Namely the first section that concludes: >
Having stressed the importance of good database design...

I'm not in agreement with the author's concept of good design. I don't want
other "doors" to edit the data that bypass the application logic. That's the
mess SQL enables for DBAs and scripts that think it's okay to change data
without executing business logic.

I think it's better to accept your data and application layers are coupled and
plan to evolve and refactor them together rather than teach your database to
defend itself from the evil business logic in applications.

I also don't think this dictates relational DBs vs nosql. Your data model is
probably relational but how you choose to store and access the data depends
more on what use cases you are trying to enable.

~~~
nine_k
When you see that SQL access can ruin data and make them invalid, it's usually
a strong indicator of an inadequate DB design. (Maybe you did not have a
chance to see a good one?)

~~~
TheRealSteve
The pure, normalized models weren't performant enough so you are correct - the
design wasn't all roses in the SQL systems. We had largely flattened down some
table groups to be able to get data fast enough. But when you hit that point
you may have outgrown RDBMS

I'm not saying you can't get one to run fast enough but the ROI starts to
decline significantly.

------
unnouinceput
When creating a visualization for your database design DO NOT under any
circumstance cross relations as they go from one table to another. The client
will get confused and frustrated. Looking at this article's visualizations
hurt my eyes. Very bad choice for what's suppose to be a "good" design.

~~~
jelnur
Thanks for feedback. I just included it as a picture, but you're right.

------
roywiggins
> we should keep numeric data in “integer”, “numeric” fields

I end up keeping numeric data as text when I'm ingesting an external data
source that I don't trust not to change ID format on me. They're all numeric
_now_ , but the format could change, and the actual numeric value of the ID is
not important at all.

~~~
bcrosby95
Yes, be careful conflating numeric data with data that looks numerical. IDs
fall into the latter for sure. Unless you have complete control over it, then
it might always be numeric.

E.g. back when I thought I was super smart, on one project I made the credit
card cvv a number. Except they can start with 0. Whoops.

~~~
barnabask
Same here, but with US postal codes. Having grown up on the west coast I
didn't realize some started with a zero until embarrassingly late in life.

~~~
roywiggins
The best rule of thumb I've seen for this is that unless incrementing a value
makes sense, probably use a text field unless you have some other really good
reason. So: auto-incremented IDs yes, arbitrary IDs that you don't control no,
phone numbers no, SSNs/PINs/ZIP codes, also no.

------
geebee
Nice link. Nothing controversial, but sometimes simplicity is controversial in
our field.

I've slowly come around to seeing proper database design as the most essential
foundation of an IT system. I remember reading "your data will outlast your
application", and I've been around as a developer long enough to have lived
it.

One big anti-pattern I've seen with ORMs is that developers who don't really
think in terms of data and relationships use the ORM as a kind of object
serialization usable only from the application. Rather than thinking of the
database as something useful that could be queried and accessed outside the
context of the application, they write objects out to various tables and then
re-import and re-construct them once they're needed again in the app, often
with dependencies that are in code or even yaml configuration files. The
upshot is that you simply can't use the database as anything other than a
persistence tier for an app. It really would be no different if they had
simply given the objects and id and pickled them to disk. The resulting data
store really is that inaccessible and meaningless outside the context of the
app.

As a result, if an analyst wants a report, they can't write SQL to get those
reports, even though the persistence tier is, if perhaps in name only, a
"relational database". And when the app goes away (as it inevitably does),
they'll still want to know, say, how many beakers and test tubes were ordered
by a lab tech who participated in 100 or more experiments per year with at
least one faculty member from radiology between June and December in 2007.

But because they don't really know SQL, they see no value to it, and they're
honestly just kinda irritated that they don't have an object database, which
is what they understand a database to be - a way to pickle and reimport
objects. The application outlasts the developer, and the data outlasts the
application.

Yeah, if you're writing an app that will hold potentially useful data,
definitely think about how this will be accessed outside the context of the
app that perhaps inspired its creation, and think about how you'd get at the
data if the application went away. That'll probably lead you to old, good
database design.

Now, I actually do think that you can create a decent relational database
through an ORM. I saw this go sideways with Rails a bunch of times, but I
think that's probably because Rails made it so easy to start developing that a
lot of people new to application development skipped the design stage of the
data backend. You can construct a pretty robust database with migrations
rather than CREATE TABLE statements, as long as you stick to the basic
principles - but the technology makes it relatively easy to do the opposite
and never really get into the mindset of data.

~~~
Supermancho
> Nothing controversial

 _ahem_

> Foreign Key constraint is the king of the relational database design

Amazon does not use FK constraints and I have rarely run into systems that do
(since 1996ish). Most people with big enough datasets learn not to use them.
The overhead for orphaned data is far less than the consequences of using
them.

~~~
PaulStatezny
Can you clarify about the overhead you're speaking of?

I assume it only comes into play at super massive scale like Amazon-level
datasets.

~~~
dalyons
Not even super massive like amazon, even an app for a few million users you’ll
run into the performance problems of FKs. They are really overrated in their
usefulness - as parent says, orphan records are really not a big deal. As soon
as you get to any moderate scale, dealing with a small level of data
inconsistencies is inevitable. Especially as you grow into a multiple
services, multiple databases type architecture where you fundamentally have to
handle breaking fks btw systems

------
jrms
Hi, my apologies if it's a bit off topic, but I wonder if you have any advise
or can point me in some direction on the way to becoming a DBA, or something
similar enough? I've been working as and old school sysadmin for the last 17
years and counting. Even if I can do nowadays devops stuff, I always liked the
DBA work, based on my experience as sysadmin for DB servers (mysql, oracle,
and friends...), and I'm considering moving my career in that direction. So is
there any "DBA certification" for which I could take a course and all that?
Thanks!!

~~~
hobs
You've hit on the major problem with DBA work, which is that you are touching
some of the most expensive prod stuff and even if you have "that cert" many
places just want to see that somewhere else trusted you as a DBA for that type
of product.

I managed to transition in a company in dire straits that had no other
options, and then getting new DBA jobs was fairly easy.

~~~
jrms
Thanks for your feedback, and I pretty understand what you mean, I've worked
with some DBA know as the "million dollars error guy". But I think that's a
second step, and anyway we were maintaining the servers where those dbs were
running... So we deserve some trust maybe?

------
rapfaria
Which software is used to make those diagrams? The ones I use don't have good
arrows like that

------
mooneater
Is the diagram a joke? That's total ERD spaghetti.

~~~
jelnur
It's included only as a picture :)

