
We do not use foreign keys (2016) - Scarbutt
https://github.com/github/gh-ost/issues/331#issuecomment-266027731
======
conroy
When posts like these come up, I'd like to remind people that context matters
when making technical decisions. What works for large companies with huge
scale (GitHub, Google, Facebook) may not work for you.

As a counter point to the linked issue, I operate a few small applications.
Foreign-keys (and constraints in general) are great at ensuring that invalid
data doesn't find its way into your database. Yes, they have a performance
cost. Yes, they make sharding more difficult. In my experience, at smaller
scale the trade-offs are worth it. YMMV

~~~
taurath
In the same vein, I'd like to remind people that you are probably not a
"temporarily low-scale big-data company", in the same vein as a temporarily
embarrassed millionaire. In lots of cases going for the very long term
scalable solution will be an impediment to your growth, and I'd suggest
dealing with those issues when the chance that you need them is on the
horizon, rather than across the globe.

CQRS is one of the biggest examples I've seen personally for this - your
flexibility will go down, work to do "basic" things will go up, and you will
really lose a lot of speed solving for 10 y/o company problems with technology
rather than 1-5 y/o company problems in terms of product market fit.

~~~
corebit
The biggest issue with CQRS I've seen is people thinking CQRS means you need
multiple, duplicate data structures, mappers, a few Kafka topics and a PhD,
when IN REALITY all it means is you put methods that return data without
modifying it in one interface/class and methods that have side effects in
another interface/class - which is really just a good application of interface
segregation.

Moreover, you now have a great rule of thumb for your brand new junior
developers: "If you're handling a GET request, use this class, and if not use
this other class" making code reviews easier to do and helping to enforce
better structure in your code. Plus tons of other benefits.

The rest of that stuff that isn't CQRS you might eventually need if you scale,
but is easy to add once you have the bones in place.

~~~
Quekid5
Like you, I've found the general idea of CQRS/ES[0] incredibly valuable
because it both enforces a functional approach to state (e.g. 'current state'
is a fold over events), and it forces people to really think about the domain.
(E.g. which bits are really _atomic_ units, or what can be fixed up if it goes
wrong in some way.)

It also forces people to think about something that's usually glossed over:
Consistency. If you have an RDBMS backing you, you tend to not think about the
fact that what the user on a web page sees is __already __out of date when
they see it, so any responsible application _should_ track when that data was
read and reject updates if the backing data has changed since it was read...
but _very_ few applications even attempt to do this (it's really hard and a
huge amount of boilerplate with most APIs). With CQRS/ES you are really forced
to think about these things up front -- and you can actually avoid most of the
issues. Whether that increases or decreases 'productivity' I don't know, but I
do know that _thinking_ about these things increases correctness.

For me, correctness is paramount. If you don't mind bugs, I can give you an
infinitely fast solution.

[0] I do think ES is an integral component.

~~~
trhway
> any responsible application should track when that data was read and reject
> updates if the backing data has changed since it was read... but very few
> applications even attempt to do this (it's really hard and a huge amount of
> boilerplate with most APIs).

it is done very simple by versioning. Usually implemented transparently for
the data layer API clients.

~~~
Quekid5
It's simple. I'm just saying that (for most frameworks) it's a lot of _work_
and boilerplate. It's also easy to miss individual cases during code review,
etc.

~~~
lensopra
You kind of miss the part where he says "Usually implemented transparently for
the data layer API clients." Transparently as in no work, no boilerplate.

I guess your mileage might vary, but Java has JPA/Hibernate, and .NET has
Entity Framework, and they both make it easy, so I'm going to be surprised if
any major framework or language doesn't make this easy.

The concept is also called optimistic locking, if that makes Googling it
easier. Using that term, I easily found that Node.js's Sequelize supports it
too [https://sequelize.org/v5/manual/models-
definition.html#optim...](https://sequelize.org/v5/manual/models-
definition.html#optimistic-locking)

------
spookthesunset
The number of times I’ve seen serious data corruption because “foreign keys
are bad and we can just enforce it in code” is amazing. There is zero excuse
to not use FK’s

Any database that doesn’t use FK’s is almost guaranteed to have crap in it
that didn’t get cleaned up, resulting in data corruption (and yes, dangling
stuff in tables count as data corruption).

Developers aren’t perfect. Shit _will_ slip through even with the most
rigorous process. The database should always provide tools to keep its self
from getting corrupted. This is why good database systems have constraints
like FK’s. If your database software makes using those tools painful (cough
MySQL), get a better database system.

Not using FK’s is just plain old ignorance.

~~~
dkhenry
Every time I have seen a database use foreign keys there has been data
corruption, because everyone thought foreign keys were declarative and not
procedural. Just because you have a foreign key doesn't mean it was always
there or that it applied on every transaction. You can turn them off at the
connection level and you in fact must turn them off for almost any kind of
bulk data load.

You should read shlomi's post he gives good reasons. Specifically this is a
github issue on his tool gh-ost which is for online schema migration, and FK's
pose lots of problems for online schema migrations.

~~~
spookthesunset
Maybe shitty toy database systems like MySQL let you disable constraints on a
per session basis. A real database system might let you defer them until the
end of a transaction—which is the only correct way to operate. Once you commit
that transaction, what you put into the system better fucking make sense and
it is the job of the database system (and only the database system) to enforce
that.

Like I said before if your database system makes using constraints hard or
lets you shoot yourself in the foot (lol at disabling constraints per
session), don’t just walk but run away from that system.

~~~
rhinoceraptor
> MySQL

What other database has been used at such a giant scale at so many companies,
than MySQL? I'm sure Oracle and SQL Server are used at big companies, but
nowhere near Facebook scale.

~~~
papln
You aren't Facebook. Facebook has put massive work into adding layers on top
of MySQL that your startup has not.

Also, you seem to be forgetting PostgreSQL

------
davidw
How about this: if your service gets as big as github, then maybe consider
doing odd things to eke out more performance or shard or whatever.

Otherwise: use FK's to maintain stronger data integrity.

As hesk mentions below, in Postgres, you can do all kinds of table ALTERing if
needs be.

~~~
spookthesunset
FK’s don’t just maintain “stronger” data integrity, they are the only way to
maintain relational data integrity. Application code cannot maintain that
relational integrity, period.

Any developer who thinks application code can enforce relational integrity is
naive and does not understand relational database systems. It is impossible
for any layer above the database itself to keep things from getting corrupt.

~~~
Diggsey
This is not remotely true. You can do this with pretty much any database that
supports SERIALIZABLE isolation-level transactions correctly.

The point is that application code has bugs, and it's a lot easier to specify
your constraints declaratively in a single place using a purpose-built DSL
(SQL) than it is to enforce them procedurally every time you access the
database.

~~~
lmm
I'd argue that any decent application programming language can express these
constraints better than SQL can.

~~~
spookthesunset
How? FK constraints are trivially expressed in SQL. It is just a "REFERENCES
table_blah ON DELETE DO BLAH ON UPDATE DO BLAH".

------
jagged-chisel
Worked at a bank years ago. One of their DBAs eschewed FKs in production.
Databases were designed with FKs which were enforced in Dev and QA. If your
app survived testing (automated, QA team ... the entire gamut) without
producing FK violation exceptions, your app could be promoted to production,
where FKs were not enforced, making things pretty fast.

I feel like this was a stopgap on the way to eliminating FKs. I have no idea
what his roadmap was because I wasn't in that group. But his process made for
good thought fodder.

~~~
munk-a
This approach is quite scary to me and I would have argued very vocally
against their rejections of FKs. There are ways to rely on replication for
read serving, periodic disabling of FKs during batch inserts, FK integrity
checks on replications, ... these can all address the performance issues
inherent in FKs, it's also (generally) quite possible to attempt to
architecturally disentangle too large networks of interdependent data in an
effort to reduce how large any single data store you are relying on will get.
But disabling FKs in production is... eh.

I'd much rather see the opposite, declare the FKs on a testing environment NOT
VALID, run application code, VALIDATE CONSTRAINT those you'd marked and see if
anything is in violation - ensuring that application code won't (in the normal
course of operation) generally hit FK constraints can greatly reduce your
performance hit from having them... then in prod you can turn them on and be
confident in your comparably lighter performance loss to data integrity.

~~~
barrkel
FKs add read locks to referenced rows. It limits concurrency and it is
observable. FKs constrain your ability to incrementally widen 32-bit FKs once
you go over 2 billion rows, if you start out with 32-bit PKs. Two concrete
reasons to avoid FKs in production, or at least disable them for longer
running transactions.

I think this perspective is something you only get once you've run bigger
databases in production.

~~~
jeltz
These sound like implementation details. In PostgreSQL FKs just add a shared
write lock to the foreign key columns of the referenced table (i.e. the locks
prevent anyone from updating the primary key of the referenced row). Also what
you said about 32-bits is not true in PostgreSQL either as far as I can
remember.

The costs of having FKs in PostreSQL are:

1) If you update the primary key of the referenced table you might see issues
with locking. But this is rare in real world applications.

2) Performance overhead from having to check all FKs and taking the locks.
This can be a big issue on some workloads and may force you to add extra
indexes. A PostgreSQL specific issue is that FK checks cannot be done in
batches.

3) Adding new FKs block writes from the referenced table and dropping FKs lock
out both readers and writers from the referenced table. This is a limitation
of the implementation.

~~~
barrkel
And that updates on tables with FKs block 'for update' locks on the referenced
tables.

(Let's not forget, amid all this Postgres-specific chatter, that the article
is about GitHub, who use MySQL.)

~~~
wbolster
not sure about mysql, but in postgresql an update on a table with a foreign
key will take 'FOR KEY SHARE' locks on the referenced table, which is a weaker
type of lock. updates on the referenced table that do not update (primary) key
columns (changing a pk is very uncommon anyway) will suffice with a 'FOR NO
KEY UPDATE' lock, which does _not_ get blocked by 'FOR KEY SHARE' locks. in
fact, the main reason postgresql has these weaker 'FOR KEY SHARE' and 'FOR NO
KEY UPDATE' lock types is for handing of foreign keys.

~~~
Volundr
Also worth noting here, if you ARE updating the primary key on the referenced
table, then locks MUST be taken to ensure data consistency. If that PK isn't
locked, than by the very problem definition you have open transactions relying
on the original PK value.

Those locks would be very challenging to accomplish at the application level.

------
SamuelAdams
That's fine when one application is using the database. I work in the
enterprise space, and we have at least 5 different applications all working
with the same database. Each of these applications has their own team.
Expecting each team to handle constraint checking uniformly in their
applications isn't always feasible, so we use foreign keys.

GitHub's approach might work fine for small, focused teams who have exclusive
control over a product. However, that's not often the reality of most
organizations.

~~~
spookthesunset
> That's fine when one application is using the database.

It's not even fine for a single application. Said application has bugs in it.
Said application can (and will) crash in unexpected ways that leave the
database in an invalid state. And if you think that "oh, that will never
happen to me". You just haven't been around long enough. It _will_ happen,
every single time. Any system without foreign keys _will_ have corrupt data in
it. Said system _will_ have user-visible issues as a result.

Just like only the backend can truly validate user input from a webform, only
the database can validate its input.

~~~
Mahn
There are ways to prevent invalid states in a database without foreign keys,
for example enforcing "all or nothing" operations via transactions.

~~~
Volundr
Until a programmer forgets to wrap the new feature in a transaction, or just
has a bug in their logic that breaks the integrity. Or a db admin working
directly in the database breaks it. Or a schema migration. It's not
impossible, sure, but it requires you and everyone working with the data two
be very, very disciplined, and is only as effective as the weakest link.

------
SkyBelow
I fixed one too many data corruptions to do away with FKs. If we are large
enough that FKs are causing performance issues, then we are large enough to
invest in and deploy another performance boosting solution while still
protecting the integrity of our database from contractors/employees running
around with direct SQL access (another nightmare unto itself, but one that I
as a developer have less control over than if there are FKs or not).

------
geophile
Wow. No discussion at all regarding the correctness of their databases. Is all
the data correct, or are there FKs referencing missing PKs? Actually, even
discovering this would be difficult, since any FK violation could be just
viewing the middle of what would be a consistent update in flight.

I get that cross-shard FKs are especially difficult. But any discussion of
this topic without addressing correctness concerns is woefully incomplete.

~~~
Hamuko
It's not like you can avoid incorrect data just with foreign keys. Say you
have an invoice model. There's a boolean value indicating that the invoice is
final and a numerical value for invoice number. Final invoices must have
invoice numbers. But a bug in your system manages to update an invoice so that
it's final, but missing an invoice number. That row is incorrect and it's
gonna cause an issue somewhere.

I can only imagine how many production databases contain incorrect data.

~~~
CharlesColeman
> There's a boolean value indicating that the invoice is final and a numerical
> value for invoice number. Final invoices must have invoice numbers. But a
> bug in your system manages to update an invoice so that it's final, but
> missing an invoice number.

Couldn't you catch that with a CHECK constraint?

[https://en.wikipedia.org/wiki/Check_constraint](https://en.wikipedia.org/wiki/Check_constraint)

~~~
Volundr
Yes, in fact you could and should.

------
hesk
The main criticism seems to be that the FK relationship makes migrating the
referenced table difficult. But why not remove the FK with ALTER TABLE before
the migration, migrate, and add the FK back again (which will catch any
missing primary keys), preferably inside a transaction?

~~~
mumblemumble
I think the emphasis there is the word "online".

The method you propose might work, but not necessarily well. Catching missing
primary keys at the end is a problem because then you have a bunch of data
integrity issues to sort out, preventing you from re-applying the constraint,
in a live system where apps are relying on the database to do integrity
checks, meaning that there's a decent chance that the running system is
creating integrity issues faster than you can sort them out. Possibly orders
of magnitude faster.

Doing everything inside a transaction may invite concurrency issues around
locking, maybe even deadlocks. Which is again a problem in a live system,
because you might be causing services to fail.

I'm not sold on FKs being something you should never do, ever, but I will say
that I used to work at a place that pushed their database servers very hard on
real-time tasks, and also tended to avoid foreign keys, and my biggest
complaint was not data integrity, it was just that there was extra effort that
needed to go into documenting the logical foreign keys (the keys still exist,
they're just not enforced by a database constraint), since you couldn't just
read them from the database schema.

I'd like to say that this can be something where you default to having FKs,
and remove them as you hit performance problems, except that the poster is
right: Developers have a habit of leaning on database constraints instead of
doing their own sanity checks, and they simply don't know all the places
they're doing it - they're not necessarily even aware they're doing it when
they're doing it - meaning that removing FK's after the fact doesn't get you
to "constraints are enforced by the apps", it gets you to "constraints aren't
enforced". I think you probably want to try and anticipate ahead of time if
you'll have a problem, so that everyone can know from the get-go whether
they'll be working without a safety net.

~~~
vbezhenar
Is it even possible to realistically do FK checks in app code? Nobody uses
serializable transactions and with other isolation levels I think it's
possible to check for FK, it's OK now, insert new row, but another transaction
simultaneously deletes that row and both transactions happily commit. It's
called phantom read. Probably it's very rare event, but it'll happen.

~~~
mumblemumble
At the place I worked that skipped FK checks, they used append-only database
schemata. If you limit yourself to CR operations, a lot of things get easier
than they are with CRUD - ensuring (a somewhat weak form of) database
consistency without transactions, yeah, but also things like audit trails and
slurping new data into the BI system.

------
gigatexal
Ahh to fkey or not to fkey. The problem with doing referential integrity in
the app is what happens if a change is made to the database outside of the
app?

I guess it depends. If you’re at the scale where you can’t refactor your
database to keep up and you, after profiling or some other method, have proven
that moving integrity enforcement into the model of your app is faster then by
all means do that.

That being said: foreign keys help the query optimizer make better decisions.
If you’re using and ORM though that doesn’t matter as you’ll get generic
queries and you’re likely not profiling your queries anyway.

They also are a way of self documenting the relationship between things.

If you’re going to get rid of foreign keys then why not denormalize so that
the data is easier to query and less likely to get corrupted? Reduce the
number of tables say into a single table a-la the DynamoDB approach

------
fdr
There seems to be a certain mysql tinge to this value calculation. I use
foreign keys and like it (I use Postgres). I do not encounter the problems the
author does. I omit them when they are a problem for some reason. This is rare
in routine work.

~~~
Hamuko
I'm gonna wager a guess and say that most MySQL users do use foreign keys
without a problem. Most MySQL users are not running Github scale.

~~~
fdr
That's probably true. The only tell for that, really, was the complexity
around migrations/schema changes, which seems a lot more painful on MySQL.

~~~
cryptonector
I don't understand why anyone would choose MySQL over PG, honestly.

------
PopeDotNinja
Because I've never thought about this, I'll ask the dumb question...

A shopping cart has many items. An item belongs to a shopping cart. In a
relational database, without foreign keys, how do you associate the shopping
cart with the items?

~~~
greggyb
The suggestion is to not use foreign key _constraints_ in the database.

The suggestion is not to eschew columns which might be JOINed on in a query.

I.e. "Don't enforce FK relationships with a constraint in the DB. Make sure
the values in both tables which may be joined are consistent by using
application code to enforce this."

~~~
spookthesunset
> Don't enforce FK relationships with a constraint in the DB. Make sure the
> values in both tables which may be joined are consistent by using
> application code to enforce this.

For those reading at home who aren't good with databases. This is exactly the
same statement as:

"Don't enforce valid inputs on the backend. The javascript front-end will
enforce it for us".

Always validate your inputs at the appropriate layer. Your backend can never
trust input from a web front-end (even if there is "only one web front-end").
Likewise, your database should never trust input from its clients (even if
there is "only one client"). Ignore this rule at your own peril.

~~~
Aeolun
Except hopefully your customers do not have direct access to your backend.

~~~
spookthesunset
Are your engineers omnipotent beings who never fail, never make mistakes and
understand the system entirely (even all the parts they don't know about)? Is
your QA process perfect? Your servers never crash in weird states?

Lucky for you!

------
devit
All of these are mostly issues with the database engine implementation:

1\. "FKs are in your way to shard your database." => not a problem for
distributed databases that can query and duplicate data across shards/servers;
alternatively, if the referenced data just isn't there in the database, a
foreign key is not usable by definition

2\. "FKs are a performance impact" => the app either just got the foreign key
value from the database, so it should be cached in memory in a properly
implemented database engine, or otherwise, the application relies on the
database checking so you need the foreign key for correctness

3\. "FKs don't work well with online schema migrations." => not a problem with
database engines that properly support online schema changes without locking,
downtime or table renaming hacks

~~~
evanelias
The performance impact is at write time, and having the data cached in memory
is irrelevant. It's the locking overhead that's non-trivial, especially if
each table has several FKs, when operating in a high-volume OLTP environment.

As for "database engines that properly support online schema changes without
locking, downtime or table renaming hacks", please name some. In my
experience, every major DBMS has cases where certain ALTERs block concurrent
DML, which is extremely problematic on very large tables. Or even cases where
there's no locking, but the operation still takes a very long time, which is
conceptually problematic for logical replication.

------
MadWombat
> FKs are a performance impact. The fact they require indexes is likely fine,
> since those indexes are needed anyhow. But the lookup made for each
> insert/delete is an overhead.

You have to check referential integrity __somewhere __. You can do it in the
application or you can let the RDBMS do it for you, but it will have to happen
somewhere. So the overhead argument is kind of bullshit. Yes, managing the
integrity in the app might provide more flexibility, but it comes with a price
of, well, having to do your own integrity checking and introducing more bug
space into your code.

> FKs don't work well with online schema migrations. ... > Doing schema
> migration of P is just not going to work. Recall that gh-ost renames the
> table at the end.

So basically, the problem is not foreign keys, the problem is gh-ost
implementation. Since they don't use FKs, they implemented their migrations in
such a way as to make it impossible to support foreign keys.

Yes, in the parent-child example, to migrate the parent you would have to
migrate the child. But it is not an impossible thing to do. Create the new
parent, import the data, map the record ids from the old parent to the new
parent, create new child, drop the foreign key, import the data, update the FK
field using the map and create a new FK. Rename the tables and you are done.

> When eventually you want to shard or extract data out, you need to change &
> test the app to an unknown extent.

Sharding your data is not something you do on a whim. And you will need to
test your app. But you were going to test your app anyway, weren't you? Or did
you think that you could split your database into multiple databases without
any testing? As for unknown extent, that is kinda bullshit. Look for places
that catch integrity exceptions. Thats what you need to look at. Look at your
database schema and if any of your tables are "ON CASCADE DELETE", go and
punch your DBA in the face. If you are the DBA you might want to start running
before someone else finds out.

~~~
cryptonector
> > FKs are a performance impact. The fact they require indexes is likely
> fine, since those indexes are needed anyhow. But the lookup made for each
> insert/delete is an overhead. > > You have to check referential integrity
> somewhere. You can do it in the application or you can let the RDBMS do it
> for you, but it will have to happen somewhere. So the overhead argument is
> kind of bullshit. Yes, managing the integrity in the app might provide more
> flexibility, but it comes with a price of, well, having to do your own
> integrity checking and introducing more bug space into your code.

Perhaps they meant "latency" rather than "performance". With eventual
consistency you can have lower latency, and spend more cycles fixing
referential integrity later.

Of course, sometimes things fail in eventually-consistent systems (e.g., you
order some item from a vendor who claims to have stock but oops! they don't,
so your order will take a few more days to ship than they claimed and now
you're a bit sad but hey, it all works out in the end, right?).

------
malvosenior
So basically they don't use foreign keys because they're into premature
optimization.

If you need to shard your database, you're big enough that you'll have the
resources to engineer the new solution.

The vast, vast majority of use cases will benefit from the database ensuring
the integrity of your data.

~~~
tomlu
Github is probably past the stage of premature optimization.

Presume (for argument's sake) that we accept the premise that foreign keys do
not scale. Then the question is whether there exists some sort of realistic
migration strategy once usage grows past that. If it does, then using FK makes
total sense until we hit that point, doesn't it? And if true this should apply
to 99% of users.

But if there exists no such migration strategy, then maybe the guy has a
point?

~~~
malvosenior
In all of my years of development, I’ve never seen data that can’t be migrated
to a new system. I’m not even sure what that would mean tbh.

------
muglug
I work at Vimeo. We don't have any foreign key constraints in our database
either.

~~~
Mahn
This thread makes it sound like it's impossible to make a medium scale
application work without FKs constrains. I work on a commercial application
that depends on relational relationships without FKs and the number of data
corruption cases that we run into on regular basis is zero. Data integrity is
handled at the application side, unexpected crashes are accounted for by heavy
use of transactions, and everything works just fine.

To be clear, I'm not advocating _against_ the use of foreign keys. But not
using them is perfectly doable and not at all what this thread would have you
believe.

------
brlewis
> It may even rely on FK to cascade deletes (shudder)

Is this just the author's personal taste, or is there something about mysql
that makes ON DELETE CASCADE a bad idea? In postgresql it's a useful tool for
maintaining database consistency.

~~~
Hamuko
I'm guessing it's the possibility for data loss. So someone accidentally
deletes a user and it cascade deletes all invoices referenced to that user
(deleting invoices is a big no-no in accounting).

~~~
brlewis
If the author is thinking of that kind of situation, he's right that ON DELETE
CASCADE would be inappropriate, but he's also incredibly wrong to not want a
foreign key enforcing that relationship. Deleting a user associated with
invoices that should not be deleted is a big no-no too. A foreign key would
not let the user be deleted until after the associated invoices are deleted.

There are plenty of other situations where ON DELETE CASCADE is the best
option.

~~~
papln
the problem with ON DELETE CASCADE isn't the CASCADE, it's the DELETE. Almost
always, you want to to mark an entity disbled, not really DELETE it. If you
want to DELETE it (for GDPR?) you should have something in place to fail your
delete unless you've properly defined how to clean up danglig keys. (Perhaps
what you need to do is delete the non-primary-key fields containing user data,
but keep the row for relational integrity)

ON CASCADE DELETE is used for deleting logical sub-components (a row owned by
another row) in a well normalized database -- it's for deleting your dadress
book when you delete your account. It's not for deleting all your friends when
your account is deleted.

~~~
brlewis
> It's not for deleting all your friends when your account is deleted.

Very nice summary up until that point. But in a well normalized db the friends
relationship (many-to-many) would be its own table. Friend relationships with
the deleted user would be deleted, as they should be. Friends would not. (You
still have time to edit for a better example.)

------
doctor_eval
I had a database with no RI and it was a nightmare, filled with bugs.

So we added RI and it was awesome, and the bugs did go away with time.

But then we became dogmatic and had to have RI for everything, including
things like maintaining local RI lookup tables for non local data. This just
bogged us down so much that we couldn’t move.

Today, we use RI strictly within a single database. For foreign keys to remote
databases, we assume the remote database knows what it’s talking about and
that the incoming data has already been checked.

This works really well for us and in particular sharding is not a problem
(sharding being a design decision not an implementation decision IMO)

Someone said elsewhere that Rails does the RI so the DB doesn’t need to. I
don’t use Rails, and I don’t bet, but I’d wager that it’s faster to do RI
inside Postgres than inside Rails.

The idea that we should throw out ALL RI is just dogmatic nonsense. I should
know :)

~~~
mceachen
Rails in 2010 was just coming around to adopting foreign keys, and the rubygem
Foreigner made it happen before it was a proper rails feature.

The idea of rails enforcing unique constraints and FKs back then was...
Optimistic. The design constraints assumed exactly one rails instance would
access the database at any given time to ensure consistency.

Not a great assumption.

------
jackcviers3
I thought foreign keys were necessary for the consistency part of ACID
compliance. Once you remove the integrity constraint, the database cannot
guarantee consistency, even with isolation and atomicity.

Sure, removing the constraint will get you a huge burst in performance, but
now you are in charge of guaranteeing consistency outside of transactions in a
distributed system involving at least one client and one server. That's a tall
order. As Runar Bjarnson put it: "Constraints liberate, liberties
constrain,"[1]. If you give something the freedom to be inconsistent at any
extreme scale, Murphy's law requires that anything that can go wrong, will go
wrong. It is hubris to think that the average developer who hasn't spent years
becoming a consistency expert is going to be able to guarantee that data won't
become corrupted better than the engineers that build and maintain and
acacademics that study rdbms systems full-time.Even experts get this wrong all
the time [2]. To all junior devs out there, until you know you need to abandon
a guarantee available to you in a given system, exhaust every other possible
option.

1\. [https://youtu.be/GqmsQeSzMdw](https://youtu.be/GqmsQeSzMdw) 2\.
[https://aphyr.com/posts/282-jepsen-
postgres](https://aphyr.com/posts/282-jepsen-postgres)

~~~
mumblemumble
Consistency, in the context of ACID, just means that the integrity constraints
that exist are enforced. It does _not_ mean that constraints that don't exist
should be enforced.

------
wisnesky
Many people on this thread have asked if there is a rigorous way to migrate
data between relational schemas that contain non-trivial foreign keys, and the
answer is yes: In the open-source categorical query language CQL, data
migrations necessarily respect foreign keys - a property guaranteed by the
mathematics of category theory and the automated theorem prover that ships
with CQL. [http://categoricaldata.net](http://categoricaldata.net)

------
CGamesPlay
Oh man, that migrations point. I remember at my last company needing to handle
this. We had some (huge) read-only tables that were referenced by other tables
and we needed up update the read-only tables in batches from raw mysql dumps.
The keys in the dumps were stable, so we could swap out the table and the
references would still be valid. Since the referenced tables were huge, they
took a long time to upload the batches, and we needed a migration that was
ACID and fast.

The process ended up looking like this:

1\. Create a `foo_load` and a `foo_unload` database, dropping any that already
exist.

2\. Populate `foo_load` with all of my new data.

4\. Rename `foo.table` to `foo_unload.table` and `foo_load.table` to
`foo.table`. This is atomic, fast, and updates references to point to
`foo_unload.table`.

At this point the application reads from the new data, but referential
integrity checks still verify against the old data.

5\. Update each reference to `foo_unload.table` to point to `foo.table` using
ALTER TABLE DROP FOREIGN KEY, ADD FOREIGN KEY. Again this is atomic and runs
quickly.

6\. Drop the `foo_unload` database.

Now this process is slow, but the switchover is 2 atomic steps and requires
locking the tables for only a few milliseconds. The process is safe as well,
if it crashes at any step, running the script again will cause it to safely
recover. Well, with one more step zero:

0\. If `foo_unload` exists, do the same rename step as step 5.

This solution worked for us for a few years. Eventually we swapped out the
whole architecture around these tables and ended up with a more traditional
ETL flow that didn't involve moving round huge mysql dumps.

------
redact207
One great thing about not enforcing FKs is that it makes integration testing a
lot easier. You can load just the data you need to test with, and none of the
FKs need to point to rows that exist that aren't within the testing scope.

This approach isn't for everyone. It works well with DDD where aggregates form
contextual table boundaries and is eventually consistent by default.

~~~
yellowapple
If you ain't testing against the full system state, then I'd be hard-pressed
to call that "integration testing".

If you're generating the test data from scratch, then it shouldn't be hard to
generate the dependent data while you're at it. If you're testing against
(anonymized) production data, then it shouldn't be hard to pull the dependent
data while you're at it. In either case, you should be validating the
integrity of those data relationships as part of the test criteria.

~~~
Ma8ee
It isn’t hard, but often very cumbersome. Oh, so you want to test invoicing,
for this Customer, which must have a Delivery Address and an Invoicing Adress
which both needs valid Postal Codes. And the Customer must have a Contact
person. Then we need the Product, that must consist of at least one Article,
and each Article must be connected to the Company that we bought it from, with
Addresses and Contact Persons and half a dozen other entities. But we have
forgotten that the article also need to be in a Category so we know which
Sales tax or VAT that needs to applied, for the State or Country of Sale. Then
of course, which Sales Person, belonging to a Sales office, with contact
person, addresses etc should get credited for the sale. And we still haven’t
been able to actually create the order yet, because we haven’t created the
delivery options.

I think you get my point. I’ve easily used more than a full day just to get
enough data in a naked system to make just the simplest test. I’m very
grateful for tools like tsqlt that make unit testing possible (by temporarily
turning of FKs)

~~~
yellowapple
That all seems pretty reasonable to me, and darn well should be included when
"test[ing] invoicing":

\- You surely want to make sure invoice creation depends on a valid billing
address at the very least, right? If that breaks, you're gonna have a lot of
rather irate AR clerks.

\- You surely want to make sure the Contact is aware of the new invoice on the
order, right? In fact, that might very well be part of the Contact's
performance metrics, so if that breaks, you're gonna have a lot of rather
irate sales/support reps (whatever "Contact" means in this context).

\- You surely want to make sure your invoices are against valid items, right?
And you surely want to make sure that expected revenue correctly ties back to
an inventory movement, which in turn ties back to an inventory receipt, which
in turn ties back to a paid invoice to a vendor, right? If that chain breaks,
you're gonna have a lot of rather irate accountants and financial auditors.

\- You surely want to make sure you're getting the right sales tax
calculations, right? If that breaks, you're gonna have a lot of rather irate
accountants, financial auditors, _and_ tax collectors.

\- You surely want to make sure the Sales Person and Sales Office both get
credit for the invoice, right? If that breaks, you're gonna have a lot of
rather irate sales reps and managers thereof.

So... no, I ain't exactly getting your point, lol. If you're changing
invoicing, then _all of the dependencies and dependents of invoicing ought to
be tested_.

But more to my point:

"very grateful for tools like tsqlt that make unit testing possible"

Unit testing != integration testing. If you're _unit_ testing, then sure, turn
off foreign keys and practice your quick draw while you cowboy it up. If
you're _integration_ testing, then that inherently means testing the whole
system _as a whole_ ; what you call "cumbersome" I call "the bare minimum of
comprehensiveness".

"I’ve easily used more than a full day just to get enough data in a naked
system to make just the simplest test."

That's usually pretty easy to script, even with foreign key constraints. It
might take you _a_ day, but future days should be able to call upon that same
script, saving you quite a bit of time :)

~~~
Ma8ee
I don't think we disagree about much. My only objection was to

> then it shouldn't be hard to generate the dependent data while you're at it

which makes in sound like somewhat light work. My point was only that it
isn't. And of course we script a lot of this test-data-creation, but then you
need different kind of data for different tests, and need to add some
flexibility. And after a while, just generating test data becomes somewhat
complex in itself. And all of the test-data scripts needs to be maintained and
changed whenever the model changes. None of this is done in a breeze.

~~~
yellowapple
One trick here is that if you're integration-testing, your tests are hopefully
already generating this data for you (if not, then you should probably be
writing more tests, lol), in which case you'd make the tests you _do_ want to
run dependent on the tests which would generate that data (for example, the
Invoice tests would be dependent on the Sales Order and Customer and Product
and Address and Contact and Sales Rep and etc. tests, so those tests will
generate the requisite data, and the Invoice tests will use that data to
create and test Invoices).

------
rukuu001
Let's just consider some of the 'non-technical' issues surrounding the use of
foreign keys.

I've encountered resistance because

\- Using them would require up-skilling existing devs

\- They'd highlight crap data coming in from other parts of the org, which
would be politically uncomfortable

\- DB would need to be re-configured

\- We're agile now and we can't be locked into those kind of constraints

edit: formatting

------
anshul008
The performance of my team's application at Myntra (Bangalore) was badly
affected by foreign keys on MySQL and we decided to drop them.

The trick we used was to drop them only in production, not in test
environments. In test environments they acted as guardrails to ensure that our
application did not break the constraint.

So in a way it gave us best of both worlds

------
uses
I'm confused, how do you design a database where things in A can't reference
things in B? It's not relational at that point right?

Or is this just about the problems of DBs where you enforce referential
integrity? In which case, if that's such a problem, like just _don 't enforce
referential integrity_...right?

------
dekhn
I have been perpetually annoyed at the SQL/RDBMS/relational calculus model. It
always feels like a huge context shift from imperative programming. after many
years of writing SQL, I noticed that many other people end up writing SQL
statements that look more or less like computer programs (CASE statements,
subselects, etc).

It all came to a head when I naively asked an experienced SQL developer how to
represent a tree in SQL and learned one way to do it was to have CHILD nodes
with FK references to PARENT nodes. So any time you want to get all the
CHILDREN of a PARENT you have to query all children to see if they have a FK
to the appropriate PARENT.

This always seemed strange and backwards to me and when I did a deeper dive I
learned that tree representations in SQL are a rabbit hole of insanity all
arranged around referential integrity.

~~~
camnora
An alternative is using a nested set
[https://en.m.wikipedia.org/wiki/Nested_set_model](https://en.m.wikipedia.org/wiki/Nested_set_model)

~~~
cryptonector
No. Just use an RDBMS that doesn't suck -- one that supports WITH RECURSIVE
CTEs.

------
ken
GitHub is a special case, perhaps in all of computerdom, in that most of their
public data is stored not in (SQL) databases, but in git repositories. That's
rather unique!

So we know they must already have a great indexing and search system in place,
and they can't use foreign key constraints on most of their data, anyway. Git
simply doesn't have that. It's not that kind of database.

Given this environment, it's not surprising they don't use FKs. It wouldn't
surprise me if they don't use JOINs much! Throw it all in a repo, and let the
indexer sort it out.

That's great for them, but my data isn't shaped like that. Data types and
referential integrity are more important to me than history and bisection.

------
edlebert
Also, to be noted, Github was a Rails app since long before there was native
support for adding foreign keys. I've worked on a lot of large monorails that
were created in Rails' early days and none of them have foreign keys.

------
amingilani
How would a Rails app be impacted by this? How would you create one-to-many,
or many-to-many relationships?

I'm sorry if this question sounds silly, but I've never heard this line of
reasoning before and I'm fascinated by it.

~~~
steveklabnik
GitHub is a Rails app. [1] Rails does not create foreign keys by default, so
most Rails apps are doing exactly this.

I _think_ you may be missing the distinction between the database feature
"foreign keys", and the general concept of an entry that refers to another
entry. Rails does the latter, but not the former, unless you explicitly do it
yourself.

(Maybe Rails has changed dramatically in the last few years, but up until at
least Rails 4, this was the case)

1: Obviously it's also way more, but the main codebase is a Rails app, and has
been forever. This thread is from 2016, so that was probably even more true
then than now.

~~~
rexpop
Thank you, this was helpful for me. I still don't, frankly, recall (did I ever
know) how Rails implements the latter without the former.

~~~
steveklabnik
No problem!

The summary is basically this: if you say that a User has_many foos, then the
foos table will have a user_id in it. When you ask for the foos a user has,
the SQL will be emitted that has the "user_id=1" or whatever clause on it.

If you add a foreign key constraint, the database itself will verify that the
foos have valid user_ids. If you don't, then it won't. You're basically giving
up the ability for the database to verify the relationship for you. Like any
kind of checks, this is more restrictive, but you get some value out of it.
The question is, is the validation worth the restrictiveness. The OP comment
argues no. Many would argue yes. It really just depends.

~~~
rexpop
Yes, the word "constraint" is confusingly absent (or, rather, used
colloquially) in the OP.

> FKs impose a lot of constraints on what's possible and what's not possible.

Thanks again for the further clarification.

------
wwarner
To me the issue is that the data model should be defined in a single place. If
you're using foreign keys, then the development process needs to take that
into account and provide a sensible way to modify the data model in a way that
the FKs keep up with the evolution of the product. It's going to mean a lot
more migrations for sure.

I personally have never seen foreign keys implemented correctly -- they've
captured an outdated version of the data model and are just in the way. For
me, they're similar to stored procedures, usually to be avoided.

------
chase-seibert
This decision makes sense from the perspective of the person who owns doing
database migrations, or who owns the performance of the database.

However, the other 99% of the engineers in the company end up dealing with the
fallout of not having FKs -- more bugs, and slower development velocity.

One alternative is to have FKs in smaller, sharded databases, where you don't
have to migrate between shards. For example, all the database rows related to
one repo could be co-located in a shared with strong FKs between them.

------
nickpeterson
My advice, put constraints that make sense as you're developing an
application, and marvel at how often you accidentally violate them while
writing trivial code. You'll realize pretty quickly that even when you know
the whole area of what you're working on you still screw up. FKs might not add
a lot of value on bet the company, mission critical, heavily tested code. But
on less rigorous codebases they will save your ass.

------
ctataryn
While we’re at it: should databases handle transactions or should we let the
application deal with it all?

IMHO The answer is the same for this as with foreign keys.

~~~
cryptonector
Ahh, but what is your answer to both of those then?

------
dylanz
Application vs Integration databases:
[https://martinfowler.com/bliki/IntegrationDatabase.html](https://martinfowler.com/bliki/IntegrationDatabase.html).
If you're going to have one application writing to your database, having your
constraints live in your application is definitely a viable option.

~~~
tiglionabbit
Not if you have multiple processes accessing the database simultaneously
without some additional coordination layer.

------
jerzyt
Granted, I have never worked with a database the size of Github, so I've never
had to deal with sharding issues. But, even though foreign keys can be a PITA
in some situations, I prefer the DBMS to do it for me than rely on the
application. Foreign keys have always kept my back. I've been burned a few
times when they weren't used.

------
jayd16
I'm curious what these comments think they're saying. What effort went into
GitHub such that they feel they can make the claim that FKs are an anti
pattern. Do they think they're just better than everyone else or is there some
other practice that makes this claim more interesting than simple hubris? Me a

------
arpa
Last time a DBA suggested me dropping foreigb keys for performance reasons, i
suggested him to write code ensuring data consistency with distributed
parallel writes. Haven't heard anything about FKs from him since.

------
megapatch
Confusing. Foreign keys are a vital part in RDBMS and very useful. He is
talking about foreign key _constraints_ , which can be problematic as
described.

------
desc
Er, what? Foreign keys are not necessarily a net performance _cost_. Depending
on the scale of the database, they can be essential for fast queries.

------
daveisfera
FKs are a tool. Sometimes it's great to use a hammer for a nail, but sometimes
a screwdriver just seems to work better when working with screws.

------
barrkel
Enable FKs for test and development, disable them in production, for tables
that see billions of rows.

FKs have other downsides. They add locks on the referenced rows, to ensure
they don't disappear before the transaction commits. If you start out with
32-bit primary keys, you have a major pain when you go over 2 billion records
- incrementally upgrading all your FKs before your PK breaks your referential
integrity, since FKs generally need to have the same type on each side.

------
postit
I'm working with a company that also refuses to use FK and indexes.

We have a huge DB cluster instead.

~~~
nwallin
> I'm working with a company that also refuses to use [...] indexes.

... ?

------
baby
Never used them in my personal projects either eventhough I learned about them
in school. It just didn’t make sense to me, why would I constrain myself in
the database when I could manage all of this in the application logic?

And if you want to experiment with different schema it’s just a nightmare.

------
kenshaw
Foreign keys should definitely be used in most schema designs. While I can
agree that they are problematic when doing a schema migration, a schema
migration would only happen "very rarely" whereas inserts/deletes happen
regularly. I believe this comment is more of a reflection of the specific
database (MySQL) which has relatively poor FK performance and adjunct issues
when compared to other SQL databases.

Also, from a process perspective of migrating a schema, the ideal mechanism
would be to incrementally copy from the existing schema to the new schema
(meaning there is no breakage in the interim). Granted (and somewhat
obviously) for databases as large as what GitHub is likely using, this is not
possible -- and so a "full migration" at once is their only option. However,
part of that "full migration", to me, would be to _remove_ the original
foreign keys from the schema, rename/delete/alter tables/fields/indices, and
then after all of that is complete, as a final step, re-add referential
integrity (ie, foreign keys).

I'd specifically like to point out the following:

1) This comment gives advice about all SQL databases based on the conclusion /
operation of only one database (MySQL), whereas the same base facts do not
apply to others (eg, PostgreSQL, Microsoft SQL Server, ...), and should not be
used to evaluate whether using FKs makes sense for other databases or not.

2) Schema migrations should be done rarely, if ever. A better approach would
have been to start with a schema design that was less prone to _needing_
change. Regardless, even if schema migrations are done with _regularity_, that
frequency would still be orders of magnitude less regular than, eg, code
changes or CRUD operations. As such, throwing away FKs simply to accommodate
either a poor or non-thoroughly thought-out schema is tossing the baby with
the bathwater, especially since the purpose of FKs is to reduce the ability of
other tiers in the application from losing or otherwise corrupting the
structured data storage.

3) FKs are useful when doing CRUD operations. They are not useful during
schema changes. As such, the proper process in this case would be to
temporarily remove FKs, perform the schema migration, and then re-institute
the FKs. This may not always be possible, due to, eg, active business logic
components or some such that can't allow the database to be taken offline to
perform a migration (see point #2). In that instance, there are a number of
other mitigations that can be put in place in the logic tiers, such as being
able to recognize different schema revisions, and deploying that _prior_ to
doing a schema change. As such, one would first migrate an active "smart"
logic tier that can recognize different schema revisions, and allow for
incremental migration of schemas without enacting a cluster-wide
lock/disabling of reads/writes.

~~~
cryptonector
2b) And keep your schema changes as incremental and backwards-compatible as
possible. Add new tables, add columns to tables, but don't drop or rename
anything. If you need a flag day, in PG terms just setup a new schema, use
triggers to sync the two, and migrate your apps to use the new schema -- when
they're all migrated, drop the sync triggers and the old schema.

------
bfrog
Well when you use MySQL, you end up with MySQL hacks like this

~~~
spookthesunset
Dunno why you are getting downvoted. MySQL is such a poor database that is
always leads to hacks. The people who use MySQL almost always get burnt by
something and figure all DBs are like that. Then people who have never worked
with non-toy databases assume all databases are like MySQL and dismiss the
whole field, which is a complete shame.

The fact is, a real database system like PostgreSQL is one of the coolest,
most underutilized tools developers have in your toolbox. Good databases can
do _so much cool shit_ to help you. It is amazing how well a good database
system can take your weird crazy query and return something in mere
milliseconds.

------
hootbootscoot
while most rdbms-based projects will continue using foreign key constraints
(and all will be referentially cast in stone), and postgressives will enjoy
100000x query speedups by remembering to include your
hornsnaggle_tree_index_lookup_speeder_upper index,

there is a certain fanatical mindset from the rdbms world that sounds a bit
like:

never forget always include jquery with every web app you make. it's made by
real profeshunalz who know more than you and lots of people already included
it and you should simply just learn to accept that all front-end web problems
have already all been solved by jqueremy and you should simply refactor it all
in terms of jqueremy. anything else is korrupshun. In other words, always only
ever consider the same tools and techniques. LOL

A mere search for other database types produces pages upon pages of "why sql
is beating nosql" and all stack overflow answers always doom you to the 7th
level of hell for anything but 3rd normalized form. hey let's all use lookup
tables for many to many. it's a cult. it's nuts. anyone who has spent any
amount of time with database-backed applications is surely aware of the
tradeoffs they are making when choosing non-relational databases of any kind.

The obsession with forcing every all all use cases into the rdbms mindset
makes me say "well then, let's just flatten everything even further into
mapped key-value a la ohm." in which one encodes the key with path info and
the value with the value, etc.

if the act of suffering for the sake of suffering and breaking things into
iddy bidd y pieces is virtuous in and of itself, surely this is a more true
pure and rigorous religion to bend every single thing to.

(before smarty pants says "um, querying" may I remind you of adding another
entire secondary index of any kind via adding it with the id of the first key
as it's value.)

I'm not advocating everyone do this. no freakouts pls.

ideally we would just persist our objects, which is coming with persistent
memory, but i digress...

This Lance fellow seems to have taken plenty of "you have no idea what you are
talking about moron!" abuse, despite having a phd in computation theory and
writing erp software since the dinosaurs roamed the earth. (I'm also not
advocating his solutions for any and all problems, but not taking his ideas
seriously is a mistake IMO) [https://codeburst.io/databases-the-future-buggy-
whips-of-sof...](https://codeburst.io/databases-the-future-buggy-whips-of-
software-31a716b0ed2b)

[https://codeburst.io/doing-without-databases-in-the-21st-
cen...](https://codeburst.io/doing-without-databases-in-the-21st-
century-6e25cf495373)

mongodb is an incredibly powerful tool for when you want the whole json, all
the json, and packed as bson. When you want to store/dish out blobs and not
join nor have no use for the parts separately, then you can do that nicely.

(besides this usual story about "for when you don't know your schema or have
lots of irregular semi-structured data, stripe, your mailgun webhooks,
whatever. It's a great DB, for some things, despite the people saying "since
it didn't do this one thing i really obsess over and think all dbs should
always do that therefore its' crap"

(well, copying the same master-slave with single-write master failover or
vertically scale story as the rdbms people was probably a weak point...)

pg as a jsonb store with secondary indices on actual columns LOL...

I guess we just basically need to accept that the same tools can be used as
primitives/components in different contexts in terms of the actual
application.

Some people use Redis as a primary datastore.

zealotry is best avoided, I guess..

look, this silo-ing stuff is also silly, and a product of the corporate
dysfunctional mentality: split everyone up and get DBA's warring with Devs...

~~~
hootbootscoot
[http://www.dbta.com/Columns/Database-Elaborations/The-
Refere...](http://www.dbta.com/Columns/Database-Elaborations/The-Referential-
Integrity-Workaround-117422.aspx)

------
shayanjavadi
Genuine question, what are the alternatives?

~~~
Hamuko
You have columns that refer to the PKs of other tables, just without
constraints. So your invoice table would have a column `user_id`, which tells
you which user the invoice is for. But if you enter an invoice for user ID
999,999 when you have no users, the database eats it happily, since it has no
opinion as to what that data should be.

------
sarah180
"We do not use foreign workers" (2019) (gitlab.com)

(Yes, I know it's more nuanced than that.)

------
andreimatei1
At GitHub they should be using CockroachDB (:P), a scalable relational
database that also supports schema changes. Then their FKs would work just
fine (albeit with some performance implications).

~~~
hw
Or, just not use FKs at all. There are many factors to consider when deciding
on a DB, and being able to use FKs is probably not high on the list.

~~~
brlewis
>being able to use FKs is probably not high on the list.

No, that's not at all probable. For many, "database" is synonymous with "ACID
database", for which foreign keys are important.
[https://en.wikipedia.org/wiki/ACID#Consistency](https://en.wikipedia.org/wiki/ACID#Consistency)

------
bregma
Build a firewall to keep them foreign keys out. Make the database great again.

