Hacker News new | past | comments | ask | show | jobs | submit login

This approach is quite scary to me and I would have argued very vocally against their rejections of FKs. There are ways to rely on replication for read serving, periodic disabling of FKs during batch inserts, FK integrity checks on replications, ... these can all address the performance issues inherent in FKs, it's also (generally) quite possible to attempt to architecturally disentangle too large networks of interdependent data in an effort to reduce how large any single data store you are relying on will get. But disabling FKs in production is... eh.

I'd much rather see the opposite, declare the FKs on a testing environment NOT VALID, run application code, VALIDATE CONSTRAINT those you'd marked and see if anything is in violation - ensuring that application code won't (in the normal course of operation) generally hit FK constraints can greatly reduce your performance hit from having them... then in prod you can turn them on and be confident in your comparably lighter performance loss to data integrity.

> This approach is quite scary to me...

No kidding. My first thoughts went to referential integrity, and getting misbehaving apps successfully past testing...

FKs add read locks to referenced rows. It limits concurrency and it is observable. FKs constrain your ability to incrementally widen 32-bit FKs once you go over 2 billion rows, if you start out with 32-bit PKs. Two concrete reasons to avoid FKs in production, or at least disable them for longer running transactions.

I think this perspective is something you only get once you've run bigger databases in production.

These sound like implementation details. In PostgreSQL FKs just add a shared write lock to the foreign key columns of the referenced table (i.e. the locks prevent anyone from updating the primary key of the referenced row). Also what you said about 32-bits is not true in PostgreSQL either as far as I can remember.

The costs of having FKs in PostreSQL are:

1) If you update the primary key of the referenced table you might see issues with locking. But this is rare in real world applications.

2) Performance overhead from having to check all FKs and taking the locks. This can be a big issue on some workloads and may force you to add extra indexes. A PostgreSQL specific issue is that FK checks cannot be done in batches.

3) Adding new FKs block writes from the referenced table and dropping FKs lock out both readers and writers from the referenced table. This is a limitation of the implementation.

And that updates on tables with FKs block 'for update' locks on the referenced tables.

(Let's not forget, amid all this Postgres-specific chatter, that the article is about GitHub, who use MySQL.)

not sure about mysql, but in postgresql an update on a table with a foreign key will take 'FOR KEY SHARE' locks on the referenced table, which is a weaker type of lock. updates on the referenced table that do not update (primary) key columns (changing a pk is very uncommon anyway) will suffice with a 'FOR NO KEY UPDATE' lock, which does not get blocked by 'FOR KEY SHARE' locks. in fact, the main reason postgresql has these weaker 'FOR KEY SHARE' and 'FOR NO KEY UPDATE' lock types is for handing of foreign keys.

Also worth noting here, if you ARE updating the primary key on the referenced table, then locks MUST be taken to ensure data consistency. If that PK isn't locked, than by the very problem definition you have open transactions relying on the original PK value.

Those locks would be very challenging to accomplish at the application level.

In which case you can run FOR NO KEY UPDATE.

This sounds like an implementation specific detail (no doubt for mysql) that for all you know has been fixed in later versions but gets passed around as if it is a permeant truth that applies to all database servers.

FK consistency can't be guaranteed without ensuring the referenced rows don't disappear before the transaction has been committed. Think about it.

Yes, but it can be done without locking out reads. The only thing you need to lock against is someone changing the primary key of the referenced row or deleting the row.

PostgreSQL has implemented this minimum level of necessary locking for quite many years now.

I never said anything about locking out reads. Did I?

FWIW, I just tested in Postgres. Locks like I said it does:

    A: create table parent(id int, value int, unique(id));
    A: create table child(id int, parent_id int references parent(id));
    A: insert into parent values (1, 10);
    A: begin;
    A: insert into child values (1, 1);
    B: begin;
    B: select 1 from parent where id = 1 for update;
    B: (blocks)
The situation in MySQL is worse because it'll block updates on any field, not just row locks.

I have specific experience of this due to use of database locks at the application level to avoid deadlocks (different lock orders) and inconsistent updates (updates based on reads across multiple tables that may have separate racing updates) by locking rows up front. For understandable schema reasons, what is logically a parent entity is the natural thing to lock, but for understandable performance reasons, FKs to the parent entity are distributed through some fairly large tables.

Ok, now think this the rest of the way through. Without FKs, how do you, at the application level, ensure that A and B don't commit separate changes (A to child, B to parent) that break consistency?

If A tries to insert a child for 1, and B changes the id to 2.... OOPS! And from both's perspective it looks perfectly safe.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact