Hacker News new | past | comments | ask | show | jobs | submit login
We do not use foreign keys (2016) (github.com/github)
349 points by Scarbutt on Nov 8, 2019 | hide | past | favorite | 320 comments

When posts like these come up, I'd like to remind people that context matters when making technical decisions. What works for large companies with huge scale (GitHub, Google, Facebook) may not work for you.

As a counter point to the linked issue, I operate a few small applications. Foreign-keys (and constraints in general) are great at ensuring that invalid data doesn't find its way into your database. Yes, they have a performance cost. Yes, they make sharding more difficult. In my experience, at smaller scale the trade-offs are worth it. YMMV

In the same vein, I'd like to remind people that you are probably not a "temporarily low-scale big-data company", in the same vein as a temporarily embarrassed millionaire. In lots of cases going for the very long term scalable solution will be an impediment to your growth, and I'd suggest dealing with those issues when the chance that you need them is on the horizon, rather than across the globe.

CQRS is one of the biggest examples I've seen personally for this - your flexibility will go down, work to do "basic" things will go up, and you will really lose a lot of speed solving for 10 y/o company problems with technology rather than 1-5 y/o company problems in terms of product market fit.

> In lots of cases going for the very long term scalable solution will be an impediment to your growth, and I'd suggest dealing with those issues when the chance that you need them is on the horizon, rather than across the globe.

My favorite example is a bootstrapped startup that I co-founded. We had a couple thousand users and less than a gigabyte of data. The app was a run-of-the-mill CRUD app. I built the entire back-end API as a single service that could be built and deployed on heroku in minutes.

At some point, I left the project, and my co-founder hired an expensive consultant to review the system and provide feedback. One of his suggestions was to break up the entire service into a suite of microservices. And this was for a service that barely averaged a handful of active users at any one time. And a team of ~2 engineers working on the back end code. Microservices.

> At some point, I left the project, and my co-founder hired an expensive consultant to review the system and provide feedback.

People get pretty unhappy if they hire a consultant and don't get some drastic change recommendations. "Everything is good" doesn't sit well when handing over cash.

As a consultant I have never been hired by someone whose system was working. I am willing to bet the consultant was hired to add a feature, could not figure the installed code, and proposed to redo it the only way he or she was used to.

I have been hired a few times to review working systems. Basically get an external set of eyes on specific things. And “yes, this all looks good, you might want to tweak a little here and keep an eye on that once you grow significantly.” is an entirely accepted outcome of such reviews.

I thought people hire consultants so that they can sell unpopular changes as recommendations from an external authority instead of letting the blame hit management directly.

That's often more what you'd bring management consultants in for. A technology consultant is more likely just to tell you to dump whatever tech you are using in favor of whatever the flavor of the month is.

> People get pretty unhappy if they hire a consultant and don't get some drastic change recommendations. "Everything is good" doesn't sit well when handing over cash.

I suggest that this is largely untrue when the consultant being hired is a security consultant.

There is demand out there for security consultants who will rubber-stamp your existing software.

The biggest issue with CQRS I've seen is people thinking CQRS means you need multiple, duplicate data structures, mappers, a few Kafka topics and a PhD, when IN REALITY all it means is you put methods that return data without modifying it in one interface/class and methods that have side effects in another interface/class - which is really just a good application of interface segregation.

Moreover, you now have a great rule of thumb for your brand new junior developers: "If you're handling a GET request, use this class, and if not use this other class" making code reviews easier to do and helping to enforce better structure in your code. Plus tons of other benefits.

The rest of that stuff that isn't CQRS you might eventually need if you scale, but is easy to add once you have the bones in place.

Like you, I've found the general idea of CQRS/ES[0] incredibly valuable because it both enforces a functional approach to state (e.g. 'current state' is a fold over events), and it forces people to really think about the domain. (E.g. which bits are really atomic units, or what can be fixed up if it goes wrong in some way.)

It also forces people to think about something that's usually glossed over: Consistency. If you have an RDBMS backing you, you tend to not think about the fact that what the user on a web page sees is already out of date when they see it, so any responsible application should track when that data was read and reject updates if the backing data has changed since it was read... but very few applications even attempt to do this (it's really hard and a huge amount of boilerplate with most APIs). With CQRS/ES you are really forced to think about these things up front -- and you can actually avoid most of the issues. Whether that increases or decreases 'productivity' I don't know, but I do know that thinking about these things increases correctness.

For me, correctness is paramount. If you don't mind bugs, I can give you an infinitely fast solution.

[0] I do think ES is an integral component.

I’ll tell you this - it decreases it, especially for small teams. You have to do a lot of work to make your query engine actually good and reliable, and reporting is also a huge bear to deal with. You have to cache somewhere to get your queries anywhere near real-time, and it takes a lot of reinvention.

I don't understand this response. Most of our use cases have always done fine with a simple JSON document store -- no reinvention or anything. Most of our Queries are just a simple PostgreSQL table with an Aggregate ID and a JSON column.

Remember: For smallish applications you can still have your application itself be a monolith, so no need for complicated setups with message queues (Kafka, whatever), etc. etc. In this case you can basically rely on near-instant communication from the Command side of things to the Query side of things. You can also have the frontend (Web) subscribe to updates.

(We have our own in-house library to do all of this so YMMV. I'm not sure what the commodity CQRS/ES libraries/frameworks are doing currently.)

> any responsible application should track when that data was read and reject updates if the backing data has changed since it was read... but very few applications even attempt to do this (it's really hard and a huge amount of boilerplate with most APIs).

it is done very simple by versioning. Usually implemented transparently for the data layer API clients.

It's simple. I'm just saying that (for most frameworks) it's a lot of work and boilerplate. It's also easy to miss individual cases during code review, etc.

You kind of miss the part where he says "Usually implemented transparently for the data layer API clients." Transparently as in no work, no boilerplate.

I guess your mileage might vary, but Java has JPA/Hibernate, and .NET has Entity Framework, and they both make it easy, so I'm going to be surprised if any major framework or language doesn't make this easy.

The concept is also called optimistic locking, if that makes Googling it easier. Using that term, I easily found that Node.js's Sequelize supports it too https://sequelize.org/v5/manual/models-definition.html#optim...

SQL Server has a built in ROWVERSION type which plays really nicely with EF Core. Made implementing concurrency checks easy.

Exactly, CQRS gets such a bad rep because of all the pieces that (typically) surround it. But really as just a way of organizing code, it's so handy.

The freedom and flexibility to create your read models and write models the way they are needed just completely sidesteps a lot of design issues that creep up. Plus, it makes the code so easy to follow when everything follows the pattern; picking out where state changes happen becomes trivial for example.

To be clear, I'm referring to the (in my circles) usual parlance of not "pure" CQRS, which is as you describe and could entirely be in a OOP "Model" layer, but what usually goes with it - things like event sourcing for auditing, cached read services (eventual consistency), etc. Should've been more specific but was just pattering off the first big complexity thing that came into my mind.

This seems to overstate the difficulty in implementing CQRS, and by that I assume you mean event sourcing. Kafka is definitely not needed, or even necessarily a good idea, nor is a PhD.

I'm not sure what you experience with CQRS has been but CQRS is not a solution for scale it is a solution for long-term maintenance costs of a growing codebase by enforcing the decoupling of services.

I like where you're going for that. Why don't we just call it 'temporarily embarrassed big data company'?

On one project where we seemed to handle engineering strategy well, one of the considerations we'd make is whether paying the tech debt would accompany an inflow of cash or not. If it didn't we'd better tackle it now so we don't bleed out.

It's probably another way to say "spend money to make money". Management is more willing to spend out of a problem when they've had a fresh fix.

The trick is, though, can you pay down that debt fast enough that the customers don't get upset while waiting. If someone's firing up a big ad push you'd better be prepared. And people, like me, who get caught flat-footed once (or have friends who were), don't want it to happen a second time. So they get opinions on 'last responsible moment' that might differ from the idealists.

If your requests come in hard and fast then you’ll have a big problem with tech debt, but if you can schedule it properly you can do it as a part of adding functionality and people tend to be happy, because once the functionality is in it can be enhanced much more quickly.

>In lots of cases going for the very long term scalable solution will be an impediment to your growth, and I'd suggest dealing with those issues when the chance that you need them is on the horizon, rather than across the globe.

My attitude is:

"Man if this takes off and I have to make some changes on how we do this....I should celebrate!"

YAGNI - learn to embrace it.

Thing I've tried and mostly failed to get across. There is usually plenty of money and resources to throw at products that are making money. Vs the amount for those that aren't.

As long as you're not painting yourself into a corner a lot of things can be fixed 'later'. Later when your not desperate to get the thing off the ground. Later when you can throw two engineers at it for six months. Later when the pain points are well understood.

> CQRS is one of the biggest examples I've seen personally for this: your flexibility will go down, work to do "basic" things will go up

The only time I've worked on a codebase that had implemented CQRS, it was when I was given to maintain a crud application to manage maybe 20 different records. Rarely used. The people who had developed it in .net mvc were of such technical prowess that, besides the CQRS pattern, the frontend still had a functional shopping cart (yes, you clicked on "save" and your "order" went to the shopping cart, relic of the mvc demo app).

Sometimes I wonder if I've just been unlucky or the world is all like this.

> I'd like to remind people that you are probably not a "temporarily low-scale big-data company", in the same vein as a temporarily embarrassed millionaire.

This point is valid in the same way as telling a startup founder to give up the startup and invest in the S&P 500. It's trivially true, but ultimately useless advice.

Building technology is hard precisely because there are so many tradeoffs. Speed to market vs scalability is one of many. It's silly to pretend that there is some kind of rote obviousness to never caring about scale in the early stage, anymore than there is rote obviousness to just investing in the S&P.

Ex post, most startup founders/investors should have just invested in the S&P, and most early stage tech leads should have just used Wordpress or Rails.

I didn't say never care about scale in the early stage. I've found it a more common problem that people in startups are building systems with too much complexity to solve problems they faced at previous employers, in order to "save themselves time" in the future that ends up never coming. There are certainly some technology companies for which scalability is the most important thing - Instagram being the biggest example I can think of. The point of me saying so was to remind people - despite what you may think, you are not an Instagram, and more than likely your business doesn't require you to be one.

Plenty of early stage tech leads DO use wordpress and rails because their company doesn't need something more. You just don't hear about them often because they won't be sexy enough for HN. YAGNI.


OK so then does YAGNI apply to investment? Why not let existing firms handle it in the S&P?

> YAGNI (continued)

over-engineering is definitely a problem, but so is hindsight bias.

I think that most non-trivial tech builds involve some smart investments and some dumb ones. It's essentially a portfolio of decisions made in the face of uncertainty. YAGNI is cynicism and hindsight bias, and over-generalization.

I think this is less obvious than it should be because of the popularity of the highly cynical "agile" approaches that consider a project that does not plan beyond the next sprint as somehow not taking on code debt.

>> OK so then does YAGNI apply to investment? Why not let existing firms handle it in the S&P?

Yes, don't rebuild your own balanced portfolio or pay someone to "beat the market" for you; just buy a COTS ETF, put it in the server closet and forget about it for a decade.

As someone currently fighting a battle with an LoB application written without foreign-key constraints with hundreds of thousands of rows of corrupted data because of bugs in sprocs that assigned the wrong value to the wrong foreign key column because they were similarly named - THIS!

The reply in the GitHub thread we’re talking about makes it clear that they still perform FK validation - it’s just performed in the application code rather than the DBMS.

I note there is another alternative: deferred constraints - or just run a query to check for invalid rows at 3am every morning.

I’m pretty confident that GitHub doesn’t use foreign keys because it was built as a Rails app. And the “Rails Way” is to create these constraints in the model. Foreign key constraints weren’t a first-class member in Rails until v4 (if memory serves correctly).

I was once a full-time Rails dev and really loved the framework (I don’t write as many user facing applications these days). Most of the Omakase trade offs didn’t bother me. But I never understood the disdain for foreign keys. For 99.99% of web apps, you want them (dare I say, need them).

Even in Rails 3 I would add them by hand in the migration files. Very few applications will ever actually care about sharding. We were pulling in millions at the last Rails company I worked for and were just fine with a master and a single replica.

If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition. Your goal should be to outgrow what MySQL (or Postgres) can do for you in master/replica mode. If you do, you’ll likely be independently wealthy...

If data integrity matters at all, model-based checks (or periodic queries for orphaned data) will not suffice. Just put a foreign key in the table where it belongs and let your DB do what it does best. ACID is an amazing thing...

> I’m pretty confident that GitHub doesn’t use foreign keys because it was built as a Rails app

Maybe originally, but lack of foreign key usage is certainly not Rails specific today. Large MySQL shops generally don't use foreign keys, full stop, for the exact reasons Shlomi described in the original comment.

Facebook does not use foreign keys either. In my experience, same thing is true at all the other large MySQL-based companies. And those companies make up a majority of the largest sites on the net btw -- including most of the consumer-facing social networking and user-generated content products/apps out there.

This does not mean that, for example, Facebook is full of data that would violate constraints. There are other asynchronous processes that detect and handle such problems.

> If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition.

Do you mean, stay with your current RDBMS of choice and shard while also removing the FKs? Or do you mean transition to some other system? (and if so, what?)

The former path is bad: sharding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.

The latter path is also bad: only very recent NewSQL DBs have sharding built-in, and you probably don't want to bet your existing rocket ship startup on your ability to transition to one under pressure. Especially given much higher latency profiles of those DBs, meaning a very thorough caching system is also required, and now you have all the fun of figuring out multi-region read-after-write cache consistency while also transitioning to an entirely new DB :)

“[S]harding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.”

I typically espouse app logic to check state and use foreign keys to ensure enforcement (eg, race conditions that are very hard to ensure at the app level but are built into many RDBMS). Foreign key failures are just treated like any other failure mode.

But honestly, I haven’t been part of a company that have really hit those upper limits that require sharding. They exist, yes. But most companies will never need to worry about it. Which is my point.

I agree that most companies won't ever need to shard, and pre-sharding (or worrying about extreme scalability challenges in general) is usually unwise premature optimization. But it does depend on the product category.

Social networking and user-generated content products (including GitHub) need to be built with scale somewhat in mind: if the product reaches hockey-stick growth, a lot of scalability work will need to be completed very quickly or the viral opportunity is lost and the company will fail. I'm not saying they should pre-shard, but it does make sense to skip FKs with future sharding in mind.

This was nearly a decade ago, but in one case I helped shard my employer's DBs (and roll out the related application-level changes) with literally only a few hours to spare before the main unsharded DB's drives filled up. If we also had to deal with removing FKs at the same time, we definitely wouldn't have made it in time and the product literally would have gone read-only for days or weeks, probably killing the company. Granted, these situations are incredibly rare, but they really do happen!

It’s at its core a case by case decision, I think FK are also a net negative in data ingestion scenari where the data set is big enough.

Trying to make sure everything is where it needs to be at any given time, everything is inserted in the right order and the data is always consistent brings exponential amount of conplexity when it could all be checked at the end and pruned for invalid data. And usually DB integrity will not be enough, you’ll want business level validation that all is OK, so there will be app level checks anyway.

PostgreSQL supports deferring constraints. https://begriffs.com/posts/2017-08-27-deferrable-sql-constra...

This looks nice, it’s still limit to a single commit though. That’s where it’s a PITA for anything that won’t (or we don’t want to) fit a single commit.

In particular splitting commits allows to ingest data in parralel (for instance if we import stores and store owners, both could be ingested separately without caring at first if each references a valid entity)

At least with mysql, and probably with postgres as well, you can temporarily turn off foreign key checks for a set of statements. So you can still get the benefits of foreign key constraints by default but when they do more harm than good you can turn them off. With the added benefit that turning off FK constraints screams "I am doing something unusual and dangerous - this requires extra caution."

App level data integrity is usually critical. Ingestion into a reporting database is an entirely different construct (where I agree FK constraints are burdensome).

It's been a while, but IIRC at the time Rails got started MySQL actually did not even support foreign key constraints. Since that was the DBMS of choice, it wasn't much of choice.

InnoDB's foreign key support predates the existence of Rails by several years. However, InnoDB wasn't the default storage engine for MySQL at the time, so that may be a factor.

I wonder how different the state of database application development would be today if all those cheap whitelabel webhosts powered by cPanel or Plesk (where most of us got started, I imagine) opted for PostgreSQL instead of MySQL - which would have influenced the major MySQL adopters like phpNuke, phpBB, WordPress, etc.

The application code isn’t enforcing FK constraints because that is impossible for the application to do. Their database is almost 100% guaranteed to be corrupt as a result.

Application code has bugs. Application code can fail in ways that result in corruption. A primary job of the database is to keep itself from getting corrupted. Enforcing foreign key violations is only something the database can do correctly. Punting the responsibility to a higher layer will result in corruption.

> A primary job of the database is to keep itself from getting corrupted

That would be where the road splits. If you handle the database as a very fast and structured storage application, a lot of these assumptions go away, with then a different set of tradeoffs.

Some data corruption could be fine if you can guarantee the critical cases, just as bugs in the code are fine as long as the useful cases are covered.

I remember a database with scheduling entries in it that could get duplicated depending on the sharding, but it didn’t matter because the app handled the case gracefully.

I think understanding the tradeoffs that match the best the use case is the most important point, always assuming that a DB has to guarantee integrity can be a burden preventing from looking at all the options.

There are 100% foreign key violations in their database. That is not the same as their database being corrupt.

They have engineered for, and understand the implications of, foreign key violations. Typically, it's as simple as "This row can be deleted", and that can cascade - at a totally different rate than you'd find in a database and with totally different performance characteristics.

Foreign key violations are data corruption!!! It violates the rules of how the data relates and can and will screw up any number of things that depend on the rules being enforced.

Reporting and bi data might get hosed.

Account management might get hosed.

Who knows what happens when FK rules are violated because by definition they should never be violated. It puts all applications on top into a undefined state, leading to bugs and god knows what else.

Foreign key violations are 100% data corruption.

> It violates the rules of how the data relates

You're missing an important point, it violates a certain set of rules of how the data relates. That certain set of rules typically makes it easier to write most general data applications.

If your rules define a foreign key as optional, you can easily engineer an application around it.

Great examples include NoSQL and Ruby duck-typing. In both cases, you infer actions based on the data structure, rather than making explicit assumptions.

The whole fact that you made a foreign key implies that that is now a rule of your database. If anything is in that column that does not adhere to the foreign key, that means your database is corrupted.

Yes, and what if you don't make a foreign key?

Then you have an implicit rule that is enforced by a hope and a prayer that some junior dev never makes a mistake, your senior engineers are clairvoyant and understand every single aspect of your systems 100% with zero off-days, your code review process catches every single possible edge case (especially the edge cases that you never knew existed), your QA process is 100% and never makes mistakes, your servers never crash in ways that leave things in an inconsistent state, etc.

Or you could, you know, simply add a foreign key constraint and never, ever, ever have corrupt table relations.

Why people fight their tools is beyond me. There is almost zero reason to defend not using foreign keys.

or you never wanted it in the first place.

i suggest being open to the concept that other perfectly capable humans may, in fact, design systems with different underpinning assumptions than those you appear to presuppose always must apply to everyone and everything.

It's important to remove the DBA or Developer hat and realize that you are working together on a singular system (or maybe it's even the same person, etc.)

"corruption" implies that the desired results have not been achieved, that essential data has been lost or compromised, and this clearly is not the case with such a deliberate design decision. one should be prepared to accept this possibility in order to not be merely a fanatic and unreasonable.

That's exactly my experience and viewpoint.

Mine is exactly the opposite, most use cases I'm exposed to do not require FK at all, so I simply never use them.

> Foreign key violations are 100% data corruption

Sure, yes, of course, who cares?

These are self-imposed rules, so breaking them is wholly up to yourself. And there are trade-offs involved that may often make it acceptable relax them.

You seem to be rather invested in one set of arbitrary definitions. If people do fine even in situations where they "by definition" shouldn't, it's the definitions that have been found lacking, not the people.

> Sure, yes, of course, who cares?

The stakeholder from whom you got the requirement on which the logical FK constraint is based, for one.

which may not exist.

Why would you identify an FK relationship other than requirements?

This assumes things depend on the rule being enforced.

Nobody suggests taking an app / db with foreign keys and removing them that is a recipe for disaster.

You can keep saying that, but it does not make it true. Properly modeled data does not need those constraints. Well written software handles these correctly. Mediocrely written software fails and complains loudly. Badly written software might get hosed. You're at just as much risk (or more, in my opinion) of that with bad schema changes as with not having foreign key constraints.

Do you not check input from your javascript front-end before you save it? Even if that front-end does its own validation? (No, you see, but there is only one web front-end.... we don't need the backend to validate the input!!!!)

In what way is letting the database ensure it isn't getting fed crap any different?

Why do developers constantly think it is okay to let unvalidated user input hit their database? Any client calling your database is a hostile client that will feed your database bad data. Arguing otherwise is complete ignorance.

> In what way is letting the database ensure it isn't getting fed crap any different?

Because one canonical validation layer is generally enough. You can see how having two separate partial validation layers could cause problems, right? And you can't put all the validation in the database, for non-trivial apps.

And "clients" shouldn't be talking to the database, no matter if you have foreign keys or not. That's a totally separate issue.

> Reporting and bi data might get hosed.

> Account management might get hosed.

LOL, if I read the data directly from the db instead of via the application's API then sure, I lose the application's guarantees. But, y'know, same might be said if I just go and read the DBs files from a sidecar shell script or something.

> It puts all applications on top into a undefined state

...that's just an assumption that's false.

If you want to operate assuming your data is always corrupt because your engineers don't understand how to use the tools provided by their database.... Seems like an awful lot of work to re-invent a wheel that your DB server can solve for you. I guess that is on you though....

Will result in corruption. Yes.

Will result in more fault tolerant software, also yes.

It's a trade off and one I make willingly at every scale.

I stopped using foreign keys after university and have never wanted them since.

Non nullable database fields are far more useful than worrying about fks.

> Will result in more fault tolerant software

More fault tolerant because you have to waste time debugging the faults and monkeypatching them in code just to avoid FKs?

> Non nullable database fields are far more useful than worrying about fks

Can’t even count how many non-nullable fields I’ve seen packed with “” empty strings to get around that requirement

1. Over my career almost none of it has been debugging problems due to a lack of foreign keys.

Further to that, instead of having lazy cascading deletes moving the goal posts on you. Data missing its parent for example is an indication something is wrong. It's a useful diagnostic.

2. Most string data is safe to represent as empty. Handling null on the other hand is a different situation and often induces warnings or other side effects depending on the language. Forcing the null checking to your boundaries is much like forcing your state/mutation to the boundaries in FP. Hell eveb though I think it's insane, hexagonal arch works on this idea as well.

> or just run a query to check for invalid rows at 3am every morning

One of the nice things you get from database FK constraints is when something messes up the data, you get an exception thrown in the code that's messing it up, complete with a backtrace and whatever inputs and timestamps you care to log.

I'll jump in here as well in case someone is considering throwing out foreign keys due to this. I would argue, never make compromises like this unless you have a scaling problem that is so bad, that there is no other way around it. Don't get me wrong, there are places where this is the case, but 99% of software does not have this issue. Use foreign keys and save yourself the headache in the future. Also bear in mind, that if you ever hit that scaling problem, you are making so much money, that it's a nice problem to have.

I would add to this if you have the scaling problem, and are planning to throw out foreign keys, and start sharding: you have essentially moved to a distributed data store, but you are using a query language not designed for it. You will face challenges learning what parts of the database system you can and cannot use safely, and enforcing these constraints will be fraught.

You are essentially moving to NoSQL. Maybe it makes more sense to just own up to being on NoSQL, and using a data store and data access patterns that were actually built for the task? It should be something to think about, anyway.

Certainly consider this option if you're planning for that scale from the start as a more meaningful alternative to simply saying "no foreign keys" from the start. I won't necessarily say it's overengineering; there are entire problem classes where tracking millions and billions of records are on the table, particularly in event monitoring.

Such an excellent post. Thank you so much.

When you get to a point where you need to start sharding your database, congratulate yourselves for the enormous success and toast the night to come.

The next morning you can embark upon the engineering effort to start the sharding re-architecture.

Two of Google's big internal databases don't use foreign keys (Spanner[0] and F1 [1]), rather they use hierarchies (parent/child tables). It definitely limits you a lot more than foreign keys, but seems to make sharding easier. It works well for some use cases, but takes some getting used to. Cockroachdb also has this functionality [2].

[0] https://cloud.google.com/spanner/docs/schema-and-data-model

[1] https://ai.google/research/pubs/pub41344

[2] https://www.cockroachlabs.com/docs/stable/interleave-in-pare...

I've worked on exactly one project where performance concerns lead to removing Fkeys.

The compromise we came to was to enforce them in dev and qa in order to catch bugs, and relax them in prod.

I still strongly believe that database constraints are a developer's best friend, in that you can trivially make your data structures fight back against misuse. This makes several classes of bugs obvious. But like anything, there are times they are not optimal.

I'll also say I think there are very few cases where a small performance advantage outweighs the costs, and would be hesitant to head down this path with a team less competent than the folks at Github.

I've worked on projects where adding foreign keys increased performance tremendously. A good relational database can take advantage of them in all sorts of ways in execution planning. Even scenarios such as some types of sharding, a relational database sometimes can use foreign keys for useful data proximity information, to better co-locate "families" of records together.

You need to know your database engine, how normalized you are or are not, how appropriate your data is for its relational model. It's not always your FK constraints that are the bottleneck you might think they are. Sometimes their "slowness" is hiding a problem elsewhere in your schema, a bad shard or an index that could be better or a key-value pool with soft version controlled enums pretending to relational data.

Sometimes "optimal" isn't a simple spectrum but a checklist of trade-offs and no "right" answer.

Someone two degrees of separation from me coined the term "medium data". I don't remember their name, but I absolutely love the term.

You are "big data" when black swan events (like hardware failure taking down a database node) become regular enough that you begin to statistically model it. Before that, you might just have a lot of data, but you aren't having "big data" problems.

Or, as a wise man once said, "Premature optimization is the root of all evil."

It's a different part of the field of course, but so many people yield opportunity cost chasing some difficult architectural problem that will likely never impact you, and if it does you'll have resources to throw at it.

Many startups won't even have slave production databases, much less have to worry about network distributed sharding, complex caching strategies, database optimization beyond query analysis.

I will add to the pile of agreements on this.

Also, if you do end up sharding a relational database, it is often by identifying subgraphs of document-like structures that you can shard on. Need to ensure these subgraphs do not have foreign key relations with one another, but you can maintain the valuable foreign key relations WITHIN the subgraphs.

Practical example: database of user profiles where suddenly you have billions? You can skill keep foreign keys on user to email-addresses or user-to-comments while eliminating cross-user foreign keys.

I would also add a good rule of thumb: when you make the design decision to remove a database feature, you need to assume that you now need to handle that feature yourself, or your data will get corrupted. For example, when removing FK constraints, or transactional boundaries, or introduce irregular checkpointing, you now need to implement a data repair system, because your data will get broken. At which point you are probably going to end up using a change event log (your own transaction log) and a system that replays the logs in order to repair and rebuild the database.

Even at very large companies, hacks can work surprisingly well for a surprising amount of time. As much as we harp on technical debt (which includes myself), we rarely appreciate how much time can be saved by implementing the most straightforward solution so we can focus on hard things.

I'm not sure if Guido was just being amicable, but in the recent Dropbox retirement post he said, “There was a small number of really smart, really young coders who produced a lot of very clever code that only they could understand,” said van Rossum. “That is probably the right attitude to have when you're a really small startup.”

In addition to context + YMMV, I want to mention, sharding is NOT THE ONLY solution for scaling[1], albeit it's a popular technique. So that makes Shlomi Noach's first point weaker.

[1]: https://www.quora.com/What-are-alternatives-to-sharding-data...

Additionally, at least in Sql Server, trusted foreign keys can give the CBO more, good options because of the guaranteed referential integrity on reads. FKs are killer on larger writes, though.

+1 million times this. Seems like everywhere I work with are concerned about google size data problems when the apps I'm working on are at best 1000 concurrent users...

I agree. No two systems are comparable. Must be very confusing for the people ramping up. A solution fits the problem, not the other way around. Keep it simple stupid. KISS

Came here to say something like this, though you said it more eloquently. Also a couple replies to you below add great context.

Who needs foreign keys... or relations?

Documents do well for a wide variety of applications ;-)

The number of times I’ve seen serious data corruption because “foreign keys are bad and we can just enforce it in code” is amazing. There is zero excuse to not use FK’s

Any database that doesn’t use FK’s is almost guaranteed to have crap in it that didn’t get cleaned up, resulting in data corruption (and yes, dangling stuff in tables count as data corruption).

Developers aren’t perfect. Shit will slip through even with the most rigorous process. The database should always provide tools to keep its self from getting corrupted. This is why good database systems have constraints like FK’s. If your database software makes using those tools painful (cough MySQL), get a better database system.

Not using FK’s is just plain old ignorance.


"As a C developer, I never check exit codes of child processes. We can just enforce it by ensuring child processes don't have bugs"

`malloc` won't fail, right?

the javascript will enforce the user input. no need to have the backend check it again!

This should be a joke, but considering how often it's actually the case it's unfortunately not that funny.

Not as much as developers who can't find their own coding errors might have you think!

It won't on Linux! But hey, random processes will get OOM-killed.

You can turn off overcommit, in which case you'll get a null return instead of OOM killing.

But also, you can still get a malloc failure without actually be running out of memory if the allocator can't find a big enough contiguous chunk of address space.

This is highly unlikely on a 64 bit system, but if you try to malloc gigabytes on a 32 bit machine you might see it.

And also, you never know when this theoretical case actually happens but it did happen to me: at one point someone may use your code on an Arduino and unexpectedly, it works! But memory allocation could fail more than you expect!

On Unix it won't. You can overcommit memory and only get into trouble when you actually try to page it. But not at malloc time.

Iirc that's configurable.

I think this is a bit hyperbolic of an expression, but I do want to reinforce that all applications need to confirm data integrity, you either confirm it when saving the data, when extracting the data, or have to very carefully balance various consistency concerns and either enforce consistency before saving data or enforce consistency after saving (but before extracting) data.

Things like RDBMS's provide some really nice utilities for data consistency enforcement in the form of FKs, these are not the most performant approaches in all cases but they're optimized to be good for nearly all use cases.

If your business need /Requires/ moving off of FKs onto some other data integrity guarantee, then just understand that you're going to be reinventing the wheel. It might turn out to be a better wheel for your car, but it's going to be expensive. And... unless you specifically hire specialized people who comprehend data guarantees and ACID properties, you'll probably do it wrong. This sort of an undertaking is for mature companies only.

> I do want to reinforce that all applications need to confirm data integrity, you either confirm it when saving the data, when extracting the data, or have to very carefully balance various consistency concerns and either enforce consistency before saving data or enforce consistency after saving (but before extracting) data.

There are two kinds of error handling. The “happy error” and the “fuck you asshole” error handing. Good systems have both.

Yes, the application should check it isn’t about to violate a FK constraint. Just like it usually checks that it isn’t about to violate a unique constraint. That way the application can fail in happy, controlled way. But if the application doesnt do the right thing than the DB server still should puke all over the transaction with a “fuck you, don’t feed me bullshit” error.

Just like data entry over the web. The javascript can return all kinds of nice happy messages to the user about missing fields and stuff. But the backend should always validate and enforce correctness even if it’s mode of failure is some ugly “piss off, don’t feed me crap” message.

Just like you should never trust user provided data coming from an HTTP post, a database should never trust the INSERT or UPDATE is valid and won’t corrupt the database. Both backend systems can return mean old ugly errors when shit is bad and let the front end do pre-validation that can do happy nice errors. The backend always has to enforce its own data integrity. Period.

Oh I absolutely agree and even when a DB is properly configured with references all cleared defined and constrained it's absolutely a good UX thing to pre-check as much as possible.

But, beyond that, it is quite possible to remove FK checks and still have strong guarantees about data integrity. It is stupidly expensive and unless you have a few billion in the bank there is absolutely no reason to even consider it, but if you're dealing with data volumes like GitHub then it's conceivable that all the other salves for enforcing data integrity fall short. In that case there are ways to approach removing FKs, but, when you do so, you're not (logically speaking) giving up the data-integrity from FKs, you are replacing FKs as a tool for data-integrity with another tool for data-integrity (one that will probably be very similar to FKs) - under this guise DB FKs can stop making sense (though also having any sort of RDBMS engine likely also stops making sense as you're essentially adopting the functional responsibility for being an RDBMS into the primary application).

Being pedantic in this case doesn’t help the cause. Too many developers don’t understand database theory at all and will read this

> But, beyond that, it is quite possible to remove FK checks and still have strong guarantees about data integrity

And not the rest of your post. Yes technically you are right but it is stupidly expensive and nobody should do it.

The problem with being technically correct is, again, people will stop at the sentence I quoted and go build yet another FK free system. Said system will undoubtedly fill up with corrupt bullshit data that eventually leads to exciting mystery bugs in production that has everybody scratching their heads. I’ve seen it time and time again....

I feel like reading HN should come with a warning on the tin that "If you're reading a long technical comment, taking away just part of it is dangerous" - were I speaking to someone in the business side of a corp that asked "Hey do we need these FK things, some developers have been saying they're slow" I'd say 'Yes, we absolutely do need FKs' then go on to talk to the developers, double check I wasn't at one of the about dozen of companies with data at a scale that FKs as implemented in RDBMSes (especially postgres, mysql tends to drop off in performance much easier without heavy tweaking) is insufficient, and then tell them that FKs do work and they probably really just need to read up a lot on indexes and stop throwing around table locks like it's christmas.

Hey there, you and I still need work. Let them speak!

Unique constraint is a good example, because it reminds us about race conditions.

The app can check that it won't violate a unique constraint before doing an insert/update, but in between that check and actually doing the insert/update, some other process may have changed data, such that unique constraint can be violated.

So when the rdbms catches this, it'snot just a "fuck you for giving me bad data" condition if the implication there is that it was a bug in app code, and it's a failsafe. It isn't necessarily a bug at all -- unless you intended it to be the app's responsibility to use db-level locks and/or transactions to guarantee this can't happen without the uniqueness constraint -- but then why not just use a uniqueness constraint, the tool the db is actually giving you for this?

Mature rdbms's sometimes don't get the recognition they deserve for being amazing at enforcing data consistency and integrity under concurrency, with pretty reasonable performance for a wide variety of usage patterns.

Foreign key constraint can be similar; you can do all the checking you want, but some other process can delete the object with the pk right before you set the fk to it.

If you have app usage patterns such that you really can't afford database data integrity enforcement (what level of use that is of course depends on your rdbms)... you are forced to give up a lot of things the rdbms is really good at, and reinvent them (in a way that is more performant than the db??) or figure out how to make sure all code written never actually assumes consistent data.

The disdain some developers tend to have for their data storage system is pretty unreal. It’s like they see it as a necessary evil instead of their friend.

They often react to FK’s as if they cramp their style, where their style is to just insert whatever into the database without regards for what And where it is.

I had a job where the database was intentionally lacking FKs so that users could input data out of order, for example create a shipping route Foo that goes to a warehouse Bar, before the warehouse existed. Let's just say it was a suboptimal design.

this just sounds like eventual consistency which can form part of a perfectly optimal design

Or "eventual" never comes and you end up with a bunch of inconsistent data.

The engineer https://en.wikipedia.org/wiki/Niki_Lauda from Rush movie fame, if he's working on a race-car-level-system where every little bit of performance you can tweak out counts, like wringing a towel dry, every last bit... and in that mindset, FKs are removed NOT from ignorance!

Not using FKs essentially means eventual consistency, and it's on you to ensure it. Obviously, eventual consistency is something a lot of systems have to settle for, often for reasons that have nothing to do with architecture, but just because of circumstances where you have to synchronize disparate systems and there's no way to eliminate all but one.

I'll note that there are very large systems that do provide ACID-like consistency (e.g., Spanner, Lustre), so it's not at all that "large -> eventual consistency".

I won't pass judgment on projects where eventual consistency was a given from the start, even if I might think they could have done better, but too, I wouldn't make either approach a hard and fast rule: circumstances vary. What bothers me is the extent to which I see "triggers bad", "FKs bad", "business logic in the DB bad" blanket statements out there.

Thanks. As a mere occasional user of databases I was confused and trying to figure out how you could not use FKs without recreating essentially the same thing in code.

At the scale of GitHub, FKs may very well being more problems than benefits. For everyone else, FKs are a benefit.

> and yes, dangling stuff in tables count as data corruption

No. There are exceptions like financial systems, but most of the time it doesn't matter if your DB has dead links to some removed entities.

> Not using FK’s is just plain old ignorance.

Nice level of argumentation, but I prefer the way how author from github can prove his opinion.

Eh. The issue arises when you have a very poorly designed schema and missing application-level operations to perform cleanup and deletion.

You don’t want to be caught in that situation because then you are handcuffed and cannot clean the database properly. For example, from the perspective of business operations you might have something that creates 1 single thing where under the hood 5+ DB objects are all created in a transactional manner and a very specific order. So how do you unroll that? You gotta carefully construct the rollback commands or just get rid of the bowling lane bumpers and slash and burn.

So you gotta commit and go all-in. If you make the choice to do it all within the app later, at least you keep that flexibility.

It all depends on your processes and your team. If you don’t design things properly and institute the correct procedures for designing and operating the DB, having or not having constraints at the DB level is inconsequential to the other issues you will face.

Bullshit. Even with the most perfect developers and perfect system, letting application code enforce constraints will lead to data corruption period. Some application will crash or something and leave dangling garbage behind and boom you are fucked.

Would you ever trust that form data is valid because the JavaScript front end “validated” it? No! Why the hell are you going to trust that everything sent to the database is valid? No constraints like FK’s is exactly like not validating input because “the JavaScript layer got it”.

It is out of ignorance that people argue otherwise, sorry. Too many people don’t understand relational database...

Well, what's true is that if you leave it to the application, you now have two problems: you have an RDBMS you use half-way, and an app that needs to implement the RDBMS features that you're not using from the RDBMS. Since most app devs are not RDBMS devs and don't want to be, they're bound to get a few things wrong.

The most likely thing to go out the window in an ORM-type app is concurrency. Take a big lock around DB ops and you're good, right? But then, too, there goes performance.

Or, if it's not concurrency that goes out the window, you have to implement eventual consistency...

The worst thing I've seen too much of is application-level JOINs and such. Next are recursive queries: done in the app. All that absolutely destroys performance. Then you have custom query languages that are just never good enough.

So I am mostly in agreement with you, but "b.s." is too strong for me. You can get all of this right in the app, though at a terrible cost, and there are times when maybe it's actually for the best.

Let me give you an example of how I might build something that's somewhere in the middle. Think of Uber or Lyft. I might have a single widely-replicated RDBMS for users, and maybe another for drivers, and maybe per-city in-memory-only ride DB for tracking requests and current rides, and one more, possibly sharded DB for billing. The ride DBs would send billing updates via some pub/sub scheme (e.g., like PG's NOTIFY, or something else). Recovering from reboots of the ride DB is easy: rely on the apps to recover the state. Most operations only have to hit one DB at a time. The whole thing is eventually consistent for user/driver ratings and billing, which seems rather fine to me given that the operations to synchronize are essentially just aggregation updates. In such an app there's not a lot of room for application logic in the DB -- some yes, and I'd put as much as possible in the DB.

There are plenty of apps where some degree of eventual consistency makes sense. What doesn't make sense is to end up hand-coding JOINs, GROUP BYs, and so on. And my preference is to use FKs and triggers as much as possible, but within limits (see all the above).

Yikes. Someone got up on the wrong side of bed today!

Sorry. I have just seen way to many systems fail in way too many mysterious ways because "foreign keys are bad" and "the application code can do the validation". In fact, I wager that 100% of all databases that don't have FK's have some kind of data corruption that is causing at least some kind of user-visible issues.

It is frustrating to me because even the most green behind the error developer would cringe at somebody saying "we don't need the backend to validate user input because the front-end javascript does it for us". This is the same thing.

Never trust your inputs. Your application code calling the database server is exactly like some javascript client calling your backend system. Your database has tools to keep it from being fed bullshit input. For some extremely frustrating reason, people think it is perfectly okay to have the database trust its user input.

The result is in almost every single instance where those tools aren't used, the inevitable corrupt data will cause some kind of hard to reproduce user-impacting issue. I've seen it so many times it makes my head hurt.

I should have clarified that I don't think FK's are bad ... just that if you are going to use them you need to plan for that.

Like you, I have been bitten by this in many different environments on both sides of the fence. In certain environments the absence of FK's has caused major headache and likewise in other environments the presense of them and the lack of a robust db design has meant the FK's cause more harm than good.

Every time I have seen a database use foreign keys there has been data corruption, because everyone thought foreign keys were declarative and not procedural. Just because you have a foreign key doesn't mean it was always there or that it applied on every transaction. You can turn them off at the connection level and you in fact must turn them off for almost any kind of bulk data load.

You should read shlomi's post he gives good reasons. Specifically this is a github issue on his tool gh-ost which is for online schema migration, and FK's pose lots of problems for online schema migrations.

Maybe shitty toy database systems like MySQL let you disable constraints on a per session basis. A real database system might let you defer them until the end of a transaction—which is the only correct way to operate. Once you commit that transaction, what you put into the system better fucking make sense and it is the job of the database system (and only the database system) to enforce that.

Like I said before if your database system makes using constraints hard or lets you shoot yourself in the foot (lol at disabling constraints per session), don’t just walk but run away from that system.


What other database has been used at such a giant scale at so many companies, than MySQL? I'm sure Oracle and SQL Server are used at big companies, but nowhere near Facebook scale.

You aren't Facebook. Facebook has put massive work into adding layers on top of MySQL that your startup has not.

Also, you seem to be forgetting PostgreSQL

> Maybe shitty toy database systems like MySQL

While appropiate to define in few words some of MySQL's colossal mistakes, this isn't the kind of language that will sway heads that have been comfortably using MySQL because those defects are just "what DB's do".

Any database that would let you disable constraints on a session basis is a toy database. Such an operation doesn’t even make sense because at some point the relational integrity has to be enforced for the entire table. You can’t just have parts of a table be relationally correct. That is like saying 1 + 1 = 3. It is a completely illogical statement.

However I would not at all be surprised to learn MySQL supports such a thing. Which supports my assertion it is a toy used (or at least installed by) people who have no understanding of relational database architecture.

So are you asserting that the following products are all built on top of a "toy" database, and their engineers have no idea what they're doing:

Facebook, YouTube, Wikipedia, Pinterest, Slack, GitHub, Etsy, Yelp, LinkedIn, Shopify, Dropbox, Wordpress, Wix, Tumblr, Square, Uber, Booking.com, Box, Venmo, SendGrid, Okta, SurveyMonkey, WePay, Alibaba, SoundCloud, among countless others...

An alternative view is that your statements are incorrect. Do you have much direct experience with high-volume OLTP database workloads, or are you basing your views of MySQL on something else?

Once you are stuck with MySQL it is very, very, very hard to get an organization to switch--not only from a technical standpoint but a political one.

I bet you any competent engineer who knows their shit about DB in those companies regrets using MySQL. I bet their code is full of hacks, crappy schemas, and all kinds of work arounds because they chose mysql. I've seen it in every company that uses MySQL. The lengths people go to avoid schema changes is astonishing.

It is much, much better to start with a real database like PostgreSQL because whatever you pick is going to be what your entire org uses from now until eternity.

Cool, so I'm going to assume that means your answer to my question of "Do you have much direct experience with high-volume OLTP database workloads?" is "no". Given your "bet" as well as comments about schema change difficulty, I'm also going to assume you did not click through to my profile...

And to clarify, I'm not saying that to toot my own horn. My point was, I primarily work on open source schema management and related consulting. I talk to large companies about MySQL schema changes literally every single day. The comment about "the lengths people go to avoid schema changes is astonishing" simply does not gel with reality among large-scale MySQL users.

As for the random blind accusations about bad code, engineer incompetence, etc that's just rude, mean-spirited, and misinformed. I personally know a lot of exceptional database engineers who work on MySQL-related infrastructure at the companies I listed above. Why crap on other people's work that you haven't seen and know nothing about?

Agreed. I think MySQL is still dragging the bad reputation it got in the early 2000s, which is unfair considering how much it improved. And I write that I someone that used to hate MySQL for all its shortcuts. I have one app in production based on MySQL. I have been thinking of switching to PostgreSQL for years. But the truth is that, the more MySQL improves, the less the switch is justified :) The engineering effort put by Google, Facebook, Oracle, etc. in MySQL and InnoDB during the last ten years is impressive.

Qualifying MySQL as a "toy" database in 2019 is obviously wrong.

But I think most companies you mentioned don't use MySQL in the usual way, as they would use a "standard" enterprise database like Oracle, SQL Server or PostgreSQL.

These companies don't use MySQL directly. They use it indirectly as the storage component of a larger architecture. For example, YouTube uses Vitess "over" MySQL.

Companies like Instagram are known to do something similar with PostgreSQL.

My point is that maybe you and the parent comment are not thinking about the same use case.

> These companies don't use MySQL directly. They use it indirectly as the storage component of a larger architecture.

Yes and no. Often it's both. I say this first-hand, having performed significant work on the database tier for two of the companies I listed, and consulted for several others.

For example, while Facebook's largest db tier goes through a dao / writethru cache, there's plenty of other use-cases that are direct MySQL usage.

And in any case, why does it matter if there's another layer involved? It's still MySQL powering mission-critical global-scale use-cases. And for example with YouTube, literally the primary benefit of Vitess is that your application can treat it as a single normal unsharded MySQL installation, so those interactions are still very MySQLy.

> For example, while Facebook's largest db tier goes through a dao / writethru cache, there's plenty of other use-cases that are direct MySQL usage.

I didn't know about that. That's interesting!

> And in any case, why does it matter if there's another layer involved?

I was writing that in the context of the parent comment about "disabling constraints". I can see why disabling constraints makes sense in a sharded environment, with an intermediate layer like Vitess. But the benefit of disabling constraints is less clear when using MySQL directly in a non-sharded environment.

Since you're here, I'd like to ask why you would use MySQL over PostgreSQL in a new project nowadays?

Regarding MySQL, the two main advantages I can think of are that tables are organized as clustered index (instead of a heap in PostgreSQL, which can be an advantage or a drawback depending on the workload) and the replication tooling.

On the other hand, PostgreSQL has a lot of useful features that I miss in MySQL: table elimination/join removal (exists in MariaDB but not in MySQL), indexes bitmap scan (to combine indexes efficiently), partial indexes, transactional DDL, LISTEN/NOTIFY, materialized views, row-level security, table functions like generate_series.

To explain more re: FB and having another layer on top of MySQL, there are a bunch of separate sharded MySQL tiers there. It's split by workload -- for example, the access pattern, schema, and sharding key differs completely between the main social graph, Messenger data, ad market, financial transaction data, etc. And then there's also the internal MySQL database-as-a-service, which allows any engineer to provision one or many databases for any other purpose. Overall, some of these things have services on top that use MySQL more as low-level storage, and others use MySQL in a more traditional fashion.

re: "disabling constraints", that's kind of orthogonal. The large MySQL users simply don't create foreign key constraints in the first place; there's nothing to disable :) Whereas MySQL's ability to disable constraints for a single session is a feature intended to make things like logical dump/restore easier and faster, schema management easier, etc. Without that feature, these tools would need to construct a dependency graph and create tables in a specific order (and/or defer FK creation until after the tables), which is needlessly complex if the tables are new/empty, and very slow if restoring a logical dump which is already known to be referentially consistent.

As for MySQL vs Postgres, IMO both databases are close enough in major functionality that for many use-cases it's best to just go with what you already know, can hire for, and can operate. There are special cases where one is better than the other, for example personally I'd go with MySQL for social networking / UGC / very high volume OLTP, and go with Postgres for use-cases where solid geospatial, OLAP, or fulltext are core requirements and/or there's a desire to minimize the number of different data stores.

Ideally with MySQL you're just using it for OLTP, and deferring to separate systems for OLAP, fulltext search, etc. In a way that's "more UNIXy" but it's also potentially an operational headache.

In terms of specific feature comparison, you already have a great list there. A couple other things on the MySQL side I'd mention are InnoDB's buffer pool (smarter caching than relying on the OS cache as pg does) as well as the existence of MyRocks storage engine (LSM-based system offering better compression than pretty much anything else of comparable performance level for OLTP).

That all said -- Postgres is an awesome database, and I'd say that Postgres is more closely aligned with the textbook definition of a proper relational database. But then again I'd also say something similar about FreeBSD (vs Linux) for server operating systems, yet for practical purposes I always go with Linux anyway :)

Thanks for sharing your experience here!

> Without that feature, these tools would need to construct a dependency graph and create tables in a specific order (and/or defer FK creation until after the tables)

This problem alone justifies the ability to temporarily disable constraints. I notice PostgreSQL, which is a toy too ^__^, offers something similar.

> Go with Postgres for use-cases where solid geospatial, OLAP, or fulltext are core requirements and/or there's a desire to minimize the number of different data stores

I agree that PostgreSQL is a really good match for these use cases.

Great point about Linux and FreeBSD ;)

PostgreSQL is not a toy database either, and yet you can do:


    SET session_replication_role = 'replica'

This really makes it seem like you have never used a database system at scale. There are reasons why systems like MySQL let you turn them off, and they are some of the same reasons why pretty much everyone who uses a database at scale has settled on MySQL. Also its why as is mentioned in the issue once you actually scale your database foreign keys become a nightmare. If all you have is a toy project you can feel free to use any database you want, but if you actually need to serve traffic you might want to rethink your priorities.

> everyone who uses a database at scale has settled on MySQL


I've only every seen people using MySQL at scale if they started with MySQL in prototype and never had the energy to migrate.

Well here is one case of a large scale user migrating


And there are tons more. In fact here is a tool to help you do it


Uber also has a giant team dedicated to re-inventing slack. I'd take their engineering prowess with a grain of salt.

Lots of people do dumb things for dumb reasons.

Most developers I interact with, even the really good ones, are profoundly stupid when it comes to databases.

I am a bit doubtful of this, especially within PostgreSQL you need to specifically go out of your way to create a NOT VALID constraint. I know that MySQL of old would default to an engine that didn't actually enforce key relationships (which was terrible but at least well documented) but in the modern world DBs will tend toward enforcement unless you specifically work against it.

How about this: if your service gets as big as github, then maybe consider doing odd things to eke out more performance or shard or whatever.

Otherwise: use FK's to maintain stronger data integrity.

As hesk mentions below, in Postgres, you can do all kinds of table ALTERing if needs be.

FK’s don’t just maintain “stronger” data integrity, they are the only way to maintain relational data integrity. Application code cannot maintain that relational integrity, period.

Any developer who thinks application code can enforce relational integrity is naive and does not understand relational database systems. It is impossible for any layer above the database itself to keep things from getting corrupt.

This is not remotely true. You can do this with pretty much any database that supports SERIALIZABLE isolation-level transactions correctly.

The point is that application code has bugs, and it's a lot easier to specify your constraints declaratively in a single place using a purpose-built DSL (SQL) than it is to enforce them procedurally every time you access the database.

if you are avoiding the use of foreign keys for the high and lofty goal of "performance" and 100% uptime while you run your fancy online migration utilities, SERIALIZABLE isolation is the last thing you would be using as it does what it says, serialiazes transactions and locks things like crazy. lots of waiting on locks, lots of deadlock potential. Throw in InnoDB's quirky implementations of isolation levels and you'd be in for a world of fun.

I'd argue that any decent application programming language can express these constraints better than SQL can.

How? FK constraints are trivially expressed in SQL. It is just a "REFERENCES table_blah ON DELETE DO BLAH ON UPDATE DO BLAH".

I agree with your general gist, but it's not impossible, you just, essentially, are writing the engine yourself.

I've seen a few applications that keep some data hot-loaded into a shared slice of memory and have tight controls over the altering and saving of that data - and that's sort of one of the easier ways to partially avoid a full reliance on FKs, if you want to use write-through caching then you're essentially accepting the cost of implementing data integrity checks in the application layer which can be done - if you're very very careful.

Unless you're working with immutable primary keys (i.e. the values can neither be updated nor deleted), it is actually not possible to reliably enforce a foreign key constraint outside the database.

Speaking in terms of theory, it absolutely is. RDBMSs aren't magic boxes, they stage data for insertion, validate it, execute that insertion - all while littering the WAL (or equivalent) with the steps necessary for the guarantee of that operation. You absolutely can write this logic into your application, at that point your application may basically be Postgres but it's possible.

If you attempt this and don't exhaust a small rainforest's worth of notepad paper and a few truckloads of dry erase markers then you're probably doing it wrong - but the problem is solvable.

There are some unsolvable problems in this realm (like holding to ACID while also ensuring that all valid communications from a client are properly processed) but you can accomplish anything your RDBMS does, you just need to be really really absurdly careful and know what you're doing at a theoretical level.

But, if you want to ensure the data is never visible in an inconsistent state, you either need to use db-level concurrency-related features like transactions or locks; or some kind of lock or other concurrency-control features at the app level while guaranteeing the db has no clients other than your app.

It seems difficult to wind up with better performance characteristics by doing this than using the higher-level abstractions the database is already supplying for this purpose, like constraints. I suppose it's possible with enough hours by enough experts, as you suggest. I doubt that's what github actually did though, they probably instead decided to write app logic that was resilient to visible data inconsistency. Which sounds difficult and dangerous to me, but github is a pretty reliable app, so I guess something worked.

Transactions are a pretty widely used RDBMS feature AFAIK.

Right, so are foreign key constraints.

Ensuring foreign key consistency without foreign key constraints is not just "use transactions", you have to be careful and intentional about how you are using them (and ensure all clients do). Why would you choose this over just using the foreign key constraint which takes care of it for you, using the same underlying technology?

Perhaps that's a clearer way to say what I was trying to say originally.

You can implement FKs with the same cost as FKs. Look up the referenced rows and add read locks. It's not cheap. FKs are by no means free.

There are many cases where relational integrity does NOT need to be enforced and in fact can happen in the application level.

One of the big reasons NoSQL got popular was because it broke the norms around strict data relationships.

NoSQL got popular because Google needed it for their billion-user planet-scale system, and then a million startups pretended they were Google. As soon as Google built their planet-scale NoSQL system, they immediately started replacing it with a planet-scale SQL system (Spanner)

And one of the reasons NoSQL didn't revolutionize anything is because people realized it's worthwhile to put on your seatbelt. It's either arrogant or naive to think that automated safety measures aren't going to catch any of one's mistakes.

You don't need a service as big as GitHub; you just need billions of rows.

I think you should enable FKs for test and dev, and possibly QA, but disable them in production.

Depending on your database, you'll get surprising interference in concurrent operations via other means than FKs, but FKs don't help.

"Just" need a billion rows.

Sure, some people get there, but it makes sense to do things the right way first, then figure out how to cut corners only when and if you need to.

Worked at a bank years ago. One of their DBAs eschewed FKs in production. Databases were designed with FKs which were enforced in Dev and QA. If your app survived testing (automated, QA team ... the entire gamut) without producing FK violation exceptions, your app could be promoted to production, where FKs were not enforced, making things pretty fast.

I feel like this was a stopgap on the way to eliminating FKs. I have no idea what his roadmap was because I wasn't in that group. But his process made for good thought fodder.

This approach is quite scary to me and I would have argued very vocally against their rejections of FKs. There are ways to rely on replication for read serving, periodic disabling of FKs during batch inserts, FK integrity checks on replications, ... these can all address the performance issues inherent in FKs, it's also (generally) quite possible to attempt to architecturally disentangle too large networks of interdependent data in an effort to reduce how large any single data store you are relying on will get. But disabling FKs in production is... eh.

I'd much rather see the opposite, declare the FKs on a testing environment NOT VALID, run application code, VALIDATE CONSTRAINT those you'd marked and see if anything is in violation - ensuring that application code won't (in the normal course of operation) generally hit FK constraints can greatly reduce your performance hit from having them... then in prod you can turn them on and be confident in your comparably lighter performance loss to data integrity.

> This approach is quite scary to me...

No kidding. My first thoughts went to referential integrity, and getting misbehaving apps successfully past testing...

FKs add read locks to referenced rows. It limits concurrency and it is observable. FKs constrain your ability to incrementally widen 32-bit FKs once you go over 2 billion rows, if you start out with 32-bit PKs. Two concrete reasons to avoid FKs in production, or at least disable them for longer running transactions.

I think this perspective is something you only get once you've run bigger databases in production.

These sound like implementation details. In PostgreSQL FKs just add a shared write lock to the foreign key columns of the referenced table (i.e. the locks prevent anyone from updating the primary key of the referenced row). Also what you said about 32-bits is not true in PostgreSQL either as far as I can remember.

The costs of having FKs in PostreSQL are:

1) If you update the primary key of the referenced table you might see issues with locking. But this is rare in real world applications.

2) Performance overhead from having to check all FKs and taking the locks. This can be a big issue on some workloads and may force you to add extra indexes. A PostgreSQL specific issue is that FK checks cannot be done in batches.

3) Adding new FKs block writes from the referenced table and dropping FKs lock out both readers and writers from the referenced table. This is a limitation of the implementation.

And that updates on tables with FKs block 'for update' locks on the referenced tables.

(Let's not forget, amid all this Postgres-specific chatter, that the article is about GitHub, who use MySQL.)

not sure about mysql, but in postgresql an update on a table with a foreign key will take 'FOR KEY SHARE' locks on the referenced table, which is a weaker type of lock. updates on the referenced table that do not update (primary) key columns (changing a pk is very uncommon anyway) will suffice with a 'FOR NO KEY UPDATE' lock, which does not get blocked by 'FOR KEY SHARE' locks. in fact, the main reason postgresql has these weaker 'FOR KEY SHARE' and 'FOR NO KEY UPDATE' lock types is for handing of foreign keys.

Also worth noting here, if you ARE updating the primary key on the referenced table, then locks MUST be taken to ensure data consistency. If that PK isn't locked, than by the very problem definition you have open transactions relying on the original PK value.

Those locks would be very challenging to accomplish at the application level.

In which case you can run FOR NO KEY UPDATE.

This sounds like an implementation specific detail (no doubt for mysql) that for all you know has been fixed in later versions but gets passed around as if it is a permeant truth that applies to all database servers.

FK consistency can't be guaranteed without ensuring the referenced rows don't disappear before the transaction has been committed. Think about it.

Yes, but it can be done without locking out reads. The only thing you need to lock against is someone changing the primary key of the referenced row or deleting the row.

PostgreSQL has implemented this minimum level of necessary locking for quite many years now.

I never said anything about locking out reads. Did I?

FWIW, I just tested in Postgres. Locks like I said it does:

    A: create table parent(id int, value int, unique(id));
    A: create table child(id int, parent_id int references parent(id));
    A: insert into parent values (1, 10);
    A: begin;
    A: insert into child values (1, 1);
    B: begin;
    B: select 1 from parent where id = 1 for update;
    B: (blocks)
The situation in MySQL is worse because it'll block updates on any field, not just row locks.

I have specific experience of this due to use of database locks at the application level to avoid deadlocks (different lock orders) and inconsistent updates (updates based on reads across multiple tables that may have separate racing updates) by locking rows up front. For understandable schema reasons, what is logically a parent entity is the natural thing to lock, but for understandable performance reasons, FKs to the parent entity are distributed through some fairly large tables.

Ok, now think this the rest of the way through. Without FKs, how do you, at the application level, ensure that A and B don't commit separate changes (A to child, B to parent) that break consistency?

If A tries to insert a child for 1, and B changes the id to 2.... OOPS! And from both's perspective it looks perfectly safe.

A neat idea for sure. However I'd be concerned that suddenly you have two different applications. Example, if you delete a row with a cascading deletion elsewhere, that will work on dev, but on prod it would leave dangling data.

Though, I imagine you could simply not use cascading behaviors.. Neat idea though. Scary, but neat hah.

No DELETEs - it's banking!

But seriously, there are a host of issues one must overcome for this plan to work. And there were indeed managed permissions for DELETE. But ultimately, I wasn't aware of much of the implementation detail ...

Hm, that is 'good thought fodder'.

My feeling on it though is that it just shifts all the faith to your test data being true to not only does but conceivably could happen in prod.

All very well testing with a bunch of constraints, but if it turns out one actually isn't exercised and then happens in production?

It also means you can't use them for validation or whatever. Which, I don't know, maybe you shouldn't anyway, but if you turn them off in prod you can't.

That's fine when one application is using the database. I work in the enterprise space, and we have at least 5 different applications all working with the same database. Each of these applications has their own team. Expecting each team to handle constraint checking uniformly in their applications isn't always feasible, so we use foreign keys.

GitHub's approach might work fine for small, focused teams who have exclusive control over a product. However, that's not often the reality of most organizations.

> That's fine when one application is using the database.

It's not even fine for a single application. Said application has bugs in it. Said application can (and will) crash in unexpected ways that leave the database in an invalid state. And if you think that "oh, that will never happen to me". You just haven't been around long enough. It will happen, every single time. Any system without foreign keys will have corrupt data in it. Said system will have user-visible issues as a result.

Just like only the backend can truly validate user input from a webform, only the database can validate its input.

There are ways to prevent invalid states in a database without foreign keys, for example enforcing "all or nothing" operations via transactions.

Until a programmer forgets to wrap the new feature in a transaction, or just has a bug in their logic that breaks the integrity. Or a db admin working directly in the database breaks it. Or a schema migration. It's not impossible, sure, but it requires you and everyone working with the data two be very, very disciplined, and is only as effective as the weakest link.

I fixed one too many data corruptions to do away with FKs. If we are large enough that FKs are causing performance issues, then we are large enough to invest in and deploy another performance boosting solution while still protecting the integrity of our database from contractors/employees running around with direct SQL access (another nightmare unto itself, but one that I as a developer have less control over than if there are FKs or not).

Wow. No discussion at all regarding the correctness of their databases. Is all the data correct, or are there FKs referencing missing PKs? Actually, even discovering this would be difficult, since any FK violation could be just viewing the middle of what would be a consistent update in flight.

I get that cross-shard FKs are especially difficult. But any discussion of this topic without addressing correctness concerns is woefully incomplete.

It's not like you can avoid incorrect data just with foreign keys. Say you have an invoice model. There's a boolean value indicating that the invoice is final and a numerical value for invoice number. Final invoices must have invoice numbers. But a bug in your system manages to update an invoice so that it's final, but missing an invoice number. That row is incorrect and it's gonna cause an issue somewhere.

I can only imagine how many production databases contain incorrect data.

> There's a boolean value indicating that the invoice is final and a numerical value for invoice number. Final invoices must have invoice numbers. But a bug in your system manages to update an invoice so that it's final, but missing an invoice number.

Couldn't you catch that with a CHECK constraint?


Yes, in fact you could and should.

What is your point? That database constraints cannot catch all errors, so don’t even bother with the ones they can catch?

How would the system update the invoice at all without an invoice number?

I'm not sure if I understand.

If you're finalizing the invoice, you're hopefully doing something like this, right?

    UPDATE invoices SET
        final = TRUE,
        invoice_number = @InvoiceNumber
    WHERE id = @InvoiceId;
(Where @InvoiceNumber is some variable the application's substituting into the query)

If so, then the problem you present should never happen (unless the DB doesn't do atomic updates by default, but wrapping the update in a transaction should provide the necessary guarantees to prevent the problem from happening; if your DB doesn't support transactions, then you really should be switching to a different DB yesterday).

If that somehow could happen, though (i.e. you don't trust the application to be bug-free, which is a reasonable attitude), a CHECK constraint (as others have pointed out) would make the database enforce that:

    ALTER TABLE invoices
    ADD CONSTRAINT final_invoices_have_invoice_number
    CHECK (final = FALSE OR invoice_number IS NOT NULL);
Of course, I'd also be wondering why an invoice would ever exist without an invoice number (it's easy enough to just make the invoice number NOT NULL - or, better yet, make it the primary key), but hey, if that's the business requirement, then that's the business requirement.

>you're hopefully doing something like this, right?

Well, in an ideal case, yes, you would be doing it like that. The reality might be different, especially when using an ORM.

>Of course, I'd also be wondering why an invoice would ever exist without an invoice number

Draft invoices do not have an invoice number, since they don't really exist anywhere. You can delete a draft invoice and nothing has happened. But if you have an invoice number, that's a record that must be kept.

> Well, in an ideal case, yes, you would be doing it like that.

I mean, it'd be either that specific case or a bug. And if it's a bug, then the check constraint as described previously would catch it and prevent it entirely.

Another option, though, would be to not even bother with a separate field for "final"; if the only two states are "draft" and "final", and finality is conditional on there being an invoice number, then the application logic can be greatly simplified:

        WHEN invoice_number IS NULL THEN 'Draft'
        ELSE 'Final'
    END AS state
    FROM invoices WHERE -- yadda yadda yadda
Most databases can cache this as a computed value column on the table itself or in a view or what have you.

> Draft invoices do not have an invoice number, since they don't really exist anywhere.

Sure they do: they exist in your database. Unless your company has a strict rule about invoice numbers always being sequential with no gaps (and that'd be pretty darn strict, in my experience), you might as well pre-assign it.

> Draft invoices do not have an invoice number, since they don't really exist anywhere. You can delete a draft invoice and nothing has happened. But if you have an invoice number, that's a record that must be kept.

Must invoices be sequential or something? Why can't I just assign the draft it's to-be number, and know it's a draft because FINAL=false/0

The main criticism seems to be that the FK relationship makes migrating the referenced table difficult. But why not remove the FK with ALTER TABLE before the migration, migrate, and add the FK back again (which will catch any missing primary keys), preferably inside a transaction?

I think the emphasis there is the word "online".

The method you propose might work, but not necessarily well. Catching missing primary keys at the end is a problem because then you have a bunch of data integrity issues to sort out, preventing you from re-applying the constraint, in a live system where apps are relying on the database to do integrity checks, meaning that there's a decent chance that the running system is creating integrity issues faster than you can sort them out. Possibly orders of magnitude faster.

Doing everything inside a transaction may invite concurrency issues around locking, maybe even deadlocks. Which is again a problem in a live system, because you might be causing services to fail.

I'm not sold on FKs being something you should never do, ever, but I will say that I used to work at a place that pushed their database servers very hard on real-time tasks, and also tended to avoid foreign keys, and my biggest complaint was not data integrity, it was just that there was extra effort that needed to go into documenting the logical foreign keys (the keys still exist, they're just not enforced by a database constraint), since you couldn't just read them from the database schema.

I'd like to say that this can be something where you default to having FKs, and remove them as you hit performance problems, except that the poster is right: Developers have a habit of leaning on database constraints instead of doing their own sanity checks, and they simply don't know all the places they're doing it - they're not necessarily even aware they're doing it when they're doing it - meaning that removing FK's after the fact doesn't get you to "constraints are enforced by the apps", it gets you to "constraints aren't enforced". I think you probably want to try and anticipate ahead of time if you'll have a problem, so that everyone can know from the get-go whether they'll be working without a safety net.

Is it even possible to realistically do FK checks in app code? Nobody uses serializable transactions and with other isolation levels I think it's possible to check for FK, it's OK now, insert new row, but another transaction simultaneously deletes that row and both transactions happily commit. It's called phantom read. Probably it's very rare event, but it'll happen.

At the place I worked that skipped FK checks, they used append-only database schemata. If you limit yourself to CR operations, a lot of things get easier than they are with CRUD - ensuring (a somewhat weak form of) database consistency without transactions, yeah, but also things like audit trails and slurping new data into the BI system.

I'm not sure "you can't add checks because they might fail!" is sound reasoning...

It's not "you can't add checks because they might fail"; it's "you can't temporarily remove checks in a live system where the database's users are used to leaning on them, because when you try to re-apply them you just might find yourself in a world of fail".

I didn't quite understand the migration issue TFA talks about, but at least with PG, there's more than enough clever ways to make schema changes incrementally.

>preferably inside a transaction?

MySQL does not support transactional DDL, unfortunately.

So then don't do it inside a transaction. It still wouldn't be any less safe than their current approach of just not using FKs at all

Adding a foreign key to a massive table can take a long time, since all existing row data must be validated against the constraint. With logical replication, long operations like this can be quite disruptive, even with a multi-threaded logical replication scheme.

That’s no longer true since MySQL8 https://dev.mysql.com/doc/refman/8.0/en/atomic-ddl.html

Sadly I don't think this is actually true. From that page:

DDL statements, atomic or otherwise, implicitly end any transaction that is active in the current session, as if you had done a COMMIT before executing the statement. This means that DDL statements cannot be performed within another transaction, within transaction control statements such as START TRANSACTION ... COMMIT, or combined with other statements within the same transaction.

Whereas with real transactional DDL, the DDL can be done within large transactions, multiple ALTER's could be rolled back, no changes are visible outside the transaction...

MySQL 8 was extremely new at the time this comment was posted. It's likely that mysql limitations were a big part of the thinking. The mention of pt-online-schema-change at the bottom makes me think so.

MySQL 8 DDL is atomic at the statement level, but cannot be combined with other SQL DDL or DML statements in transactions yet.

Ahh to fkey or not to fkey. The problem with doing referential integrity in the app is what happens if a change is made to the database outside of the app?

I guess it depends. If you’re at the scale where you can’t refactor your database to keep up and you, after profiling or some other method, have proven that moving integrity enforcement into the model of your app is faster then by all means do that.

That being said: foreign keys help the query optimizer make better decisions. If you’re using and ORM though that doesn’t matter as you’ll get generic queries and you’re likely not profiling your queries anyway.

They also are a way of self documenting the relationship between things.

If you’re going to get rid of foreign keys then why not denormalize so that the data is easier to query and less likely to get corrupted? Reduce the number of tables say into a single table a-la the DynamoDB approach

There seems to be a certain mysql tinge to this value calculation. I use foreign keys and like it (I use Postgres). I do not encounter the problems the author does. I omit them when they are a problem for some reason. This is rare in routine work.

I'm gonna wager a guess and say that most MySQL users do use foreign keys without a problem. Most MySQL users are not running Github scale.

That's probably true. The only tell for that, really, was the complexity around migrations/schema changes, which seems a lot more painful on MySQL.

I don't understand why anyone would choose MySQL over PG, honestly.

True, but the people I know who run PostgreSQL at that scale do not complain about foreign keys. Not since PostgreSQL added separate lock levels to improve concurrency with foreign keys.

Because I've never thought about this, I'll ask the dumb question...

A shopping cart has many items. An item belongs to a shopping cart. In a relational database, without foreign keys, how do you associate the shopping cart with the items?

The suggestion is to not use foreign key constraints in the database.

The suggestion is not to eschew columns which might be JOINed on in a query.

I.e. "Don't enforce FK relationships with a constraint in the DB. Make sure the values in both tables which may be joined are consistent by using application code to enforce this."

> Don't enforce FK relationships with a constraint in the DB. Make sure the values in both tables which may be joined are consistent by using application code to enforce this.

For those reading at home who aren't good with databases. This is exactly the same statement as:

"Don't enforce valid inputs on the backend. The javascript front-end will enforce it for us".

Always validate your inputs at the appropriate layer. Your backend can never trust input from a web front-end (even if there is "only one web front-end"). Likewise, your database should never trust input from its clients (even if there is "only one client"). Ignore this rule at your own peril.

Except hopefully your customers do not have direct access to your backend.

Are your engineers omnipotent beings who never fail, never make mistakes and understand the system entirely (even all the parts they don't know about)? Is your QA process perfect? Your servers never crash in weird states?

Lucky for you!

A foreign key is a hard-constraint that must be enforced.

There's no stopping you from storing item IDs in a theoretical shopping cart table without FK constraints though. It's just that the onus is on your application to provide the guarantee that the item exists.

This becomes a little clearer if your item IDs are not simply auto-incrementing IDs (which have their own challenges in a large distributed system), but instead are SKUs, or things which have actual meaning.

If you have a need for a ID that is unique but otherwise meaningless in a distributed system, feeding a concatenated node ID, timestamp (to ms), and looping transaction counter (mod a sufficiently large number) into a secure hash like SHA-256 should take care of it.

Sure, that satisfies the uniqueness aspect if that's all you're looking for in an ID. But it doesn't satisfy the predictability aspect.

You can't know what the ID will be ahead-of writing the item to the database, nor store the item in a shopping cart without retrieving the item via a different piece of information (which is pretty likely to be SKU in this case)

When there aren't any connections to known unique keys (like SKUs for shopping), PK predictability isn't really possible (or necessary). The above formula is just an alternative to using auto-incrementing integers when in a distributed environment.

In a non-relational model, you'd generally copy the items/products and their quantity into the shopping cart.

That's a huge change from the relational model, and takes much more space than a normalized approach, but it also comes with some advantages. For example, it makes sure that the price of the items in the cart remain constant, even when you change product prices. That way, you don't have to inform the user "oh, your cart just got 8% more expensive" before checkout.

Agreed. Invoices, for example, are a representation of history so they need the exact data that was used at the time. They can have a reference to a product id but at any time it's possible for the data associated with that product to be completely changed from the meaning at the time of the invoice. This happens with, e.g., UPCs, which can be reused. On the other hand, if you have assemblies of products then you probably want those items to update themselves automatically.

It's confusion over the terminology. There are two definitions of "foreign key" in wide use:

(1) When one table's key occurs as a value in another table, in that second table that column is sometimes called a foreign key. For example, table A has id, table B has A_id, and people refer to A_id as "a foreign key".

(2) When you define this relationship explicitly in the database, so that the database can enforce it (and/or to document it), that's called a "foreign key constraint" but also sometimes just a "foreign key".

This article just means they don't use #2. It doesn't mean they don't use #1.

The phrasing was probably clearer in the context of the discussion they were having at the time because the comment above it linked some docs for the software in question (qh-ost), and under the "Limitations" section, those docs said, "Foreign key constraints are not supported." So to them it was clear they were talking about database constraints.

I agree, whether or not a given data model has foreign keys is a math question with a correct answer (up to isomorphism of the data model); if it does, where in a system they should or should not be enforced is an entirely separate, implementation (vs semantic) level issue.

you are talking about relation, in this context fk is an abstract concept

linked article describes fk as an dbms construct - a schema constraint

relations and conceptual foreign keys may exist without dmbs constraints, in such scenario dmbs does not guard validity of conceptual "foreign keys" and treats them as usual data

for example consider schema:

user{id,name}; book{id,title}; library_borrow{id,user_id,book_id,date}

if you want you may define a constraint:

ALTER TABLE library_borrow ADD CONSTRAINT FOREIGN KEY (user_id) REFERENCES user(id) CONSTRAINT fk_library_borrow_user_id;

if you do not define such constraint then dbms will not verify existence of records in `user` table when you insert data in library_borrow

absence of constraints affects also indexes - if you do not define a constraint you usually have to manually define an index on fk column

The association would still exist implicitly in the code that uses the database. The columns could stay the same, and the code could still use the item's hypothetical cart_id column to get the items in a cart.

You lose the explicitness and safety of referential integrity in the database, but it's a trade off that can enable other things, like sharding, as mentioned by the GitHub employee.

In what way does sharing make it impossible to have database enforced referential integrity?

You do not have to have a foreign key defined to do a join across tables. Foreign keys just enforce that all the values in the given column of a table are present in another given column on a specified table (usually a second table). They prevent you from entering data that does not match up.

The way I've usually seen it done is that the foreign key columns still exist. The relationships are there, the database just doesn't enforce them.

The items table might have a shopping_cart_id column

A FK declaration uses named field(s) in your table. You can omit the FK but the associative fields remain. An FK is a constraint on what values can be in those fields.

There is no difference when not having foreign key constraints.

In both cases you can't delete items from your catalog or you'd break old carts. So you just don't do it.

Another way is to copy the product from the product table to the cart table. This saves you from making references to versions of products.

An item should never be deleted from a catalog. That would break a lot more than carts.

If you need to track the inventory status/availability of an item, then you store that in the database. The specific implementation (field, list of current items, etc) depends on your needs.

All of these are mostly issues with the database engine implementation:

1. "FKs are in your way to shard your database." => not a problem for distributed databases that can query and duplicate data across shards/servers; alternatively, if the referenced data just isn't there in the database, a foreign key is not usable by definition

2. "FKs are a performance impact" => the app either just got the foreign key value from the database, so it should be cached in memory in a properly implemented database engine, or otherwise, the application relies on the database checking so you need the foreign key for correctness

3. "FKs don't work well with online schema migrations." => not a problem with database engines that properly support online schema changes without locking, downtime or table renaming hacks

The performance impact is at write time, and having the data cached in memory is irrelevant. It's the locking overhead that's non-trivial, especially if each table has several FKs, when operating in a high-volume OLTP environment.

As for "database engines that properly support online schema changes without locking, downtime or table renaming hacks", please name some. In my experience, every major DBMS has cases where certain ALTERs block concurrent DML, which is extremely problematic on very large tables. Or even cases where there's no locking, but the operation still takes a very long time, which is conceptually problematic for logical replication.

> FKs are a performance impact. The fact they require indexes is likely fine, since those indexes are needed anyhow. But the lookup made for each insert/delete is an overhead.

You have to check referential integrity somewhere. You can do it in the application or you can let the RDBMS do it for you, but it will have to happen somewhere. So the overhead argument is kind of bullshit. Yes, managing the integrity in the app might provide more flexibility, but it comes with a price of, well, having to do your own integrity checking and introducing more bug space into your code.

> FKs don't work well with online schema migrations. ... > Doing schema migration of P is just not going to work. Recall that gh-ost renames the table at the end.

So basically, the problem is not foreign keys, the problem is gh-ost implementation. Since they don't use FKs, they implemented their migrations in such a way as to make it impossible to support foreign keys.

Yes, in the parent-child example, to migrate the parent you would have to migrate the child. But it is not an impossible thing to do. Create the new parent, import the data, map the record ids from the old parent to the new parent, create new child, drop the foreign key, import the data, update the FK field using the map and create a new FK. Rename the tables and you are done.

> When eventually you want to shard or extract data out, you need to change & test the app to an unknown extent.

Sharding your data is not something you do on a whim. And you will need to test your app. But you were going to test your app anyway, weren't you? Or did you think that you could split your database into multiple databases without any testing? As for unknown extent, that is kinda bullshit. Look for places that catch integrity exceptions. Thats what you need to look at. Look at your database schema and if any of your tables are "ON CASCADE DELETE", go and punch your DBA in the face. If you are the DBA you might want to start running before someone else finds out.

> > FKs are a performance impact. The fact they require indexes is likely fine, since those indexes are needed anyhow. But the lookup made for each insert/delete is an overhead. > > You have to check referential integrity somewhere. You can do it in the application or you can let the RDBMS do it for you, but it will have to happen somewhere. So the overhead argument is kind of bullshit. Yes, managing the integrity in the app might provide more flexibility, but it comes with a price of, well, having to do your own integrity checking and introducing more bug space into your code.

Perhaps they meant "latency" rather than "performance". With eventual consistency you can have lower latency, and spend more cycles fixing referential integrity later.

Of course, sometimes things fail in eventually-consistent systems (e.g., you order some item from a vendor who claims to have stock but oops! they don't, so your order will take a few more days to ship than they claimed and now you're a bit sad but hey, it all works out in the end, right?).

So basically they don't use foreign keys because they're into premature optimization.

If you need to shard your database, you're big enough that you'll have the resources to engineer the new solution.

The vast, vast majority of use cases will benefit from the database ensuring the integrity of your data.

Github is probably past the stage of premature optimization.

Presume (for argument's sake) that we accept the premise that foreign keys do not scale. Then the question is whether there exists some sort of realistic migration strategy once usage grows past that. If it does, then using FK makes total sense until we hit that point, doesn't it? And if true this should apply to 99% of users.

But if there exists no such migration strategy, then maybe the guy has a point?

In all of my years of development, I’ve never seen data that can’t be migrated to a new system. I’m not even sure what that would mean tbh.

I work at Vimeo. We don't have any foreign key constraints in our database either.

This thread makes it sound like it's impossible to make a medium scale application work without FKs constrains. I work on a commercial application that depends on relational relationships without FKs and the number of data corruption cases that we run into on regular basis is zero. Data integrity is handled at the application side, unexpected crashes are accounted for by heavy use of transactions, and everything works just fine.

To be clear, I'm not advocating against the use of foreign keys. But not using them is perfectly doable and not at all what this thread would have you believe.

Sorry to hear it!

> It may even rely on FK to cascade deletes (shudder)

Is this just the author's personal taste, or is there something about mysql that makes ON DELETE CASCADE a bad idea? In postgresql it's a useful tool for maintaining database consistency.

I'm guessing it's the possibility for data loss. So someone accidentally deletes a user and it cascade deletes all invoices referenced to that user (deleting invoices is a big no-no in accounting).

These kinds of things are exactly why foreign keys are so helpful to keep things from getting corrupted. Invoices should always reference a user in the user table.

The relationship should not be "ON DELETE CASCADE" but probably "ON DELETE NO ACTION". Then when some junior dev tries to issue a DELETE on the users table, the DB will correctly tell them to get bent and throw an error because there are dangling invoices still associated with the user. Because you are a smart person, you've also prohibited DELETE and UPDATE's against the invoices table for the db user that is used. The application cannot delete the invoice, and it cannot delete the user either--which is the only correct course of action. The database can now protect itself from hostile input from all your junior devs. Exactly why you want FK's. You can't trust user input.

The fact that the author of the article shudders about cascading deletes demonstrates that they have zero authority to speak about databases--something that is very common in the developer community. DB's aren't mysterious black boxes. They are really fucking cool tools that can do all kinds of neat shit to keep your data nice, safe, cozy and warm (they do way more than that too... a good database is perhaps one of the most underutilized tools developers have in their toolbox)

If the author is thinking of that kind of situation, he's right that ON DELETE CASCADE would be inappropriate, but he's also incredibly wrong to not want a foreign key enforcing that relationship. Deleting a user associated with invoices that should not be deleted is a big no-no too. A foreign key would not let the user be deleted until after the associated invoices are deleted.

There are plenty of other situations where ON DELETE CASCADE is the best option.

the problem with ON DELETE CASCADE isn't the CASCADE, it's the DELETE. Almost always, you want to to mark an entity disbled, not really DELETE it. If you want to DELETE it (for GDPR?) you should have something in place to fail your delete unless you've properly defined how to clean up danglig keys. (Perhaps what you need to do is delete the non-primary-key fields containing user data, but keep the row for relational integrity)

ON CASCADE DELETE is used for deleting logical sub-components (a row owned by another row) in a well normalized database -- it's for deleting your dadress book when you delete your account. It's not for deleting all your friends when your account is deleted.

> It's not for deleting all your friends when your account is deleted.

Very nice summary up until that point. But in a well normalized db the friends relationship (many-to-many) would be its own table. Friend relationships with the deleted user would be deleted, as they should be. Friends would not. (You still have time to edit for a better example.)

Okay, then don't do ON DELETE CASCADE. If you really want to allow the user to be deleted entirely, then ON DELETE SET NULL (or ON DELETE SET DEFAULT pointing to the ID of some "null" user) should do the trick. Otherwise, if you want to retain the customer info (which if you're saving the invoices you almost certainly do), then ON DELETE RESTRICT is what you want. This is all defined in the context of the invoice's schema, mind you, so you'd have to make a conscious effort to cascadedly delete your invoices on user deletion when defining that foreign key constraint.

Use the database to do your bidding. Easier to stay sane that way :)

I mean, sure, don't use on delete cascade in that case. But there's a big difference between "sometimes bad" and "always bad". The link seems to argue "always bad".

If I have data which much always be deleted if its owner is deleted, what's the process? Manually issue exhaustive delete statements in a transaction?

Is on delete cascade always bad or just a bad default? (It's obviously a bad default, I'm not arguing that.)

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact