As a counter point to the linked issue, I operate a few small applications. Foreign-keys (and constraints in general) are great at ensuring that invalid data doesn't find its way into your database. Yes, they have a performance cost. Yes, they make sharding more difficult. In my experience, at smaller scale the trade-offs are worth it. YMMV
CQRS is one of the biggest examples I've seen personally for this - your flexibility will go down, work to do "basic" things will go up, and you will really lose a lot of speed solving for 10 y/o company problems with technology rather than 1-5 y/o company problems in terms of product market fit.
My favorite example is a bootstrapped startup that I co-founded. We had a couple thousand users and less than a gigabyte of data. The app was a run-of-the-mill CRUD app. I built the entire back-end API as a single service that could be built and deployed on heroku in minutes.
At some point, I left the project, and my co-founder hired an expensive consultant to review the system and provide feedback. One of his suggestions was to break up the entire service into a suite of microservices. And this was for a service that barely averaged a handful of active users at any one time. And a team of ~2 engineers working on the back end code. Microservices.
People get pretty unhappy if they hire a consultant and don't get some drastic change recommendations. "Everything is good" doesn't sit well when handing over cash.
I suggest that this is largely untrue when the consultant being hired is a security consultant.
There is demand out there for security consultants who will rubber-stamp your existing software.
Moreover, you now have a great rule of thumb for your brand new junior developers: "If you're handling a GET request, use this class, and if not use this other class" making code reviews easier to do and helping to enforce better structure in your code. Plus tons of other benefits.
The rest of that stuff that isn't CQRS you might eventually need if you scale, but is easy to add once you have the bones in place.
It also forces people to think about something that's usually glossed over: Consistency. If you have an RDBMS backing you, you tend to not think about the fact that what the user on a web page sees is already out of date when they see it, so any responsible application should track when that data was read and reject updates if the backing data has changed since it was read... but very few applications even attempt to do this (it's really hard and a huge amount of boilerplate with most APIs). With CQRS/ES you are really forced to think about these things up front -- and you can actually avoid most of the issues. Whether that increases or decreases 'productivity' I don't know, but I do know that thinking about these things increases correctness.
For me, correctness is paramount. If you don't mind bugs, I can give you an infinitely fast solution.
 I do think ES is an integral component.
Remember: For smallish applications you can still have your application itself be a monolith, so no need for complicated setups with message queues (Kafka, whatever), etc. etc. In this case you can basically rely on near-instant communication from the Command side of things to the Query side of things. You can also have the frontend (Web) subscribe to updates.
(We have our own in-house library to do all of this so YMMV. I'm not sure what the commodity CQRS/ES libraries/frameworks are doing currently.)
it is done very simple by versioning. Usually implemented transparently for the data layer API clients.
The freedom and flexibility to create your read models and write models the way they are needed just completely sidesteps a lot of design issues that creep up. Plus, it makes the code so easy to follow when everything follows the pattern; picking out where state changes happen becomes trivial for example.
I'm not sure what you experience with CQRS has been but CQRS is not a solution for scale it is a solution for long-term maintenance costs of a growing codebase by enforcing the decoupling of services.
On one project where we seemed to handle engineering strategy well, one of the considerations we'd make is whether paying the tech debt would accompany an inflow of cash or not. If it didn't we'd better tackle it now so we don't bleed out.
It's probably another way to say "spend money to make money". Management is more willing to spend out of a problem when they've had a fresh fix.
The trick is, though, can you pay down that debt fast enough that the customers don't get upset while waiting. If someone's firing up a big ad push you'd better be prepared. And people, like me, who get caught flat-footed once (or have friends who were), don't want it to happen a second time. So they get opinions on 'last responsible moment' that might differ from the idealists.
My attitude is:
"Man if this takes off and I have to make some changes on how we do this....I should celebrate!"
As long as you're not painting yourself into a corner a lot of things can be fixed 'later'. Later when your not desperate to get the thing off the ground. Later when you can throw two engineers at it for six months. Later when the pain points are well understood.
The only time I've worked on a codebase that had implemented CQRS, it was when I was given to maintain a crud application to manage maybe 20 different records. Rarely used. The people who had developed it in .net mvc were of such technical prowess that, besides the CQRS pattern, the frontend still had a functional shopping cart (yes, you clicked on "save" and your "order" went to the shopping cart, relic of the mvc demo app).
Sometimes I wonder if I've just been unlucky or the world is all like this.
This point is valid in the same way as telling a startup founder to give up the startup and invest in the S&P 500. It's trivially true, but ultimately useless advice.
Building technology is hard precisely because there are so many tradeoffs. Speed to market vs scalability is one of many. It's silly to pretend that there is some kind of rote obviousness to never caring about scale in the early stage, anymore than there is rote obviousness to just investing in the S&P.
Ex post, most startup founders/investors should have just invested in the S&P, and most early stage tech leads should have just used Wordpress or Rails.
Plenty of early stage tech leads DO use wordpress and rails because their company doesn't need something more. You just don't hear about them often because they won't be sexy enough for HN. YAGNI.
OK so then does YAGNI apply to investment? Why not let existing firms handle it in the S&P?
> YAGNI (continued)
over-engineering is definitely a problem, but so is hindsight bias.
I think that most non-trivial tech builds involve some smart investments and some dumb ones. It's essentially a portfolio of decisions made in the face of uncertainty. YAGNI is cynicism and hindsight bias, and over-generalization.
I think this is less obvious than it should be because of the popularity of the highly cynical "agile" approaches that consider a project that does not plan beyond the next sprint as somehow not taking on code debt.
Yes, don't rebuild your own balanced portfolio or pay someone to "beat the market" for you; just buy a COTS ETF, put it in the server closet and forget about it for a decade.
The reply in the GitHub thread we’re talking about makes it clear that they still perform FK validation - it’s just performed in the application code rather than the DBMS.
I note there is another alternative: deferred constraints - or just run a query to check for invalid rows at 3am every morning.
I was once a full-time Rails dev and really loved the framework (I don’t write as many user facing applications these days). Most of the Omakase trade offs didn’t bother me. But I never understood the disdain for foreign keys. For 99.99% of web apps, you want them (dare I say, need them).
Even in Rails 3 I would add them by hand in the migration files. Very few applications will ever actually care about sharding. We were pulling in millions at the last Rails company I worked for and were just fine with a master and a single replica.
If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition. Your goal should be to outgrow what MySQL (or Postgres) can do for you in master/replica mode. If you do, you’ll likely be independently wealthy...
If data integrity matters at all, model-based checks (or periodic queries for orphaned data) will not suffice. Just put a foreign key in the table where it belongs and let your DB do what it does best. ACID is an amazing thing...
Maybe originally, but lack of foreign key usage is certainly not Rails specific today. Large MySQL shops generally don't use foreign keys, full stop, for the exact reasons Shlomi described in the original comment.
Facebook does not use foreign keys either. In my experience, same thing is true at all the other large MySQL-based companies. And those companies make up a majority of the largest sites on the net btw -- including most of the consumer-facing social networking and user-generated content products/apps out there.
This does not mean that, for example, Facebook is full of data that would violate constraints. There are other asynchronous processes that detect and handle such problems.
> If you get to a point where sharding is best for your company you hopefully have enough revenue coming in to fund a data transition.
Do you mean, stay with your current RDBMS of choice and shard while also removing the FKs? Or do you mean transition to some other system? (and if so, what?)
The former path is bad: sharding alone is very painful even without introducing a ton of new application-level constraint logic at the same time.
The latter path is also bad: only very recent NewSQL DBs have sharding built-in, and you probably don't want to bet your existing rocket ship startup on your ability to transition to one under pressure. Especially given much higher latency profiles of those DBs, meaning a very thorough caching system is also required, and now you have all the fun of figuring out multi-region read-after-write cache consistency while also transitioning to an entirely new DB :)
I typically espouse app logic to check state and use foreign keys to ensure enforcement (eg, race conditions that are very hard to ensure at the app level but are built into many RDBMS). Foreign key failures are just treated like any other failure mode.
But honestly, I haven’t been part of a company that have really hit those upper limits that require sharding. They exist, yes. But most companies will never need to worry about it. Which is my point.
Social networking and user-generated content products (including GitHub) need to be built with scale somewhat in mind: if the product reaches hockey-stick growth, a lot of scalability work will need to be completed very quickly or the viral opportunity is lost and the company will fail. I'm not saying they should pre-shard, but it does make sense to skip FKs with future sharding in mind.
This was nearly a decade ago, but in one case I helped shard my employer's DBs (and roll out the related application-level changes) with literally only a few hours to spare before the main unsharded DB's drives filled up. If we also had to deal with removing FKs at the same time, we definitely wouldn't have made it in time and the product literally would have gone read-only for days or weeks, probably killing the company. Granted, these situations are incredibly rare, but they really do happen!
Trying to make sure everything is where it needs to be at any given time, everything is inserted in the right order and the data is always consistent brings exponential amount of conplexity when it could all be checked at the end and pruned for invalid data. And usually DB integrity will not be enough, you’ll want business level validation that all is OK, so there will be app level checks anyway.
In particular splitting commits allows to ingest data in parralel (for instance if we import stores and store owners, both could be ingested separately without caring at first if each references a valid entity)
Application code has bugs. Application code can fail in ways that result in corruption. A primary job of the database is to keep itself from getting corrupted. Enforcing foreign key violations is only something the database can do correctly. Punting the responsibility to a higher layer will result in corruption.
That would be where the road splits. If you handle the database as a very fast and structured storage application, a lot of these assumptions go away, with then a different set of tradeoffs.
Some data corruption could be fine if you can guarantee the critical cases, just as bugs in the code are fine as long as the useful cases are covered.
I remember a database with scheduling entries in it that could get duplicated depending on the sharding, but it didn’t matter because the app handled the case gracefully.
I think understanding the tradeoffs that match the best the use case is the most important point, always assuming that a DB has to guarantee integrity can be a burden preventing from looking at all the options.
They have engineered for, and understand the implications of, foreign key violations. Typically, it's as simple as "This row can be deleted", and that can cascade - at a totally different rate than you'd find in a database and with totally different performance characteristics.
Reporting and bi data might get hosed.
Account management might get hosed.
Who knows what happens when FK rules are violated because by definition they should never be violated. It puts all applications on top into a undefined state, leading to bugs and god knows what else.
Foreign key violations are 100% data corruption.
You're missing an important point, it violates a certain set of rules of how the data relates. That certain set of rules typically makes it easier to write most general data applications.
If your rules define a foreign key as optional, you can easily engineer an application around it.
Great examples include NoSQL and Ruby duck-typing. In both cases, you infer actions based on the data structure, rather than making explicit assumptions.
Or you could, you know, simply add a foreign key constraint and never, ever, ever have corrupt table relations.
Why people fight their tools is beyond me. There is almost zero reason to defend not using foreign keys.
i suggest being open to the concept that other perfectly capable humans may, in fact, design systems with different underpinning assumptions than those you appear to presuppose always must apply to everyone and everything.
It's important to remove the DBA or Developer hat and realize that you are working together on a singular system (or maybe it's even the same person, etc.)
"corruption" implies that the desired results have not been achieved, that essential data has been lost or compromised, and this clearly is not the case with such a deliberate design decision. one should be prepared to accept this possibility in order to not be merely a fanatic and unreasonable.
Sure, yes, of course, who cares?
These are self-imposed rules, so breaking them is wholly up to yourself. And there are trade-offs involved that may often make it acceptable relax them.
You seem to be rather invested in one set of arbitrary definitions. If people do fine even in situations where they "by definition" shouldn't, it's the definitions that have been found lacking, not the people.
The stakeholder from whom you got the requirement on which the logical FK constraint is based, for one.
Nobody suggests taking an app / db with foreign keys and removing them that is a recipe for disaster.
In what way is letting the database ensure it isn't getting fed crap any different?
Why do developers constantly think it is okay to let unvalidated user input hit their database? Any client calling your database is a hostile client that will feed your database bad data. Arguing otherwise is complete ignorance.
Because one canonical validation layer is generally enough. You can see how having two separate partial validation layers could cause problems, right? And you can't put all the validation in the database, for non-trivial apps.
And "clients" shouldn't be talking to the database, no matter if you have foreign keys or not. That's a totally separate issue.
> Account management might get hosed.
LOL, if I read the data directly from the db instead of via the application's API then sure, I lose the application's guarantees. But, y'know, same might be said if I just go and read the DBs files from a sidecar shell script or something.
> It puts all applications on top into a undefined state
...that's just an assumption that's false.
It is not, though, no matter how much you keep shouting it is.
> because your engineers don't understand how to use the tools provided by their database....
...I'd be careful waving accusations of incompetence if I were you, it's not me or my coworkers declaring databases to be black magic.
> Seems like an awful lot of work to re-invent a wheel that your DB server can solve for you.
...try to run a car company without reinventing the wheel every few years, I dare you.
> I guess that is on you though....
Will result in more fault tolerant software, also yes.
It's a trade off and one I make willingly at every scale.
I stopped using foreign keys after university and have never wanted them since.
Non nullable database fields are far more useful than worrying about fks.
More fault tolerant because you have to waste time debugging the faults and monkeypatching them in code just to avoid FKs?
> Non nullable database fields are far more useful than worrying about fks
Can’t even count how many non-nullable fields I’ve seen packed with “” empty strings to get around that requirement
Further to that, instead of having lazy cascading deletes moving the goal posts on you. Data missing its parent for example is an indication something is wrong. It's a useful diagnostic.
2. Most string data is safe to represent as empty. Handling null on the other hand is a different situation and often induces warnings or other side effects depending on the language. Forcing the null checking to your boundaries is much like forcing your state/mutation to the boundaries in FP. Hell eveb though I think it's insane, hexagonal arch works on this idea as well.
One of the nice things you get from database FK constraints is when something messes up the data, you get an exception thrown in the code that's messing it up, complete with a backtrace and whatever inputs and timestamps you care to log.
You are essentially moving to NoSQL. Maybe it makes more sense to just own up to being on NoSQL, and using a data store and data access patterns that were actually built for the task? It should be something to think about, anyway.
Certainly consider this option if you're planning for that scale from the start as a more meaningful alternative to simply saying "no foreign keys" from the start. I won't necessarily say it's overengineering; there are entire problem classes where tracking millions and billions of records are on the table, particularly in event monitoring.
The next morning you can embark upon the engineering effort to start the sharding re-architecture.
The compromise we came to was to enforce them in dev and qa in order to catch bugs, and relax them in prod.
I still strongly believe that database constraints are a developer's best friend, in that you can trivially make your data structures fight back against misuse. This makes several classes of bugs obvious. But like anything, there are times they are not optimal.
I'll also say I think there are very few cases where a small performance advantage outweighs the costs, and would be hesitant to head down this path with a team less competent than the folks at Github.
You need to know your database engine, how normalized you are or are not, how appropriate your data is for its relational model. It's not always your FK constraints that are the bottleneck you might think they are. Sometimes their "slowness" is hiding a problem elsewhere in your schema, a bad shard or an index that could be better or a key-value pool with soft version controlled enums pretending to relational data.
Sometimes "optimal" isn't a simple spectrum but a checklist of trade-offs and no "right" answer.
You are "big data" when black swan events (like hardware failure taking down a database node) become regular enough that you begin to statistically model it. Before that, you might just have a lot of data, but you aren't having "big data" problems.
It's a different part of the field of course, but so many people yield opportunity cost chasing some difficult architectural problem that will likely never impact you, and if it does you'll have resources to throw at it.
Many startups won't even have slave production databases, much less have to worry about network distributed sharding, complex caching strategies, database optimization beyond query analysis.
Also, if you do end up sharding a relational database, it is often by identifying subgraphs of document-like structures that you can shard on. Need to ensure these subgraphs do not have foreign key relations with one another, but you can maintain the valuable foreign key relations WITHIN the subgraphs.
Practical example: database of user profiles where suddenly you have billions? You can skill keep foreign keys on user to email-addresses or user-to-comments while eliminating cross-user foreign keys.
I would also add a good rule of thumb: when you make the design decision to remove a database feature, you need to assume that you now need to handle that feature yourself, or your data will get corrupted. For example, when removing FK constraints, or transactional boundaries, or introduce irregular checkpointing, you now need to implement a data repair system, because your data will get broken. At which point you are probably going to end up using a change event log (your own transaction log) and a system that replays the logs in order to repair and rebuild the database.
I'm not sure if Guido was just being amicable, but in the recent Dropbox retirement post he said, “There was a small number of really smart, really young coders who produced a lot of very clever code that only they could understand,” said van Rossum. “That is probably the right attitude to have when you're a really small startup.”
Documents do well for a wide variety of applications ;-)
Any database that doesn’t use FK’s is almost guaranteed to have crap in it that didn’t get cleaned up, resulting in data corruption (and yes, dangling stuff in tables count as data corruption).
Developers aren’t perfect. Shit will slip through even with the most rigorous process. The database should always provide tools to keep its self from getting corrupted. This is why good database systems have constraints like FK’s. If your database software makes using those tools painful (cough MySQL), get a better database system.
Not using FK’s is just plain old ignorance.
"As a C developer, I never check exit codes of child processes. We can just enforce it by ensuring child processes don't have bugs"
But also, you can still get a malloc failure without actually be running out of memory if the allocator can't find a big enough contiguous chunk of address space.
This is highly unlikely on a 64 bit system, but if you try to malloc gigabytes on a 32 bit machine you might see it.
Things like RDBMS's provide some really nice utilities for data consistency enforcement in the form of FKs, these are not the most performant approaches in all cases but they're optimized to be good for nearly all use cases.
If your business need /Requires/ moving off of FKs onto some other data integrity guarantee, then just understand that you're going to be reinventing the wheel. It might turn out to be a better wheel for your car, but it's going to be expensive. And... unless you specifically hire specialized people who comprehend data guarantees and ACID properties, you'll probably do it wrong. This sort of an undertaking is for mature companies only.
There are two kinds of error handling. The “happy error” and the “fuck you asshole” error handing. Good systems have both.
Yes, the application should check it isn’t about to violate a FK constraint. Just like it usually checks that it isn’t about to violate a unique constraint. That way the application can fail in happy, controlled way. But if the application doesnt do the right thing than the DB server still should puke all over the transaction with a “fuck you, don’t feed me bullshit” error.
Just like you should never trust user provided data coming from an HTTP post, a database should never trust the INSERT or UPDATE is valid and won’t corrupt the database. Both backend systems can return mean old ugly errors when shit is bad and let the front end do pre-validation that can do happy nice errors. The backend always has to enforce its own data integrity. Period.
But, beyond that, it is quite possible to remove FK checks and still have strong guarantees about data integrity. It is stupidly expensive and unless you have a few billion in the bank there is absolutely no reason to even consider it, but if you're dealing with data volumes like GitHub then it's conceivable that all the other salves for enforcing data integrity fall short. In that case there are ways to approach removing FKs, but, when you do so, you're not (logically speaking) giving up the data-integrity from FKs, you are replacing FKs as a tool for data-integrity with another tool for data-integrity (one that will probably be very similar to FKs) - under this guise DB FKs can stop making sense (though also having any sort of RDBMS engine likely also stops making sense as you're essentially adopting the functional responsibility for being an RDBMS into the primary application).
> But, beyond that, it is quite possible to remove FK checks and still have strong guarantees about data integrity
And not the rest of your post. Yes technically you are right but it is stupidly expensive and nobody should do it.
The problem with being technically correct is, again, people will stop at the sentence I quoted and go build yet another FK free system. Said system will undoubtedly fill up with corrupt bullshit data that eventually leads to exciting mystery bugs in production that has everybody scratching their heads. I’ve seen it time and time again....
The app can check that it won't violate a unique constraint before doing an insert/update, but in between that check and actually doing the insert/update, some other process may have changed data, such that unique constraint can be violated.
So when the rdbms catches this, it'snot just a "fuck you for giving me bad data" condition if the implication there is that it was a bug in app code, and it's a failsafe. It isn't necessarily a bug at all -- unless you intended it to be the app's responsibility to use db-level locks and/or transactions to guarantee this can't happen without the uniqueness constraint -- but then why not just use a uniqueness constraint, the tool the db is actually giving you for this?
Mature rdbms's sometimes don't get the recognition they deserve for being amazing at enforcing data consistency and integrity under concurrency, with pretty reasonable performance for a wide variety of usage patterns.
Foreign key constraint can be similar; you can do all the checking you want, but some other process can delete the object with the pk right before you set the fk to it.
If you have app usage patterns such that you really can't afford database data integrity enforcement (what level of use that is of course depends on your rdbms)... you are forced to give up a lot of things the rdbms is really good at, and reinvent them (in a way that is more performant than the db??) or figure out how to make sure all code written never actually assumes consistent data.
They often react to FK’s as if they cramp their style, where their style is to just insert whatever into the database without regards for what And where it is.
I'll note that there are very large systems that do provide ACID-like consistency (e.g., Spanner, Lustre), so it's not at all that "large -> eventual consistency".
I won't pass judgment on projects where eventual consistency was a given from the start, even if I might think they could have done better, but too, I wouldn't make either approach a hard and fast rule: circumstances vary. What bothers me is the extent to which I see "triggers bad", "FKs bad", "business logic in the DB bad" blanket statements out there.
No. There are exceptions like financial systems, but most of the time it doesn't matter if your DB has dead links to some removed entities.
> Not using FK’s is just plain old ignorance.
Nice level of argumentation, but I prefer the way how author from github can prove his opinion.
You don’t want to be caught in that situation because then you are handcuffed and cannot clean the database properly. For example, from the perspective of business operations you might have something that creates 1 single thing where under the hood 5+ DB objects are all created in a transactional manner and a very specific order. So how do you unroll that? You gotta carefully construct the rollback commands or just get rid of the bowling lane bumpers and slash and burn.
So you gotta commit and go all-in. If you make the choice to do it all within the app later, at least you keep that flexibility.
It all depends on your processes and your team. If you don’t design things properly and institute the correct procedures for designing and operating the DB, having or not having constraints at the DB level is inconsequential to the other issues you will face.
It is out of ignorance that people argue otherwise, sorry. Too many people don’t understand relational database...
The most likely thing to go out the window in an ORM-type app is concurrency. Take a big lock around DB ops and you're good, right? But then, too, there goes performance.
Or, if it's not concurrency that goes out the window, you have to implement eventual consistency...
The worst thing I've seen too much of is application-level JOINs and such. Next are recursive queries: done in the app. All that absolutely destroys performance. Then you have custom query languages that are just never good enough.
So I am mostly in agreement with you, but "b.s." is too strong for me. You can get all of this right in the app, though at a terrible cost, and there are times when maybe it's actually for the best.
Let me give you an example of how I might build something that's somewhere in the middle. Think of Uber or Lyft. I might have a single widely-replicated RDBMS for users, and maybe another for drivers, and maybe per-city in-memory-only ride DB for tracking requests and current rides, and one more, possibly sharded DB for billing. The ride DBs would send billing updates via some pub/sub scheme (e.g., like PG's NOTIFY, or something else). Recovering from reboots of the ride DB is easy: rely on the apps to recover the state. Most operations only have to hit one DB at a time. The whole thing is eventually consistent for user/driver ratings and billing, which seems rather fine to me given that the operations to synchronize are essentially just aggregation updates. In such an app there's not a lot of room for application logic in the DB -- some yes, and I'd put as much as possible in the DB.
There are plenty of apps where some degree of eventual consistency makes sense. What doesn't make sense is to end up hand-coding JOINs, GROUP BYs, and so on. And my preference is to use FKs and triggers as much as possible, but within limits (see all the above).
The result is in almost every single instance where those tools aren't used, the inevitable corrupt data will cause some kind of hard to reproduce user-impacting issue. I've seen it so many times it makes my head hurt.
Like you, I have been bitten by this in many different environments on both sides of the fence. In certain environments the absence of FK's has caused major headache and likewise in other environments the presense of them and the lack of a robust db design has meant the FK's cause more harm than good.
You should read shlomi's post he gives good reasons. Specifically this is a github issue on his tool gh-ost which is for online schema migration, and FK's pose lots of problems for online schema migrations.
Like I said before if your database system makes using constraints hard or lets you shoot yourself in the foot (lol at disabling constraints per session), don’t just walk but run away from that system.
What other database has been used at such a giant scale at so many companies, than MySQL? I'm sure Oracle and SQL Server are used at big companies, but nowhere near Facebook scale.
Also, you seem to be forgetting PostgreSQL
While appropiate to define in few words some of MySQL's colossal mistakes, this isn't the kind of language that will sway heads that have been comfortably using MySQL because those defects are just "what DB's do".
However I would not at all be surprised to learn MySQL supports such a thing. Which supports my assertion it is a toy used (or at least installed by) people who have no understanding of relational database architecture.
ALTER TABLE <tablename> DISABLE TRIGGER ALL
SET session_replication_role = 'replica'
Facebook, YouTube, Wikipedia, Pinterest, Slack, GitHub, Etsy, Yelp, LinkedIn, Shopify, Dropbox, Wordpress, Wix, Tumblr, Square, Uber, Booking.com, Box, Venmo, SendGrid, Okta, SurveyMonkey, WePay, Alibaba, SoundCloud, among countless others...
An alternative view is that your statements are incorrect. Do you have much direct experience with high-volume OLTP database workloads, or are you basing your views of MySQL on something else?
I bet you any competent engineer who knows their shit about DB in those companies regrets using MySQL. I bet their code is full of hacks, crappy schemas, and all kinds of work arounds because they chose mysql. I've seen it in every company that uses MySQL. The lengths people go to avoid schema changes is astonishing.
It is much, much better to start with a real database like PostgreSQL because whatever you pick is going to be what your entire org uses from now until eternity.
As for the random blind accusations about bad code, engineer incompetence, etc that's just rude, mean-spirited, and misinformed. I personally know a lot of exceptional database engineers who work on MySQL-related infrastructure at the companies I listed above. Why crap on other people's work that you haven't seen and know nothing about?
But I think most companies you mentioned don't use MySQL in the usual way, as they would use a "standard" enterprise database like Oracle, SQL Server or PostgreSQL.
These companies don't use MySQL directly. They use it indirectly as the storage component of a larger architecture. For example, YouTube uses Vitess "over" MySQL.
Companies like Instagram are known to do something similar with PostgreSQL.
My point is that maybe you and the parent comment are not thinking about the same use case.
Yes and no. Often it's both. I say this first-hand, having performed significant work on the database tier for two of the companies I listed, and consulted for several others.
For example, while Facebook's largest db tier goes through a dao / writethru cache, there's plenty of other use-cases that are direct MySQL usage.
And in any case, why does it matter if there's another layer involved? It's still MySQL powering mission-critical global-scale use-cases. And for example with YouTube, literally the primary benefit of Vitess is that your application can treat it as a single normal unsharded MySQL installation, so those interactions are still very MySQLy.
I didn't know about that. That's interesting!
> And in any case, why does it matter if there's another layer involved?
I was writing that in the context of the parent comment about "disabling constraints". I can see why disabling constraints makes sense in a sharded environment, with an intermediate layer like Vitess. But the benefit of disabling constraints is less clear when using MySQL directly in a non-sharded environment.
Since you're here, I'd like to ask why you would use MySQL over PostgreSQL in a new project nowadays?
Regarding MySQL, the two main advantages I can think of are that tables are organized as clustered index (instead of a heap in PostgreSQL, which can be an advantage or a drawback depending on the workload) and the replication tooling.
On the other hand, PostgreSQL has a lot of useful features that I miss in MySQL: table elimination/join removal (exists in MariaDB but not in MySQL), indexes bitmap scan (to combine indexes efficiently), partial indexes, transactional DDL, LISTEN/NOTIFY, materialized views, row-level security, table functions like generate_series.
re: "disabling constraints", that's kind of orthogonal. The large MySQL users simply don't create foreign key constraints in the first place; there's nothing to disable :) Whereas MySQL's ability to disable constraints for a single session is a feature intended to make things like logical dump/restore easier and faster, schema management easier, etc. Without that feature, these tools would need to construct a dependency graph and create tables in a specific order (and/or defer FK creation until after the tables), which is needlessly complex if the tables are new/empty, and very slow if restoring a logical dump which is already known to be referentially consistent.
As for MySQL vs Postgres, IMO both databases are close enough in major functionality that for many use-cases it's best to just go with what you already know, can hire for, and can operate. There are special cases where one is better than the other, for example personally I'd go with MySQL for social networking / UGC / very high volume OLTP, and go with Postgres for use-cases where solid geospatial, OLAP, or fulltext are core requirements and/or there's a desire to minimize the number of different data stores.
Ideally with MySQL you're just using it for OLTP, and deferring to separate systems for OLAP, fulltext search, etc. In a way that's "more UNIXy" but it's also potentially an operational headache.
In terms of specific feature comparison, you already have a great list there. A couple other things on the MySQL side I'd mention are InnoDB's buffer pool (smarter caching than relying on the OS cache as pg does) as well as the existence of MyRocks storage engine (LSM-based system offering better compression than pretty much anything else of comparable performance level for OLTP).
That all said -- Postgres is an awesome database, and I'd say that Postgres is more closely aligned with the textbook definition of a proper relational database. But then again I'd also say something similar about FreeBSD (vs Linux) for server operating systems, yet for practical purposes I always go with Linux anyway :)
> Without that feature, these tools would need to construct a dependency graph and create tables in a specific order (and/or defer FK creation until after the tables)
This problem alone justifies the ability to temporarily disable constraints. I notice PostgreSQL, which is a toy too ^__^, offers something similar.
> Go with Postgres for use-cases where solid geospatial, OLAP, or fulltext are core requirements and/or there's a desire to minimize the number of different data stores
I agree that PostgreSQL is a really good match for these use cases.
Great point about Linux and FreeBSD ;)
I've only every seen people using MySQL at scale if they started with MySQL in prototype and never had the energy to migrate.
And there are tons more. In fact here is a tool to help you do it
Lots of people do dumb things for dumb reasons.
Most developers I interact with, even the really good ones, are profoundly stupid when it comes to databases.
Otherwise: use FK's to maintain stronger data integrity.
As hesk mentions below, in Postgres, you can do all kinds of table ALTERing if needs be.
Any developer who thinks application code can enforce relational integrity is naive and does not understand relational database systems. It is impossible for any layer above the database itself to keep things from getting corrupt.
The point is that application code has bugs, and it's a lot easier to specify your constraints declaratively in a single place using a purpose-built DSL (SQL) than it is to enforce them procedurally every time you access the database.
I've seen a few applications that keep some data hot-loaded into a shared slice of memory and have tight controls over the altering and saving of that data - and that's sort of one of the easier ways to partially avoid a full reliance on FKs, if you want to use write-through caching then you're essentially accepting the cost of implementing data integrity checks in the application layer which can be done - if you're very very careful.
If you attempt this and don't exhaust a small rainforest's worth of notepad paper and a few truckloads of dry erase markers then you're probably doing it wrong - but the problem is solvable.
There are some unsolvable problems in this realm (like holding to ACID while also ensuring that all valid communications from a client are properly processed) but you can accomplish anything your RDBMS does, you just need to be really really absurdly careful and know what you're doing at a theoretical level.
It seems difficult to wind up with better performance characteristics by doing this than using the higher-level abstractions the database is already supplying for this purpose, like constraints. I suppose it's possible with enough hours by enough experts, as you suggest. I doubt that's what github actually did though, they probably instead decided to write app logic that was resilient to visible data inconsistency. Which sounds difficult and dangerous to me, but github is a pretty reliable app, so I guess something worked.
Ensuring foreign key consistency without foreign key constraints is not just "use transactions", you have to be careful and intentional about how you are using them (and ensure all clients do). Why would you choose this over just using the foreign key constraint which takes care of it for you, using the same underlying technology?
Perhaps that's a clearer way to say what I was trying to say originally.
One of the big reasons NoSQL got popular was because it broke the norms around strict data relationships.
I think you should enable FKs for test and dev, and possibly QA, but disable them in production.
Depending on your database, you'll get surprising interference in concurrent operations via other means than FKs, but FKs don't help.
Sure, some people get there, but it makes sense to do things the right way first, then figure out how to cut corners only when and if you need to.
I feel like this was a stopgap on the way to eliminating FKs. I have no idea what his roadmap was because I wasn't in that group. But his process made for good thought fodder.
I'd much rather see the opposite, declare the FKs on a testing environment NOT VALID, run application code, VALIDATE CONSTRAINT those you'd marked and see if anything is in violation - ensuring that application code won't (in the normal course of operation) generally hit FK constraints can greatly reduce your performance hit from having them... then in prod you can turn them on and be confident in your comparably lighter performance loss to data integrity.
No kidding. My first thoughts went to referential integrity, and getting misbehaving apps successfully past testing...
I think this perspective is something you only get once you've run bigger databases in production.
The costs of having FKs in PostreSQL are:
1) If you update the primary key of the referenced table you might see issues with locking. But this is rare in real world applications.
2) Performance overhead from having to check all FKs and taking the locks. This can be a big issue on some workloads and may force you to add extra indexes. A PostgreSQL specific issue is that FK checks cannot be done in batches.
3) Adding new FKs block writes from the referenced table and dropping FKs lock out both readers and writers from the referenced table. This is a limitation of the implementation.
(Let's not forget, amid all this Postgres-specific chatter, that the article is about GitHub, who use MySQL.)
Those locks would be very challenging to accomplish at the application level.
PostgreSQL has implemented this minimum level of necessary locking for quite many years now.
FWIW, I just tested in Postgres. Locks like I said it does:
A: create table parent(id int, value int, unique(id));
A: create table child(id int, parent_id int references parent(id));
A: insert into parent values (1, 10);
A: insert into child values (1, 1);
B: select 1 from parent where id = 1 for update;
I have specific experience of this due to use of database locks at the application level to avoid deadlocks (different lock orders) and inconsistent updates (updates based on reads across multiple tables that may have separate racing updates) by locking rows up front. For understandable schema reasons, what is logically a parent entity is the natural thing to lock, but for understandable performance reasons, FKs to the parent entity are distributed through some fairly large tables.
If A tries to insert a child for 1, and B changes the id to 2.... OOPS! And from both's perspective it looks perfectly safe.
Though, I imagine you could simply not use cascading behaviors.. Neat idea though. Scary, but neat hah.
But seriously, there are a host of issues one must overcome for this plan to work. And there were indeed managed permissions for DELETE. But ultimately, I wasn't aware of much of the implementation detail ...
My feeling on it though is that it just shifts all the faith to your test data being true to not only does but conceivably could happen in prod.
All very well testing with a bunch of constraints, but if it turns out one actually isn't exercised and then happens in production?
It also means you can't use them for validation or whatever. Which, I don't know, maybe you shouldn't anyway, but if you turn them off in prod you can't.
GitHub's approach might work fine for small, focused teams who have exclusive control over a product. However, that's not often the reality of most organizations.
It's not even fine for a single application. Said application has bugs in it. Said application can (and will) crash in unexpected ways that leave the database in an invalid state. And if you think that "oh, that will never happen to me". You just haven't been around long enough. It will happen, every single time. Any system without foreign keys will have corrupt data in it. Said system will have user-visible issues as a result.
Just like only the backend can truly validate user input from a webform, only the database can validate its input.
I get that cross-shard FKs are especially difficult. But any discussion of this topic without addressing correctness concerns is woefully incomplete.
I can only imagine how many production databases contain incorrect data.
Couldn't you catch that with a CHECK constraint?
UPDATE invoices SET
final = TRUE,
invoice_number = @InvoiceNumber
WHERE id = @InvoiceId;
If so, then the problem you present should never happen (unless the DB doesn't do atomic updates by default, but wrapping the update in a transaction should provide the necessary guarantees to prevent the problem from happening; if your DB doesn't support transactions, then you really should be switching to a different DB yesterday).
If that somehow could happen, though (i.e. you don't trust the application to be bug-free, which is a reasonable attitude), a CHECK constraint (as others have pointed out) would make the database enforce that:
ALTER TABLE invoices
ADD CONSTRAINT final_invoices_have_invoice_number
CHECK (final = FALSE OR invoice_number IS NOT NULL);
Well, in an ideal case, yes, you would be doing it like that. The reality might be different, especially when using an ORM.
>Of course, I'd also be wondering why an invoice would ever exist without an invoice number
Draft invoices do not have an invoice number, since they don't really exist anywhere. You can delete a draft invoice and nothing has happened. But if you have an invoice number, that's a record that must be kept.
Must invoices be sequential or something? Why can't I just assign the draft it's to-be number, and know it's a draft because FINAL=false/0
I mean, it'd be either that specific case or a bug. And if it's a bug, then the check constraint as described previously would catch it and prevent it entirely.
Another option, though, would be to not even bother with a separate field for "final"; if the only two states are "draft" and "final", and finality is conditional on there being an invoice number, then the application logic can be greatly simplified:
WHEN invoice_number IS NULL THEN 'Draft'
END AS state
FROM invoices WHERE -- yadda yadda yadda
> Draft invoices do not have an invoice number, since they don't really exist anywhere.
Sure they do: they exist in your database. Unless your company has a strict rule about invoice numbers always being sequential with no gaps (and that'd be pretty darn strict, in my experience), you might as well pre-assign it.
The method you propose might work, but not necessarily well. Catching missing primary keys at the end is a problem because then you have a bunch of data integrity issues to sort out, preventing you from re-applying the constraint, in a live system where apps are relying on the database to do integrity checks, meaning that there's a decent chance that the running system is creating integrity issues faster than you can sort them out. Possibly orders of magnitude faster.
Doing everything inside a transaction may invite concurrency issues around locking, maybe even deadlocks. Which is again a problem in a live system, because you might be causing services to fail.
I'm not sold on FKs being something you should never do, ever, but I will say that I used to work at a place that pushed their database servers very hard on real-time tasks, and also tended to avoid foreign keys, and my biggest complaint was not data integrity, it was just that there was extra effort that needed to go into documenting the logical foreign keys (the keys still exist, they're just not enforced by a database constraint), since you couldn't just read them from the database schema.
I'd like to say that this can be something where you default to having FKs, and remove them as you hit performance problems, except that the poster is right: Developers have a habit of leaning on database constraints instead of doing their own sanity checks, and they simply don't know all the places they're doing it - they're not necessarily even aware they're doing it when they're doing it - meaning that removing FK's after the fact doesn't get you to "constraints are enforced by the apps", it gets you to "constraints aren't enforced". I think you probably want to try and anticipate ahead of time if you'll have a problem, so that everyone can know from the get-go whether they'll be working without a safety net.
MySQL does not support transactional DDL, unfortunately.
DDL statements, atomic or otherwise, implicitly end any transaction that is active in the current session, as if you had done a COMMIT before executing the statement. This means that DDL statements cannot be performed within another transaction, within transaction control statements such as START TRANSACTION ... COMMIT, or combined with other statements within the same transaction.
Whereas with real transactional DDL, the DDL can be done within large transactions, multiple ALTER's could be rolled back, no changes are visible outside the transaction...
I guess it depends. If you’re at the scale where you can’t refactor your database to keep up and you, after profiling or some other method, have proven that moving integrity enforcement into the model of your app is faster then by all means do that.
That being said: foreign keys help the query optimizer make better decisions. If you’re using and ORM though that doesn’t matter as you’ll get generic queries and you’re likely not profiling your queries anyway.
They also are a way of self documenting the relationship between things.
If you’re going to get rid of foreign keys then why not denormalize so that the data is easier to query and less likely to get corrupted? Reduce the number of tables say into a single table a-la the DynamoDB approach
A shopping cart has many items. An item belongs to a shopping cart. In a relational database, without foreign keys, how do you associate the shopping cart with the items?
The suggestion is not to eschew columns which might be JOINed on in a query.
I.e. "Don't enforce FK relationships with a constraint in the DB. Make sure the values in both tables which may be joined are consistent by using application code to enforce this."
For those reading at home who aren't good with databases. This is exactly the same statement as:
Always validate your inputs at the appropriate layer. Your backend can never trust input from a web front-end (even if there is "only one web front-end"). Likewise, your database should never trust input from its clients (even if there is "only one client"). Ignore this rule at your own peril.
Lucky for you!
There's no stopping you from storing item IDs in a theoretical shopping cart table without FK constraints though. It's just that the onus is on your application to provide the guarantee that the item exists.
This becomes a little clearer if your item IDs are not simply auto-incrementing IDs (which have their own challenges in a large distributed system), but instead are SKUs, or things which have actual meaning.
You can't know what the ID will be ahead-of writing the item to the database, nor store the item in a shopping cart without retrieving the item via a different piece of information (which is pretty likely to be SKU in this case)
That's a huge change from the relational model, and takes much more space than a normalized approach, but it also comes with some advantages. For example, it makes sure that the price of the items in the cart remain constant, even when you change product prices. That way, you don't have to inform the user "oh, your cart just got 8% more expensive" before checkout.
(1) When one table's key occurs as a value in another table, in that second table that column is sometimes called a foreign key. For example, table A has id, table B has A_id, and people refer to A_id as "a foreign key".
(2) When you define this relationship explicitly in the database, so that the database can enforce it (and/or to document it), that's called a "foreign key constraint" but also sometimes just a "foreign key".
This article just means they don't use #2. It doesn't mean they don't use #1.
The phrasing was probably clearer in the context of the discussion they were having at the time because the comment above it linked some docs for the software in question (qh-ost), and under the "Limitations" section, those docs said, "Foreign key constraints are not supported." So to them it was clear they were talking about database constraints.
linked article describes fk as an dbms construct - a schema constraint
relations and conceptual foreign keys may exist without dmbs constraints, in such scenario dmbs does not guard validity of conceptual "foreign keys" and treats them as usual data
for example consider schema:
if you want you may define a constraint:
ALTER TABLE library_borrow ADD CONSTRAINT FOREIGN KEY (user_id)
if you do not define such constraint then dbms will not verify existence of records in `user` table when you insert data in library_borrow
absence of constraints affects also indexes - if you do not define a constraint you usually have to manually define an index on fk column
You lose the explicitness and safety of referential integrity in the database, but it's a trade off that can enable other things, like sharding, as mentioned by the GitHub employee.
The items table might have a shopping_cart_id column
In both cases you can't delete items from your catalog or you'd break old carts. So you just don't do it.
Another way is to copy the product from the product table to the cart table. This saves you from making references to versions of products.
If you need to track the inventory status/availability of an item, then you store that in the database. The specific implementation (field, list of current items, etc) depends on your needs.
1. "FKs are in your way to shard your database." => not a problem for distributed databases that can query and duplicate data across shards/servers; alternatively, if the referenced data just isn't there in the database, a foreign key is not usable by definition
2. "FKs are a performance impact" => the app either just got the foreign key value from the database, so it should be cached in memory in a properly implemented database engine, or otherwise, the application relies on the database checking so you need the foreign key for correctness
3. "FKs don't work well with online schema migrations." => not a problem with database engines that properly support online schema changes without locking, downtime or table renaming hacks
As for "database engines that properly support online schema changes without locking, downtime or table renaming hacks", please name some. In my experience, every major DBMS has cases where certain ALTERs block concurrent DML, which is extremely problematic on very large tables. Or even cases where there's no locking, but the operation still takes a very long time, which is conceptually problematic for logical replication.
You have to check referential integrity somewhere. You can do it in the application or you can let the RDBMS do it for you, but it will have to happen somewhere. So the overhead argument is kind of bullshit. Yes, managing the integrity in the app might provide more flexibility, but it comes with a price of, well, having to do your own integrity checking and introducing more bug space into your code.
> FKs don't work well with online schema migrations.
> Doing schema migration of P is just not going to work. Recall that gh-ost renames the table at the end.
So basically, the problem is not foreign keys, the problem is gh-ost implementation. Since they don't use FKs, they implemented their migrations in such a way as to make it impossible to support foreign keys.
Yes, in the parent-child example, to migrate the parent you would have to migrate the child. But it is not an impossible thing to do. Create the new parent, import the data, map the record ids from the old parent to the new parent, create new child, drop the foreign key, import the data, update the FK field using the map and create a new FK. Rename the tables and you are done.
> When eventually you want to shard or extract data out, you need to change & test the app to an unknown extent.
Sharding your data is not something you do on a whim. And you will need to test your app. But you were going to test your app anyway, weren't you? Or did you think that you could split your database into multiple databases without any testing? As for unknown extent, that is kinda bullshit. Look for places that catch integrity exceptions. Thats what you need to look at. Look at your database schema and if any of your tables are "ON CASCADE DELETE", go and punch your DBA in the face. If you are the DBA you might want to start running before someone else finds out.
Perhaps they meant "latency" rather than "performance". With eventual consistency you can have lower latency, and spend more cycles fixing referential integrity later.
Of course, sometimes things fail in eventually-consistent systems (e.g., you order some item from a vendor who claims to have stock but oops! they don't, so your order will take a few more days to ship than they claimed and now you're a bit sad but hey, it all works out in the end, right?).
If you need to shard your database, you're big enough that you'll have the resources to engineer the new solution.
The vast, vast majority of use cases will benefit from the database ensuring the integrity of your data.
Presume (for argument's sake) that we accept the premise that foreign keys do not scale. Then the question is whether there exists some sort of realistic migration strategy once usage grows past that. If it does, then using FK makes total sense until we hit that point, doesn't it? And if true this should apply to 99% of users.
But if there exists no such migration strategy, then maybe the guy has a point?
To be clear, I'm not advocating against the use of foreign keys. But not using them is perfectly doable and not at all what this thread would have you believe.
Is this just the author's personal taste, or is there something about mysql that makes ON DELETE CASCADE a bad idea? In postgresql it's a useful tool for maintaining database consistency.
The relationship should not be "ON DELETE CASCADE" but probably "ON DELETE NO ACTION". Then when some junior dev tries to issue a DELETE on the users table, the DB will correctly tell them to get bent and throw an error because there are dangling invoices still associated with the user. Because you are a smart person, you've also prohibited DELETE and UPDATE's against the invoices table for the db user that is used. The application cannot delete the invoice, and it cannot delete the user either--which is the only correct course of action. The database can now protect itself from hostile input from all your junior devs. Exactly why you want FK's. You can't trust user input.
The fact that the author of the article shudders about cascading deletes demonstrates that they have zero authority to speak about databases--something that is very common in the developer community. DB's aren't mysterious black boxes. They are really fucking cool tools that can do all kinds of neat shit to keep your data nice, safe, cozy and warm (they do way more than that too... a good database is perhaps one of the most underutilized tools developers have in their toolbox)
There are plenty of other situations where ON DELETE CASCADE is the best option.
ON CASCADE DELETE is used for deleting logical sub-components (a row owned by another row) in a well normalized database -- it's for deleting your dadress book when you delete your account. It's not for deleting all your friends when your account is deleted.
Very nice summary up until that point. But in a well normalized db the friends relationship (many-to-many) would be its own table. Friend relationships with the deleted user would be deleted, as they should be. Friends would not. (You still have time to edit for a better example.)
Use the database to do your bidding. Easier to stay sane that way :)
If I have data which much always be deleted if its owner is deleted, what's the process? Manually issue exhaustive delete statements in a transaction?
Is on delete cascade always bad or just a bad default? (It's obviously a bad default, I'm not arguing that.)
So we added RI and it was awesome, and the bugs did go away with time.
But then we became dogmatic and had to have RI for everything, including things like maintaining local RI lookup tables for non local data. This just bogged us down so much that we couldn’t move.
Today, we use RI strictly within a single database. For foreign keys to remote databases, we assume the remote database knows what it’s talking about and that the incoming data has already been checked.
This works really well for us and in particular sharding is not a problem (sharding being a design decision not an implementation decision IMO)
Someone said elsewhere that Rails does the RI so the DB doesn’t need to. I don’t use Rails, and I don’t bet, but I’d wager that it’s faster to do RI inside Postgres than inside Rails.
The idea that we should throw out ALL RI is just dogmatic nonsense. I should know :)
The idea of rails enforcing unique constraints and FKs back then was... Optimistic. The design constraints assumed exactly one rails instance would access the database at any given time to ensure consistency.
Not a great assumption.
Sure, removing the constraint will get you a huge burst in performance, but now you are in charge of guaranteeing consistency outside of transactions in a distributed system involving at least one client and one server. That's a tall order.
As Runar Bjarnson put it: "Constraints liberate, liberties constrain,". If you give something the freedom to be inconsistent at any extreme scale, Murphy's law requires that anything that can go wrong, will go wrong.
It is hubris to think that the average developer who hasn't spent years becoming a consistency expert is going to be able to guarantee that data won't become corrupted better than the engineers that build and maintain and acacademics that study rdbms systems full-time.Even experts get this wrong all the time . To all junior devs out there, until you know you need to abandon a guarantee available to you in a given system, exhaust every other possible option.
It's an easy mistake to make though.