When I wrote Reddit Gold (now called Reddit Premium) I intentionally left the paying-user table index-free, looking forward to the day it would become necessary.
It hadn’t happened by the time I left the company, even though sales were exceeding expectations.
That goes against my intuition, because the performance would be more impacted (beneficial to have an index) where you have very few bits set on a boolean field.
If you have 10m users and 100 have "IsPremium = 1" then an index massively speeds up anything with WHERE IsPremium = 1, compared to if the data is fairly evenly spread between IsPremium = 1 and IsPremium = 0 where an index won't be much benefit.
So low sales would increase the need for an index.
That said, I'm assuming searching for "Users WHERE IsPremium = 1" is a fairly rare thing, so an index might not make much difference either way.
You wouldn't need any of these WHERE clauses if you have all the premium users in one table. Even managing the users in both tables in pretty trivial when you just add someone as they pay and remove them as the premium account expires.
Two relations with a one-to-one relationship implies identical primary keys. Most implementations default to creating an index on the primary key.
In this case, the premium user table only needs the user id primary key/surrogate key because it only contains premium users. A query starting with this table is naturally constrained to only premium user rows. You can think of this sort of like a partial index.
One consequence of this approach is that queries filtering only for non-premium users will be more expensive.
Having a separate table containing all the premium users is different than an extra column in the normal user table. In the extra table example you don't really need an index (in premium user table) if you have only 100 premium users
Well, my guess is that updates and inserts are much more frequent than searches in their use case. You're assuming a balanced frequency for these operations and it hardly ever happens.
My experience is that it's uaually better to add indices where it's expected to be needed beforehand. Adding indices on large production tables will millions of rows can bring down databases for hours or even days worst case. It's tricky to manage.
Consider the example from the post of searching your shell history. If I don't need indexes I can just have it all in one flat file and use grep. Switching to a tool that supports indexing adds a lot of potential complexity and ways things could go wrong.
Or consider the example of a developer querying frontend logs: queries are many orders of magnitude less common than writes, and (at the scale I used to work at) an index would be incredibly expensive.
I take the point of your examples as written in the post, but I think both of those are a bad comparison to the Reddit Premium subscriber table being discussed, because:
- We’re already using a database, so there’s very minimal added complexity
- This is an extremely hot table getting read millions of times a day
- The scale of the data isn’t well-bounded
- Usage patterns of the table are liable to change over time
It wasn't an extremely hot table, and the scale of the data was well-bounded insofar as the only way for it to become non-negligible would be for us to have had revenue beyond our then-wildest dreams.
Indexes can have catastrophic consequences depending on the access patterns of your table. I use indexes very sparingly and only when I know for sure the benefit will outweigh the cost.
I built something inspired by this very post in 2013/2014. Not sure how the scale compares, but we insert ~10 million “things” with an average of 30 data attributes per day with a 30 day window. It definitely uses primary and foreign keys. It took some time to tune. Had to support an additional access pattern using a non-unique index. Had to work with a DBA to get partitioning right to handle the large write volume and manage expiration efficiently. It worked out great and is still chugging along. It does suck not having all the tools of an RDBMS at your disposal, but it was a good trade off.
FK constraints however are a pretty common gotcha. You have a table that allows deletes, and every table that has a column pointing to that table gets scanned on each delete operation.
So you have to add an index on that column, or start talking about tombstoning data instead of deleting it, but in which case you may still need the FK index to search out references to an old row to replace with a new one.
An FK also adds the burden of keeping a copy of each related row during the lifespan of a transaction. This means small-but-frequently-updated tables that are joined to a lot can be a source of unexpected headaches.
I had read a Joel Spolsky post about how they used to get an email every time they made a sale (“Ding!”) and the day they finally turned it off was full of pride. (I did that too, originally not even filtered out of my inbox.)
Speculating, but presumably this table only needed consultation during creation of a new user session. That’s probably a pretty heavyweight operation to begin with, so adding a scan of a few KB (user IDs being modestly sized integers) for every thousand currently paying users is NBD.
It hadn’t happened by the time I left the company, even though sales were exceeding expectations.