When I wrote Reddit Gold (now called Reddit Premium) I intentionally left the pa...

eterm · on May 25, 2023

That goes against my intuition, because the performance would be more impacted (beneficial to have an index) where you have very few bits set on a boolean field.

If you have 10m users and 100 have "IsPremium = 1" then an index massively speeds up anything with WHERE IsPremium = 1, compared to if the data is fairly evenly spread between IsPremium = 1 and IsPremium = 0 where an index won't be much benefit.

So low sales would increase the need for an index.

That said, I'm assuming searching for "Users WHERE IsPremium = 1" is a fairly rare thing, so an index might not make much difference either way.

raldi · on May 25, 2023

We had lots of indexes on the user table; there was a separate table with details of paying users.

pc86 · on May 25, 2023

You wouldn't need any of these WHERE clauses if you have all the premium users in one table. Even managing the users in both tables in pretty trivial when you just add someone as they pay and remove them as the premium account expires.

abtinf · on May 25, 2023

> paying-user table index-free

Presumably, that means they created a new table intended to be straight-joined back to the user table. No need to search a column.

Vvector · on May 25, 2023

Joining a table on a column without an index will require a linear scan of the column. You almost always want an index on JOIN columns

lazide · on May 25, 2023

The primary key (userid?) will almost always be indexed implicitly. You’d usually have to go out of your way to avoid it.

So it was probably being joined by an indexed column, but without an explicitly defined index.

abtinf · on May 25, 2023

Two relations with a one-to-one relationship implies identical primary keys. Most implementations default to creating an index on the primary key.

In this case, the premium user table only needs the user id primary key/surrogate key because it only contains premium users. A query starting with this table is naturally constrained to only premium user rows. You can think of this sort of like a partial index.

One consequence of this approach is that queries filtering only for non-premium users will be more expensive.

holoduke · on May 25, 2023

Having a separate table containing all the premium users is different than an extra column in the normal user table. In the extra table example you don't really need an index (in premium user table) if you have only 100 premium users

cactusplant7374 · on May 25, 2023

Indexes are such an easy thing to add. I don't get it. Seems like under optimization when you consider the tradeoffs.

SoftTalker · on May 25, 2023

They add overhead to updates and inserts. If you don't need them, why take that hit. Know your data.

toast0 · on May 25, 2023

If the table is small enough that indexes aren't critical for reads, they probably don't impact the rate of writes either.

Of course, if the object is large scale bulk inserts, with only very occasional selects, then yes.

EGreg · on May 25, 2023

Wait a second, the whole thing is you don't mind O(N) overhead in searching on every request, but you mind O(log N) overhead for updates and inserts?

titaniczero · on May 25, 2023

Well, my guess is that updates and inserts are much more frequent than searches in their use case. You're assuming a balanced frequency for these operations and it hardly ever happens.

wswope · on May 25, 2023

This is a read-heavy workload per the OP: https://news.ycombinator.com/item?id=36071799

raldi · on May 25, 2023

It was neither read-heavy nor write-heavy.

wswope · on May 26, 2023

Ah, thanks for the correction; I’m guessing I read your comment too literally.

Would strikethrough my prior comment if it weren’t past the edit window…

holoduke · on May 25, 2023

My experience is that it's uaually better to add indices where it's expected to be needed beforehand. Adding indices on large production tables will millions of rows can bring down databases for hours or even days worst case. It's tricky to manage.

tomnipotent · on May 25, 2023

Considering how much data is written to support redo/undo/wal logs, a single index for user id is very little overhead.

jefftk · on May 25, 2023

Consider the example from the post of searching your shell history. If I don't need indexes I can just have it all in one flat file and use grep. Switching to a tool that supports indexing adds a lot of potential complexity and ways things could go wrong.

Or consider the example of a developer querying frontend logs: queries are many orders of magnitude less common than writes, and (at the scale I used to work at) an index would be incredibly expensive.

wswope · on May 25, 2023

I take the point of your examples as written in the post, but I think both of those are a bad comparison to the Reddit Premium subscriber table being discussed, because:

- We’re already using a database, so there’s very minimal added complexity

- This is an extremely hot table getting read millions of times a day

- The scale of the data isn’t well-bounded

- Usage patterns of the table are liable to change over time

raldi · on May 25, 2023

It wasn't an extremely hot table, and the scale of the data was well-bounded insofar as the only way for it to become non-negligible would be for us to have had revenue beyond our then-wildest dreams.

raldi · on May 25, 2023

If you have nothing to optimize yet, how will you know if you’re optimizing it right?

wswope · on May 25, 2023

Typically, you create a large number of test records and see how your expected access patterns perform.

VWWHFSfQ · on May 25, 2023

> Indexes are such an easy thing to add.

Indexes can have catastrophic consequences depending on the access patterns of your table. I use indexes very sparingly and only when I know for sure the benefit will outweigh the cost.

bcrosby95 · on May 25, 2023

So no primary key or uniqueness constraints?

Macha · on May 25, 2023

Unlikely, given Reddit's past schema design. One table of "things", and then another table of attributes of those things in a entity,key,value format.

https://kevin.burke.dev/kevin/reddits-database-has-two-table...

time0ut · on May 26, 2023

I built something inspired by this very post in 2013/2014. Not sure how the scale compares, but we insert ~10 million “things” with an average of 30 data attributes per day with a 30 day window. It definitely uses primary and foreign keys. It took some time to tune. Had to support an additional access pattern using a non-unique index. Had to work with a DBA to get partitioning right to handle the large write volume and manage expiration efficiently. It worked out great and is still chugging along. It does suck not having all the tools of an RDBMS at your disposal, but it was a good trade off.

bcrosby95 · on May 26, 2023

That doesn't mean those tables didn't have primary or unique keys.

According to that post, in 2010, they had about 10 million users. At a conservative 10 fields per user, you're looking at 100 million records.

I'm a bit skeptical that they table scanned 100 million records anytime they wanted to access a user's piece of data back in 2010.

ComodoHacker · on May 25, 2023

Modern DB engines can enforce Unique or PK constraints without indexes. Yes, they perform scans.

tomnipotent · on May 25, 2023

Every major RDBMS requires an index for both constraints.

hinkley · on May 25, 2023

FK constraints however are a pretty common gotcha. You have a table that allows deletes, and every table that has a column pointing to that table gets scanned on each delete operation.

So you have to add an index on that column, or start talking about tombstoning data instead of deleting it, but in which case you may still need the FK index to search out references to an old row to replace with a new one.

tomnipotent · on May 25, 2023

An FK also adds the burden of keeping a copy of each related row during the lifespan of a transaction. This means small-but-frequently-updated tables that are joined to a lot can be a source of unexpected headaches.

ComodoHacker · on May 26, 2023

After checking the docs, you are right, I stand corrected. I was pretty sure Oracle and DB2 don't.

BeefWellington · on May 25, 2023

It's not really index-free if a separate table has constraints of any kind.

hardware2win · on May 25, 2023

What was the reasoning?

raldi · on May 25, 2023

I had read a Joel Spolsky post about how they used to get an email every time they made a sale (“Ding!”) and the day they finally turned it off was full of pride. (I did that too, originally not even filtered out of my inbox.)

dmak · on May 25, 2023

Wow really? This seems really surprising to me.

twoodfin · on May 25, 2023

Speculating, but presumably this table only needed consultation during creation of a new user session. That’s probably a pretty heavyweight operation to begin with, so adding a scan of a few KB (user IDs being modestly sized integers) for every thousand currently paying users is NBD.

raldi · on May 25, 2023

Nope; it was used every time we needed to look up whether someone was a subscriber, which was cached for some things but not all of them.

raldi · on May 25, 2023

Self-replying to add: This wasn’t a particularly frequent occurrence.

ollien · on May 26, 2023

What sort of events would trigger this lookup if it was infrequent?

raldi · on May 26, 2023

It's hard to remember after 13 years, but for example if someone made a purchase and redeemed it, or checked the status of their subscription.

ollien · on May 26, 2023

Makes sense :)