Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When I wrote Reddit Gold (now called Reddit Premium) I intentionally left the paying-user table index-free, looking forward to the day it would become necessary.

It hadn’t happened by the time I left the company, even though sales were exceeding expectations.



That goes against my intuition, because the performance would be more impacted (beneficial to have an index) where you have very few bits set on a boolean field.

If you have 10m users and 100 have "IsPremium = 1" then an index massively speeds up anything with WHERE IsPremium = 1, compared to if the data is fairly evenly spread between IsPremium = 1 and IsPremium = 0 where an index won't be much benefit.

So low sales would increase the need for an index.

That said, I'm assuming searching for "Users WHERE IsPremium = 1" is a fairly rare thing, so an index might not make much difference either way.


We had lots of indexes on the user table; there was a separate table with details of paying users.


You wouldn't need any of these WHERE clauses if you have all the premium users in one table. Even managing the users in both tables in pretty trivial when you just add someone as they pay and remove them as the premium account expires.


> paying-user table index-free

Presumably, that means they created a new table intended to be straight-joined back to the user table. No need to search a column.


Joining a table on a column without an index will require a linear scan of the column. You almost always want an index on JOIN columns


The primary key (userid?) will almost always be indexed implicitly. You’d usually have to go out of your way to avoid it.

So it was probably being joined by an indexed column, but without an explicitly defined index.


Two relations with a one-to-one relationship implies identical primary keys. Most implementations default to creating an index on the primary key.

In this case, the premium user table only needs the user id primary key/surrogate key because it only contains premium users. A query starting with this table is naturally constrained to only premium user rows. You can think of this sort of like a partial index.

One consequence of this approach is that queries filtering only for non-premium users will be more expensive.


Having a separate table containing all the premium users is different than an extra column in the normal user table. In the extra table example you don't really need an index (in premium user table) if you have only 100 premium users


Indexes are such an easy thing to add. I don't get it. Seems like under optimization when you consider the tradeoffs.


They add overhead to updates and inserts. If you don't need them, why take that hit. Know your data.


If the table is small enough that indexes aren't critical for reads, they probably don't impact the rate of writes either.

Of course, if the object is large scale bulk inserts, with only very occasional selects, then yes.


Wait a second, the whole thing is you don't mind O(N) overhead in searching on every request, but you mind O(log N) overhead for updates and inserts?


Well, my guess is that updates and inserts are much more frequent than searches in their use case. You're assuming a balanced frequency for these operations and it hardly ever happens.


This is a read-heavy workload per the OP: https://news.ycombinator.com/item?id=36071799


It was neither read-heavy nor write-heavy.


Ah, thanks for the correction; I’m guessing I read your comment too literally.

Would strikethrough my prior comment if it weren’t past the edit window…


My experience is that it's uaually better to add indices where it's expected to be needed beforehand. Adding indices on large production tables will millions of rows can bring down databases for hours or even days worst case. It's tricky to manage.


Considering how much data is written to support redo/undo/wal logs, a single index for user id is very little overhead.


Consider the example from the post of searching your shell history. If I don't need indexes I can just have it all in one flat file and use grep. Switching to a tool that supports indexing adds a lot of potential complexity and ways things could go wrong.

Or consider the example of a developer querying frontend logs: queries are many orders of magnitude less common than writes, and (at the scale I used to work at) an index would be incredibly expensive.


I take the point of your examples as written in the post, but I think both of those are a bad comparison to the Reddit Premium subscriber table being discussed, because:

- We’re already using a database, so there’s very minimal added complexity

- This is an extremely hot table getting read millions of times a day

- The scale of the data isn’t well-bounded

- Usage patterns of the table are liable to change over time


It wasn't an extremely hot table, and the scale of the data was well-bounded insofar as the only way for it to become non-negligible would be for us to have had revenue beyond our then-wildest dreams.


If you have nothing to optimize yet, how will you know if you’re optimizing it right?


Typically, you create a large number of test records and see how your expected access patterns perform.


> Indexes are such an easy thing to add.

Indexes can have catastrophic consequences depending on the access patterns of your table. I use indexes very sparingly and only when I know for sure the benefit will outweigh the cost.


So no primary key or uniqueness constraints?


Unlikely, given Reddit's past schema design. One table of "things", and then another table of attributes of those things in a entity,key,value format.

https://kevin.burke.dev/kevin/reddits-database-has-two-table...


I built something inspired by this very post in 2013/2014. Not sure how the scale compares, but we insert ~10 million “things” with an average of 30 data attributes per day with a 30 day window. It definitely uses primary and foreign keys. It took some time to tune. Had to support an additional access pattern using a non-unique index. Had to work with a DBA to get partitioning right to handle the large write volume and manage expiration efficiently. It worked out great and is still chugging along. It does suck not having all the tools of an RDBMS at your disposal, but it was a good trade off.


That doesn't mean those tables didn't have primary or unique keys.

According to that post, in 2010, they had about 10 million users. At a conservative 10 fields per user, you're looking at 100 million records.

I'm a bit skeptical that they table scanned 100 million records anytime they wanted to access a user's piece of data back in 2010.


Modern DB engines can enforce Unique or PK constraints without indexes. Yes, they perform scans.


Every major RDBMS requires an index for both constraints.


FK constraints however are a pretty common gotcha. You have a table that allows deletes, and every table that has a column pointing to that table gets scanned on each delete operation.

So you have to add an index on that column, or start talking about tombstoning data instead of deleting it, but in which case you may still need the FK index to search out references to an old row to replace with a new one.


An FK also adds the burden of keeping a copy of each related row during the lifespan of a transaction. This means small-but-frequently-updated tables that are joined to a lot can be a source of unexpected headaches.


After checking the docs, you are right, I stand corrected. I was pretty sure Oracle and DB2 don't.


It's not really index-free if a separate table has constraints of any kind.


What was the reasoning?


I had read a Joel Spolsky post about how they used to get an email every time they made a sale (“Ding!”) and the day they finally turned it off was full of pride. (I did that too, originally not even filtered out of my inbox.)


Wow really? This seems really surprising to me.


Speculating, but presumably this table only needed consultation during creation of a new user session. That’s probably a pretty heavyweight operation to begin with, so adding a scan of a few KB (user IDs being modestly sized integers) for every thousand currently paying users is NBD.


Nope; it was used every time we needed to look up whether someone was a subscriber, which was cached for some things but not all of them.


Self-replying to add: This wasn’t a particularly frequent occurrence.


What sort of events would trigger this lookup if it was infrequent?


It's hard to remember after 13 years, but for example if someone made a purchase and redeemed it, or checked the status of their subscription.


Makes sense :)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: