Scaling a social network is just inherently a very hard problem. Especially if you have a large userbase with a few very popular users. Stackshare recently did a nice blogpost about how we at Stream solve this for 300 million users with Go, RocksDB and Raft: https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...
I think the most important part is using a combination of push and pull. So you keep the most popular users in memory and for the other users you use the traditional fanout on-write approach. The other thing that helped us scale was using Go+RocksDB. The throughput is just so much higher compared to traditional databases like Cassandra.
It's also interesting to note how other companies solved it. Instagram used a fanout on write approach with Redis, later on Cassandra and eventually a flavor of Cassandra based on RocksDB. They managed to use a full fanout approach using a combination of great optimization, a relatively lower posting volume (compared to Twitter at least) and a ton of VC money.
Friendster and Hyves are two stories of companies that didn't really manage to solve this and went out of business. (there were probably other factors as well, but still.) I also heard one investor mention how Tumblr struggled with technical debt related to their feed. A more recent example is Vero that basically collapsed under scaling issues.
I used to work at Hyves. Hyves overcame it's scalability issues, but went out of business for other reasons. Hyves used MySQL and Memcache similar to facebook at that time.
By the way, RocksDB which is now Facebook's main database (afaik) is written on top of LevelDB. So both Google and Facebook run on software written by Jeff Dean...
> I also heard one investor mention how Tumblr struggled with technical debt related to their feed
Not sure I'd agree with that, but I suppose it depends on the context and timing of the statement.
Tumblr's solution for reverse-chrono activity feed is, at its core, <1000 lines of PHP and a few extremely heavily optimized sharded MySQL tables. It is creaky and old, but its relatively small code footprint means it isn't terrible on the tech debt scale.
Tumblr's feed is computed entirely at read-time; there's no write fan-out / no materialized inboxes. The key insights that make the system fast (under 10ms latency to identify which posts go in the feed) even at scale:
* If you are grabbing the first page of a feed (most recent N posts from followed users), the worst-case is N posts all by different authors. If you have a lookup table of (user, most recent post time) you can quickly find the N followed users who posted most recently, and only examine their content, rather than looking at content from all followed users.
* For subsequent pages, use a timestamp (or monotonically increasing post id) as a cursor, i.e. the id or time of the last post on the previous page. Then to figure out which followed users to examine for the new page of results, you only need to look at followed users who posted since that cursor timestamp (since they may have older posts on the current page) plus N more users (to satisfy the worst-case of all N posts on the current page being by different authors that were not on previous pages).
* InnoDB's clustered index PK means it is very very fast at PK range scans. This is true even for disjointed sets of range scans, e.g. with PK of (a,b) you can do a query like "WHERE a IN (...long list of IDs...) AND b >= x AND b <= y" and it is still extremely fast.
* You can optimize out the access pattern edge cases and make them fuzzier. Most non-bot users only ever scroll so far; even power users that follow a lot tend to check the feed very often, so they don't go deep each time. On the extreme edge, users who follow thousands don't comprehensively try to view every piece of content in their feed anyway. This means you can measure user activity to determine where to place limits in the system that help query performance but are completely unnoticeable by users.
Social network scaling is not a technology problem.
Technology has imposed scaling on societies and patted it self on the head for the unintended benefits derived, while burying its head in the sand on the unintended consequences.
The roman empire, Genghis Khan, the East India Company, Napoleon, Hitler etc etc etc achieved scale and then what happened?
Scaling a social network is just inherently a very hard problem. Especially if you have a large userbase with a few very popular users. Stackshare recently did a nice blogpost about how we at Stream solve this for 300 million users with Go, RocksDB and Raft: https://stackshare.io/stream/stream-and-go-news-feeds-for-ov...
I think the most important part is using a combination of push and pull. So you keep the most popular users in memory and for the other users you use the traditional fanout on-write approach. The other thing that helped us scale was using Go+RocksDB. The throughput is just so much higher compared to traditional databases like Cassandra.
It's also interesting to note how other companies solved it. Instagram used a fanout on write approach with Redis, later on Cassandra and eventually a flavor of Cassandra based on RocksDB. They managed to use a full fanout approach using a combination of great optimization, a relatively lower posting volume (compared to Twitter at least) and a ton of VC money.
Friendster and Hyves are two stories of companies that didn't really manage to solve this and went out of business. (there were probably other factors as well, but still.) I also heard one investor mention how Tumblr struggled with technical debt related to their feed. A more recent example is Vero that basically collapsed under scaling issues.