If you want to read more about activity feeds there are a ton of papers listed h...

ma2rten · on May 25, 2018

I used to work at Hyves. Hyves overcame it's scalability issues, but went out of business for other reasons. Hyves used MySQL and Memcache similar to facebook at that time.

By the way, RocksDB which is now Facebook's main database (afaik) is written on top of LevelDB. So both Google and Facebook run on software written by Jeff Dean...

cdoxsey · on May 25, 2018

Its a leveldb fork with substantial changes. For example, a non binary search file format option: https://github.com/facebook/rocksdb/wiki/PlainTable-Format

Pull down the code some time. Everything and the kitchen sink is in there somewhere. It's a crazy project.

evanelias · on May 25, 2018

> I also heard one investor mention how Tumblr struggled with technical debt related to their feed

Not sure I'd agree with that, but I suppose it depends on the context and timing of the statement.

Tumblr's solution for reverse-chrono activity feed is, at its core, <1000 lines of PHP and a few extremely heavily optimized sharded MySQL tables. It is creaky and old, but its relatively small code footprint means it isn't terrible on the tech debt scale.

Tumblr's feed is computed entirely at read-time; there's no write fan-out / no materialized inboxes. The key insights that make the system fast (under 10ms latency to identify which posts go in the feed) even at scale:

* If you are grabbing the first page of a feed (most recent N posts from followed users), the worst-case is N posts all by different authors. If you have a lookup table of (user, most recent post time) you can quickly find the N followed users who posted most recently, and only examine their content, rather than looking at content from all followed users.

* For subsequent pages, use a timestamp (or monotonically increasing post id) as a cursor, i.e. the id or time of the last post on the previous page. Then to figure out which followed users to examine for the new page of results, you only need to look at followed users who posted since that cursor timestamp (since they may have older posts on the current page) plus N more users (to satisfy the worst-case of all N posts on the current page being by different authors that were not on previous pages).

* InnoDB's clustered index PK means it is very very fast at PK range scans. This is true even for disjointed sets of range scans, e.g. with PK of (a,b) you can do a query like "WHERE a IN (...long list of IDs...) AND b >= x AND b <= y" and it is still extremely fast.

* You can optimize out the access pattern edge cases and make them fuzzier. Most non-bot users only ever scroll so far; even power users that follow a lot tend to check the feed very often, so they don't go deep each time. On the extreme edge, users who follow thousands don't comprehensively try to view every piece of content in their feed anyway. This means you can measure user activity to determine where to place limits in the system that help query performance but are completely unnoticeable by users.

statusgraph · on May 25, 2018

I've found the Yahoo paper from 2010 to be the best distillation of the general problem of feed following and how to solve it (blending push/pull as you say):http://jeffterrace.com/docs/feeding-frenzy-sigmod10-web.pdf

obelix_ · on May 25, 2018

Social network scaling is not a technology problem.

Technology has imposed scaling on societies and patted it self on the head for the unintended benefits derived, while burying its head in the sand on the unintended consequences.

The roman empire, Genghis Khan, the East India Company, Napoleon, Hitler etc etc etc achieved scale and then what happened?