
Ask HN: Facing scaling issues with news feeds on Redis.  Any advice? - dave1619
We just released a social section to our iOS app several days ago and we are already facing scaling issues with the users' news feeds.<p>We're basically using a Fan-out-on-write (push) model for the users' news feeds (posts of people and topics they follow) and we're using Redis for this (backend is Rails on Heroku).  However, our current 60,000 news feeds is ballooning our Redis store to almost 1GB in a just a few days (it's growing way too fast for our budget).  Currently we're storing the entire news feed for the user (post id, post text, author, icon url, etc) and we cap the entries to 300 per feed.<p>I'm wondering if we need to just store the post IDs of each user feed in Redis.  And then store the rest of the post information somewhere else.  Would love some feedback here.  In this case, our iOS app would make an api call to our Rails app to retrieve a user's news feed.  Rails app would retrieve news feed list (just post IDs) from Redis, and then Rails app would need to query to get the rest of the info for each post.  Should we query our Postgres DB directly?  But that will be a lot of calls to our DB.  Should we create another Redis store (so at least it's in memory) where we store all of the posts from our DB and we query this to get the post information?  Or should we forget Redis and go with MongoDB or Cassandra so we can have higher storage limits?<p>Thanks for your help in advance.
======
itsprofitbaron
First of all, I'm glad you have chosen Fan-out-on-write over Fan-out-on-Read
because, reading how your newsfeed works it appears to be operating similar to
Twitters & I believe I have a nice solution for you :)

I'd use BigTable as the data is distributed among a cluster by rowkey & then
individual records are stored as columns which are then sorted by column name.
As BigTable stores are designed to deal with relatively unbounded numbers of
columns and considering that each timeline is given a unique rowkey e.g.
username & a column is inserted to represent each event - the column name also
has a time sortable unique ID (this becomes important in a second) & the
column value would contain the event's data.

As a result, this type of schema would see a timeline which potentially could
contain hundreds of events (even though you're currently capping it at 300)
and can be read from a single node with a few disk I/Os & the time-sorted
ordering (as I mentioned above becomes important here) allows it to be
efficiently paginated by using a range slice operation. The result is an
append only commit log meaning that insertions are pretty cheap (because
afterall this is what is important!) and you usually can do tens of thousands
per second per node!

I believe that BigTable would be best suited for this which means that you're
really looking at using Cassandra as it can stay fully available for writes
and it reads during a network partition. I did consider MongoDB to be used but
it only has a limited document size limit so I ruled it out. Anyway back to
Cassandra... as things are going on in your timeline for the users,
additions/subtractions are commutative so you can easily exploit the likes of
Amazon DynamoDB [1] so you can fully exploit the weakest consistency
guarantees of Cassandra.

Its important to note the system would require queues & workers to perform the
fan-out operation, event deletions & updates to the list of feeds for each
users timeline but you could make this part of the background system.
Similarly, as the workers won't immediately get around to permanently mutating
the timelines of your users I would recommend that you also cache the
mutations in memcached so you can temporarily perform the data scrubbing at
read time before the workers perform the task which will considerably improve
consistency.

[1] <http://aws.amazon.com/dynamodb/>

