The name definitely seems to distract the discussion away from the code and the project.
We started this project over a year ago before Feedly was famous. Now it's indeed confusing, but changing an open source project's name is quite a pain for us and people using Feedly.
Think the 1.0 release will be a good time to clean this up.
Feedly allows you to build newsfeed and notification systems using Cassandra and/or Redis.
Suggestion: That the library is a Python library is just as, if not more, important that working with Cassandra and/or Redis. You should mention that in the first sentence, for example, "Feedly is a Python library that allows you to easily build newsfeed and notification systems on top of Cassandra and/or Redis."
-----
More notes:
It's also not particularly clear what approach Feedly actually implemented. You mention "push" and "push/pull", and say what Fashionista used to use, but never actually mention the approach Feedly has taken.
Without looking at the code, i.e. just from the README, I'd guess you're doing full fanout (i.e. "push"), with no pull for inactive users (i.e. what Twitter does for users inactive for 30 days). That's fine, except...
...you don't really discuss how Feedly handles the power law problem (publishing to Lady Gaga's feed), which is the only difficult engineering issue with these kinds of systems. If Feedly's approach is to ignore it, and just push all 30 million subtasks into Celery, how will that impact all of the updates to the other user's feeds, which are happening concurrently? How will that impact memory usage? I couldn't tell just by reading the docs.
Finally, the stream itself is actually the easiest part (other than the power law problem). Production systems need monitoring, along with some kind of recommendation system for getting users connected to each other. Integrating all that is obviously a separate project, but I'd be reluctant to build on top of Feedly until I saw at least some hooks for incorporating that stuff.
Otherwise, good work. I've been implementing this stuff recently, and the only other semi-full featured open source project out there for activity streams when I started was written in PHP, so it's nice to see something a bit more hacker-friendly. Bonus points for Redis and Cassandra, two of the coolest NoSQL database out there!
A.) In the early days we simply pulled everything together by querying the database which worked quite well.
B.) After that we switched to using redis and in a similar approach to twitter's pushed only to active users and pulled the feed for inactive users.
C.) Now our requirements slowly changed and it became harder to fit everything in memory (or fallback) that's why we switched to Cassandra.
Feedly offers you a framework, but allows you to make your own decisions regarding the tradeoffs.
- you can change the fanout to hit only active users (twitter, yahoo paper approach)
- you can chose to store the full activity in the feed or only an id (memory usage vs extra lookups)
- you can customize the priority of tasks
These are all things we've done at one point or another. You can chose your own approach with Feedly.
For Fashiolista we currently use celery with rabbitmq and have different queues and worker clusters for low and high priority fanout tasks.
In your example with Lady Gaga memory usage would go up on the rabbitmq server. Since rabbit also stores to disk it won't run out easily. After that the celery machines will handle the updates and autoscaling will kick in more machines if needed.
Yes we are considering adding HBase. In general adding new backends is quite easy.
It however takes a while to figure out the backend dependent performance tweaks. So a company running HBase and Feedly in production would be of great help.
So we've built Feedly for our startup Fashiolista.com
It's quite a large project for a team of 4 though. Therefor we are actively looking for more contributors. Let us know if you're interested in helping out.
You have a PR problem with this name, with risk of elevating into a legal problem.