As far as I understand, since user most likely voted only on a small subset of those 100 comments (say 3) and negative lookups are very fast because of bloom filters , therefore all lookups combined are fast.
> Can you get into more details about how this is used? If reddit needs to display a page that has 100 comments, do they query Cassandra on the voting status of the user on those 100 comments?
Sort of, yeah. There are two versions of this, the old way and the new way.
Old way: keep a cache of the last time they voted. Remove any comments from those 100 that are younger than the last vote (since they can't possibly have voted on them). Then look up the keys containing the remaining ones. Most of them hit the bloom filter, those that pass the bloom filter actually get looked up. In the worst case this is all 100 comments, which can hit up up to 1/RF of your Cassandra nodes. The worst case doesn't happen often.
The new way is a little different, you have one Cassandra row (so only one machine) containing all of the votes for the user (perhaps further limited to a given link ID or date). You hit that one node for all 100 comments. If you have per-row bloom filters, see Old Way for the rest.
> I thought Cassandra was pretty slow in reads (slower than postgres) so how does using Cassandra make it fast here?
"Fast" and "slow" as used here is very naive, performance is usually more complicated than simple statements like this. I guess if you had 1 postgres node with 1 key and one Cassandra node with 1 key, and only 1 client, you could answer simple generalisations like this. But reddit has thousands of concurrent hits spread over hundreds of servers with varying amounts of RAM, I/O performance, network bottlenecks, and usage profiles.
The single biggest win is that you can easily horizontally scale Cassandra. Just add more nodes until it's "fast". But even that's a gross simplification.
For another example, if you scale Postgres by adding a bunch of replicas and choose a random one to read from for a given key, then they all have the same data in RAM, so your block cache miss rate is very high (that is, your effective block cache is the amount of RAM in one machine). Additionally, your write performance is capped to the write performance of your one master. Your replication throughput is capped to his outbound network bandwidth.
So you want a subset of the data on all of them so that whichever machine you ask has a very high likelihood of having that data in block cache. So you shard it by some function. But then you want to add more postgres machines, you have to migrate the data somehow, without shutting down the site. You've now more or less written Cassandra.