

Reddit joins Digg, Twitter on Cassandra - jbellis
http://blog.reddit.com/2010/03/she-who-entangles-men.html

======
jayair
I have read that Cassandra needs good IO performance to do well. So I am
wondering what type of performance you guys are getting on EC2 and what are
the sort of things you have done specifically in regards to it (like RAID).

We are looking into Cassandra for thecadmus.com and any info would be greatly
appreciated. Thanks.

~~~
jbellis
> I have read that Cassandra needs good IO performance to do well

If you're writing to disk, _anything_ will need good i/o performance to do
well. That's sort of the name of the game.

EBS is kind of ass for i/o, not so much for average but because you get huge
latency spikes for no reason. Most cassandra installs "in the cloud" are on
rackspace cloud servers instead (where you get persistent, raided local disk)
for this reason.

But, you do see people like reddit and simplegeo and openx running it on ec2
and it seems to work well enough.

~~~
jayair
Thanks for clearing that up. We have been dealing with the weird spikes as
well and so we were a little iffy about Cassandra on EC2.

~~~
bd
For the moment, Reddit is using Cassandra just for persistent cache, which
implies high read/write ratio.

Also it seems most of reads are done from the memory anyways (they mentioned
6GB RAM for 94% of their database in old memcachedb setting [1]).

So much less stress on I/O for this particular use-case.

[1] [http://blog.reddit.com/2010/03/and-fun-weekend-was-had-by-
al...](http://blog.reddit.com/2010/03/and-fun-weekend-was-had-by-all.html)

~~~
jayair
Thats a good point. My understanding is that Cassandra is ideal for write
heavy scenarios. So it is interesting to see a slightly different use case.

------
mark_l_watson
I spent several hours Thursday night re-reading Cassandra docs and doing
another test install. I am trying to decide between MongoDB (which I have a
fair amount of experience with) and Cassandra for a 2 node setup where I care
less (at least right now) about latency than about redundancy and convenience.
I want to (initially) use an EC2 instance and a non-Amazon VPS and since I
can't tell from reading the docs (and I have not seen any published
comparisons), I am going to have to set up both a 2 node Cassandra install and
a MongoDB replica-pair to see which does better given the hit of one service
running in an east coast Amazon availability center and the other in a data
center in Texas. If I am ever fortunate to have more than a few users so
performance becomes an issue, I would like to add another EC2 (or two) to the
mix, but still keep a service running in a non-Amazon data center.

If anyone has any links to Cassandra/MongoDB comparisons for my desired setup
I would appreciate seeing them.

------
pan69
What is exactly the difference between Casandra and a document store like
CouchDB? Or are they similar?

Just trying to get my head around this nosql thing and the different
approaches...

~~~
spidaman
CouchDB is an MVCC document store, it isn't concerned with scaling the way
Cassandra, HBase and Voldemort are. This is a pretty good overview of the
landscape [http://www.vineetgupta.com/2010/01/nosql-databases-
part-1-la...](http://www.vineetgupta.com/2010/01/nosql-databases-
part-1-landscape.html)

------
davidw
It sort of looks like Cassandra is where all the people with A LOT of data are
going. Any particular reason(s) for that?

(Edit: as compared to other 'non traditional' database systems).

~~~
mrduncan
<http://en.wikipedia.org/wiki/Cassandra_(database)> has a pretty good overview
if you're not familiar with Cassandra.

Cassandra only guarantees eventual consistency where as most RDBMS systems are
focused on always showing a consistent view of data. Because of this (and
other trade-offs), Cassandra is able to provide much higher IO rates as well
as improved fault tolerance and linear scalability.

Twitter, Digg, and Facebook all happen to have usages which fit perfectly with
this model. If you see a status update a few seconds (or even minutes) late
occasionally it's no big deal. If your bank funds aren't matching for that
long, it can be a big deal.

Edit: I just started looking into Cassandra a few days ago so someone please
correct me if I've mis-stated anything.

~~~
davidw
I'm more interested in the comparison with Redis, MongoDB, et alia.

------
oomkiller
Reddit also investigated (and maybe is still investigating) Riak, as they
asked some detailed questions on the list. It will be interesting to see
another high profile deployment of Cassandra, but I want them to start putting
things that actually matter in there.

~~~
jbellis
Yes, reddit said they evaluated riak + the other usual suspects:
[http://www.reddit.com/r/programming/comments/bcqhi/reddits_n...](http://www.reddit.com/r/programming/comments/bcqhi/reddits_now_running_on_cassandra/c0m3rs9)

