Hacker News new | past | comments | ask | show | jobs | submit login
Cassandra at Twitter Today (engineering.twitter.com)
86 points by abraham on July 10, 2010 | hide | past | web | favorite | 22 comments

I don't know about you, but to me this seems like a big blow to the reputation of Cassandra. I know a bunch of people (including myself) whose primary motivation for investigating Cassandra as a large-scale data store was "Twitter is using it, therefore it's got to be fine for what we want".

In my experience this was not the case -- after quite a lot of prototyping, we moved away from Cassandra. It scales brilliantly, but its query time has a pretty solid floor that I was unable to shift to speeds approaching what I needed: it never went up, but nor did it go down.

Of course, I am a total noob at distributed systems -- perhaps it's simple to get it to go faster, but Twitter's move away suggests not. (Facebook's use-case is quite different, and in any case they still use an awful lot of MySQL too) The other systems they cite as currently using Cassandra in production are nowhere near the scale of the primary tweet store.

I don't think they're avoiding Cassandra as their tweet store because it's not capable of performing well. It's probably more cause it would be a really long process migrating every tweet into Cassandra. I'm sure if they could, they would go back to the first day Twitter launched, and use Cassandra as the tweets store.

As for query time, maybe your problem was I/O throughput? I've read Cassandra can be really slow if you have poor I/O throughput. This happens a lot in the cloud, where I/O is usually shared.

Don't they not show / hide & archive old tweets anyway? Seems like they could really simple transition Cassandra in if they wanted to.

On a small scale, yes. On Twitter scale (in February they were averaging 50 million tweets/day [1]), it's probably not so simple.

[1] http://blog.twitter.com/2010/02/measuring-tweets.html

Edit: As of mid June, it was about 65 million per day.


My point was that they aren't importing legacy data at all if they aren't showing old stuff. Just start writing the new ones to the new DB, right?

I think that apart from the speculations about why twitter does not want to move to a new store for tweets, the important "take off" here is that what they (and other companies) do is experimenting with different platforms, and then using what they feel it's right for them.

Don't use what others are using or suggesting. Download a few stores, take your time to understand how they work, and then do your pick. Time consuming? Not as time consuming as starting a big project with a system that is not a good fit.

Or you could just use MySQL.

1. Cassandra is very young! Especially, the design and implementation of local storage and local indexing are junior and not good.

2. Pool read-performance is also due to the poor local storage implementation.

3. The local storage, indexing and persistence structures are not stable. They need to be re-designed /re-implemented. If Twitter move data to current Cassandra, they should do another move later for a new local storage, indexing and persistence structure.

4. There are many good techniques in Cassandra and other open-sourced projects (such as Hadoop, HBase) etc. But, they are not ready for production. Understand the detail of these techniques and implement them in your projects/products.

Schubert Zhang

Every time I use Twitter, it is malfunctioning in a very noticeable and bizarre new way. I don't care about scale, how difficult it is to be twitter, etc - I would not take their word about any technical matter without a tsp of salt.

Ha ha. Everything I say on this site is down voted recently. As if I fucking care.

Twitter :

- can't find my @ replies. Shows zero. Not correct.

- has been showing 2-5 copies of each message in my own history for the past week

- gave me 30 error screens in a row, then an hour later on their status blog "we've noticed elevated rates of errors". Like a constant stream of errors is normal for them, but elevated rates - time to do something!

- can't get their widgets/external code straight, redirects people to twitter.com/undefined and twitter.com/null

Etc, etc. Give me a break. Twitter is hopeless, and has always been hopeless in the tech department.

At the middle of Hype cycle, they are now getting real.

Interesting to see that they use two different db systems. I'm a bit worried about this. I believe it is just postponing an important decision. In the future they will have to choose for one db system. With the growth of Twitter wouldn't it be better to make a decision now and focus all resources on optimizing one system?

Why do they have to choose 1 database system? I never understand this reasoning. Pick the best tool for the job!

I for one whole-heartedly agree with their decision. Considering the issues they've had recently with their internal networks, it would be insanity to move the majority of their stack to a new technology.

MySQL scales just fine, provided you are good at partitioning and at least it's a known quantity.

There's no reason a company can't use different tools for different jobs. When I worked at Amazon.com we used many storage and query systems: Oracle, MySQL, S3, BDB (and various simple distributed systems built on BDB), Dynamo, SimpleDB, memcached, etc. etc.

One size doesn't fit all. MySQL and Cassandra are radically different system in both the query model and distribution models. Cassandra and HBase are similar in the data models (the differences are mandated by their respective distribution models), but are radically different in their distribution models. Entirely different systems are suited for entirely different applications.

To my mind, the whole NoSQL thing is primarily about "polyglot persistence" - building systems on more than one storage system. I think large scale apps that use more than one db system are becoming the norm. NoSQL means acknowledging that different storage systems are good for different features.

Meh. Many, many applications have at least two database systems: a caching layer, such as memcached; and a persistance layer, such as PostgreSQL. Twitter's basically just using Cassandra in place of memcached. This doesn't strike mas a dangerous or bizarre architecture.

"Twitter's basically just using Cassandra in place of memcached"

Not sure about that. They also said they are using it for analytics. Actually, since Cassandra is persistent to disks, using it as a memcached doesn't sound right.

Since Cassandra 0.6 there is a way to keep data in memory, acting as a cache.

They're also using their own FlockDB (http://github.com/twitter/flockdb) to store their social graph.

FlockDB isn't a data store itself. They use FlockDB on top of MySQL AFAIK.

For actual data storage they build this layer on top of MYSQL ( and redis, and all other things that need sharding ) : http://github.com/twitter/gizzard

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact