
Twitter Drops MySQL For Cassandra - chanux
http://www.informationweek.com/news/software/open_source/showArticle.jhtml?articleID=223100894
======
jbellis
I tried to explain to the reporter why Cassandra's data model (particularly
that it supports an arbitrary number of columns per row) makes it support
denormalization better than traditional rdbmses, and somehow that got turned
into "cassandra supports an arbitrary number of rows."

This and a few other poor explanations make me wince reading this. So I don't
particularly recommend this article, although I've seen worse. :)

TFA does at least link the original interview w/ Ryan King of Twitter
([http://nosql.mypopescu.com/post/407159447/cassandra-
twitter-...](http://nosql.mypopescu.com/post/407159447/cassandra-twitter-an-
interview-with-ryan-king)) which is much better for people at HN level.

My own article at [http://www.rackspacecloud.com/blog/2010/02/25/should-you-
swi...](http://www.rackspacecloud.com/blog/2010/02/25/should-you-switch-to-
nosql-too/), although high level, also has some useful links for those who
want to drill down for more details.

~~~
JulianMorrison
I wish Ryan King had said _why_ the other DBs named were rejected. That could
be useful info.

~~~
socratees
Sites similar to twitter are worried about Read/Writes and up time rather than
concurrency and locking mechanisms. All flavors of NOSQL databases are best
suited for this. Sites like amazon.com, or any transaction processing site
can't think of using NOSQL. tl:dr; NOSQL is good until transaction processing
& concurrency comes in to picture.

~~~
andrewtj
Scalaris (<http://code.google.com/p/scalaris/>) is consistent and has
transactions.

As an aside, three sentences doesn't really warrant a "tl;dr".

~~~
z8000
Not being snarky here but how can anyone seriously consider Scalaris for any
data that's even remotely important?

[http://code.google.com/p/scalaris/wiki/FAQ#Is_the_store_pers...](http://code.google.com/p/scalaris/wiki/FAQ#Is_the_store_persisted_on_disk)?

~~~
andrewtj
It certainly limits it from being useful if you don't have dollars to throw at
high-end infrastructure, but it's not quite as terrible as it sounds.
Effectively all it means is that you can't cold start it. How big a problem
this is depends like anything else on the application in question.

~~~
z8000
As I write this comment I just _know_ that in 5 years I will look at it and
chuckle at myself but... it boggles my mind to consider a pure-RAM system that
relies on at least one copy of the dataset being available _forever_ in some
system's "memory grid".

[http://highscalability.com/are-cloud-based-memory-
architectu...](http://highscalability.com/are-cloud-based-memory-
architectures-next-big-thing)

------
justinsb
As Jonathan has pointed out, the article has lots of inaccuracies, but it also
has a very good explanation of the NoSQL Faustian pact: "Cassandra doesn't do
joins... doesn't guarantee referential integrity, where the user knows the
data being used reflects the latest updates... can't process transactions,
with a guarantee that the transaction will either be completed or discarded,
the way relational systems do" because it focuses on "more immediate goals
than the pristine data handling rules of relational systems."

As the expression goes: at the poker table, if you don't know who the sucker
is, it's you. Guess who's responsible for those fuddy-duddy, old-fashioned
things like querying your data, or making sure that you've written data, or
that your database isn't corrupted... it's you.

If you're Twitter, and you're fail-whaling every day, then maybe the work
required to make this trade-off work makes sense. But I can't help but feel
there's got to be a better way.

~~~
raganwald
I won't defend Cassandra's design choices here, but I will say that the
relational model is only about joins and normal forms. The other things you
mention happen to be built into all modern relational databases but are
orthogonal concepts.

A distributed hash table can be built with transactions, isolation, and so
forth. Such a system would offer a different set of trade-offs that might
satisfy a different set of users.

~~~
justinsb
Completely agree with you on the theoretical level; it's a (lazy) shorthand to
contrast the typical NoSQL trade-offs with the ACID model that most relational
databases employ.

In practice though, I think if you introduce (multi-'row') ACID into any of
today's NoSQL database, you'd just end up with a bad traditional database (a
'relational' database without a strong theoretical grounding, without the
ability to do joins and without a powerful query language.) This whole NoSQL
movement feels like a re-run of the evolution of the relational database -
those that don't know their history are doomed to repeat it.

~~~
jbellis
> I think if you introduce (multi-'row') ACID into any of today's NoSQL
> database, you'd just end up with a bad traditional database

Strongly disagree, although there is truth to the converse (adding sharding to
a traditional database yields a bad nosql implementation :).

There's a sketch of how Google added multirow transactions to bigtable for app
engine (via a layer above bigtable called megastore):
[http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore....](http://perspectives.mvdirona.com/2008/07/10/GoogleMegastore.aspx)

The key idea is that "a per-entity-group transaction log is used. One of the
rows that stores the entity group is the entity group’s root. The log is
stored with the root, which is replicated like all rows in Big Table."

This is basically a version of the approach advocated by Pat Helland in his
paper on "Life Beyond Distributed Transactions" --
<http://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf> \-- namely, that the
most sane approach to distributed transactions is to redefine the problem, and
restrict transactions to be within "entities" that fit on a single machine.
Which, as App Engine demonstrates, turns out to be enough to do an awful lot.

~~~
justinsb
Interesting links - thanks. There's definitely a continuum here - the full
ACID model at one end; key-value stores at the other. It looks like Google is
moving in the right direction - adding more ACID - which certainly plays to my
personal bias :-)

------
RyanMcGreal
>in what's becoming a not-uncommon move.

Or as Orwell put it: "A not unblack dog was chasing a not unsmall rabbit
across a not ungreen field."

~~~
telemachos
Litotes: my favorite figure of speech.

<http://en.wikipedia.org/wiki/Litotes>

~~~
DanielBMarkham
Mine too.

And when combined with a thinly veiled insult, gives the wonderful "back-
handed compliment"

"He's not as unintelligent as he looks" "She's not the liar she might have
been given her poor upbringing"

------
freshfunk
I'm a fan of the NoSql movement and have been exploring Cassandra as an option
for data storage.

I had a conversation with an engineer who works at a pretty well-known company
here in SV and their sys admins are dropping Cassandra and pushing all the
engineers back to MySql. I don't know the whole story but it seemed to be
implied that open-sourced Cassandra had issues and supposedly Facebook had a
much different version they were using.

Of course this is all second hand, so I tried to search on the experiences of
other people using Cassandra (with decent volume). Unfortunately most of the
threads I found had people just like me, at the exploratory stage. Or they
hadn't been live with it for long.

If there were any pitfalls or hairy parts with maintain Cassandra, that would
be good to know. Also, examples of clients who have decent load and have been
using it for a while.

~~~
jbellis
> their sys admins are dropping Cassandra and pushing all the engineers back
> to MySql

I'm curious what you are thinking of, because I have better picture of
companies using Cassandra than most. :)

I do know of one company that fits your description, where some of the mysql
DBAs were very anti-cassandra because, frankly, it's not mysql and that's what
they were used to. But that has been resolved (the most vocal DBA left) and
the Cassandra migration is continuing.

> If there were any pitfalls or hairy parts with maintain Cassandra, that
> would be good to know

Other than the obvious (it's not a relational database), we've documented the
main limitations here: <http://wiki.apache.org/cassandra/CassandraLimitations>

------
physcab
I'm curious why a lot of people / companies seem to be picking up Cassandra
lately. I'm not one to peg one "NoSQL" software system against another, but
I've been using HBase for a few months (albeit on a rather limited cluster)
and feel that it fits great especially with its compatibility with Hadoop. We
use Hadoop + Map/Reduce extensively at Grooveshark. Does anyone have
experience in using both systems and can offer a candid account of both?

~~~
simonw
I have no experience with either, but here's a recent blog entry about moving
from HBase to Cassandra:

[http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-
wh...](http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-
moved/)

------
bjclark
More accurate title: Twitter migrates parts of it's system away from MySQL to
other data stores including Cassandra.

------
sown
So what kind of performance can you get from Cassandra? How big can the values
be? I was thinking of using it as a backend for a mail system.

~~~
jbellis
> So what kind of performance can you get from Cassandra?

~10k ops/second per quad-core node. Scales roughly linearly w/ node and core
counts. ("Roughly" means, obviously there is network overhead as you move from
single node to multiple, that kind of thing.)

> How big can the values be?

2 GB, although you probably don't want to max that out.

Cassandra is mostly used for smaller pieces of data, although I do know at
least one person using it as an S3 replacement by chunking files into 64MB
chunks; each file is one row consisting of columns that each contain one such
chunk.

> I was thinking of using it as a backend for a mail system.

It should work fine for that.

------
davidw
Whoever does a thorough comparison of a number of these new "nosql" systems,
including features and some benchmarks, will have him or herself an extremely
popular article.

~~~
jbellis
One I wrote: [http://www.rackspacecloud.com/blog/2009/11/09/nosql-
ecosyste...](http://www.rackspacecloud.com/blog/2009/11/09/nosql-ecosystem/)

Another, that goes into even more detail (imo, too much for one article, but a
good article all the same): [http://www.vineetgupta.com/2010/01/nosql-
databases-part-1-la...](http://www.vineetgupta.com/2010/01/nosql-databases-
part-1-landscape.html)

Benchmarking systems w/ very different data models is difficult to impossible,
which is why you don't see that in this kind of survey piece. You're best off
by picking one segment and focusing on that. Yahoo did that with the
ColumnFamily stores (cassandra, hbase, and one they wrote internally) here:
<http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf> (note that Cassandra 0.5
results are on page 16 and 17, not inline w/ the rest)

~~~
davidw
Maybe one way to do it would be to pick a problem, or several problems, and
implement solutions in each one.

Hrm, maybe it's a book more than an article, although I don't think I'd buy
it, given that it'd be out of date so soon.

The point being, as a casual observer of these things, I don't yet have a good
feel at all for which ones might be good for what.

------
Roridge
informationLastweek

~~~
mahmud
Not everyone is plugged into the "scene" RSSes. Publications purpose is to
discover trends in news signals, and distill that into a coherent format
supported by argument.

~~~
Roridge
fair point, I cheerfully withdraw my comment.

------
lallysingh
about...fracking...time

