That paper is very useful so thanks for posting the link but it has a number of issues as I see it.
1) It considers Cassandra, Redis, VoltDB, Voldermort, HBase and MySQL. It does not cover either MongoDB or Couchbase.
2) Latency values are given as average and do not show p95/p99. In my experience, Cassandra in particular is susceptible to high latency at these values.
3) Even considering average values, the read latency of Cassandra is higher than you would see with either MongoDB or Couchbase.
4) Cassandra does not deal well with ephemeral data. There are issues while GC'ing large number of tombstones for example that will hurt a long running system.
The long and short of it is that Cassandra is a fantastic system for write heavy situations. What it is not good at are read heavy situations where deterministic low latency is required, which is pretty much what the pinterest guys were dealing with.
Another reason it is marketing is because it lacks essential information on the setup of each benchmarked system. E.g for Cassandra I don't even know which version they used, what was the replication factor, what consistency level did they read data at, did they enable row cache (which decreases latency a lot), etc.? Cassandra improved read throughput and latency by a huge factor since version 0.6 and is constantly improving so the version really matters.
The p95 latency issues were largely caused by GC pressure from having a large amount of relatively static data on-heap. In 1.2, the two largest of these: bloom filters and compression data were moved off-heap. It's my experience that with 1.2, most of the p95 latency is now caused by network and/or disk latency, as it should be.
I'm not going to compare it with other data stores in this comment, but I'd encourage people to consider that Cassandra is designed for durable persistence and larger-than-RAM datasets.
As far #4, this is mostly false. Tombstones (markers for deleted rows/columns) CAN cause issues with read performance, but "issues while GC'ing large number of tombstones" is a bit of a hand-wavey statement. The situation in which poor performance would result from tombstone pile-up is if you have rows where columns are constantly inserted and then removed before GC grace (10 days). Tombstones sit around until GC grace, so effectively consider data you insert to live for at least 10 days, unless of course you do something about it.
Usually people just tune the GC grace, as it's extremely conservative. It's also much better to use row-level deletes if possible. If the data is time-ordered and needs to be trimmed, a row-level delete with the timestamp of the trim point can improve performance dramatically. This is because a row-level tombstones will cause reads to skip any SSTables with max_timestamp < the tombstone. It also means compaction will quickly obsolete any succeeded row-level tombstones.
Here's a graph of P99 latency as observed from the application for wide row reads (involving ~60 columns on average, CL.ONE) from a real 12-node hi1.4xlarge Cassandra 1.2.3 cluster running across 3 EC2 availability zones. The p99 RTTs between these hosts is ~2ms.
This also happens to be on data that is "ephemeral" as our goal is to keep it bounded at ~100 columns. The read:write ratio is about even. It has a mix of row and column-level deletes, LeveledCompactionStrategy, and the standard 10 day GC grace.