
Almost every Cassandra feature has some surprising behavior - dragonne
http://blog.parsely.com/post/1928/cass/
======
Tloewald
The thing about encoding multiple fields within a cell using \x01 somewhat
bugs me -- not because it's a hack but because it's yet another example of
needless reinvention of wheels. Good old ASCII has characters specifically
devoted to separating fields, keys, etc. that no-one uses for anything else.
Why not use them instead of inventing a new character that does the same
thing? (By the same token, there was never any need for CSV _or_ tab-delimited
text since those characters can't be typed into spreadsheet cells; parsing CSV
can be a pain, and TABs can be typed.)

If you care, the characters are:

FS -- file separator 1C

GS -- group separator 1D

RS -- record separator 1E

US -- unit separator 1F

As you can see, they also have the benefit of being _self-describing_ (unlike
\x01, as the article points out).

(The sad thing is I only learned about these characters because I had to parse
files in a 1960s format originally designed to be stored on tape drives -- and
they used these delimiters and they worked great.)

~~~
delluminatus
I only found out about these recently because my company uses them in their
custom RPC protocol, which was created about 15 years ago I think.

IMO, there are ups and downs to using unprintable characters in your protocol.
On one hand, yeah, you don't need to worry about someone putting a tab in a
TSV field and messing up the format. On the other hand, it can make debugging
a lot harder because what you see isn't necessarily what you get, and
obviously it becomes harder to hand-write requests.

Of course, OP ended up using unprintable characters anyway, so I think they
might as well use the ASCII ones. But even if you know about and use those
characters, I think there is still a place for [CT]SV.

I'm glad you brought them up though, because I think the nonprintable ASCII
characters occupy a very interesting place in computer science -- (almost)
universally supported, but (almost) never used.

~~~
Tloewald
You can make a case for CSV because it's text editor friendly, and tabs
because it conceptually maps to typewriters, but we don't even have the option
to use these characters. And arguing that typing escape sequences to represent
common characters is better is pretty difficult to swallow.

------
dvirsky
The thing is that contrary to many other tools, you can't run Cassandra with
almost any of its default settings (maybe besides ports). In most tools you'll
need to tweak a few defaults, with Cassandra you need to thoroughly read the
docs on every little configuration in the server and schema definitions
(especially if you're working in multiple DCs), you'll always find a little
surprise if you skim through it.

It's also almost mandatory to read the internal design docs of cassandra even
if you're not the admin but just working with it. And modelling data is a lost
less trivial than it looks - and almost always not what you assume.

Anyway, this is the best talk I've seen on Cassandra data modeling even if you
know Cassandra but not on an expert level. I made sure everyone on my team saw
it at least once.
[https://www.youtube.com/watch?v=qphhxujn5Es](https://www.youtube.com/watch?v=qphhxujn5Es)

~~~
threeseed
I have to disagree. But again given how scalable Cassandra is we could live in
different worlds.

I've run multi-gigabyte data sets in Cassandra without any config changes. And
when I responsible for a 40 node Cassandra cluster we didn't do any tweaks
other than a few JVM settings here and there.

Cassandra I have to admit though is very sensitive to how you model and store
your database. The whole tombstone saga is never a fun one to go through.

~~~
easytiger
> I've run multi-gigabyte data sets

WOW. You've run several GB of data through it? Wow.

~~~
threeseed
That was the small one. The 40 node one had hundreds of terabytes.

------
zenogais
It's been posted in the comments below, but I'll reiterate it because it needs
to be said. This article is more about the issues of naive database adoption
strategies than any deep fundamental flaw with Cassandra.

I've deployed C* on some similarly large datasets and encountered none of
these issues - even when storing terabytes of time-series data. The difference
- I read through the documentation from head to foot before getting started,
did numerous dry-runs to figure out the data modeling, and when I wasn't sure
about something asked the community (who are usually incredibly quick to
respond on IRC, twitter, or elsewhere).

------
hobofan
We used C* for a similar use case and encountered some of the same issues.

The article doesn't even mention that there are a lot of places in the docs
where it says: "If you configure a keyspace like this, your node will most
likely crash". Not just degraded performance, any develeoper with some access
might crash a node (and maybe the whole cluster) with a userspace error.

The main lesson I took away from Cassandra is that battle-
tested@Netflix(/other bigcorp) doesn't mean resilient and might require a
engineer on standby at all times to run correctly.

~~~
rohansingh
> The main lesson I took away from Cassandra is that battle-
> tested@Netflix(/other bigcorp) doesn't mean resilient and might require a
> engineer on standby at all times to run correctly.

This is entirely true. We are big users/consumers/fans of Cassandra at
Spotify. We have approximately one crap-ton of Cassandra clusters with several
metric crap-tons of data. But we have an entire team that provides support and
tooling around Cassandra. We contribute upstream, and have employed core
Cassandra contributors in the past.

In return we get a datastore that scales pretty much infinitely with our data
sets, has performance characteristics that we are now well aware of and are
able to reason about, and provides us cross-DC replication and topology-
awareness. It took years to get to this point though, and to build the
operational expertise required to run Cassandra. Only recently have we gotten
to the point where teams are able to self-service provision their own
Cassandra clusters.

This is a resilient, scalable solution, but if I were to quit tomorrow and
start a five-person startup, there is no way I would consider C* as a workable
solution.

~~~
skorgu
What would you use?

~~~
rohansingh
Really depends on what the startup was. If it's a big data startup, then I'd
put a lot of thought into based on the type of problem. If we were just
building an app that needed some kind of data store, who knows, might even use
something like Parse.

No need to prematurely optimize as long as you set yourself up to prevent
lock-in.

------
arthursilva
Bashing non-COMPACT storage is nonsense.

non-COMPACT store by itself doesn't add any significant overhead (only 2 bytes
per cell and even less after compressed). What really wastes space is using
collections types. Even using that I doubt it will ever reach the 30x mark the
author stated.

~~~
chisleu
The article is full of nonsense and the OP is here so criticism is being
downvoted.

~~~
pixelmonkey
huh?

~~~
chisleu
-3 every time I post in this thread. I guess I'll keep posting.

------
serialpreneur
I think you can sum up some of the grievances of the OP as "felt mislead by
Cassandra/DataStax marketing". But, some of the criticism is not really the
fault of Cassandra or DataStax at all IMO.

\- CQL stands for Cassandra Query Language. As a user of C*, I don't remember
anywhere in docs it claiming it conforming to the SQL standard. They do
mention it to be "similar to SQL" which is a fair comparison.

As a counter-example of another somewhat similar database, AWS DynamoDB. The
absence of a CQL like syntax really frustrates me.

\- Data Modelling: Effective data modelling is difficult in all DB systems,
inherently more so in NOSQL or non-ACID distributed DBs.

\- Counters & Collections: I feel that criticism is legit. I felt similar
pains too. I've learned the lesson there not to trust all marketing claims.

------
pixelmonkey
OP here. Surprised to find this article on the HN front page, a month after we
originally posted it.

Glad to answer questions. Ask me anything!

~~~
samkone
So are you still running Cassandra? And why rely on counters for analytics
when it's wildly known to be a bad idea.

~~~
pixelmonkey
We are still running Cassandra. It's used in a more restricted way than we
originally thought we'd use it, but we're following the recommendations we
wrote up in linked article.

We don't use counters -- at all. (This is discussed in one of the sections.)
Counters are one of the few Cassandra features that can offer you some form of
value aggregation inside the data store, but we decided they weren't worth it,
due to the quirks.

What do you mean by, "when it's wildly known to be a bad idea?" What are you
referencing? Lots of people use counters in Redis and MongoDB for analytics,
for example.

~~~
nemothekid
Counters in Cassandra seem to have been politically driven (something about
the design being influenced by Twitter), and given the fact that Cassandra
values availability over all us, best effort counters at best. So you can't
trust them to be 100% accurate (or so I've heard). This is another case of "to
run Cassandra you really have to know Cassandra" as Cassandra counters provide
a different set of guarantees than Redis (which is single instance) and Mongo
(which chooses consistency).

------
manishsharan
Cassandra or most NoSQL databases seem to require a lot more internals
knowledge than what a development or an ops team would have. Most sysops teams
have a cadre of certified DBAs and administrators and they get on by without
any surprises. The one thing Oracle got right was their certification program
which ensures there are no gotchas -- most DBA can deliver smooth operations.
I guess it is because these NOSQL products are relatively new, they require
only ninja level experts to fiddle with the controls.

I would be interested in hearing more from someone who has use a third party
managed cassandra service like Google' casandra product
[https://cloud.google.com/solutions/cassandra/](https://cloud.google.com/solutions/cassandra/).
Did you still deep internal knowledge to use Cassandra ? This is important to
know for me because my organization will not have the resources to manage
Cassandra clusters but we need a Cassandra like store.

~~~
lobster_johnson
On the other hand, if you tried to fit the article's use case -- at that scale
-- into something like PostgreSQL you would still have to start learning the
internals.

I think the article author's problem as that they assumed Cassandra would be
_similar_ to the technology they were already familiar with, which was not the
case; it's different in so many areas, especially with regard to its
performance profile, and to someone used to relational databases it's outright
alien.

At the scale they're describing, they _do_ exceed the threshold point at which
one has to learn the internals of the technology. In engineering, scale is
everything. Riding a bicycle doesn't require that you know how bicycles work,
but sending a rocket to the moon demands a lot of knowledge about a wide range
of subjects.

------
zikzikzik
This Team Used Apache Cassandra Without Reading The Fine Manual… You Won’t
Believe What Happened Next

    
    
      “You honestly expected that adopting a data store at your scale would not require you to learn all of its internals?”
    

These aren't even internals, these are basic facts about Cassandra.

~~~
dvirsky
I do believe Cassandra requires a much more intimate knowledge of its
internals if you want to do anything serious with it - more so than any other
database I've worked with.

~~~
threeseed
I have to agree. But with the clarification that it isn't significantly more.
Spend a few days and you will be an expert.

And I think it's worth noting that the "serious" things you would do on
Cassandra more often than not you couldn't do on most other databases.

------
sidonie
I always assumed Cassandra to be the easiest-to-operate distributed database.
What are other choices could I make? I'm primarily interested in using it as a
distributed, on-disk, datacenter-aware hash table.

------
20kleagues
Thank you for changing the name of the title. The default title is clickbait
which I find very surprising given this is a tech post and IMO the audience
for such posts hates these titles.

~~~
EdwardDiego
It's a parody of clickbait.

~~~
pixelmonkey
OP here. Indeed, it is a parody of click bait, although in retrospect I think
I should have avoided it.

Parse.ly's customers are media companies so we deal with news / media /
headlines all the time, so we were being cute in referencing this practice in
the article. I used the headlines as a mechanism to break up the prose, too,
and add some levity, since this article weighs in at over 4,000 words.

But I underestimated the ability of techies to scan the headline and paragraph
one of an article and say, "not worth my time" and bounce instinctively. In
fact, when I tried to submit this post to r/programming, the moderators
instantly deleted it, referencing the headline, despite it being a parody.

One thing I shouldn't be surprised by: The Internet adapts!

------
dmarklein
SQL is "trendy"?

~~~
NDizzle
It depends on the phase of the moon.

------
smoyer
I love that the article pokes fun at click-bait articles ... I'm not a
Cassandra user but having used several other NoSQL solutions, I'm not at all
surprised that this team, who dove into a new technology without caution, was
surprised.

~~~
threeseed
It's not just dove in without caution but without fully checking design
decisions.

For example using Maps other than for a storing a handful of values has always
been discouraged.

------
chisleu
"We were actually hoping that Cassandra could store things more compactly than
our raw data"

Why in hell would you think that?

CQL is not SQL? No shit. Actually, the docs pretty clearly spell that out.
Anyone with even a cursory knowledge of Cassandra or any big data store would
know you need to understand how the system works, and what it's limitations
are before you model your data for the system.

COMPACT STORAGE is for backwards compatibility. Turn it on and you are turning
off CQL3:
[http://docs.datastax.com/en/cql/3.0/cql/cql_reference/create...](http://docs.datastax.com/en/cql/3.0/cql/cql_reference/create_table_r.html)
[http://www.datastax.com/dev/blog/whats-new-in-
cql-3-0](http://www.datastax.com/dev/blog/whats-new-in-cql-3-0)

Counters are only usable for things where you don't care if the counter gets
increased exactly one time. It is for stuff that doesn't really matter, like
"likes" on a page or something.

I'm not even reading "Check Your Row Size: Too Wide, Too Narrow, or Just
Right?" because measuring the number of columns max and the ~10MB performance
limit is trivial to do and this is a datastore designed by data scientists for
people who are willing to do the very little bit of knowledge gain required.

This whole article reads like someone who decided to incorporate Cassandra
without actually spending any time learning how to do it correctly before
trying to do it.

"He said, “You honestly expected that adopting a data store at your scale
would not require you to learn all of its internals?” He has a point."

No shit.

~~~
dang
Hey, it sounds like you know quite a bit about this topic, and passion for
technical matters is something everyone here understands. But please follow
the HN guidelines when commenting. It takes some effort, but it's effort we
all need to make.

[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

------
baldfat
I still think of Cassandra as the tool the Digg engineers used to kill Digg as
we all loved/hated it at it's time.

~~~
samkone
If a company blames its demise on a piece of tech, then they either need
better engineers to change that tech. Or the product had no future at all. And
Cassandra has changed a lot since then.

~~~
baldfat
What I said still is true. Digg USED Cassandra to destroy itself. They changed
business plans and removed the community with a ton of down time and a loss of
data.

I agree it wasn't Tech fault for Kevin Rose allowing VCs to control Digg, but
there were issues with Cassandra and especially during the SQL vs NoSQL debate
at its highest levels.

