
A Newbie’s Guide to Cassandra - ddrum001
https://blog.insightdatascience.com/the-total-newbies-guide-to-cassandra-e63bce0316a4
======
nemothekid
One thing that Cassandra doesn't have a good story of, and what intro guides
continue to gloss over is the ops situation. I've recently moved some our
largest Cassandra tables to BigTable for this reason. The compaction / repair
/ garbage collection death cycle is probably the most difficult thing to
manage, and in the past 3 years of using Cassandra, managing it has gotten
worse. Tools have been deprecated (like OpsCenter) and new features can
exacerbate the problem. There is still no reliable way to detect when repairs
have finished, and if you have a large enough table, repairs can take a week
to finish. Combine that with the fact that if a table is that large, then it
probably has a high write volume - meaning it has a lot of compactions as
well. So you have repairs and compactions going on which thrash the heap, and
now you also have a GC tuning problem.

It took a lot of experimentation to get right, but once I did, scaling started
to mean smaller drives and more nodes, which meant a more expensive cluster,
for which I was largely paying for my CPUs to repair and garbage collect data.

Other than ops however, Cassandra is a great tool and does everything it says
it does on the box.

~~~
Boxxed
Completely mirrors my experience with Cassandra. I think they'd have a real
contender on their hands if operating a cassandra cluster didn't basically
take a full time engineer. Its backup story is absolutely abysmal, and tooling
is atrocious -- during a support incident a DataStax guy suggested I dump a
table with sstable2json (or something like that) which generated a 100GB json
file. When I pointed out that basically nothing could consume it because it
was one 100GB hash object, he said "Yeah, I guess no one ever uses this
stuff."

~~~
jjirsa
As a long time Cassandra user: people use sstable2json all the time, but most
people don't have 100gb sstables (or 20gb sstables that make 100gb of json)

Certainly something we can do better - how would you break it up? Adding a key
to dump an individual partition to json?

~~~
Boxxed
It wasn't that large a database if I remember -- maybe 1TB? The sstable sizes
seemed reasonable at the time, I think it was just explosion due to json.

Anywhoo, one huge file is fine, what's _not_ fine is having one huge json
object -- streaming parsers might be ubiquitous in the XML world, but
definitely not in json land. Something simple like small json documents
separated by newlines would work.

~~~
jjirsa
[https://issues.apache.org/jira/browse/CASSANDRA-13848](https://issues.apache.org/jira/browse/CASSANDRA-13848)
Created just for you

~~~
Boxxed
What a guy slash gal; cool!

------
schmichael
This article does a massive disservice by using the pre-CQL Column Family and
Row terminology. While it's the Cassandra data modelling I'm the most
accustomed to personally, it causes endless confusion for users who find
themselves in the CQL documentation trying to understand how it all maps to
Primary Keys, Partition Keys, static columns, etc.

This transition has been causing confusion for at least 5 years now, and it
appears people are still using the old terminology!
[https://www.datastax.com/dev/blog/thrift-to-
cql3](https://www.datastax.com/dev/blog/thrift-to-cql3)

~~~
pfarnsworth
As a recent user of Cassandra, I found exactly this to be a huge problem. Any
type of googling would return too many different terms, and the relevance or
context were completely missing so I was confused for a while, until I
realized that the terms changed. The unintended consequence of such a quick
change in terminology is that it makes for a very hard experience for newbies.

~~~
jjirsa
It's not really that quick of a change - it's been in flight for something
like 4 years? Maybe 5? And thrift is still supported until 4.0, so you can use
the old style for quite some time.

------
mi100hael
_> Cassandra’s data model is a partitioned row store with tunable consistency
where each row is an instance of a column family that follows the same schema_

"Total Newbie" apparently means well-versed in database paradigms and
terminology.

~~~
theflork
Agreed. If anyone has recommended resources for someone coming from from SQL
world and wanting to learn more about databases like Cassandra and HBase
space, please share!

~~~
jjirsa
Datastax academy is probably the best free source

Searching YouTube for Cassandra summit talks is probably second

There was a push to do some better docs on the ASF website but it's just
manpower that is currently spending time writing code instead - we have no
real full time doc writers that focus on the open source product. Maybe some
day someone will volunteer (and if you want to volunteer, I'll commit the docs
for you - the site has a how to contribute guide, but honestly I'll take
GitHub PRs if they're nontrivial even though it's an annoying workflow for our
non-GitHub master).

------
errantmind
These days I work with Cassandra on a daily basis. The company I am
contracting with switched to Cassandra a while back for their primary data
store. A few poor decisions later and they were spending tens of thousands of
dollars a month running Cassandra in Azure. The cost was high because they
modeled and queried their data like they were still using a SQL database which
was incredibly inefficient.

The lesson here is to think long and hard about how you are going to access
your data before switching to a database like Cassandra. This will help you
decide if Cassandra is the right database to fit your use-cases. If so, be
sure to model your data appropriately.

In this case, based on how the company wants to query the data, they would
have been better off with PostgeSQL.

~~~
bulldoa
any recommendation and resources to read up on for when to use cassandra and
how to design the schema?

~~~
jjirsa
Use Cassandra when you need real time HA cross datacenter without having to
manually fail over

Use Cassandra when you're going to need to grow our database cluster often and
don't have tooling to handle resharding

Use Cassandra when you do millions of simple queries (per second), not a
handful of complex JOINs

I've used Cassandra at 3 different employers now, and I can't imagine using
anything else for many use cases, but there will always be some where it's the
wrong choice.

~~~
boondaburrah
I like that your comment's denormalised for better use with Cassandra.

------
bkeroack
CQL is the best and worst thing about Cassandra. The pro is that obviously it
is very similar to SQL so it's easy to understand, the con is that C* is
nothing like a RDBMS so you can be easily fooled into doing dumb/inefficient
things with the nice CQL syntax.

I think that Cassandra is best thought of as a fancy K/V store that lets extra
data ride along with query results. Don't think of rows/columns at first, it
will just screw you up in your modeling. Also keep in mind that the cost for
very fast queries is a lot of extra time spent figuring out how to model new
data access patterns in the future.

~~~
Boxxed
I'd say it's absolutely the worst. For about ten seconds it seems like it's a
nice higher level abstraction, but then you realize how that abstraction is
hiding exactly what you care about: how your data maps to the underlying
storage. Making it look like SQL was also a huge mistake because it gives the
impression of a certain level of expressivity that it by no means has.

