in most relational databases, the schema is cheap to alter and does not impose a temporary performance impact.
In-fact, all of their requirements (aside of linear scalability) could also be met with a relational database. Doing so would gain you much more flexible access to querying for various reports and it would reduce the engineering effort required for retrieval of data as they add more features (relational databases are really good at being queried for arbitrary constraints).
I think people tend to dismiss relational databases a bit too quickly these days.
b) You were quite okay to just dismiss scalability there except that's the most important requirement for a company such as this. People don't just choose Cassandra lightly given how significant its tradeoffs are.
c) Most companies are offloading analytics/reporting workloads into Hadoop/Spark and then exporting the results back to their EDW. This allows for far more functionality and keeps your primary data store free from adhoc internal workloads.
d) Nobody dismisses relational databases quickly. In almost all cases they are the first choice because they are so well understood. The issue is that most of them do have issues with linear scalability and the cost to support them quite prohibitive e.g. Teradata, Oracle.
Re c) this strongly depends on the usecase, I've seen companies use a) to avoid split-brain problems and having to manage two data pipelines to great success at similar scales. You might find https://www.youtube.com/watch?v=NVl9_6J1G60 interesting.
1) Altering schema is vague. It was used vaguely in the article (although, given the clarity of the article, I suspect the authors knew exactly what they meant). Some alterations on relational database tables are fine, even when hot and have billions of rows. Others are not.
Add a new column: fine. Index the new column: fine. Create a new index on a column with billions of rows: definitely not fine.
But the index plan described in the article was very specific about what they wanted. It doesn't sound like they had to add any new indices.
2) Mostly agree here. Linear scalability is a big deal here, and it's fucking hard to do well for most RDBMS systems. I slightly disagree, however, because the article explicitly states that the requirements are willing to trade C for A in CAP theorem. This is important. The hardest parts of linear scaling in RDBMS are enforcing C. Think transactional systems that absolutely must be consistent. Like your bank account. This isn't that, and the blog post clearly states it. Takes a lot of pressure off the relational database when it comes to scaling.
3) Strongly disagree. Most companies don't have the resources or manpower to do that. It takes a lot of time and a lot of effort. Hell, most companies don't even have an EDW. Let alone a pipeline from the OLTP server to Spark/Hadoop to the non-existent EDW.
4) We seem to run in different circles. Almost everyone I know dismisses relational databases without question. Mongo is the way to go. And I get called out as the resident old fart/luddite who insists on using postgres. Speaking of which, if the first things you think of with relational DBs are Teradata and Oracle, we are definitely operating in different contexts.
If your opinion is that relational databases are generally well understood by--and therefore often the first choice for--developers . . . I want to know where you work.
Because that's not a different context from where I am.
That's a different universe.
The reality is that storing and retrieving data is a hard problem, and there's no set answer that works for everyone in every situation. If you're building a new product from scratch, you should go with what your team knows, provided that the team knows enough to not put yourself in the situation where you're just losing data in a partition scenario (well-made point in the original article. Mongo is fine on one node. Scale it out, and you might as well write your data to /dev/null)
Almost any datastore will serve the needs of a new product until it needs to scale horizontally. Relational, NoSQL, Object store, whatever. When it comes to scaling linearly, you have to take factors into account.
1) Which part of CAP theorem are you willing to sacrifice? You always have to let go of one.
1a) If you want a CP system, you have no choice but to deal with scaling problems of relational databases. You must have transactional guarantees for this to work.
1b) If you need an AP system, you have choices, but the choices lean in favor of systems like cassandra. It's just easier than seting up multi-node postgres and doing sharding.
It's also worth pointing out that people very often dismiss vertical scaling too soon. Take a look at Joel Spolsky's articles about infrastructure at StackOverflow. You can do quite a lot with the available firepower of modern technology by just buying bigger and better hardware.
I'm not suggesting that going bigger would have been the right choice for Discord. But sometimes it can be the right choice.
If there's something I fundamentally disagree with about the article, it's this: trying to do everything in a single data store. I think--much like what you suggested above--that it's better to have separate systems for reading and writing. Since the use case is definitively AP, I can't see a reason not to have a transactional system in an RDBMS and a streaming pipeline to a cassandra cluster for reading.
Use the right tools for the right job, is basically my point.
I suppose I run with more...sensible devs? I mean a lot of my co-workers are Millennial Hipster Rubyist types, and they'll pick a Postgres or MySQL database literally every time and never leave it with their cold dead hands. One team here even built their own queuing system on top of some Ruby and MySQL. (Please don't ask. They had...reasons but they basically reinvented Kafka.)
These same teams really try to avoid Redis, also.
Of course, these teams are writing REST APIs with very strict SLAs. Most the time I see MongoDB and other "NoSQL" DBs used is when you have front end JS devs writing the Node backend code. >.>
The hardest parts of linear scaling in RDBMS is actually doing the scaling - it's "what do I do when I'm about to outgrow a master and need to add a bunch of capacity", and "what do I do when the master crashes". At Crowdstrike we would add 60-80 servers to a cassandra cluster AT A TIME, no downtime, no application changes, no extra work on our side - just bootstrap them in, they copy their data, and they answer queries. The tooling to do that in an RDBMS world probably exists at FB/Tumblr/YouTube, and almost nowhere else.
> Think transactional systems that absolutely must be consistent. Like your bank account
Most banks use eventual consistency, with running ledgers reconciled over time.
> It takes a lot of time and a lot of effort. Hell, most companies don't even have an EDW. Let alone a pipeline from the OLTP server to Spark/Hadoop to the non-existent EDW.
In the cassandra world, it's incredibly common to setup an extra pseudo-datacenter, which is only queried by analytics platforms (spark et al). Much less work, and doesn't impact OLTP side.
> 1a) If you want a CP system, you have no choice but to deal with scaling problems of relational databases. You must have transactional guarantees for this to work.
This is fundamentally untrue - you can query cassandra with ConsistencyLevel:ALL and get strong consistency on every query (and UnavailableException anytime the network is partitioned or a single node is offline). Better still, you can read and write with ConsistencyLevel:Quorum and get strong consistency and still tolerate a single node failure in most common configs.
> Use the right tools for the right job, is basically my point.
And this is the real point, with the caveat that you need to know all the tools in order to choose the right one.
1) scaling is easy . . . oh casandra. Where you can't have C and don't care about P.
2) Let me tell you about banks. I used to work for banks. Banks do not use systems that are eventually consistent. Banks use systems--however old and outmoded--that are strongly consistent. Banks do not use systems that are eventually consistent except for ACH transfers. And that's not a database. That's a flat file
3) There is no cassandra world that you speak of. This is utter bullshit.
4) No it's not untrue. Cow--as we call it on me team--absolutely sucks at C when you're talking about scaling horizontally.
Make up your mind. Is this good at single node guarantees or is it good at sharded guarantees?
We know for a fact that if you want CAP, you can't have all three. You can have AP or CP, but you can't have all of them. If you're arguing that you can have C and A, you have failed at P.
Maybe that's a thing you're willing to trade-off. But it doesn't in any way relate to my point.
My point, if you missed it, was this: if you want strong consistency, you need a relational database, and you need transactional guarantees.
That is hard to do, and no one does it well yet. You're just lying to people if you say otherwise.
> 1) scaling is easy . . . oh casandra. Where you can't have C and don't care about P.
This isn't about teaching me the CAP theorem. I know the CAP theorem. I know the tradeoffs. I've built and managed systems that handle hundreds of billions of events a day, writing millions of writes a second into a thousand cassandra nodes. You can have C, if you want it - you dont get transactions with rollbacks, but that doesn't mean you dont have consistency.
> 2) Let me tell you about banks. I used to work for banks. Banks do not use systems that are eventually consistent
All this time, I thought ING was a bank: https://www.youtube.com/watch?v=EiqdX23u_Mk
> 3) There is no cassandra world that you speak of. This is utter bullshit.
I see, lame troll or wholly clueless. Guess I'm done.
> if you want strong consistency, you need a relational database
You can have consistency without transactions.
Overall, it really sounds like you've never used or know much about Cassandra, and are possibly confused on the CAP theorem.
Got a citation for that, skipper?
Never mind, I found one: http://bfy.tw/9bcO
But you'd think they'd read about CAP theorem and cloud architectures while they're hiding from everyone.
In too many cases, people conflate RDBMS with MySQL which any schema mod is time consuming on large tables, even when adding nullable columns with no other constraints.
I call it the "MySQL Effect", aka NoMySQL.