With that said, it's yet another piece of tech that bloats our stack. I would love to reduce our tech debt: We are much more familiar with relational databases like MySQL and Postgres, but we fear they won't answer the analytics problems we have, hence Cassandra and Spark. We use these technologies out of necessity, not love for them.
Ultimately, the question that I'm interested in trying to answer is: would it help if there were more ways to make Spark feel like a traditional relational database? (e.g. being able to interact with the Spark driver using MySQL or Postgres wire protocol)
It's interesting. 10 years ago I would have probably said something like that "relational dbs will just get better as data grows", quite the opposite happened... Relational has been pushed to the side and we now have to learn a lot of new technologies, in my case: Cassandra; Spark; Pandas (python). This whole stack used to be just MySQL :)... And I miss those days!
At the same time, of course there are some very good points to be made for this sort of storage agnosticism -- mainly from an efficiency standpoint (i.e. being able to choose the storage format for the occasion). I'm really not quite sure if this argument is strong enough for completely sacrificing the simplicity of a traditional database.
Sometimes I think that MPP engines like Spark should take the philosophy of "batteries included but replaceable" -- that is, basically serving as an all-in-one "database" that provides a default storage engine (e.g. a basic columnstore and a basic rowstore), but still letting the user plug in other data sources to join, only if they want.