Hacker News new | comments | show | ask | jobs | submit login

No limitations. Spark works and does a good job, it has many features that I can see us use in the future too.

With that said, it's yet another piece of tech that bloats our stack. I would love to reduce our tech debt: We are much more familiar with relational databases like MySQL and Postgres, but we fear they won't answer the analytics problems we have, hence Cassandra and Spark. We use these technologies out of necessity, not love for them.

Ah, I see -- so a thing that I'm curious about is, what do you miss about relational databases? Are they mainly aspects on the operational side, or the usability/API side?

Ultimately, the question that I'm interested in trying to answer is: would it help if there were more ways to make Spark feel like a traditional relational database? (e.g. being able to interact with the Spark driver using MySQL or Postgres wire protocol)

Spark already does a good job at that, imo. It's increasingly easy to query information, at this point we are basically writing SQL-like queries with it. BUT, Spark isn't a relational db, or even a storage solution. What I miss is just having the one piece of technology that deals with both storing and querying: Actual relational databases.

It's interesting. 10 years ago I would have probably said something like that "relational dbs will just get better as data grows", quite the opposite happened... Relational has been pushed to the side and we now have to learn a lot of new technologies, in my case: Cassandra; Spark; Pandas (python). This whole stack used to be just MySQL :)... And I miss those days!

Ah, yeah. I have so many mixed thoughts on this. I also think that the open source world copying Google's super-decoupled GFS-Bigtable-MapReduce-Dremel-etc... architecture was really not great for operational complexity. Few teams can operate like Google, and maintain so many moving parts in production all at the same time.

At the same time, of course there are some very good points to be made for this sort of storage agnosticism -- mainly from an efficiency standpoint (i.e. being able to choose the storage format for the occasion). I'm really not quite sure if this argument is strong enough for completely sacrificing the simplicity of a traditional database.

Sometimes I think that MPP engines like Spark should take the philosophy of "batteries included but replaceable" -- that is, basically serving as an all-in-one "database" that provides a default storage engine (e.g. a basic columnstore and a basic rowstore), but still letting the user plug in other data sources to join, only if they want.

Have you looked at memsql.com? It's the "better" relational database you're talking about, especially for data warehousing.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact