Hacker News new | past | comments | ask | show | jobs | submit login
Choose SQL (stateofprogress.blog)
126 points by koevet on March 21, 2017 | hide | past | favorite | 120 comments



I've been using Postgres for the past 20 years, and it's basically almost always been the right choice. Even when it's not, it's a good starting point to explore the problem until everyone's really, really sure that something else might be best for that niche.


Short lived, high frequency/throughput data (most likely some kind of non-essential logging), that gets accumulated on an interval in a different database would be such a case. Something like Redis can be the right choice there.

Also PostgreSQL is much more than a SQL database, it rivals an application server with its flexibility through extensions and supported languages.


Yes; antirez originally created redis with a logging thing in mind if memory serves... Indeed:

https://groups.google.com/forum/#!msg/redis-db/DHBqQ7x4sOU/e...


If memory serves? Redis serves everything directly from memory.


Yes, but it has Append Only Files (AOF) that are similar to postgres' WAL and allow it to be durable. It can also fork periodically and write a copy of the database to disk. (It's single threaded and linux will COW when forking so it's safe and efficient iirc.)


PG offers unlogged tables for non-acid ephemeral data.


Although he author doesn't take into account that not all web apps are created equal and the SQL/NoSQL argument isn't black and white... I do find myself slightly nodding in agreement with what he says.

NoSQL doesn't reduce development effort. What you gain from not having to worry about modifying schemas and enforcing referential integrity, you lose from having to add more code to your app to check that a DB document has a certain value. In essence you are moving responsibility for data integrity away from the DB and in to your app, something I think is quite dangerous.

NoSQL has its place, but I do feel that it is a bit more hyped up than it should be.


> What you gain from not having to worry about modifying schemas and enforcing referential integrity, you lose from having to add more code to your app to check that a DB document has a certain value.

This × 100. The app I'm working on has so many `if model.value` calls scattered throughout the code that it becomes difficult to follow and make sure no spots were missed. Simply adding validation to the app before writing data isn't sufficient since there are other ways to persist data and NoSQL provides no sense of data integrity.


> In essence you are moving responsibility for data integrity away from the DB and in to your app, something I think is quite dangerous.

This is only necessarily true of specific "instances" of NoSQL. MongoDB has supported server-side schema enforcement since Dec 2015, for example. Most likely other relatively mature NoSQL systems have similar features. There's certainly nothing about NoSQL which makes server-side schemas impossible.

The issue is social, not technical. I believe that people that want to use NoSQL dislike schemas - or believe that their use case is too hard to describe with a schema. Most NoSQL databases won't have schema validation set up even if the feature is available.


Database modeling should be the foundation of any non-trivial application. With a proper data model, there are ways to reason about the data and hence the application as well. Also, with the advent of new client-server communication like GraphQL and Hasura, the frontend developer needs to rely on schema specifications for designing the applications.


> In essence you are moving responsibility for data integrity away from the DB and in to your app, something I think is quite dangerous.

Why is this a bad thing? The code in my app can be tested and is much more expressive than some contraints in the DB.


The problem there is what if you have two different apps connecting to the same DB. When the DB enforces referential integrity both apps can be certain the data in the DB is correct and they wont be able to mess with it. When the apps have to take on that role each app must implement (i.e. unnecessarily duplicate) the same integrity checking functionality. In a complex environment where you might have different apps using the same NoSQL data store that becomes very difficult to manage.


Even if your app is the only app accessing the database, the code in your app can be changed. When you change the validation criteria, do you reread every row from the database and revalidate it?

When validation criteria are changed on an SQL server, it revalidates all the rows - either immediately, or at the end of the transaction. With the DB doing the validation, the default is that all rows are valid.

Also, do you ever side-step your database abstraction layer to send SQL directly? If you ever insert/update/delete rows that way, perhaps for performance reasons, now you have an extra place in your app to make sure the relevant validations are kept up-to-date.

https://robots.thoughtbot.com/validation-database-constraint...

Relatedly, this article suggests that your app should only be validating user input - so there should be no app validation of fields not set by end users - whilst the database itself should validate what your app gives it.


Yes! As a data scientist, an important point of my job is evangelizing the data consumer's point of view. It's easy for data producers (e.g. the developers who write the logging code that generates the data I use) to make decisions that make their lives easier, at the expense of making my life much harder.

Only by talking to each other can we find the balance that mutually optimizes our joint effort level. A little more effort up-front to log data in a format with well-defined schemas, translates into huge savings in processing, consuming, and understanding the data down the road.


> evangelizing the data consumer's point of view

Note that this is a fundamental principle in software development, not just for data science:

    Prefer data structures that are simple to interpret over those simple to generate.
This principle follows from two basic observations:

1. Data is usually produced by one application, but consumed by many different applications.

2. Writing a serializer is almost always easier than writing a parser. [1]

However, it is not always clear how to apply this principle. For example, as a library or web service author, you define the API on your own, rather than having your users define it. It is extra effort to get feedback from your library users, and to design your API accordingly. That's why most APIs are geared towards the data producers rather than consumers.

There is even an enterprise design pattern for this:

https://martinfowler.com/articles/consumerDrivenContracts.ht...

[1] The former usually involves a template system and a few helper functions. The latter usually involves regexes embedded in ugly code, parser generators with all their shortcomings (and the fact that most programmers don't bother to learn using them), complex DOM or JSON traversals that fail to handle all special cases, or masses of autogenerated classes automatically derived from some XML schema (or JSON schema).


> Second, no; NoSQL does not scale better than SQL. At least not in the manner you think. MySQL is being used from top-traffic websites like Facebook and Twitter, to high traffic websites like GitHub and Basecamp and almost every PHP installation out there. If this is not scaling, then what is?

YouTube, mentioned in the article, also uses MySQL, but the author forgets to mention those big names don't use a plain simple Vanilla SQL. In the case of YouTube, it's Vitess [1]. It would be nice for the author to mention this.

[1] http://vitess.io/


All those big sites do a lot of things to make MySQL scale.

The point is just fighting the perception that MySQL can't scale.


> The point is just fighting the perception that MySQL can't scale.

Which is a misplaced point that'd misleading an entire generation of engineers.

MySQL doesn't scale. The companies mentioned had to throw huge amount of engineering resources to make it work, whatever the costs and the drawbacks. They had no choice left, hack it or die out of your own traffic.


How many systems are out there that will run at YouTube/Facebook/etc scale without "huge amount of engineering resources to make it work"? Zero. Zilch. Nada.


System that will run at hundreds of times the scale of MySQL without requiring the equivalent hassle in engineering:

Cassandra, ElasticSearch, Riak, RedShift, BigQuery, DynamoDB, BigTable, Spanner, Terradata, S3...


systems that are designed as distributed datastores run as managed, hosted services on MASSIVE cloud datacenters...yea those will scale.

if you're trying to keep your hosting overhead down though none of those is available (except S3 I suppose, which is super cheap, but also not really a datastore solution) but a basic app server running MySQL (or similar) is. now, I'm not saying anybody should expect scaling for free. I'm just saying you're comparing apples to oranges here.

The apples to apples comparison is basic app server running MySQL vs. basic app server running MongoDB. the myth is that the MongoDB instance inherently scales better than the MySQL instance. it doesn't, particularly. it has better write performance, but not much else that gives it better scalability.

and the real point of the article is to combat some of the magical thinking that has lead to many bad decisions from misinformed developers who heard a rumor that a NoSQL datastore has orders of magnitude better performance characteristics than an RDBMS.


When we talk about scaling (not necessarily to the size of reddit/youtube) that implies you're already way beyond an app running with a mysql.

> MongoDB

MongoDB is a pile of crap and a shame to all the NoSQL datastores. Please don't bring MongoDB in a talk about databases. If that is your only experience of NoSQL, it is perfectly normal that you think of NoSQL as a mistake to never use ;)


No need to convince me of Mongo's problems. I struggle to think of a use case for it where I wouldn't prefer to use the JSON store functionality of Postgres instead.

My best experience with NoSQL (in terms of it working ok and not otherwise fucking up the system) has been with HBase on a cluster running map/reduce jobs. That was well suited for the purpose (mostly). I'm sure glad nobody had any illusions about using that as a production datastore for a user facing web app though.


It is really a great problem to have one day.


SQL databases are very powerful and capable tools, but they also hide some complexities behind a declarative language. That makes easy to get into a bad place, performance-wise, if the DB mutates quickly. You don't need millions of accesses to the DB to have unbearable slow queries if schemas are not carefully designed, which is difficult in early stages of development.

I usually consider the "I had to migrate from DB X to DB Y" a success story. You discovered what was required and later in the process, improved your system. Painful, but probably unavoidable.

Databases are not easy to handle and operate. Is easy to underestimate them. We used to have people just devoted to them (can you remember DBAs?). It is worth to spend time learning them and knowing their goods and bads. And sometimes making mistakes in unavoidable...


I'm a dba so yes I can remember DBAs. I don't understand the condescension from you comment. As a DBA, I have never understood the argument of a mutating database as a solid justification. It just sounds lazy/unprepared, at best. Sure, there may be a valid complaint for the unicorns out there, but they represent such a small percentage of situations that I hardly see how ther experiences necessarily translate to everyone else's situation. "It's hard" or "it's difficult" sound like blame-shifting to cover for a lack of actually understanding requirements.

I don't mean to single you out in this comment. The reality is that your comment just really reiterates so many of the same arguments that are regularly trotted out as justification.

NoSQL already came and went. It was called ISAM. It has a use case, but it's not the silver bullet everyone wants it to be.

Oh yeah, as a DBA, I've loved hearing how NoSQL was going to do away with my job and that I am a 39 year old dinosaur who'd better catch up and get on board with DevOps. Guess who runs data CI where I am? Guess who manages our Mongo and Redis servers. Yep. The ancient DBAs.


How about some vitriol?

You know what the #1 feature of nosql is? NoDBA.

I can hear you screaming you'll still need a DBA. You don't get it. You won't need an Enterprise DBA.

Enterprise DBAs are almost invariably horrible blockades to data storage in any enterprise. All they do is lock away the database from developers, cram Oracle down their throats, do things as slowly as possible behind ticket walls. You lock the data away pining for your halcyon mainframe days of white coats and clean rooms, and force your slow ponderous overtenanted infrastructure.

Your profession, since you have proudly identified as a DBA, you are an Enterprise DBA, is a disgrace.


I'm sorry you've had such poor experiences with Enterprise DBAs. It sounds like you're drawing upon your personal experiences to make generalizations about a profession. We all do it, but professionalism requires us to step back from our frustrations and work together for the good of the team/organization/company/purpose.

As an Enterprise DBA, I know I have to be ready to be an SME on any number of topics at a moment's notice. I have to understand my data structures and logical modeling, I have to understand IO patterns and concerns (RAM and physical storage are the ones that crop up the most often for me), I have to understand security, I have to understand networking. I'm sure that you also have to consider many of these things as well so I'm not better than anyone, I'm just doing my job.

On top of that, I also have to make sure that my organization's best interests are considered. So if I'm slow to fulfill your request, or even if I say, "no, I can't do that" please don't take it personally. It's just that there are other things to consider than what you need right now and what you think is what you need to do what you need to do as fast as possible. Some times the "fast and easy" way to do things isn't the correct way to do them. Let's figure out a way to work together to do what needs to be done correctly the first time so that we're not plastered on an HN article that says, "why I moved from MySQL to Mongo to PostgreSQL."


How about some vitriol?

How about not? Bad actors can be found in all domains, DBAs included. Painting all with such sweeping brushes does little to suggest people should take your comment seriously or advance the conversation constructively. There are also plenty of DBAs who know the strengths and weaknesses of many different data models, engines, and platforms, and work with developers to make decisions to ensure efficient data storage and IO for a successful application.

Edit to add:

Your profession, since you have proudly identified as a DBA, you are an Enterprise DBA, is a disgrace.

If you identify yourself as a DBA you must be an enterprise DBA? There are plenty of DBAs who are happy to dig into backend or frontend code as well, but have a lot of experience working with database systems. Many frontend and backend devs have experience working with databases, and similarly describe themselves to focus on their strengths. Are they enterprise frontend and backend devs?


I'm afraid most NoSQL usage forces programmers to become, for all practical purposes, expert DBAs, or performance would be bad.

Using SQL requires competence too, but learning nonstandard exotic idiosyncrasies of a certain configuration of a certain bleeding edge system (the blood is yours) is much less useful than principled and somewhat standard techniques that can be readily adapted to any good RDBMS.

For example, there's a substantial difference between improving performance of a slow query by trial and error, rewriting application code to figure out what the current version of a bug-ridden interpreter/optimizer likes, and improving performance of a slow query by ENGINEERING, creating an index according to the reports of analysis tools, back-of-the-envelope estimates of size and performance and true understanding of different index types.


To add to this point as well; agile methodologies work best when everyone is a generalist and the business is onboard with being involved in the development process.

DBAs are specialists which makes them inherently siloed.


I find your perception of DBA work to be quite narrow. Too many comments on here are based purely on colloquial evidence and not enough actual communication and understanding.


I have several times done the "I had to migrate from DB X to DB X but designed and indexed properly" re-write and success story...

(I've also got many many more stories of my team saying "We'll need to use $NoSQL-de-jour here for performance and scalability!" where I've talked everyone into "Let's just go with MySQL/RDS/Postgres as a prototype, and see where the speed or scaling problems arise before we rewrite things." and never coming close to bumping into scale or performance problems...)


> Is easy to underestimate them.

VERY true.

> Databases are not easy to handle and operate.

I think instead are the easiest of all available options. With a SQL database you have at least available some kind of GUI to build the schema and do query and stuff.

And have see not-very-technical people building databases just fine.

Try the same with Cassandra "Hey, secretary, please build a Cassandra database for the contacts, please!".

Yep, not easy.

------

You only get into trouble when:

- You schema is super-bad. But is the same with NoSql

- You don't create that 1-2 indexes per-table (and some Guis/RDBMS auto-create them for the ID and surrogate keys, that is almost enough)

- You have a very large database. Or, you have a bad schema and is leaking as a very large resultset (or SQL query) that is overcomplicated everything.

- You forgot you have a powerfull RDBMS and insist to use it below the capabilities of Acces, yet, have proud in learning C++. Because C++ is powerful. But why dedicate a few hours learning the basic of Sql??

Most people have very easy problems to solve, and in contrast with the nightmare that is debug a NoSql store that badly try to re-create what a RDBMS have and you know some data is lost but good look finding it...

---

Where the situation get out of proportion is with the startup-mindset of "scalability". A problem that few have, even startups.


> You only get into trouble when:

You have more than 10 TB of data.


That's likely true, but honestly how many people really have TB of data?


About anyone who doesn't run a toy project. Data can goes really really quick when you have some users.


Choose the right tool for the right job. We are all building unique systems solving unique problems. SQL is good for some things and NoSQL is good for some things.

If you are taking the time and (someone else's) money to build something do it right and research how various data stores work up front. You should be at the point where it is obvious what the right decision is in your mind and if it isn't...get your learn on until it is.


And the thesis of this article is that SQL is overwhelmingly the more versatile and better choice.


Choice for what? The article talks about a very specific domain (web applications solving a certain class of problem). Many organizations use multiple data stores and handle different problems.

At Goldman Sachs there is almost every imaginable type of data store in use. All carefully vetted decisions that are largely the right choice much of the time. Almost all of it flows into some form of web application at some point in time in a giant interconnected system.

I am very happy we don't use a single class of data store. If we did we'd be fucked.

The "thesis" may apply very narrowly to a small web app of a certain size. But honestly if you are building something that cookie cutter are you really building something unique? What you are doing has probably been commoditized.


We had key-value stores before there were relational databases. RDBMSs are built upon them, in fact. Now a new generation discovers the reason that RDBMSs became popular: they offer a more general purpose abstraction that is more useful outside of certain specific cases.


> MySQL is being used from top-traffic websites like Facebook and Twitter, to high traffic websites like GitHub and Basecamp and almost every PHP installation out there.

Is it just me, or does this seem rather misleading? My understanding is that they can use MySQL because they've carefully designed their architectures to direct most hits to caches or Memcache/Redis, which minimizes the load on the database.


How is that misleading? Using RDBMS as your primary data store is the decision that people reading this are going to benefit from.

When it comes time to cache and optimize for things like search, that's when you use the additional technologies.

It's a huge mistake to think that companies that are using Redis/ElasticSearch/MapReduce are doing so to cover up for the wrong decision they made of going with an RDBMS. The RDBMS is their source of truth, everything else is just an indexing mechanism.


The RDBMS doesn't have to be the source of truth. Other databases can do that just fine.


It at least sounds misleading. Saying "Tool X is used at company Y which serves $tons_of_traffic" is not the same as saying "tool X is directly involved in serving $tons_of_traffic for company Y and replacing it with tool X' would not allow to do the same but would result in much worse results". But on a cursory reading, the former reads just like the latter. For almost all really high-traffic sites, and many not-facebook-scale-but-also-high-traffic sites, there are multiple layers of content generation and caching. Nobody serves Facebook-scale content directly from MySQL (or any other backend DB) to the wide web, that'd be insane. So when talking about backend performance, one always has to remember the layer involved and look for what happens in those layers.


I'm currently at my second startup doing a migration from Mongo to Postgres.


https://github.com/stripe/mosql was very useful in my conversion at work. (That said, we only had one small service on Mongo at the time, but it trivialized the conversion effort entirely nonetheless.)


I've been thinking about this a lot recently, and while I used to unambiguously agree with this, I'm slightly more uncertain now.

First, to be sure, in the year of 2017, it's definitely the safe path to go with Postgres or MySQL. Traditional relational databases aren't distributed out of the gate, which is both a good thing and a bad thing. They're mature, which means that lots of bugs have been found and fixed, and there's a Stack Overflow answer for every basically error possible. And yes, Postgres will remain the "best" choice probably for at least another decade.

Very broadly, traditional relational databases are chosen because of two advantages: features (rich data model, query language, ACID transactions, etc.), and maturity. NoSQL datastore generally sacrifice the former for the sake of scalability, and lack the latter simply because they're too new.

First, the features. I think most people are in agreement that stripping down your data model to fit into a basic but highly scalable key-value store ultimately hurts productivity. But it's definitely not impossible for a scalable distributed database to provide most or all of the features that a traditional relational database provides; that's exactly what Google did[1], and what some open source projects[2] are trying to emulate.

Sharding Postgres is safe, but it's also hard. It does messy things to your application code, and also hurts productivity. And at the end of the day, you still don't get ACID transactions and joins if you ever have to do anything that touches more than one shard. We really do need a better solution that combines the best of both worlds.

Sure, traditional relational databases will probably always be the most mature databases around. But there will come a time when those new databases are mature enough. And I think along the way, we can rethink some of the properties of traditional SQL that make it hard to work with sometimes.

1: https://static.googleusercontent.com/media/research.google.c...

2: http://db.cs.cmu.edu/papers/2016/pavlo-newsql-sigmodrec2016....


Agree mostly with the article; nitpicks:

>In the first case there is a language using words that humans use to talk to each other, while in the second one there is pure JavaScript, where you have to build your query using a JSON object and even convert the 24 hours to milliseconds on your own.

I have never personally implemented anything using either MySQL or Mongo. But even I can see that this is a bad example. In the first case, you have an SQL query raw. This obviously has to be passed to some sort of API to be used in code, so it's an incomplete example. In the second case, you're blaming what appears to be a weakness in Javascript's "Date" library on MongoDB. In Go this becomes time.Now().AddDate(0,0,1) which is relatively straightforward. nb. Go time lib becomes less pretty whenever there's an error involved and I sort of hate it honestly.

>First, scaling is not and most probably won’t be your problem. Unless you start having a few thousand of queries per second and terrabytes of data in your database, this won’t be a problem at all.

...true...

>And in case it does, you can go with a fully managed solution like AWS’ RDS.

...but the freshman delusion is "I have scaling problems", so the sophomore delusion is "I will never have scaling problems" :p

The job of a businessperson includes managing things. You can't outsource all of the management. It's how you add value after all. I've seen plenty of cases where people choose not to go with AWS because they can save money by doing more work and they made off with some dough.


> you're blaming what appears to be a weakness in Javascript's "Date" library on MongoDB

But this is the point: that SQL is a better query language than Javascript (which is what MongoDB uses).


The biggest problem with Mongodb is that it loses freaking data all the damn time. I never fully realized why it's so hated before taking my current job, which uses it as the primary database, distributed among a bunch of servers. There are all sorts of concurrency issues all the time. And no, write-concern:majority doesn't help.


I use both. I even use text indexing, redis, queues and other things.

Sometimes RDB is better. For other things it's easier and more scalable with document store.

They need to know about each other using your application logic (join by GUID for instance), but you're doing that anyway even if you religiously staying in one camp or the other.


Warning, rant incoming: I still want to do uncivilized things with the <insert swear word here> who took away WebSQL and left us with the <insert more swear words> NoSQL <...> that is IndexedDB.

No JOINs, no natural way to express a ORDER BY x LIMIT y,z... no way to e.g. mirror your back-end database in order to implement stuff like full-featured offline/online apps... and don't get me started on the fact that IndexedDB is async which makes even emulation of the features I stated an open invitation into callback hell. Or, basically, ANYTHING more complex than a SELECT x FROM y.

How could this ever end up being a freakin' standard and WebSQL being dropped?!

(If anyone here decides to implement a WebSQL equivalent backed by IndexedDB, here are my monies, take them! Please!)


I'm not sure WebSQL was but maybe it just died off as nobody was using it?


WebSQL was a browser-embedded SQL database engine.

It was killed by the standards committee because a key part of the standard was essentially "do exactly what SQLite version x.yy does", because the only implementation was by embedding that version of SQLite in the browser and putting a thin API on top of it.


And Mozilla said they would never ever support such a thing


WebSQL was basically SQLite accessible from JS - which also, unfortunately, was the reason for it being dropped... apparently the standards committees only want standards with multiple competing implementations.

Which is total bananas, given the fact that there's only one (free, ultra portable, tiny) embeddable SQL database - SQLite. Which every browser except IE/Edge ships, anyways, for internal data storage.


regarding apparently the standards committees only want standards with multiple competing implementations., yes, that is a quite basic principle, but it is not directly what killed WebSQL. (Non-)passing of a standard depends a lot on what is going inside the specific Working Group: If everybody had agreed that SQL was the right answer/important to have, WebSQL probably could have gotten through pretty close to what it was, but they didn't.

Mozilla published some blog posts about why they didn't like WebSQL and backed IndexedDB instead: https://hacks.mozilla.org/2010/06/beyond-html5-database-apis...

If someone else had put in the work to write a "proper specification" of an SQL dialect, shown that it works with (modified) SQLite and was reasonable to implement in a different implementation, WebSQL might have passed nevertheless.

It happens that W3C Working Groups release standards without 2 implementations, so this isn't some strict rule that can't be broken, but then everybody involved agrees.

In the other direction, I think it would have been good to demonstrate by implementation that you can build good SQL-like APIs on top of IndexedDB during the standards process, to make sure it fulfills that need.


I'm a big fan of sqlite and would have loved to see WebSQL succeed, but I think HTML adopting the de-facto behaviour of sqlite is "bananas" - it would be the equivalent of "formatLikeWord95" of the OOXML spec. They should have defined a proper ISO SQL subset/profile instead.

What remains of WebSQL is the commonly implemented SQL client driver API for Node.js which isn't so good a fit IMHO.


> They should have defined a proper ISO SQL subset/profile instead.

That wouldn't have helped. Google Chrome (and, by extension, Safari and Opera) and Firefox would just use the already present SQLite library in the engine... and I think MS would also have chosen SQLite instead of risking to develop a full SQL database (and opening a rich source of bugs and security issues, in contrast with using battle-tested SQLite). Again, just one implementation of the standard.


Microsoft has several embedded DB engines of their own (not Free, but they already own them), so I don't know that they would have used SQLite.


Or, MS would have blessed us with their Access/JET DB.


Well if MS could build a safe interface to Access (also managing translation from ISO SQL to Access SQL!), I'd be perfectly fine with it! I've seen people building really nice stuff based on it, a pity it's only available on "higher end" Office variants and even there only on Windows :(


...and somehow the browser adopted WebGL which is even more blatantly OpenGL.


OpenGL has multiple, independent implementations (including backends) though. Being based on another API standard is not the problem.


Tools that we use influence our decision making, shape our experience and ultimately architectures of our systems. Different constraints lead to very different designs.

And using databases designed in the nineties just cannot get you to fast resilient services or to any relevant experience necessary to build them at some point in the future. But it will waste just as much time if not more. CRDTs and eventual consistency are actually easier to understand and reason about, than transactions. Don't get fooled by familiarity. Don't choose MySQL, choose Cassandra, Riak, DynamoDB, don't choose POSIX storage for the same reasons, choose object storages, Swift, S3, etc.


> CRDTs and eventual consistency are actually easier to understand and reason about, than transactions.

I suspect we’re going to have to agree to disagree on this one. Without spending too much time on this, it’s pretty easy to cite the Google F1 paper (http://static.googleusercontent.com/media/research.google.co...):

"The system must provide ACID transactions, and must always present applications with consistent and correct data. Designing applications to cope with concurrency anomalies in their data is very error-prone, time-consuming, and ultimately not worth the performance gains."


Since the title is a bit misleading (talks about database choice: SQL vs NoSQL), I comment about the big data/data analytics point of view in case anyone comes to this post for the same reason.

The only problem I have with SQL (I use Postgres, in the form of Redshift almost daily: it's excellent when it works as expected) is the not-great readability of the code and the abysmal testing/debugging/checking of SQL code. Have had to write lots of boilerplate (Python, mostly) for that, and it feels like we are reinventing a wheel someone else is using, somewhere.


> is the not-great readability of the code

Isn't that more of a poor-SQL-programmer problem, than a SQL-Database-technology problem?

> and the abysmal testing/debugging/checking of SQL code

I haven't used Postgres, but MS Sql Server has a ton of testing, debugging tools including SQL Trace, Explain Plans, and even a Stored Procedure debugger wherein you can set breakpoints and step through your code.

Of course, one cannot expect the same level of rich debugging features that one gets with traditional IDE for programming languages, since SQL programming is a somewhat different beast in that sense.


You can only go so far with SQL as far as readability goes, and is ensuring everyone in the team writes readable queries. Problem is, inheriting a big lump of queries is totally different from inheriting a big lump of code: you can write tests (even if a little black-boxed) and refactor the lump until it makes sense. It is far more convoluted in SQL.

I don't mean testing/debugging as in stepping through code that has been just written, but code that has been automatically submitted to the database during the night and for some reason has failed. The goal is to never be in that situation: with SQL you need to be way more careful than with other languages.


> You can only go so far with SQL as far as readability goes

This. Exactly this is the problem. SQL as a language fails to provide you with simple means of composition.

And this is where libraries like Python's SQLAlchemy come into play. Note that it is not necessary to use SQLAlchemy as an ORM. Even if you use it "just" as a query builder, it will simplify a lot.

Composition is really where the syntax of SQL fails utterly. This is quite surprising, because the foundation of SQL, relational algebra, excels at composability: You have many operations which all operate on the same type of data (namely, sets of tuples, or in math/cs speak: "relations"). As such, you can combine them in any way you want. For example:

    * "Filter" takes a relation and returns a relation, just with fewer rows.
    * "Projection" takes a relations and returns a relation, just with fewer columns
    * "Cross Join" takes two relations and returns one relation.
    * "Inner Join" is really just a combination of "Cross Join" and "Filter"
    * ... and so on.
SQL's attempt to be more "human readable" than those nested operations fails to preserve that. Yes, we have sub selects, but can't just stick together the operations we want.

For example, assume you have an existing query (perhaps more complex than this one):

    select a, b from t where a > 0
and want to apply a simple filter "b > 0" on top of it:

    (select a, b from t where a > 0) where b > 0
This kind of composition is not allowed. You either have to either give up composition (no reuse of the existing query) and combine the filters by hand:

    select a, b from t where a > 0 and b > 0
Or, you have to write a larger sub select:

    select * from (select a, b from t where a > 0) as temp where b > 0
In relational algebra, the first statement would have been:

    Projection[a,b](Filter[a>0](t))
Or, using ">" for nested function calls (function composition):

    t > Filter[a>0] > Projection[a,b]
For the task at hand, you just compose it with your additional filter and be done with it, reusing 100% of your existing query:

    t > Filter[a>0] > Projection[a,b] > Filter[b>0]
However, SQL forces you to either rewrite this query:

    t > Filter[a>0 and b>0] > Projections[a,b]
Or to apply a sub select, which means adding nonsense operations such as naming a purely temporary intermediate result and projecting to all columns:

    t > Filter[a>0] > Projection[a,b] > Name[temp] > Filter[b>0] > Projection[*]
In my view, the task of SQL query builders (such as the one in SQLAlchemy) is to restore the ability for programmers to form their query in relational algebra, without having to worry about the quirks added by the SQL language.


SQL Server's tools are awesome! Too bad the licensing fees are not.


All that tools are included for free.

You are only limited if you try to use a higher-end feature (like analytics) but the overall tooling is all there.


Not true in software, but IMO is in this case: you get what you pay for.


>>> I haven't used Postgres, but MS Sql Server has a ton of testing, debugging tools including SQL Trace, Explain Plans, and even a Stored Procedure debugger wherein you can set breakpoints and step through your code.

Let's make it simple. Imagine a database with no tools whatsoever.

Welcome to postgre!


Do you use a ORM?

What kind of boilerplate?

---

RDBMS have a less-than-ideal programming API, and also is de-coupled from the front-end (good) but also mean it requiere to re-implement some stuff in the client (because is a agnostic api). ORMs make this task harder than it must.

--

For testing/debugging not help that a Database is a huge mutable store, and thigh testing was not a thing before.

But more of this is a trouble with the disconnect between the database guys and the app guys.

For example, Firebird (http://firebirdsql.org/) allow to embed the database similar to sqlite and also use it in a separate server process. It mean is easy to create a test environment easily without docker-alike craziness.

Another disconnect is that is impossible to avoid sql. SQL is a poor developer API and is not decoupled so we can use something else.

The advantage of some NoSql is that them not recreate SQL (at first) so them can do some nice stuff as post a JSON and call some basic commands.


We use it for analytics/data related analysis, so we use straight (or templated) SQL queries. The queries are a bit too thick for using an ORM. The boilerplate is a set of helpers to make sure what the SQL's intention is, is actually fulfilled without problems (lots of sanity checks to make sure the queries are correct before running, column stability, etc). Basically to ensure anything depending on the output still can trust it after the changes.


Note that the well-designed ORMs have a good query builder that can be used on its own. For example, I've heard many people use Python's SQLAlchemy just to have a more sane SQL syntax.

(See also my other comment: https://news.ycombinator.com/item?id=13929111 )


So, basically a kind of static type system?

Anyway, reporting is a very hard problem. I don't know if exist a truly good way to solve it in a general case.

---

Is part of this not solve with views? Any example to showcase the trouble?


Don't have examples (that I can show). It's trying to be a static check (more than type, but it's close enough) system, indeed. With views, we can isolate the used data, as long as we don't modify too much the underlying structure.

Also, the biggest chunks of SQL we handle are essentially "computational": take this, join with that, mangle a lot what we have now, repeat several times and get an output we can use. Can't avoid the big part in the middle


Even though Redshift is based on PostgreSQL 8.0, hasn't it diverged enough that it's not really PostgreSQL anymore?


It has lots of proprietary extensions and breaks lots of things from Postgres, but it's not far enough (at least for me) to consider it a completely different beast. You can still trust the Postgres documentation when the AWS documentation is lacking


Yes. Nit: PostgreSQL 8 doesn't really describe a particular version. 8.0, 8.1, 8.2, 8.3, and 8.4 are all major PostgreSQL versions. Edited to add: Redshift was based on 8.0.2.


Edited.


Yes so much yes. Be sure you need NoSQL when you replace SQL. SQL is so powerful, mature and reliable. If your problem can be solved with plain old SQL, it's probably the best choice.


NoSQL developers can command higher pay because there are fewer of them. Same with any other shiny new tech, really.


Yeah I've also heard good COBOL developers can command huge pay on legacy banking systems. That would be a good reason to learn COBOL, but I don't think it makes COBOL a good technology choice.

There are times where NoSQL is a better choice. When massive scale is involved I understand that NoSQL can come into it's own. I'm sure there are other cases as well. But when it gets used for CRUD, it tends to be a bad call.


COBOL is good at what it was intended for. Just like NoSQL.


But there are so many comments in here saying that "SQL is hard" by comparison... should that not command higher pay?


They can command more pay because NoSQL is required by bigger companies who've got more clients and more money to pay DBA.

Also, NoSQL is not a database, it's 10 different databases which have little in common with each other. A rare speciality pays more.


And RDBMS developers can command even higher pay converting platforms.


Glad to see this post.

One other thing I'd add - I read a very good mantra for software development, that I've adapted a bit in my head.

Your data will outlast your application

Your application will outlast your framework

Your framework will outlast your developers.

For many of the apps I write, the database is the core of the application. One question I like to ask myself is: if my users were well versed in UNIX, SQL, and a programming language, how well could they get along without the web application?

The answer to this question reveals how well I've designed my app.

For instance, is the backend database useful on its own? Or is it merely a persistence tier for scattered bits of information that only become useful once assembled by code? If it's the latter, then no wonder people start questioning the value of SQL. It isn't functioning as it should, there's no relational structure to the data. Persistence for scattered bits of objects that need to be reassembled through code means that the relational database is almost just getting used to serialize objects back and forth to disk. If you're doing that, then of course you start to see SQL as just a bunch of stuff that complicates things! I mean, at that point, why not just write and retrieve objects? There's almost no difference. Thing is, this sometimes reveals a mismatch between SQL and the problem at hand, but just as often, it reveals a developer who doesn't really think in terms of databases and relations, and who just sort of crams things into a SQL database because it is the default persistence tier.

Is the code that does essential analysis and logic on the data easy to run and understand on its own? If my users were capable of logging onto the machine, running this code as command line scripts, and reading the output, how useful would it be to them? How clear and concise is the code?

If both tiers are clear and easy to understand for someone grounded in SQL, Unix, and a programming language, I have control over my code base. If not, there better be a really good reason. Eventually, the developer will leave, the framework will be out of date, and someone new will need to work on the code base. If you can't make sense of the database, in particular, you are in serious trouble.


Please someone have this reply pinned somewhere always visible. Great mantra/wisdom.


This is just another article telling me about what I should do with my web application without asking me anything whatsoever about my web application. Just because you made a bad choice for your situation doesn't necessarily mean everyone else did too.


It's not impossible that your app is one of the special snowflake web apps that are not fundamentally just another CRUD app underneath (at least on the backend) - but even if yours is not, you've got to admit 99.9% or more of all web applications are _not_ special snowflakes and generalised advice for CRUD apps is often precisely on-point for most web developers.


This is just another comment assuming that the article is talking directly to me. Just because it makes some generalisations that are useful to many, doesn't necessarily mean it is tailored to my exact situation.


Come on... The article is clearly written for a general web dev audience (if not, please let me know the more specific audience you've inferred) and starts with this completely unqualified first sentence:

> Let’s get straight to the point; choose an SQL database for your web application. I think I can’t make my self clearer.

Yeah there is better qualification later, but there are whole paragraphs of railing against nosql databases for all kinds purposes that they actually CAN make sense in, for certain applications.


We don't need to know anything about your web application to know that you need a RDBMS.

If you happen to have an exception to this rule, you are a VERY VERY rare case indeed.


With things like Citus and Memsql readily available, it really makes the scaling part of relational databases so much easier.


What are your thoughts on InfluxDB or Cassandra to model and use for apps? They are two "NoSQL" solutions that seem to strike a balance of simplicity and query capability. The only NoSQL solution I'm experienced in is RIAK which is purely key/value and I find SOLR (which aggregates for RIAK) not that friendly.


There are alternative syntaxes to SQL that I find easier to maintain. Like sequence of piped commands like bash or Splunk/Sumologic/Azure analytics language and similar.

Order in syntax for SELECT is weird. I usually look FROM then WHERE then the fields being selected. Feels like reading backwards.


Although I agree with most of the author's points I think he underestimates the value of ease of use by programmers. SQL syntax can definitely be daunting to some and difficult to understand complex queries. There is room for improvement in that regard.


<devil's advocate>Perhaps those programmers should take a little time out of their "learn the latest Javascript framework" budget and spend a little time learning to use SQL properly? Just because a powerful and industry standard tool is "daunting" that's possibly the _worst_ reason to choose something less well suited to the problem...


But then I'll be behind in JS, by like 3 frameworks.


Take your favorite complex query. Write code to accomplish the same thing with raw data in whatever format you consider convenient.

You will be AMAZED at how compact SQL actually is by comparison!


Indeed.

An often overlooked point is that good SQL optimizers generate different versions of queries depending on the current state of the database and the actual parameters to the query. That is, the optimal query plan for finding the sales to "school teachers" is probably different than for "astronauts" and the optimal query plan for reporting on "todays orders" is probably different at the start of the day than late in the afternoon. SQL query planners try to do the right thing with this sort of data variability and generally (but not always!) do a good job. It would take an immense amount of work and insight into the data and all the potential use cases to approach this with custom coded queries.

Before there were relational databases these sorts of access path, query strategy decisions had to be made by developers and were reflected in the schema and physical design of the database. This was inflexible and very labor intensive and it made many kinds of applications and or even routine changes uneconomical.


Oh trust me, I don't overlook that point. I've spent too much time cajoling optimizers into the query plan that I know will work when they think they know better. And I know from painful experience how much more work it is to cajole developers in the same way.

That said, there is little point discussing this subject people who clearly have no experience about how to do what a little bit of SQL does. If you're convinced that manipulating data is easy, you're not going to appreciate that manipulating data efficiently is even more tricky.


The comparison of the code in the article is a little unfair, though. You have to prepare a statement, bind some values to it, and then convert engine-specific data in a loop into your language's representation. Of course DB-modules help with that, but the boilerplate complexity remains.


The boilerplate is manageable. And with the right DB-modules, it practically disappears.

That said, do try to rewrite a moderately complex query without SQL. Pick one with multiple joins, several where conditions, and a group by. You will generally wind up with a fragile program that is many times as big as the equivalent SQL, and is a lot harder to read and maintain.


It's not a matter of SQL/something else, but what you're storing in your DB, how you model data.

If you're storing state rather than information (append-only), you'll have a hard time in the future regardless of query language.


Agreed, but for god's sake don't choose MySQL.


I think SQL is a great basis point for awesome databases, but I don't think it's necessarily the only option we should have. Options are good!


It's good to have options; it's dangerous to believe all options are equivalent. For 99.9999% of applications, you'll have a much better time starting with a SQL database. If you've somehow end up with an application in the 0.0001% you can probably safely convert to something else if needed.


Isn't comparing NoSQL and SQL servers a little simplistic? Surely there is more than one type of NoSQL server...


And more than one kind of SQL server.

Redshift and SQLite are pretty different.


[deleted]


NoSql solves a different problem set. Modern SQL servers definitely have a lot limitations at huge scale. You can only write-scale so far before needing to resort to the variety of fragile SQL scaling solutions. (Read scaling has more options).

That's not to say they lose usefulness at scale or should be avoided because they'll be a "bottleneck" at some point (if a SQL database is ever a bottleneck, you're super lucky!) but there are definitely cases where a non-SQL database is the right choice.

In some cases, it's not a scaling but just a use-case difference (Elasticsearch)


Hipster Driven Development is really expensive.


Not much when combined with StackOverflow-Based Programming.


[flagged]


You've been posting a lot of unsubstantive comments to HN, and (worse) also uncivil ones. We ban accounts that do that, so please stop doing that.

If you have a substantive point to make, please make it thoughtfully; if you don't, please don't comment until you do.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: