Hacker News new | past | comments | ask | show | jobs | submit login
SQL is 43 years old – Here’s why we still use it today (sqlizer.io)
588 points by d4nt on May 2, 2017 | hide | past | favorite | 382 comments

Been programming for 30+ years, and 99% of my projects use SQL databases. I've tried and dropped NoSQL many times. I still wake up in a cold sweat thinking about the earliest version of Firebase where I had a project that tried to join three tables together to get some meaningful data.

I still remember the response of a Firebase team member to my forum question about it - "These days, storage and computing power is cheap - just duplicate the child table as nodes in the main table and do it all that way for every parent/child relationship you have. Don't worry that you have to duplicate the same set of data for every parent that is related to the same child...That's how NoSQL works..." <shudder>

Even though I use ORMs in my project these days, every time I have to test a complex query, I write it in raw SQL first and check it before trying to make the ORM duplicate the same query.

Granted, NoSQL has its place and its advantages, but for me, when it comes to "money code", I will stick to SQL.

I feel like this is something I have to keep repeating, "the data always outlives the app code." If app code is required to make sense of the data, you are going to have problems. Of course NoSQL has a place, but only in very certain cases should it be your primary datastore. SQL (and the RDBMSs that leverage it) are built to store, protect, and provide structure to the data.

The second annoyance I have is the push for schemaless. Schemaless does not exist. There is always a schema, except in a schemaless data store, the schema has been moved to app code and then my earlier comment applies.

One thing I do agree with is that most ORMs are not very good. The best ones I have used are very thin layers over the sql, like jOOQ. I think the lack of understanding of SQL and bad ORM experiences (Hibernate WTF) is what led people to think SQL/RDMBs were the problem when in fact they were not.

>The second annoyance I have is the push for schemaless. Schemaless does not exist. There is always a schema, except in a schemaless data store, the schema has been moved to app code and then my earlier comment applies.

This is a great point. It's like types for programming languages. They always exist; it's just a matter of whether you have the compiler managing it or whether you need to keep it in your head when you're hacking.

> I think the lack of understanding of SQL and bad ORM experiences (Hibernate WTF) is what led people to think SQL/RDMBs were the problem when in fact they were not.

Again, like types in programming languages, I think the ability to iterate quickly without having to predict how the data will be used in the future leads to 'schemaless' approaches which can be kicked out the door more quickly (maybe) than ponderous schema laden tables.

> It's like types for programming languages. They always exist; it's just a matter of whether you have the compiler managing it or whether you need to keep it in your head when you're hacking.

While true to a degree, it's worth noting that some languages also reduce the number of types considerably at the same time, and maybe even provide some other conveniences when dealing with types (e.g. automatic interpretation as numeric when used in a numeric operation).

You still have to keep types in your head, but the reduction of types generally makes this non-problematic. Although, as your code grows and you define your own types (either formally through an included system or as ad-hoc data structures), you start needing to keep more types in your head, and this starts eating away at any benefit you had from not having to define them in the first place.

You and the parent are not using quite the same definition of "types", I think.

I think we are, if you look at it from a high level. C has some well known types most people are familiar with. Perl has a few types too, scalar, array and hash (and a few more lesser known or used ones). This makes Perl very easy to use initially, because you can just treat strings as string and numbers as numbers, and Perl will mostly do what you want based on the operators involved. You create complex data structures by just creating them (ad-hoc) or by actually defining a class, which actually gives you a new type.

As you create any non-trivial program, you'll likely create some types, whether in C or Perl (as classes). What was initially a benefit in Perl, since you didn't have many typed to keep track of, becomes a liability, as there are more types but still no formal way to ensure they are being used in the correct locations and as expected arguments without runtime checking (and manual checking at that, unless you use a good module to handle a lot of that for you).

That some languages keep types fairly low level (C, with typed relating fairly closely to machine representation) and some are high level (Perl, with each type being a higher level container) is really irrelevant. In the end, they are just labels that codify behavior (size, acceptable methods of use).

In SQL and NoSQL systems, you have the same trade-offs. Do I define everything up-front so I know (presumably) when there's a problem, or do I create the structure ad-hoc as needed and enforce the structure through the function of the application that creates the data?

Yes, we're talking about the same thing. But in Perl, do we make classes or hashes with keys as the field names? :)

Such is the eternal question in Perl. :)

Also, sometimes we make classes out of scalars, arrays and globs as well. It was interesting the first time I looked into the source of IO::Handle (and IO).

Hmm, I bet there's a module to create classes that uses pack and unpack to store byte level representations in a scalar as the blessed type of a class.

Heh, then I think I read you as saying something more interesting than you'd intended :-P

What did you think I was talking about?

I guess there could be a few senses of types for a programming practitioner:

1. integral types (int, float, etc) 2. product types (structs, objects) 4. sum types (enums) 5. containers (vec, dict, list, etc).

Typeless types would be dicts of lists of dicts. i.e. shitty product types. e.g. `x = dict(y=1, dict(z=[1,2,dict(u='good data')]))`. Passing this to functions that want to get 'good data' out of the structure basically end up treating this as a type, if that makes sense.

If I understand kbenson's point, it's that sometimes this is better than a tower of babel of type hierarchies that you may have confronted in Java or C++ and you end up needing to ponder how to convert your IWiggleWoggleSupplierProvider into an IWoggleWiggleProviderSupplier even though they do almost the same thing and have the same data but somehow they have different inheritance hierarchies.

There's a sense of types of "anything (computable?) you can statically say about your terms". Representation is just a part of this, and often not a very important one. "How can I use this?" is usually much more interesting.

From that perspective, "treating the shape of your dicts as its type" certainly makes sense - it determines how you can use it correctly. It's incidental that you're not encoding this information in the "type system" your language provides (and that may or may not be a good idea, depending - there are always tradeoffs and we've left an awful lot of context unspecified).

From the same perspective, if you can use your IWWSProvider and your IWWSSupplier identically, we can say that they're arguably not different types and only being treated so because someone has given the Java or C++ "type system" an incorrect specification.

I think the lack of understanding of SQL and bad ORM experiences (Hibernate WTF)

I don't understand this comment. I've never had a bad experience with Hibernate. If anything, Hibernate has saved me tons of time over the years.

I, on the other hand, have spent an inordinate amount of time, including waking up in the middle of the night for incident response, due to Hiberanate misuse, which is so easy that I'd not touch Hibernate with a ten foot pole.

There is one case where Hibernate can save you a significant amount of time vs plan SQL, and that's arbitrary update statements. But the moment there are two joins in there, predicting hibernate performance is not necessarily easy, and differentiating a query that will run well from one that will kill your system at scale is far harder than in plan SQL. In practice, the only way to use Hibernate responsibly is to examine the likely queries that it will generate.

My least favorite example was a situation where manual SQL gave me a very fast query that took a couple of milliseconds to run on a large database, but Hibernate's answer was not even just one query: It was an accidentally quadratic monstrosity, when the original programmer wanted was one row: Hibernate just wasn't smart enough to figure out which row it wanted in the DB, so it grabbed half the database. After I saw the beautiful queries that Hibernate had created, I had to give the finger to the existing data layers and do my own thing, which I am told remains to this day.

The way Hibernate is built only makes sense if it can hide the database behind it. In reality, it doesn't, so I not only have to make sure everyone knows SQL well, but they also have to learn hql and hibernate's most common pitfalls. All to save a little bit of time on the rare cases where I might have very dynamic queries that somehow hibernate won't screw up.

Java EE was a pretty unfortunate thing, but under duress, I'd much rather use EJB3 and straight SQL over the Spring + Hibernate stack that many of us had to deal with between 2000 and 2010.

I'd be curious about the size of the systems you have used with Hibernate. Do you know off the top of your head what will cause Hibernate to issue a flush? How about the order of operations in a batch? I've read the entire Hibernate book multiple times while trying to squeeze out speed, and finally decided enough.

The problem with Hibernate is I have to learn all of these Hibernate quirks to maybe get decent speed out of a system, when I could have just written SQL and been done. This isn't even including debugging when Hibernate decides to issue N+1s.

I'd be curious about the size of the systems you have used with Hibernate.

I wouldn't even know what metric to use to quantify that. Number of tables? On the order of hundreds, probably low hundreds... definitely not thousands.

It may just be that I've been lucky in that the data models I've worked with have been things that map well to using Hibernate. I will say that I haven't worked with a lot of data models where you need more than 3 or 4 joins in any given query. I know some people write far more complex queries than that, and maybe those are the kinds of scenarios that Hibernate doesn't handle well?

The problem with Hibernate is I have to learn all of these Hibernate quirks to maybe get decent speed out of a system, when I could have just written SQL and been done

Do you hand-roll your own caching mechanism when just writing plain SQL?

> Do you hand-roll your own caching mechanism when just writing plain SQL?

Using something like jOOQ has meant I rarely need any caching, unlike Hibernate where caching almost seems required.

Interesting. I've never used jOOQ, but I'll give it a look. Hibernate has worked well for me, but I'm not dogmatic about it. Always interested to try new tools.

I've been writing SQL for almost 20 years, so the pain for me is the grunt mapping of data to and from. jOOQ does much of the grunt work of mapping, and lets me write typed SQL that it smartly translates to queries. More advanced features like various types of fetch make my job even easier.


I had to use Hibernate in a project. We were loading an entity from the database, someone invoked a setter and the set value was updated in the database automatically. Totally blew my mind.

Never touched it again since then (not doing much Java anyway).

This mirrors the experience I have as well. I (currently) don't know what will cause Hibernate to flush but I certainly do know that I had to know this exactly and had to do some really painful debugging at points. The worst one I was on was just early prototyping of an app that used Hibernate and we had so many problems with performance and Hibernate(isms) that the whole thing seemed destined to fail even though our use case was exactly what ORM/Hibernate are for...yet we still have to control when changed flushed because otherwise things would get wonky really quickly.

Yeah, profiling helps a little. It taught me that queries flush (and you can easily go n^2 from that if you're not careful).

More recently, we've had n+1 issues (putting a printing breakpoint in EntityType.loadByUniqueKey helped me track down where). My hack to get around that was to do a bulk Criteria query up front to pre-load the session cache with the entities that were being fetched one at a time.

There were a lot of quirks that I've had to learn over the last decade, and I'm still discovering more.

Speed? I found Hibernate is great for CRUD.

Like where you have dozens and dozens of domain objects that you want to persist, and you don't want to spend all your time writing DAO code with SQL in in that is all the same but different.

Enterprise applications are rife with those.

For speed write some hand written SQL to handle those few cases (like batch imports maybe).

Also I found a handy rule:

Only create an association between entities when the entity makes no sense without the other.

For all other cases split them up and perhaps create aggregates by hand.

That jibes with my experience. We don't have tons of complicated associations, and do a lot of fairly straightforward CRUD operations. Maybe that's just the sweet spot for using Hibernate.

I have used Nhibernate in my project at a company. But beyond simple queries we hit a performance issue. For a complex query, it took about 10 seconds to load the data to the frontend. So we had to rewrite the "code logic" ( the Nhibernate part ) into an "custom sql query" (Nhibernate allows you to write custom queries ) and then write custom mappers to map the results back to our application. The reason is Nhibernate does its processing objectwise and inmemory.

This should be emphasised more. Schemaless is good for productivity during prototyping (like dynamically typed languages), but long term a well-defined schema will help maintenance (like statically typed languages).

Sometimes it really is schemaless, although usually only schemaless for certain fields. Imagine a case where users can create custom fields or store arbitrary data in some fields.

>Schemaless does not exist. There is always a schema, except in a schemaless data store, the schema has been moved to app code

I get what you're saying but keep in mind that "schemaless" isn't a philosophical statement about denying ontology[1]. Instead, it's an industry label for avoiding database schema operations. E.g. a rapidly-evolving app doesn't know ahead of time all the rigid columns they need so they use "schemaless" db strategy to avoid "ALTER TABLE ADD COLUMN X" or avoid an export/import to new v2,v3,etc tables. From the relative point-of-view of the db engine, it doesn't see a "schema" when the so-called "columns" are embedded as strings inside of generic fields.

[1] https://en.wikipedia.org/wiki/Ontology

Sure, but the cost of schema maintenance is grossly overstated by those who simply aren't familiar with the tooling to do so. Furthermore, the cost of dealing with the corrupt data down the line will absorb all the gains you had in a matter of months, unless of course you've reinvented strong schema control at the application layer.

A lot of inexperienced devs look at NoSQL as being simpler than SQL because the dev documentation is shorter and has a more stylish theme. They then assume that all the touted benefits of SQL are YAGNI and they'll cross that bridge when they come to it.

The mistake they make is assuming that the benefits of SQL databases are something that only is needed for a mature application. To the contrary, SQL gives an incredible hedge against the uncertainty of change in the early days of an app when churn is high. A normalized data structure allows much more query flexibility; you should only denormalize for performance once you realize what the workload and logic will look like for the long haul.

Of course if you have a known workload that you want to scale up huge then that's a legitimate use case for NoSQL, but the purpose is not flexibility.

"Sure, but the cost of schema maintenance is grossly overstated by those who simply aren't familiar with the tooling to do so."

Scale ruins everything. Most SQL databases are pretty terrible at schema maintenance at scale (i.e. hundreds of MySQL shards). This (among other ways that scale ruins the RDBMS) is what led to NoSQL in the first place at the largest of web properties.

Sure, joe average startup probably doesn't have the same problems. But that's marketing.... The relational model and SQL have been beaten up for decades by those who prefer to remain closer to the metal.

I highly doubt your "close to the metal" solutions outperform state-of-the-art relational databases like EmptyHeaded. The truth is that people prefer to reinvent their own inferior solutions rather than looking for best-in-class stuff.

Our current manager wants to transition over to NoSQL, as he doesn't see any value having to constantly update JPA Repositories and dependencies when a column is added or removed from the database table.

I recommended that JPA Repositories shouldn't be stored at the lowest level of Maven jar package's eg... we have a very deep nested Maven pom dependencies (about 6 layers deep) of which the JPA repositories are defined in the lowest of lowest levels. Granted I gave references to research papers from Google that JPA Repositories and Hibernate ORM should be at the project layer (due to their nature of changing rapidly).

Turns out, still going with the NoSQL solution. Although it seems like everyone have convinced themselves that NoSQL is faster development processes than a simple table with constraints.

In the projects I've worked on that require frequent table modification, I came to the opinion that the correct approach would be to store the data where the columns/values are stored in a table or set of tables based on types and linked together with a relationship to the original table as the parent. The big downside I could see is performance, but maybe I'm falsely assuming that this would still be faster than NoSQL? (I haven't actually used nosql however.) But, for my projects, the performance hit should/would not be a problem.

Since my experience is probably just limited, I'm curious if there is an easy to explain example where this is incorrect? (Or I guess maybe most people disagree with me and see my above opinion as always/most of the time incorrect)

I will agree with your approach. With any large system or government database schema adding another table and then adding a relation to the necessary reports is 100x better than changing the parent table.

I've seen too often jobs being quoted for clients for simply adding a column to table blow out to 4-5 month development tasks. Turns out the table was being used by an old C/C++ program with oracle C bindings. That had to be updated, and that in turn caused another application to stop working.

This is correct if I understand you. Basically, when people add something new to a schema, they're generally tacitly also admitting it's optional. Optional fields should not exist on the same table as non-optional fields, because then you have to allow everything to be null or a specially designated value.

Leaf node dependencies should be very stable. Otherwise you spend all your time rebuilding or at least rerunning integration tests. I work at a company that has achieved the opposite of that with not one but two libraries, and deployment is a miserable experience.

Change management is not something that should be feared if given careful consideration!

I don't always know the "perfect" schema design up front but there are great tools in the modern SQL tool belt for managing abstractions and evolving schemas.

In PostgreSQL:

The "crm" schema can store all of my customer relationship management related tables. I may not know all of the data I need for them right now and will need to evolve them over time. However if I start with a solid design for my public schema I can use views in the public schema:

    CREATE VIEW public.customers AS
        SELECT c.id, c.contact_name, c.email
        FROM crm.customers AS c;
This is a simple view. You applications can CREATE/UPDATE/INSERT against the view in the public schema with no changes to your code. However if you want to evolve the schema of customers you can do so in the `crm` schema adding tables, columns, etc.

    CREATE VIEW public.customers AS
        SELECT c.id, c.contact_name, c.email
        FROM crm.customers AS c
        WHERE deleted = false;
You can then evolve your public schema to include more information as needed by updating the view. The caveat is that if your view starts materializing data from more than one table you'll have to intercept INSERT/UPDATE/DELETE and replace them with your own functions to put the data in the right place. This works quite well for implementing nice features like administrative flags without having to update your application code or spread that logic out: it's all kept in the database.

You still get the power of indexes, a strong query language, and structured data. PostgreSQL is great as an application server allowing you to comment all of your objects and having deep introspection facilities allows you to not only be flexible with your data models but also consistent.

Update: Added an example of changing the public schema to add a filter to the query for an administrative flag

Where can i learn more about this type modern SQL tools you mention? (Besides what you've just mentioned of course :D)

So now every time I add a column I have to update the schema as well. That's making change management harder, not easier.

"PostgreSQL as an Application Server" is a book I'd be likely to buy.

Not sure if I'm more impressed if you'd already started work on it, or sketched up the page/TOC + mailchimp in response. :) But I signed up!

And without the adult supervision that "ALTER TABLE" theoretically implies, you end up with 15 columns named "t1" and the like, no idea what any of it means, and source code to match (assuming its not gone and you're trying to reverse engineer the schema-less data)

There is a place for unorganized, disorganized, unengineered, unplanned data structures, but its not in the DB.

I don't really get this. ALTER TABLE ADD COLUMN X is a cheap operation when your database is small, at least on decent hardware. It's when you get up to scale that it starts being a burden. During the rapidly-evolving stage is when it should be the easiest to alter database schemas, because you shouldn't have accumulated all that much data yet.

I work at a company with multi-gigabyte tables and ALTER TABLE is still easy to pull off, we just use tools the community has built for larger tables. It's one of the most unfounded fears I've heard in the industry.

And... if you accumulate a lot of data,you may mostly access it in an aggregated way, such as GROUP BY. where ALTER TABLE ADD COLUMN X is free.

> Schemaless does not exist. There is always a schema...

I also think so but the issue has several important aspects which justify the use of the separate term "schemaless". One of them is that schema elements (say, column names) can be stored as normal data. For example, instead of having normal columns like Name, Age, Department, we could introduce a column storing these strings in 3 rows. As a result, DBMS is simply unaware of the schema - the schema exists only in our head (and in the app). As a consequence, DBMS cannot help us too much in managing data, and instead our app becomes responsible for these tasks, hence we get problems you mentioned. But the major problem is that currently there is no technology that allows us to say that this table column stores actually column names which can be used in queries and have to be treated as normal columns.

> currently there is no technology that allow us to say that this table column stores actually column names that can be used in queries and have to be treated as normal columns.

That's because the column name is part of the RDBMS' structure, as is the type, for the most part. If you want to read a column and use the row values as the column names, that can be done, it's just a query. The DBMS is going to have to scan the whole table (or lean on the index) for that column anyway. I can't think of why I'd want the RDBMS to do that for me and imply the column names, rather than me implement it in code. I'm managing the column names as data at that point. That's if you still think it's a good idea.

The other piece that jumps out at me is that when you do something like that, you're really building a variably wide table, but not letting the DBMS manage it as a wide table. Instead, every query is going to require either you or the DBMS to transpose rows to columns at some point in the query - if it's a simple select, only on the return, which won't be that expensive, but some joins are going to require it to happen earlier. I'm not smart enough to know whether it'll be computationally expensive to the point of being a showstopper, but I do know it'll require a lot of memory if you want it done fast, or you have to do it slow in scratch/spool.

Use jsonb inside a Database and use constraints to ensure all requested json attributes (store name,etc..) are present at INSERT/UPDATE time?

That don't seems like rocket science to me... duh...

PS: But of course don't store ID in an object value or you will lost major benefit of RDMS relational constraints.

Well, now you're just as 'schemaless' as any NoSQL, so that's neither an advantage nor a disadvantage over it.

There are other differences, both advantages and disadvantages to doing this with Postgres (which is the only db with a type called 'jsonb' I think?), vs some kind of non-SQL document store. I agree that as Postgres json(b) gets better and better, the disadvantages decrease. And one of the main advantages to me is simply not having to have another system to run and understand, since postgres is probably already there.

But it's always trade-offs and choices, that you make better with more experience and better understanding of your domain. It's definitely not a "duh, it's always obvious" thing. Unless your domain is so simple that it doesn't hardly matter.

> PS: But of course don't store ID in an object value or you will lost major benefit of RDMS relational constraints.

Or anything that you might want to use as a foreign key down the road. It requires you to know that now, or modify later, so we might as well declare a schema and manage it unless we're totally certain we won't need any additional keys. But if we're certain of that, we could just use a schema.

But do schema-less implement foreign constraints at all????

I'm currently implementing a mixed solution where clearly defined properties (and thus ID) have columns and constraints but I also include a schema-less "all you can eat" jsonb field so that experimental properties can be stored.

Upon each release I will migrate validated shema-less properties to static columns or even sub-tables. Similarly if a column is remove on next iteration I might transfer attributes to the jsonb to revert it later if needed. This does include a migration step but this seems only like good practice. Of course scaling... but I'm really not limited by this now.

The client will interact with the model through views or functions so that he don't have to care if columns are real or not. But anyway as soon as PK/FK are needed I will probably straight-out think that jsonb will be insufficient and take 5 minutes to think of a relational way to implement.

I don't see a problem with what you're doing as a means of development, since you're basically stuffing a bunch of data that you may or may not need in the future into a single field with the intent of breaking it out later if you do. I will ask whether this is really saving you any time or simplifying your development at all, given that you essentially impose structure iteratively as the need arises. You're not really schema-less, you're schema-lite, so you're maintaining some structure somewhere. Also, depending on the contents of the jsonb field, you may be defining its structure somewhere, even if it's just naming the elements.

I don't have a lot of experience with it, but Postgres supports indexing on jsonb fields. If you do end up in a situation where your keys are in the right place and you have some data in a jsonb, you may still be able to get at it efficiently.

Part of this is how we approach development as well. For one, I'm now some management type that makes powerpoints, so any code I write is on my own projects. But when I did more of this, my approach was always to start with an RDBMS unless I had a pretty specific use case to not use one. Having done it long enough, I am being honest that I find maintaining a database schema, migrating changes, and all of that to be pretty trivial. That's my approach, but I don't see an issue with yours if you think it's making you more efficient.

I should ask, because I'm assuming it, but make sure whatever you're stuffing into the jsonb field is true row-level data. Try to avoid putting data in there that "belongs" on another row - at the extreme it could provide some unintended exposure. Hypothetically, a row containing Customer A's demographics in the "person" table has a jsonb field - it may be tempting to put some basic demos about other customers B, C and D connected to that customer A in some way, but that's data you'd not want to expose as belonging to Customer A. Yet, if it comes back on the same row, it's as trivial to leak it as it is to retrieve it.

EDIT: To actually answer your question, I don't see how schema-less can have foreign constraints. Foreign constraints are defined in the schema. What I was saying is that if you're going to need joins later, have enough schema to support that.

I like your schema-lite terminology and I think it describe pretty well my approach.

I've only start to explore all possibilities, but main use so far:

-Store experimental properties before next schema revision

-Particularly good alternative to composite-type in early stage dev if you want to make live modifications without replaying the whole deployement script. (Composite type usually have huge cascading constraints)

-Storing "metadata row" for later use

PS: Actually it may even be an alternative to composite type in production because I can totally picture case when you need row level "objects" that vary depending on the value of another column. Yes you can normalize, but it might be more maintainable to keep everything in one table an write an appropriate check constraint. And it will still be easily human readable as opposed to a literal representation of a custom type or whatever (think of PostGIS geometry type that is very versatile but totally opaque to humans).

> But do schema-less implement foreign constraints at all????

In MySQL 5.7 you can have tables with JSON columns (like PostgreSQL's JSONB), and also "Generated columns" which are expressions over other columns (e.g. extract a value from the JSON). You can then use those generated columns as the source or destination of foreign key constraints.

Neat, I didn't know that. Is that more or less just having FK constraints on a view, though, or am I missing something conceptually here?

Right, "Computed Columns" are "view-like" columns added to normal tables. So you define your table with a few normal columns, and a few generated columns (which are each expressions over the normal columns, or over other generated columns).

On those "generated columns" you can create indexes, reference other tables (foreign keys) or be referenced by other tables (foreign keys).

MySQL has a binary JSON data type (like most databases) and also expressions to extract data from those JSON fields.

MySQL has a lot of disadvantages vs PostgreSQL, but having foreign keys on elements within JSON (via this mechanism) is a thing MySQL can do that PostgreSQL can't :) (Unless I'm mistaken?)

I think you are mistaken, even if implementation will vary you can have the same behavior (However I won't ever try in production because jsonb base FK is a pretty scary concept IMHO)

However on a solution I've implemented on PostgreSQL I expand jsonb as regular field in a view and made that view editable via trigger. This way users can update jsonb_field without being aware of because they edit it through a view as if they are normal fields.

> I also think so but the issue has several important aspects which justify the use of the separate term "schemaless". One of them is that schema elements (say, column names) can be stored as normal data.

It is fundamental to the relational model that metadata like that is stored in and queryable from relations just like normal data.

I saw Mike Stonebraker while he was on his VoltDB "tour" a few years back. He makes the same arguments. In his talk he walked through the history of database technology and discussed a similar "war" between CODASYL and SQL. He framed that debate in terms of the split between separating the concerns of your data's structure (schema) from application logic.

Fairly or not, he suggested the current fight between noSQL/Schemaless vs SQL/RDBMS was being fought in ignorance of all that went on in the 70s.

Yeah, I think the good ORMs make the simple case _really_ simple (like User.getById(1234)) while letting you drop down to raw SQL easily when you need it. Any ORM that doesn't let you do that is not worth using, IMO.

Thank you for recommending jOOQ. I find many ORMs seem unnecessarily complicated compared to raw SQL (with a few exceptions), nice to find another one that offers readable code. Is the performance generally decent too?

The performance with jOOQ is great. I find it on par with raw SQL. It also has quite a few advanced ways to map the results that can take more complex queries and map them into usable data easily. The main jOOQ programmer really knows databases and has a very informative blog.

My only complaint (and it's a minor one that would be hard to get around) is the need to run a code generator for jOOQ after every schema change.

Just a noob question on jooq. How do you construct error message which you use to send back to user on SQL exceptions. Like in case of duplicate item, you send back "User already exists". How your handling it? I know jooq just sends DataAccessException, now the question is how your converting that exception to a meaningful message which needs to be sent to say UI?

That's a really interesting question, and the best answer is: You have to write this manually. The SQL/JDBC/jOOQ layer can only generate system exceptions, like a constraint violation exception in this case. But there are many different types of constraints, and many different things that a constraint violation can mean. Only you and your service layer can possibly know how to translate such a violation to a meaningful (localised?) user error message.

Having said so, you could hook in an ExecuteListener into jOOQ to handle errors in a single location: https://www.jooq.org/javadoc/latest/org/jooq/ExecuteListener...

An example that uses the Spring Exception translator: https://github.com/jOOQ/jOOQ/blob/master/jOOQ-examples/jOOQ-...

Agree so much.

The first thing I do for almost any new system is design the data model. Once that satisfies all the requirements, building an application and UI on top of that is usually pretty straightforward.

Exactly what I do. I think in terms of data. UIs seem to change weekly, but the data lives forever.

I find a simple analogy for schemaless in the OO world as well - an object without a class definition (for example a dynamic/anonymous object) and/or an interface has unknown properties and methods and is thus of highly limited use.

Just like a table without a schema... because after all, it's still likely objects all the way down.

There is no such thing as working without a "schema", because software simply can't function without data and corresponding metadata to describe it.

> There is always a schema

It has a name. Schema-on-write (static schema) vs. schema-on-read (dynamic schema).

I ended up using NoSQL for ephemeral data and SQL for persistent data in most of my applications.

I've largely abandoned ORMs that generate SQL queries for me. A mapping layer that takes a SQL result and stuffs it into an object (DTO) is useful, but anything beyond that is more trouble than it's worth.

For complex SQL, it's better to write it by hand. You get to tune it and ensure the indexing is correct (either the query usage, or create/modify your indexes), and you get to see exactly what it's doing.

For even simple conditional joins, there's something nice about knowing exactly what's going on, and knowing it's not going to be doing something that results in a nested join loop (eg: the type of slow query you don't notice until you're using it in production and performance suddenly drops).

I'm a big fan of Dapper for this, so a lot of my data layer code looks like:

    public IEnumerable<Order> LoadOrders(int customerId) 
        using (var db = GetConnection())
            return db.Query<Order>("SELECT * FROM orders WHERE customerId = @customerId", new { customerId });
The one exception I make to ORM-generated code is INSERT/UPDATE queries. For most cases, the ORM doesn't have much it can screw up, and since it's essentially just mapping code ("Name = @Name, Address1 = @Address1, etc") it's more likely I'll make a typo or copy/paste error than anything else.

I've spent too many hours debugging crappy ORM queries, and I find it generally takes twice as long (with quadruple the frustration) to get the ORM to generate the complex SQL you want it to vs just writing the SQL.

I fully agree. For simple things ORMs can be useful (very simple) anything else they get in the way.

Firstly, have to write out the SQL, then convert it to the ORM format, only to find the ORM doesn't support something you are using, so you rewrite the query in a more long winded way, then convert it to the ORM.

> Firstly, have to write out the SQL, then convert it to the ORM format

Ha, you know I had actually forgotten about that, but I have done that many times.

You model the data, write some queries to essentially test out the structure/data/concepts, then instead of pasting the queries into your code, you spend another bunch of time trying to tell your ORM how to come up with the query you pretty much already have.

Even if you're using a very well-tested, robust ORM that flawlessly handles your use case and you're an absolute expert at using it, it's still more work than just pasting a query you already have.

> Firstly, have to write out the SQL, then convert it to the ORM format, only to find the ORM doesn't support something you are using, so you rewrite the query in a more long winded way, then convert it to the ORM.

Or you could avoid the conversion step entirely. If the ORM is useful 90% of the time and in the way 10% of the time then use different tools for the 10%.

This is funny to read, I've recently started giving up on ORM's, but I felt really guilty about it. Like I have to really justify myself. Like you, I use it in a hybrid manner, simple queries I still use it... but for complex selects, I have yet to meet an ORM that can dish it out as good as I can.

This is where I really think scripted languages can shine... I find that using SQL in JS you really don't need an ORM, except maybe to shape your response for the purpose of an API interface... even then it's still better than the boilerplate that ORMs bring into the mix more often than not.

That looks very similar to JDBI for Java: http://jdbi.org/

At my workplace, we've started using jOOQ for this sort of thing. https://www.jooq.org/ (albeit we're a Java shop, not a C# shop)

It still sort of generates SQL for you, but really you end up writing the SQL yourself except in a type-safe way.

All I want is a modeling layer, and callbacks, like thinky.io provides for RethinkDB. It's very rare when any query I'm writing can just be generated by one of these ORM packages. It just becomes more confusing, and less powerful.

Even for data modification statements, if you can use TRIGGERs, CHECK and other constraints, and/or stored procedures / functions to implement business logic, then you'll be better off doing that than using an ORM.

Typically there are two arguments for NoSQL: "no SQL!" (i.e., "the SQL language is so ugly, difficult, and painful") and "no ACID" (i.e., eventually consistent, hopefully, maybe).

The first one is always demonstrated to be a terrible argument when someone creates a SQL alike for whatever NoSQL we're talking about.

The second one has been a more durable argument. But transactional semantics clearly are independent of query language! And technologies like Lustre and Spanner show that one can still have some measure of traditional transactional semantics in distributed systems (filesystems and databases, respectively). There are many applications where some degree of "eventually-consistent" is a great tool for making them scale, but this is often a function of what can be done in the event that inconsistencies create problems (e.g., a store selling more items of some product than it has available, a case where the store can refund the customer or delay delivery).

I'm extremely skeptical of NoSQLs, as you can tell. I would say that NoSQLs have NoPLACE. (Certainly as to that first argument.)

I'm also extremely skeptical of ORMs. So far as I can tell ORMs only ever add a layer of headaches. They seem to aim for simplifying the DB experience for developers and users, but the moment you step out of the small world of queries made simpler by the ORM... you're in for a world of hurt.

I'd actually quite like a relational database with a language that isn't SQL. I find SQL's syntax pretty irregular and awkward (e.g. update having a different syntax from insert), find the tools available for abstracting and composing queries (views) underwhelming and hate that schemas aren't declarative.

The first complaint derives largely from the SQL language being intended to read link English text. Languages like this show up regularly, seemingly based on the misconception that syntax is what's hard about programming. Most of them fail, as they should. SQL didn't, because it was tied to the exceptionally useful relational database.

The second is best illustrated by showing an alternative approach as an example: Korma for Clojure (http://sqlkorma.com/). It's easy to compose queries from simpler queries, just as we compose functions from simpler functions. Views provide some of this.

The third, if it's solvable would make using relational databases much nicer. Instead of schemas being manually-created diffs full of imperative statements like create and alter, make them declarative and have software generate the diff. If transformations on the data itself are required between versions, require it to include a pure function describing said transformation. By default, require a function to transform it back as well unless it explicitly says what data is to be discarded.

I don't find most SQL too bad at all... where it gets weird is when you're dealing with more exotic data types and the database vendor's specific details in interacting with them. XML/JSON interaction in SQL databases tends to be particularly mind bending.

I do prefer the more fluent-like interfaces though... MongoDB and RethinkDB in node for example, or using the LINQ extension methods with Entity Framework (not LINQ syntax itself, which I find much more difficult). Though with LINQ, it's often interesting how twisted what gets queried from the data source can be vs. what you actually need... but that's a problem with most ORMs.

I do find that SQL queries with Node/JS adapters is pretty damned natural compared to any ORM for Java or C# imo...

You don't like languages that the average human being can make sense out of but you do like Clojure? Did I read that right?

Then, "make them declarative and have software generate the diff". So you want the benefits of relational data without the pain, is that it? Why not use Hibernate like the oh so wise ones have told us to use only to now realize trying to bend relational tools into what a language designer's ideas are is maybe, probably, not the best idea after all and perhaps the reason that SQL and relational databases have withstood the test of time and Hibernate is on the road to death is that SQL and relational design were extremely, unabashedly well thought out and designed in the first place.

You don't like languages that the average human being can make sense out of but you do like Clojure? Did I read that right?

You read that almost right. See, I don't think the average human who's fluent in English but not SQL will be able to properly understand anything but the most basic queries. Sure,

    SELECT * from users where username = 'remotehack';
is going to mostly make sense to the average person (but what's the * and ; for?), but then so will

    (select users (where {:username "remotehack"}))
The brackets won't make sense, of course, but we're talking about reading it, not writing, and they're not terribly important to that.

Now, give that same average person

    Select Distinct On (u.username, u.email)
    From users u
    Join logins l On l.username = u.username
    Order By u.username, login_time Desc
and do you really think they'll have any idea what to expect that to return? This no longer reads like English.

And no, when I ask for declarative schemas, I'm not asking for ORM. I'm asking for a system where I say what the schema is instead of saying how to create it, just as I say what data I want rather than how to retrieve it. This is not incompatible with any of the fundamental concepts of relational databases.

> I'd actually quite like a relational database with a language that isn't SQL.

They exist, e.g., Dataphor, Rel, etc. But it's hard to displace SQL.

>I'd actually quite like a relational database with a language that isn't SQL

Perhaps you want a Prolog. :)

To be clear, though, you're talking more about the Relational Model than SQL per se.

There aren't very many query languages for relational DBs that come close to SQL's power.

I'm currently using Firebase at work (startup, mobile app, web dashboards), and I think it's got some great features.

It's not a replacement for SQL, but an entirely different product (and has no doubt come a long way since V1, especially with Firebase Functions out recently). You have to make tradeoffs, like data duplication, for the advantages. From the sound of it, the advantages probably aren't relevant to you, but I'm confident they exist for us.

- We only have to store a comparatively small amount of user data for the lifetime value of each user, so the actual database cost is marginal (i.e. data duplication is not expensive to us, as long as reads are fast). (Development) time is (very) expensive.

- Rapid development and iteration. I can make additions and make fields redundant with ease in response to customer feedback, my own development choices, and changing product needs. There's no API update needed, as reads are client side. I have data model classes which store the reference in the database, some constructors, and getters and setters (About the same as any other data model + API would have). The release goes live and data just starts being manipulated in a new way. Alterations can have simple defaults with || in JS and ?? in Swift (even if it's just an empty string).

- Minimal backend. Our application doesn't require much backend logic, and that which we do have is mostly event driven with Firebase Functions, upon certain database writes. Works out quite cheap too (for now - most usages don't trigger any functions).

- Caching, and offline first. We take the approach that it doesn't matter if there's a short delay in 95% of our data updating - either due to patchy mobile signal or the data taking time to sync down. Firebase has made on device caching and real time updates to that data a breeze. With GeoFire, we've got instant map search (keys are cached by Firebase and contain sufficient information to filter, etc).

- Declarative security rules are very powerful, and can even reference values in the database itself for security control - effectively. We store a permissions tree in the database for our more complex security logic.

As a bonus, additional features like Authentication, Analytics, Push Notifications, etc are all convenient to have bundled up.

As you can see, most of this isn't about the database tech, but the development speed for us. The perceived costs and inefficiencies may exist in the database itself, but there are massive advantages for the application development as a whole.

I should probably clarify that I really liked Firebase, for all the reasons you mentioned above. It was really the schema restrictions, and (at the time) the lack of relational functionality that made me scrap it for a mobile app project that I was working on at the time.

This was very early days too, and I believe Firebase has improved considerably since then, and indeed I did a little "fun" project in it recently [0] (directly related to HN actually) and enjoyed the process.

I may still revisit it sometime, and see if it will suit another project.

[0] - https://hackernoon.com/tophn-a-fun-side-project-built-with-v...

Most projects I write eventually turn into an RDS and Redis. Storing non-relational data can be super handy, and putting it somewhere where you know it can disappear at any point keeps you disciplined. haha

What does this have to do with the article? SQL - the query language - is completely separate from any database that implements it.

SQL has the great side effect of creating structures that future employees can understand. Its a set of tables with relationships. Given these you can quickly inspect the structures independent of the programming language dejour. With a few commands a new employee can understand: business logic, hr, billing, reporting, and other major backbone principals in a matter of hours. On the other hand, I have been at companies that jumped on the noSQL / ORM / wrap it until you can't wrap it / hide sql (rails). When new employees show up... well... its a bunch of semi/no structured stuff spread across thousands of lines of specific logic.

"Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious." - Fred Brooks, The Mythical Man Month.

Note that "tables" in this statement had a different meaning than today. That one wasn't about relational databases, it was about tables as in-memory data structures.

In today's words, "flowcharts" means "code" and "tables" means "data structures".

I don't care if it is:

- tables in a relational database

- nested structs (records) and lists

- nested dictionaries (hash tables) and lists



- ... whatever

What I do care is that I can see the data structures, and not just the code. Static typing often gets that job done fairly well. If you don't use that, please use at least type annotations. This is the most important part of your documetation, and the compiler ensures it remains correct over time. Most of the time, this (+ the function name) is the only documentation I really need.

Show me your well structured code and well named unit tests and I won't need your flow charts.

Show me your well defined normalized tables and I have all I need.

But for the love of $entity please don't show me tons of business logic in stored procedures. I actually won't work for a company that expects me to build or maintain a product based on stored procs.

Show me your well-defined normalized tables, and I will know the rules for a valid state of the system. How, though, will I know the rules for a valid transition from one state to another? Have I seen some nasty stored procedures? Yes, a number of which I have written. But I have seen well thought out and useful ones also.

With logic being in code:

* it is easily unit tested

* all source code is one place that can be branched, versioned, merged, deployed, rolled back, diffed, code reviewed, approved with pull request etc.

Or you can simply implement it so that when your code runs it first checks that the stored procedures exist and have the same definition as the one stored in the source code; and to drop/recreate them if that's not the case.

In which case the stored procedures are in your "main" code repo and "deploying" them simply means running the software. Though your point about unit tests stands.

Why are your stored procedures not in the code repo?

Because deployment tooling for stored procedures sucks?

Stored procedures are easier to deploy than table schema. You just need to drop if exists and then create them.

Table schema is the hard part since there is overhead associated with creating an index on an existing table and with renaming a column.

SQL is code. Not sure how that escapes people.

Harder to unit test, harder to merge, harder to roll back...

That was true a decade ago. The tooling has caught up in most languages, so it's trivial to embed your SQL in application code and use it in an environmentally independent matter (eg to invoke in unit tests), provided you take the few minutes initially to set things up correctly instead of the all-too-common "fuck it we're a startup" manner people seem to love.

And you can unit test it too.


Guess which is faster for continuous testing - being able to mock out your Entity Framework context and use an in memory representation of your table via Lists or constantly running an ungodly stored proc against the database?

I can't say that our tests are slow. We can test our database continuously and independently from our various projects that utilize it. Our integration and smoke tests run continuously as you would expect.

You can't tell me that you can run a full suite of database dependent tests as you can run a bunch of in memory tests that mock the database queries.

How do you handle 5 - 10 devs working concurrently if they are all running tests?

You work from local databases. You keep a migration script that sets up the schema and another migration script that sets up the data. If you create a feature, you must add to each script when merging.

You can run unit tests by creating transactions, running the test, and then rolling back at the end.

I do get your point, though. It's more convenient to be able to mock your database from your code.

That's much more complicated than just mocking your context in unit tests....


Go down to "Testing Query Scenerios". I've wrapped the four "mockset.As" lines into one extension method to make it a lot less verbose.

Yes it is, but it's much less complicated than trying to guess what your generated SQL code will look like.

No need to guess. Run all of your unit tests without mocking out your DbContext and add the following ....

context.Database.Log = log.Info

Where Log.info is just a method that accepts a string and take a look at the log file generated from your unit tests.

Isolation I guess? I don't know what your experience is and why you think it cannot be anything but slow but... it's not?

pgtap lets us unit test our SQL functions, triggers, views, etc. It's integrated with our change management system. It's fast enough and works well.

I like to use table-driven transition constraints. The basic design is, you have a table of valid transitions (begin state, end state) and a function that you can write a constraint with.

It's the same as other code, but with better data locality.

> Show me your well structured code and well named unit tests and I won't need your flow charts.

Please note that the quote uses old language. With "flowcharts" they actually meant what we know call "structured code".

More precisely, the flowcharts were the hand-written program which was then translated into machine code or some kind of assembly. In other words, back then the flowcharts were the highest level code.

I have seen at least one case where I could ascribe no meaning to several of the tables in a database. To be fair, seeing the code did not help either, until it became apparent that it was all an attempt to work around a fundamental error. The creator of this mess was blaming the users for it not working, and I was happy to refute that claim and exonerate the users.

Rails tried to hide SQL so much that one of the consequences was not supporting proper table level constraints (FK) till AR ~4.2 or so. Just take a look at the DB of a company using Rails in production: hundreds of keys to non-existent relations.

I regularly use raw sql in rails. ORMs are fine for convenience, but for me, they never make or break an app. They can save you some typing on very basic stuff, but if you find you're writing complex ORM code, I think it's a good idea to just use sql directly.

This fact alone, after several years of changing schema and migrations, led to us having major rewrites and porting over to much more hands off approach. It's an understatement how much AR has led to just confusion especially with new employees!

Seeing this as well in proprietary software sold by one of the biggest erp software vendors... And it's still not fixed and likely won't be fixed anytime soon.

There is a perception amongst lower programmers that FKs are only a thing you enable in production, that they are an impediment to rapid iteration.

I mean, it's completely wrong, I know that. But try being the new guy telling a team of 10 people that.

There is also a perverse corruption of "if it ain't broke, don't fix it" that goes on. If you can spend 100 hours manually validating every relationship in your application code, that's 100 hours you can put into your estimate and you know you can complete. If you only have a cursory understanding of SQL, then "learn more about FKs and implement them across the DB" seems like a big, unknowable blob of time that is impossible to estimate. It doesn't matter that it might only be 5 hours of work, at least we can be certain about 100 hours and bill the client for it.

Finally, there are a small set of things that are fundamentally wrong about all modern RDBMS implementations. For example: it's nonsensical to have a foreign key that isn't indexed. You always want an index on foreign keys, there is never a scenario where you don't want them. But while primary keys are indexed by default, foreign keys are not. And those sorts of things give the anti-RDBMS crowd enough of a foot-hold to argue for continued ignorance.

I have found out the hard way that foreign keys are indexed by default in some implementations including MySQL innodb. Not in oracle.

My experience has been the opposite of yours -- I've heard of disabling foreign keys in production only for performance reasons, but not the other way around.

I like them turned on all the time as well. Valuable safety rail

Curious, this doesn't fit my mental model: it's nonsensical to have a foreign key that isn't indexed

I guess you normally want indexes on parent - child tree relationships (book -> chapter), but you don't want them all the time. What about when you have an 'article' with a 'status and a 'category' and you only ever find all articles by combination of category and status? In that case you'd be maintaining 3 indexes, category, status and status_category, but the only one you'd need would be status_category.

Indexes are a tool to to allow for optimising lookups while foreign keys are a tool to allow you to keep your data consistent.

I don't see how your example doesn't require an index, specifically because, as you said, indexes are a tool to allow for optimizing lookups.

I imagine that maybe you're suggesting you first query the DB for the ID of the combined status_category, and then query the article by that ID. That's not a good idea, for a couple of reasons. First, you're making two round trips to the DB when you could, with no additional effort (just effort in a different place) be doing one. Second, you've introduced a data race condition. If someone deletes that status_category after you've queried for it but before you've queried the articles, you aren't going to get the results you want.

It would be better to do a join across articles to status_category to status and category, then query based on the status and category values you want. Without an index on the FKs between status_category and status and category, a relatively small table can have a big impact on query performance.

Finally, while I know your example is arbitrary, it's a little hard to argue against a design that is probably wrong. I doubt the suggested schema for articles and categories is a good one. If I argue "you should never have to arbitrarily subtract '1' from a result just to get the results you want", it would not be a good counterargument to say, "yes, but sometimes you want to add 2 and 2 and get 5, so then you need to subtract 1". The problem isn't where you see it.

FKs aren't just a consistency tool. Consistency and referential integrity are features that results from having an FK, but the FK is a signal that data can be searched in a certain way.

> I don't see how your example doesn't require an index, specifically because, as you said, indexes are a tool to allow for optimizing lookups.

If you never lookup by a column alone, you don't need a single-column index on that column. An FK need not ever be a lookup target (the target column[s] it references are necessarily a lookup target, but not necessarily vice versa.)

An FK is probably usually going to want some kind of index, but it's not nonsensical to have a non-indexed FK.

Touché! An index signals the data should be searched in a certain way!

I don't think I explained my arbitrary scenario particularly well :-) I'm definitely not suggesting 2 queries.

If I have articles that could be category: math|science and source: website_a|website_b|... and I only ever query for source and category together then the other indexes aren't used.

It's a contrived example, but in my mind the existence of a foreign key doesn't imply an index is required.

I assumed they meant an index on multiple columns (status, category), not a separate table for status_category with an index on that.

I think in general most rdbms require indexes to properly enforce foreign key constraints in a reasonable way.

If the parent entities are immutable, or are at least immortal and have primary key values that are immutable, then the DB engine will only need the index on the target primary/unique key/index to maintain the constraint.

Though in general this is not the case, as you say, it is the case often enough that enforcing "FK means an index" could be an annoyance.

The number and kind of bad ideas a bad development team can have is virtually limitless.

> There is a perception amongst lower programmers that FKs are only a thing you enable in production, that they are an impediment to rapid iteration.

Really? I've been working as a web programmer for about 5 years now. I use the term Web Programmer because I feel like the term Web Developer carries with it the connotation that you only work with Javascript.

My specialty is definitely with Python and more specifically with Django. Django's ORM is the first ORM I ever used. Maybe it's because I self taught, but I never had the inclination that FKs were an impediment to rapid iteration. In fact, quite the opposite. FKs are a fantastic way to enforce relationship constraints between tables/objects/models/whatever you want to call them.

Whenever I start on a new project, the first thing I do is start defining my data structure. In Django, this mostly involves using the ORM layer to define Model Classes. I usually define the core sets of models necessary for the application. As an example, if I were building a simple blog, I'd start by overriding Django's built in user model so I have the flexibility to add columns or place constraints on existing columns. Then I would define the post model which includes an `author` and a `category` FK column. Then I would define the `Category` model. I generate a migration script and run it. Every time I need to make a change to the Schema, I simply add or change whatever I need to and generate a new migration script alongside the old one. These migrations are dependent upon previous migrations and have version numbers. If I need to, I can roll back to a previous schema. Django makes this all very simple and actually separates the migrations out by which app the model definitions reside in. This means that I can make changes to multiple model definitions and only apply the changes to the Db for one of those model classes. It's very flexible.

So, after my long-winded explanation above, I don't see how people could find FKs to be restrictive when there are so many tools like Django's ORM and migration system that makes altering your schema so simple.

Re: FKs being indexed. This isn't always the case. Our embedded/workgroup-level database engine, ElevateDB, always uses a system-generated index for all PK, FK, and UNIQUE constraints. They're needed for constraint enforcement, anyways, and offer good "default" performance improvements by ensuring that the query optimizer can flip (most) inner joins in order to improve nested-loop join performance.

> it's nonsensical to have a foreign key that isn't indexed. You always want an index on foreign keys

Incorrect, though cases where you don't want the index are rarer than those where you do especially when thinking about simple examples.

> there is never a scenario where you don't want them.

If the parent entities never have their primary key values changed and are never deleted, then the database engine itself will never use the index to enforce the key. If you never need to join from the parent entities to the child entities then your queries are unlikely to make use of it either.

The extra index takes space (maybe a fair amount of space if the key is wide and/or you have a [bad] design with a wide clustering key), potentially space in your in-memory page pool, and processing time & IO during inserts, updates and deletes. If you are unlikely to need the index then why take that hit?

> But while primary keys are indexed by default, foreign keys are not.

Primary keys need to be indexed to avoid a full table scan on every insert (or update of a key value) to that table or any table that refers to the key via a foreign key constraint. Foreign keys will only cause a scan with no index present if a key value is changed (or a row deleted) in the parent table. As some entities shouldn't be deleted and primary key values should be immutable, it follows that this sort of situation can happen.

It would be possible to make the index optional by other means then requiring you to declare it if you want it, but SQL's modelling language tends towards declaring what you want not what you don't want so it fits better with the syntax to have you add the index if you want it rather than deleting it if you don't or having something in the syntax like "ADD CONSTRAINT fk_key_name ON (<field(s)>) REFERENCES <target>(<field(s)>) WITHOUT INDEX".

> And those sorts of things give the anti-RDBMS crowd enough of a foot-hold to argue for continued ignorance.

This isn't purely a relational problem. noSQL data stores have indexes too, and not having an index on a referencing key that might be needed to check on referential integrity (looking for orphan records) can be a problem there too. We shouldn't change the behaviour of relational stores because some people who use noSQL don't understand good data modelling. Many people using noSQL do understand good data modelling of course, but some use noSQL because they don't want to try understand SQL rather than because it is the wrong tool for the job at hand, and those people probably don't understand noSQL either but get away with not doing so in the short term).

> If the parent entities never have their primary key values changed and are never deleted, then the database engine itself will never use the index to enforce the key.

Enforcing referential integrity is not the only thing we can do with foreign keys. Once you have an FK, you're going to want to query against it. To do that without data races requires a join. The join can be optimized better if the FK is indexed.

Come up with an example of an FK you never want to join on and then maybe we talk about not wanting an FK indexed. I've never seen anyone make a convincing argument that this is even as much as a 1% case. It should be so incredibly, almost inconceivably rare to not want an FK indexed that yes, I'm saying it should go against the traditional grain and just be the default. Traditions are not sacred.

> Once you have an FK, you're going to want to query against it.

Not necessarily. You are definitely (at a minimum, implicitly for referential integrity) going to want to query from the table using the FK to the table it references, but you may or may not want to query by the FK column.

> Come up with an example of an FK you never want to join on and then maybe we talk about not wanting an FK indexed.

Joining on an FK doesn't require the FK to be indexed, it requires the target to be indexed. You need an index on the FK of the FK is used in a simple equality or inequality filter criteria other than a join to it's target, or if it's used in a join where the other criteria are filtering it's target table rather than the table with the FK. But if you filter the table with the FK and join to it's target, which is a fairly common case, you don't need an FK index.

> Once you have an FK, you're going to want to query against it.

Not necessarily, and even where you do there are many cases where an index just on the foreign key is not the best choice because the queries are filtering on other properties too so a wider index is worth defining instead.

> Come up with an example of an FK you never want to join on

Recording tables with many fact dimensions, where you need to enforce a limited range of values in each dimension column but only ever use the data afterwards for aggregating after filtering/grouping by other properties. The storage needed for the unnecessary indexes structures could balloon significantly here. This is most commonly seen in warehousing situations, but the pattern is far from unheard of in OLTP workloads.

> it's nonsensical to have a foreign key that isn't indexed

> and "there is never a scenario where you don't want them

It may be uncommon to prefer not to have an index on a foreign key, but absolute statements like those are simply wrong because there are cases where you don't want (or at least done need) them and there are cases where you want something other than what would be generated automatically.

Useful rules-of-thumb perhaps but only if presented as such ("in general X" or "X is true except when it isn't due to Y or Z") rather than absolutes.

The relational model provides a pretty general and unified way to represent information, together with a cohesive and powerful set of primitives for working with it, yet many systems architects insist on hiding it behind an ad-hoc interface that looks like a throwback to pre-relational days.

That's true, but you could always either create them yourself in SQL in a migration, or use a gem like foreigner (as I did). Always use FKs...

You put your business logic in the rdbms? As constraints and triggers and stored procedures and such?

There are advantages to that, but pretty big disadvantages too. For maintaining and developing a non-trivial app, SQL, or a particular vendor's SQL variant, is definitely not my preferred platform. I definitely don't find that something easy to understand coming to an existing project that's been developed like that. I guess it _could_ be a matter of taste and experience, if it is yours!

Most of the time when business logic is put into triggers it turn into a disaster. Even Oracle recommends only putting simple logic within triggers and for the most part should only be used for auditing purposes.

For example too many times I've updated a record with a new column value only to find out that the value I've updated because of some trigger caused the value to be set to null.

One advantage: you get all the write concurrency your RDBMS can give you for your transactions. Too many ORMs I've seen just have a Big Lock, leaving you with just one writer at a time.

I can't think of disadvantages to putting your business logic in the RDBMS. Can you elaborate?

> Too many ORMs I've seen just have a Big Lock, leaving you with just one writer at a time.

Geez, I think you've seen some pretty terrible ORMs. I don't know of any mature and popular ORMs that do that. If they do, I have no idea why they are popular or what makes anyone consider them mature.

ORMs that don't do that include: ActiveRecord, Sequel, Hibernate, Core Data, SQLAlchemy, etc.

I think he meant putting constraint and index logic in the code.

For example: I've seen people rely on a unique index in the code. But if you set a unique index in the DB you know it will be unique always. Even when the code is replaced some day.

I'm not sure I'd call constraints business logic, and I would definitely want to at least double-check them in the database.

I could be wrong, but I think the main reason business logic is put into the DB (and provided that the "users" of the DB are also segregated properly), is to present a cohesive view to all users of the DB.

Too often, when you have a "wide open" DB and multiple "users" of that DB (think groups within an org) decide to query it (whether just for reading, or worse, updating), the queries can often turn out to be radically different for the same business logic.

Maybe one development group thinks that a query should be done one way, while another thinks that to get the same data it should be done another, and now you have two or more groups disseminating data to other groups in totally different ways.

You may ask why the organization has multiple development groups, but it occurs. I worked for one company I won't name because it isn't important - as a web developer in their marketing department; our group was considered as separate from the IT group, which handled the main "DB" which was based around AS/400 systems and Apache SOLR - we worked together as best as possible, but also bumped heads, in that they wanted us to only work with their "DB" (I don't really consider SOLR to be a DB, though it can and is used like one by a variety of orgs) through their interface (which at times didn't work like they documented it - and many times changed without us knowing about it until our stuff broke mysteriously - usually around the end-of-year holiday push) - whereas we needed to store a lot of the stuff "nearline" in a MySQL DB (we used MySQL, PHP, and AWS for much of our development) to make things more responsive for our end users (ie - the people buying the products the company made off the websites we were creating). In essence, we a bit of a "split personality" going on - but our main boss was only two levels removed from the CEO of the company, so our stuff generally was tolerated - but it is a similar situation.

Where you have multiple dev groups (whether by design or because "ad-hoc" things occur - like Bob in accounting figuring out how to use ODBC with Excel macros to query the database for data for his and other departments), you can have this kind of chaos. When the DB is heavily controlled by IT, with the only "views" of it through tightly controlled business logic which is part of the DB, this kind of difference in data can be controlled.

But it does have many more downsides, which also means it probably shouldn't be used or done in that manner. Instead, the interface should probably be through a single interface (RESTful or similar), with the business logic in the code of that interface, and only the barest needed other logic in the DB to tie it together. Provided that the security on the DB is tightly controlled, and the only way other groups can access the data is via the exposed interface API, you can achieve the same results I think.

In a way, that's how we had the access at that one company - we had a RESTful interface API to the backend SOLR store; we could send a formatted "query object" and get back one or more "records" as a stream of data (we'd then usually take the data, parse through it and store parts of it into our MySQL DB - because the query/response time of that SOLR DB was horrible from a web development perspective; I don't think this was the fault of IT, but rather the fact that the datastore on the backend was vast, holding information about products dating back 60 or more years - I'm sure there was likely some COBOL in the mix somewhere).

That did have the downside of the fact that if we wanted a particular means to query for something that didn't exist in the existing API, we either had to make due with what we could do "locally" (thru code and/or mysql "buffering"), or we had to put in a request for a change to the IT group (which may or may not get accepted, and might take weeks for the turnaround time before we could use it). Furthermore, as I mentioned before, there were more than a few times that we rolled out a particular feature, only to have the website(s) that relied on that feature break because the backend API changed "behind our backs" (and we usually saw this over a holiday period, when our sales would peak of course). In many cases, we couldn't do anything about this (not even storing the SOLR information - it was too vast, plus there was a nagging idea that if we tried that someone's head would roll for not using the implemented interfaces and data that already existed - we could "buffer" or "cache" things, but we couldn't wholesale transfer the data over).

Exposing data structures in a human readable form was probably a necessity because SQL slotted itself into a space where COBOL lived...and one of the features of COBOL was its intent to create readable data processing programs. Like SQL, COBOL was generally a very reasonable alternative when practical engineering and business considerations were all on the table (e.g. mature well documented tools and the availability of employable candidates).

Before databases you had only flat files or VSAM (and before that, ISAM) for your data. Good luck debugging your data in a VSAM cluster. https://en.wikipedia.org/wiki/VSAM

My apologies for not being clear. By "data structures" I meant data structures and by implication that COBOL provided an abstraction over bytes in memory or on persistent storage media. The premise of my statement is that SQL is also an abstraction over bytes in memory or on persistent storage media in similar use cases.

Thankfully, SAS handles VSAM reasonably well. Don't ask me why I know this, or why I need to continue to know this.

This is not true for any non-trivial database. I had to learn this. Business logic still has to be understood, there is still lot's of domain knowlege how this is indented to be used and how not, and in my case there was still lot's of ambigous, poorly documented VBA-Code to be understood. I'm not aware of a database without a frontend of some kind.

Personally I find SQL to be one of the easiest languages to read (when reading code from other developers).

I've already posted on HN how I use SQL in my public affairs data journalism class [0]. To me, there is no better, in terms of accessibility and return on investment gateway language to the power of computation and programming than SQL, with the exception of spreadsheets and formulas. Even if you don't go further into programming, SQL provides the best way for describing what we always need to do with data for journalistic purposes -- joining records, filtering, sorting, and aggregating. Ironically, I learned SQL late in my programming career, and initially thought its declarative paradigm to be mysterious and inferior to procedural languages. In fact, I don't know how to do anything in SQL beyond declarative SELECT queries (and a handful of database create/update admin queries). Turns out this is just powerful enough for me for most app dev work (Rails, Django), and the simplicity is a boon for non-programmers.

ProPublica just published a bunch of data-related jobs and positions. The phrase "Proficiency in SQL is a must" makes an appearance: https://www.propublica.org/atpropublica/item/propublica-is-h...

[0] https://news.ycombinator.com/item?id=8505000 https://news.ycombinator.com/item?id=10585009

I've recently started learning Pandas (dataframe library for python) and i find most queries cumbersome compared to the absolute minimal SQL I know

Pandas is mostly columnar in nature. If you're heavy into Python idioms it's fairly easy to grok, imo, but there is a subtle learning curve about what is in and out of kernel. Since in-kernel operations are about 50-100x faster, it's obvious when you did the wrong thing, but it doesn't show until data sets are huge.

I tried doing something regarding browser hits from Akamai data in Pandas and 3 different SQL databases (mysql, postgres, sqlite) and nothing came close to pandas for holding 150m hits (one day's worth across our properties) in memory as well as Pandas. Especially with Dask Dataframes mixed in. No competition for the effort involved.

Using tools such as sqlmagic or pyodbc for ipython (below) you can let the sql database perform most of the heavy lifting of getting "raw" result sets into a dataframe and then perform some lighter, possibly more ptyhon-idiomatic tweaks to the data once the focused set is loaded in-memory.

  def getdataframe(sql):
          connection = pyodbc.connect("DSN=myDSN", autocommit=True)
          return pandas.read_sql(sql, connection)
      except Exception as e:
          print repr(e)

> with the exception of spreadsheets and formulas

You may want to check Oracles MODEL clause. Not as powerful as a spreadsheet but you can already do a lot of things in the middle of your SQL.

Thanks for the tip, hadn't heard about it before but any SQL feature that inspires articles like "Using the SQL MODEL Clause to Generate Financial Statements" [0] is generally the mix of practicality and code that I'm looking for. I suppose one major roadblock is the use of Oracle SQL. Not the language but the database software. I used to teach MySQL and SQLite before settling on solely the latter. First, SQLite can easily handle any database that students have to deal with, which at max are in the gigabyte range (and usually much much smaller than that). Second, there's far fewer moving parts. I mean, they still have to deal with the idea that SQLite is a database, and that something like DB Browser for SQLite [1] is a client, but it's much less of an obtrusive detail than installing MySQL, a MySQL GUI, and then checking to see if the MySQL daemon is running on their personal laptops.

[0] http://www.orafaq.com/node/69

[1] http://sqlitebrowser.org/

I support a lot of data scientists using SQL and map-reduce technologies. The ones using SQL are about 10x more productive. The ones on map-reduce are building tools to figure out why their metrics are bad, because it takes too long to re-run them.

Can anyone shed light on why there has been a phenomenon of people finding SQL 'too complex' and moving to noSQL? (Not sure if that's entirely fair but, from the outside, it is what it looks like). Is it hype driven? Are courses at university not tending to cover SQL that much?

The main reason is the object-relational impedance mismatch[1]. Basically, programmers like working with objects that have data fields. This is because most modern, widely-used programming languages treat objects/aggregates with data fields as a first class concept. But SQL isn't designed around objects with fields, it's designed around tables, rows, and result sets from queries. Therefore, working with SQL in most modern programming languages generally requires layers of annoying result-set->object or object->row plumbing/conversion code. (Not to mention the vagaries of type conversions.) Of course, these days, this problem can be substantially mitigated to a certain extent by clever ORMs, but an ORM is generally a leaky abstraction at best. Obviously, whether or not any of this bothers you will depend on your use cases and a lot of other factors.

[1] https://en.wikipedia.org/wiki/Object-relational_impedance_mi...

But SQL isn't designed around objects with fields, it's designed around tables and rows.

I think a more correct analogy would be that table are like classes, columns are the properties, and rows are instances. And so defining foreign keys is like setting a pointer to a parent instance.

There is not direct analogy for methods, but you can use function/trigger to do the same job.

PostgreSQL is actually an object-oriented RDMBS, it's not because you are meant to manipulate these objects through SQL that they are less powerful. And SQL is actually Turing Complete with PostgreSQL.

It's clearly not convenient for general programming, but as soon as data manipulation is involved you benefit from a lot of built-in optimization.

> defining foreign keys is like setting a pointer to a parent instance.

I can't agree with that statement, though I'm sympathetic to why you would say it. A foreign key is something that essentially doesn't translate from the world of Tables/Rows to the world of Classes/Objects.

In the relational data model the foreign key is a convenient place to include an index for joining on. What does join even mean between two classes? The closest you'll typically get is nested objects, but that's not quite right. A row produced from table joins isn't a nested data structure. It's still flat. The join operation in relational algebra has no direct equivalent in OOP.

This breaks down when you deal with classic object oriented concepts like inheritance, encapsulation, or polymorphism.

This is why object/relational mapping (ORM) has been called the Vietnam of Computer Science[0].

[0]: http://blogs.tedneward.com/post/the-vietnam-of-computer-scie...

Sorry if I was unclear, I'm not advocating for ORM.

And yes it breaks down because it is similar but different. Row "instances" can perfectly exists outside a table and even be defined outside any table via composites types or even on the fly.

Just saying a RDMS can be a powerful object-oriented environment using solely SQL and no ORM at all.

> tables are like classes, columns are the properties, and rows are instances.

tables are like functions, columns are the arguments, rows are the invocations

functions are actually functions...

functions are relations

I spent years trying to hide the database. Now days I just embrace it. The database is there. Tables and rows are as real as invoices or purchase orders. The database will still be there after my programs and I are gone. I use OOP to construct good gateways into my tables.

The struggle is real. Letting go of my local data structures was a surprisingly difficult psychological hurdle. I sometimes wish there'd been someone there to tell me that the easiest paths are basically "all" and "nothing". Unfortunately I was the only programmer in the entire organization so I was making it up as I went along. I struggled for WEEKS playing pull-change-push games with huge duplicate local data structures before it even occurred to me there was a better way.

> I use OOP to construct good gateways into my tables.

That's all an ORM is, a good default gateway to the tables that you don't have to write by hand.

ORMs (i.e. Hibernate) try to hide the database. You change the objects and they are magically saved. See http://stackoverflow.com/questions/20863150/in-the-context-o...

I've written ORM's, I don't need them explained to me, and no all ORM's don't try and hide the database, that's just one pattern. Others like ActiveRecord simply try and make life simpler without complete hiding behind so much magic you can't figure out what's going on. ORM's abstract the database, that's not the same as hiding the database.

Except for the times when they load all columns into memory and then you programmatically discard almost everything because you really just needed Person#name.

An ORM is a gateway...not a good gateway.

Most disagree, hence the popularity of ORM's.

I remember a few years back - I don't remember exactly when, but i'd say around '07-ish, definitely a couple years before MongoDb appeared - some of my peers who were highly proficient in Relational Databases began denormalizing their data to meet higher demands than we were used to. The basic idea was to turn tables into Key-Value stores and put Memcached in front of them. The higher the traffic, the less normalized the data became.

Around this time, NoSQL databases started popping up, and a lot of my colleagues moved on to them since it fit so well with the trend toward denormalizing everything. Some loved them and dove in full-force.

Personally, I kept working with normalized data, used a caching layer to handle the denormalized versions of the data and learned more about scaling with Master / Slave configurations, and honestly felt very much like I was being left behind.

In order to see what the fuss was about, I tried a small personal project with MongoDB (this is pre-redis, I think), and honestly enjoyed the simplicity of it. And then my toy project got a sudden popularity bump from 20 users to 40k in a day and my project just died. I spent two weeks trying to keep up and then kinda got it working, but couldn't keep the site alive for more than a day. And since it was just a toy project, I just gave up completely.

I've only used non-relational stores for portions of projects that explicitly warrant them, since.

There is also the eternal rotation of the corporate wheel of centralization and decentralization, so when the official corporate HQ Oracle team inevitably becomes too slow and expensive and difficult and incompetent to deal with as part of the usual business cycle, non technical people will route around then and use Excel spreadsheets as their database, or dump json xml files into a filesystem, or use mongo primarily to keep the corporate Oracle team away, like how garlic keeps vampires away. And then when the decentralized solution falls over from poor design or lack of scalability or lack of reliability or lack of backups or lack of support personnel, adult supervision will return and there's a push to massively re-centralize into professionally engineered schema and indexes.

Another aspect of "too complex" is sometimes the true data structure of the problem is correct and also too complex. Some programmers bite off small chunks and chew on them, then push back that the entire data structure of the problem should become the small successfully chewed up chunk. A model that encompasses all of the concept of "number" is very complicated, so a programmer starts off writing a small simple integer library and pushes back that the definition of number should become the simple integer... then reality impacts and as the system evolves and demands are made, the concept of "number" needs floats, rats, complex, base conversions, maybe worse, and the original very complicated design is the only successful way to implement the business requirement of "number". Whoops. If you baked into the cake at the start how to handle rats with zero denominators life would be a lot easier and safer than bolting it on later or praying the app level code handles it, for example.

This first paragraph applies exactly to us! We're the offshoot now because of a bloated central IT system and now that we have a good thing going, there's a corporate push to modernize IT and bring back all our offshoots to one place.

Loved the vampire/garlic analogy!

>a phenomenon of people finding SQL 'too complex' and moving to noSQL?

Fyi... the "SQL vs noSQL" means at least 2 different ideas and some of your replies are highlighting one aspect but not the other.

1) "SQL vs noSQL" can mean "SQL syntax (e.g. joins) vs object syntax to save/retrieve documents (e.g. Javascript JSON/BSON)" . This is probably the main driver of MongoDB adoption. (They don't care about Mongodb's scaling aspect; they just like the easier syntax.) The counterpoint to this idea is that "people are using terrible db engines like MongoDB because they are unwilling to learn SQL syntax."

2) "SQL vs noSQL" can mean "OLTP RDBMS engine (e.g. MySQL/Postgre/Oracle) vs distributed db engine (e.g. Hbase/Cassandra/etc)". The counterpoint to this idea is that "people are deploying to distributed db engines when their use case actually fits in a 1-node RDBMS engine."

From your wording, it looks like you're talking about #1. To that point, many programmers don't like the compexity of 10-way joins of a dozen normalized tables to reconstruct a customer data entry screen. With a document-oriented noSQL db, you just retrieve the entire denormalized "document" with no joins. However, there are tradeoffs to the "easier" noSQL engine such as performance.

Hi - yes you are definitely correct to make that distinction and I was only talking about (1), having forgotten (2). I see your point about the large joins required to get the information you need for a screen.

I'd toss out a third explanation for nosql is its not really a language thing but a concept of normalized/denormalized thing.

Historically databases got used a lot where the short term cost of normalization was low and the long term cost of denormalized data was extremely high, like accounting or bank records or medical records.

The world has some data store applications in completely different environments, where the cost ratios are wildly different and the bandwidth load is extremely high. Like feeding the whole logging output of a webserver cluster into a DB for "data mining" or whatever. So minimum total cost in those weird new applications involves doing the opposite of what usually is remains correct in the older, still profitable applications.

Inevitably resume stuffing being what it is, neophillia, you end up with people trying to pound square pegs into round holes to boost their resume, look cool, gain experience, or for the sheer joy of trying something new. There's also the rush of transgressive behavior, short term thinking, etc. So naturally you get the extreme over-reaction of people arguing that corporate financial records should be nosql or your bank balance or credit record should be based on nosql, etc.

The two factors I saw:

1. Bad experiences with cumbersome ORMs and refusal to learn how databases work: basically fighting with performance or normalization problems and concluding the problem was the concept rather than using it poorly (“JOINs/subselects are hard so I'm doing 10k queries. SQL isn't web scale!”). A related component of this was optimizing for only part of the problem – e.g. someone's writing a web form and they found it appealing to slap arbitrary key:value pairs into a NoSQL store because they hadn't gotten around to writing the validation, reporting, etc. code which actually needed structure.

2. Hype, hype, hype: people would look at papers coming out of large places like Google, Yahoo, etc. with impressive numbers and think they needed the same infrastructure to impress the other cool kids, ignoring the fact that those papers mentioned traffic / data volumes many orders of magnitude higher and the huge companies could hire more engineers to make up for the extra time needed to hit that kind of scale. A lot of that thought was influenced by the pre-SSD era so the assumptions about when you exceed the limits of a single server really don't hold up to serious consideration now, or often even back then.

The promise of SQL is the ability to do arbitrary complex queries, but the key to making SQL fast is to tailor your indices and queries for the current use. These two are in opposition, so that most of what people do with SQL is do the same simple key-based scans and grouping of linear data sets that NoSQL excels at, except with a big wrapping layer of serializing in and out of normalized tables. Every time I've seen people try to write complex aggregation and filtering tasks in SQL, the resulting queries are either a nightmare to maintain or performance is abysmal.

Another problem is trying to store dynamic schemas in SQL, e.g. data for a CMS with user-defined entities. You either have to expose a version of SQL to the client, which is dangerous and/or a huge hassle (see above), or you have to implement SQL-in-SQL with rows masquerading as columns. Neither is ideal.

For me it was a much different problem though that sold me: SQL has no concept of revision control or conflict resolution, and all update tracking must be done in-band, with manually incrementing versions and timestamps. Doing master-master style synchronization between two SQL databases (e.g. server and client) is a pain. Doing it with e.g. CouchDB was a breeze, because of its git-like revision tracking. Being able to shunt JSON in and out was a huge benefit, as simple arrays and hashes do show up everywhere. Being able to suck in data from production into a developer database using its built-in replication features was great too.

If SQL serves your purposes well today, that's great, but there are plenty of reasons to want to move beyond it. I'd like whatever Post-SQL is to handle nested data types, particularly if it comes with functional-programming-style algebraic closure of the resulting constructs. SQL-in-SQL should just be SQL.

In the meantime, I will just design data-first and acknowledge that how I _pull out_ the data is the main constraint anyway, SQL or not. For arbitrary queries, there's dedicated indexing and searching solutions like ElasticSearch that can actually deliver on doing that efficiently, without lots of careful babysitting.

As a guess, SQL and noSQL are different tools good for different things and before noSQL people were trying to hammer their document-store shaped peg into a relational-database shaped hole.

To speculate, noSQL key-value stores became very popular because they allow you to model imperfectly defined situations, and to update that model quicker than in traditional relational-database type situations.

Consider the difference between writing some classes in Java, or bunging everything into a dict in Python. The Java solution could be more formal and well documented, but a pain in the ass if the model changes significantly. The Python solution is a bit more janky and probably sluggish, but can adapt quicker.

I think there are more people in things like research, self-teaching, iterative game dev, and exploratory startups that value the adaptability than there are in things like old school business consultancy for whom the underlying model might be static and well described.

As the web grew, so did the scale of websites.

Many sites and apps found themselves in a situation where they would gladly trade strong consistency, for more performance and eventual consistency.

Further, schema changes for relational databases with billions of rows have to be very carefully orchestrated to avoid downtime. Databases with more flexible schemas allow schema changes more easily (though obviously the code most accommodate the fluid schemas).

Various NoSQL solutions were developed to fill these needs.

Sure, they were overused for a bit (though that fad has basically passed), but they definitely arose to fill a real need.

Schema changes, and how they affect up-time, are an implementation detail in relational databases. It's possible to design a relational database engine that doesn't require huge amounts of downtime to alter the schema of large tables. It's all in how the data is stored.

Of course, the reality of the situation is all that matters, and most database engines at that time did not offer such features. So, I don't necessarily fault people for trying to find solutions that simply worked for their needs.

I think it's like everything else, developers see what the top companies are using and think, well, if it works for them it has to be the best (and you can search for "what's the best..." on any developer forum and see what I mean). So in that sense, it's hype driven.

I don't believe this is the best way to decide on a technology.

This doesn't make much sense to me. I think the answer is just lack of experience.

"noSQL" should be read as "not-only sql" and I think sql in this acronym should be interpreted as "relational databases". I think the relational model is convenient for a lot of purposes but at some point (maybe ~10 years ago) some companies where trying to use it for everything even when those things didn't actually fit in that model. So, "noSQL" for me actually means "alternatives" for different kinds of data time-series, graph, key-value, documents, etc. with redis, neo4j, graphite, mongodb, etc.

What I would like to see in the future is some ecosystem based on small pluggable modules that you can use to build the embedded store for a micro-services. Instead of using a full-feature database system (relational or not) you assemble your store like Legos with only the stuff you need. Something like this https://github.com/level/levelup/wiki/Modules

It is hype driven. Or perhaps driven by a failure to understand the subject, coupled with a natural tendency to look for simple, and ultimately trite explanations for everything.

The pot calling the kettle black.

relational data modeling is something you will learn from experience once you're developing an application where the business domain data is inherently relational.

everything you work on in school is not like this. you spend a lot of time studying in-memory data structures and doing tiny projects (ranging from 100 lines of code homework all the way up to 5000 lines of code midterm project) that implement in-memory data structures and algorithms to work with them.

so you get some inexperienced developer who only has their schooling to inform them and they're all stuck thinking in terms of arrays, lists, hashmaps, and trees. learning a new data model has a lot of cognitive overhead so the allure of a data persistence tool that gives you almost the same data structures you already know is very strong. "Why would I learn about relational data modeling when everything is just a hashmap anyway?" thus MongoDB as extremely popular with 23 year olds and extremely scorned by 33 year olds. 10 years of job experience drives the point home.

For me personally, relational data model didn't really make much sense until I studied relational algebra. After over-night my supervisor was amazed and wanted to know why my quires went from shit to amazing in their eyes.

"NoSQL" databases existed before SQL. For example, look at multivalue databases like Pick/D3, or the sparse map like persistent data storage structures of MUMP. We had plain text databases, key value storage, hierarchical databases or multivalue databases before we had relational ones.

Actually, pre-RDBMS those were the dominant systems. The concepts invented from RDBMS were the new kid on the block that was hot and cool and more powerful that supplanted the original "NoSQL" system and had remained dominant until we hit "Internet scale" and people wanted to 1) scale to huge data volumes requiring distributed databases and processing cheaply and 2) wanted to return to finding alternatives to the dominating SQL platforms.

Actually it was an attempt to get better performance when the sizes of data increased drastically. Then of course it became a trend and developed some ideas by itself that are unrelated to where it came from.

Just from my pov, SQL does not support a document view of data and does not easily support versioning of same and this is a huge problem certainly in complex business systems.

> SQL does not support a document view of data and does not easily support versioning of same

SQL supports linear ordered or even branching versioning fairly easily (though only recently have many SQL-based DBs had decent tools for temporal versioning), and SQL-based object-relational databases (Postgres and Oracle, for example) have supported both document-oriented views over classical relationally-structured data and document-oriented data storage since well before the NoSQL craze.

Honestly, I think there's a cargo cult of people who want things to be impressively complicated.

I think you missed the part in the article where it said “SQL - it’s so easy marketers can learn it.”

Because MongoDB is web scale. It doesn't use SQL or joins and that's the secret ingredient in the web scale sauce. That and sharding.

I think the people downvoting this haven't seen [1] :)

[1] https://www.youtube.com/watch?v=b2F-DItXtZs

I think it's just that Hackernews doesn't find it funny.

There just is no substitute for SQL. Some thoughts on what has given it a bad name:

1) The pervasive use of artificial keys. USE NATURAL KEYS. Unfortunately probably 99% of real-world databases were designed with artificial keys. I wish I could point to some literature on this topic. It is very rare and I only came to learn about this from a DBA who is well-versed in designing with natural keys. I'm trying to get him to publish more on this topic.

2) ORMs. This is just a bad practice. Their use in part derives from the awful schemas designed with artificial keys, requiring another layer of complexity to get a more intuitive model of the data. Fortunately for me over the past 3 years I've been doing almost all my application I/O with F#'s SQL Type Provider, SqlClient, http://fsprojects.github.io/FSharp.Data.SqlClient/ , which strongly types native query results, functions, and SPROCS. Just does not work if you need to construct dynamic SQL. I've been trying to goad the author into also providing meta data retrieval. That would be the icing on the cake.

3) SQL does not seem to be a required topic for undergrads. There are really no unsolved problems (of note), so it's not interesting to academics.

4) Most app programmers don't get much practice writing difficult queries or tuning problem queries, so that one time every 9 months when you do something hard, it is hard. (And again, often compounded by the complexity introduced by artificial keys.)

> The pervasive use of artificial keys. USE NATURAL KEYS. Unfortunately probably 99% of real-world databases were designed with artificial keys. I wish I could point to some literature on this topic. It is very rare and I only came to learn about this from a DBA who is well-versed in designing with natural keys. I'm trying to get him to publish more on this topic.

While I really enjoy the concept of Natural Keys, I just see so few places where they are applicable. If I was designing a banking Db schema, I could see using an account number as a primary key as long as I am not exposing my Db to anything but internal systems. However, if I'm designing a social networking platform, what can I use as a natural primary key for a user object? I can't use a name because not all names are unique and they can change. I can't use an email because they can change and I feel like Email addresses can get pretty large(more space and slower lookups). I could maybe use a username if I enforce that a username can never change and must be unique. But, we then run into the same issues as emails where a username can be fairly long and thus cause slower lookups. I also don't like the idea of using strings as primary keys either because I would need to take into account implementation details like string encoding (utf8, utf16, utf32, ascii, latin-1) and make sure to encode/decode on every lookup/insert.

So, I can see some use cases where natural primary keys make sense. However, I believe that for most use cases, artificial keys are a better choice. Integers don't take up a lot of space, it's easy to enforce uniqueness, they are a natural sequence, and they rarely, if ever need to change[1].

[1] In fact, I would make the argument that if you're altering primary keys at all, you're doing something wrong.

Properly identifying the natural keys requires more up front thinking than slapping on an identity or guid column. Also it ends up with a different set of tables at the end of the day. So looking at your current schema and saying natural keys don't work here is probably true. It's a big topic, and like I said, not enough literature, but if you search sql natural key you can get started.

There are no proper natural keys, natural keys are a bad solution to the real world problem of running and managing an application. The correct solution is to use a constraint to enforce the uniqueness of your supposedly "natural key", and use an artificial key to actually join and code against because quite frankly natural keys suck and are commonly multi-column keys that require updating and joining against all of which make them poor choices for programming against.

Surrogate keys don't break your app and code when business rules change or you finally figure out your natural key isn't so natural and has exceptions that forces you to make it not a key. If you think natural keys are a good solution, you've not lived outside of the database where real $$ is on the line.

Those who live in a SQL ivory tower where breaking changes don't matter like natural keys; those who actually write applications that use databases and have to deal with the constant churn and change of business rules know how foolish natural keys are and being pragmatic and understanding how to correctly use a pointer to avoid the need to cascade updates and make the schema immune to business rule changes have long ago chosen surrogate keys which are the vastly superior engineering solution.

Hi Gnaritas,

I would love to have a discussion with you on the topic, but to get started I think we'll need a little more solid ground than the over-generalized, empty claims you made: "There are no proper natural keys" / "Natural keys are a bad solution" / "The correct solution is.." / "Natural keys suck" etc. Can you provide viable/scientific/quantitative/reproducible/logical, either practical or theoretical arguments to support these claims? If you can, I would be happy to discuss those with you in detail.

BTW - If you need to change your database schema when your business rules change, the root problem is your data model. It means your database schema models the business rules, instead of modeling the data universe.

Before we start the discussion, I would like to suggest that you read the following two short articles. I think it will give us a more solid common terminology.





> the correct solution is to use a constraint to enforce the uniqueness

That would of course be correct for a natural key. It seems that you too recommend using natural keys and I agree with you on that point. Natural keys are required so that a database can represent the real world accurately and enforce essential business rules. We all rely on websites to enforce the uniqueness of user names; we all expect our bank to enforce the uniqueness of account numbers.

Those who live in an ivory tower where data quality doesn't matter don't like natural keys; those who write applications that have to work in the real world know how foolish it is to leave out natural keys. Those older and wiser heads will have seen and dealt with the serious practical consequences of shoddy databases where natural keys weren't implemented. I think that is the problem being referred to in this thread.

1) Natural keys suck in the real world, artificial keys have come to dominate for a real reason that you either understand or suffer for ignoring. Natural keys are bad practice.

2) ORM's are good practice, they reduce massive amounts of duplicated code in applications and massively lessen development time which generally dwarfs the cost of execution time. If you don't use an ORM you are pissing money down the drain, most businesses are reluctant to do that. Using an ORM doesn't mean not using SQL when appropriate, it means using the ORM in the 90% of cases where it works best. Complex SQL belongs in views, which are then trivially mapped with the ORM giving you the best of both worlds. If you're doing complex queries with the ORM, the ORM isn't the problem, you are.

3) Should be.

4) Nor should they, programmers are expensive, paying them to do what a library can do better is a waste of money and time. You need a guy or two who knows SQL well enough to tune query and do indexing, that's your db guy.

> There are really no unsolved problems (of note), so it's not interesting to academics.

For anyone who likes SQL and is looking for a research topic, read about temporal databases and suggest some ways to handle DDL changes over time (e.g. adding a new NOT NULL column, or changing a relationship from one-to-many to many-to-many).

Here is your starting bibliography:

Richard Snodgrass, Developing Time-Oriented Database Applications in SQL.

Hugh Darwen & C.J. Date, "An Overview and Analysis of Proposals Based on the TSQL2 Approach".

Krishna Kulkarni & Jan-Eike Michels, "Temporal features in SQL:2011".

Tom Johnston, Bitemporal Data

Magnus Hagander, "A TARDIS for Your ORM": https://www.youtube.com/watch?v=TRgni5q0YM8

I would love to read what you come up with!

I would argue that [http://www.AnchorModeling.com] is a pretty good take on it. I think it shares some genealogy with Fact-Based Modeling techniques (FCO-IM?). It has a really decent story for evolving the schema over time.

I've read Darwen/Date, Snodgrass, and Johnston's (two books) on the subject. Johnston's seems the best practical choice, but they patented the ideas :/ That said, one could probably implement Johnston's model without violating patents if you ignore explicitly modeling the episode structure and just follow the theory.

Date and Darwen make decent points about wanting to use the relational model rather than some baked-in concept, but ultimately they do almost no legwork on practically putting any of their notions to use. This seems to be par for the course on Date books.

Weis and Johnston handle the problem more directly. They also tackle a harder problem overall (BiTemporal, vs just Temporal). Also Johnston is just easier to read than Date. (Side note, I feel like a blog where I just read chapters of Date writings and condense the content basic points would end in most of Date's arguments fitting in 50 pages.

That said, Weis and Johnston still punted on schema evolution. Anchor Modeling sort of starts at supporting an evolving schema and moves outward from there. In Anchor modeling there isn't the traditional Temporal/Bi-temporal notion, but rather positors than have varying degrees of certainty about posits. There is basically a concurrence of facts that can be retracted or changed over time. Useful for modeling varying degrees of certainty or alternate perspectives on data.

The downside of Anchor Modeling is that the datamodel is basically 6th normal form with a bunch of table valued functions, triggers, and views to aid in making it palatable for devs. Johnston sort of acknowledges this style in one of his books but I believe argues against the concept of an entities information being spread throughout various tables. In his mind, a table is a type, that is made of attributes (columns). Anchor modeling is more along the lines of 'Attributes are types', and you can relate them to form larger types.

I wish there was more work on the ORM side supporting some of these concepts (Schema evolution and temporality).

Yes, I would agree, temporal databases is an interesting topic.

Natural keys can be composite which makes it a pain to manage cross-table relationships. I typically have a unique constraint on the actual composite key and a auto inc id.

That is the correct solution, a surrogate key.

Good to know, thanks!

Can you give 1 example of a good natural key?

Note I will disqualify anything that has a reasonable chance of changing, like the primary email address of an account, a persons name, a persons day of birth or the 'public id' of a bank account

A composite key made up of foreign keys in a junction table.

How often does the birthday of people change? (Then again, how likely is it to be unique?)

The closest example I can give is that I work on a system where the birthday can be partially unknown. This mostly affect refugees. So the person either has to guess his own age, or only knows the year, or only the year and month.

Then again, someone might make a mistake. "What do you mean 09-11 is not November 9th, you silly American."

Now that I think of it, I guess the best argument against natural keys is "typos".

I studied SQL as the lab portion of a database design course. The sequel to that course was wholly dedicated to PL/SQL.

And the sequel to that was postgrad?

The sequel to the SQL? ;)

Hey, some of us pronounce it S-Q-L :p

I've always thought the main reason NoSQL solutions became popular is that developers could finally get at and manipulate the data the way they wanted to.

I've known some very, very prickly DBA's in my time who referred to the databases they looked after as "My database". So they would say things like "Don't put junk in my database"

And would give you endless grief over how you wrote your queries or asked you a million questions about why you needed a new table and why your proposed design was shit.

As a result, many of us devs tended to view SQL Databases as some sort of dark art. So in this regard, NoSQL is freeing at first glance.

But if I'm honest, once I got over my fear, the "pros" of NoSQL solutions in comparison to good ol' SQL seem to be relatively feeble.

I think it's easier to get up and running with a NoSQL solution because there is far less friction when it comes to rapid prototyping of ideas, but things get complex pretty quickly.

I'd also say that for the vast majority of applications out there, the difference between the two will mostly be a wash.

By any chance did these DBA's be the Linux server administrators at all?

My own personal experience is DBA's/Linux server administrators have authority issues. I had one situation where I had a encrypted file on one of the Linux server and the server administrator requested for months to have the decryption key so he could inspect the file.

That seems highly unprofessional. I won't even ls into users' directories without their express permission.

.... yup, also him going in randomly through the day and killing proc's id's and clearing out log files at random was other issues where reported. I think for development to avoid him and get some level of sanity we used AWS and install ubuntu/postgres/java sdk to get work done.

I've been programming for most of my life. SQL has been a big part of my career. And I love it. It's one of my top 5 languages.

It's a nice, functional, declarative language in the vein of prolog and such. You just tell it the shape of the data you want, where to materialize it from, filter, aggregate, calculate the window of, etc... and the system figures out how to execute it as efficiently as it can. It beats out procedurally munging data by a long shot. It's more concise for many operations than ML-like variants.

It's a great tool to have. And understanding the underlying maths, relational algebra, is beautiful. I've found trying to implement your own rick-shod relational database is a good way to try to mechanically understand the theory. Then move on to implementing datalog... etc. The reason why SQL continues to stick around is that the fundamental theories are quite sharp! I'd appreciate a more concise syntax some days but overall I can't say I'm displeased. It's great!

One of my dreams is building a hybrid of RDBMS and Protocol Buffers. It would look like a bunch of nested structs that can be kept in memory, fetched from disk or over the network. The schema would be kept in .proto files, and you'd be able to reload it at runtime (existing data would be handled according to protobuf schema evolution, because each field in each struct is numbered and old numbers aren't reused). Most joins would be gone because nested structs are allowed, but you'd still have an SQL-like language for more general queries (designing such a language when nesting is allowed is a bit tricky, but not very). Things like indices and transactions would also exist at every level (memory, disk, network) because they are useful at every level.

The end goal is eliminating impedance mismatch, because your current in-memory state, your RPCs and your big database would be described by the same schema language, somewhat strongly typed but still allowing for evolution. I have no idea if something like that already exists, though.

The problem is that what joins to what differs depending on the context for what you are looking for.

Sometimes you join an order to supplier, other times it will be to customer and other times it will be to a list of products. Sometimes, all of them need to be included.

With nested structs the parent/child relationship is fixed. If your query needs to invert this relationship, you essentially have to search through your entire database which will be incredibly slow and resource intensive.

Either that, or you just store your data and their relationships separately and then allow for highly optimised searches to be conducted between these relationships according to the query and statistics you hold about the data ... but then you have just reimplemented an RDBMS..

I guess the idea is to make the schema as readable as possible using nesting, and then use indices for the rest.

Yep, it was one of the things used in quite a few systems before SQL became more widespread:


> One of my dreams is building a hybrid of RDBMS and Protocol Buffers.

That's actually one my goal with https://github.com/jimktrains/drsdb. All communication would be done via protocol buffers and schemas could be created or loaded at runtime to be verified and used.

The primary goal begin a distributed RDBMS written in rust.

> SQL and relational database management systems or RDBMS were invented simultaneously by Edgar F. Codd in the early 1970s.

Codd didn't invent SQL. Donald Chamberlin and Raymond Boyce did.

> SQL is originally based on relational algebra and tuple relational calculus

Maybe "originally based on", but not "an implementation of". For example, it is perfectly possible for an SQL query to return duplicate rows, which isn't possible under relational algebra/calculus, a relation being by definition a set of rows (or, more precisely, of tuples).

SQL is the second most underused tool we have with dealing with data. AWK being the first. SQL is great because the logic works with dealing with data and forces you to make good decisions earlier.

Everyone should spend a day learning SQL if for no other reason they the ability to think logically about data.

awk is really nice. I'm starting to use it more and more in places where I previously used a long pipe of grep, cut and sed.

I use perl when grep/cut/sed show their limits. I never really got into awk.

I suppose we all have our favorite tools.

I'm a heavy Perl user, too, but some stuff that is very succinct in awk is a bit more awkward in Perl. Most prominently when you're using $1, $2, etc. which are just there in awk but need a split/\s+/ in Perl.

-a is your friend.


"Why do we still use SQL" and "Why do we still use relational databases" are two very different questions. They seem like much the same thing, because SQL is pretty much the only query language offered by relational database systems nowadays... so if you use SQL, you use an RDBMS, and vice versa. But other query languages used to exist. There was QUEL [1], for example. It seems to have fallen by the wayside; most people have probably never heard of it. I guess there is very little room for multiple languages in this particular space.

[1] https://en.wikipedia.org/wiki/QUEL_query_languages

Absolutely. Relational databases are so useful that I happily use SQL even though it is not a very good language. I have thought about making a "compile-to-SQL" language in the same vein as Typescript. (Does this already exist? You would think so, but I can't find one.)

HTSQL has some great examples of where SQL could be better: http://htsql.org/doc/overview.html#why-not-sql. I would love to get that goodness in a language not tied to the rest of HTSQL.

There have been a few languages that compile to SQL. CLSQL for Common Lisp and Korma for Clojure come to mind.

There are quite a few things that "Compile-to-sql". In .NET, the EntityFramework ORM takes the AST from the C# code int he query and generates the SQL from it. I'm not sure how ORMs work in other languages, but I would imagine it's something similar.

SQL is a great language. So great that it's a superset of everything good ever created on this domain¹, but still cognitively simple.

There is space for statistical and search-based (like Prolog) languages, but those are very niche. Proof of that is that they exist, but yet nobody here is talking about them.

1 - Ok, object store query languages are not a strict subset of standard SQL. But it's only a matter of adding one or two commands, like Postgres does.

"It’s like how MailChimp has become synonymous with sending email newsletters. If you want to work with data you use RDBMS and SQL. In fact, there usually needs to be a good reason not to use them. Just like there needs to be a good reason not to use MailChimp for sending emails, or Stripe for taking card payments."

Wow, that's a subtle, almost unnoticeable promotion of MailChinp. /s

I came expecting a treatment on Structured Query Language, was disappointed when it turned out to be on Relational Database Management Systems. It doesn't take a rocket scientist to figure out why RDBMSes won out over non-relational systems.

What I want to know is why nobody ever came up with a better query interface. Every abstraction I've ever seen was built on top of SQL.

Because SQL is so good at representing operations in the relational algebra that rdbms are based on, and such a mature technology, that it's awfully hard to replace it with something better? SQL and rdbms kind of go together, and always have -- it's the query language that parsimoniously represents what you can do to a relational source, whose development has happened in concert with rdbms themselves.

What don't you like about SQL?

Not OP, but I find the syntax to be arcane and bizzarely restrictive. Every query has to be SELECT FROM JOIN WHERE GROUP BY HAVING ORDER BY; for some reason the table comes after its fields and the ORDER BY can't be expressed using field aliases defined at the start of the query.

Instead, I've been toying with the idea of a language where queries are expressed as a pipeline of operations on result sets, like this:

    FROM employee
      /* result set: all employees */
    LEFT JOIN department ON employee.DepartmentID = department.DepartmentID
      /* result set: all employees + their department */
    ORDER BY employee.last
      /* result set: all employees + their department, sorted */
    WHERE department.name IN ('Sales', 'Engineering')
      /* result set: Sales and Engineering employees + their department, sorted */
    FIELDS employee.first, employee.last, department.name
      /* result set: Sales and Engineering employees' first, last, and department names, sorted */
This is a SELECT, but you can for example turn it into an update just by adding a clause to the end:

    UPDATE employee.salary = employee.salary * 1.05
This is a much more regular language; there's nothing enforcing you to write these operations in a certain order, and you can add or remove clauses as necessary. For complex queries, I find that I write in this style anyway, using a chain of WITH clauses as a pipeline with a final SELECT on the end to get the results.

  > for some reason the table comes after its fields
I agree it would have been better to start with the FROM clause and have the fields listed near the end.

  > I write in this style anyway,
  > using a chain of WITH clauses
That sounds like it could make it better. I will try that more often.

  > the ORDER BY can't be expressed using field aliases
  > defined at the start of the query
In Postgres you can!

This reads very much like LINQ query syntax:

    from e in employee
    join d in department on e.DepartmentID equals d.DepartmentID
    where new string[]{ 'Sales', 'Engineering' }.Contains(d.name)
    orderby e.last
    select e.first, e.last, d.name
Forgive any errors, it's been a while.

So it's just like SQL but with the order of statements being more flexible? Does not seem entirely unreasonable.

Basically yes, though I'd also make it so if you alias a field it's referred to by the alias from then on (e.g. FIELDS foo AS bar ORDER BY bar instead of SELECT foo AS bar ORDER BY foo). I'd probably be inclined to rename some of the keywords as well (SORT, FILTER, MAP, REDUCE perhaps).

There are some query builder libraries that let you specify clauses in a different order, like "query.from(table).where(col, value).select([...fields])", but I have yet to find one that lets you specify the query as a series of transformations on tables. (I think the word I'm looking for is "isomorphic" but that doesn't seem right somehow.)

Have you ever seen Arel?


> I've been toying with the idea of a language where queries are expressed as a pipeline of operations on result sets

That is what the relational database engine does in the first place. In fact, many really smart people have figured out how to make the engine really good at it and SQL is the best for working with it.

Thing is, SQL maps fairly directly to the fundamentals operations defined on Relations in their rigorous sense. Every abstraction of those operations is going to have a similar form to SQL unless it seriously over-complicates things.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact