Schema-less is usually a lie

programminggeek · on July 3, 2013

100% agree with this post and it's important to understand where/when schema in the code makes more sense than schema in the db.

If the DB is determining the schema and your code just mirrors it, you will inevitably do things like triggers, stored procedures, etc. that essentially put application code inside the DB. This makes testing and maintenance of such things all but impossible, even if it does make the DB queries fast. At the same time, your application still needs to mirror the schema of the DB or things will break.

"Schemaless" (query schemas) isn't better per se, but having the mindset of putting the schema in the code means you can write tests to ensure the schema over time. Depending on your perspective, one might be better than the other, but as they say "knowing is half the battle"

joe_the_user · on July 3, 2013

If the DB is determining the schema and your code just mirrors it, you will inevitably do things like triggers, stored procedures, etc. that essentially put application code inside the DB. This makes testing and maintenance of such things all but impossible, even if it does make the DB queries fast.

I don't believe this is necessarily true. I think it depends on the structure of the enterprise.

The traditional relational database seems to center on what could be called the "traditional large enterprise". This large enterprise tends to have many projects sharing the same static data and here a single store for that data makes sense and that store can and should have a standard interface, which can be embodied as stored procedures if necessary and would have to be well-defined enough that individual applications can deal with it.

That's the "traditional large enterprise". Today, however, we have large companies where single application has become the essence of the company (Google, Facebook, etc)and the one-datastore, multiple-applications model no longer makes sense and it does make sense to move more data manipulation the application level.

Still, when making pronouncements about what works best, I think context is important.

dragonwriter · on July 3, 2013

> Today, however, we have large companies where single application has become the essence of the company (Google, Facebook, etc)and the one-datastore, multiple-applications model no longer makes sense

Which "one application" is the essence of Google?

coldtea · on July 4, 2013

All of them. Google is mainly a bunch of "one applications".

Maps is maps, Search is search, YouTube is YouTube, GMail is Gmail etc. Those are not apps providing different views of the same data, in the way the parent describer enterprise apps.

joe_the_user · on July 3, 2013

Whatever, say, Google-a-few-year-back then.

dragonwriter · on July 3, 2013

Which just illustrates that while comparatively young companies may all be all about one application, its not exactly unusual even for a company that starts out that way to rapidly grow into one providing a large array of applications with overlapping use of data.

While there are reasons that running all those applications for, say, Google on a shared RDBMS backend isn't the right answer, the reason isn't that Google has a single application that uses all its data and so doesn't have to worry about coordination between different applications using the same data.

dragonwriter · on July 3, 2013

> If the DB is determining the schema and your code just mirrors it, you will inevitably do things like triggers, stored procedures, etc. that essentially put application code inside the DB. This makes testing and maintenance of such things all but impossible

If testing and maintenance of database objects (tables and other relations, as well as procedural code in triggers, SPs, etc.) is "all but impossible", the problem is with your DB maintenance policies and practices, not with where you are putting code.

It may be the case that lots of places have bad database maintenance practices (just as lots of places have bad maintenance practices for non-DB software), and it may be that in those environments, when you have a limited scope of influence, routing around those bad practices is the best solution. But it is a mistake to present that as a general solution.

camus · on July 3, 2013

> This makes testing and maintenance of such things all but impossible

Developers have been maintaining such apps for decades without any problem. And everything is testable , even a stocked procedure.

If you dont care about data integrity, then make your application responsible for your schema...Some do care. Non-rel databases guarantee 0 data integrity, with very little performance gain over relational-databases.

Finally most frameworks allow developpers to generate the db schema during development without writing a single query. with all the cache layers like redis and other goodies , there is little to no reason to use non-rel databases.

tracker1 · on July 5, 2013

A given RDBMS doesn't guarantee any data integrity. The integrity is a bit of a lie. I've seen plenty of real world databases with software created by idiots where a DateTime field was defined as a VARCHAR. Or fields that should have been a VARCHAR created as CHAR, then bugs happen with the strings aren't auto-trimmed.

IMHO the database these days SHOULD be fairly agnostic. Also normalization, which is the core of your so called integrity actually REDUCES performance. At my last job, our main VIEW into the data took over 28 join operations for some highly normalized data (much of which had to overcome some bad data in the db).

We setup a MongoDB database for searching against, as well as being able to pull up a single record without dozens of join operations, and it was a LOT faster, against real-time data... the search that was replaced was a batch process that recreated a single table every half hour. In this case MongoDB was a much better fit.

Sorry, but SQL databases don't guarantee any data integrity either. It's up to the developers that implement those schemas... and the fact is, for most of them, they are better off doing that in their primary application code.

Also, if you are using an ORM tool to "generate" your schema from code, then what advantage does said schema's "integrity" give you?

lttlrck · on July 3, 2013

"Non-rel databases guarantee 0 data integrity, with very little performance gain over relational-databases."

guarantee ZERO data integrity?

gregwebs · on July 3, 2013

I disagree with the idea that MongoDB has struck the right balance. I have seen production issues from manual editing of data which resulted in a floating point instead of an Integer.

I really wish MongoDB had support for a strong schema in the DB besides indexes (in addition to existing support for schema-less). I only actually want schema-less at most 20% of the time, but I am stuck with it for everything outside of indexes.

spamizbad · on July 3, 2013

Completely agree. I'd love it if you could provide collections a baseline set of document validation rules- rules that only apply to fields you explicitly mention.

I guess what I want is a schema-ish database, where I can say "these 5 fields are required and must conform to these rules. Anything else is fair game."

vyrotek · on July 3, 2013

I've been looking some something similar to what you've described. I recently discovered PostgreSQL has a 'hstore' feature which seems to support this. You seem to be able to make a normal table and then specify one column to act like a key/value dictionary. These values can even be included in queries.

http://www.postgresql.org/docs/9.0/static/hstore.html

antsaasma · on July 4, 2013

PostgreSQL also has JSON support. You might want to check out Christphe Pettus's presentation on this subject:

http://thebuild.com/blog/2013/07/02/postgresql-as-a-nosql-da...

tracker1 · on July 5, 2013

It is pretty trivial to create an API/Repository in front of a non-sql based database that can use JSON-Schema for validation of data structures. In fact, it's probably better if you do this.

For that matter, if you use NodeJS, then said API and the backend structure are fairly trivial. I used NodeJS to create an API for queries against MongoDB, and it allowed for me to normalize and check data in said queries against the data. It also allowed me to do programmatic elimination of sensitive portions of data from the front end without much effort at all.

It seems what you really should be creating is an API that your application uses. I'm a proponent of the Data Storage Layer being as dumb as possible. If it weren't for the built in indexing, and common access structures, I'd be more inclined to roll my own. I do think that MongoDB does a nice job of striking a balance between say Couch and MySQL (I intentionally use MySQL here instead of PostgreSQL or MS-SQL). I also think that RethinkDB within a year or so will likely be a better option for many of those thinking about MongoDB.

jasonmccay · on July 3, 2013

I really like the term "schema-ish" and I think that most people probably want something like this. They will inevitably still have a defined structure for their data, but they need that extra bit of flexibility to keep the overall data structure simplistic and sensible.

tokenizerrr · on July 3, 2013

I recall reading about how MongoDB saves all the key names with each record, since there is no schema. This results in quite a lot of disk space going to waste. It seems like this could be a solution to that problem as well, I like it.

jasonmccay · on July 4, 2013

The guys at Tokutek are solving that problem with MongoDB. Pretty interesting stuff ... you should check it out.

leeoniya · on July 3, 2013

from https://news.ycombinator.com/item?id=3838131

'...but every time I see "Schemaless" in MongoDB, I think "oh, so you're implementing schema in your application?"'

marcosdumay · on July 4, 2013

And missing most of the centralized documentation people use everywhere.

But if you only have one application, you don't need centralized documentation.

tracker1 · on July 5, 2013

You can create an API for data access that validates a given document schema... you don't have to operate in the open.

jbellis · on July 3, 2013

Cassandra is not schema-less. I covered this in my recent article on Cassandra Myths: http://www.infoq.com/articles/cassandra-mythology

kailuowang · on July 3, 2013

To me, schema-less simply means that you don't need to maintain the data structure in a separate schema definition. It doesn't mean that your data doesn't have a structure - the structure is in your data, query and indexes, it's just easier to change without that extra "schema" definition.

wvenable · on July 3, 2013

It's not easier to change. When you have a schema-less database you don't change the schema, you add a new one for new data. Any data with an old schema continues to have it. So at any time you have up to n schemas simultaneously in your database if you changed it n times.

rhizome · on July 3, 2013

Here, step aside while I demonstrate my any-endian arbitrary wordlength CPU.

taeric · on July 3, 2013

Isn't this akin to the idea that many dynamic programs actually have fairly static types?

6ren · on July 3, 2013

Stretching it: natural language was grammatical before grammar was invented. Natural laws existed before they were discovered. But it can be easier to change grammar/law/types/schema when not explicit.

joe_the_user · on July 3, 2013

"...it can be easier to change grammar/law/types/schema when not explicit."

Now that's quite interesting statement. I'd say it is true for humans but false for computers and it's paradoxical to think about why this is.

If computers were an intelligent as humans, you wouldn't have to worry about giving your program any structure because the computer could change that structure later. But sadly, spaghetti code isn't what it's cracked up to be.

Similarly, an application where don't bother thinking about your schema beforehand isn't going to be application which you can change easily later.

taeric · on July 5, 2013

It is not just spaghetti code, though. Right? It is code that does not constantly check that input fits an expected pattern. Which is most code. I'm reminded of this talk by Sussman.[1] Essentially, in software we are (understandably) adamant about not modifying the input of our application for "pure" functions. Nature goes right the opposite, often attempting to make the input work at all costs.

So, whereas a schema based application that models a user with an address is unable to cope with being given two addresses. A schema-less approach where it just stores blobs simply stores what it was given. If code that reads this is unable to make sense of multiple addresses, it will raise an error. Not necessarily unlike a user being told to send a package to an address, but given a list of addresses.

[1] http://www.infoq.com/presentations/We-Really-Dont-Know-How-T...

taeric · on July 3, 2013

I thought that was the point, though. Just because the schema is not explicitly set in the database, does not mean there is not a "de facto" schema. Sure, some changes will be easier this way, but just changing things is almost assuredly going to break something.

tome · on July 3, 2013

That sounds like a good analogy to me.

Jare · on July 3, 2013

> Indexes on collections in MongoDB are very much “schema”

I think that's all you need to read in this post. It's... rather thin IMHO.

Encave · on July 3, 2013

Here are some slides from a good talk by Martin Fowler about schemalessness and how there is always an 'implicit schema' http://martinfowler.com/articles/schemaless/

just2n · on July 3, 2013

Isn't a schemaless database just a filesystem?

tracker1 · on July 5, 2013

Some honestly aren't much more... they all have at least a "key" or id field you can lookup a record by, and most have a map/reduce implementation. Some require you to do your own indexed records and others (like MongoDB and RethinkDB) have the ability to add additional indexes (similar to SQL).

I find that MongoDB tends to perform at least as well as an SQL based solution in general use. Some use cases are better suited to SQL (anything with multiple records in a single transaction as a hard requirement for example). It really depends on your needs.

camus · on July 3, 2013

it'w worse , it is basically a bin where you throw paper. You can still search for a specific one ,but as you throw more paper in the trashcan , it gets more complicated to find the paper you want. but hey ,it is easier and faster than to organize them on a rack they say ...

tracker1 · on July 5, 2013

It really depends on the non-sql solution... Most of which do have indexed keys for retrieving data at the very least (key-value storage), and use Map-Reduce in order to perform basic queries, or you setup your own indexed records.

MongoDB specifically allows for additional indexes to be used as part of general queries, or the aggregation methods available. MongoDB even has some nice built in features for querying against geolocation data. They do have some limitations, but in general it works quite well.

Your analogy comparing non-sql to a trashcan is a bad comparison... a better comparison would be to a set of file cabinets.

cmccabe · on July 4, 2013

So it's a "circular file" system.

thelarry · on July 3, 2013

i feel like a lot of people seem to jump on to "schemaless" dbs because they feel the can just dump whatever they want into the database. they also run into a problem later on when they realize that they need to care about data integrity and types.

tracker1 · on July 5, 2013

Or you have a data structure that would require a lot of tables for what equates to a single record of interest.

In my last job, we were presenting search, and display of classified listings for cars. The normalized data required, iirc, 28 join operations to get most of the data for a single record (for display), about 12 iirc for the search support (not including geo/location based searches), and a second lookup for related data.

This could be replaced by a single query in a non-sql database. There is a real cost to these kinds of structures in SQL... There are a LOT of use cases where a single record structured as a complete object is much better than having to break up said structure into dozens of fields.

gsabo · on July 3, 2013

I believe Kyoto Cabinet is an example of a truly Schema-less DB:

http://fallabs.com/kyotocabinet/