Hacker News new | past | comments | ask | show | jobs | submit login
The Future of NoSQL (foundationdb.com)
44 points by gk1 on Nov 13, 2014 | hide | past | web | favorite | 24 comments

While I understand the initial appeal of schemaless databases in my experience the schema is the best living documentation of the shape of the data. It becomes really handy to decouple this from the application layer when you start having multiple clients connecting to the database (transactional vs analytics workloads). I've also had my fair share of seemingly non deterministic behavior when working well tested code hits old data that you forgot was in a slightly different format.

> when you start having multiple clients connecting to the database

If you assume that will happen, then all the things you suggest are true: the schema is indeed the best documentation, and clients will have to pay close attention to versioning since you can really only have one version of a fixed schema at a time. You'll probably also want to move business logic into the database in the form of foreign key constraints, triggers and the like. Getting that right is really important to protect against a broken client corrupting data.

But that isn't the only strategy. You can instead have clients connect to an API, with the API implementation being the only thing that connects to the database. The API becomes the documentation. It can handle versioning. It handles business logic. In this world, the database schema is much less important, and you can safely use schemaless databases.

Both designs have their advantages, and multiple clients connecting directly to the database may well be a better choice in many circumstances, but it's not inevitable.

Exactly. Your database may be schemaless, but your data sure isn't.

> Schema-less design allows data to be modeled more flexibly than in relational databases, which lock the developer into a single schema at any given point in operations.

Implying that schemaless design is a GOOD thing

I've always been interested in NoSQL databases, but I never understood the advantage of a schema-less design. Migrations make it so you can change the structure of your data at any time so you really aren't locked into anything. Even ignoring that, your data has to have some kind of shape to it (a kind of ad-hoc schema) and you still have to deal with data whose shape has changed (which migrations take care of for you).

Migrations get expensive and risky when you have lots of data. Being able to do them incrementally is sometimes worth the complication of maintaining code to read older versions. A traditional database doesn't let you do that; you have to execute an "alter table" statement all at once.

The current state-of-the art with MySQL uses temporary tables and triggers to perform online migrations.

yup. once upon a time we had a MySQL db on commodity hardware doing around 6K QPS with over 100M rows in the biggest table. you can't do much to it without taking it offline, and even then you have no idea if/when the migration will complete.

There is no advantage of a schema-less design, because there is always a schema. It is just a matter of whether the schema is made explicit and enforced, or whether it is scattered all over the application code.

it can make it much easier to do on-demand migrations: in massive online apps, a large percentage of data is essentially 'dead' - it's unlikely anyone will ever access it again. and massive online apps (call it 'web-scale' if you will ;) can accumulate a lot of data. the systems they're discussing make it easier to migrate data as needed instead of at once (which sometimes isn't even possible if the infrastructure is really being pounded).

The advantage of schemaless is that if you are adding extra attributes (most common database change) then you don't have to do any migration. And migrations have a long history of having side effects as well as requiring Database/Operations teams involvement if you work in the enterprise.

And as I mentioned before the "shape of your data" is quite often defined in your application anyway.

I imagine that side effects exist and are more difficult to spot in a NoSQL migration.

The best way to handle this is through schema evolution rather than migration. No red flag days or downtime. Works with rolling upgrades. See strategies like my own avrobase, facebook/twitter thrift or google's protocol buffers.

Schemaless design is a tradeoff like many things. There are systems where being able allow the client to write new data structures can be important. (one example I've seen was a system that tracked arbitrary metric messages from devices that could add fields based on their device type. The system needed to be forward-compatible without speaking to the parties implementing those devices).

Also, some of those systems have a _dynamic_ schema, e.g. Elasticsearch, where they learn the data type on the fly, but disallow incompatible changes on fields already known, which is a bit of an inbetween.

Database schemas are a complete waste of time for many applications.

If you have a well defined application data model and you use an ORM that what does a database schema actually get you ? You can centralise and better enforce data integrity within your application.

How is it better enforced in your application?

I realize that having worked with dynamic languages for many years, I rely on the database for keeping my data types in check. That way the language can be more relaxed about it. Nulls are still a major pain in the arse though.

> You can centralise and better enforce data integrity within your application.

Right up until someone rights an import script! That's WebScale(tm)

"After extensive experience working with Bigtable and other eventually consistent systems..."

This is not accurate -- Bigtable is not eventually consistent. The scope of transactions supported by a system is a different set of considerations from the level of consistency it provides. Bigtable is consistent but only allows for transactionality at the row level.

Optimistic concurrency control is nothing new and Percolator layered transactions on top of Bigtable years back. Furthermore, TrueTime -- allowing for comparatively low-latency update across a globally distributed set of DCs -- is the real innovation in Spanner, not the use of optimistic concurrency control.

Honestly, I am not sure what this article is trying to claim, except perhaps that per-node performance has been improved. AFAICT, most of this is due to the fact that RAM is cheaper than it was, SSDs have reached commoditization, and networks in the DC are faster than they used to be.

NoSQL always seemed like a misnomer, should be SomeSQL, postgresql can do the same kinda ops and usually faster than your average NoSQL db: http://www.enterprisedb.com/nosql-for-enterprise

According to Martin Fowler, #nosql was initially just a hashtag somebody came up with for a meetup on some of the new databases, with nary a thought that it would become the name of a "movement".

It should maybe be renamed noRealUndersatndingOfSQl.

Certainly the majority of proponents I speak to want to use noSQL databases without any real reason, and where a relational database would be a better fit.

Right. What did that Codd guy know, anyway? As if data can be modelled as relations! Information wants to be FREE! Normal forms are a straitjacket that no self-respecting hipster would wear to the bar at lunch-time.

It's supposed to stand for Not Only SQL, I think.

I can't blame these guys for trying to write these articles, because I also write on these subjects. BUT there is always something that bothers me about them (maybe cause I'm jealous), because their advertisements follow me around - even though I use ad blockers.

Basically, I feel like it is all promotional material, despite the fact that it is dealing with technical stuff that I research all the time, and well, as a hypocrite also try to promote my work. But I open source my work, while with them I feel like they are just trying to bait/give-an-excuse for me to click the buy button.

But I know, as with all software, it is going to come with the initiation... and in psychology, there is the sunk cost fallacy. Right? Where the more I invest, the more I want to believe and will make excuses to keep going with it. And so the incentives aren't actually aligned.

Compare this with let's say MongoDB - not only is it free AND ridiculously easy to start playing with, they then have to wait long enough to bait you into their consulting. However, as we go up the "cult ladder of software" we don't have the sunk-cost fallacy of money we've invested (just time), so it is easier for us to bail. That sucks for 10gen.

So with Foundation, I'm sure that once you are in the circle, it is wonderful, because that is what you are paying for. But it makes me not even want to enter into the circle in the first place. So what am I asking of Foundation? To open source their product? Make their lives suckier and harder to pay their bills?

Unfortunately yes, not because I'm malicious, but because database technology is much more about academics than anything else. Not that business and money-making can't co-exist with that, but because it is a field where we fundamentally have to have open access and collaborate even if we are competing for funds/grants/money/customers.

Let's ask why again. Why? Because nobody, businesses or not, is going to get reliable progress until then. Yes, we'll get incremental innovations, and Foundation probably has that, but these will be lost and reset time and time again, until every one in the field is willing to sacrifice their all. Unfortunately, a field that requires extreme expertise and costly talent to push forward, and even then, only slowly.

So I'm not going to even bother analyzing the technical claims of the post, because we have so much work to do first in even defining and making the terms more common, understandable, concrete and clear. Despite a century or so worth of work in the field.

Honestly, I feel like http://aphyr.com/ is doing the most important contribution, despite the fact that he isn't even building solutions (compared to my team and Foundation's). Why? Because Aphyr has had great success in popularizing the discussion and providing needed clarity. And yes yes yes, I know that Foundation claims to have run their own Jepsen tests and so on. Good for them, I'm still trying to even get to that point. But... openness.

Like I probably shouldn't be as critical as I am being, a lot of my complaints in here can easily be refuted with various aspects of Foundation, like their free getting started, like their pay-what-you-need, etc. And obviously they are contributing to the discussion by making posts like this. But I feel like at least some of my comments and sentiments resonate, maybe at least in helping Foundation in knowing they are giving off a weird vibe/signal that they aren't even aware of.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact