In many applications, data outlives code. This is certainly the case in enterprise applications, where data can sometimes migrate across several generations of an application. Data may also be more valuable to the organization than the code that processes it.
While I'm no fan of databases, one obvious advantage is that they provide direct access to the data in a standard way that is decoupled from the specific application code. This makes it easy to perform migrations, backups etc. It also increases one's confidence in the data integrity. Any solution that aims to replace databases altogether must address these concerns. I think that intimately coupling data with the application state, as suggested in the article, does not achieve this.
The goal is not to replace databases altogether. The goal is to solve some particular problems very well. Last time I used this approach, for example, we mirrored a bunch of data in a traditional SQL store for reporting and ad-hoc querying, things that databases are great at.
In my view, direct access to data decoupled from application code is a bug, not a feature. With multiple code bases touching the same data, schema improvements become nearly impossible.
I also think data integrity is easier to maintain with a system like this. SQL constraints don't allow me to express nearly as much about data integrity as I can in code. Sure, I could use stored procedures, but if I'm going to write code somewhere, I'd rather it be in my app.
"I also think data integrity is easier to maintain with a system like this."
If you are in the middle of a transaction and you realize that some constraint is being violated, how do you roll it back without interfering with the other transactions?
I can't speak to all systems like this, but the Prevayler approach is pretty straightforward. Most importantly, there are no simultaneous transactions: changes happen one at a time. That seems crazy if you're used to dealing with disk-backed databases, but if everything is hot in RAM, then it's not a problem. In that context, it's pretty easy: when you start executing a change you verify all your constraints before doing anything.
"Most importantly, there are no simultaneous transactions... but if everything is hot in RAM, then it's not a problem"
Uh, OK. So, you're happy with single-core boxes then, I take it?
Actually, you're regressing to even before that, when there was no pre-emptive multitasking. When the program is done doing something, it yields control to some other task.
Also, I'd like to point out that just because you aren't explicitly doing I/O doesn't mean that you aren't doing I/O. The OS might have paged out some stale data (quite likely, since you aren't managing I/O yourself), and you might be holding the giant lock while it's paging it in.
"I can't speak to all systems like this, but the Prevayler approach is pretty straightforward."
I just want to clarify: so when you encounter a problem, you do some rollback, which automatically moves the state to the last snapshot and rolls forward to the previous transaction, right? No manual steps?
I hope you have a recent snapshot or that will be a long wait (while holding the big lock, I might add).
> Uh, OK. So, you're happy with single-core boxes then, I take it?
Not at all. You use only one core for the core execution of write transactions, but that's a small part of any real system. All cores can read simultaneously. All cores can also do all sorts of other work, including preparing transactions to execute, deserialization of requests, rendering responses, logging, and anything else your app needs to get up to.
The limit is also one core per transactional domain. If you can split your data up into lumps between which you never need transactions, you can happily run one core on each.
> Also, I'd like to point out that just because you aren't explicitly doing I/O doesn't mean that you aren't doing I/O.
Actually, it does explicitly do I/O. You do it just before every command executes.
> The OS might have paged out some stale data.
I guess that's possible, which would indeed cause a momentary pause, but this approach is typically used with dedicated servers and plenty of RAM, so it's never been a problem in practice for me.
> I just want to clarify: so when you encounter a problem, you do some rollback, which automatically moves the state to the last snapshot and rolls forward to the previous transaction, right?
You mean a bug in our code that causes a problem? Depends, on the system, I suppose. Prevayler had an automatic rollback. It just kept two copies of the data model in RAM; if a transaction blew up it would throw out the possibly tainted one. But there are a number of ways to solve this, so I don't advocate anything in particular. Other than heavy unit testing, so that things don't blow up much.
> So a single writer would block all readers, right?
Correct. For the fraction of a millisecond the transaction is executing, anyhow. Since transactions only deal with data hot in RAM, transactions are very fast.
> No, I mean like "I already wrote some data, but now a constraint has been violated so I need to undo it".
That shouldn't happen, and I've used two approaches to make sure. One is do all your checking before you change anything. The other is to make in-command reversion easy, which is basically the same way you'd make commands undoable.
Basically, instead of solving the problem with very complicated technology (arbitrary rollback), you solve it with some modest changes in coding style. Since you never have to worry about threading issues, I've found it pretty easy.
> Correct. For the fraction of a millisecond the transaction is executing, anyhow. Since transactions only deal with data hot in RAM, transactions are very fast.
Transactions don't just read and write. They sometimes compute things, like joins, which can take several milliseconds. These computations often must run within the transaction and would thus need to acquire the lock for several milliseconds.
Joins haven't been a problem for me, mainly because this approach doesn't constrain you to a tables-and-joins model of the world. With Prevayler, for example, you treat things as a big object graph, so there are no splits to join back up.
Of course, it could be that some problem is just computationally intense, but I can think of a number of approaches to lessen the impact of that in a NoDB system.
I also think data integrity is easier to maintain with a system like this. SQL constraints don't allow me to express nearly as much about data integrity as I can in code. Sure, I could use stored procedures, but if I'm going to write code somewhere, I'd rather it be in my app.
Surely I could just as easily say 'Sure, I could use a data access layer in my app but if I'm writing a multi-app database, I'd rather the database enforced its own integrity.'?
Nothing's the perfect tool for every job, but I certainly think stored procedures have their place and have the power to handle the bulk of tasks.
Yes, you can definitely do it either way. Years ago as a demo a friend built the heart of a financial exchange in stored procedures. It was very fast, and very reliable. But the same is true about the LMAX system that Fowler describes.
Personally, though, I'd much rather do my important coding in a real programming language. Better tools, more libraries, bigger communities, and no vendor lock-in.
Modern SQL dialects are Turing complete and frankly pretty rich dialects. I know MS SQL Server best so can't speak in detail for others, but the community around that is certainly very substantial. Library support, well, doesn't work quite the same way (yet!) but there's plenty of libraries of code samples available for adapting. Vendor lock-in is a tricky one; by the time you've got to a certain scale of application I tend to think you're programming as much to the API (whether it's the provider's standard API or your own specific API layered over the underlying platform) as to the official 'language'; lock-in can creep up surprisingly easily. Facebook avoided vendor lock-in by writing in open PHP and have since had to write their own PHP compiler to get the performance they needed from the solution they were locked in to.
A former employer used to bulk process EDI order lines in very large quantities. Deduplicating them, dynamically rebatching them according to what was available and what wasn't, updating orders with newer product where customer had specified 'this or better', cross-referencing against multi million row datasets of cataloguing and tagging information to identify how to handle the item. It was a monster; I hate to think about the volumes of data that touched each batch, and with processing orders it absolutely had to have transactional integrity. And yet, written in SQL and running on a very average commodity server, it was actually very fast. The data never left the server until it was ready to do so and all the loads stayed happily internal. The implications of trying to implement it on a NoDB solution - the volumes of data being passed around, the amount of data specific library code the DBMS provides but the underlying language doesn't which would need reading..... It's not pretty.
I don't maintain SQL is the perfect language for everything, that's patently silly. But I do maintain it's a lot more powerful (and with good performance and reliability) than it's given credit for on some very complex operations, and that a lot the reasons people prefer to work in alternatives boil down to lack of understanding. A little learning of what a modern DBMS is capable of can reap huge rewards of saving work in the 'real programming language', as you put it.
Like you, I believe in the right tool for the job. For many applications, an SQL server is awesome.
I'm sure the developer community for MS SQL Server is reasonably large, but it is much, much smaller than the Ruby community or the Java community. The same is true for library code for each environment.
I think the Facebook example cuts the other way. If they had implemented all their core application logic in MS SQL Server, they would have been well and truly screwed if the performance wasn't enough. With PHP, at least they could write their own compiler; trying to reverse-engineer MS SQL Server is orders of magnitude harder.
Regarding the "volumes of data being passed around" part, that works well with a hot-in-RAM system; no data is passed anywhere. As with doing it all in stored procedures, no data leaves the server.
I do agree that a lot of people don't get full value out of their database. Sometimes that's a very reasonable business choice: vendor lock-in is extremely expensive at scale. But it does often come from ignorance. On the other hand, almost every developer has written a few database-backed applications, but very few have written anything in the NoDB style. Many can't even conceive of coding something without a database. I'd love to see both sorts of ignorance reduced.
What have you needed out of ANSI SQL that is a gap in its Turing Completeness? Totally serious. A great many things can be dismissed as not being Turing Complete, so please provide us with some examples of why this is bad in ANSI SQL.
I did not mean that ANSI SQL was bad. However, by not being turning complete it has fundamental limitations that limit it from expressing certain logic (as you might need to do in a stored procedure). This frequently means that you must use proprietary extensions to SQL (such as PL/SQL) to accomplish these tasks.
My interpretation of the parent post was that it was a response to a comment about vendor lock in. I was only trying to point out that it is not always possible to ensure compatibility between databases by writing strict ANSI SQL.
You care about Turing completeness if you are trying to express something that requires it (which is something that frequently needs to be done in stored procedures). Also, SQL is not a data storage engine it is a query language.
ANSI SQL is not vendor specific but it is just a standard not an implementation. As a result you have to rely on vendors to implement the language. Many vendors deviate from the standard. This means that you cannot just write ANSI SQL and expect it to work on all databases.
> In my view, direct access to data decoupled from application code is a bug, not a feature. With multiple code bases touching the same data, schema improvements become nearly impossible.
It's unclear that multiple applications with direct access to said data make schema improvements any easier.
The obvious solution, copying the data for applications that are using the new schema, pretty much guarantees that one or more of the copies are wrong.
> I also think data integrity is easier to maintain with a system like this. SQL constraints don't allow me to express nearly as much about data integrity as I can in code. Sure, I could use stored procedures, but if I'm going to write code somewhere, I'd rather it be in my app.
How do you guarantee that all of the apps that touch that data use the current version of said code?
Code normalization is as important as data normalization.
Very well put. It is not access type (direct vs indirect) that make schema improvements hard, but access preservation: availability. And indeed copying the data often leads to one or more of the copies being wrong, unless special measures are taken in that direction.
> How do you guarantee that all of the apps that touch that data use the current version of said code?
An approach that may be worth considering is to not require all apps to use the current version: allow multiple versions. For some cases this would work, say if the semantics of the newer version are backwards compatible with the semantics of the older version. If the data semantics are preservable, transforming a schema could happen while each data access request to the schema is actively transformed.
But it clearly wouldn't work in all cases. More work to handle that would be needed.
> Code normalization is as important as data normalization.
True, this is the near show-stopper really. In that case, the best one can hope for is preparing the state of the new version (new data in new schema) and carefully coordinating a quick restart of the old version for the new version.
I would love to hear your thoughts on this. We have been working towards that direction with ChronicDB (http://chronicdb.com) and would welcome feedback.
An e-mail address in your profile would have made that possible. (Chronicdb looks interesting and complements something that I've been thinking about. I suspect that you've implemented many of the relevant mechanisms.)
> How do you guarantee that all of the apps that touch that data use the current version of said code?
In a NoDB app? It's very easy: only one code base ever directly touches the data, because the data lives in the RAM allocated to the app. You give external access via an API, so integrity is very easy to enforce.
Congratulations, you have reinvented the integrity constraint! Except IBM have been working on this for 40 years, making it reliable and performant. I have yet to see anyone roll their own data integrity layer that comes anywhere close to the major vendors.
Good point. With databases distributing data in-memory across machines, shared memory becomes the database. Don't be surprised if Arc runs HackerNews on distributed memory some day...
But one has to wonder, what happens when you need to upgrade the app? Shutting down the process and destroying the memory image doesn't seem like the best option:
- First, it disrupts connected applications since the process is killed, introducing downtime.
- Second, when starting up again in say version 2, the data that will be loaded in memory still needs to be transformed in the format expected by version 2. This transformation can take time on large data, introducing further downtime.
The challenge would be to eliminate this downtime by combining a solution for both client disruption and state transfer. A data abstraction like a database using SQL can simplify such a solution.
"With multiple code bases touching the same data, schema improvements become nearly impossible."
Few companies have procedures in place that allow this; but it is possible if you have the right procedures.
Sure, but the problem becomes harder the larger you get. Look at almost any Internet-wide deployment, though, and you see the alternative: isolate database schemas behind APIs, and rev APIs and schemas separately, as the situation demands.
Isolating databases behind APIs and rev-ing API+schema separately is not enough. When the schema changes, data must be transformed to match the new schema version. As you point out this takes too long with a large database, and it doesn't account for data consistency.
We have been working on building what we hope are the procedures for this with ChronicDB (http://chronincdb.com). But it turned out harder than it seems, and we are not sure it will quite work out. We'd welcome feedback.
And not the last thing is security and permissions to access different parts of data. I see no way to have it easily implemented in the event logging system.
Chmod? Seriously, if you need different permissions to stres in an event log, just write multiple event logs - each with only the data they need - and store them with different permissions. This assumes that they can be properly decoupled, but since you're the one writing the event log, you can set it up however you want.
When people set out to design a SQL database, they usually end up updating and deleting records. This is bad because it destroys history, and nothing that you can add to your SQL architecture will fix it at a fundamental level.
By basing your system on a journaled event stream, you start with a foundation of complete history retention, and you can build exactly the sort of reporting views you need at any time (say, by creating a SQL database for other applications to query).
When people set out to design a data driven application, they usually end up updating and deleting records.
FTFY...
It's not hard to build history into a SQL table design. You can even store events in a...wait for it... SQL database. I have built numerous systems backed by SQL databases that have complete history retention. Answering questions like 'who had id 'X' on this date 3 years ago' are easily solvable with basic standard sql.
I certainly don't believe SQL databases are perfect or the tool for every job, but in many cases they work just fine until you get into very large datasets. Admittedly, I only deal with databases in the 100s of GB range so I have yet to personally run into the scaling problems that a Google or Facebook have and the SQL backed systems I have built work just fine.
It's not hard, no, but it usually doesn't happen in the average application. That's the issue: it's not built in, it's not standardized, and every SQL database is fully mutabile by default.
If your system operates in this journaled/event-sourcing way at the most basic level then you have the ultimate future-proof storage layer. You could decide to completely change the way the data is stored and represented (in-memory or otherwise) at any time, as long as you have that raw history.
You loose some features of SQL when you have historical values.
For example if you only want customers to create orders you can set a foreign key between customer and order. If you don't delete orders and customers, then the database will allow you to insert new orders against old customers, unless you apply a more complex constraint.
Deletes on enterprise SQL systems are usually prevented (the preferred pattern is "mark for delete + purge" similar to a VM's garbage collection). The Application ignores "marked" data as deleted.
That leaves tracking the remaining inserts and updates, which is a well understood problem. It's called Auditing. Here is a simple script that will auto-audit a SQL Server database... variations in other SQL dialects are likely just as straighforward.
You can have a design where your previous version of a record gets automatically copied into another table along with the timestamp of the operation. Then you can slice this history however you want. All with no additional app code.
But I wouldn't write off the noDB approach for various transitional data, or data that isn't mean to live long anyway, like tweets.
I used to work on an application that did all the typical insert/update/delete operations on the core data tables and retained a journaled event stream in a separate audit database sufficient to regenerate the entire database from scratch (which was done at least a couple of times).
I suppose it would be equally possible to think of the main data tables as a "reporting view" in the sense you use here, except that the application was 1000:1 or more in update frequency to read frequency, and all the reads were performed on the main data tables, so that's kind of a "tail wagging the dog" view of the app.
For an application with different requirements, your view of things might be quite useful, of course.
While I'm no fan of databases, one obvious advantage is that they provide direct access to the data in a standard way that is decoupled from the specific application code. This makes it easy to perform migrations, backups etc. It also increases one's confidence in the data integrity. Any solution that aims to replace databases altogether must address these concerns. I think that intimately coupling data with the application state, as suggested in the article, does not achieve this.