The reality is shared, mutable state is the most efficient way of working with memory-sized data. People can rant and rave all they want about the benefits of immutability vs mutability, but at the end of the day, if performance is important to you, you'd be best to ignore them.
Actually, to be more honest, reality is more complicated still. MVCC that many databases use to get ACID semantics over a shared mutable dataset is really a combination of mutable and immutable.
This is a really interesting subject -- I should do a talk/blog post about this at some point. Here is a quick summary.
RethinkDB's storage engine heavily relies on the notion of immutability/append-only. We never modify blocks of data in place on disk -- all changes are recorded in new blocks. We have a concurrent, incremental compaction algorithm that goes through the old blocks, frees the ones that are outdated, and moves things around when some blocks have mostly garbage.
The system is very fast and rock solid. But...
Getting a storage engine like that to production state is an enormous amount of work and takes a very long time. Rethink's storage engine is really a work of art -- I consider it a marvel of engineering, and I don't mean that as a compliment. If we were starting from scratch, I don't think we'd use this design again. It's great now, but I'm not sure if all the work we put into it was ultimately worth the effort.
Specifically immutability for
1. In memory data structures...this is the contention of the functional programming people.
2. Persistent data stores. This is the lsm style of data structure that substitutes linear writes and compaction for buffered in-place mutation.
3. Distributed system internals--this is a log-centric, "state machine replication" style of data flow between nodes. This is a classic approach in distributed databases, and present in systems like PNUTs.
4. Company-wide data integration and processing around streams of immutable records between systems. This is what I have argued for (http://engineering.linkedin.com/distributed-systems/log-what...) and I think Martin is mostly talking about.
There are a lot of analogies between these but they aren't the same. Success of one of these things doesn't really imply success for any of the others. Functional programming could lose and log-structured data stores could win or vice versa. Pat Helland has made an across the board call for immutability (http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf), but that remains a pretty strong assertion. So it is worth being specific about which level you are thinking about.
For my part I am pretty bullish about stream processing and data flow between systems being built around a log or stream of immutable records as the foundational abstraction. But whether those systems internally are built in functional languages, use lsm style data layout on disk is kind of an implementation detail. From my point of view immutability is a lot more helpful in the large than in the small--I have never found small imperative for loops particularly hard to read, but process-wide mutable state is a big pain, and undisciplined dataflow between disparate systems, caches, and applications at the company level can be a real disaster.
On the other hand, the data structures you query in real-time, making that immutable is problematic, because then you'll need a LevelDB style compaction step. That doesn't mean to say that it can't be done well, but that it's hard to do well.
Complete agree. I was talking about immutability on the storage engine level. Totally different tradeoffs apply at different levels in the stack (that you described).
Yes, we ran into all these issues with the RethinkDB storage engine. Unfortunately I can't summarize the solution, because there are no silver bullets. It took a long time to perfect the engine, and there was enormous amount of tuning work to get everything right.
For example, we have a "young blocks" subsystem that treats recently updated blocks differently (since, empirically, recently written blocks are dramatically more likely to be updated again, so we hold off on trying to collect them). How long should you wait? How many young blocks should you consider?
Working out solutions to these questions takes a lot of trial and error, and that's where the bulk of the work is (and that's just one subsystem!)
I'd love to write about it in depth, I'll try to make it a priority.
But overall it's very similar to a programming language GC. The devil, as usual, is in the details.
RethinkDB's storage engine uses a different architecture -- it gets you better insert/update performance on SSDs without stalls (but not as good a throughput as LSM-based engines), in exchange for significant engineering effort to make the engine bulletproof. Again, most of the time, people can get that by scaling horizontally.
I think that in 99% of cases a traditional storage engine approach works just fine. We all tried to reinvent the wheel, but ultimately it turned out to be a lot of work for fairly little benefit.
Please publish this in a paper or at least a blog article so I can properly quote you the next time a discussion on ACID comes up. :)
- keep most recent version of all keys in B-tree
- store updates in undo log ("rollback segments")
- queries for older versions dynamically undo recent changes
If you overwrite data in place that's being concurrently read, you get garbled data. So you must guarantee nobody is reading it. One way is to lock the data for both readers and writers using a mutex of some form. Another way is Linux-RCU style. Both make readers always pay a price for what should be an uncommon case.
It makes more sense to me to put your updates in a new place, and if need be copy them over the old data once nobody can see the old data anymore.
RethinkDB's storage engine doesn't have a journal -- the main data file is essentially journaled, which is quite different from the traditional meaning of the word.
For me the pros of having data as an immutable stream of events (eventsourcing) is that you get migrations and data modeling for free - You don't have to deal with having to design the "perfect" data model in advance (or worry about schema/data migrations later on) and you can get caching as first level data rather as derived from another store.
> Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?
This isn't true. Databases _can_ be those things, but that isn't the definition of a database. Most of the databases I have worked and created do not use update or delete except to archive old data that is no longer in the working set.
And mutability isn't always faster. Most of the time when people are championing mutability, it is because it is the most expedient (esp with their mental model), not because from a whole system standpoint it is actually faster. They trot out a microbenchmark that proves their point while ignoring use cases like retrieving old state or auditing the transaction history.
Immutability is easier (safer, more correct) on the whole. I think both Rust and Clojure take a good stance on mutation.
Only for anecdotal evidence, my ad-hoc immutable (lots of copying) ETLs performed better than the mutation happy ones. The GC was able to throw stuff away faster, the code was cleaner, kept only what it needed. The gaussian circle of confusion was smaller.
I think it might also have a great deal of merit in prototyping applications quickly - Having the option to do just-in-time/memory projections while later switching to "real" storage seems really ideal to me if i'm just building something to validate it would like to defer those type of tech decisions.
When designing our product it became quickly apparent that immutability isn't practical, we opted for MVCC ACID transactions, but the most difficult part of MVCC database is getting the purge right.
You can cheat purges with some clever optimization but at some point, you need to clean up old versions and when you do that, you are using precious I/O.
Getting purge/compaction right is hard. Update intensive scenarii are always problematic for databases.
I used some tricks to reduce the amount of data that must be cleaned up at any given point in time, but it was not possible to evade it completely. I still have to do concurrent compaction, and there's some really nasty corner cases I don't have any good solutions for, it's a hard problem.
I'm not convinced that's the case. Almost everyone has merely hidden their mutable globals under layers of abstractions. Things like "singletons", "factories", "controllers", "service objects", "dependency injection" are the vernacular of the masked-globals game.
I've made use of NSQ to stream user update events (products viewed, orders placed) to servers sitting at the network edge which cache the info in leveldb. Our request latency was something like 10 microseconds over go's json/rpc. We weren't even able to come close to that in the other nosql database servers we tried, even with aggressive caching turned on.
However they offer very different guarantees so it's an apples to oranges comparison. NSQ isn't really designed to provide a replayable history, although you can fake it by registering a consumer which does nothing but log to file (nsq_to_file) and that works pretty well.
(disclaimer: the nsq mailing list has lots of chatter these days, nsq may be growing features I'm not aware of)
This method of cache invalidation fails in a very key place though (just like in the article). What happens if you change a very core thing that invalidates a large percentage of the cache?
In an example, you're hoping that when you invalidate the query "SELECT COUNT(*) FROM foo WHERE x = 1" because a new document that matched came in, you're simply incrementing the existing cached value, rather than rescanning the database index.
1. the transaction log is a central repository for all data
2. much more detailed data is stored, enough that analytics and can run off this same source of data
The amount of data generated increases proportional to the number of updates on a row/piece of data whereas with a mutable solution, it is constant w.r.t number of updates on the same data. That is a pretty big scaling difference.
However, storing that much data translates to much higher costs for HDDs/servers, or possibly lower write performance if the log is stored on something like HDFS.
There would also be performance costs for building and updating a materialized view. Imagine a scenario like this:
Events -> A B C D E F G H I J K
Materialized view M has been computed up to item J (but not K yet)
Now either writing K incurs the cost of waiting for all dependent views to materialize, or the read on M incurs the cost of updating M.
Some fusion of this would be pretty interesting though. For example, what if we just query on M without applying any updates if there have been <X updates? That translates to similar guarantees as an eventually consistent DB - the data could be stale. Atleast it gives us more control over this tradeoff.
This kind of "competition" leads to analysis paralysis though. Its much better when there is a single winner...
That would already be a huge progress over how databases are currently used; if records were in fact immutable many problems would be instantly solved.
Yes, there really are protocols that handle single request/multiple response interactions, and they've been around for decades. Unlike crap built on HTTP, which was never intended for uses like this, these protocols work well with multiple concurrent requests in flight simultaneously, etc.
I really like the talk from the point of view of simplifying the system-wide problems caused by a gigantic mutable state. But I feel that at the border of system to humans there will be other issues to discuss.