Hacker News new | comments | show | ask | jobs | submit login
Turning the database inside-out with Apache Samza (confluent.io)
232 points by martinkl 686 days ago | hide | past | web | 64 comments | favorite

Immutability is hardly a cure-all, see the discussion here for why RethinkDB moved away from it: http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-g...

The reality is shared, mutable state is the most efficient way of working with memory-sized data. People can rant and rave all they want about the benefits of immutability vs mutability, but at the end of the day, if performance is important to you, you'd be best to ignore them.

Actually, to be more honest, reality is more complicated still. MVCC that many databases use to get ACID semantics over a shared mutable dataset is really a combination of mutable and immutable.

Slava @ rethink here.

This is a really interesting subject -- I should do a talk/blog post about this at some point. Here is a quick summary.

RethinkDB's storage engine heavily relies on the notion of immutability/append-only. We never modify blocks of data in place on disk -- all changes are recorded in new blocks. We have a concurrent, incremental compaction algorithm that goes through the old blocks, frees the ones that are outdated, and moves things around when some blocks have mostly garbage.

The system is very fast and rock solid. But...

Getting a storage engine like that to production state is an enormous amount of work and takes a very long time. Rethink's storage engine is really a work of art -- I consider it a marvel of engineering, and I don't mean that as a compliment. If we were starting from scratch, I don't think we'd use this design again. It's great now, but I'm not sure if all the work we put into it was ultimately worth the effort.

I really think there are a couple of levels of immutability that it is easy to conflate.

Specifically immutability for

1. In memory data structures...this is the contention of the functional programming people.

2. Persistent data stores. This is the lsm style of data structure that substitutes linear writes and compaction for buffered in-place mutation.

3. Distributed system internals--this is a log-centric, "state machine replication" style of data flow between nodes. This is a classic approach in distributed databases, and present in systems like PNUTs.

4. Company-wide data integration and processing around streams of immutable records between systems. This is what I have argued for (http://engineering.linkedin.com/distributed-systems/log-what...) and I think Martin is mostly talking about.

There are a lot of analogies between these but they aren't the same. Success of one of these things doesn't really imply success for any of the others. Functional programming could lose and log-structured data stores could win or vice versa. Pat Helland has made an across the board call for immutability (http://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf), but that remains a pretty strong assertion. So it is worth being specific about which level you are thinking about.

For my part I am pretty bullish about stream processing and data flow between systems being built around a log or stream of immutable records as the foundational abstraction. But whether those systems internally are built in functional languages, use lsm style data layout on disk is kind of an implementation detail. From my point of view immutability is a lot more helpful in the large than in the small--I have never found small imperative for loops particularly hard to read, but process-wide mutable state is a big pain, and undisciplined dataflow between disparate systems, caches, and applications at the company level can be a real disaster.

Excellent points, yes it's important to clarify what we're talking about here. Samza sounds like an event-sourcing style immutable event log. You could think of it like the transaction or replication log of a traditional database. Having that be immutable is very sensible! But you can't always query that in "real-time".

On the other hand, the data structures you query in real-time, making that immutable is problematic, because then you'll need a LevelDB style compaction step. That doesn't mean to say that it can't be done well, but that it's hard to do well.

LMDB does ACID MVCC using copy-on-write with no garbage collection or compaction needed. It delivers consistent, deterministic write performance with no pauses. It is actually now in use in a number of soft realtime systems.

I was specifically thinking of LMDB as a counter-example when I wrote that it's not impossible, just hard to do well. A much more sensible set of tradeoffs than LevelDB.

> I really think there are a couple of levels of immutability that it is easy to conflate.

Complete agree. I was talking about immutability on the storage engine level. Totally different tradeoffs apply at different levels in the stack (that you described).

It's that merge (edit: GC is a better term) step that's difficult to get right. Google screwed this up badly with LevelDB which had(still has?) horrible performance issues caused by compaction. Even with concurrent compaction it can be difficult due to needing additional disk space, adding additional read and write pressure to the storage subsystem and the effects that has on latency. I'm not sure what RethinkDB's approach was there, but I'm very curious to know.

> Even with concurrent compaction it can be difficult due to needing additional disk space, adding additional read and write pressure to the storage subsystem and the effects that has on latency. I'm not sure what RethinkDB's approach was there, but I'm very curious to know.

Yes, we ran into all these issues with the RethinkDB storage engine. Unfortunately I can't summarize the solution, because there are no silver bullets. It took a long time to perfect the engine, and there was enormous amount of tuning work to get everything right.

For example, we have a "young blocks" subsystem that treats recently updated blocks differently (since, empirically, recently written blocks are dramatically more likely to be updated again, so we hold off on trying to collect them). How long should you wait? How many young blocks should you consider?

Working out solutions to these questions takes a lot of trial and error, and that's where the bulk of the work is (and that's just one subsystem!)

I'd love to write about it in depth, I'll try to make it a priority.

How much similarity is there between the JVM's G1GC, CMS or other collectors and Rethink's compaction? It looks like the heuristics and trade-offs are very much the same. (Latency, hard space constraint, usage patterns.) Okay, you don't have to do pointer/object graph chasing, but queries and consistency and whatnot has to do something similar.

Remember -- this is an on-disk compactor, so it's not quite the same as collecting garbage in memory. There are other differences -- database dependencies are typically trees (or in the worst case DAGs, as there aren't any circular references). So our compactor can be much simpler (in fact, it's closer to a purely functional collector like Haskell's).

But overall it's very similar to a programming language GC. The devil, as usual, is in the details.

In another thread, you had mentioned that "I don't think we'd use this design again". Curious to hear how you would design such a system differently, without using a LSM tree approach that's popular these days?

I don't think the LSM tree approach really panned out (in a sense that the benefits don't outweigh the drawbacks). You get much better insert and update throughput at the cost of significant production stalls. Most people don't need that level of insert throughput (and if they do, they can get it by scaling horizontally). Even if you need that throughput on a single box, most people aren't ok with long stalls. Facebook has been doing some work to minimize stalls in an LSM storage engine, but this is a significant engineering effort that only really makes sense for a few companies.

RethinkDB's storage engine uses a different architecture -- it gets you better insert/update performance on SSDs without stalls (but not as good a throughput as LSM-based engines), in exchange for significant engineering effort to make the engine bulletproof. Again, most of the time, people can get that by scaling horizontally.

I think that in 99% of cases a traditional storage engine approach works just fine. We all tried to reinvent the wheel, but ultimately it turned out to be a lot of work for fairly little benefit.

> I think that in 99% of cases a traditional storage engine approach works just fine. We all tried to reinvent the wheel, but ultimately it turned out to be a lot of work for fairly little benefit.

Please publish this in a paper or at least a blog article so I can properly quote you the next time a discussion on ACID comes up. :)

If you do, please ping me at dan.eloff @ populargooglemailservice.com, I don't want to miss reading that one!

Please do -- I've been reading the linkedin/confluent/samza writeups and thinking there's a lot of truth to their ideas. It'd be great to hear more on-the-ground experience from a different perspective.

Is this not how most 'modern' (read 90s) relational db's work?

It's similar, they use MVCC, which means no in-place updates or deletes. Postgres then has a compaction step called vacuum to clean up old tuples. Redis is one of the few "databases" that truly uses in-place updates, but it has that luxury because it's single-threaded.

MVCC doesn't necessarily mean no in-place updates, it just means that you can distinguish between multiple versions. For example, Oracle:

- keep most recent version of all keys in B-tree

- store updates in undo log ("rollback segments")

- queries for older versions dynamically undo recent changes


Good point, but do they seriously do that? It's stupid.

If you overwrite data in place that's being concurrently read, you get garbled data. So you must guarantee nobody is reading it. One way is to lock the data for both readers and writers using a mutex of some form. Another way is Linux-RCU style[1]. Both make readers always pay a price for what should be an uncommon case.

It makes more sense to me to put your updates in a new place, and if need be copy them over the old data once nobody can see the old data anymore.

[1] http://lwn.net/Articles/262464/

Isn't that just called "journaling", as opposed to "immutable"?

Journaling typically implies that there is a separate log of operations/changes, but the main data file (the BTree) is still updated in-place. You can then use the journal to roll back the changes if necessary.

RethinkDB's storage engine doesn't have a journal -- the main data file is essentially journaled, which is quite different from the traditional meaning of the word.

Everything obviously has trade offs, no choice is perfect in everyway.

For me the pros of having data as an immutable stream of events (eventsourcing) is that you get migrations and data modeling for free - You don't have to deal with having to design the "perfect" data model in advance (or worry about schema/data migrations later on) and you can get caching as first level data rather as derived from another store.

Actually @eloff and the OP are arguing past each other. Both are wrong in different ways.

> Databases are global, shared, mutable state. That’s the way it has been since the 1960s, and no amount of NoSQL has changed that. However, most self-respecting developers have got rid of mutable global variables in their code long ago. So why do we tolerate databases as they are?

This isn't true. Databases _can_ be those things, but that isn't the definition of a database. Most of the databases I have worked and created do not use update or delete except to archive old data that is no longer in the working set.

And mutability isn't always faster. Most of the time when people are championing mutability, it is because it is the most expedient (esp with their mental model), not because from a whole system standpoint it is actually faster. They trot out a microbenchmark that proves their point while ignoring use cases like retrieving old state or auditing the transaction history.

You'll have to go into detail. Mutable data structures mop the floor with immutable ones, synchronization issues aside. What's actually better depends on the specific case in question, but mutable data structures have a superset of the techniques available to immutable ones, and that's never a disadvantage. It, might, however lead you to use a poor design that falls down under contention.

Mutability has to put into context of the whole system, not just a narrowly focused benchmark. We can no longer put synchronization issues aside, that just isn't possible anymore. Correctness (coherency and data races) both on a micro level and macro are more important than raw mutation rates. While there are times when mutation is the answer, it has to thought about carefully.

Correctness is very important. But being correct and faster is better. A lot of the techniques I use to accomplish that start to look similar to immutable. Copy on write is the cornerstone of good lock-free algorithms as well as immutable data structures. However, not being religious about mutability has tangible performance benefits - depending on the details, as always.

Yes, but that stuff is hard and error prone.

Immutability is easier (safer, more correct) on the whole. I think both Rust and Clojure take a good stance on mutation.

Only for anecdotal evidence, my ad-hoc immutable (lots of copying) ETLs performed better than the mutation happy ones. The GC was able to throw stuff away faster, the code was cleaner, kept only what it needed. The gaussian circle of confusion was smaller.

It's very hard. Bugs in lock-free algorithms are the only class of bugs that have defeated me. I once worked for three months of evenings trying to fix one, before throwing in the towel. I learned from that experience to use only very simple algorithms. Adhering to the single-writer principle[1] helps a lot. When changes and deletes can't interact because they're all on the same thread, life is a _LOT_ easier.

[1] http://mechanical-sympathy.blogspot.com/2011/09/single-write...

I don't have any evidence, but I have a gut feeling that if we had immutable, write-once persistent data structures, we could relax coherency _a whole lot_ and gain a huge bandwidth and latency advantage. Memory ordering to support mutability is an expensive abstraction.

It has it's advantages to be sure, I really like eventsourcing. For some kinds of projects it's the obvious way to go. Finance seems like a killer application, because of the way the entire audit trail is stored, and the system can often be rebuilt to a valid state upon finding and correcting a production bug.

For anything that critically needs an audit log it's a pretty obvious choice (atleast compared to when people try to implement audit logs in a RDBMS).

I think it might also have a great deal of merit in prototyping applications quickly - Having the option to do just-in-time/memory projections while later switching to "real" storage seems really ideal to me if i'm just building something to validate it would like to defer those type of tech decisions.


When designing our product it became quickly apparent that immutability isn't practical, we opted for MVCC ACID transactions, but the most difficult part of MVCC database is getting the purge right.

You can cheat purges with some clever optimization but at some point, you need to clean up old versions and when you do that, you are using precious I/O.

Getting purge/compaction right is hard. Update intensive scenarii are always problematic for databases.

As a matter of interest, what product is that?

I used some tricks to reduce the amount of data that must be cleaned up at any given point in time, but it was not possible to evade it completely. I still have to do concurrent compaction, and there's some really nasty corner cases I don't have any good solutions for, it's a hard problem.

I agree with this comment, but I think this blog post is really addressing data flow at the company-wide or datacenter scale. Surprisingly there immutable event data hasn't really had much of a home at all beyond data warehousing.

... most self-respecting developers have got rid of mutable global variables in their code long ago.

I'm not convinced that's the case. Almost everyone has merely hidden their mutable globals under layers of abstractions. Things like "singletons", "factories", "controllers", "service objects", "dependency injection" are the vernacular of the masked-globals game.

None of those things you said imply mutability. (Okay, maybe singletons, depending on the implementation.)

True, but in practice they tend to be used as containers or initializers for mutable variables.

As one who works with analytics databases and ETL (extract-transform-load) processes a great deal, immutability of data stores is an incredibly valuable property. Maybe append-only does not make sense in operational databases all the time, but for non-real-time analytics, it makes a huge amount of sense. In my case, operational data is queried, optimized for storage space and quick loading, and cached to disk. Because it is an analytics database used for longer-term analysis and planning, daily queries of operational data are sufficient in many cases. Operational workload is not even a consideration. The ETL process also allows for "updating" records in the "T" (transform) part. Updates to operational data are not even necessary, and often impossible, so correcting and enhancing the data for decision making is a huge win for clients. Issues similar to "compaction time" can still occur, but an ETL approach allows for many clean ways of controlling the process and avoiding those failure scenarios.

Anyhow in the Bay Area interested in learning more about Apache Samza should attend the meetup tonight in Mountain View: http://www.meetup.com/Bay-Area-Samza-Meetup/events/220354853...

I'm not sold on Samza, but I can tell you that creating isolated services that create their datastore from a stream of events is a really useful pattern in some use cases (ad-tech).

I've made use of NSQ to stream user update events (products viewed, orders placed) to servers sitting at the network edge which cache the info in leveldb. Our request latency was something like 10 microseconds over go's json/rpc. We weren't even able to come close to that in the other nosql database servers we tried, even with aggressive caching turned on.

What don't you like about Samza out of interest? Something fundamental with their model or more implementation related?

I've seen organizations have lots of trouble operationally with kafka (which samza uses). I've seen NSQ be extremely reliable operationally.

However they offer very different guarantees so it's an apples to oranges comparison. NSQ isn't really designed to provide a replayable history, although you can fake it by registering a consumer which does nothing but log to file (nsq_to_file) and that works pretty well.

(disclaimer: the nsq mailing list has lots of chatter these days, nsq may be growing features I'm not aware of)

What troubles did you typically see with Kafka? Was it Zookeeper related? (Also, good to talk to you at the Gopher Meetup on Tuesday)

Similar interesting talk by Rich Hickey:


I was wondering how this relates to Datomic... I'm not really familiar enough to say much about similarities and differences, but would be interested if someone who is could comment.

I asked the same question at the end of his talk, see the relevant section in the video:


Conceptually, one of the challenges of streams as first class citizens is that humans don't do well with them. For the purposes of analysis, humans need a "snapshot" or fix on the data. This way they can derive insights from the data and act on human things. The reality is that, for many real-world scenarios, a real-time view of the data is not just a luxury, it's actually a drawback, because data changes are noisy. Many human problems deal with abstract representations of the actual data, and so imprecision is part of the problem.

I really like the talk from the point of view of simplifying the system-wide problems caused by a gigantic mutable state. But I feel that at the border of system to humans there will be other issues to discuss.

You can do similar "magic" cache invalidation with Elasticsearch and the percolate feature. Each time you do a query and cache some transformation of the result, put that query in a percolate index. Then when you change a document, run the document against the percolate index and, voila, you get the queries that would have returned it and can then invalidate your cache.

This method of cache invalidation fails in a very key place though (just like in the article). What happens if you change a very core thing that invalidates a large percentage of the cache?

What you're hoping for is that some cacheable function of many documents is also a monoid.

In an example, you're hoping that when you invalidate the query "SELECT COUNT(*) FROM foo WHERE x = 1" because a new document that matched came in, you're simply incrementing the existing cached value, rather than rescanning the database index.

This is a cool idea - the holy grail scenario I'm envisioning is storing all data in the log i.e

1. the transaction log is a central repository for all data 2. much more detailed data is stored, enough that analytics and can run off this same source of data

The amount of data generated increases proportional to the number of updates on a row/piece of data whereas with a mutable solution, it is constant w.r.t number of updates on the same data. That is a pretty big scaling difference.

However, storing that much data translates to much higher costs for HDDs/servers, or possibly lower write performance if the log is stored on something like HDFS.

There would also be performance costs for building and updating a materialized view. Imagine a scenario like this:

Events -> A B C D E F G H I J K Materialized view M has been computed up to item J (but not K yet) Read/Query M

Now either writing K incurs the cost of waiting for all dependent views to materialize, or the read on M incurs the cost of updating M.

Some fusion of this would be pretty interesting though. For example, what if we just query on M without applying any updates if there have been <X updates? That translates to similar guarantees as an eventually consistent DB - the data could be stale. Atleast it gives us more control over this tradeoff.

I really enjoyed reading about Storm too: http://nathanmarz.com/blog/history-of-apache-storm-and-lesso...

This kind of "competition" leads to analysis paralysis though. Its much better when there is a single winner...

You mean like Hadoop? I disagree, the popular bad solution starves out innovation by sucking all the air out of the room. Easier to decide on the globally bad choice.

I meant Samza and Storm. I thought both of them could run on Hadoop.

I believe they meant that Hadoop is an example of a clear winner.

A more promising model, used in some systems, is to think of a database as an always-growing collection of immutable facts.

That would already be a huge progress over how databases are currently used; if records were in fact immutable many problems would be instantly solved.

You would just be trading them for the intensely ugly problem of garbage collection. Disk space is cheap, but it's not infinitely cheap. There are plenty of append-only data stores out there now, and they all suffer from compaction-related performance issues.

Does anyone knows which app has been used to create the "handwritten" images? I draw very badly so I'm looking for such an app to explain data flows on a corporate blog/wiki.

It looks like the free app "Paper" by FiftyThree, available on iPads: https://itunes.apple.com/us/app/paper-by-fiftythree/id506003...

From the comments in the article, it appears to be the Paper app for iOS/iPad and a stylus.

Windows Journal is also perfectly serviceable - https://drive.google.com/file/d/0Bxjbk6tMrOKQcXhKT1dIVkQ5ZVE...

Also Draw by Adobe has similar results (free!) https://itunes.apple.com/us/app/adobe-draw/id911156590

Streams - another reinvention of LDAP Persistent Search.

Yes, there really are protocols that handle single request/multiple response interactions, and they've been around for decades. Unlike crap built on HTTP, which was never intended for uses like this, these protocols work well with multiple concurrent requests in flight simultaneously, etc.

This is CouchDB, right?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact