Hacker News new | past | comments | ask | show | jobs | submit login
Designing Data-Intensive Applications (dataintensive.net)
598 points by wooola on Oct 8, 2017 | hide | past | favorite | 59 comments



I am a backend engineer who has been working in the mobile gaming space for many years now. Most of the focus today for mobile gaming backends is to scale to millions of players while offering low latency and real-time interactions. Have been following the development of this book through its beta and having read it now, I think it is fair to say that this book is worth its weight in gold. I can relate directly to all the challenges faced over the years in implementing a real time distributed database _like_ functionality with a high read _and_ write throughput, both custom and off the shelf. Extremely well written.


In "The Future of Data Systems", the author imagines a system where the application writes events to a Kafka-like distributed log. Consumers of the log do work like de-duping, committing the data to a RDBMS, invalidating caches, updating search indexes, etc. The application might read state directly from log updates, or have a system to sync w/ some sort of atomic state (e.g. RethingDB changefeeds).

The architecture seems to solve two big problems

* Scaling RDBMS (there are solutions like Cloud Spanner but they rely heavily on Google's proprietary network for low latency and are expensive as balls)

* Keeping downstream systems in sync. A lot of companies have gone with a "Tail the RDBMS" of some kind, e.g. writing the MySQL binlog to a Kafka queue and having downstream consumers read from that, but this seems like a more elegant solution.

Are there any examples or experiences of people working with systems like this? What are some downsides, challenges, and actual benefits?


From my personal experience (which probably has to do with working in industries where incoming data has a wide array of clients who need it ASAP) - the biggest challenge with any distributed DB system is the problem of "reading your own writes" and how the system approaches it. Not to be confused with tx isolation levels.

It's a balancing act between two extremes - locking everything down and ensuring the tx has been committed on all nodes/propagated to all consumers on one hand and sending an "ack" back to the client with a loose promise of eventual consistency on the other.


We have built something like this on top of a distributed in-memory database. The changelog of the distributed in-memory database is the 'source of truth' that downstream clients would like to consume. The research problem is that the changelog is batches of transactions that complete within a given epoch (time). Transactions may execute in parallel on different data nodes within an epoch and there is no global ordering over them (no global time, only logical time). Downstream clients, however, may require an ordering over the transactions, so we worked on a solution to this problem when the database stores metadata for a filesystem. Our solution ensures we don't have to serialize the transactions to ensure a global ordering, and provides strong eventual consistency to clients.


Can you elaborate on the solution? Do you use an aggregator node, or have you massaged real-time reqs to allow for an aggregation period on the client side? I'm super curious to hear how you proceeded!


No, we're using a table as a 'queue' in the db. Client are decoupled and if our middleware is offline, it can restart and catch up by draining the table. Transactions over the table ensure the consistency and integrity of the 'queue'. We provide at-least-once semantics to downstream apps, by exploiting transactions in the DB. Actually, we have one sink in the DB itself and we get exactly once semantics for that. The work in under submission with anonymous reviewing, so can't elaborate massively on everything else. Performance numbers are good, though, a sigle server can forward more than 10k ops/sec from the database changelog to a downstream db used for freetext search.


HI I am working in one is called omniql, I am lone developer, so probably it will take time, but there are much of the what they said in the book planned for omniql, is based in observable pattern so it will have live queries by default, it will use system like kafka, nats, Redis or any message system, for communicate the partitions of the distributed databases and do computation that can be merged hierarchically, the computation are done by `serverless function` in any language, is a model that can be extended to ui interface to and create a framework very similar to mobx , but more lightwave, and ui components that can automatically query their data requirement to the server just when they are observed, I am creating a binary protocol of communication for this called delta https://github.com/nebtex/delta, that is perfect for this system an incremental computing, is a really a big project, but I hope to have something working for the next 6 months.


I have been involved in using a system built like this. All I can say is... It feels like you're building a database out of an event stream.

A shitty one at that... Basically the write log part, only without a way to apply that state reliably like a real database. So you have to keep the log around basically forever. It's like you're in the middle of a DB recovery all the time.

After insane amounts of research and deep thought my personal opinion is that this is the wrong way to do scalable systems. Event sourcing and eventual consistency are taking industry for a ride in the wrong direction.

In my quest to find a better way I found some research/leaks/opinions of Googlers, and I think they're right. Even Netflix admits that using eventual consistency means they have to build systems that go around and "fixup" data that ends up in bad states. Ew. Service RPC loops in any such systems are Pandora's box. Are these calls getting the most recently updated version of the data? Nobody knows. Even replaying the event log can't save you, the log may be strongly ordered but the data state between services that call each other is party determined by timing. Undefined behavior.

You'll notice that LinkedIn/Netflix/Uber etc all seem to be building their systems using this pattern. Who is conspicuously absent? Google. The father of containers, VM's, and highly distributed systems is mum.

Researching Google's systems gives some fascinating answers to the problem of distributed consistency, a solution I'm stunned hasn't seen more attention. Google decided as early as 2005 that eventually consistent systems were too hard to use and manage. All of their databases, BigTable, MegaStore, Spanner, F1... They're all strongly consistent in certain ways.

How does Google do it? They make the database the source of truth. Service RPC calls either fail or succeed immediately. Service call loops, while bad for performance, produce consistent results. Failures are easy to find because data updates either succeed or fail immediately, not in some unbouded future time.

The rest of the industry is missing the point of microservices IMO. Google's massively distributed systems are enabled largely by their innovative database designs. The rest of the industry is trying to replicate the topography of Google's internal systems without understanding what makes them work well.

For microservices to be realistically usable for most use cases we need someone to come up with decent competition to Google's database systems. When you have a transactional distributed database all the problems with data spread across multiple services goes away.

HBase was a good attempt but doesn't get enough love. A point missed in the creation of HBase, that becomes clear when reading the papers about MegaStore and Spanner, is that it wasn't designed to be used as a data store by itself. Instead, it has the minimal features needed to build a MegaStore on top of it. The weirder features of HBase/BigTable (like keeping around 3 copies of changed data, and row level atomicity without transactions) are clearly designed to make it possible to build a database on top of it.

Unfortunately nobody thus far has taken up that challenge, and outside Google were all stuck with shitty databases that Google tossed away a decade ago.


Great insightful comment. I came to the same conclusion a number of years ago. We did something about it - we built a new Hadoop platform around a not very well known distributed, in-memory, open-source database - MySQL Cluster (NDB). It is not the MySQL Server you think you know. It is an in-memory OLTP engine used by most network operators as a call subscriber DB. It can handles millions reads or writes/sec on commodity hardware (it has been benched at 200m reads/sec, about 80m writes/sec). It has transactions (read committed isolation level) and row-level locks. It supports efficient cross-partition transactions using one transaction coordinator per database node (up to 48 of them). You can build scalable apps with strong consistency if you can write apps with primary key ops and partition-pruned index scans. We managed to scale out HDFS by 16X with this technique. Since then, we have been doing like you suggested - we built a microservices architecture for Hadoop called Hopsworks around the transactional distributed database. All the evils of eventually consistency go away - systems like Apache Ranger/Sentry become just tables in the DB. More reading is available here: http://www.hops.io/?q=content/news-events



Hopsworks looks like it might be exactly what I need, I do typical data science work for small to small-medium data and wanted to start properly playing with spark on a HDFS store.

Currently most work is just done in R/Python in VM's on a small proxmox cluster (where only 1 node is always on) but I'd like start gently moving to spark, run the stack on a single node and scale on demand.

Is Hopsworks for me, does this approach even make sense for such small data or am I crazy? Thanks for your response!


Yes, Hopsworks can run on anything from 1 server to 1000s. We are finalizing the first proper release now - Jupyter support, tensorflow, pyspark, sparkr, python-kernel for jupyter too,


Awesome, that sounds perfect, I'll give it a shot. You have a mailing list or anyway to follow? Cheers


Very interesting. What do you think about CockroachDB (https://github.com/cockroachdb/cockroach)? Is that serious enough to the challenge, from outside Google?


Two problems. The name is a nightmare for any large company. Countless people have brought this up and the dev team is apparently deaf to the issue. Idiots.

The second major issue is technical. Building something like Spanner requires a very accurate time source that absolutely will not skew. Ever. This is how Google avoids partitions and essentially breaks CAP theory. Perfect time gives you globally accurate timestamps without exchanging data. Distributed transactions without locks or shared state, just usually benign contention.

They're not that hard to build, just no demand. Possibly some issues with ITAR preventing such accurate clocks from becoming commodity hardware? It could easily lead to extremely accurate IMU's which are definitely limited. Not sure, but that's what I ran into researching fibre optic gyros. Atomic clocks could probably be built on SMT scale for a few cents IMO.

As usual, it seems Google is already doing this and has been for years. We either need to wait for the trickle down that Google thankfully does after about 5 years... Or get some deep pocketed tech behemoth to foot the bill for everyone else


Leveraging accurate clocks doesn't let Google ignore partitions. "TrueTime itself could be hindered by a partition"[0]. Spanner also uses two-phase commits and locking, which are unavailable under certain kinds of network partitions.

From their 2017 paper on Spanner and CAP:

> To the extent there is anything special, it is really Google’s wide-area network, plus many years of operational improvements, that greatly limit partitions in practice, and thus enable high availability.

[0] https://static.googleusercontent.com/media/research.google.c...


I like your view. I almost drank the EE kool-aid but I decide instead to do the reverse: I write against a "normal" PG database and then trigger the changes to a log (into PG too) Then I read from the log elsewhere. Probably will copy the log to something else (like kafka) just for speed.

I dislike to lose ACID. Mainly because my apps are all about financial/business stuff.

My ideal DB right know will be alike:

    Incoming data ->
        - Write WAL (disk)
        - Convert to Logical JSON-like structure (memory). For Consisten API
        - Pre-Triggers (BLOCK) <- Validations
        - Persist on Table(s) (only the data necesary to perform validations later?. Cache?)
        - Persist on READ-LOG (all the data!)
        - Post-Triggers (NON-BLOCK!) Read by:

    Log Listener(s) ->
       - Build caches, (secondary?) indexes, etc (NON-BLOCK)


> It feels like you're building a database out of an event stream.

Many software applications, especially web application servers, are effectively a set of data structures updated incrementally from an incoming data stream, and then served to users. The analogy of many applications to a database (or an interpreter) is an accurate one and, in my personal experience, useful as well.

> So you have to keep the log around basically forever.

Different logs can have different retention periods, and you linearize across different logs by using a single writer per timeline. Many domains allow for enough splitting of timelines to enable effective parallel processing without compromising consistency (consider the case of distinct customer organizations using a time-tracking product - there's no reason they need to be able to write to the same database, and volume/contention within any individual organization is likely to be low, allowing for maintained performance).


We built a kind of hybrid infrastructure around this idea.

Any component sends its data into kafka-esque (we're using combination of NATS and PubSub) pipeline where series of workers read, process and write data into our RDBMS which is the ultimate source of the truth.

This allowed us to run a double scale system, where all components are running on their own pace and the RDBMS is running on its own. There is some inherited delay in the data propagation, but it works for service like our (search engine) that doesn't require real-time exposure of newly acquired data.

This design also allows for a frugality as the RDBMS cluster is only scaled based on long term trends and now short term bursts. We also are able to buy committed usage for the cluster as we've great predictability in its growth.


We're building an open source database like this. It's document-oriented and relies on a transaction log core (currently using Postgres, but it's pluggable), that feeds into a query subsystem (currently layered on top of Elasticsearch, but also pluggable).

The transaction log encourages small logical "patches" (set a field, increment a number, replace a substring, move an array element, etc.) that are applied in sequence but can be disentangled by clients to generate a consistent UI, and also used to resolve conflicts between multiple distributed writers. You can also follow the log through the gRPC and HTTP APIs, and you can register "watches" on queries that cause matching changes to trigger an event.

While the transaction log is the underlying data model, we also maintain a consistent view of the current version so that you can use it as a document database with classical CRUD operations. So on the surface it's a lot like Firebase or CouchDB, except you get things like joins and schemas.

Drop me an email (see profile) and I can send you some links.


Curious how this compares to Couchbase + N1QL, since that gets you a document db that also supports full SQL.


In our case, we wanted something more compact, so a query looks more like GraphQL (it's technically a superset of JSON, but it usually doesn't look like JSON). Joins, for example, are just nested declarations that list which attributes on the joined collections to fetch. Here's a query that shows many of the features:

    *[_type == "blogpost" && published == true] {
      _id, title,

      "bodyExcerpt": regexReplace(body, "(.+?)(\. |$)", "\\1"),
      
      author -> { name, category },

      "sectionNames": sections -> name
    } | order(createdAt desc)[0..20]
I don't know much about N1QL, I will have to read more about it.


this looks very similar what I am doing, the difference appears to be that in delta a number of operations is just a subset that can be done in your system. for example it only allows the set operation on scalars/string fields and push operation on vectors, also the delete operations are placed in a way in which automatically let the merge algorithm to know if it needs to check past version in order to compute the current version of the resource, it also allows to compact all the history of a resource in distributed way.


Not sure what you mean by subset. In our case, a single transaction contains one or more document operations (create, delete, etc.), one of which is a "patch" operation that applies a fine-grained transformation. The transformations use an extended version of JSONPath in order to be able to target deep tree nodes as well as apply transformations to multiple fields (e.g. authors[0].publications[*].title). Operations such as set, increment etc. are specific to the data type of the value being transformed and will fail if the data type does not support the transformation function.


lobster nice, sorry for my bad english, is not native language :), in delta each message can be just appended to their past versions (binary format, not parsing required and based on flatbuffers) , in the same memory or disk region, which I have called a superposition (a superstition can represent any resource) you can see an image here https://github.com/nebtex/delta/blob/master/docs/version-lin..., each time that you append a message the tables in the new message are linked to their immediate past version if they exist (you can see tables like nested messages on protobuff), but if a deletion message of that table is found it does not create the link, when the program tries to find a field in a table it lookups in the latest message firsts and then go to past version till it found something, I believe that this should be fast due to how cache works in the modern computer architectures (still not tested), the superposition can be compacted to a single message, in order to free space, also the compaction can run in parallel if you have a lot of messages, for example is possible to maintain all the mutation of the db in a distribuited log, and if people need to recreate the latest state or any past state, should be a fast operation due that is possible to use all the nodes availables. I need to work in a better doc for sure, if someone has some recommendation to give me it will really nice, hehe.


I'm actually midway through this book and I definitely recommend it. The content manages to be both approachable and enlightening. I'm a backend software engineer with the latitude to architect systems at my company and the content so far has given me a stronger foundation for choosing how and where to manage our data. I really enjoy the mini-dives into the structures and decisions supporting the common databases you see today (B-trees, LSM trees, etc.) and discussing the trade offs between them. Now I find myself better equipped to evaluate the tools at our disposal for a given job.

I can imagine it may not dive deep enough for people who really understand the internals of a given data store and the content is probably available elsewhere. However this book is a thoughtful and engaging curation of a lot of information that I may have missed otherwise


"..better equipped to evaluate the tools at our disposal for a given job"

That's a great review - I learned about the book recently, and it sounds like exactly what I need right now, to make a more informed decision about database choices.


For the deep dive, there are always the numerous references at the end of each chapter. Some of these link to software project documentation, but must to recent papers or conference presentations.


This book is a modern survey on practical distributed systems. I knew bits of pieces of the material going in, but the way it was brought together was just masterful.

It will not appeal to the absolute novice to be sure. But for anyone else who has worked on systems for moving data (ETL, streams) and storing data (databases and other data stores), this book will show you how things (probably stuff you've done bits of pieces of) fit together and expound on the few foundational big ideas that makes everything cohere. Once you've understood that, you are on your way to designing data systems that are much cleaner and more scalable.

My experience reading this book is a bit like that of a tradesperson going back to school to learn theory, and after being enlightened, coming away with a new understanding of how to put together theory and practice to better his craft.

I chanced upon this book through an excellent interview Martin Kleppmann did on Software Engineering Daily podcast. If you want the talk-show-host cliff notes version of what the book is about, you should listen to this particular episode:

https://softwareengineeringdaily.com/2017/05/02/data-intensi...


I've read it and highly recommend it.

Does anyone know books that are similar in style? (conceptual, showcasing different solutions to problems and their tradeoffs, high signal-to-noise)


A little more theoretical, but for programming language, there's "Programming Language Pragmatics" which covers imperative, functional, logical PLs and everything needed to make them run from runtimes, linking, virtual machines etc. It's not as demanding or in-depth as e.g. the "Dragon book".

A few shorter books have come out that try to touch different approaches that I've liked: "Seven Languages in Seven weeks" -- and the series has also gained 7 databases, concurrency models and web frameworks.

Finally, there's this anthology where OSS authors described what they did in their applications, so there's a ton of practical information http://aosabook.org/en/index.html


Same question here. I am still reading this book but the way the author combines the concepts with the practices and the contents are structured really inspire me to keep reading. (Usually I gave up easily)


I have the same question, I thoroughly enjoy the book and would love to see similar recommendations. Maybe it could be a good ask HN post.


Picked this book up a few weeks ago and starting diving into a yesterday! As somebody who spends more time on the data generation and analysis side, but is looking to move more towards the data engineering side, it's been great. Helps build out that "tree trunk" of knowledge to really grok what's going on.

Edit: I picked it up based off the recommendation of another HN commenter who said they picked it up after listening to Martin on the SWE Daily Podcast[0]

[0] https://softwareengineeringdaily.com/2017/05/02/data-intensi...


I was very impressed with the quality and depth of the book. I also appreciated how unbiased it felt. I feel that so many books tout approach/technology X as the best approach, but Martin really did a great job at explaining trade offs of various approaches and possible solutions to the problems they introduce. Highly recommended--especially considering getting the Amazon price point (~ $25).


If you're trying to decide if this book is worth picking up, a number of HN commenters recommended it the last time it was brought up - https://news.ycombinator.com/item?id=15185663


I've been massively recommending this book. I think it's very unlikely a backend engineer can avoid having their applications on multiple servers. This book does a wonderfully clear and practical description of the issues with distributed systems, and tools and techniques to address those issues. It's written in a way that is very accessible to someone without a traditional comp sci background.

I was excited about this book because there is a gap in distributed systems books. I feel like there's are a large amount of blogs but most of the books available on amazon are text books and/or include heavy math.


It seems most of my comments these days are singing the praises of this book. It should be practically mandatory reading for anyone in the field. It ties concepts together and builds understanding in a way that doesn't rely on specific technologies. I wish it had been written and that I could have done a whole course on this in grad school.


I really enjoyed this book. I was unfamiliar with quite a lot of the material, so I can't really evaluate it in terms of other references. As an introduction, it's fantastic. Nice coverage of practical, modern tools. Theory stuff is covered at a high level, but with a ton of references for further study. The author is pretty opinionated about a lot of the more hype driven "web-scale" marketing claims. I really appreciated that, as there's typically some kernel of truth behind the marketing fluff, and Kleppman brings these out with moderation and context.

Great book if you'd like to finally understand Jepsen.io articles.


Came here to say the same thing. If you want to develop an intuition about when and why to use different data technologies this is the book for you. Each chapter is relatively self contained, so you don't have to read the whole thing to benefit from it.


I got my work to get me this book, while I appreciate them doing so I wish O'Reilly put a code in or something to let you redeem a PDF copy too. I've seen other publishers do this (e.g. Manning)

I tend to read on the tube a lot and lugging a hefty O'Reilly tome around on my commute isn't ideal.

Currently the book is just sitting on a shelf, I'll get round to it one day!


I know a lot of people here already praise the book... but I have to do that as well. It's a great overview and it explains all the relevant concepts really nicely. Thanks, Martin!


I have this book as well. As a more seasoned systems engineer, I think it covers a lot of the groundwork needed for people newer to the distributed disciplines.


This book has a very high ROI and I recommend it whole-heartedly. I can't honestly name any computer science book where I've gained so much in such a small amount of time.

I wrote a more detailed review at http://horia141.com/designing-data-intensive-applications-re... for the interested.


I'm halfway through this book and I highly recommend it. I'd say its required reading for anyone who wants to be an architect.


Really loving the book. Definitely a must read for the data engineers. I would add that redbook.io is another great read.


I'm about 40% through this book, and it's been a stellar read so far. Using clear, well thought-out language it gets straight to business – chapter after chapter. Highly recommended!


Hands down the best book in this area. Took me some time to read the book cover to cover but it is definitely worth it


Good book! It's only at the introductory level, but I liked that it lacks bias and does a survey of the field.


do you have a reference with more advanced level? will be nice if you share it. thanks


A book that tries talk broadly necessarily has less room to go into detail. For advanced level references, one will have to drill down to sources that are more specific.

Thankfully, this book cites its sources and extensively documents references, and the author even maintains the reference links[0].

[0]: https://github.com/ept/ddia-references


I highly recommend this book. Its a must read book IMHO. Especially useful when trying to select a database.


I was a bit dismayed that this was about the technical design of data-intensive applications, not about their UX design. There still seems to be a huge gap in the latter.


Hah, I just thought the same! I've ended up in situations where I needed some UX patterns for data that updates often with many updates per second, and wondering how to show that the best way. Guess one example is log viewers with debug output from some program.

Would very much appreciate if someone has some resources to link or case studies.


Probably you interpreted the word "design" with its narrower meaning of visual(ly-oriented) design, rather than, say architecting data-intensive applications.

As with another poster, Edward Tufte's books came to mind - though it's about visual presentation of information, not user interface/experience design.

I've also felt that there's an unmet demand for books that provide a thorough overview of UI/UX design patterns, especially the way this book (Designing Data-Intensive Applications) does for its domain.


Design isn't just about visuals, but also interaction.

It is an overloaded term for sure, but the title of the book caught my attention, it is only when I read the article that I realized it was talking about the design of application implementations, not the UX.


Agreed. I was expecting some interesting data visualizations and other frontend topics.


Take a look at Tufte's book The Visual Display of Quantitative Information


Interaction works both ways (input and output). Tufte's work is generally in just one direction (visualization as output). It isn't very applicable to application as much as it is to presentation (think: how did we figure out what to present in that PowerPoint inthe first place?).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: