
Designing Data-Intensive Applications - wooola
https://dataintensive.net/
======
agadabanka
I am a backend engineer who has been working in the mobile gaming space for
many years now. Most of the focus today for mobile gaming backends is to scale
to millions of players while offering low latency and real-time interactions.
Have been following the development of this book through its beta and having
read it now, I think it is fair to say that this book is worth its weight in
gold. I can relate directly to all the challenges faced over the years in
implementing a real time distributed database _like_ functionality with a high
read _and_ write throughput, both custom and off the shelf. Extremely well
written.

------
mshenfield
In "The Future of Data Systems", the author imagines a system where the
application writes events to a Kafka-like distributed log. Consumers of the
log do work like de-duping, committing the data to a RDBMS, invalidating
caches, updating search indexes, etc. The application might read state
directly from log updates, or have a system to sync w/ some sort of atomic
state (e.g. RethingDB changefeeds).

The architecture seems to solve two big problems

* Scaling RDBMS (there are solutions like Cloud Spanner but they rely heavily on Google's proprietary network for low latency and are expensive as balls)

* Keeping downstream systems in sync. A lot of companies have gone with a "Tail the RDBMS" of some kind, e.g. writing the MySQL binlog to a Kafka queue and having downstream consumers read from that, but this seems like a more elegant solution.

Are there any examples or experiences of people working with systems like
this? What are some downsides, challenges, and actual benefits?

~~~
slackingoff2017
I have been involved in using a system built like this. All I can say is... It
feels like you're building a database out of an event stream.

A shitty one at that... Basically the write log part, only without a way to
apply that state reliably like a real database. So you have to keep the log
around basically forever. It's like you're in the middle of a DB recovery all
the time.

After insane amounts of research and deep thought my personal opinion is that
this is the wrong way to do scalable systems. Event sourcing and eventual
consistency are taking industry for a ride in the wrong direction.

In my quest to find a better way I found some research/leaks/opinions of
Googlers, and I think they're right. Even Netflix admits that using eventual
consistency means they have to build systems that go around and "fixup" data
that ends up in bad states. Ew. Service RPC loops in any such systems are
Pandora's box. Are these calls getting the most recently updated version of
the data? Nobody knows. Even replaying the event log can't save you, the log
may be strongly ordered but the data state between services that call each
other is party determined by timing. Undefined behavior.

You'll notice that LinkedIn/Netflix/Uber etc all seem to be building their
systems using this pattern. Who is conspicuously absent? Google. The father of
containers, VM's, and highly distributed systems is mum.

Researching Google's systems gives some fascinating answers to the problem of
distributed consistency, a solution I'm stunned hasn't seen more attention.
Google decided as early as 2005 that eventually consistent systems were too
hard to use and manage. All of their databases, BigTable, MegaStore, Spanner,
F1... They're all strongly consistent in certain ways.

How does Google do it? They make the database the source of truth. Service RPC
calls either fail or succeed immediately. Service call loops, while bad for
performance, produce consistent results. Failures are easy to find because
data updates either succeed or fail immediately, not in some unbouded future
time.

The rest of the industry is missing the point of microservices IMO. Google's
massively distributed systems are enabled largely by their innovative database
designs. The rest of the industry is trying to replicate the topography of
Google's internal systems without understanding what makes them work well.

For microservices to be realistically usable for most use cases we need
someone to come up with decent competition to Google's database systems. When
you have a transactional distributed database all the problems with data
spread across multiple services goes away.

HBase was a good attempt but doesn't get enough love. A point missed in the
creation of HBase, that becomes clear when reading the papers about MegaStore
and Spanner, is that it wasn't designed to be used as a data store by itself.
Instead, it has the minimal features needed to build a MegaStore on top of it.
The weirder features of HBase/BigTable (like keeping around 3 copies of
changed data, and row level atomicity without transactions) are clearly
designed to make it possible to build a database on top of it.

Unfortunately nobody thus far has taken up that challenge, and outside Google
were all stuck with shitty databases that Google tossed away a decade ago.

~~~
jamesblonde
Great insightful comment. I came to the same conclusion a number of years ago.
We did something about it - we built a new Hadoop platform around a not very
well known distributed, in-memory, open-source database - MySQL Cluster (NDB).
It is not the MySQL Server you think you know. It is an in-memory OLTP engine
used by most network operators as a call subscriber DB. It can handles
millions reads or writes/sec on commodity hardware (it has been benched at
200m reads/sec, about 80m writes/sec). It has transactions (read committed
isolation level) and row-level locks. It supports efficient cross-partition
transactions using one transaction coordinator per database node (up to 48 of
them). You can build scalable apps with strong consistency if you can write
apps with primary key ops and partition-pruned index scans. We managed to
scale out HDFS by 16X with this technique. Since then, we have been doing like
you suggested - we built a microservices architecture for Hadoop called
Hopsworks around the transactional distributed database. All the evils of
eventually consistency go away - systems like Apache Ranger/Sentry become just
tables in the DB. More reading is available here:
[http://www.hops.io/?q=content/news-
events](http://www.hops.io/?q=content/news-events)

~~~
psandersen
Hopsworks looks like it might be exactly what I need, I do typical data
science work for small to small-medium data and wanted to start properly
playing with spark on a HDFS store.

Currently most work is just done in R/Python in VM's on a small proxmox
cluster (where only 1 node is always on) but I'd like start gently moving to
spark, run the stack on a single node and scale on demand.

Is Hopsworks for me, does this approach even make sense for such small data or
am I crazy? Thanks for your response!

~~~
jamesblonde
Yes, Hopsworks can run on anything from 1 server to 1000s. We are finalizing
the first proper release now - Jupyter support, tensorflow, pyspark, sparkr,
python-kernel for jupyter too,

~~~
psandersen
Awesome, that sounds perfect, I'll give it a shot. You have a mailing list or
anyway to follow? Cheers

------
topstriker515
I'm actually midway through this book and I definitely recommend it. The
content manages to be both approachable and enlightening. I'm a backend
software engineer with the latitude to architect systems at my company and the
content so far has given me a stronger foundation for choosing how and where
to manage our data. I really enjoy the mini-dives into the structures and
decisions supporting the common databases you see today (B-trees, LSM trees,
etc.) and discussing the trade offs between them. Now I find myself better
equipped to evaluate the tools at our disposal for a given job.

I can imagine it may not dive deep enough for people who really understand the
internals of a given data store and the content is probably available
elsewhere. However this book is a thoughtful and engaging curation of a lot of
information that I may have missed otherwise

~~~
lioeters
"..better equipped to evaluate the tools at our disposal for a given job"

That's a great review - I learned about the book recently, and it sounds like
exactly what I need right now, to make a more informed decision about database
choices.

------
wenc
This book is a modern survey on practical distributed systems. I knew bits of
pieces of the material going in, but the way it was brought together was just
masterful.

It will not appeal to the absolute novice to be sure. But for anyone else who
has worked on systems for moving data (ETL, streams) and storing data
(databases and other data stores), this book will show you how things
(probably stuff you've done bits of pieces of) fit together and expound on the
few foundational big ideas that makes everything cohere. Once you've
understood that, you are on your way to designing data systems that are much
cleaner and more scalable.

My experience reading this book is a bit like that of a tradesperson going
back to school to learn theory, and after being enlightened, coming away with
a new understanding of how to put together theory and practice to better his
craft.

I chanced upon this book through an excellent interview Martin Kleppmann did
on Software Engineering Daily podcast. If you want the talk-show-host cliff
notes version of what the book is about, you should listen to this particular
episode:

[https://softwareengineeringdaily.com/2017/05/02/data-
intensi...](https://softwareengineeringdaily.com/2017/05/02/data-intensive-
applications-with-martin-kleppmann/)

------
g___
I've read it and highly recommend it.

Does anyone know books that are similar in style? (conceptual, showcasing
different solutions to problems and their tradeoffs, high signal-to-noise)

~~~
Erwin
A little more theoretical, but for programming language, there's "Programming
Language Pragmatics" which covers imperative, functional, logical PLs and
everything needed to make them run from runtimes, linking, virtual machines
etc. It's not as demanding or in-depth as e.g. the "Dragon book".

A few shorter books have come out that try to touch different approaches that
I've liked: "Seven Languages in Seven weeks" \-- and the series has also
gained 7 databases, concurrency models and web frameworks.

Finally, there's this anthology where OSS authors described what they did in
their applications, so there's a ton of practical information
[http://aosabook.org/en/index.html](http://aosabook.org/en/index.html)

------
veritas3241
Picked this book up a few weeks ago and starting diving into a yesterday! As
somebody who spends more time on the data generation and analysis side, but is
looking to move more towards the data engineering side, it's been great. Helps
build out that "tree trunk" of knowledge to really grok what's going on.

Edit: I picked it up based off the recommendation of another HN commenter who
said they picked it up after listening to Martin on the SWE Daily Podcast[0]

[0] [https://softwareengineeringdaily.com/2017/05/02/data-
intensi...](https://softwareengineeringdaily.com/2017/05/02/data-intensive-
applications-with-martin-kleppmann/)

------
sbpayne
I was very impressed with the quality and depth of the book. I also
appreciated how unbiased it felt. I feel that so many books tout
approach/technology X as the best approach, but Martin really did a great job
at explaining trade offs of various approaches and possible solutions to the
problems they introduce. Highly recommended--especially considering getting
the Amazon price point (~ $25).

------
teej
If you're trying to decide if this book is worth picking up, a number of HN
commenters recommended it the last time it was brought up -
[https://news.ycombinator.com/item?id=15185663](https://news.ycombinator.com/item?id=15185663)

------
dm03514
I've been massively recommending this book. I think it's very unlikely a
backend engineer can avoid having their applications on multiple servers. This
book does a wonderfully clear and practical description of the issues with
distributed systems, and tools and techniques to address those issues. It's
written in a way that is very accessible to someone without a traditional comp
sci background.

I was excited about this book because there is a gap in distributed systems
books. I feel like there's are a large amount of blogs but most of the books
available on amazon are text books and/or include heavy math.

------
dswalter
It seems most of my comments these days are singing the praises of this book.
It should be practically mandatory reading for anyone in the field. It ties
concepts together and builds understanding in a way that doesn't rely on
specific technologies. I wish it had been written and that I could have done a
whole course on this in grad school.

------
bglazer
I really enjoyed this book. I was unfamiliar with quite a lot of the material,
so I can't really evaluate it in terms of other references. As an
introduction, it's fantastic. Nice coverage of practical, modern tools. Theory
stuff is covered at a high level, but with a ton of references for further
study. The author is pretty opinionated about a lot of the more hype driven
"web-scale" marketing claims. I really appreciated that, as there's typically
some kernel of truth behind the marketing fluff, and Kleppman brings these out
with moderation and context.

Great book if you'd like to finally understand Jepsen.io articles.

~~~
mshenfield
Came here to say the same thing. If you want to develop an intuition about
when and why to use different data technologies this is the book for you. Each
chapter is relatively self contained, so you don't have to read the whole
thing to benefit from it.

------
djhworld
I got my work to get me this book, while I appreciate them doing so I wish
O'Reilly put a code in or something to let you redeem a PDF copy too. I've
seen other publishers do this (e.g. Manning)

I tend to read on the tube a lot and lugging a hefty O'Reilly tome around on
my commute isn't ideal.

Currently the book is just sitting on a shelf, I'll get round to it one day!

------
drej
I know a lot of people here already praise the book... but I have to do that
as well. It's a great overview and it explains all the relevant concepts
really nicely. Thanks, Martin!

------
ninjakeyboard
I have this book as well. As a more seasoned systems engineer, I think it
covers a lot of the groundwork needed for people newer to the distributed
disciplines.

------
horia141
This book has a very high ROI and I recommend it whole-heartedly. I can't
honestly name any computer science book where I've gained so much in such a
small amount of time.

I wrote a more detailed review at [http://horia141.com/designing-data-
intensive-applications-re...](http://horia141.com/designing-data-intensive-
applications-review.html) for the interested.

------
nindalf
I'm halfway through this book and I highly recommend it. I'd say its required
reading for anyone who wants to be an architect.

------
muramira
Really loving the book. Definitely a must read for the data engineers. I would
add that redbook.io is another great read.

------
yoshuaw
I'm about 40% through this book, and it's been a stellar read so far. Using
clear, well thought-out language it gets straight to business – chapter after
chapter. Highly recommended!

------
chajath
Hands down the best book in this area. Took me some time to read the book
cover to cover but it is definitely worth it

------
blaisio
Good book! It's only at the introductory level, but I liked that it lacks bias
and does a survey of the field.

~~~
criloz2
do you have a reference with more advanced level? will be nice if you share
it. thanks

~~~
pritambaral
A book that tries talk broadly necessarily has less room to go into detail.
For advanced level references, one will have to drill down to sources that are
more specific.

Thankfully, this book cites its sources and extensively documents references,
and the author even maintains the reference links[0].

[0]: [https://github.com/ept/ddia-references](https://github.com/ept/ddia-
references)

------
mehra
I highly recommend this book. Its a must read book IMHO. Especially useful
when trying to select a database.

------
seanmcdirmid
I was a bit dismayed that this was about the technical design of data-
intensive applications, not about their UX design. There still seems to be a
huge gap in the latter.

~~~
lioeters
Probably you interpreted the word "design" with its narrower meaning of
visual(ly-oriented) design, rather than, say _architecting_ data-intensive
applications.

As with another poster, Edward Tufte's books came to mind - though it's about
visual presentation of information, not user interface/experience design.

I've also felt that there's an unmet demand for books that provide a thorough
overview of UI/UX design patterns, especially the way this book (Designing
Data-Intensive Applications) does for its domain.

~~~
seanmcdirmid
Design isn't just about visuals, but also interaction.

It is an overloaded term for sure, but the title of the book caught my
attention, it is only when I read the article that I realized it was talking
about the design of application implementations, not the UX.

