Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What are good resources to learn system design?
246 points by techsin101 9 months ago | hide | past | favorite | 65 comments
What I mean by system design is to understand seemingly endless options when it comes to data handling on backend side.

For example...

- Kafka

- Rabbitmq

- Kinesis

- Spark

- Elastic search

- Map reduce

- Bigquery

- InfluxDB

- Hadoop

- Teradata

- Snowflake

- Databricks


I understand Postgres the best, and would love to know why these and others exist, where do they fit in, why are they better over PSQL and what for, and if they are cloud only what's their alternatives....It seems all of them just store data, which PSQL does too, so what's the difference?

Designing Data-Intensive Applications

Recommended by the CTO of Azure, creator of Kafka, and many HN users on other threads including me :)


So many recommendations for this book but I really didn't like it, very theoretical. It lists a lot of the main data related software systems and how they work, but that seemed about it. I really expected comparisons and tradeoffs, eg when should you switch from using a database to a message queue to Kafka? No real world examples or design experience.

Everything is so case specific. And at the same time most of the tools work pretty well for the scale of most companies to the point they it doesn’t matter when you chose what...

Because of those two I don’t know exactly what people expect apart from either specific use cases ( conferences , blogs , etc.) or generic fundamentals (the book).

Groupthink. Just like that book about sleep and whatever other pop pseudo intellectual books are repeated on here.

Sounds like you didn't read the book. There's actually a section with the title "Message brokers compared to databases" with a detailed comparison.

I would second the book and also suggest having a read through https://aws.amazon.com/builders-library/?cards-body.sort-by=....

There are some interesting problems and concepts at scale explained really well.

The book is fantastic! If you want to get a feel for the book before diving in (it's quite long), I've written a summary/review:


Seconded. Also each chapter has a comprehensive reference providing pointers to various technical topics mentioned in the chapter. The provided references serve a good starting if one wants to explore further any the area interests him/her. These references are also maintained at: https://github.com/ept/ddia-references

This book is a gem, concepts are greatly explained and is full of references to papers and other articles. I've heard people clearing systems design interviews just by using this book as reference. I'm not sure if it's an exaggeration but from what I can see it does a great job convering the fundamentals of scalable systems.

> CTO of Azure

Not by CTO of Azure but by CTO of Microsoft - Kevin Scott

Agreed. Captures a lot of the issues you get at scale.

Very solid book. Wish it had been out when I was still working on systems at cloud scale.

For me this book is a little dry and don't let me pass more than the first initial chapters.

I wish there were video versions of this.

I just bought this book. Thanks

Please check out Azure Cloud Design Patterns [0]. It lists a large number of common patterns applied in the field of distributed application design.

[0]: https://docs.microsoft.com/en-us/azure/architecture/patterns...

It is very helpful for me as well. Thanks!

Looks great addition

All the new shiny. From experience I can tell you

a) most companies deal with small amounts of data. Small can mean dozens of megabytes to dozens or hundreds of gigabytes. A single well provisioned server will typically be able to handle that very well. Also an SQL database can do a great deal if you know what you're doing.

b) inappropriately used big data frameworks are expensive performance killers. https://adamdrake.com/command-line-tools-can-be-235x-faster-... for example.

c) Good quality programming, as in understanding the machine, memory layout and why it matters, and a good understanding of algorithms (and a hefty dose of common sense), will often yield you more speedup than buying almost any number of new machines.

c) Hiring is often driven by fads and companies often don't like being told 'you don't need this roomful of servers', they like to waste money, so maybe do learn them (the profligacy with money is likely to be coming to an end with the economic damage of covid).

Takeaway: brainpower will get you much further than horsepower

Yes. Also if your foundation is already flaky designed and not well understood turning it into a distributed system likely makes everything more error-prone and more likely to fail.

> The first rule of distributed systems is don't distribute your system until you have an observable reason to.

Hah. Well said. Along the same lines:

“You can have a second computer once you’ve shown you know how to use the first one.”

–Paul Barham

So much this. The original question has the cart before the horse. First understand the problem you're trying to solve, then solve THAT problem. Not the one you wish you had. Or, in the words of John Gall:

"A complex system that works is invariably found to have evolved from a simple system that worked. A complex system designed from scratch never works and cannot be patched up to make it work. You have to start over with a working simple system."

I'm OP, and if I could I'd use Postgres everywhere, I want to know exactly when other stuff is useful and necessary, the trade offs, so I can make judgement call for myself and team.. That yes we don't need this, regular sql server would do with some pub/sub.

I can't tell you when to use it, I can clearly tell you when not to (as has the other guy here).

If 1 machine postgres suffices, stick with it. If it does not, would buying more processors/memory/disk fix that? If so, stick with it.

The big data stuff is when you really have too much data to deal with in one box and literally cannot get round that.

The tradeoffs are (my limited experience): cost in terms of buying a stack of boxes. Cost in terms of learning a massive new software stack. Cost in terms of "it scales almost infinitely but each node processes so much less".

I gave a link before, I hope that showed something useful. Here's another, see article and paper https://html.duckduckgo.com/html?q=scalability%20%22at%20wha...

Does that provide (at least the beginnings of) an answer?

(that last line could come across as rather snotty; was not intended that way. Meant to imply I'm happy to try to answer any questions to the very limited extent that I can).

A classical LAMP stack with server side rendering will be fine for 90+ % of companies out there. And will save a load of effort in development and maintenance.

A few months ago I wrote a post called "Systems design for advanced beginners" that a lot of people seem to have found helpful.

Link: https://robertheaton.com/2020/04/06/systems-design-for-advan... HN comments: https://news.ycombinator.com/item?id=23904000

Thanks for this! Got a link to this last week from a coworker and was super helpful.

This is great! The only thing I would say is missing is some talk about API Gateways. Which is something you want to do at the beginning of a project and not after.

Designing Data-Intensive Applications is excellent but also check these resources:

- https://github.com/donnemartin/system-design-primer

- http://aosabook.org/en/index.html

I want to commend you for asking this question. Once you open your mind to systems thinking (in a general, broad or abstract sense) it will make you a far better engineer.

When it comes to data, you are ultimately worried about 1. storing it and making sure it stays there and 2. retrieving it or asking questions about that data with certain guarantees. Speed? Consistency? Local access? Grabbing a ton of rows at once? Grabbing really old data quick? The old adage is true here: nothing in life is free. If you want fast writes you might sacrifice read performance, or vice versa. If you dial one knob up, one knob needs to get dailed down (usually). All of the tools you listed have various trade offs and were designed or optimized for specific workloads. Some are more general (PSQL is a great example) but looking at them all spread out on a table the differences become more clear.

Choosing your tool will depend on how well it will meet your requierments and how it is going to play nice with all your other systems. Systems thinking is a lot bigger than choosing a performant tool that has the right libraries. You gotta think about long term support: how do I do backups of my data? How do I restore data? How do I perform upgrades down the road? How do I deal with downtime, can I throw more resources at it?

Long story short: I am very glad to hear more people thinking about systems engineering but make sure you don't get too caught up in the specific tooling and libraries. Learning and practicing the concepts and fundamentals and making sure to pause to think in the abstract boxes-and-lines sense is very important, too.

Learning about 'Clean architecture' and 'hexagonal architecture' will help to reinforce good systems design patterns.

My thought process is that I don't want to neccessarily learn all the tools but rather understand them so I'd know which one to pick and when. Right now everything seems redududant from my perspective. i.e. Why have time series database? you can save that in postgres, just have two columns time, data.???

Storing it is only half the problem, but what happens when you have 1 billion rows? How quickly will you be able to query the exact data you want?

I would encourage you to stand up a PSQL instance, pack it with a hundred million rows of simulated data, and experiment :)

Assuming you have some experience building simple single-node systems:

- Read Designing Data Intensive Applications. As others have said, it's a gem of a book, very readable, and it covers a lot of ground. It should answer both of your questions. Take the time to read it, take notes, and you should be well set. If you need to dive deeper into specific topics, each chapter links to several resources.

- Read some classic papers (Dynamo, Spanner, GFS). Some of these are readable while some are not-so-readable, but it'll be useful to get a sense of what problems they solve and where they fit in. You may not understand all of the terminology but that's fine.

That should give you a strong foundation that you can build upon. Beyond that, just build some systems, experiment with the ideas that you're learning. You cannot replace that experience with any amount of reading, so build something, make mistakes, struggle with implementation, and you'll reinforce what you've learned.

Backend is vast, and this helps you build a general sense of the topic. When you find a topic that you're really interested in (say stream processing, storage systems, or anything else), you can dive into that specific topic with some extra resources.

> I understand Postgres the best, and would love to know why these and others exist, where do they fit in, why are they better over PSQL and what for, and if they are cloud only what's their alternatives....It seems all of them just store data, which PSQL does too, so what's the difference?

A lot of that depends on the way you're building a system, the amount of data you're going to store, query patterns, etc. In most cases, there are tradeoffs that you'll have to understand and account for.

For example, a lot of column oriented databases are better suited for analytics workloads. One of the reasons is for that is their storage format (as the name says, columns rather than rows). Some of the systems you mentioned are built for search; some are built from the ground up to allow easier horizontal scaling, etc.

> In most cases, there are tradeoffs that you'll have to understand and account for.

exactly what I want to learn. So it looks like everyone is recommending the book, so I'll finish it first thing.

Uhh... at the risk of being too literal:

  - Kafka: a service for defining and managing message streams; used in service architectures that communicate by message-passing and in high-throughput data processing applications and pipelines.

  - RabbitMQ: another message queue service; less complex than Kafka.

  - Kinesis: a message queue service provided by AWS.

  - Spark: an in-memory distributed computation engine; a central "driver" consumes job definitions, written in code, and farms them out to "workers"; horizontally scalable; a variety of options exist for managed/hosted Spark.

  - ElasticSearch: a service for indexing data; consumes data blobs and search terms to associate them with; used to build search engines; many convenient utilities for managing search terms and queries.

  - MapReduce: a paradigm for defining distributed data operations; partitions of a "job" are sent to "mappers" that compute partial results, and those results then flow to "reducers," that combine the partial results into the finished output; Hadoop is the best-known implementation of this paradigm.

  - BigQuery: a scalable database offered by Google as a service.

  - InfluxDB: a time series database; used for storing and analyzing data that has a time component.

  - Hadoop: an implementation of the MapReduce paradigm; many hosted options, or you can run it on your own hardware.

  - Teradata: a company that sells various data analysis tools that run on its custom data warehouse.

  - Snowflake: hosted SQL database.

  - Databricks: hosted Spark.

this is what i was looking for, just little more depth and/or infographic.

only things i'd add

- what is it? (done)

- when to use it over postgres?

- how it compares to alternatives?

also what else is there?

System design is not just about the game (individual tools), but about the meta game (flows of data, interconnected abstractions, navigating problem space). There is no rote checklist process that will reliably pick the right tool.

You do need to learn about tools at least superficially, but when you learn how to build the right mental models for your problems, that's when the whole picture starts to become clear and you will just "see" how the right tools will slot into your problem. Then you can deep dive into those tools.

I'd highly recommend starting with Bret Victor's demo, Up And Down The Ladder Of Abstraction: http://worrydream.com/LadderOfAbstraction/ (view on desktop) to start building the "abstraction muscle".

Then it will become more apparent what constraints might lead you to choose a message bus with a RabbitMQ broker instead of making internal HTTPS calls, for example.

[But really, as to your final paragraph, just use Postgres until you can't anymore]

I have this link saved in my bookmarks. Never read it though.


This is one of the best and most accessible pieces I've read about the underlying principles of how such systems work.


Rare found, funnily enough I got here because I wanted to build my own logging system and read I should use lot of things. So question began why can't I just store events in Postgres, it's a database, right? This article begins with logs so I feel it understands my problem

This is a kind of help vampire question. If you don't have problems which made you look at big data solutions, don't look for them. Those solutions will create problems for you if you don't already have them.

Using those things (Kafka, Hadoop etc.) when you don't have sufficient data to justify it is like using a supertanker to do your grocery shopping.

"If you don't have problems which made you look at big data solutions, don't look for them."

Why do you assume they don't have or aren't dealing with such problems?

Maybe they want to work for a company that has these problems.

Perhaps the question they should be asking, then, is how do they join such a company.

Experience with their tech stack is a good start :)

I used kubernetes for my home server in college- it's a terrible idea and added unnecessary complexity for my use case, but my experience there got me a job where it solves more problems than it causes.

Could be, though it makes sense to first learn a bit about the topic so you know what you're getting yourself into.

The context of this kind of question is usually "I want to pass a systems design interview"

Does it talk about services I listed, or just general overview ie. use load balancers and caching.

I've taken it. It does talk about trade offs like when you want to use a relational or nonrelational db, various strategies when it comes to caching, scaling, etc. Sometimes it will talk about specific tech like memcache but I don't recall too many mentions about the direct techs listed in the OP.

It's very useful for preparing for interviews that ask, "How would you recreate dropbox?" or "how would you recreate instagram." You know... system design interviews.

I think it's well worth a purchase, only if you can afford it. I feel like the other resources people posted in the thread are better but if you need structured content it's very well done.

You should look at the link.

It walks you through a) systems design of various companies you've heard of; b) how to design common things like a URL shortener or a rate limiter; c) more general topics like you mentioned.

There are also several example chapters available as a free preview, including the module on Instagram.

What you've listed here are tools. You need to learn system design concepts and not just tools.

Have a look here - https://github.com/donnemartin/system-design-primer

It might be helpful to also study how databases are built and function. They contain applications of theoretical concepts from Algorithms, Data Structures, OS and Distributed systems.

There are many textbooks on this subject but if you are feeling lost then I'd suggest starting with https://www.databass.dev/ which gives a decent birds eye view of many concepts.

You have to start learning about different architectures and patterns first. Their popular use cases and pros and cons. That will help you to understand where different features and behaviors of a certain product/service really fits. Details are mostly about trade-offs between different resources and design choices. Read case studies from companies that tried those architectures to learn more about challenges and benefits.

It isn't only about features. Cost and security are big factors. Risk, disaster-recovery, data-management, SLAs, available APIs, and interfaces.

You have to calculate how different resources and architectures will scale with your use-case and how much they will cost to develop and maintain.

There are also other variables that are related to your organization. Internal parameters like available skills, organization structure, project life cycle, available documentation, and long-term support are big factors when making a decision.

It's a good thing I favorited this. System design doe beginner. https://news.ycombinator.com/item?id=23904000

Also, check the benchmark, the scalability, the architecture, etc. Sometimes, DB with similar frontend (API) are very different on the backend (architecture, implementation), for example CockroachDB vs PostgreSQL, hence different usage. One is OLTP, the other is OLAP, etc.

Read "Designing Data-intensive Applications"-- it's a great combo of theory + real-application (including some of the technologies you listed).

From my understanding (don't have much backend experience) you need those only for specific workloads. First learn difference between OLTP and OLAP. Traditional DBs are usually designed for OLTP, and new DBs are designed for OLAP and some for mega scale (Petabytes).

I recommend you learn: * ES - for text search * Clickhouse - simplest OLAP * Cassandra (Petabytes of data, columnar store) * Learn some about timeseries DBs (analytics) * Graph DBs

RabittMQ, Kafka or Pulsar are used for message bus/que implementations. Simple case: producing message takes 1 time unit but processing 5 units, so you want to implement kind of threading without coupling to specific hosts, so you use queue and subscribe to quue with readers. Read ZeroMQ docs on all communication patterns to learn typical cases.

Your question seems to imply generality ("systems design"), but the description of your question seems to imply specific tooling (e.g., Kafka).

Many people have mentioned really good books (e.g., DDIA). Such books are good for gathering a general knowledge about "systems design", but you will be still clueless about the differences between Kafka and Rabbitmq until you actually read their documentation manuals.

There is no shortcut I'm afraid. If you want to "understand seemingly endless options when it comes to data handling on backend side" you will have to read the corresponding seemingly endless documentation manuals. How else would you know about the advantages or disadvantages of, let's say, InfluxDB over Postgres if you don't read their manuals?

Being able to make these decisions means understanding a wide variety of potential moving pieces. Getting broad exposure is a key theme.

I recommend reading this specifically; this is basically an education in production systems, and covers a lot of ground. :)


This playlist helped get me familiar with a wide breadth of topics - https://www.youtube.com/watch?v=vge7qwCR1dA&list=PLt4nG7RVVk...

I recently came across this YouTube channel https://www.youtube.com/playlist?list=PLMCXHnjXnTnvo6alSjVkg... which provides some examples of system design on a whiteboard, very useful to put all pieces together.


Start to build.

Yeah... no. You can't learn by just "building" (or you can but it's simply too slow); you need to do a lot of "reading" too. At some point you need to "write"/"do" stuff, that is true (because otherwise you'll forget/ you won't get an in-depth appreciation and understanding of what you read). But the reading part is critical for learning.

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact