Hacker News new | past | comments | ask | show | jobs | submit login

For anyone eager to read something now, Designing Data-Intensive Applications [1] is an excellent and completed book that covers nearly all of the same material with significant depth.

[1] https://www.amazon.com/Designing-Data-Intensive-Applications...

I recently took over a large (new) data engineering project. After being given almost no direct, I sat down and read this book and let it assist me with my design.

When we reviewed the design I mentioned a few points that were like: "Yeah I know the little requirements you gave counteracted this design, but if we do it this way it'll help us out (source in book)"

This book is really well written, and I've learned so much from it and I keep opening it up every day for further guidance.

Why is this book considered to be so good? I started it because it's been recommended on HN so much and I gave it up rather quickly because it was really dry and not all that focused on practical applications. Should I give it another go?

I highly recommend it. It does a very good job at explaining the "magic" behind all the data storage techniques, giving you a very good fundamentals and intuition of why each of them are good for certain kind of problems.

After reading the book, googling for something like "mongodb v.s. cassandra" would start to feel as silly as googling for something like "javascript v.s. css" as you start to understand the fundamental differences between them.

No more need to hope the vague Medium post you found while trying to decide which DB to use would match your use case closely enough.

I’ve read the book too and didn’t feel like it covered much that isn’t covered by an undergraduate CS curriculum of databases and distributed systems. Perhaps the book appeals to developers without a formal education in computer science?

Some senior engineers have been in the game long enough that they could have a reputable cs degree without classes in DB or distributed systems. Now it seems less likely, but after teaching Java, C++, C, the other topics were electives.

This. Data engineering has ramped up significantly and if you want senior people you'll quickly run out of people who've been exclusively doing "big data" for 5+ years.

So your options are either senior software engineers who have done some data work (that's how I got to be a Data Engineer) or people who've been doing analytical data work (either in the traditional warehousing space or via science/insurance/finance type spaces) that are semi-technical but have no formal engineering background.

The former are people who went to college in the late 90s/early 2000s (like myself) when things were different. The latter need to hyperfocus on coming up to speed in engineering.

I reviewed this guide a couple months ago for my employer to consider as the basis of an internal bootcamp, and I'd note that it's perfect for the audiences I mentioned. Also, even for people with more up to date academic experience, note that the transactional database schemas that software normally deals with often look wildly different than analytical structures.

Indeed- I am out of school almost 20 years now, distributed systems were an advanced research topic and no class on them existed. My database course was an elective and focused entirely on RDBMS'es and SQL.

I have kept up to date on these technologies, I participated in undergrad research on distributed systems and my career has revolved around them. Many devs never really get a say in where their data goes, they might read a blog post or two about new systems, but it leaves a very light imprint. Its been rather spotty as to whether I had any say in where my data is stored throughout my career.

This book enabled me to think better from first principles.

e.g. How might I go about optimizing a redshift query? Well, now that I have an idea about how data is laid out on disk, because redshift is a columnar store, if I try to optimize X query, here's how I imagine the index to be so that sequential reads would be faster.

I could find a reference on how to optimize redshift queries, but this book answers the WHY and not just the immediate how.

I've read so many books that were practical, yet became so much less useful over time. (e.g. reading a book about the specifics of the Angular API, whereas now I write mostly React.)

I keep returning back to this book for understanding a top-level view of the fundamentals of distributed systems, specifically data stores.

I hope you give the book a second look at some point.

I really like it because it covers just enough on a number of topics and ties them together. There are many books which can allow you to delve further into specific subjects.

The book may seem rather shallow if you are experienced developer but I feel it is extremely good at covering breadth in data intensive applications. For practical applications I have found following Open source frameworks like Kafka, Spark or Presto more helpful. You can also go through references cited in the book to look at other applications.

I might even have stronger feelings than Vicky in terms of how 'useful' it is. If you want to build an other piece of tooling that we already have to muc of, then maybe. http://veekaybee.github.io/2019/04/11/attic-compsci/

Yep! It is really great and covers theory, technical implementations, and practical implementations while not locking into any vendor or specific tech stack. If anything, it's technical information is too dense.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact