
Data Engineering Cookbook - charlysl
https://github.com/andkret/Cookbook
======
meritt
For anyone eager to read something now, Designing Data-Intensive Applications
[1] is an excellent and completed book that covers nearly all of the same
material with significant depth.

[1] [https://www.amazon.com/Designing-Data-Intensive-
Applications...](https://www.amazon.com/Designing-Data-Intensive-Applications-
Reliable-Maintainable/dp/1449373321)

~~~
ps101
Why is this book considered to be so good? I started it because it's been
recommended on HN so much and I gave it up rather quickly because it was
really dry and not all that focused on practical applications. Should I give
it another go?

~~~
hdra
I highly recommend it. It does a very good job at explaining the "magic"
behind all the data storage techniques, giving you a very good fundamentals
and intuition of why each of them are good for certain kind of problems.

After reading the book, googling for something like "mongodb v.s. cassandra"
would start to feel as silly as googling for something like "javascript v.s.
css" as you start to understand the fundamental differences between them.

No more need to hope the vague Medium post you found while trying to decide
which DB to use would match your use case closely enough.

~~~
laichzeit0
I’ve read the book too and didn’t feel like it covered much that isn’t covered
by an undergraduate CS curriculum of databases and distributed systems.
Perhaps the book appeals to developers without a formal education in computer
science?

~~~
alttab
Some senior engineers have been in the game long enough that they could have a
reputable cs degree without classes in DB or distributed systems. Now it seems
less likely, but after teaching Java, C++, C, the other topics were electives.

~~~
triplee
This. Data engineering has ramped up significantly and if you want senior
people you'll quickly run out of people who've been exclusively doing "big
data" for 5+ years.

So your options are either senior software engineers who have done some data
work (that's how I got to be a Data Engineer) or people who've been doing
analytical data work (either in the traditional warehousing space or via
science/insurance/finance type spaces) that are semi-technical but have no
formal engineering background.

The former are people who went to college in the late 90s/early 2000s (like
myself) when things were different. The latter need to hyperfocus on coming up
to speed in engineering.

I reviewed this guide a couple months ago for my employer to consider as the
basis of an internal bootcamp, and I'd note that it's perfect for the
audiences I mentioned. Also, even for people with more up to date academic
experience, note that the transactional database schemas that software
normally deals with often look wildly different than analytical structures.

------
CapmCrackaWaka
Excessive self-promotion aside, I think high level documents like this are
more important than people give them credit for.

Computer science is: 1) Daunting to beginners 2) Changing constantly

It is absolutely a field where people suffer from "I don't know what I don't
know". 100-ft overviews of best practices and new buzz-words are refreshing,
in my opinion.

The criticisms I have of this document are all preference. I do not want to
watch the podcasts or youtube videos, especially if I just want a high level
overview of something. A ctrl-f of 'andreaskayy' returns 12 results. This
guy's self promotions is everywhere, which is fine for some people, but it
makes me think I'm getting 1 man's opinion on everything, and not an unbiased
explanation of different technologies/methods.

~~~
hdra
>Daunting to beginners

I agree it is. But just I took a quick look at the book, while I understand
the book is not done yet, looking at the table of contents, I doubt this would
help. Like, do you have to cover kerberos, IP Subnetting, OSI IP model, Agile,
git, REST API, docker, and many more in a _single book_ about "data
engineering"? If anything, it would confuse beginners even more. Its like the
author tried to cram as many buzzwords as possible into a single book.

~~~
CapmCrackaWaka
I agree this author seems more interested in self-promotion than anything
else.

However, I don't think the purpose of this book is to cover these subjects in
their entirety. Most of the time, with books like these, I would just like to
see definitions, use cases, why it's used, maybe even what it replaced in the
space it exists in. Materials like this aren't meant to make you an expert,
they are just meant to show you what's out there, why it's out there, and
(maybe) what was out there before. They give you context with which to google
things.

Either way, everyone expects something different from educational materials. I
personally would not use this book, but I get the idea he was going for.

------
madis
How about calling it "Book of mostly empty chapters on various topics about
software engineering with slight emphasis on data engineering"?

~~~
kylek
How about constructive criticism...

~~~
squeaky-clean
Have you looked through the PDF? Many sections are entirely blank except a
section header. Many are just links to a blog post or podcast.

Data Warehouse vs Data Lake chapter is a single podcast link. The Hadoop
chapter is 5 pages, mostly used by large diagrams and the docker chapter is
less than 4 pages with half of the sections empty except a heading. The REST
API chapter is less than 2 pages with a blank section headed OAuth Security.

Data Visualization is entirely blank. The database chapter is mostly empty
except for text about HDFS, and just links on MongoDB, ElasticSearch and
InfluxDB. Apache Kafka gets its own mostly blank chapter.

Most of the beginning chapters seem unrelated to data engineering. 3) Learn to
Code. 4) Getting Started with Git. 5) Agile Development. 6) Learn how a
computer works (section 1 is subtitled "CPU,RAM,GPU,HDD" but the chapter is
empty). 7) Computer Networking - Data Transmission.

~~~
herewego
Where is your finished data engineering book? I would like to read it.

How do you think a book gets written? Obviously you don't think that someone
sits down, puts finger to keyboard, and then a book bursts into fruition. This
is a work-in-progress kindly made freely available. Is it really fair to
criticize the author for not having finished it yet?

~~~
squeaky-clean
> Where is your finished data engineering book? I would like to read it.

So I need to have written a book to be able to download a PDF and see 85/100
pages are blank? I work as a data engineer and can tell you 50% of these
chapter topics are not directly related to data engineering.

There are no chapters in this book even close to 10% finished. If you want a
book recommendation I'm seconding the suggestion in this thread of Designing
Data-Intensive Applications. I have a copy 3 feet from me at the moment.

> This is a work-in-progress kindly made freely available. Is it really fair
> to criticize the author for not having finished it yet?

Please look through the PDF. This isn't just not done. This is not ready to
share with anyone publicly. There is no useful information in this. There are
probably under 20 paragraphs of original text.

> Is it really fair to criticize the author for not having finished it yet?

No, but I'm criticizing the fact that it's posted[0]. Not that they're working
on something.

I don't see the author here in this thread so my warning is to other readers.
Just move on unless you're a book publisher looking for an author to pick up.

The only real criticism anyone could offer about this would be about the
chapter structure, because that's all that exists. I would recommend they drop
all the chapters that are a CS101 equivalent. There's no need to explain git
or the OSI model or grep.

[0] edit, I want to clarify I mean just posted and dumped. If the author were
here for questions or feedback I would feel differently. But with just this
link as-is, there is no point in sharing.

~~~
scruple
> So I need to have written a book to be able to download a PDF and see 85/100
> pages are blank? I work as a data engineer and can tell you 50% of these
> chapter topics are not directly related to data engineering.

Also data engineering focused, agreed completely.

------
wiremine
Looks like a great outline for an important topic! A few thoughts:

1\. Given the title, I was expecting a more traditional Cookbook style book.
I.e., "How do I..." followed by one or more recipes to answer the question.

2\. The blueprint on page 39 is a good start: it includes the 4 main processes
for data engineering. However, Display is only one use case. Training models,
for example, is another. It also ignores real-time vs. batch processing. This
comes up later in the book, but could be diagrammed more clearly. There are a
lot of recipes for the overall architecture, and for each subprocess.

~~~
herewego
This is a good example of respectful criticism. I wish others would follow
your example.

------
spsphulse
This is good.

Although when I think of a cookbook, I'm mostly interested in some re-usable
snippets of codes that can be used again and again. A good example would be
Chris Albon's site [https://chrisalbon.com](https://chrisalbon.com)

For eg: \- a recipe for splitting comma separated values in a column into
multiple rows or \- a data cleaning recipe that removes all unwanted
spaces/trailing newlines/punctuations

I've searched but never come across a compilation of such re-usable code
snippets anywhere. Would be glad if anyone has any resources like this.

------
alienreborn
This looks like blogspam or collecting a bunch of blogspam articles and
turning it into a book.

------
nautilus12
As a data engineer, the level of self promotion this guy is going through is
insane. Good for him, if this catches on though, I hope I dont have to make
myself into a data engineering celebrity/ thought leader in order to get jobs
in the future.

------
chrstphrhrt
Anyone using non-JVM stuff in general? Nothing against the giant ecosystem
around Hadoop/spark etc.

I am currently using one Python process with Prefect for DAGs
([https://docs.prefect.io/guide/](https://docs.prefect.io/guide/)), custom API
queries and Elasticsearch indexing code in a batch processing style and it
seems to be going fine to ingest a 2TB dataset with the ability to generate
rich errors, retry/resume etc.

Besides maybe job listings, why would a real "plumber" who came from python
and js NEED to dive into the whole Apache ecosystem?

------
hobbescotch
Is there an equivalent job title for Data Engineer without the word
“engineer”? As a data engineer without an engineering degree I don’t like
having that word in my job title.

~~~
Avalaxy
ETL developer? Business Intelligence Consultant?

~~~
privateprofile
"ETL developer" was (and still is) the industry term for more than 10 years
before this new wave of buzzword spam came along (i.e. when "data science" was
"data mining"). Most of these roles focused on using low-code tools like SSIS
to build business-focused data transformation workflows faster than with any
code intensive approach (e.g. Hadoop ecosystem and derivatives).

Business intelligence consultant/developer was a blanket term used either for
1) people that can model and translate business requirements into data
platforms/components; 2) poorly targeted recruiting; 3) more rarely, to
describe people that work both in 'frontend' (reporting and dashboarding using
tools like PowerBI, who are "data analysts" in newspeak) and 'backend' (ETL
developer).

"Data Engineer" today is, sadly, too often synonym with "solving solved
problems using an unnecessarily complex approach and toolkit"; and then there
are the cases where the volume or complexity of the data actually justifies
the cost of using "big data" platforms. Not to sound harsh; the article
discussed here illustrates my point using "simple" unix tools:
[https://news.ycombinator.com/item?id=14401399](https://news.ycombinator.com/item?id=14401399)

------
ablekh
While respecting the author's good intentions to share knowledge, I think that
calling this set of (shallow) notes a "book" is an insult to all decent
comprehensive books and their authors.

------
ryantuck
What is the benefit of making something like this into a PDF instead of a
website?

I don't think I ever see a PDF link online and think, "yes, this is the ideal
format for ingesting this information."

------
lelima
Good summary of he current technologies on the market right now, Maybe adding
a few ETL tools will be great, they're still important :)

I liked case of studies compilation thumps up.

------
closeparen
Why is it useful to separate "data" from software engineering in general? What
are some examples of programs that do not operate on data?

~~~
squeaky-clean
The difference is where the difficulty of your problem lies. I work with
serving and reporting ads, so most of our actual logic is simple. Glorified
ETL work and fancy caching layer really.

The problem comes about that we can have peaks of 20k reads and 20k writes per
second with strict guarantees on response time and data consistency. All at
the same time needing to be kept consistent across multiple datacenters in
several regions.

Your typical application won't hold up in that environment or can become
extremely difficult to maintain. And I've met people at conferences who would
say my use case still isn't "big data" and is pennies compared to their data
streams. It really does become a class of algorithms and solutions on its own,
just dealing with your real-world ability to manage that much data.

~~~
closeparen
This doesn't ring true with how I've seen the terms used.

Highly scalable distributed systems are the bread and butter of backend
engineering in modern tech companies. Even specialists working on storage
layer tech in particular are not called "data engineers," just backend
engineers, although with more systems/infrastructure focus than others.

People with the title "data engineer" at my company pretty much do ETL
pipelines on Hadoop. I've done some ETL pipelines on Hadoop; seems like a tool
that should be in the portfolio of any SWE. But when you give a data engineer
a problem, they are _sure_ to suggest an ETL pipeline in Hadoop; I have even
seen them build RPC interfaces out of INSERT statements. Whereas a SWE is
looking at a broader suite of options, and probably defaulting to backend
services / OLTP databases. Hence my confusion. Maybe our use of the term is
anachronistic? Or maybe they just know a bunch of HiveQL or pipeline design
patterns that are too complex to fit into the heads of people who can also
write services?

~~~
squeaky-clean
I would say these terms are not exclusive of each other. Like Parallelogram,
square, rectangle. All 3 of your descriptions would be different types of
Backend Engineers to me, with focuses on systems or data (or both) or
something else. Sometimes the storage team is focused on generically keeping
storage working while another team actually works on the "big data"
applications, but often these teams design and plan very hand-in-hand.

> But when you give a data engineer a problem, they are sure to suggest an ETL
> pipeline in Hadoop

Maybe someone who sees Hadoop as their only hammer but IMO a "good" data
engineer would go

"How much data? 300MB? And you need to run this how often? Once? You're sure
just once? Like really sure? Okay here is a bash script."

------
just_myles
I see these kinds of books as good points for reference. These days I like
video formats that reference the theory and show real world examples.

------
soobrosa
Tempted to say how harmful things and advice is in here.

------
HNisCurated
What a weird title for a book about data science.

I'm not a fan of the marketing gravy people dump on everything. The future
will devolve to unhelpful titles like - "How to do Data F __king science "

~~~
wickerman
Data engineering is not data science. Data engineers deliver the data for data
scientists, data scientists use the data in models.

