
We need new data books, so we started one: Cloud Data Management - thingsilearned
https://chartio.com/blog/cloud-data-management-book-launch/
======
jameslk
One of the experiences that stuck with me most going from doing software
engineering stuff to data engineering stuff is the very difficult and
sometimes complete lack of tooling around testing and debugging issues in SQL
and pipelines. As software engineers, we have debuggers, unit testing, mocking
frameworks, e2e tools, BDD languages like Cucumber, code quality tools, etc.
But when you're working on a pipeline and you're wondering what will happen if
you run it, sometimes your best bet is actually to just run it and wait 20 min
for it to complete. Or run some portion of it ad hoc. Or if you want to know
what a SQL query is going to do in the black box of your query engine, you
might try to parse the esoteric language of a query explainer. The best tools
seem to be available after deployment such as data quality tests, dashboards
and alerts. I think there's a lot of opportunity to improve the ecosystem.

~~~
faizshah
I have started using prefect.io as a workflow runner. It helps me define tasks
and deeply define what should happen on failure of a task including cleaning
up a cluster.

Because Tasks are just functions you could test your tasks with regular unit
testing and mocking methods.

But I agree a lot of data people seem to think interactive is good enough.

~~~
edraferi
How do you think Prefect compares to Airflow / Dagster?

~~~
faizshah
What makes prefect stand out is the highly pythonic API and the powerful
primitives for defining failure scenarios. Additionally, for me, integrating
with dask is key as dask has become one of my main tools.

Heres a rewrite of the airflow tutorial in prefect:
[https://docs.prefect.io/core/examples/airflow_tutorial_dag.h...](https://docs.prefect.io/core/examples/airflow_tutorial_dag.html)

My advice is try it out and see if you like the api. I think that the airflow
ui is still a killer feature in 2019 though.

Theres also this post from the creator of prefect on why not airflow:
[https://medium.com/the-prefect-blog/why-not-
airflow-4cfa4232...](https://medium.com/the-prefect-blog/why-not-
airflow-4cfa423299c4)

I haven’t tried dagster yet, how does it compare to airflow?

------
DataDaoDe
I completely agree with the sentiment here. As an engineer tasked with
building data driven systems and architectures, I'm well aware of the amount
of buzzword / enterprise nonsense floating around in this space - and it is
enormous. You can couple that with the fact that its really difficult to find
any practical books and resources that you can actually apply to solve real-
world problems when working in small-teams and on quick deadlines. Getting
business to act in a data driven and analytic way, and building the
architecture for it - is a non-trivial task. This looks like it could be one
of those rare good resources in the space.

Great work by the guys from chartio!

~~~
thingsilearned
Thanks! We're truly looking for this to be community driven as well. So if you
see places to contribute, or where you might disagree, or where you could
share a story - do let us know or make a pull request on GitHub!

Besides the need for a new data book, we realized that it needed to be of a
different format, as the space is moving so fast the expertise is very
distributed.

------
scruple
I'm in the data space today. It is indeed a very confusing space to be in.
This is a fantastic set of resources you've created here. Well done!

------
timwis
Has anyone found any similar guidance around master data management?
Matching/deduplicating, and feeding back into source systems.

~~~
edraferi
Senzing and Tamr are two options for this

~~~
timwis
Are those products? I was hoping for books/guidance

------
wenc
I worry that this promotes the data architecture philosophy that is currently
in vogue in tech companies, but is actually a bad fit for many traditional
enterprises. Most of the time, data architecture needed depends on use case.

Tech/web companies deal with massive amounts of unstructured/semi-structured
data ingested at a fast clip, so the architectural thinking here works.

However I would argue that many traditional enterprises whose major sources of
data are primarily highly structured (SQL databases), a lake is actually not
needed.

A young data engineer working in a traditional enterprise, enamored with the
idea of data lakes say, might try to ETL SQL databases into an object store
(add RBAC etc.) only to rebuild it back out into a data warehouse. This will
almost always turn out to be the wrong approach.

The simpler and more manageable approach is actually to federate existing
databases, add cataloging etc. and not even use a data lake at all.

~~~
kfk
I think by lake they mean s3 and similar alternatives (like hdf is for
hadoop). This is not crazy as processing s3 files is quite easy. What you say
is true but the problem is that no analyst deals only with one db, they have
to deal with an increasing number of data sources, that’s what the lake is
for. I also don’t believe you need a warehouse after the lake and that one
source of truth is pipe dreams, so I’m with you this architecture is not 100%
what companies probably need.

~~~
wenc
> no analyst deals only with one db, they have to deal with an increasing
> number of data sources

Which is why I mentioned federation. Most data sources in an enterprise are
SQL native or accessible and it does not make sense to dump a SQL database
into S3 just to be able to combine it with other forms of data.

Federation means you can operate across multiple databases (eg do Cross
database joins etc.)

------
neilobremski
This is just lovely. I'm the lead of a company's data warehouse project,
having inherited it from someone who was a DBA that read a book and then made
the whole thing ...

I tend to joke that I'm a digital janitor and now I'm a big data janitor.
Well, things are going well enough, but cleanup on a live system with users
and zillions of reports is incredibly difficult. Part of the issue is
isolation and size: too many have too much access to too much data. It's
overwhelming and it also leads to a lot of high cost because queries aren't
understood and they're run against massive datasets.

I've been educating myself as well as I can between fixing errors and this
book is just the thing I need to calm my nerves. It totally makes sense how
the stages are set and even just this blog post overview has given me ideas of
how to carve some things up.

Kudos, cheers, and all that. Happy Halloween!

------
brucej
Great to see more information about data out there for people to learn from.
BTW my colleges and I created a declarative (SQL) open source (MIT) framework
for Apache Spark to make ETL and ML super easy, if you want to check it out it
you can read more here: [https://arc.tripl.ai](https://arc.tripl.ai) . We've
recently started combining this with Argo and Delta Lake which is working well
for us in the source, lake to warehouse stages.

------
ryantuck
I've recently dug into the Agile DWH Design and The DWH Toolkit books and the
design tips in them all seem really compelling, for good reason. Though as
I've actually started modeling, I've found that the creation of "proper"
fact/dimension tables has felt at times like overkill, given the technologies
we're using (BQ / Postgres / Looker).

So, perfect timing! Really looking forward to checking this out.

~~~
thingsilearned
Yeah exactly. Much of the "overkill" was done because of performance and cost
reasons, that frankly just don't apply anymore. Now the largest expense by far
is time.

There are a number of people starting to talk about Star Schemas having little
gain on modern stacks. The performance and costs gains are automatically done
now with C-Store warehouses. Fivetran has a great post on this
[https://fivetran.com/blog/obt-star-schema](https://fivetran.com/blog/obt-
star-schema)

~~~
supercanuck
The reason a Star Schema exists is because it reduces the amount of data
stored and it reduces the load times (faster updates) because only the facts
and measures get updated. There is also little redundancy in Master Data
(dimensions).

in the world of fixed assets (on-premise data warehouses) this was a concern,
I'm curious to see how this plays out in these Cloud Providers because to me
it seems like they will be more than happy to rent you as much space as you
want and everyone is happy until the bill comes due.

Doing materialized, flat tables everywhere is great for reporting performance
but the tables will not be updated as quickly, there will be redundancy in
storage and it will be difficult to sync time dependent dimensions.

~~~
thingsilearned
This is what I mean by updates due to C-Store warehouse engines (Redshift,
BigQuery, Snowflake, etc). It's not just that the cloud providers are happy to
rent you more space - it's also that they're C-Store and because of that
redundancy in columns is well compressed automatically.

We try to explain some of that here - [https://dataschool.com/data-
modeling-101/row-vs-column-orien...](https://dataschool.com/data-
modeling-101/row-vs-column-oriented-databases/)

And Fivetran did a great benchmarking of it here -
[https://fivetran.com/blog/obt-star-schema](https://fivetran.com/blog/obt-
star-schema)

The architecture of the C-Store warehouses often removes the benefits of
materialized views. This is why for a very long time Redshift didn't even
support them - they insisted they weren't needed as they didn't improve
performance over regular redshift significantly.

------
rilt
this is awesome.

would have loved to have this book when we were building this at my previous
co.

very approachable in how it explains each segment on the whole and zooms in on
them individually.

would have saved our team a lot of time as we came to very similar conclusions
but over the course of a few months

------
the_watcher
Cool! The framework makes a lot of sense, and articulates what I've observed
very well.

------
iblaine
I see Panoply is part of this effort. Panoply creates a lot of spam on
hackernews, reddit, twitter, and elsewhere.

As an example, doing ETL with drag n drop tools, and not in code, is a dying
skill in the industry.

[edit] Basically I'm wary of companies commenting on data standards, where
those same companies also sell a product in the data industry. You're probably
going to find more honesty on guidelines from open source contributors like
LinkedIn, Netflix, Sitchfix, etc.

~~~
camel_Snake
I'm surprised to see dbt make that list of yours - I don't think I've ever
seen them spam the online communities I visit.

------
acak
On an unrelated note, would anyone be able to tell me which blog engine (and
theme) their website is running on?

~~~
thingsilearned
We use Jekyll. It's a custom design from our own awesome Steven Lewis.

~~~
acak
Rather well designed. Thanks!

------
sarcasmatwork
Yay, free ebook. Thanks!

------
evandev
Is this book available in print at all?

~~~
thingsilearned
not yet. Just PDF. But we'll be working on getting it into print eventually.

