
Maintainable ETL Practices at Stitchfix - rahimiali
https://multithreaded.stitchfix.com/blog/2019/05/21/maintainable-etls/
======
jdreaver
Huge +1 to leveraging SQL as much as possible during ETL.

We recently moved our data warehouse to Snowflake. Along the way, we adopted
an ELT paradigm instead of ETL, where you load your raw data into the
warehouse and then perform any transformations in the warehouse. A tool called
dbt ([https://www.getdbt.com/](https://www.getdbt.com/)) has made this a joy.
You write SQL select queries in SQL files, and dbt creates a DAG based on this
SQL so it can efficiently create tables/views in parallel. dbt also comes with
some nifty testing facilities to validate your data once it is transformed.

This is a great article, thank you for writing it!

~~~
systems
I am not a big fan for the term ELT

Just because you are using a database as your work space doesnt mean you Load
your data before transformation

Most complex dw, have multiple layers of transformation, and several
workspaces along the way

There is always a final step where you load the clean data, into your clean
schema, used for querying the data

ETL

    
    
        Extract = Fetch the data you want 
        T = Make all necessary change
        L = Create a clean copy

~~~
tomnipotent
ETL can include the destruction of raw data between the T & L. ELT assumes it
is retained. This is a legacy from days when storage was expensive and it
often wasn't tenable to retain everything.

~~~
unityByFreedom
We held onto middle data when I worked in ETL 15 years ago. It's normal to
save data in the middle for unit testing and lookup tables. I agree with GP
this doesn't need a new term.

------
atombender
In discussions like this, I often wonder how there can be enough data to
require a "big data" pipeline with things like Spark and Presto.

Stitch Fix seems to be one of those online services that send you sets of
clothes that they think fit your style. That seems like a really narrow, low-
data kind of industry. How much data can they possibly have? And why so big on
the backend? In January 2018 they had 100 engineers. Presumably they're even
larger now. Just for a service that sends out clothing.

Maybe I'm lacking in imagination or insight into what takes to run a company
like this. On the other hand, a single PostgreSQL instance can run complex ad-
hoc queries, with CTEs and everything, on a single node involving millions,
even billions, of rows.

~~~
threeseed
Spark is not a competitor to PostgreSQL.

It is a distributed compute engine that has a lot of capabilities, is rock
solid, allows you to blend SQL with Python/R/Scala and can support ML use
cases as your needs grow. You can easily store all of your data in PostgreSQL
and run Spark on top.

------
tempguy9999
Too large a chunk of my life was burnt the ETL from hell. Other than SQL, I
can't see any other tool that would have helped.

I found the easiest way was just dump tables of data into long varchars - even
the supposedly numeric stuff, because often enough it wasn't - then scrub &
filter it from there. (NB, if you take anything from this post, make it that).

There were multiple challenges, but one that was most painful was the sheer
filth mixed into the data. Too often even a human couldn't work out what the
heck was supposed to be in some field. Often we could, but it ate a lot of
time.

It doesn't sound like you ETLs are anything like mine; you start with pretty
clean data. You are lucky.

~~~
achompas
Yikes, I’m sorry. What were your application developers doing to persist this
messy data?

~~~
tempguy9999
It was an upgrade from a legacy system to its successor. I never used the
legacy system so I don't know what checks they did. I suspect little to none;
I guess they pretty much exposed the raw 'tables' to the user to insert
whatever they wanted.

I wrote 'tables' because it wasn't a relational DB although the successor
system was.

The amount of data was pretty minimal too, not much more than a dozen GB, but
it ate up at least 4 man-years to convert. I still cringe thinking about it.
Some parts were an interesting challenge but mostly just soul destroying
grind.

------
ak39
“The primary benefit of SQL is that it’s understood by all data professionals:
data scientists, data engineers, analytics engineers, data analysts, DB
admins, and many business analysts.“

No, the primary benefit of using SQL is not SQL itself but the “free features”
you get with the RDBMS you’re using. Enforcement of primary keys, automatic
data type conversions, free transactions! and rollbacks ... and the sheer gift
that is set-based operations on large chunks of data. It is just plain stupid
to want to code the transformations or insertion of millions of rows
iteratively than to rely the implicit rollback power of an INSERT statement.

Rule # 1 in my projects: Avoid getting sexy in code, use the RDBMS for all
transformations post staging. Always stage to RDBMS “as is” (data warts and
all). Cleanup with SQL and stored procedures. This way you can SQL query the
whole original format and figure out the best cleanup and transformations
needed.

------
iblaine
One comment is CTEs don’t use indexes, so you may want to avoid using them in
production code. That’s something to consider if you may need to port code
to/from databases where indexes may be available.

