We recently moved our data warehouse to Snowflake. Along the way, we adopted an ELT paradigm instead of ETL, where you load your raw data into the warehouse and then perform any transformations in the warehouse. A tool called dbt (https://www.getdbt.com/) has made this a joy. You write SQL select queries in SQL files, and dbt creates a DAG based on this SQL so it can efficiently create tables/views in parallel. dbt also comes with some nifty testing facilities to validate your data once it is transformed.
This is a great article, thank you for writing it!
Just because you are using a database as your work space
doesnt mean you Load your data before transformation
Most complex dw, have multiple layers of transformation, and several workspaces along the way
There is always a final step where you load the clean data, into your clean schema, used for querying the data
Extract = Fetch the data you want
T = Make all necessary change
L = Create a clean copy
1. The raw data layer where data comes directly from buisiness log
2. Common data layer where one copy of data serves multiple purposes
3. Application layer where final result is presented
Stitch Fix seems to be one of those online services that send you sets of clothes that they think fit your style. That seems like a really narrow, low-data kind of industry. How much data can they possibly have? And why so big on the backend? In January 2018 they had 100 engineers. Presumably they're even larger now. Just for a service that sends out clothing.
Maybe I'm lacking in imagination or insight into what takes to run a company like this. On the other hand, a single PostgreSQL instance can run complex ad-hoc queries, with CTEs and everything, on a single node involving millions, even billions, of rows.
They also have mobile apps, they run product experiments, they source and sell clothing and manage inventory, they build and iterate on algorithmic approaches to recommend and design clothing (many of which help stylists and never reach the screen of an external client).
You can skim through their Algorithms blog for some more detail. I find them impressive in how they scale the impact of relatively few stylists to about 3M users.
It is a distributed compute engine that has a lot of capabilities, is rock solid, allows you to blend SQL with Python/R/Scala and can support ML use cases as your needs grow. You can easily store all of your data in PostgreSQL and run Spark on top.
I found the easiest way was just dump tables of data into long varchars - even the supposedly numeric stuff, because often enough it wasn't - then scrub & filter it from there. (NB, if you take anything from this post, make it that).
There were multiple challenges, but one that was most painful was the sheer filth mixed into the data. Too often even a human couldn't work out what the heck was supposed to be in some field. Often we could, but it ate a lot of time.
It doesn't sound like you ETLs are anything like mine; you start with pretty clean data. You are lucky.
I wrote 'tables' because it wasn't a relational DB although the successor system was.
The amount of data was pretty minimal too, not much more than a dozen GB, but it ate up at least 4 man-years to convert. I still cringe thinking about it. Some parts were an interesting challenge but mostly just soul destroying grind.
of course the value to the business of much of this data is highly questionable, but this isn't always an argument you win..
No, the primary benefit of using SQL is not SQL itself but the “free features” you get with the RDBMS you’re using. Enforcement of primary keys, automatic data type conversions, free transactions! and rollbacks ... and the sheer gift that is set-based operations on large chunks of data. It is just plain stupid to want to code the transformations or insertion of millions of rows iteratively than to rely the implicit rollback power of an INSERT statement.
Rule # 1 in my projects: Avoid getting sexy in code, use the RDBMS for all transformations post staging. Always stage to RDBMS “as is” (data warts and all). Cleanup with SQL and stored procedures. This way you can SQL query the whole original format and figure out the best cleanup and transformations needed.