Hacker News new | past | comments | ask | show | jobs | submit login

Huge +1 to leveraging SQL as much as possible during ETL.

We recently moved our data warehouse to Snowflake. Along the way, we adopted an ELT paradigm instead of ETL, where you load your raw data into the warehouse and then perform any transformations in the warehouse. A tool called dbt (https://www.getdbt.com/) has made this a joy. You write SQL select queries in SQL files, and dbt creates a DAG based on this SQL so it can efficiently create tables/views in parallel. dbt also comes with some nifty testing facilities to validate your data once it is transformed.

This is a great article, thank you for writing it!




I am not a big fan for the term ELT

Just because you are using a database as your work space doesnt mean you Load your data before transformation

Most complex dw, have multiple layers of transformation, and several workspaces along the way

There is always a final step where you load the clean data, into your clean schema, used for querying the data

ETL

    Extract = Fetch the data you want 
    T = Make all necessary change
    L = Create a clean copy


ETL can include the destruction of raw data between the T & L. ELT assumes it is retained. This is a legacy from days when storage was expensive and it often wasn't tenable to retain everything.


We held onto middle data when I worked in ETL 15 years ago. It's normal to save data in the middle for unit testing and lookup tables. I agree with GP this doesn't need a new term.


Multiple layers, You have:

1. The raw data layer where data comes directly from buisiness log

2. Common data layer where one copy of data serves multiple purposes

3. Application layer where final result is presented


DBT is fantastic, I very strongly recommend it.


DBT has been a game changer for us, such an awesome tool.




Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: