Indeed. Dataform however only does the T (transformation) in ELT. Most businesses are centralising an exploding diversity of raw data in the data warehouse and Dataform helps them manage the data there.
One other main difference with a product like composable is that Dataform is built with software engineering best practices in mind. The transformations are written in code (SQL) rather than a GUI. It enables version control, reusable components for example and makes it easier to manage a large number of interdependent transformations.
We see that PDT doesn't scale well with the the amount of data complexity you usually have to deal with in the warehouse.
On top of that, the results of PDTs are only accessible to Looker. As your data team matures, there will be many other users and applications that need to take advantage of your transformed data in the warehouse. (e.g. modelling in Python)
Dataform also brings other features not available in PDTs. For example:
- Data validations tests (assertions) to ensure data quality in all your data applications (including Looker)
- Incremental tables: You can define tables that update incrementally instead of being rebuilt from scratch, hence delivering better performance and saving costs
- Reusable code. You can define reusable code snippets across your entire project
I much agree with this response. The term that I've heard to summarize all of the above is "semantic layer". That layer will run in the warehouse, vs. some external dashboard tool.
1. I think NoETL products are complementary solutions. NoETL can be a great option for ad hoc queries and when performing a one-off analysis.
But for your company's analytics, you want to build a suite of well defined, reliable and documented datasets that power you dashboards and your regular analytics.
Another way to say it is that with NoETL, you write one query to answer one complex question. With ETL (actually ELT), you define core datasets with which you can answer many questions.
2. In terms of adoption, products like Dataform are helpful from the day you start having data in your data warehouse and have a full time analyst. It's basically giving superpowers to your analysts.
NoETL targets semi-structured and schema-less data models prevalent in NoSQL/HTAP data stores, so if well defined relates to well defined schemas, it maybe doesn't apply here.
As far as reliability (aka reproducibility) and documentation is concerned, I guess it's more about building tools that promote sound engineering processes like version control, modules (aka reusable snippets), documentation (could be Jupyter notebooks, etc.) and these tools could be built for NoETL systems too.
Great question. Beyond being on the cloud and everything that goes with that, I'd say that there are two fundamental differences with a product like SSIS.
1. Dataform focuses on transformations happening _inside_ the cloud data warehouse. We don't move data between systems. More and more businesses are moving from traditional ETL processes to ELT and centralising their raw data in their warehouse. Dataform help businesses manage the T in ELT.
2. Dataform is built with software engineering best practices in mind. In Dataform, all your transformations are written in code (mostly SQL) instead of a GUI like SSIS. The code can be version controlled, edited and fits better the large amount of transformations teams have to deal with.
On-prem: Right now our IDE is only available as SaaS, although we will be looking at this in the near future. You can develop and test projects with the CLI and deploy them yourself but no tutorials for setting this up beyond the basics yet: https://docs.dataform.co/guides/command-line-interface/
Warehouse support: Athena/Presto and Azure are top of mind. I've not come across ClickHouse before but I'll definitely add it to our tracker!
Our framework (and the CLI interface) is pretty similar to be honest, except it’s in JS. We also spent quite a bit of time focusing on performance.
The main difference is that we provide an all-in-one web platform for teams to develop, test and schedule their SQL pipelines. With one web interface and in 5 min, you can write a query, test it, add it to a schedule, push the changes to GitHub (or submit a PR) and monitor your jobs.
We make it easy to enforce software engineering best practices like version control and isolated deployments. For example, our version control works similarly to Looker. Analysts can work simultaneously from different branches while not having to use different tools nor use the command line.
1. Speed. We wanted compilation to really, really fast.
2. In all honesty we just weren't big fans of Jinja, having used it for quite a while. JS templating is OK out the box, and we are considering React like syntax in the future.
3. Love it or hate it, NPM packages are pretty easy to work with and we are working on a number of packages at the moment.
4. When you start to look at things like UDFs and Cloud functions which enables some really cool use cases, JS seems to be prevailing (in Snowflake and BigQuery at least).
I will admit though that we do usually get a bit of a shocked reply when people hear it's not Python!
I see your point but I think that data analysts are usually familiar with Python, not JS. The UI is cool and intuitive, it looks like you copied most of the concepts except Jinja from DBT. That's perfectly OK though, let's see where it goes!