Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Dataform (YC W18) – Build Reliable SQL Data Pipelines as a Team
110 points by G2H on July 15, 2019 | hide | past | favorite | 36 comments
Hi HN!

We’re Guillaume and Lewis, founders of Dataform, and we're excited (and nervous) to be posting this on HN.

Dataform is a platform for data analysts to manage data workflows in cloud data warehouses such as Google BigQuery, Amazon Redshift or Snowflake. With our open source framework and our web app, analysts can develop and schedule reliable pipelines to turn raw data into reliable datasets they need for analytics.

Before starting Dataform, we managed engineering teams in AdSense and led product analytics for publisher ads. We heavily relied on data (and data pipelines!) to generate insights, drive better decisions and build better products. Companies like Google invest a lot to build internal data tools for analysts to manage data and build data pipelines. In 5 minutes I could define a new dataset in SQL that would be updated every day and then use it in my reports.

Most businesses today are centralising their raw data into cloud data warehouses but lack the tools to manage it efficiently. Pipelines run manually or via custom scripts that break often. Or the company decides to invest engineering resources to set up, maintain and debug a framework like Airflow. But that’s just for scheduling and the technical bar is often too high for analysts to contribute.

We saw a need for a self-service solution for data teams to manage data efficiently, so that analysts can own the entire workflow from raw data to analytics. We built Dataform with two core principles in mind:

1. Bring engineering best practices to data management. In Dataform, you build data pipelines in SQL, and our open source framework lets you seamlessly define dependencies, build incremental tables and reuse code across scripts. You can write tests against your raw and transformed data to ensure data quality across your analytics. Lastly, our development environment also facilitates the adoption of best practices, where analysts can develop with version control, code review or sandboxed environments.

2. Let data teams focus on data, not infrastructure. We want to bring a better, faster and cheaper alternative to what businesses have to build and maintain in-house today. Our web app comes with a collaborative SQL editor, where teams develop and push their changes to GitHub. You can then orchestrate your data pipelines without having to maintain any infrastructure.

Here's is a short video demo where we develop two new datasets, push the code to GitHub and schedule their execution, in under 5 minutes.


You can sign up at https://dataform.co. If you're curious how it works - here are the docs: https://docs.dataform.co and the link to our open framework: https://github.com/dataform-co/dataform

We would love to hear your feedback and answer any questions you might have!

Lewis and Guillaume

I take it everything is based on schedules, there’s no runtime magic to handle incremental updates in a timely or efficient way? I’ve been watching @frankmcsherry’s Materialize.io with interest, for example.

Possibly controversial, but I don't believe there are that many situations in data analytics where you really need realtime/streaming analytics. It usually comes at the cost of significant additional complexity and maintenance, that ultimately hurts reproducibility and agility.

Having said that, we are working on triggering table builds from watching changes to consumed data sets, which paired with an incremental table build would allow you to reliably achieve latency of under a few minutes.

I disagree.

Near realtime (say sub 1-minute) feeds can be leveraged to extract a lot of value out of the data you're collecting. Perhaps maybe not for your typical SaaS startup, but for free-to-play video games you end up leaving so much value on the table if you can't quickly (automatically) respond to spend or churn indicators.

I think a good rule of thumb for this is looking at how fast your decisions are made. If you're making decisions daily then refreshing your data every 24 hours is probably good enough. If you're making decisions every hour, then every 60 minutes is probably good enough.

Another factor is scale. When you're dealing with 6-figure CCUs and are trying to optimize conversion or retention through split-testing, figuring out which variants are anomalously poorly performing quickly can save you a whole lot of money.

I reckon there's at least a 50 titles that can benefit from streaming analytics, which immediately affects anywhere from 100-500 employees (analysts, engineers) and likely influences over 1000 (broader company). That's a significant portion of the field.

That’s why I don’t like the term real time. Decision time should be the criterion.

Also, a true Data-Warehouse has it's data frozen on a set recurring schedule, say every hour or every 24 hours. This is so that users running related jobs at different times throughout the day see the same results and are working with the same data.

1. What do you think of NoETL and SQL++? Ref: https://www.couchbase.com/products/analytics

2. Do you think adopting a tool like Dataform makes more sense at mid-stage companies? Ref: https://thinkgrowth.org/the-startup-founders-guide-to-analyt...

1. I think NoETL products are complementary solutions. NoETL can be a great option for ad hoc queries and when performing a one-off analysis.

But for your company's analytics, you want to build a suite of well defined, reliable and documented datasets that power you dashboards and your regular analytics.

Another way to say it is that with NoETL, you write one query to answer one complex question. With ETL (actually ELT), you define core datasets with which you can answer many questions.

2. In terms of adoption, products like Dataform are helpful from the day you start having data in your data warehouse and have a full time analyst. It's basically giving superpowers to your analysts.

> well defined, reliable and documented datasets

Can you elaborate this?

NoETL targets semi-structured and schema-less data models prevalent in NoSQL/HTAP data stores, so if well defined relates to well defined schemas, it maybe doesn't apply here.

As far as reliability (aka reproducibility) and documentation is concerned, I guess it's more about building tools that promote sound engineering processes like version control, modules (aka reusable snippets), documentation (could be Jupyter notebooks, etc.) and these tools could be built for NoETL systems too.

Congrats on your launch! How is this different from DBT?

Our framework (and the CLI interface) is pretty similar to be honest, except it’s in JS. We also spent quite a bit of time focusing on performance.

The main difference is that we provide an all-in-one web platform for teams to develop, test and schedule their SQL pipelines. With one web interface and in 5 min, you can write a query, test it, add it to a schedule, push the changes to GitHub (or submit a PR) and monitor your jobs.

We make it easy to enforce software engineering best practices like version control and isolated deployments. For example, our version control works similarly to Looker. Analysts can work simultaneously from different branches while not having to use different tools nor use the command line.

What’s the rationale behind using JS instead of Python/Jinja? Any insights on that?

Great question :)

1. Speed. We wanted compilation to really, really fast. 2. In all honesty we just weren't big fans of Jinja, having used it for quite a while. JS templating is OK out the box, and we are considering React like syntax in the future. 3. Love it or hate it, NPM packages are pretty easy to work with and we are working on a number of packages at the moment. 4. When you start to look at things like UDFs and Cloud functions which enables some really cool use cases, JS seems to be prevailing (in Snowflake and BigQuery at least).

I will admit though that we do usually get a bit of a shocked reply when people hear it's not Python!

I see your point but I think that data analysts are usually familiar with Python, not JS. The UI is cool and intuitive, it looks like you copied most of the concepts except Jinja from DBT. That's perfectly OK though, let's see where it goes!

>In Dataform, you build data pipelines in SQL, and our open source framework lets you seamlessly define dependencies, build incremental tables and reuse code across scripts.

Do you have any plan to add a visual interface? Unfortunately many analysts I work with are not really capable in SQL, or don't fully grok SQL features.

Also, how you plan on creating a moat against cloud companies putting the functionality of your project in their own proprietary solutions? For example, in Azure you can already use the Azure Data Lake stack of ADL Storage + Analytics + Data Factory to do what you are doing. These have IDE integrations, CICD integrationos, git integrations, are first class citizens in their cloud interfaces, etc.

A UI for building data transformations (SQl queries) isn't on our immediate roadmap, but we have thought about it.

We are taking a bit of a bet here that the industry is moving towards SQL, and we see a growing interesting in analysts wanting to learn it.

Moat: Always a concern! Short term we think most of our unique value comes from the collaborative web development experience which is not OSS currently. Also staying platform agnostic is important to us. Longer term, we have some plans which without going into details mostly focuses on building value around a community and our SQL ecosystem, but not ready to get into details just yet.

Check out the demo. There’s a significant UI available.

Apart from being cloud-focused, what's the difference between dataform and a classic ETL engine like SQL Server Integration Services?

Great question. Beyond being on the cloud and everything that goes with that, I'd say that there are two fundamental differences with a product like SSIS.

1. Dataform focuses on transformations happening _inside_ the cloud data warehouse. We don't move data between systems. More and more businesses are moving from traditional ETL processes to ELT and centralising their raw data in their warehouse. Dataform help businesses manage the T in ELT.

2. Dataform is built with software engineering best practices in mind. In Dataform, all your transformations are written in code (mostly SQL) instead of a GUI like SSIS. The code can be version controlled, edited and fits better the large amount of transformations teams have to deal with.

Is there a tutorial to run it on-prem? For example for development or testing purposes?

What data warehouses are currently and which are on the roadmap? Is ClickHouse somewhere in your plans?

On-prem: Right now our IDE is only available as SaaS, although we will be looking at this in the near future. You can develop and test projects with the CLI and deploy them yourself but no tutorials for setting this up beyond the basics yet: https://docs.dataform.co/guides/command-line-interface/

Warehouse support: Athena/Presto and Azure are top of mind. I've not come across ClickHouse before but I'll definitely add it to our tracker!

What's the core transformation technology running the SQL transforms? Are you relying on the cloud data warehouse itself (i.e. BigQuery, Amazon Redshift or Snowflake) to run the SQL transforms? Asking because the transform workload usually dominates the query workload if they're not properly separated.

Just from the docs, it appears to rely on the warehouse itself. It’s basically SQL + js templates make data engineers/analysts jobs easier.

Yup, the warehouse or query engine does the heavy lifting, we do the orchestration and make easier to develop these pipelines.

Any success getting through the VRM process at a financial services firm? Your product looks very exciting.

Thank you! We haven't been through VRM yet but that's about to come up. If you have any tips, would love to hear your thoughts. I am at ghh@[domain]

Does it work cross databases? Ie, if data is stored in multiple databases and needs to be combined for analytics? Currently, we are getting all data to S3 via DMS and running a script on top of this to combine data

Dataform still requires you to move your data to a single place where it can be queried together, whether that's S3 with Presto, Redshift spectrum, or something else.

There are lots of great products for moving data like this, Stitch, Fivetran are both great and we've used them ourselves, but they don't help you manage complex data transformation processes.

Looks like yet another ETL/ELT tool.

We have been using composable with quite a bit of success: https://dev.composable.ai

I'm routinely confused about the role these sort of tools play. I conceptually understand what they're doing but weren't these problems solved a long time ago with materialized views and foreign data wrappers?

The hard part of ETL for me has always been gracefully handling the outliers, the data that you need to look at within the context of multiple rows (e.g. duplicate rows that aren't exactly the same so it's not a simple SELECT DISTINCT) in order to make the correct decision, or the entity matching/mapping/categorization that often is necessary. Having the lookup tables that often need manual oversight when new entries show up that haven't yet been properly tagged. Or if you're ingesting address data and you want to normalize it through a geocoder, where exactly does that occur in these SQL-templating "pipelines"?

I feel like these are basically focused on moving data between different DBs and generating group-by queries to populate some rollup tables?

Indeed. Dataform however only does the T (transformation) in ELT. Most businesses are centralising an exploding diversity of raw data in the data warehouse and Dataform helps them manage the data there.

One other main difference with a product like composable is that Dataform is built with software engineering best practices in mind. The transformations are written in code (SQL) rather than a GUI. It enables version control, reusable components for example and makes it easier to manage a large number of interdependent transformations.

To be fair, these other tools do this as well.

Does this add any value beyond what PDTs in Looker provide?


We see that PDT doesn't scale well with the the amount of data complexity you usually have to deal with in the warehouse.

On top of that, the results of PDTs are only accessible to Looker. As your data team matures, there will be many other users and applications that need to take advantage of your transformed data in the warehouse. (e.g. modelling in Python)

Dataform also brings other features not available in PDTs. For example:

- Data validations tests (assertions) to ensure data quality in all your data applications (including Looker)

- Incremental tables: You can define tables that update incrementally instead of being rebuilt from scratch, hence delivering better performance and saving costs

- Reusable code. You can define reusable code snippets across your entire project

I much agree with this response. The term that I've heard to summarize all of the above is "semantic layer". That layer will run in the warehouse, vs. some external dashboard tool.

Is there a roadmap for adding other data sources?

CTO here, What are you after? :)

Athena/Presto, Azure are next in our sights.

Applications are open for YC Summer 2021

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact