
Launch HN: Dataform (YC W18) – Build Reliable SQL Data Pipelines as a Team - G2H
Hi HN!<p>We’re Guillaume and Lewis, founders of Dataform, and we&#x27;re excited (and nervous) to be posting this on HN.<p>Dataform is a platform for data analysts to manage data workflows in cloud data warehouses such as Google BigQuery, Amazon Redshift or Snowflake. With our open source framework and our web app, analysts can develop and schedule reliable pipelines to turn raw data into reliable datasets they need for analytics.<p>Before starting Dataform, we managed engineering teams in AdSense and led product analytics for publisher ads. We heavily relied on data (and data pipelines!) to generate insights, drive better decisions and build better products. Companies like Google invest a lot to build internal data tools for analysts to manage data and build data pipelines. In 5 minutes I could define a new dataset in SQL that would be updated every day and then use it in my reports.<p>Most businesses today are centralising their raw data into cloud data warehouses but lack the tools to manage it efficiently. Pipelines run manually or via custom scripts that break often. Or the company decides to invest engineering resources to set up, maintain and debug a framework like Airflow. But that’s just for scheduling and the technical bar is often too high for analysts to contribute.<p>We saw a need for a self-service solution for data teams to manage data efficiently, so that analysts can own the entire workflow from raw data to analytics. We built Dataform with two core principles in mind:<p>1. Bring engineering best practices to data management. 
In Dataform, you build data pipelines in SQL, and our open source framework lets you seamlessly define dependencies, build incremental tables and reuse code across scripts. You can write tests against your raw and transformed data to ensure data quality across your analytics. Lastly, our development environment also facilitates the adoption of best practices, where analysts can develop with version control, code review or sandboxed environments.<p>2. Let data teams focus on data, not infrastructure. 
We want to bring a better, faster and cheaper alternative to what businesses have to build and maintain in-house today. Our web app comes with a collaborative SQL editor, where teams develop and push their changes to GitHub. You can then orchestrate your data pipelines without having to maintain any infrastructure.<p>Here&#x27;s is a short video demo where we develop two new datasets, push the code to GitHub and schedule their execution, in under 5 minutes.<p><a href="https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=axDKf0_FhYU" rel="nofollow">https:&#x2F;&#x2F;www.youtube.com&#x2F;watch?v=axDKf0_FhYU</a><p>You can sign up at <a href="https:&#x2F;&#x2F;dataform.co" rel="nofollow">https:&#x2F;&#x2F;dataform.co</a>. If you&#x27;re curious how it works - here are the docs: <a href="https:&#x2F;&#x2F;docs.dataform.co" rel="nofollow">https:&#x2F;&#x2F;docs.dataform.co</a> and the link to our open framework: <a href="https:&#x2F;&#x2F;github.com&#x2F;dataform-co&#x2F;dataform" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;dataform-co&#x2F;dataform</a><p>We would love to hear your feedback and answer any questions you might have!<p>Lewis and Guillaume
======
thom
I take it everything is based on schedules, there’s no runtime magic to handle
incremental updates in a timely or efficient way? I’ve been watching
@frankmcsherry’s Materialize.io with interest, for example.

~~~
1ewish
Possibly controversial, but I don't believe there are that many situations in
data analytics where you really need realtime/streaming analytics. It usually
comes at the cost of significant additional complexity and maintenance, that
ultimately hurts reproducibility and agility.

Having said that, we are working on triggering table builds from watching
changes to consumed data sets, which paired with an incremental table build
would allow you to reliably achieve latency of under a few minutes.

~~~
devbug
I disagree.

Near realtime (say sub 1-minute) feeds can be leveraged to extract a lot of
value out of the data you're collecting. Perhaps maybe not for your typical
SaaS startup, but for free-to-play video games you end up leaving so much
value on the table if you can't quickly (automatically) respond to spend or
churn indicators.

I think a good rule of thumb for this is looking at how fast your decisions
are made. If you're making decisions daily then refreshing your data every 24
hours is probably good enough. If you're making decisions every hour, then
every 60 minutes is probably good enough.

Another factor is scale. When you're dealing with 6-figure CCUs and are trying
to optimize conversion or retention through split-testing, figuring out which
variants are anomalously poorly performing quickly can save you a whole lot of
money.

I reckon there's at least a 50 titles that can benefit from streaming
analytics, which immediately affects anywhere from 100-500 employees
(analysts, engineers) and likely influences over 1000 (broader company).
That's a significant portion of the field.

~~~
cgio
That’s why I don’t like the term real time. Decision time should be the
criterion.

------
thekhatribharat
1\. What do you think of NoETL and SQL++? Ref:
[https://www.couchbase.com/products/analytics](https://www.couchbase.com/products/analytics)

2\. Do you think adopting a tool like Dataform makes more sense at mid-stage
companies? Ref: [https://thinkgrowth.org/the-startup-founders-guide-to-
analyt...](https://thinkgrowth.org/the-startup-founders-guide-to-
analytics-1d2176f20ac1)

~~~
G2H
1\. I think NoETL products are complementary solutions. NoETL can be a great
option for ad hoc queries and when performing a one-off analysis.

But for your company's analytics, you want to build a suite of well defined,
reliable and documented datasets that power you dashboards and your regular
analytics.

Another way to say it is that with NoETL, you write one query to answer one
complex question. With ETL (actually ELT), you define core datasets with which
you can answer many questions.

2\. In terms of adoption, products like Dataform are helpful from the day you
start having data in your data warehouse and have a full time analyst. It's
basically giving superpowers to your analysts.

~~~
thekhatribharat
_> well defined, reliable and documented datasets_

Can you elaborate this?

NoETL targets semi-structured and schema-less data models prevalent in
NoSQL/HTAP data stores, so if well defined relates to well defined schemas, it
maybe doesn't apply here.

As far as reliability (aka reproducibility) and documentation is concerned, I
guess it's more about building tools that promote sound engineering processes
like version control, modules (aka reusable snippets), documentation (could be
Jupyter notebooks, etc.) and these tools could be built for NoETL systems too.

------
buremba
Congrats on your launch! How is this different from DBT?

~~~
G2H
Our framework (and the CLI interface) is pretty similar to be honest, except
it’s in JS. We also spent quite a bit of time focusing on performance.

The main difference is that we provide an all-in-one web platform for teams to
develop, test and schedule their SQL pipelines. With one web interface and in
5 min, you can write a query, test it, add it to a schedule, push the changes
to GitHub (or submit a PR) and monitor your jobs.

We make it easy to enforce software engineering best practices like version
control and isolated deployments. For example, our version control works
similarly to Looker. Analysts can work simultaneously from different branches
while not having to use different tools nor use the command line.

~~~
buremba
What’s the rationale behind using JS instead of Python/Jinja? Any insights on
that?

~~~
1ewish
Great question :)

1\. Speed. We wanted compilation to really, really fast. 2\. In all honesty we
just weren't big fans of Jinja, having used it for quite a while. JS
templating is OK out the box, and we are considering React like syntax in the
future. 3\. Love it or hate it, NPM packages are pretty easy to work with and
we are working on a number of packages at the moment. 4\. When you start to
look at things like UDFs and Cloud functions which enables some really cool
use cases, JS seems to be prevailing (in Snowflake and BigQuery at least).

I will admit though that we do usually get a bit of a shocked reply when
people hear it's not Python!

~~~
buremba
I see your point but I think that data analysts are usually familiar with
Python, not JS. The UI is cool and intuitive, it looks like you copied most of
the concepts except Jinja from DBT. That's perfectly OK though, let's see
where it goes!

------
opportune
>In Dataform, you build data pipelines in SQL, and our open source framework
lets you seamlessly define dependencies, build incremental tables and reuse
code across scripts.

Do you have any plan to add a visual interface? Unfortunately many analysts I
work with are not really capable in SQL, or don't fully grok SQL features.

Also, how you plan on creating a moat against cloud companies putting the
functionality of your project in their own proprietary solutions? For example,
in Azure you can already use the Azure Data Lake stack of ADL Storage +
Analytics + Data Factory to do what you are doing. These have IDE
integrations, CICD integrationos, git integrations, are first class citizens
in their cloud interfaces, etc.

~~~
1ewish
A UI for building data transformations (SQl queries) isn't on our immediate
roadmap, but we have thought about it.

We are taking a bit of a bet here that the industry is moving towards SQL, and
we see a growing interesting in analysts wanting to learn it.

Moat: Always a concern! Short term we think most of our unique value comes
from the collaborative web development experience which is not OSS currently.
Also staying platform agnostic is important to us. Longer term, we have some
plans which without going into details mostly focuses on building value around
a community and our SQL ecosystem, but not ready to get into details just yet.

------
polskibus
Apart from being cloud-focused, what's the difference between dataform and a
classic ETL engine like SQL Server Integration Services?

~~~
G2H
Great question. Beyond being on the cloud and everything that goes with that,
I'd say that there are two fundamental differences with a product like SSIS.

1\. Dataform focuses on transformations happening _inside_ the cloud data
warehouse. We don't move data between systems. More and more businesses are
moving from traditional ETL processes to ELT and centralising their raw data
in their warehouse. Dataform help businesses manage the T in ELT.

2\. Dataform is built with software engineering best practices in mind. In
Dataform, all your transformations are written in code (mostly SQL) instead of
a GUI like SSIS. The code can be version controlled, edited and fits better
the large amount of transformations teams have to deal with.

~~~
polskibus
Is there a tutorial to run it on-prem? For example for development or testing
purposes?

What data warehouses are currently and which are on the roadmap? Is ClickHouse
somewhere in your plans?

~~~
1ewish
On-prem: Right now our IDE is only available as SaaS, although we will be
looking at this in the near future. You can develop and test projects with the
CLI and deploy them yourself but no tutorials for setting this up beyond the
basics yet: [https://docs.dataform.co/guides/command-line-
interface/](https://docs.dataform.co/guides/command-line-interface/)

Warehouse support: Athena/Presto and Azure are top of mind. I've not come
across ClickHouse before but I'll definitely add it to our tracker!

------
khaledh
What's the core transformation technology running the SQL transforms? Are you
relying on the cloud data warehouse itself (i.e. BigQuery, Amazon Redshift or
Snowflake) to run the SQL transforms? Asking because the transform workload
usually dominates the query workload if they're not properly separated.

~~~
justjico
Just from the docs, it appears to rely on the warehouse itself. It’s basically
SQL + js templates make data engineers/analysts jobs easier.

~~~
1ewish
Yup, the warehouse or query engine does the heavy lifting, we do the
orchestration and make easier to develop these pipelines.

------
dataminded
Any success getting through the VRM process at a financial services firm? Your
product looks very exciting.

~~~
G2H
Thank you! We haven't been through VRM yet but that's about to come up. If you
have any tips, would love to hear your thoughts. I am at ghh@[domain]

------
dmundhra
Does it work cross databases? Ie, if data is stored in multiple databases and
needs to be combined for analytics? Currently, we are getting all data to S3
via DMS and running a script on top of this to combine data

~~~
1ewish
Dataform still requires you to move your data to a single place where it can
be queried together, whether that's S3 with Presto, Redshift spectrum, or
something else.

There are lots of great products for moving data like this, Stitch, Fivetran
are both great and we've used them ourselves, but they don't help you manage
complex data transformation processes.

------
dttos
Looks like yet another ETL/ELT tool.

We have been using composable with quite a bit of success:
[https://dev.composable.ai](https://dev.composable.ai)

~~~
G2H
Indeed. Dataform however only does the T (transformation) in ELT. Most
businesses are centralising an exploding diversity of raw data in the data
warehouse and Dataform helps them manage the data there.

One other main difference with a product like composable is that Dataform is
built with software engineering best practices in mind. The transformations
are written in code (SQL) rather than a GUI. It enables version control,
reusable components for example and makes it easier to manage a large number
of interdependent transformations.

~~~
dttos2
To be fair, these other tools do this as well.

------
mgberlin
Does this add any value beyond what PDTs in Looker provide?

~~~
G2H
Definitely!

We see that PDT doesn't scale well with the the amount of data complexity you
usually have to deal with in the warehouse.

On top of that, the results of PDTs are only accessible to Looker. As your
data team matures, there will be many other users and applications that need
to take advantage of your transformed data in the warehouse. (e.g. modelling
in Python)

Dataform also brings other features not available in PDTs. For example:

\- Data validations tests (assertions) to ensure data quality in all your data
applications (including Looker)

\- Incremental tables: You can define tables that update incrementally instead
of being rebuilt from scratch, hence delivering better performance and saving
costs

\- Reusable code. You can define reusable code snippets across your entire
project

~~~
scapecast
I much agree with this response. The term that I've heard to summarize all of
the above is "semantic layer". That layer will run in the warehouse, vs. some
external dashboard tool.

------
starik36
Is there a roadmap for adding other data sources?

~~~
1ewish
CTO here, What are you after? :)

Athena/Presto, Azure are next in our sights.

