
Launch HN: Datafold (YC S20) – Diff Tool for SQL Databases - hichkaker
Hi HN! My name is Gleb. I&#x27;m here with my co-founder Alex to tell you 
about our company Datafold (<a href="https:&#x2F;&#x2F;datafold.com" rel="nofollow">https:&#x2F;&#x2F;datafold.com</a>). Datafold lets you diff
large datasets for fast and powerful regression testing. We support
databases such as PostgreSQL, Snowflake, BigQuery, and Redshift.<p>One of the biggest pain points in developing ETL pipelines – 
chains of jobs that move, clean, merge and aggregate analytical data – 
has been regression testing: verifying how a change in source code (mostly,
SQL) affects the produced data.<p>Early in my career, as an on-call data engineer at Lyft, I
accidentally introduced a breaking code change while attempting to
ship a hotfix at 4AM to a SQL job that computed tables for core
business analytics. A seemingly small change in filtering logic ended
up corrupting data for all downstream pipelines and breaking
dashboards for the entire company. Apart from being a silly mistake,
this highlighted the lack of proper tooling for testing changes. If
there had been a way to quickly compare the data computed by
production code vs. the hotfix branch, I would have immediately
spotted the alarming divergence and avoided merging the breaking
change.<p>Without a diffing tool, the typical options for regression testing
are: (1) Data “unit tests” (e.g. check primary key uniqueness, ensure
values are within interval, etc.) – these are helpful, but costly 
investment. Frameworks such as dbt make it easier, but it’s
often still prohibitively hard to verify all assumptions in a large
table. (2) Write custom SQL queries to compare data produced by the
prod and dev versions of the source code (e.g. compare counts, match
primary keys). This can easily take up 100+ lines of SQL and hours of
unsatisfying work, which no one really wants to do. (3) &quot;Fuck It, Ship
It&quot; is always an option but too risky nowadays as analytical data not
only powers dashboards but also production ML models.<p>As this problem is common in data engineering, some large
organizations have built and open-sourced their solutions – for
example, BigDiffy by Spotify. However, most of these tools are
CLI-based and produce results in a plain-text format which is hard to
comprehend when you are dealing with complex data.<p>To fit existing workflows of our users, we’ve built a web interface
with interactive charts showing both diff summary statistics (e.g. %
of different values by column) and value-level side-by-side comparison
(git diff style). But since the mission of the tool is to save
engineers as much time as possible, we also opened an API for
automation through Airflow or other orchestrators, and built a Github
workflow that runs diff on every pull request with changes to ETL
code. Since billion-row-scale datasets are not uncommon nowadays,
there is an optional sampling feature that helps keep compute costs
low and get results within a few minutes no matter how large the
dataset is.<p>We&#x27;ve found Datafold to be a good fit for the following workflows: (1)
Developing data transformations – before an ETL job is shipped to
production, it undergoes multiple iterations. Often it’s important to
see how data changes between every iteration, and particularly useful
if you have 1M+ rows and 100+ columns where “SELECT *” becomes
useless. (2) Code review &amp; testing: large organizations have hundreds
of people committing to ETL codebases. Understanding the impact of
even a modest SQL diff is daunting. Datafold can produce a data diff
for every commit in minutes so changes are well understood. (3) Data
transfer validation: moving large volumes of data between databases is
error-prone, especially if done via change data capture (CDC): a
single lost event can affect the resulting dataset in a way that is
tricky to debug. We allow comparing datasets across different
databases, e.g. PostgreSQL &amp; Snowflake.<p>We&#x27;ve set up a sandbox at <a href="https:&#x2F;&#x2F;app.datafold.com&#x2F;hackernews" rel="nofollow">https:&#x2F;&#x2F;app.datafold.com&#x2F;hackernews</a> so you can see
how diffing works. Shoot us an email (hn@datafold.com) to set up a trial and use it with
your own data.<p>We are passionate about improving tooling for data engineers and would
love to hear about your experience with developing data pipelines and
ensuring data quality. Also, if you think that dataset diffing can be
helpful in other domains, we are very curious to learn from you!
======
GordonS
I was curious about pricing, but I see it's "call me pricing" with buttons to
schedule a demo, so at least I can see this is squarely aimed at the
enterprise. If I'm being honest, I don't like seeing "call me pricing" on HN;
there are no rules against it, but it just doesn't feel right on HN.

Are you able to say anything about pricing here?

~~~
hichkaker
Definitely a fair point!

The primary reason we haven't provided pricing is that we have just launched
and wanted to collect more data points before setting making the pricing
public.

Our current offering is:

1) Free for diffing datasets < 1M rows 2) $90 / mo / user for diffing datasets
of unlimited size 3) Cross-database diff, on-prem (AWS/GCP/data center)
deploy, Single-sign-on are in custom-priced enterprise bucket.

We would love to hear your thoughts on this.

~~~
simonebrunozzi
I have a quick suggestion for you: two options you can mitigate this "issue".

Option 1: make it free, up to a certain dataset size. You can harvest
interested leads like the gentleman above.

Option 2: (if you don't want to deal with a huge volume) offer it for $50 one-
time fee, up to X size, for Y months (e.g. $50, up to 1 GB, valid for 3
months). Nice way to filter qualified leads.

There are variations from the two options above, but I think you can easily
get the general idea.

Thoughts?

~~~
hichkaker
Thank you for the suggestion!

We're leaning towards Option 1: free diffing for datasets < 1M rows. Option 2
seems a bit tricker since we are in a way creating a new tool category and it
can be harder to convince someone to pay before they try and understand the
value (unlike, say, a BI tool – everyone knows they need some kind).

------
throwaway_pdp09
I see a lot of these things and I don't understand them. I've done too much
ETL so I'm not naive. Now either 1) people are making a mountain out of a
molehill (not saying that's happening here, but in other cases I think so) 2)
there's something my experience of ETL hasn't taught me or 3) these tools are
specialised for niches. This one talks about 'large datasets' but I don't know
how large that is.

Some questions then

> Often it’s important to see how data changes between every iteration, and
> particularly useful if you have 1M+ rows and 100+ columns where “SELECT *”
> becomes useless.

select is fine for diffing. You just do an either-way except , something like

    
    
      (
      select f1, f2, f3 ... f100
      from t1
      except
      select f1, f2, f3 ... f100
      from t2
      )
      union 
      (
      select f1, f2, f3 ... f100
      from t2
      except
      select f1, f2, f3 ... f100
      from t1
      )
    

used this and it's fine on many rows (millions is fine but I do recommend and
index and a DB with a halfway decent optimiser).

> (2)

Interesting. OK.

> (3) Data transfer validation: moving large volumes of data between databases
> is error-prone

Really? I never had a problem. What is 'large'? what problems have you seen?
There are easy solutions with checksums, error correction (comes free with
networks) or round-tripping, is that a problem?

Edit, just done that with mssql tables, 8 cols, 38 bytes per row, ~776,000
rows (identical but for one row), diff as above takes 2 seconds without an
index (with PK it takes 5 seconds. Sigh. Well done MS). The single row
discrepancy shows up fine. Totally trivial to extend it to 100 columns (did
that too in previous job).

~~~
forrestb
Data diffs are easy when the tables are already exactly the same. The
union/except query you wrote returns no results.

It's much harder when many things are wrong at once, and you have to debug: \-
Do schemas match? (same column names, column order, data types) \- Are pk's
unique? \- Do pk's intersect? (pk's exclusive to one table?) \- For
intersecting pk's, how is data different? (string truncation, timestamp
truncation, rounding, upper vs lower case, empty string vs null, etc)

~~~
throwaway_pdp09
> Data diffs are easy when the tables are already exactly the same. The
> union/except query you wrote returns no results.

It won't if the tables are identical. And if the tables differ, it a) tells
you and b) gives you a very good idea where to look.

In all honesty, none of things are a problem.

> It's much harder when many things are wrong at once

Then you find and fix each in turn. BTDT.

> Do schemas match? (same column names, column order, data types)

Well, if they don't and you're responsible you've really messed up, but that
does happen (cough). Just do a similar diff via information_schema to show you
schema discrepancies.

> Do pk's intersect?

this is just a subset of full-table diffing. Use my except/union but just on
the PK fields (consult infoschema for the columns which are the PKs. If it has
to be done for many tables, build the union/except dynamically then execute
the string to allow for different tables with different columns names - I've
done this too. Annoying but when done once you can just reuse it).

> For intersecting pk's, how is data different? (string truncation, timestamp
> truncation, rounding, upper vs lower case, empty string vs null, etc)

The aforementioned comparison using infoschema views will tell you where the
schemas differs. Comparing the contents show you where the contents differ.

All this is annoying but of _very_ little work compared to a whole ETL
process, and - emphatically - is reusable on the next job.

~~~
forrestb
> Well, if they don't and you're responsible you've really messed up, but that
> does happen (cough). Just do a similar diff via information_schema to show
> you schema discrepancies.

Analysts with 1-2 years experience in SQL are often writing ETL. Responsible
Data Engineers can't be there every time they deploy to production.

> All this is annoying but of very little work compared to a whole ETL
> process, and - emphatically - is reusable on the next job.

Yes, you can write your own tools for SQL QA. Making a standard tool that the
entire company can and _does_ use is another story.

~~~
mcrad
> Analysts with 1-2 years experience in SQL are often writing ETL

This is the essence of Big Data.

------
ishcheklein
Hey! Looks great! Is there an example of the Github integration - how does it
looks like?

I'm one of the developers and maintainer of the DVC project and we recently
released CML.dev- which integrates with Github and can be used to run some
checks on data as well. But in our case it's about analyzing files more or
less. I'm curious how does that integration look like in your case.

~~~
hichkaker
Thanks! We are going to add an example to our website and perhaps publish it
as a Github App.

The flow is: For every whitelisted (say, SQL) file changed by a pull request,
our bot compares the production data (produced by the master branch) vs. the
data output of the new branch and pastes the diff as a comment to the PR. Some
YAML configuration is needed to, for example, adjust it to your repo file
structure.

~~~
ishcheklein
Thanks! And how and where does setup happens which database to use to run the
query for the specific SQL file? Also if it's part of some pipeline will it
have to run the whole pipeline from the very beginning?

~~~
hichkaker
So there are two main workflows:

1) Run diff as part of CI (on code change), that can be done via Github
workflow (or other CI). The Github workflow takes in a YAML config file, and
also is customizable so you can fit it according your needs. For example, one
of our customers names SQL files after tables they populate, so when the bot
sees a changed SQL file, it knows what table to compare it to.

2) If you are looking to run a diff within the data pipeline itself (e.g. in
Airflow), you can wrap Datafold API in an Airflow operator to integrate it as
a task in your DAG.

I hope that I covered your questions but please let me know if you are
interested in specific use cases!

------
hodgesrm
It is always good to see new approaches to testing but I don't see how this
one is going to work. I've worked at multiple database companies. Diff'ing
data is one of the weakest and most cumbersome ways to verify correctness.

Diffs are relatively slow, when they fail you get a blizzard of errors, and
the oracles (i.e. the "good" output) have to be updated constantly as the
product changes. Plus I don't see how this helps with schema migration or
performance issues, which are major problems in data management. And don't get
me started on handling things like dates, which change constantly, hence break
diffs.

If you really care about correctness it's better to use approaches like having
focused test cases that check specific predicates on data. They can run
blindingly fast and give you actionable data about regressions. They're also a
pain to code but are most productive in the long run.

~~~
hichkaker
Thank you for sharing!

I assume we are talking about analytical, not transactional data:

> Diff'ing data is one of the weakest and most cumbersome ways to verify
> correctness.

It depends on the use case: if the goal is to assess the impact of a change in
source code on the resulting dataset produced (extremely common in ETL dev
workflow in my experience), then isn't diff the natural solution? Of course,
it depends on how the results are presented. A row-by-row output for a
billion-row dataset is useless. That's why we provide diff stats across
columns/rows and data distribution comparisons while allowing the user to see
value-level diff if needed.

> Diffs are relatively slow

In general – yes, that's why we've implemented configurable sampling. In the
majority of cases, developer is looking to assess the magnitude of difference
and certain patterns, for which you don't need a large sample size. Our
customers typically use ~1/10000 of the target table row count as a sample
size.

> when they fail you get a blizzard of errors We try to fail gracefully :)

> I don't see how this helps with schema migration or performance issues.

For schema migration, you can verify whether anything has changed in your
dataset _besides_ the intended schema changes (which certainly happened on my
watch).

> or performance issues

We certainly don't claim to solve all DBA issues with diff, but here's an
actual real example from our customer: they are optimizing their ETL jobs in
BigQuery to lower GCP bill by reducing query runtime. After refactoring the
code, they diff production vs. new code output to ensure that the data
produced hasn't been affected.

> If you really care about correctness it's better to use approaches like
> having focused test cases that check specific predicates on data.

Possibly, but

> They're also a pain to code

...which is often a prohibitive pain point if you have 300+ analytical
datasets with 50+ columns each (a common layout for companies of 250+).

And another problem: the more test cases, the more failures on every run, and
unlike app code unit testing, you can't expect the cases to stay relevant
since the data is changing constantly, so those "unit testing" test suites
require constant maintenance, and as soon as you stop actualizing them, their
value drops to 0.

I think that diffing and "unit testing" are complimentary approaches and
neither one is a panacea. So my recommendation has been to use both: 1)
Specific test cases to validate the most important assumptions on data 2) Diff
tool for regression testing.

~~~
hodgesrm
I'm unconvinced your approach works beyond a narrow range of use cases. The
weakness is the "is this a problem" issue. You have a diff. Is it really
significant? If it's significant, how did it arise? You can spend an
inordinate amount of time answering those two questions, and you may have to
do it again with every run. Diffs are cheap to implement but costly to use
over time. That inversion of costs means users may end up bogged down
maintaining the existing mechanism and unable to invest in other approaches.

If I were going after the same problem I would try to do a couple of things.

1\. Reframe the QA problem to make it smaller. Reducing the number and size of
pipelines is a good start. That has a bunch of knock-on benefits beyond
correctness.

2\. Look at data cleaning technologies. QA on datasets is a variation on this
problem. For example if you can develop predicates that check for common
safety conditions on data like detecting bad addresses or SSANs you give users
immediately usable quality information. There's a lot more you can do here.

Assuming you are working on this project, I wish you good luck. You can
contact me at rhodges at altinity dot com if you want to discuss further. I've
been dealing with QA problems on data for a long time.

~~~
hodgesrm
p.s., to expand on #2 if you can "discover" useful safety conditions on data
you change the economics of testing, much as #1 does.

------
stephane-klein
At the moment I use
[https://github.com/djrobstep/migra](https://github.com/djrobstep/migra) to
make PostgreSQL diff.

~~~
random_savv
I also use migra - and love it. Migra only checks your schemas though, not
your data.

~~~
simonebrunozzi
From what I recall you are correct. But I also vaguely remember that there was
another little open source tool that integrated with Migra to check data as
well? Or was it a custom-made script?

Sorry, can't find it, I wish my comment could be more useful.

------
itpragmatik
1) Will it allow me to diff AWS RDS Aurora/MySQL serverA database schemaA
against AWS RDS Aurora/MySQL serverB database schemaB ? 2) Are there APIs to
initiate and view/parse these diffs that you generate or is it all through UI?

~~~
hichkaker
Yes, you can diff to physically distinct databases. MySQL support is in
progress.

There is a REST API that accepts the same parameters as in the UI and returns
the diff results in JSON + optionally, materializes the tables in your DB with
row-level differences.

------
igorludi
To whome it may concern, we have written a paper on the subject (I'm not
affiliated with datafold):
[https://www.researchgate.net/publication/323563341_A_Time-
co...](https://www.researchgate.net/publication/323563341_A_Time-
constrained_Algorithm_for_Integration_Testing_in_a_Data_Warehouse_Environment)

The paper describes the original algorithm with examples.

------
MaxwellM
Much needed! Analysts struggle with this all the time - trying to explain why
an analysis is off and having to manually debug every column in a new database

------
kveykva
In case you're unaware, your logo looks unfortunately a bit uncannily similar
to that of [https://www.sigmacomputing.com/](https://www.sigmacomputing.com/)
given you are both in a similar broader product category! I actually thought
your logo throughout the site was actually a reference to integration with
sigma at first.

~~~
hichkaker
Thank you for pointing it out, I've never noticed but now that you brought it
up will do a few more iterations on our design to avoid the collision.

------
greg
I got the chance to play with Datafold and I would have __loved __to have had
it when I was working for Facebook on data pipelines.

------
nicoburns
I've recently discovered, and highly recommend Daff [0]. It's an open source
tool that can diff CSVs and SQLlite database tables. The tabular diff format
is fantastic.

[0]: [https://github.com/paulfitz/daff](https://github.com/paulfitz/daff)

------
brainless
Hey Gleb, congrats on the launch. This is an interesting tool.

Myself being a founder of a product in the space of tools for data-driven
companies, I wanted to ask -

Is you product aimed entirely at data engineers? The description seemed very
technical and a problem that mostly very large companies would have. Did I
understand correctly?

~~~
hichkaker
Thank you and that's a great question.

It largely depends on how critical the data pipelines are to the business.
We've seen big companies suffer the most from issues with their data quality.
At the same time, we've had early-stage teams using our diff tool because
their core product depends on the data they ingest from third-party vendors
and performing regression testing of that data and the transformation code has
been taking them a lot of time.

The product in its current form could be useful to anyone developing data
transformations (which is what data engineers typically do full time), and we
are working on expanding it to help data consumers (analysts, PMs, etc.) have
higher confidence in the quality of datasets and metrics they rely on.

------
chaps
Can this be used locally for datasets with non-transferable PII? Thinking
about this for non-profit work.

~~~
hichkaker
Yes, we have on-prem AWS/GCP and K8S-based deployments available, would be
very happy to support your work, just shoot me an email hn@datafold.com.

------
gregwebs
Great tool! I am only interested in running such a tool locally (on-prem).
This avoids security/privacy issues and data transfer time/cost issues.

A good model for me would be 30 day free license to get it integrated into our
worklows.

~~~
hichkaker
Totally fair! We offer free trials for on-prem. Please ping us at
hn@datafold.com to set it up.

------
jjirsa
For Cassandra: [https://github.com/apache/cassandra-
diff](https://github.com/apache/cassandra-diff)

(Designed for correctness testing at petabyte scale)

------
sbr464
Can you import/work with .bcp files? (Microsoft SQL server bulk export files).
For example, diffing 2 bcp files, even if you need to import and set them up
as databases again.

~~~
hichkaker
If you can materialize those .bcp files as tables, then yes, absolutely.

------
lihan
How does it work behind the scene? Is it simply sample a portion of the data
then do the diff? What if I need 100% accuracy?

~~~
hichkaker
If diffing datasets within the same physical database, generate SQL, execute
in the database, analyze and render results.

If diffing datasets across physically different databases, e.g. PostgreSQL <>
Snowflake or 2 distinct MySQL servers, pull data in our engine from both
sources, diff, and show results.

Sampling is optional but helpful to keep compute costs low for large
Mill/Bill/Trill-row datasets.

------
samblr
Congrats on the launch.

How does it compare to enterprise ETL tools like Informatica, Talend - is it
not possible to do these within them ?

~~~
forrestb
At best, Informatica could output the diff of two tables to another table. But
the paradigm is very limiting. I'm not sure if you can dynamically change the
mappings every time you feed it a new table (different primary keys, columns,
data types), and you'd still have to browse the results in SQL.

~~~
samblr
Thank you, not to undermine the work done in Datafold. But visual schema-diff
is a small improvement in giant of tools like informatica which can pretty
much do anything in database space.

It can be well argued that YC is taking in companies without realising what
exists out there in enterprise world. There are quite a few no-code/api-
stiching tools which launched recently as well. And those at best resemble
poor imitation of powerful BI tools that are existing out there since almost a
decade.

~~~
hichkaker
Those tools are definitely vastly powerful. Have you used either of them?

TBH, I haven't, but judging from our current post-Informatica users and by
reading questions on Informatica/Talend official user forums, I concluded that
the diffing problem (to be specific – not only schema, but data diffing) is
not directly addressed by them (the answers are in the realm of "there is no
diff feature but you can write SQL..."

In general, we see data stacks becoming increasingly modularized and tools
more specialized. For example, there are at least 20x more teams using OSS
like Airflow/Luigi/Dagster for managing their data pipelines (and 2-5 other
tools for the rest of the workflow) than using end-to-end platforms that you
mentioned. We see Datafold as a regression testing tool in a modular stack.

~~~
samblr
Thank you for your reply.

I have used Talend in great detail 3 years ago but I didn't have the usecase
of schema diff at the time. But for data diff you can easily define workflow.
And have to admit these workflows are crazy powerful and even can help the
data fix with any transformation required (nocode or code)

However, Im seeing the usecase for a light weight tool with visual aspect. I
like this. But will this problem be big enough for VC investment is the
question ? I see schema diff can be just a plugin in one of the existing
database tools. And if you are getting into data diff - you got to see what
those tools do too.

~~~
hichkaker
> But will this problem be big enough for VC investment is the question?

That's a great question. Thinking about where problems arise in data
pipelines, there are fundamentally two moving pieces: 1) Your data – you're
continuously getting new data without a real ability to enforce your
assumptions on its schema or shape. 2) Your code for ingestion and
transformation that needs to evolve with the business and to adapt to changes
in other parts of the infra.

Datafold's Diff tool currently mostly addresses #2. It can add value to any
company that runs ETL pipelines but most impactful at large data engineering
teams (similar story to CI or automated testing tools).

Regarding #1, wouldn't it be useful if we tracked ALL your datasets across
time and alerted you on anomalies in those datasets? And I am not talking
about rigid "unit" tests e.g. X <= value < Y, but actual stats-based anomaly
detection, akin to what Uber does: [https://eng.uber.com/monitoring-data-
quality-at-scale/](https://eng.uber.com/monitoring-data-quality-at-scale/)

So, with diff, we already compute and store detailed statistical profiles on
every column in the table. Next, we are going to track those profiles across
time.

Diff is just the first tool we've built to get a wedge into the workflows of
high-velocity data teams and start adding value, but it's just the beginning
of a more comprehensive and, hopefully, valuable product we aspire to deliver.

~~~
samblr
Much appreciate your response

------
blunt69
Will there be integration with ORACLE Dbs?

~~~
_ZeD_
This

~~~
hichkaker
Yes, we can add support for Oracle quite soon if you are interested.

------
chwolfe
Very cool, nice work! SQL Server / Azure SQL support available or on the
roadmap?

~~~
hichkaker
Thank you!

We are [almost] DB-agnostic (as long as there is SQL interface and JOINs are
supported) and have been adding support for particular DBs based on the
interest from our users. SQL Server & Azure SQL areon the roadmap, please let
us know if you have a use case in mind to pilot Datafold on them.

------
FlashBlaze
It is interesting to see Ant Design used other than my personal projects.

------
husseiny
Very cool! Does it work with MS SQL in our own DC?

~~~
hichkaker
Support for MS SQL is in progress. We can deploy in your DC using with Docker
or K8S.

------
greenie_beans
love it, fucked up and made a startup with a solution so you'll never make
that mistake again.

