Hacker News new | past | comments | ask | show | jobs | submit login
Great Expectations: data testing, documentation, and profiling (greatexpectations.io)
35 points by polm23 3 days ago | hide | past | favorite | 11 comments

We were experimenting with this a couple years ago and were very impressed but it didn’t quite fit into our workflow at the time (mostly our fault). Really cool to see how far this has progressed, it’s a really helpful tool that should standardize all sorts of necessary data checks especially in gnarly, human entered datasets.

I can't tell why I would want this based on a quick skim:

- Why check `expect_column_values_to_not_be_null` rather than setting the column `not null`?

- Why check `expect_column_values_to_match_strftime_format` rather than making the column a datetime?

- `expect_column_values_to_be_unique`? Really?

As far as I can tell all of the expectations can be implemented as constraints or (at worst) super-performant triggers. Sounds like a band-aid on terrible database design.

You’re missing the need this product addresses. It’s not about your database, it’s about the data someone sent you that you’re about to ETL into your database.

I used to work at a medical ML company, the datasets we got from insurance companies and medical providers were generally awful. Occasionally, it didn’t even match the data dictionary they themselves provided. If you need to connect to or ingest outside data sources or check the output of an ETL pipeline, this tool is extremely useful.

Likewise data quality from trustees in the financial sector can be pretty variable.

I suspect that areas where data crosses from one company to another (or even department to department) without some strongly enforced standard will have these sorts of issues fairly often and benefit from tools like this.

So, basically a test step in your ingest pipeline? I was under the impression that it is the bare minimum if we are talking about production. Why is a separate tool necessary? To normalize the pipeline tests? If that's the case maybe there's another issue with engineering principles and basics that is not adressed but I don't know.

I don't want to just be sceptical and dismiss anyone's effort for a free and open source tool but since I can express what I think I believe it will yield more overhead than anything, at least for my use cases so far since testing the data require way more steps than analyzing a single dump for coherence. While I can see the advantage of such a tool with exchanging interface definitions with data providers or other people, I consider it yet another thing to add to the whole infrastructure which obviously comes at a cost that is not 0 that can ultimately be solved with documentation enforced through sane devops methods.

It is the bare minimum to do data tests throughout your pipeline, but a good, open source, structured framework that I don’t have to write myself is kinda nice ;)

I can then separate testing from the transformation code (which yes, should be tested too in unit tests and the like), and use this tool as an audit to say, “did my code do the right thing on the dataset in this latest ETL?”, and “where did it screw up?”

You can also use it to write a specific data spec and rely on the tool to guarantee that spec. It’s a lot easier to write a spec than the tool that does the checking. That spec also documents your data standard, is machine readable, and versionable.

Ideally the cost is low, you spend 15m with the client data dictionary, write the spec, and then go have a coffee while the tool sees and highlights where the dataset is out of spec.

That does sound useful, but at the same time I can't help thinking that it would be better if the validation was pushed further up the stack, because trying to fix someone else's bad data is just going to end in worse data.

I'd have loved that at the time! But clients don't like it when we ask them to fix these things.

Less flippantly, with large organizations, the people giving you the data aren't necessarily the people who entered the data or have the ability or skills to fix it. They may also be legally prohibited from altering the data, so correcting errors can get tricky.

Obviously, for downstream analytics you need good data, so we have to fix it as well as possible. Usually, it involved a lot of back and forth with the client. We'd tell them things like, "you have a lot of dates that are set as January 1, 1900, we'll treat those as undefined, does that make sense?" That's why a tool like Great Expectations, especially as it adds data introspection features can be super helpful. I can write a quick spec, compile a report and send the client that report in record time. Even without major errors, the client has all sorts of tribal knowledge about how to use that database is used, and often you have to deduce those rules or do good data introspection to ask the client the right questions.

Some of those terrible databases are Redshift, SQLite, Elasticsearch. Redshift does not enforce unique constraint, SQLite does not have datetime.

Sure a trigger can be used. But they can also be disabled. So you end up verifying all data has been inserted using the trigger anyway.

Because they can’t afford a real database.

So it's a bit like a JSON schema, but with more fine grained checks? Very interesting.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact