- Why check `expect_column_values_to_not_be_null` rather than setting the column `not null`?
- Why check `expect_column_values_to_match_strftime_format` rather than making the column a datetime?
- `expect_column_values_to_be_unique`? Really?
As far as I can tell all of the expectations can be implemented as constraints or (at worst) super-performant triggers. Sounds like a band-aid on terrible database design.
I used to work at a medical ML company, the datasets we got from insurance companies and medical providers were generally awful. Occasionally, it didn’t even match the data dictionary they themselves provided. If you need to connect to or ingest outside data sources or check the output of an ETL pipeline, this tool is extremely useful.
I suspect that areas where data crosses from one company to another (or even department to department) without some strongly enforced standard will have these sorts of issues fairly often and benefit from tools like this.
I don't want to just be sceptical and dismiss anyone's effort for a free and open source tool but since I can express what I think I believe it will yield more overhead than anything, at least for my use cases so far since testing the data require way more steps than analyzing a single dump for coherence.
While I can see the advantage of such a tool with exchanging interface definitions with data providers or other people, I consider it yet another thing to add to the whole infrastructure which obviously comes at a cost that is not 0 that can ultimately be solved with documentation enforced through sane devops methods.
I can then separate testing from the transformation code (which yes, should be tested too in unit tests and the like), and use this tool as an audit to say, “did my code do the right thing on the dataset in this latest ETL?”, and “where did it screw up?”
You can also use it to write a specific data spec and rely on the tool to guarantee that spec. It’s a lot easier to write a spec than the tool that does the checking. That spec also documents your data standard, is machine readable, and versionable.
Ideally the cost is low, you spend 15m with the client data dictionary, write the spec, and then go have a coffee while the tool sees and highlights where the dataset is out of spec.
Less flippantly, with large organizations, the people giving you the data aren't necessarily the people who entered the data or have the ability or skills to fix it. They may also be legally prohibited from altering the data, so correcting errors can get tricky.
Obviously, for downstream analytics you need good data, so we have to fix it as well as possible. Usually, it involved a lot of back and forth with the client. We'd tell them things like, "you have a lot of dates that are set as January 1, 1900, we'll treat those as undefined, does that make sense?" That's why a tool like Great Expectations, especially as it adds data introspection features can be super helpful. I can write a quick spec, compile a report and send the client that report in record time. Even without major errors, the client has all sorts of tribal knowledge about how to use that database is used, and often you have to deduce those rules or do good data introspection to ask the client the right questions.
Sure a trigger can be used. But they can also be disabled. So you end up verifying all data has been inserted using the trigger anyway.