Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Data Caterer – Data testing tool for any data source (github.com/data-catering)
2 points by pitah1 on Dec 4, 2023 | hide | past | favorite
Hi everyone, I'm Peter. A Software Engineer who has been around the data space for the majority of my career. Data Catering is a tool I've been itching to try and implement as I've spent lots of time debugging data issues, generating manual data to test jobs/services or replicating production-like data flows. The usual way to solve this problem is to stream/load a subset of production data, mask it and store it in a pre-prod environment. This is okay but can increase the risk of personal data leaking as pre-prod environments may not have the same security measures in place (Optus data leak as an example https://en.wikipedia.org/wiki/2022_Optus_data_breach), having network connections between prod and non-prod being opened, and it may not cover all data scenarios (different permutations/combinations of values, unknown unknowns). Another approach is to generate data yourself. There are a number of existing tools that can aid with data generation but lack the ability to clean up the generated data, generate both batch and event data, define relationships between datasets (for example, an account can have multiple transactions associated with it, or foreach account create event, the same account should exist in a CSV file), or validating data.

With this in mind and following along the ideas of having a tool that can be run anywhere, ability to automatically discover, is metadata driven, generate and validate data, and giving users the ability to customise how it is run, Data Caterer was created.

Main features include:

- Metadata discovery

- Batch or event data generation

- Maintain referential integrity across any dataset

- Create custom data generation scenarios

- Clean up generated data

- Validate data

- Suggest data validations

In terms of technical details, it is a Spark based application where users can use the Scala/Java API or YAML files to interface with it. It is a single docker image that can be run following the quick start found here (https://data.catering/get-started/docker/). If you want to find out more details, check out the main website here (https://data.catering/).

I believe there is room for improvement in the data testing area as we should look to be more proactive in detecting and resolving bugs related to data quality or system integration.

Any feedback is appreciated.



Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: