Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: How to build data pipelines to continuously ingest customer files?
6 points by SpeakerFrThDead 11 months ago | hide | past | favorite | 3 comments
I'm working on onboarding a customer who wants to deliver their data as a weekly dump of CSVs. We’re supposed to ingest that data and get it into our system to provide analytics for their team.

I was initially thinking of just building a one-off ETL script, but I was warned the files may randomly break spec (new/renamed fields, etc) due to errors in the process that generates them. Is there a standard way to handle this type of thing?




Tools-wise there are many ways to do it. From simple custom scripts to more advanced solutions like a data lakehouse. It depends on the budget and the needs I suppose.

> but I was warned the files may randomly break spec (new/renamed fields, etc) due to errors in the process that generates them. Is there a standard way to handle this type of thing?

This is a political decision. I think you have two options:

- Set a clearly defined schema, and validate against that. Whenever the input doesn't match the schema, put it in a separate 'error folder' and notify someone to edit the data so it can be retried.

- Accept schema changes. Either by having some sort of auto migrations, or by simply not defining a schema at all (like a document store).

Imho, the last option is not a good one if you want to build analytics on top of it, because you need to have some cleaned up and structured version of the data that you can rely on.


- Automate schema detection and reporting of discrepancies (to both the client and you), and make that 100% robust.

- get a good SLA in place (you can’t promise to load their data within an hour if they essentially can send random crap, for example)

- Add a zero to what you charge them.

Alternatively, walk away from that customer.


The standard way is manually until there's enough experience to automate some of it away. Understanding the actual problem first is a direct step toward a robust solution to the actual problem. Building a non-trivial process is a process. Good luck.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: