Hacker News new | past | comments | ask | show | jobs | submit login

Too large a chunk of my life was burnt the ETL from hell. Other than SQL, I can't see any other tool that would have helped.

I found the easiest way was just dump tables of data into long varchars - even the supposedly numeric stuff, because often enough it wasn't - then scrub & filter it from there. (NB, if you take anything from this post, make it that).

There were multiple challenges, but one that was most painful was the sheer filth mixed into the data. Too often even a human couldn't work out what the heck was supposed to be in some field. Often we could, but it ate a lot of time.

It doesn't sound like you ETLs are anything like mine; you start with pretty clean data. You are lucky.

Yikes, Iā€™m sorry. What were your application developers doing to persist this messy data?

It was an upgrade from a legacy system to its successor. I never used the legacy system so I don't know what checks they did. I suspect little to none; I guess they pretty much exposed the raw 'tables' to the user to insert whatever they wanted.

I wrote 'tables' because it wasn't a relational DB although the successor system was.

The amount of data was pretty minimal too, not much more than a dozen GB, but it ate up at least 4 man-years to convert. I still cringe thinking about it. Some parts were an interesting challenge but mostly just soul destroying grind.

well in the case of a system migration for consumer data; in many cases the developers may have moved on or even retired. In some cases operational data can have lived through pre RDBMS systems, migrations to RDBMS, migrations to different system software from different vendors, adhoc imports of data from other systems, bulk zero day currency conversions (to the Euro), different approaches for configuration of obscure product codes, bugs, systems that didn't understand leap years, developers and systems who didn't understand normalisation or timezones or decimal data types or rounding errors or character sets, crashes, 'manual' rollbacks and hacks to reset large batch jobs.

of course the value to the business of much of this data is highly questionable, but this isn't always an argument you win..

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact