Hacker News new | past | comments | ask | show | jobs | submit login

They touch on something relevant here and it's a great point to emphasise

> The emphasis on preserving raw HTML proved vital when Tagesschau repeatedly altered their newsticker DOM structure throughout Q2 2020. This experience underscored a fundamental data engineering principle: raw data is king. While parsers can be rewritten, lost data is irretrievable.

I've done this before keeping full, timestamped, versioned raw HTML. That still risks shifts to javascript based things but keeping your collection and processing distinct as much as you can so you can rerun things later is incredibly helpful.

Usually, processing raw data is cheap. Recovering raw data is expensive or impossible.

As a bonus, collecting raw data is usually easier than collecting and processing it, so you might as well start there. Maybe you'll find out you were missing something, but it's no worse than if you'd tied things together.

edit

> Huh? To find the specific dates new item corresponding to a given topic? Why not just predict the date-range e.g. "Apr-Aug 2022"

They say they had to manually find the links to the right liveblog subpage. So they had to go to the main page, find the link and then store it.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: