Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wow, that's one of the most orange tag-rich posts I've ever seen.

We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT.

"Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.

Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.

So, we're looking for LLM to generate a code to parse HTML.

Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com



I’d love to look into this for a hobbyist project I’m working on. Wish you had self signup!




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: