Wow, that's one of the most orange tag-rich posts I've ever seen. We're doing a ...

Wow, that's one of the most orange tag-rich posts I've ever seen.

We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT.

"Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.

Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.

So, we're looking for LLM to generate a code to parse HTML.

Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com