Wow, that's one of the most orange tag-rich posts I've ever seen.
We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT.
"Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.
Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.
So, we're looking for LLM to generate a code to parse HTML.
Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com
We're doing a lot of tests with GPT-4o at NewsCatcher. We have to crawl 100k+ news websites and then parse news content. Our rule-based model for extracting data from any article works pretty well, and we never could find a way to improve it with GPT.
"Crawling" is much more interesting. We need to know all the places where news articles can be published: sometimes 50+ sub-sections.
Interesting hack: I think many projects (including us) can get away with generating the code for extraction since the per-website structure rarely changes.
So, we're looking for LLM to generate a code to parse HTML.
Happy to chat/share our findings if anyone is interested: artem [at] newscatcherapi.com