Great work! One of the things that would be incredibly useful/interesting would be generating a reusable script with an LLM, instead of just grabbing the data. In theory, this should result in a massive cost reduction (no need to call the LLM every time) as long as the source code doesn’t change which would make it sustainable for constant and frequent monitoring.
This approach was studied in a paper called Evaporate+ - https://www.vldb.org/pvldb/vol17/p92-arora.pdf They used active learning to pick the best function among candidate functions generated by the LLM on a sampled set of data.
I’ve worked on this exact problem when extracting feeds from news websites. Yes calling LLM each time is costly so I use LLM for the first time to extract robust css selectors and the following times just relying on those instead of incurring further LLM cost.
I'm working on this problem now. It's possible in some sources - whenever the HTML structure is enough that you map it to the feature of interest - but it could also happen that the information is hidden within the text, which makes it virtually impossible
In my experience Anthropic models are more steerable (requires less prompting) than OpenAI's. For example in code-generation, I'd tell GPT-4 to not include any comments, yet sometimes it would just ignore this. Have not experienced this with Opus yet.
Interesting, so it takes a screenshot of the page in playwright and then asks the LLM to parse the image and find the values corresponding to the keys in the schema? How expensive is it to run it per webpage? Does it sometimes hallucinate, and if so, have you tested how often?
The LLM Scraper seems like a highly useful tool that can transform any webpage into structured data. This could be a significant advancement for data analytics and automation processes. I'm looking forward to seeing its practical applications and effectiveness.
Great work! I’ve worked on the same problem and used LLM to extract feed into structured data (in my case have to use a more affordable model like GPT3.5 for a Saas app, looking at llama3 now)
Have you thought about automatically extract schema?
Awesome! The problem with extracting schema automatically is that you won't know what comes out of it upfront and it could be changing on every run. What I'm trying to do is enable scraping webpages in a structured (and type-safe!) manner.
The problem is not the scrapping with llm. You need to solve the underlying problems like antibots, captcha, all these issues prevent the scraping at scale