Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: LLM Scraper – turn any webpage into structured data (github.com/mishushakov)
88 points by ushakov on April 20, 2024 | hide | past | favorite | 24 comments


Great work! One of the things that would be incredibly useful/interesting would be generating a reusable script with an LLM, instead of just grabbing the data. In theory, this should result in a massive cost reduction (no need to call the LLM every time) as long as the source code doesn’t change which would make it sustainable for constant and frequent monitoring.


This approach was studied in a paper called Evaporate+ - https://www.vldb.org/pvldb/vol17/p92-arora.pdf They used active learning to pick the best function among candidate functions generated by the LLM on a sampled set of data.


I’ve worked on this exact problem when extracting feeds from news websites. Yes calling LLM each time is costly so I use LLM for the first time to extract robust css selectors and the following times just relying on those instead of incurring further LLM cost.


Thank you! I’m working on supporting local llms via llama.cpp currently, so cost won’t be an issue anymore


Given that the ollama API is openai compatible, that should be a drop in, no?


Not really, I believe it’s missing function calling

Edit: and grammar as well


Ahh yeah gotcha


I'm working on this problem now. It's possible in some sources - whenever the HTML structure is enough that you map it to the feature of interest - but it could also happen that the information is hidden within the text, which makes it virtually impossible


This is a really nice idea. Wonder what the prompt would look like for that.


The biggest trick here is going to be costs. I have gotten scary openai bills feeding websites into gpt-4 due to cost scaling with content size.


Definitely. Smaller models like Haiku are already pretty capable (and cheap!)


How does Haiku do with instruction following?


In my experience Anthropic models are more steerable (requires less prompting) than OpenAI's. For example in code-generation, I'd tell GPT-4 to not include any comments, yet sometimes it would just ignore this. Have not experienced this with Opus yet.


Interesting, so it takes a screenshot of the page in playwright and then asks the LLM to parse the image and find the values corresponding to the keys in the schema? How expensive is it to run it per webpage? Does it sometimes hallucinate, and if so, have you tested how often?


The LLM Scraper seems like a highly useful tool that can transform any webpage into structured data. This could be a significant advancement for data analytics and automation processes. I'm looking forward to seeing its practical applications and effectiveness.


Operating modes are input, yes? Handling JS sites would be a huge improvement. Part of your plans?


Correct. JS sites are supported out of the box since we're using Playwright!


Nice! Markdown output would be an awesome addition


Great work! I’ve worked on the same problem and used LLM to extract feed into structured data (in my case have to use a more affordable model like GPT3.5 for a Saas app, looking at llama3 now)

Have you thought about automatically extract schema?


Awesome! The problem with extracting schema automatically is that you won't know what comes out of it upfront and it could be changing on every run. What I'm trying to do is enable scraping webpages in a structured (and type-safe!) manner.


Oh neat, I'm working on the same thing with python and playwright.

I'm finding that the latency with web LLMs is a pain in the ass and hoping to switch to llama3 after I get that set up with function calling.


Awesome! Keep in mind there's already scrapeghost and entities-extraction-web-scraper in Python.

I've tried using it with Groq's Llama 3 70B and it worked well :)


The problem is not the scrapping with llm. You need to solve the underlying problems like antibots, captcha, all these issues prevent the scraping at scale


This is so sick haha and I'm loving this new norm of providing local options by default :)




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: