Hi! I'm Marcell, and I'm working on FetchFox (
https://fetchfoxai.com). It's a Chrome extension that lets you use AI to scrape any website for any data. I'd love to get your feedback.
Here's a quick demo showing how you can use it to scrape leads from an auto dealer directory. What's cool is that it scrapes non-uniform pages, which is quite hard to do with "traditional" scrapers: https://youtu.be/wPbyPSFsqzA
A little background: I've written lots and lots of scrapers over the last 10+ years. They're fun to write when they work, but the internet has changed in ways that make them harder to write. One change has been the increasing complexity of web pages due to SPAs and obfuscated CSS/HTML.
I started experimenting with using ChatGPT to parse pages, and it's surprisingly effective. It can take the raw text and/or HTML of a page, and answer most scraping requests. And in addition to traditional scraping thigns like pulling out prices, it can extract subjective data, like summarizing the tone of an article.
As an example, I used FetchFox to scrape Hacker News comment threads. I asked it for the number of comments, and also for a summary of the topic and tone of the articles. Here are the results: https://fetchfoxai.com/s/cSXpBs3qBG . You can see the prompt I used for this scrape here: https://imgur.com/uBQRIYv
Right now, the tool does a "two step" scrape. It starts with an initial page, (like LinkedIn) and looks for specific types of links on that page, (like links to software engineer profiles). It does this using an LLM, which receives a list of links from the page, and looks for the relevant ones.
Then, it queues up each link for an individual scrape. It directs Chrome to visit the pages, get the text/HTML, and then analyze it using an LLM.
There are options for how fast/slow to do the scrape. Some sites (like HN) are friendly, and you can scrape them very fast. For example here's me scraping Amazon with 50 tabs: https://x.com/ortutay/status/1824344168350822434 . Other sites (like LinkedIn) have strong anti-scraping measures, so it's better to use the "1 foreground tab" option. This is slower, but it gives better results on those sites.
The extension is 100% free forever if you use your OpenAI API key. It's also free "for now" with our backend server, but if that gets overloaded or too expensive we'll have to introduce a paid plan.
Last thing, you can check out the code at https://github.com/fetchfox/fetchfox . Contributions welcome :)
I also assume you don't check the robots.txt of websites?
I'm all for automating tedious work, but with all this (mostly AI-related) scraping, things are getting out of hand and creating a lot of headaches for developers maintaining heavily scraped sites.
related:
- "Dear AI Companies, instead of scraping OpenStreetMap, how about a $10k donation?" - https://news.ycombinator.com/item?id=41109926
- "Multiple AI companies bypassing web standard to scrape publisher sites" https://news.ycombinator.com/item?id=40750182