We got frustrated with the time and effort required to code and maintain custom web scrapers, so we built an LLM-based solution that can extract data from any website in the format you want. AI should automate tedious and un-creative work, and web scraping definitely fits this description.
We're using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using GPT for every data extraction, as most comparable tools do, would be way too expensive and very slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
Try it out for free on our playground https://kadoa.com/playground and let us know what you think! And please don't bankrupt us :)
## How it works (the playground uses a simplified version of this):
- Loading the website: automatically decide what kind of proxy and browser we need
- Analysing network calls: Try to find the desired data in the network calls
- Preprocessing the DOM: remove all unnecessary elements, compress it into a structure that GPT can understand
- Slicing: Slice the DOM into multiple chunks while still keeping the overall context
- Selector extraction: Use GPT (or Flan-T5) to find the desired information with the corresponding selectors
- Data extraction in the desired format
- Validation: Hallucination checks and verification that the data is actually on the website and in the right format
- Data transformation: Clean and map the data (e.g. if we need to aggregate data from multiple sources into the same format). LLMs are great at this task too
The vision is a universal API for web data :)
EDIT: The heavy traffic is leading to rate limiting issues, sorry about that! You can still check the example extractions though.
There was a nice open source browser automation project called TaxyAI using GPT-4 posted on HN a few weeks back but the author pulled the project and did a bait and switch after getting all the hype.
We got frustrated with the time and effort required to code and maintain custom web scrapers, so we built an LLM-based solution that can extract data from any website in the format you want. AI should automate tedious and un-creative work, and web scraping definitely fits this description.
We're using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using GPT for every data extraction, as most comparable tools do, would be way too expensive and very slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.
Try it out for free on our playground https://kadoa.com/playground and let us know what you think! And please don't bankrupt us :)
## How it works (the playground uses a simplified version of this):
- Loading the website: automatically decide what kind of proxy and browser we need
- Analysing network calls: Try to find the desired data in the network calls
- Preprocessing the DOM: remove all unnecessary elements, compress it into a structure that GPT can understand
- Slicing: Slice the DOM into multiple chunks while still keeping the overall context
- Selector extraction: Use GPT (or Flan-T5) to find the desired information with the corresponding selectors
- Data extraction in the desired format
- Validation: Hallucination checks and verification that the data is actually on the website and in the right format
- Data transformation: Clean and map the data (e.g. if we need to aggregate data from multiple sources into the same format). LLMs are great at this task too
The vision is a universal API for web data :)
EDIT: The heavy traffic is leading to rate limiting issues, sorry about that! You can still check the example extractions though.