Web Scraping on Autopilot with GPT-4

hubraumhugo · on May 9, 2023

Co-founder here, let me add some context :)

We got frustrated with the time and effort required to code and maintain custom web scrapers, so we built an LLM-based solution that can extract data from any website in the format you want. AI should automate tedious and un-creative work, and web scraping definitely fits this description.

We're using LLMs to generate web scrapers and data processing steps on the fly that adapt to website changes. Using GPT for every data extraction, as most comparable tools do, would be way too expensive and very slow, but using LLMs to generate the scraper code and subsequently adapt it to website modifications is highly efficient.

Try it out for free on our playground https://kadoa.com/playground and let us know what you think! And please don't bankrupt us :)

## How it works (the playground uses a simplified version of this):

- Loading the website: automatically decide what kind of proxy and browser we need

- Analysing network calls: Try to find the desired data in the network calls

- Preprocessing the DOM: remove all unnecessary elements, compress it into a structure that GPT can understand

- Slicing: Slice the DOM into multiple chunks while still keeping the overall context

- Selector extraction: Use GPT (or Flan-T5) to find the desired information with the corresponding selectors

- Data extraction in the desired format

- Validation: Hallucination checks and verification that the data is actually on the website and in the right format

- Data transformation: Clean and map the data (e.g. if we need to aggregate data from multiple sources into the same format). LLMs are great at this task too

The vision is a universal API for web data :)

EDIT: The heavy traffic is leading to rate limiting issues, sorry about that! You can still check the example extractions though.

gremlinsinc · on May 9, 2023

email me when it works on crunchbase data or similar.

KRAKRISMOTT · on May 9, 2023

There was a nice open source browser automation project called TaxyAI using GPT-4 posted on HN a few weeks back but the author pulled the project and did a bait and switch after getting all the hype.

https://news.ycombinator.com/item?id=35344354

tyingq · on May 9, 2023

I understand it would be a point-in-time snapshot of the code with the open source license, and not maintained, but...

Wayback has it: https://web.archive.org/web/20230404124635/https://github.co...

And browser extensions are fairly easy to download and open anyway.

syntaxing · on May 9, 2023

Did they remove it to capitalize on it?

chevet · on May 9, 2023

I was curious and asked the author but was unable to get an answer

elwebmaster · on May 9, 2023

It doesn’t work for what I tried. More like a toy project?

wand3r · on May 9, 2023

I got rate limited using one of the examples (yahoo fin)

t_a_v_i_s · on May 9, 2023

What did you try?

theturtletalks · on May 9, 2023

I tried the specialized bikes example and it said I need to use the API since it requires proxy.