Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Get structured website data with just a prompt
2 points by ericciarla 11 days ago | hide | past | favorite | discuss
Hey everyone! Eric, Caleb, and Nick here from Firecrawl (YC S22).

We’re excited to announce the release of /extract - an endpoint that turns entire websites into structured data with just a prompt. With /extract, you can retrieve any information from anywhere on a website without being limited by crawling/scraping roadblocks or the typical context constraints of LLMs.

Here’s how our new /extract endpoint works: Users provide a prompt or desired output schema along with URLs. We leverage our existing index, /map, and /crawl endpoints to gather relevant context. For pre-indexed sites, we use a mixture of vector search, keyword search and a custom classifier to identify the most relevant pages. Some thoughtful prompting and re-ranking algorithms analyze user intent and score pages accordingly. Once identified, relevant pages are batch scraped to retrieve fresh data.

For complex tasks, an AI agent determines the type of extraction needed and routes it to the appropriate pipeline. For example, extracting thousands of products dynamically creates a custom multi-entity schema, breaking the user’s schema into smaller parts that can be processed independently. This avoids relying on an LLMs small context window and enables efficient parallelization and merging of each independent extraction at the end.

We integrate structured outputs from OpenAI, small LLMs, and task-specific models for classification and prompting. The entire process is parallelized using BullMQ and Kubernetes on GCP GKE, ensuring scalability, speed and leverages our existing Firecrawl scraping infrastructure. The result is a structured, intent-aligned response tailored to the user’s needs.

Since starting Firecrawl, we knew it wouldn't just change web scraping. We realized that AI tools could process vastly more data than humans and traditional web search methods weren't designed for this ability to consume data at scale. This opened up a new paradigm for information retrieval - one that required quickly querying structured and unstructured datasets from across the web. We set out to make building web datasets at scale easy and /extract is a major step towards this future.

If you want to try out: - Visit our landing page here: https://www.firecrawl.dev/extract - See Extract documentation: https://docs.firecrawl.dev/features/extract - Also, most of our work including /extract is open-source. Check it out here at https://github.com/mendableai/firecrawl

That's all for now! Let us know any feedback on /extract.






Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: