Hey HN,
This is Jan, founder of Apify, a web scraping and automation platform. Drawing on our team's years of experience, today we're launching Crawlee [1], the web scraping and browser automation library for Node.js that's designed for the fastest development and maximum reliability in production.
For details, see the short video [2] or read the announcement blog post [3].
Main features:
- Supports headless browsers with Playwright or Puppeteer
- Supports raw HTTP crawling with Cheerio or JSDOM
- Automated parallelization and scaling of crawlers for best performance
- Avoids blocking using smart sessions, proxies, and browser fingerprints
- Simple management and persistence of queues of URLs to crawl
- Written completely in TypeScript for type safety and code autocompletion
- Comprehensive documentation, code examples, and tutorials
- Actively maintained and developed by Apify—we use it ourselves!
- Lively community on Discord
To get started, visit https://crawlee.dev or run the following command: npx crawlee create my-crawler
If you have any questions or comments, our team will be happy to answer them here.
[1] https://crawlee.dev/
[2] https://www.youtube.com/watch?v=g1Ll9OlFwEQ
[3] https://blog.apify.com/announcing-crawlee-the-web-scraping-a...
I'm especially excited about the unified API for browser and HTML scraping, which is something I've had to hack on top of Scrapy in the past and it really wasn't a good experience. That, along with puppeteer-heap-snapshot, will make the common case of "we need this to run NOW, you can rewrite it later" so much easier to handle.
While I'm not particularly happy to see JavaScript begin taking over another field as it truly is an awful language, more choice is always better and this project looks valuable enough to make dealing with JS a worthwhile tradeoff.