Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Crawlee – Web scraping and browser automation library for Node.js (crawlee.dev)
282 points by jancurn on Aug 23, 2022 | hide | past | favorite | 80 comments
Hey HN,

This is Jan, founder of Apify, a web scraping and automation platform. Drawing on our team's years of experience, today we're launching Crawlee [1], the web scraping and browser automation library for Node.js that's designed for the fastest development and maximum reliability in production.

For details, see the short video [2] or read the announcement blog post [3].

Main features:

- Supports headless browsers with Playwright or Puppeteer

- Supports raw HTTP crawling with Cheerio or JSDOM

- Automated parallelization and scaling of crawlers for best performance

- Avoids blocking using smart sessions, proxies, and browser fingerprints

- Simple management and persistence of queues of URLs to crawl

- Written completely in TypeScript for type safety and code autocompletion

- Comprehensive documentation, code examples, and tutorials

- Actively maintained and developed by Apify—we use it ourselves!

- Lively community on Discord

To get started, visit https://crawlee.dev or run the following command: npx crawlee create my-crawler

If you have any questions or comments, our team will be happy to answer them here.

[1] https://crawlee.dev/

[2] https://www.youtube.com/watch?v=g1Ll9OlFwEQ

[3] https://blog.apify.com/announcing-crawlee-the-web-scraping-a...




Looks like you took the good ideas from Scrapy's crawling engine and combined it with a great scraping API, which is all I ever wanted in a bot framework!

I'm especially excited about the unified API for browser and HTML scraping, which is something I've had to hack on top of Scrapy in the past and it really wasn't a good experience. That, along with puppeteer-heap-snapshot, will make the common case of "we need this to run NOW, you can rewrite it later" so much easier to handle.

While I'm not particularly happy to see JavaScript begin taking over another field as it truly is an awful language, more choice is always better and this project looks valuable enough to make dealing with JS a worthwhile tradeoff.


> While I'm not particularly happy to see JavaScript begin taking over another field as it truly is an awful language, more choice is always better and this project looks valuable enough to make dealing with JS a worthwhile tradeoff.

Love that comment :D

Yeah, the ability to switch between headless and http is very important to us in production. We often hack something up quickly with headless and then later optimize it to use HTTP when we find the time.


people trash JavaScript while using python? They should look at the mirror!

Anyone can trash js except for all these C like languages with boringly similar designs, doubly so for python. Itself took over the world due to the fact that amateur (scientists and other professions as opposed to programmers) can easily play with it.


Python is anything but a C-like language, I really don't know what you're talking about.


Because of indentation? Gimme a break, most of the semantics are the same. Different would be like lisp or SML


> I'm not particularly happy to see JavaScript begin taking over another field as it truly is an awful language

Sorry but this irked me. What exactly are you hangups with JS ? It's just a JiT dynamic typed language. By design.

It has its qwirks for sure, but again, its just a language. Truly awful it isn't.


node.js dependency hell that often makes java's transitive dependencies on bigger projects look like hello world? Exception handling? Performance? No real multithreading afaik?

I have no skin in the game (anymore), but boy the feeling repeated gazillion times left and right in past decade about javascript crawling to be used in places it shouldn't be strongly resonated with me back then (not particularly for this project, rant in general).


> node.js dependency hell

You have one file which lists your dependencies and it comes as a standard with the framework compared to Java & co where you have lots of flavours. The amount of dependencies your dependencies have is up to your choice.

> Exception handling?

Synchronous code uses the standard try / catch method. Asynchronous code has been using async / await combined with try / catch. A strategy that Nose.js invented and that other languages like Java, Rust and Python copied.

> Performance?

V8 and Node.ja are pretty much the fastest dynamic language platform. Years ago, when companies started switching to Node.js, a lot of companies actually reduced the amount of servers they used when switching from Java to Node.js

> No real multithreading afaik?

I/O is multithreaded through libuv and C-Ares. The rest can simply run as a multi-process app. Worker pools have also recently been introduced. In any case that wouldn't be an issue for crawling which doesn't require multithreading.

JavaScript is the language used in the front-end so it seems like the most fitting language to use for crawling and scraping, especially since the introduction of headless browsers.


Nothing against dynamically typed languages, Python is one as well. Maybe "awful" was a bit of a strong word, but to quickly summarise my issues with JS: - all the syntax and duck typing footguns (see all those "js is weird" memes) - the complexity of the build chain, especially if you want typing support - the unmanageable dependency mess that seems to be the norm in the ecosystem


Hi! It looks really REALLY cool!

Is there any kind of detection/stealthiness benchmark compared to libraries such as puppeteer-stealth or fakebrowser?

Honestly no matter how feature-complete and powerful a scraping tool is, the main "selling point" for me will always be stealthiness/human like behavior no matter how crappy the dev experience is.(and IMHO that's the same for most serious scrapers/bot makers)

Will it always be free or could it turn into a product/paid SaaS?(kind of like browserless) I'm kind of wondering if it's worth learning it if the next cool features are going to be for paying users only.

Is this something that you use internally or is it just a way to promote your paid products?

Thanks :)


> for me will always be stealthiness/human like behavior no matter how crappy the dev experience is

Can't say I agree. The biggest value for me is being able to respond to site changes quickly. Having a key bot offline for an extended period of time can be costly, so being able to update, test and deploy it quickly is a big selling point. The vast majority of sites, including major companies, have very rudimentary bot detection, and a high-quality proxy provider is often all you need to bypass it.

As for the advanced methods like recaptcha 3 and cloudflare, I don't know of any framework that passes those out of the box anyways, so might as well use something that's easy to hack on and implement your own bypasses as necessary.


We do a lot of web scraping (hundreds of millions of requests, multiple terabytes of data per month) and have been using Crawlee - previously known as Apify SDK - since its v0.20 days. We adopted it for exactly this reason. It's extremely versatile and very pleasant to build on. The combination of Node, JS and Crawlee's modular SDK offers a sweet spot for scraping that imho is light years ahead of anything else.

Helps too that the apify devs themselves are nice and super responsive (we've had quite a few PRs merged over the last couple of years). The SDK code (and supporting libs like browser-tool, got-scraping) is clean and very easy to read/follow/extend (happy to hear too that the license is going to remain unchanged).


I'm not aware of a benchmark, but puppeteer-extra-plugin-stealth can be detected: https://datadome.co/bot-management-protection/detecting-head...

Crawlee does appear to do the basic checks though, like checking navigator.webdriver: https://github.com/apify/crawlee/blob/master/test/browser-po...

Last time I checked (over a year ago) I couldn't find any public code to make Chrome/Firefox properly undetectable.

That said, going to extreme lengths to be undetectable is rarely necessary, because some sites will serve up CAPTCHA's to real people on clean uncompromised residential connections anyway.


Hey! Crawlee uses the libraries from our fingerprint suite internally. https://github.com/apify/fingerprint-suite#performance

It has an A rating in the BotD (fingerprint.js) detection. Now we're working on improving the CreepJS detection. That one is really tough though. Not even sure if anybody would use it in production environments as it must throw a lot of false positives.

It will always be free and maintained, because we're using it internally in all of our projects. We thought about adding a commercial license like Docker. Open source, but paid if you have more than $10mil revenue or more than 250 employees. But in the end we decided that we won't do even that and it's just free and always be free.


Hi! Very cool project. Just out of curiosity, what trips up Crawlee on CreepJS? I haven't heard of anyone actually using it in production (actually don't think it's meant for production use). It's certainly overzealous in its aggregate "trust score", but (a) it seems like a good benchmark to aim for; (b) some of its sub-scores, like "stealth" and "like headless", might be helpful for Crawlee to evaluate, given the signals included in those analyses are fairly simple for people to throw together in their own custom (production) bot detection scripts and are somewhat ubiquitous.


With fingerprints it's a tradeoff between having enough of them for large scale scraping and staying consistent with your environment. E.g. you can get exponentially more combinations if you also use Firefox, Webkit, MacOS and Windows user-agents (and prints) when you're running Chrome on Linux, but you also expose yourself to the better detection algorithms. If you stick to Linux Chrome only prints (which is what you usually run in VMs), you'll be less detectable, but might get rate limited.


Hi there!

We dont have any benchmarks for Crawlee just yet, but we are working on those as we speak. We care deeply about bot detection, one of the features of Crawlee is generated fingerprints based on real browser data we gather - you can read more about it in the https://github.com/apify/fingerprint-suite repository, which is used under the hood in Crawlee. For scraping via HTTP requests (e.g. cheerio/jsdom), we develop library called got-scraping (https://github.com/apify/got-scraping), that tries to mimic real browsers while doing fast HTTP requests.

Crawlee is and always will be open source. It originated from the Apify SDK (http://sdk.apify.com), which is a library to support development of so called Actors on the Apify Platform (http://apify.com) - so you can see it as a way for us to improve the experience of our customers. But you can use it anywhere you want, we provide ready to use Dockerfiles for each template.


This looks cool at first glance. I'll dig into it more.

One note that may be helpful, if all you care about is the HTML, it's better to take a "snapshot" of the page by streaming the response directly to blob storage like S3. That way if something fails and you need to retry, you can reference the saved raw data from storage vs making another request and potentially getting blocked. Node pipelines makes it really easy to chain this stuff together with other logic.

For reference, I run a company that does large scale scraping / data aggregation.


Yeah I agree, keeping the source HTML is great for debugging or retro-fixing issues. We also like to take screenshots on important errors, when running headless.


I see you basically recommend bypassing rate limits by using proxies etc? Why not just respect rate limits if set properly? A little bit of consideration for what/whomever is on the other end ;)


Because everyone "being nice" is how Google keeps its monopoly on search. Googlebot can do anything and everything and no one complains. Or how about sites like Twitter and Instagram that live off selling their user's data having extreme limitations on their public apis and aggressively blocking alternative frontends like nitter or bibliogram because OF COURSE god forbid someone could want to look up something on their platform and not have an account.

The typical response to people raising these issues is "buuuut xy is a private platform that can do what it wants", yes, but why are you defending technocrats with bigger profits than many nation state's GDPs? (Reasonable) crawling should be allowed and promoted, in fact, it should be codified in law as a necessary element for the future of open and free internet. Anyone trying to prevent it, or even worse, make it illegal, is a bad actor.


I get your point and I don't have an objective answer to it. We believe that internet is an open medium and there's immense value for humankind waiting to be discovered and unlocked in all its data. After all, many of the big tech companies in the world utilize web scraping heavily.

Rate limits can be applied for different reasons. If they protect the website from being overloaded, they are good in our opinion. If they protect it from competition, research or building new non-competitive, but valuable products that are not harmful to the original website, they are not ideal.

We leave that to the user to decide the ethics of their project and just provide the tools.


This looks really neat, I love the idea of a single api for both traditional and headless scraping.

From my experience headless scraping is in the order of 10-100x slower and significantly more resource intensive, even if you carefully block requests for images/ads/etc.

You should always start with traditional scraping, try as hard as you can to stick with it, and only move to headless if absolutely necessary. Sometimes, even if it will take 10x more “requests” to scrape traditionally, it’s still faster than headless.


Thanks, that's our experience exactly and that's why we built the library this way. It's not uncommon to switch from HTTP to headless back to HTTP in the lifecycle of a project as the website evolves or as you find better ways to scrape.


I‘m very new to web scraping, can you explain what the use cases are for each and how you can switch between them? As far as I understood you can use HTTP scraping for static websites and need some kind of browser/headless browser to scrape dynamically rendered websites. How would you do that with plain HTTP? By figuring out the ajax network requests and then sending those directly?


Exactly. The dynamic websites need to pull the data from somewhere as well. There's no magic behind it. Either all the data is in the initial payload in some form (not necessarily HTML), or it's downloaded later, again, over HTTP.

Headless browsers are useful when the servers are protected by anti-scraping software and you can't reverse engineer it, when the data you need is generated dynamically - not downloaded, but computed, or simply when you don't have the time to bother with understanding the website on a deeper level.

Usually it's a tradeoff between development costs and runtime costs. In our case, we always try plain HTTP first. If we can't find an obvious way to do it, we go with browsers and then get back to optimizing the scraper later, using plain HTTP or a combination of plain HTTP and browsers for some requests like logins, tokens or cookies.


Jan, thanks for the open approach to running the tech behind apify!

The libraries look useful - one question which wasn't obvious in the doc, how do you manage / suggest approaching rate limiting by domain? Ideally respecting crawl-delay in robots.txt, or just defaulting to some sane value.. most naive queue implementations make it challenging, and queue-per-domain feels annoying to manage.


The ideal approach would depend on your architecture. It's really easy and cheap to create new queues on the Apify platform (we create ~500k every day) so we usually run a crawler per domain. It performs the best and it's the easiest to set up.

On Crawlee level, you can open new queues with one line of code and name them with the hostname, so the most straightforward solution would be to run multiple Crawler instances with multiple queues and then rate limit using the options explained here https://crawlee.dev/docs/guides/scaling-crawlers and push the new URLs to the respective queues using the URLs' hostname.

If you'd like to discuss this a bit more in depth, you can join our Discord or ask in GitHub discussions. Both are linked from Crawlee homepage.


Most mentions of crawl-delay in robots.txt set a limit so slow that the website can't be fully crawled before the heat death of the Universe. That's why Google and bing etc. ignore crawl-delay.


Sweet that you went down the free route and made it an npm package, following the good way, by providing an optional upgrade to SaaS. Cool stuff. I could have used this dearly last time I scraped. Like others I used mixed methods (headless browser for renders and direct calls) and wrote a lot of error handling boilerplate.


Thanks! We really love open source and wanted to give back to the community. Crawlee is built on top of other great open-source libraries and projects. It's the best thing about building software.


This looks very cool, but am I the only one who has an aversion to any product/library calling itself the X instead of an X?


> Crawlee is a web scraping and browser automation library

Above is the headline from the crawlee.dev website.


Yes but on Github, the place that people sensitive to the definite article will see the most, it says "The web scraping and browser automation library."

I opened a PR to change it: https://github.com/apify/crawlee/pull/1480


Probably not :)


Is there a way to use Apify's paid proxies without paying for the hosting? That doesn't look like option on the website.


Yeah, sure. We don't advertise it, but if you get in touch with us on support@apify.com or through the chat widget, we can create a proxy-only plan for you.


Looks pretty cool. I'm working on a project that relies on regularly scraping large amounts of data. My codebase uses nodejs, and I'd love to try out a few of the features listed under "Helpful utils and configurability" as they might be able to solve a few pain points I have.


Nice! Good luck with your project. The parsers are available under the utils.social namespace: https://crawlee.dev/api/utils/namespace/social The headless browser utils are under puppeteer and playwright utils https://crawlee.dev/api/puppeteer-crawler/namespace/puppetee...


It would be very useful if this or some other library came with Captcha solvers or a way to add Captcha solvers to the scrapers. Even regular users get Captchas sometimes.


You can use any captcha solving service with Crawlee, but we plan to add a plugin to make its use much easier. It's on our roadmap.


It looks like you can specify HTTP headers, so you should be able to use a captcha solving service with Crawlee.


word!


Can I use this to log into LinkedIn, run a query on posts and then send me an email of the results? (In theory of course as I am sure this will violate some policy)


LinkedIn is one of the most protected websites out there so you always risk getting the account banned. But at small scale, it should be fine. Crawlee has support for Playwright + Firefox with statistically generated fingerprints (you can also pass your own fingerprint) which looks pretty human-like. Put some random sleeps in between actions so it looks like you are actually using your mouse.

To send emails, you can use any 3rd party tools, check out Apify as Crawlee is well integrated there and they have email sender easy to use.


Great job! Probably the best toolkit for DIY data extraction at the moment. Shines most "against" super-sophisticated sites. Well done, guys!


Thank you ;)


Looks great! Just wondering why it has a few scrapers built in - like puppeteer and cheerio. Is it because you might want headless only sometimes?


Yeah, exactly. Using pure HTTP needs much less resources than running headless browsers, so unless you really need to use a browser, you can save a lot of compute power (and money) by using plain HTTP.


This seems great.

I've been using the unmaintained node-osmosis lib for years, maybe it'll motivate me to finally move from it.


Funny, I never knew about node-osmosis. I like the API in the example. It's a shame that it's no longer maintained.


Nice! Finally a web scraping library for the programming language most websites use. It was about time lol


Thank you! And exactly as you say, using the same language as the websites gives us some advantages - we have HTTP-only (no client-side JS) crawlers based on the Cheerio library, which mimics jQuery API, and if you later find out that you need to use a full headless browser with Puppeteer, you can just call the utility function injectJquery, and there's very little you have to modify to keep your script working


Wow, this looks awesome. Looking for a scraping tool right now. Will give this a shot.


Oh this looks lovely, congratulations!

I would really like this but running in Python.


you want something better than Scapy? maybe I have Stockholm syndrome but I find it to be very well structured, testable, and has solved every problem I've had with running scrapers


Do you feel this and Scrapy are similar? In my reading this has a different feature set.

It allows headed crawling + avoiding blockers etc.


In that they're trying to be crawling frameworks, and for sure Scrapy allows headed crawling via Splash, it's just not something I've needed or advocate for

Scrapy also has a long lineage of extensions, which maybe Crawlee will gain as it increases in popularity but I didn't see any obvious way of decoupling (for example) if one wanted to plug a new storage engine into Crawlee: https://crawlee.dev/docs/guides/result-storage versus that delineation is very strong in Scrapy for all its moving parts

Also, Parsel (the selector library powering Scrapy) is A++, in that it allows expressing one's intent via xpath, css selector, and regex matches in a fluent API; I'm sure this nodejs framework allows doing something similar because it seems to be all-in on the DOM, but it for sure will not be `response.xpath("//whatever").css("#some-id").re("firstName: (.+)").extract()`

Further, as I mentioned -- and as someone pointed out elsewhere in this submission -- Scrapy is prepared to store requests to disk and makes testing spider methods super easily since they're very well defined callback methods. If you have the HTML from a prior run and need to reproduce a bad outcome, testing just the "def parse_details_page" is painless. It certainly may be possible to test Crawlee code, too, but I didn't see anything mentioned about it


so what would be the the approach with this library? used scrapy and like it but more in the JS ecosystem now so would like this to be similair.


I don't know anything about this other than the announcement here and on reddit, so you'll likely want to post your question as a top comment so Jan can see it, or open a GH issue so they can help you evaluate


Good job to the APIfy team for this, this look very interesting!


If B4nan is around here from Apify, amazing work. On Crawlee, and on MikroORM. I especially use MikroORM extensively in production. One of the best, if not the best, ORM for NodeJS.


Thanks!


Thank you! :)


Demo video is awesome. Loved it.


Can Crawlee run in AWS Lambdas?


Yes


Awesome! Thanks.


If it doesn’t, please make an issue. We know it works from the community but we don’t have tests specifically for Lambda. But it should work, so we’ll help if it doesn’t.


this needs some sort of stealth plug-in called Creepee


Stealth is already included by default, but I love the name :D


This looks really great. However, I can't find examples of how to handle scraping behind a login or a paywall, without having to 'type' credentials every time.


Just found this on the Apify documentation: https://docs.apify.com/tutorials/log-into-a-website-using-pu...

Is there a similar guide for Crawlee?


The example uses Crawlee already, you can just remove the

import { Actor } from 'apify';

and then all references to Actor and either remove them or replace them with Crawlee functions.

E.g. await Actor.openKeyValueStore() should be replaced with KeyValueStore.open()

It makes sense to add a separate example for Crawlee though. But it's true that it does not exist yet.


You can use a headless browser (would recommend PlaywrightCrawler) to log in once and then use the session cookie until it expires in any crawler. When it expires, you can re-login and repeat the process.


[deleted]


Cool

One issue I have w/ webdriving headless browser in general is host RAM usage per browser/chromium/puppeteer instance (e.g. ~600-900mb) for a single browser/context/page.

Could crawlee make it easier to run more browser contexts with less ram usage?

e.g. concurrently running multiple of these (pages requiring js execution): https://crawlee.dev/docs/examples/forms


In crawlee, you can use the useIncognitoPages option to create a separate context for each page https://crawlee.dev/api/browser-pool/class/LaunchContext#use... Not sure if it will be enough to offset your RAM requirements.

From our experience, RAM is not the limiting factor. It's the CPU. You need at least 1 CPU core for the modern browsers to work reliably at scale so if you're using a container that has 1GB ram and 0.25 core, it's just not worth it. If you have access to containers that have strong CPUs and not a lot of RAM, then it's a different story.


That being said, for scraping purposes, you can almost always build the scraper with HTTP requests only. Sometimes it might be hard but theoretically it is always possible (it is what the browser itself does right).


In a way, I hate you, but at the same time I love you. It's because I'm working on something similar to get data for my product. Seems like I'm going to use Ampify instead to save my life.

Just a feedback from the developer point of view tho. I think the documentation (both clawlee & Apify) need some work. I took me a while get the difference between clawlee & other headless crawler like playwight etc.


Yeah, I agree, the functionality of the library is very rich so the focus now for new docs is to continue making it more streamlined and easier to navigate.

Crawlee is basically a big wrapper around open source tools like Puppeteer, Playwright, Cheerio (I would not call these crawlers though as they don't have any logic for enqueueing requests etc.)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: