We have to crawl about 60-80k news websites per day [0].
I've spent about 1 month to test how scrapy could be a fit for our purposes. And, quite surprisingly, it was hard to design a distributed web crawler. Scrapy is great for those in-the-middle tasks where you need to crawl a bit + process data on the go.
We ended up just using requests to crawl the web. Then post-process the web pages in the next step.
Many thanks to Zyte [1] (ex-ScrapingHub) for open-sourcing so many wonderful tools for us. I've spoke to Zyte's CEO, and was really fascinated how he still being a dev person while running such a big company.
> I've spent about 1 month to test how scrapy could be a fit for our purposes. And, quite surprisingly, it was hard to design a distributed web crawler. Scrapy is great for those in-the-middle tasks where you need to crawl a bit + process data on the go.
I agree (but in my case I needed 3 months to understand that :) ).
I did start with plain "Scrapy" and it did work fine (maybe some parameters could be explained a bit better especially the ones involving concurrency) and it has been in general very helpful to initially understand some concepts related to web scraping.
But now, after having rewritten multiple parts like 50 times to be able to handle most of what can happen, having split the workload into a multi-staged pipeline involving different specialized programs, having implemented custom code to distribute domains to be scraped to dedicated subprocesses (to greatly improve performance), and much more... I am using Scrapy basically to only download the raw bytes and to take care of what's mentioned in "robots.txt" => my next step will most probably be to replace it (now I don't like a lot having it as a dependency for my program, I therefore started thinking that such a complicated lib is a bit a "risk") with "requests" and something that can help to interpret "robots.txt".
In my case ("wide" Internet scan involving so far ~8M webpages distributed over ~500K domains) I think that it might have been better to start straight with just using "requests".
Even with scrapy it's usually best to to crawl and store full html that can be scraped offline. It's too easy to miss something when scraping on the fly or drop data that might be useful in the future.
SPAs tend to be some of the easiest sites to scrape (at least if you’re building targeted scrapers), because they come with a production-ready API out of the box.
You can just watch the network tab of dev tools to know what endpoints to hit. Same applies to mobile apps – before you reach for BeautifulSoup, maybe check to see if the website has a mobile app too. It’s usually worth setting up a mitmproxy to see what API you might be able to scrape.
Does newscatcherapi provide a list of the 60-80K news sites so the customer knows what sources are actually being searched. Or even just way to determine if site X is among the 60-80K being searched.
Thank you. The other site is also very interesting. I am working on this MVP and its news aggregator type site for a NICHE product. So i need to aggregate news for a brand from maybe 10-20 blogs and list the URL. thank you for sharing both. I'll reach out to them.
While I agree that Scrapy is a great tool for beginner tutorials and easy entry into scraping, it's becoming difficult to use it in real world scenarios because almost all the large players now employ some anti-bot or anti-scraping protection.
A great example above all is Cloudflare. You simply can't convince Cloudflare you're a human with Scrapy alone. Scrapy has only experimental support of HTTP2 and does not support proxies over HTTP2 (https://github.com/scrapy/scrapy/issues/5213). Yet, all browsers use HTTP2 now, which means all normal users use HTTP2... You get the point.
What we use now is Got Scraping (https://github.com/apify/got-scraping). It's a special purpose extension of Got (HTTP client with 18 mil weekly downloads) that masks its HTTP communication as if it was coming from a real browser. Of course, this will not get you as far as Puppeteer or Playwright (headless browsers), but it improved our scraping tremendously. If you need a full crawling library, see the Apify SDK (https://sdk.apify.com) which uses Got Scraping under the hood.
I've used Scrapy extensively for writing crawlers.
There's a lot of good things like not having to worry about storage backends, request throttling (random seconds between requests), the ability to run parallel parsers easily. There is also a lot of open source middleware to help with things like retrying requests with proxies and rotating user agents.
However, like any battery included framework it has downsides in terms of flexibility.
In most cases requests and lxml should be enough to crawl the web.
If you are just doing one or two pages, say you want to get weather for your location, then requests is sufficient. But if you want to do many pages where you might want to scan and follow, requests gets tedious very quickly.
If you’re a web developer not really, rather than worrying about storage backendes, spiders, yielding and managing loops and items, you could just host a DRF or Flask API with your scrapers (written in Requests+lxml) initiated with an API request.
Yes, Scrapy is quite a good scraper technology for some features, especially caching, but for some websites it's like doing things the hard way...
The easiest scraper with a proxy rotator I've found is in my current fave web-automator, scraper scripter and scheduler:
Rtila [1]
Created by an indy/solo developer-on-fire cranking out user-requested features quite quickly... check the releases page [2]
I have used (or at least trialled) the vast majority of scraper-tech and written hundreds of scrapers since my first VB5 controlling IE then dumping to SQLserver in the 90's and then moving to various php and python libs/frameworks and a handful of windows apps like ubot and imacros (both of which were useful to me at some point but I never use those nowadays)
A recent release of Rtila allows creating standalone bots you can run using it's built-in local Node.js server (which also has it's own locally hosted server API you can program anything else against using any language you like)
I'm sure Rtila is fantastic at what it does, but I gotta say it's hilarious to see a landing page done in the Corporate Memphis artstyle but worded in euphemism: https://www.rtila.net/#h.d30as4n2092u
"‘Cause if the web server said no, then the answer obviously is no. The thing is that it’s not gonna say no—it’d never say no, because of the innovation."
Yeah. I hate cloudflare, captchas. Why can't these companies accept that our scrapers are valid user agents? Only Google is allowed to do it, nobody else.
They use residential proxies with altered clients and / or headless browsers. Cloudflare's bot protection mostly makes use of TLS fingerprinting, and thus pretty easy to bypass.
Scraping is a cat and mouse game that’ll vary a lot by site. I’m far from an expert and welcome correction here, but the two big tricks that’ll go a long way AFAIK are using a residential proxy service (never tried one - they tend to be quite shady), and using a webdriver-type setup like Selenium or Puppeteer to mock realistic behavior (though IIRC you have to obfuscate both those systems since they’re detectable via JS).
One of the most underrated features is the request caching. It really helps with the problem of finding out your spider crashed or you didn’t parse all the data you wanted and rerunning the job. Rather than making hundred or thousands of requests you can get them from the cache.
One nitpick is that the documentation could be a bit better about integrating scrapy with other Python projects / code rather than running it directly from the command line.
Also, some of their internal names are a bit vague. There’s a Spider and a Crawler. What’s the difference? To most people these would be the same thing. This makes reading the source code a little tricky.
1. Mixing the terms "scrape" and "crawl". Crawling means following URLs found in a page, automated browsing. Scraping means extracting data from a page. Pages can be "scraped" without doing any crawling. The difference between writing a "scraper" and a "spider" is significant. The later is far more complex.
2. The OP gives the reader a peek at what some Python code looks like but it does no show us what the extracted data and its preferred output format looks like. If someone reading wanted to quickly compare this solution to some other solution, e.g., "can our solution do what this one does", they would need to see what the output data and its format looks like. To test we need to know (a) the input, i.e., the example webpage, and (b) the output, i.e., the extracted data including any desired formatting. Here we are provided with (a) only.
Google search isn't a static site, the results are dynamically generated based on what it knows about you (location, browser language, recent searches from IP, recent searches from account, and so on with all of the things they know from trying to sell ad slots to that device).
That being said there isn't anything wrong with using Scrapy for this. If you're more familiar with web browsers than Python something like https://github.com/puppeteer/puppeteer can also be turned into a quick way to scrape a site by giving you a headless browser controlled by whatever you script in nodejs.
I see. I am familiar with Python but I don't need something so heavy like Scrapy. Ideally I am looking for something that is very lightweight + fast and can just parse the DOM using CSS selectors.
As others have said, google isn't a static site, and in addition to that, they create a nightmare of tags and whatnot that make it utterly horrific to scrape.
After scraping tens of millions of pages, possibly hundreds of millions, i've fallen back to LXML w/ Python. It's not for all use-cases, but it works for me.
One thing I'll attempt to do before scraping the page is look to see if the page is rendered server side or client side. If it's client side, I'll see if I can just get the raw data and if that's the case, it makes it much much easier.
Is there a guide on how to do this with sites that are paywalled? For example how do I use this with NYT or WSJ? And any tricks on identifying the main content? Or do you just have to deal with that being different for each site? How do the reader modes in browsers do it?
I've spent about 1 month to test how scrapy could be a fit for our purposes. And, quite surprisingly, it was hard to design a distributed web crawler. Scrapy is great for those in-the-middle tasks where you need to crawl a bit + process data on the go.
We ended up just using requests to crawl the web. Then post-process the web pages in the next step.
Many thanks to Zyte [1] (ex-ScrapingHub) for open-sourcing so many wonderful tools for us. I've spoke to Zyte's CEO, and was really fascinated how he still being a dev person while running such a big company.
[0] https://newscatcherapi.com/news-api [1] https://www.zyte.com/