Hacker News new | past | comments | ask | show | jobs | submit login
How to Crawl the Web with Scrapy (babbling.fish)
189 points by babblingfish on Sept 13, 2021 | hide | past | favorite | 55 comments



We have to crawl about 60-80k news websites per day [0].

I've spent about 1 month to test how scrapy could be a fit for our purposes. And, quite surprisingly, it was hard to design a distributed web crawler. Scrapy is great for those in-the-middle tasks where you need to crawl a bit + process data on the go.

We ended up just using requests to crawl the web. Then post-process the web pages in the next step.

Many thanks to Zyte [1] (ex-ScrapingHub) for open-sourcing so many wonderful tools for us. I've spoke to Zyte's CEO, and was really fascinated how he still being a dev person while running such a big company.

[0] https://newscatcherapi.com/news-api [1] https://www.zyte.com/


> I've spent about 1 month to test how scrapy could be a fit for our purposes. And, quite surprisingly, it was hard to design a distributed web crawler. Scrapy is great for those in-the-middle tasks where you need to crawl a bit + process data on the go.

I agree (but in my case I needed 3 months to understand that :) ).

I did start with plain "Scrapy" and it did work fine (maybe some parameters could be explained a bit better especially the ones involving concurrency) and it has been in general very helpful to initially understand some concepts related to web scraping.

But now, after having rewritten multiple parts like 50 times to be able to handle most of what can happen, having split the workload into a multi-staged pipeline involving different specialized programs, having implemented custom code to distribute domains to be scraped to dedicated subprocesses (to greatly improve performance), and much more... I am using Scrapy basically to only download the raw bytes and to take care of what's mentioned in "robots.txt" => my next step will most probably be to replace it (now I don't like a lot having it as a dependency for my program, I therefore started thinking that such a complicated lib is a bit a "risk") with "requests" and something that can help to interpret "robots.txt".

In my case ("wide" Internet scan involving so far ~8M webpages distributed over ~500K domains) I think that it might have been better to start straight with just using "requests".


Even with scrapy it's usually best to to crawl and store full html that can be scraped offline. It's too easy to miss something when scraping on the fly or drop data that might be useful in the future.


How do you deal with these god awful SPAs though, when you just do requests..


SPAs tend to be some of the easiest sites to scrape (at least if you’re building targeted scrapers), because they come with a production-ready API out of the box.

You can just watch the network tab of dev tools to know what endpoints to hit. Same applies to mobile apps – before you reach for BeautifulSoup, maybe check to see if the website has a mobile app too. It’s usually worth setting up a mitmproxy to see what API you might be able to scrape.


We do it as a part of post-processing, we'd have to make one more call.

Quite annoying though


Does newscatcherapi provide a list of the 60-80K news sites so the customer knows what sources are actually being searched. Or even just way to determine if site X is among the 60-80K being searched.


Second one — yes. All sources, per request.

But as a general rule, we crawl all we could/know which is a news website.

Can add any niche/industry-specific websites on request.


> We have to crawl about 60-80k news websites per day [0]

Can't even imagine that number... different languages or something?


Yeah, plus many of websites are quite niche news (construction news, for example)


ok so i can just hire zyte to build me a custom scraper?


well, I think it is the cheapest & the fastest way, tbh.


Thank you. The other site is also very interesting. I am working on this MVP and its news aggregator type site for a NICHE product. So i need to aggregate news for a brand from maybe 10-20 blogs and list the URL. thank you for sharing both. I'll reach out to them.


I'm the co-founder of the other one, we could help you with your task.

Feel free to contact me. artem [at] newscatcherapi.com


Oh awesome! emailing you now. Thank you.


While I agree that Scrapy is a great tool for beginner tutorials and easy entry into scraping, it's becoming difficult to use it in real world scenarios because almost all the large players now employ some anti-bot or anti-scraping protection.

A great example above all is Cloudflare. You simply can't convince Cloudflare you're a human with Scrapy alone. Scrapy has only experimental support of HTTP2 and does not support proxies over HTTP2 (https://github.com/scrapy/scrapy/issues/5213). Yet, all browsers use HTTP2 now, which means all normal users use HTTP2... You get the point.

What we use now is Got Scraping (https://github.com/apify/got-scraping). It's a special purpose extension of Got (HTTP client with 18 mil weekly downloads) that masks its HTTP communication as if it was coming from a real browser. Of course, this will not get you as far as Puppeteer or Playwright (headless browsers), but it improved our scraping tremendously. If you need a full crawling library, see the Apify SDK (https://sdk.apify.com) which uses Got Scraping under the hood.


Very cool libraries, thank you for sharing!


I've used Scrapy extensively for writing crawlers.

There's a lot of good things like not having to worry about storage backends, request throttling (random seconds between requests), the ability to run parallel parsers easily. There is also a lot of open source middleware to help with things like retrying requests with proxies and rotating user agents.

However, like any battery included framework it has downsides in terms of flexibility.

In most cases requests and lxml should be enough to crawl the web.


> In most cases requests and lxml should be enough to crawl the web.

Don't mind my `curl | pup xmlstarlet grep(!!)`s... Nothing to see here...


My brother-in-law had just finished his pilot training and was trying to apply for a job as teacher to continue his training.

However, the jobs were first come, first serve so he was waking up at 4 am and constantly refreshing for hours trying to be the first one.

When I heard about it, I quickly whipped up a `curl | grep && send_notif` (used pushback.io for notifs) and it helped him not have to worry so much.

When a new job posting finally came along he was the first in line and got the job :)


Was rate limiting not a concern? How often did you run it? I'm sure you had to keep some state too?


Their website was pretty trivial, but I beefed up the script to check for failure status code and also notify me if it wasn't 200.

I ran it every minute.

I had to include a unique uuid to cache bust, but other than that there wasn't any state.

I wrote about it here https://plainice.com/webscrape-notifications


If you are just doing one or two pages, say you want to get weather for your location, then requests is sufficient. But if you want to do many pages where you might want to scan and follow, requests gets tedious very quickly.


If you’re a web developer not really, rather than worrying about storage backendes, spiders, yielding and managing loops and items, you could just host a DRF or Flask API with your scrapers (written in Requests+lxml) initiated with an API request.

I guess it’s a matter of preference


While a decent post, this is more or less inadaquate in 2021. Do a post on bypassing cloudflare/other anti botting tech using residental proxy swarms


Yes, Scrapy is quite a good scraper technology for some features, especially caching, but for some websites it's like doing things the hard way...

The easiest scraper with a proxy rotator I've found is in my current fave web-automator, scraper scripter and scheduler: Rtila [1]

Created by an indy/solo developer-on-fire cranking out user-requested features quite quickly... check the releases page [2]

I have used (or at least trialled) the vast majority of scraper-tech and written hundreds of scrapers since my first VB5 controlling IE then dumping to SQLserver in the 90's and then moving to various php and python libs/frameworks and a handful of windows apps like ubot and imacros (both of which were useful to me at some point but I never use those nowadays)

A recent release of Rtila allows creating standalone bots you can run using it's built-in local Node.js server (which also has it's own locally hosted server API you can program anything else against using any language you like)

[1] https://www.rtila.net

[2] https://github.com/IKAJIAN/rtila-releases/releases


I'm sure Rtila is fantastic at what it does, but I gotta say it's hilarious to see a landing page done in the Corporate Memphis artstyle but worded in euphemism: https://www.rtila.net/#h.d30as4n2092u

"‘Cause if the web server said no, then the answer obviously is no. The thing is that it’s not gonna say no—it’d never say no, because of the innovation."


That is a modified quote from Its Always Sunny In Philadelphia BTW, if you didn't recognize it.


Yeah. I hate cloudflare, captchas. Why can't these companies accept that our scrapers are valid user agents? Only Google is allowed to do it, nobody else.


Because most scrapers aren't providing any value to website owners, in fact, they are costing them, unlike google.


Exactly!

While there are scraping APIs that unblock requests and charge for them, I'd love to learn more about how they work....


They use residential proxies with altered clients and / or headless browsers. Cloudflare's bot protection mostly makes use of TLS fingerprinting, and thus pretty easy to bypass.


Scraping is a cat and mouse game that’ll vary a lot by site. I’m far from an expert and welcome correction here, but the two big tricks that’ll go a long way AFAIK are using a residential proxy service (never tried one - they tend to be quite shady), and using a webdriver-type setup like Selenium or Puppeteer to mock realistic behavior (though IIRC you have to obfuscate both those systems since they’re detectable via JS).


I love scrapy! It’s a wonderful tool.

One of the most underrated features is the request caching. It really helps with the problem of finding out your spider crashed or you didn’t parse all the data you wanted and rerunning the job. Rather than making hundred or thousands of requests you can get them from the cache.

One nitpick is that the documentation could be a bit better about integrating scrapy with other Python projects / code rather than running it directly from the command line.

Also, some of their internal names are a bit vague. There’s a Spider and a Crawler. What’s the difference? To most people these would be the same thing. This makes reading the source code a little tricky.


Couple of nitpicks with the OP

1. Mixing the terms "scrape" and "crawl". Crawling means following URLs found in a page, automated browsing. Scraping means extracting data from a page. Pages can be "scraped" without doing any crawling. The difference between writing a "scraper" and a "spider" is significant. The later is far more complex.

2. The OP gives the reader a peek at what some Python code looks like but it does no show us what the extracted data and its preferred output format looks like. If someone reading wanted to quickly compare this solution to some other solution, e.g., "can our solution do what this one does", they would need to see what the output data and its format looks like. To test we need to know (a) the input, i.e., the example webpage, and (b) the output, i.e., the extracted data including any desired formatting. Here we are provided with (a) only.


I used scrapy a lot. Just my opinion:

1. Instead of creating a urls global variable, use start_requests function.

2. Don't use beautifulsoup to parse, use CSS or XPATH.

3. If you are going into multiple pages over and over again, use CrawlSpider with Rule.


Can you please give some details about your second point? What’s wrong with beautifulsoup?


Using CSS & XPATH to select elements is very natural to web pages. BS4 has very limited CSS selector support and zero XPATH support.


It is very slow. But personally, I prefer to write my crawlers in Go (custom code, not Colly).


Try Parsel: https://github.com/scrapy/parsel

It's way faster and has better support for CSS selectors.


> But personally, I prefer to write my crawlers in Go (custom code, not Colly).

This is my current setup as well, been scraping on and off for 20+ years now.


What's your problem with Colly? [0]

[0] http://go-colly.org/


Mostly that I started my crawler before learning about Colly and it didn't make sense to rewrite the code.

By "not Colly" I just wanted to remark that in Go is relatively easy to write a crawler from scratch.


Any suggestions regarding how to scrape Java-based websites? (For example, harness racing entries and results from: https://racing.ustrotting.com/default.aspx).


Generally you use devtools to find out what endpoints the javascript is requesting data from, then you make an identically-styled request.


What's wrong with it? That seems like a server-side rendered page/easier to deal with than waiting for JS to load.


Related question - what is a very fast and easy to use library for scraping static sites such as Google search results?


Google search isn't a static site, the results are dynamically generated based on what it knows about you (location, browser language, recent searches from IP, recent searches from account, and so on with all of the things they know from trying to sell ad slots to that device).

That being said there isn't anything wrong with using Scrapy for this. If you're more familiar with web browsers than Python something like https://github.com/puppeteer/puppeteer can also be turned into a quick way to scrape a site by giving you a headless browser controlled by whatever you script in nodejs.


I see. I am familiar with Python but I don't need something so heavy like Scrapy. Ideally I am looking for something that is very lightweight + fast and can just parse the DOM using CSS selectors.


I've had excellent luck with SerpAPI. It's $50 a month for 5,000 searches which has been plenty for my needs at a small SEO/marketing agency.

http://serpapi.com


As others have said, google isn't a static site, and in addition to that, they create a nightmare of tags and whatnot that make it utterly horrific to scrape.

After scraping tens of millions of pages, possibly hundreds of millions, i've fallen back to LXML w/ Python. It's not for all use-cases, but it works for me.

One thing I'll attempt to do before scraping the page is look to see if the page is rendered server side or client side. If it's client side, I'll see if I can just get the raw data and if that's the case, it makes it much much easier.


Is the complete example (ie, a git repo or the python file) linked anywhere in the blog post?


That's a good idea, I added a link to download a python file with all the code at the end of the article.


How does one use this on site that requires a log in? OAauth?


Is there a guide on how to do this with sites that are paywalled? For example how do I use this with NYT or WSJ? And any tricks on identifying the main content? Or do you just have to deal with that being different for each site? How do the reader modes in browsers do it?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: