Hacker News new | past | comments | ask | show | jobs | submit login
Introduction to web scraping with Python (datawhatnow.com)
373 points by weenkus on Oct 24, 2017 | hide | past | web | favorite | 60 comments

It is making one mistake, it is parsing and scraping in the same loop. You should pull the data, store them and have another process accessing the data store and perform the parsing and understanding of the data. A "quick" parsing can be done to pull the links and build your frontier, but the data should be pulled and stored for the main parsing.

This allows you to test your parsing routines independently of the target website, this allows you to later compare with previous versions and this allows you to reparse everything in the future, even after the original website long gone is.

My recommendation is to use the WARC archive format to store the results, this way you are on the safe side (the storage is standardized), it compresses very well and the WARC are easy to handle (they are immutable store, nice for backups).

1000% percent this. I write about Python web scraping a lot and the big one is that there's two parts. First is gathering the pages you need to scrape locally, and the second is scraping the pages you've saved. You need need to separate those two to avoid hitting their servers over and over when you're tying to debug the scraping code. My way is to write the first in a file called gather.py, and then the other in scrape.py. Have that in mind before doing and heavy scraping.

Since my scrapes aren't always the biggest, feel free to just save the html in a local folder and then scrape from there. This part of the project depends on how many pages you need to scrape, the size of the files, whether you need to store the data, whether it's a one time scrape or croned, etc. Either way, save the files, and scrape from there.

Here are some of the posts I've done on the subject if people reading the comments want to see more about scraping.

- https://bigishdata.com/2017/05/11/general-tips-for-web-scrap... - https://bigishdata.com/2017/06/06/web-scraping-with-python-p... - https://bigishdata.com/2017/05/11/general-tips-for-web-scrap...

For a simple caching solution that works well with requests, you can look at cachecontrol:

    from cachecontrol import CacheControl
    sess = requests.session()
    cached_sess = CacheControl(sess)
    response = cached_sess.get('http://google.com')
Very good for interactive debugging when you have to make multiple GET requests. First time you'll hit the webserver, after that it's all served from cache.

requests.session() --> requests.Session()

There is no issue with parsing and scraping in the same loop as long as there is caching in there as well. You don't want to be hitting the server repeatedly whilst you're debugging.

A project like Scrapy should have caching on by default, but it seems to be an afterthought. Repeatable and reproducible parsing of cached websites is necessary, e.g. if you find additional data fields that you want to parse without downloading the entire site over again.

I think the bigger point is the benefit of storing pulled data as is for the future, not so much about hitting the server multiple times. If so, I agree with this 100% -- being able to re-run your algorithms later on a local dataset is a powerful capability. Later time, different computer, new software version -- no problem, you have a local copy of the data.

With caching, you are at the mercy of whatever third party caching scheme is used under the hood and raw pulled data can disappear any time without your explicit command (e.g., if some library gets updated and decides that this invalidates the caching scheme).

By caching, I just mean storing of data locally so you don't have to request it again under a certain timeframe. I use my own caching scripts written in Python, if you use a 3rd party library then data deletion does not matter too much either if you configure it properly and backup the data - html/json data compresses really well using lzma2 in 7-zip.

Agreed on saving the files first. Here is a code snippet that implements something similar but saves each URL response first, albeit not using WARC:


Yeah this might be handy for small stuff but it's way too naive for anything bigger than couple pages. I recently had to scrape some pictures and meta-data from a website and while scripts like these seemed cool they really didn't scale up at all. Consider navigation, following URLs and downloading pictures all while remaining in the limits what's considered non-intrusive.

My first attempt, similar to this, failed miserably as the site employed some kind of cookie check that immediately blocked my requests by returning 403.

As mentioned in article I then moved on to Scrapy https://scrapy.org/. While seemingly a bit overkill once you create your scraper it's easy to expand and use the same scaffold on other sites too. Also it gives a lot more control on how gently you scrape and outputs nicely json/jl/csv with the data you want.

Most problems I had was with the Scrapy pipelines and getting it to output properly two json files and images. I could write a very short tutorial on my setup if I wasn't at work and otherwise busy right now.

And yes it's a bit of grey area but for my project (training a simple CNN based on the images) I think it was acceptable considering that I could have done the same thing manually (and spent less time too).

Python Requests has a notion if "session" which takes care of cookies etc... Use it all the time when needing to automate tasks that require to sign in.

I've been through the rigmarole of writing my own crawlers and and find Scrapy very powerful. I've run into roadblocks with dynamic/Javascript heavy sites; for those parts selenium+chromedriver works really well.

As parent and others have said: this is a grey area so make sure to read the terms of use and/or gain permission before scraping.

Notice how it's not a grey area when Google do it. The usually double standard apply I guess.

I don't understand what this comment is referring to. Google's spider respects robots.txt, just block all paths and google will not crawl your site. So too for Bing, Yahoo, Baidu (some complications though, I think), Yandex.... Most of the major spiders respect robots.txt.

Is there some major Google web scraping effort I'm not aware of?

I am just getting into web scraping and have also been using Selenium with either Firefox or PhantomJS. Is there a better way to handle the javascript heavy sites? I found one library called dryscraping but haven't had the time to look too deep into it.

Take a look at the Google team's Puppeteer


Splash runs in docker and does a decent job. From the scrapinghub team.

I love requests+lxml, use it fairly regularly, just a few quick notes:

1. lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.

2. Don't forget to check the status code of r (r.status_code or less generally r.ok)

3. Those with a background in coding might prefer the .cssselect method available in whatever object the parsed document results in. That's obviously a tad slower than find/findall/xpath, but it's oftentimes too convenient to pass upon.

4. Kind of automatic, but I'll say it anyway - scraping is a gray area, always make sure that what you're doing is legitimate.

> 1. lxml is way faster than BeautifulSoup - this may not matter if all you're waiting for is the network. But if you're parsing something on disk, this may be significant.

Caveat: lxml's HTML parser is garbage, so is BS's, they will parse pages in non-obvious ways which do not reflect what you see in your browser, because your browser follows HTML5 tree building.

html5lib fixes that (and can construct both lxml and bs trees, and both libraries have html5lib integration), however it's slow. I don't know that there is a native compatible parser (there are plenty of native HTML5 parsers e.g. gumbo or html5ever but I don't remember them being able to generate lxml or bs trees).

> 2. Don't forget to check the status code of r (r.status_code or less generally r.ok)

Alternatively (depending on use case) `r.raise_for_status()`. I'm still annoyed that there's no way to ask requests to just check it outright.

> Those with a background in coding might prefer the .cssselect method available in whatever object the parsed document results in. That's obviously a tad slower than find/findall/xpath, but it's oftentimes too convenient to pass upon.

FWIW cssselect simply translates CSS selectors to XPath, and while I don't know for sure I'm guessing it has an expression cache, so it should not be noticeably slower than XPath (CSS selectors are not a hugely complex language anyway)

Nice. Seems to only do lxml tree building?


Are you the author? If so, well done, that looks like a great package.

On the contrary, I have found lxml suitable for all of my scraping projects where the objective is to write some XPath to parse or extract some data from some element.

LXML itself is, the problem is that its HTML parser (libxml's really) is an ad-hoc "HTML4" parser which means the tree it builds routinely diverges from a proper HTML5 tree as you'd find in e.g. your browser's developer tools and the way it fixes (or whether it fixes it at all) markup is completely ad-hoc and hard to predict.

Are you talking about etree.HTML() being garbage? And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )?

> Are you talking about etree.HTML() being garbage?


> And what are your thoughts on parsing it as xml (e.g. etree.fromstring(), etree.parse() )

No problem there. XML is much stricter and thus easier to "get right" so to speak. lxml's html parser is built upon libxml's HTML parser[0], which predates HTML5, has not been updated to handle it, and is as its documentation notes

> an HTML 4.0 non-verifying parser

This means it harks back to an era where every parser did its thing and tried its best on the garbage it was given without necessarily taking in account the neighbour.

[0] http://xmlsoft.org/html/libxml-HTMLparser.html

I had reason to gather news articles and extract keywords and authors - can't remember why I didn't use BeatifulSoup, because that was my first choice. In the end I used lxml with its html5parser:


Regarding legality - something frowned upon is putting load on servers. You may get blocked, rate limited or worse if you put too much strain on servers. Especially when you as I did, experimented with different query options to get the data I needed, and had to rerun scraping a number of times.

A neat trick I found, was to configure a web proxy locally (squid, in my case) to aggressively cache EVERYTHING. This was, new runs only went out to the news sites for new queries I had never run before. Very helpful, it also speeded up the development to access files locally (cached in squid) instead of having to go out to the internet all the time.

In that github repo there is an example squid config for caching permanently.

Yep. requests.session fixes the vast majority of cookie / login / session / problems. Replaying the same headers is also a powerful technique, so long as they aren't constantly changing.

ASPX is still horrid, but just about doable if you pull all the hidden form variables out of the HTML and put them back into the header verbatim. But you're probably best off going down a headless browser route at that point.

If anyone's wanting someone with scraping experience (UK/remote), I'm currently available... (dave.mckee@gmail.com)

This is perhaps the fastest way to screenscrape a dynamically executed website.

1. First go get and run this code, which allows immediate gathering of all text nodes from the DOM: https://github.com/prettydiff/getNodesByType/blob/master/get...

2. Extract the text content from the text nodes and ignore nodes that contain only white space:

let text = document.getNodesByType(3), a = 0, b = text.length, output = []; do { if ((/^(\s+)$/).test(text[a].textContent) === false) { output.push(text[a].textContent); } a = a + 1; } while (a < b); output;

That will gather ALL text from the page. Since you are working from the DOM directly you can filter your results by various contextual and stylistic factors. Since this code is small and executes stupid fast it can be executed by bots easily.

I wonder how many folks using this will obey the robots.txt as explained nicely within the article:


Web scraping is powerful, but with great power comes great responsibility. When you are scraping somebody’s website, you should be mindful of not sending too many requests. Most websites have a “robots.txt” which shows the rules that your web scraper should obey (which URLs are allowed to be scraped, which ones are not, the rate of requests you can send, etc.)."

Not many. Browser testing libraries are widely repurposed as automation tools by black hats.

I found a lot of use cases for web scraping is kinda ad-hoc and usually, occurs as part of another task (eg. a research project or enhancing a record). I ended up releasing a simple hosted API service called Page.REST (https://page.rest) for people who would like to save that extra dev effort and infrastructure cost.

I agree, and I find scripting a web browser via the developer console a really productive approach.

First, it's completely interactive.

Second, it's the browser, so absolutely everything works. It doesn't matter if the data you want is only loaded by an obscure JS function when a hidden form is submitted on a button click. Just find the button, .click() it, and wait for a mutation event.

I have a write up on this[1], but I need to extend it with some more advanced examples.

[1] https://github.com/jawj/web-scraping-for-researchers

You can even control Chrome remotely via Python with a pretty simple web sockets api:


That's really the best of both worlds.

That may be fine for javascript heavy websites for a site with a few pages, but for anything with more than say 1,000 pages it is much more efficient to scrape using requests with lxml. The requests can be made concurrently, are scalable and there is no browser overhead with page rendering.

I've done a lot of scraping in my day, and I've found that lxml/requests is 2-3 OOM more resource efficient than a Selenium based browser. That JS/rendering engine is HEAVY!

With a headless browser the web scraping script can be even simpler. For example, have a look at the same scraper for datawhatnow.com at https://www.apify.com/jancurn/YP4Xg-api-datawhatnow-com

I've found .NET great for scraping. More so than Python as LINQ can be really really useful for weird cases I find.

My usual setup on OSX is .NET Core + HTMLAgilityPack + Selenium.

Did you consider using AngleSharp instead of HTMLAgilityPack?


No, will check it out. Thanks.

Could CSS selectors, with a few minor extensions, be just as good at XPath for this kind of thing?

I guess a lot of the reason I find xpath frustrating is my usage frequency corresponds exactly to the time needed to forget the syntax and have to relearn/refresh it in my head.

If CSS selectors needed only a few enhancements to compete with XPath, it might be worth enhancing a selector library to enable quick ramp up speed for more web people.

In the Chrome console you can right click elements in the sources tab and select Copy > Copy XPath.

For example, your comment:


> If CSS selectors needed only a few enhancements to compete with XPath

You may want to try ParslePy, it combines CSS/XPath functionality, allowing you to declaratively specify the selector paths in a JSON file. I just made a PR to allow YAML over JSON, but not sure if Pip picked up on it yet.

It appears this is based on that cssselect library mentioned above, compiling CSS selectors to XPath. That is, performance should approximate XPath, while convenience should be higher.

As an alternative to lxml or BeautifulSoup, I've used a library called PyQuery (https://pythonhosted.org/pyquery/) with some success. It has a very similar API to jQuery.

I can't stress enough what a bad idea it usually is to copy XPath expressions generated by dev tools. They tend to be super inefficient for traversing the tree (e.g. beginning with "*" for no reason), and don't make good use of tag attributes.

How do you guys manage masking the IP address when you want to scrape using your python script?

I find there is really no need to hide or mask the IP address when web scraping. The use of proxies or Tor to do so is completely unnecessary and maybe prohibitive e.g. try using Google in Tor.

When you are hitting sites thousands of times you have to make yourself as human like and anonymous as possible. Which isn't even as hard as it sounds. Just your address, user agent and random timers are the three most important things in botting.


I wrote a Clojure library that facilitates writing this sort of scripts in a relatively robust way:


Just wanted to say that skyscraper is awesome! Thanks for building it!

lxml is nice. i would as suggested parse and scrape in different threads so you can speed up a bit, but it's not required per se. if you can't get the data you see on the website using lxml there might be ajax or other stuff implemented. to capture these streams / datas use a headless browser like phantomJS or so. Article looks good to me for 'simple' scrapings and is a good base to start playing with the concepts.

The nice thing about making a scraper from scratch like this is that you get to decide it's behaviour and fingerprint ,and you wont get blocked as some known scraperr. that being said, most people would appreiciate if you parse their robots.txt , but depending on your geographical locatin this might be an 'extra' step which isnt needed... (i'd advise to do it anyway if you are a friendly ;) and maybe put in user agent for requests something like 'i don't bite' to let ppl know you are benign...) if you get blocked while trying to scrape you can try to fake site into thinking you are browser just by setting user agent and other headrs appropriately. if you dont know which these are, open nc -nlvp 80 on your local machine and wget or firefox into it to see headers...

Deciding on good xpath or 'markers' to scrape can be automated, but it's often ,. if you need good accurate data from a singlular source, a good idea to manually go through the html and seek some good markers...

an alternate method of scraping is automating wget --recursive + links -dump to render html pages to txt output and grep or w/e these for what data you need... tons of methods can be devised... depending on your needs some will be more practical and stable than others.

saving files is only usefull if you need assurance on data quality and if you want to be able to tweak the results without having to re-request the data from the server. (just point to local data directory instead...). this way you can setup a harvester and parsers fr this datas.

if you want to scrape or harvest LARGE data sets consider a proxy network or something like a tor connection jugling docker instance or so to ensure rate limiting is not killing your hrvesters...

good luck have fun and don't kill peopels servers with your traffic spam, that's a dick move.... (throttle/humanise your scrapings...)

Anyone knows good repo to do the same in Go?

goquery [1] is pretty nice.

[1] - https://github.com/PuerkitoBio/goquery

You probably mean gocrawl: https://github.com/PuerkitoBio/gocrawl

Can confirm. I've used goquery in a few projects and it's invariably worked out well.


I've been working on this: https://github.com/schollz/crawdad.

Its a redis-backed distributed scraper that's easy to continue after interruptions.


What gives?

"Coming Soon

We're not ready yet..."

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact