
Introduction to web scraping with Python - weenkus
https://datawhatnow.com/introduction-web-scraping-python/
======
Loic
It is making one mistake, it is parsing and scraping in the same loop. You
should pull the data, store them and have another process accessing the data
store and perform the parsing and understanding of the data. A "quick" parsing
can be done to pull the links and build your frontier, but the data should be
pulled and stored for the main parsing.

This allows you to test your parsing routines independently of the target
website, this allows you to later compare with previous versions and this
allows you to reparse everything in the future, even after the original
website long gone is.

My recommendation is to use the WARC archive format to store the results, this
way you are on the safe side (the storage is standardized), it compresses very
well and the WARC are easy to handle (they are immutable store, nice for
backups).

~~~
ivan_ah
For a simple caching solution that works well with requests, you can look at
cachecontrol:

    
    
        from cachecontrol import CacheControl
        
        sess = requests.session()
        cached_sess = CacheControl(sess)
        response = cached_sess.get('http://google.com')
    

Very good for interactive debugging when you have to make multiple GET
requests. First time you'll hit the webserver, after that it's all served from
cache.

~~~
staticautomatic
requests.session() --> requests.Session()

------
tekkk
Yeah this might be handy for small stuff but it's way too naive for anything
bigger than couple pages. I recently had to scrape some pictures and meta-data
from a website and while scripts like these seemed cool they really didn't
scale up at all. Consider navigation, following URLs and downloading pictures
all while remaining in the limits what's considered non-intrusive.

My first attempt, similar to this, failed miserably as the site employed some
kind of cookie check that immediately blocked my requests by returning 403.

As mentioned in article I then moved on to Scrapy
[https://scrapy.org/](https://scrapy.org/). While seemingly a bit overkill
once you create your scraper it's easy to expand and use the same scaffold on
other sites too. Also it gives a lot more control on how gently you scrape and
outputs nicely json/jl/csv with the data you want.

Most problems I had was with the Scrapy pipelines and getting it to output
properly two json files and images. I could write a very short tutorial on my
setup if I wasn't at work and otherwise busy right now.

And yes it's a bit of grey area but for my project (training a simple CNN
based on the images) I think it was acceptable considering that I could have
done the same thing manually (and spent less time too).

~~~
qrybam
I've been through the rigmarole of writing my own crawlers and and find Scrapy
very powerful. I've run into roadblocks with dynamic/Javascript heavy sites;
for those parts selenium+chromedriver works really well.

As parent and others have said: this is a grey area so make sure to read the
terms of use and/or gain permission before scraping.

~~~
wootie512
I am just getting into web scraping and have also been using Selenium with
either Firefox or PhantomJS. Is there a better way to handle the javascript
heavy sites? I found one library called dryscraping but haven't had the time
to look too deep into it.

~~~
RhodesianHunter
Take a look at the Google team's Puppeteer

[https://github.com/GoogleChrome/puppeteer](https://github.com/GoogleChrome/puppeteer)

------
drej
I love requests+lxml, use it fairly regularly, just a few quick notes:

1\. lxml is _way_ faster than BeautifulSoup - this may not matter if all
you're waiting for is the network. But if you're parsing something on disk,
this may be significant.

2\. Don't forget to check the status code of r (r.status_code or less
generally r.ok)

3\. Those with a background in coding might prefer the .cssselect method
available in whatever object the parsed document results in. That's obviously
a tad slower than find/findall/xpath, but it's oftentimes too convenient to
pass upon.

4\. Kind of automatic, but I'll say it anyway - scraping is a gray area,
always make sure that what you're doing is legitimate.

~~~
masklinn
> 1\. lxml is way faster than BeautifulSoup - this may not matter if all
> you're waiting for is the network. But if you're parsing something on disk,
> this may be significant.

Caveat: lxml's HTML parser is garbage, so is BS's, they _will_ parse pages in
non-obvious ways which do not reflect what you see in your browser, because
your browser follows HTML5 tree building.

html5lib fixes that (and can construct both lxml and bs trees, and both
libraries have html5lib integration), however it's slow. I don't know that
there is a native compatible parser (there are plenty of native HTML5 parsers
e.g. gumbo or html5ever but I don't remember them being able to generate lxml
or bs trees).

> 2\. Don't forget to check the status code of r (r.status_code or less
> generally r.ok)

Alternatively (depending on use case) `r.raise_for_status()`. I'm still
annoyed that there's no way to ask requests to just check it outright.

> Those with a background in coding might prefer the .cssselect method
> available in whatever object the parsed document results in. That's
> obviously a tad slower than find/findall/xpath, but it's oftentimes too
> convenient to pass upon.

FWIW cssselect simply translates CSS selectors to XPath, and while I don't
know for sure I'm guessing it has an expression cache, so it should not be
noticeably slower than XPath (CSS selectors are not a hugely complex language
anyway)

~~~
aumerle
[https://github.com/kovidgoyal/html5-parser](https://github.com/kovidgoyal/html5-parser)

~~~
masklinn
Nice. Seems to only do lxml tree building?

~~~
aumerle
[http://html5-parser.readthedocs.io/en/latest/#html5_parser.p...](http://html5-parser.readthedocs.io/en/latest/#html5_parser.parse)

~~~
masklinn
Damn.

Are you the author? If so, well done, that looks like a great package.

------
austincheney
This is perhaps the fastest way to screenscrape a dynamically executed
website.

1\. First go get and run this code, which allows immediate gathering of all
text nodes from the DOM:
[https://github.com/prettydiff/getNodesByType/blob/master/get...](https://github.com/prettydiff/getNodesByType/blob/master/getNodesByType.js)

2\. Extract the text content from the text nodes and ignore nodes that contain
only white space:

let text = document.getNodesByType(3), a = 0, b = text.length, output = []; do
{ if ((/^(\s+)$/).test(text[a].textContent) === false) {
output.push(text[a].textContent); } a = a + 1; } while (a < b); output;

That will gather ALL text from the page. Since you are working from the DOM
directly you can filter your results by various contextual and stylistic
factors. Since this code is small and executes stupid fast it can be executed
by bots easily.

------
chinathrow
I wonder how many folks using this will obey the robots.txt as explained
nicely within the article:

"Robots

Web scraping is powerful, but with great power comes great responsibility.
When you are scraping somebody’s website, you should be mindful of not sending
too many requests. Most websites have a “robots.txt” which shows the rules
that your web scraper should obey (which URLs are allowed to be scraped, which
ones are not, the rate of requests you can send, etc.)."

~~~
robattila128
Not many. Browser testing libraries are widely repurposed as automation tools
by black hats.

------
laktek
I found a lot of use cases for web scraping is kinda ad-hoc and usually,
occurs as part of another task (eg. a research project or enhancing a record).
I ended up releasing a simple hosted API service called Page.REST
([https://page.rest](https://page.rest)) for people who would like to save
that extra dev effort and infrastructure cost.

~~~
gmac
I agree, and I find scripting a web browser via the developer console a really
productive approach.

First, it's completely interactive.

Second, it's the browser, so absolutely everything works. It doesn't matter if
the data you want is only loaded by an obscure JS function when a hidden form
is submitted on a button click. Just find the button, .click() it, and wait
for a mutation event.

I have a write up on this[1], but I need to extend it with some more advanced
examples.

[1] [https://github.com/jawj/web-scraping-for-
researchers](https://github.com/jawj/web-scraping-for-researchers)

~~~
dmn001
That may be fine for javascript heavy websites for a site with a few pages,
but for anything with more than say 1,000 pages it is much more efficient to
scrape using requests with lxml. The requests can be made concurrently, are
scalable and there is no browser overhead with page rendering.

~~~
newlyretired
I've done a lot of scraping in my day, and I've found that lxml/requests is
2-3 OOM more resource efficient than a Selenium based browser. That
JS/rendering engine is HEAVY!

------
jancurn
With a headless browser the web scraping script can be even simpler. For
example, have a look at the same scraper for datawhatnow.com at
[https://www.apify.com/jancurn/YP4Xg-api-datawhatnow-
com](https://www.apify.com/jancurn/YP4Xg-api-datawhatnow-com)

------
martinald
I've found .NET great for scraping. More so than Python as LINQ can be really
really useful for weird cases I find.

My usual setup on OSX is .NET Core + HTMLAgilityPack + Selenium.

~~~
dennisgorelik
Did you consider using AngleSharp instead of HTMLAgilityPack?

[https://github.com/AngleSharp/AngleSharp](https://github.com/AngleSharp/AngleSharp)

~~~
martinald
No, will check it out. Thanks.

------
WhitneyLand
Could CSS selectors, with a few minor extensions, be just as good at XPath for
this kind of thing?

I guess a lot of the reason I find xpath frustrating is my usage frequency
corresponds exactly to the time needed to forget the syntax and have to
relearn/refresh it in my head.

If CSS selectors needed only a few enhancements to compete with XPath, it
might be worth enhancing a selector library to enable quick ramp up speed for
more web people.

~~~
tycho01
> If CSS selectors needed only a few enhancements to compete with XPath

You may want to try ParslePy, it combines CSS/XPath functionality, allowing
you to declaratively specify the selector paths in a JSON file. I just made a
PR to allow YAML over JSON, but not sure if Pip picked up on it yet.

~~~
tycho01
It appears this is based on that cssselect library mentioned above, compiling
CSS selectors to XPath. That is, performance should approximate XPath, while
convenience should be higher.

------
donjh
As an alternative to lxml or BeautifulSoup, I've used a library called PyQuery
([https://pythonhosted.org/pyquery/](https://pythonhosted.org/pyquery/)) with
some success. It has a very similar API to jQuery.

------
staticautomatic
I can't stress enough what a bad idea it usually is to copy XPath expressions
generated by dev tools. They tend to be super inefficient for traversing the
tree (e.g. beginning with "*" for no reason), and don't make good use of tag
attributes.

------
victor106
How do you guys manage masking the IP address when you want to scrape using
your python script?

~~~
dmn001
I find there is really no need to hide or mask the IP address when web
scraping. The use of proxies or Tor to do so is completely unnecessary and
maybe prohibitive e.g. try using Google in Tor.

~~~
robattila128
When you are hitting sites thousands of times you have to make yourself as
human like and anonymous as possible. Which isn't even as hard as it sounds.
Just your address, user agent and random timers are the three most important
things in botting.

[http://www.blackhatunderground.net/forum/the-deep-
web/9-blac...](http://www.blackhatunderground.net/forum/the-deep-web/9-black-
hat-how-to-build-a-social-bots-army/)

------
nathell
I wrote a Clojure library that facilitates writing this sort of scripts in a
relatively robust way:

[https://github.com/nathell/skyscraper](https://github.com/nathell/skyscraper)

~~~
rlander
Just wanted to say that skyscraper is awesome! Thanks for building it!

------
vectorEQ
lxml is nice. i would as suggested parse and scrape in different threads so
you can speed up a bit, but it's not required per se. if you can't get the
data you see on the website using lxml there might be ajax or other stuff
implemented. to capture these streams / datas use a headless browser like
phantomJS or so. Article looks good to me for 'simple' scrapings and is a good
base to start playing with the concepts.

The nice thing about making a scraper from scratch like this is that you get
to decide it's behaviour and fingerprint ,and you wont get blocked as some
known scraperr. that being said, most people would appreiciate if you parse
their robots.txt , but depending on your geographical locatin this might be an
'extra' step which isnt needed... (i'd advise to do it anyway if you are a
friendly ;) and maybe put in user agent for requests something like 'i don't
bite' to let ppl know you are benign...) if you get blocked while trying to
scrape you can try to fake site into thinking you are browser just by setting
user agent and other headrs appropriately. if you dont know which these are,
open nc -nlvp 80 on your local machine and wget or firefox into it to see
headers...

Deciding on good xpath or 'markers' to scrape can be automated, but it's often
,. if you need good accurate data from a singlular source, a good idea to
manually go through the html and seek some good markers...

an alternate method of scraping is automating wget --recursive + links -dump
to render html pages to txt output and grep or w/e these for what data you
need... tons of methods can be devised... depending on your needs some will be
more practical and stable than others.

saving files is only usefull if you need assurance on data quality and if you
want to be able to tweak the results without having to re-request the data
from the server. (just point to local data directory instead...). this way you
can setup a harvester and parsers fr this datas.

if you want to scrape or harvest LARGE data sets consider a proxy network or
something like a tor connection jugling docker instance or so to ensure rate
limiting is not killing your hrvesters...

good luck have fun and don't kill peopels servers with your traffic spam,
that's a dick move.... (throttle/humanise your scrapings...)

------
davidpelayo
Anyone knows good repo to do the same in Go?

~~~
q3k
goquery [1] is pretty nice.

[1] -
[https://github.com/PuerkitoBio/goquery](https://github.com/PuerkitoBio/goquery)

~~~
dullgiulio
You probably mean gocrawl:
[https://github.com/PuerkitoBio/gocrawl](https://github.com/PuerkitoBio/gocrawl)

