
Python web scraping resources - dchuk
http://jakeaustwick.me/python-web-scraping-resource/?mc_list=python
======
dmritard96
A nice write up. At a previous company I built a solution (was forced into
java but ultimately the same process...) that used many of these techniques.
Some suggested next steps/additional enhancements if you need to do this
repeatedly and at scale.

Implement global throttling on a per domain basis.

Consider some abstraction. I implemented an abstract fetcher (with a number of
concrete fetchers that were runtime selectable), and an abstract/concrete
parser. Compose a Scraper with these two things. Allow for a runtime switch
that will determine which fetcher to use (javascript enabled, straight
requests, etc.). If you want to get really fancy, in your database of urls,
you can flag urls that need to use a heavier, full fledged browser.

For the fetcher, use the selenium bindings. We tested phantomjs and chrome and
chrome outperformed phantom. It might have been the java bindings
(ghostdriver) but w/e, something you just have to test for yourself. Once we
settled on chrome, I built a chrome plugin to block ads and other unrelated
calls. This adds LOADs of time. Its pretty tricky but you can inject in a list
of well crafted regexes and it drops initial load times dramatically.

For the parser, you may want to consider a fallback system. Often times the
particular piece of data you want (say a title) can be found in a handful of
places on the page. It will make your parsing much more reliable.

Compose a 'bot runner' from the Scraper. We had json documents that described
fields we were abstracting and all the fallback rules used to locate the
needed data. Lastly, the bot runner can be expanding to include things like
navigation and other fancier tricks.

If you go for broke, build a system for generating bots, think chrome plugin.

Don't forget, pruning dead URLs is a tricky little problem but an important
one.

To scale this whole operation linearly, we used a queue (redis at first,
eventually kafka) and storm. Storm allowed us to arbitrary expand and contract
our bot runners.

Scraping is a problem that just about everyone encounters and alot of the most
standard solutions seem to really fall short. Your article is an excellent
start.

~~~
meritt
Curious what was the catalyst to switch from Redis to Kafka? Reliability that
a given message was received and processed and replaying that to a different
consumer in the event of failure?

~~~
dmritard96
mostly just scale. Keeping your queue in memory isnt necesarilly a requirement
unless you have a lambda arch or a realtime requirement (which we started
with). For our larger operations, kafka, which is disk backed, distributed
with a little more ease and cheaper was a nice option. For reliability, we
were using some of storms primitives as well as the mechanisms inside qless,
the queuing library we were using on top of redis.
[https://github.com/ChannelIQ/qless-java](https://github.com/ChannelIQ/qless-
java)

------
jmduke
I would highly recommend Scrapy if you plan on doing any serious scraping:
[http://scrapy.org/](http://scrapy.org/).

~~~
crdoconnor
I wouldn't. I used this for a project and then quickly regretted it for the
following reasons:

* XPath selectors just plain suck.

* The item pipeline is way too straitjacketed. You have to do all sorts of fucking around with settings in order to make the sequence of events work in the (programmatic) way you want it because the framework developers 'assumed' there's only one way you'd really want to do it.

* Scrapy does not play well with other projects. You _can_ integrate it with django if you want a minimal web UI but it's a pain to do so.

* Tons of useless features. Telnet console? wtf?

* It's assumed that the 'end' of the pipeline will be some sort of serialization - to database, xml, json or something. Actually I usually just want to feed into the end of another project without any kind of serialization using plain old python objects. If I want serialization I probably want to do it myself.

* For some reason DjangoItem didn't really work (although by the time I tried to get it to work I'd kind of given up).

IMO this is a classic case of "framework that should have been a library".

Here's what I used instead after scrapping scrapy:

* mechanize - to mimic a web browser. I used requests sometimes too, but it doesn't really excel at pretending to be a web browser, so for that reason I usually used mechanize as a drop in replacement. * celery - to schedule the crawling / spin off multiple crawlers / rate-limiting / etc. * pyquery - because xpath selectors suck and jquery selectors are better. * python generators - to do pipelining.

I'm largely happy with the outcome. The code is less straitjacketed, easier to
understand and easier to integrate into other projects if necessary (you don't
have the headache of trying to get _two_ frameworks to play together nicely).

~~~
kmike84
Hey,

A good feedback, thanks!

> XPath selectors just plain suck.

Scrapy supports CSS selectors.

> The item pipeline is way too straitjacketed. You have to do all sorts of
> fucking around with settings in order to make the sequence of events work in
> the (programmatic) way you want it because the framework developers
> 'assumed' there's only one way you'd really want to do it.

Could you plese give an example?

> Scrapy does not play well with other projects. You can integrate it with
> django if you want a minimal web UI but it's a pain to do so.

This is true. But it is a pain to integrate any even-loop based app with
another app that is not event-loop based. It is also true that Scrapy is not
easy to plug into existing event loop (e.g. if you already have twisted or
tornado-based service), but it should be fixed soon.

> Tons of useless features. Telnet console? wtf?

Telnet console is a Twisted feature; it came almost for free, and it is useful
to debug long-running spiders (which can run hours and days).

> It's assumed that the 'end' of the pipeline will be some sort of
> serialization - to database, xml, json or something. Actually I usually just
> want to feed into the end of another project without any kind of
> serialization using plain old python objects. If I want serialization I
> probably want to do it myself.

If you don't want serialization then you want a single process both for
crawling and for other tasks. This rules out synchronous solutions - you can't
e.g. integrate a crawler with django efficiently without serialization. If you
just want to do some post-processing then I don't see why putting code to
Scrapy spider is worse than putting it to other script and calling Scrapy from
this script.

> For some reason DjangoItem didn't really work (although by the time I tried
> to get it to work I'd kind of given up).

This may be true.. I don't quite get what is it for :)

> IMO this is a classic case of "framework that should have been a library".

It can't be a library like requests or mechanize for technical reasons - to
make crawling efficient Scrapy uses event loop. It _can_ (and should) be a
library for twisted/tornado/asyncio; it is possible to use Scrapy as a such
library now, but this is not straightforward; this should (and will) be
simplified.

> * mechanize - to mimic a web browser. I used requests sometimes too, but it
> doesn't really excel at pretending to be a web browser, so for that reason I
> usually used mechanize as a drop in replacement. * celery - to schedule the
> crawling / spin off multiple crawlers / rate-limiting / etc. * pyquery -
> because xpath selectors suck and jquery selectors are better. * python
> generators - to do pipelining.

Celery is also not the easiest piece of software. Scrapy is just a single
Python process that doesn't require any databases, etc.; Celery requires to
deploy a broker and have a place to store task results; it is also less
efficient for IO-bound tasks.

~~~
crdoconnor
>Scrapy supports CSS selectors.

Still far inferior to JQuery selectors.

>Could you plese give an example?

The example I'm thinking of is when I was trying to create a pipeline that
would output a skeleton configuration file when you passed one switch and
would process and serialize the data parsed when you passed another. It was
possible but kludgy.

>But it is a pain to integrate any even-loop based app with another app that
is not event-loop based.

That's not where the pain lies. It's more the fact that it has its own weird
configuration/setup quirks (e.g. its own settings.py, reliance on environment
variables, executables).

>If you don't want serialization then you want a single process both for
crawling and for other tasks. This rules out synchronous solutions - you can't
e.g. integrate a crawler with django efficiently without serialization.

I don't really want scrapy doing process handling at all. It's not
particularly good at it. Celery is much better.

Using other code to do serialization also doesn't necessitate running it on
the same process. You can import the django ORM wherever you want and use it
to save to the DB. I know you _can_ do that - but, again, kludgy.

>It can't be a library like requests or mechanize for technical reasons - to
make crawling efficient Scrapy uses event loop.

I get that. It should have been more like twisted from the outset though. The
developers were clearly inspired from django and that led them down a
treacherous path.

>It can (and should) be a library for twisted/tornado/asyncio; it is possible
to use Scrapy as a such library now, but this is not straightforward; this
should (and will) be simplified.

Well, that's good I suppose. I still think that it focuses on bringing
together a bunch of mediocre modules for which, individually, you can find
much better equivalents. Also, (unlike django) tight, seamless integration
between those modules doesn't really gain you much.

>Celery is also not the easiest piece of software.

The problem it is solving (distributed task processing) is not an easy
problem. Celery is not simple, but it is GOOD.

>Scrapy is just a single Python process that doesn't require any databases,
etc. Celery requires to deploy a broker and have a place to store task
results; it is also less efficient for IO-bound tasks.

A) You can use redis as a broker and that's trivial to set up. I always have a
redis available anyway because I always need a cache of some kind (even when
crawling!).

B) My crawling tasks are never I/O bound or CPU bound. They're bound by the
rate limiting imposed upon me by the websites I'm trying to crawl.

C) I'm usually using celery anyway. I still have to do task processing that
DOESN'T involve crawling. Where do I put that code when I'm using scrapy?

~~~
kmike84
> The example I'm thinking of is when I was trying to create a pipeline that
> would output a skeleton configuration file when you passed one switch and
> would process and serialize the data parsed when you passed another. It was
> possible but kludgy.

I don't get it - how is creating a configuration file related to the
processing of the items, why would you do it in items pipeline?

> It's more the fact that it has its own weird configuration/setup quirks
> (e.g. its own settings.py, reliance on environment variables, executables).

It is possible to create a Crawler from any settings object (not just a
module), and Scrapy does not rely on executables AFAIK. But this all is poorly
documented. Also, there is an ongoing GSoC project to make settings easier and
more "official".

> I don't really want scrapy doing process handling at all.

Scrapy doesn't handle processes, it is single-threaded and uses a single
process. This means that you can use e.g. a shared in-process state.

> Using other code to do serialization also doesn't necessitate running it on
> the same process. You can import the django ORM wherever you want and use it
> to save to the DB. I know you can do that - but, again, kludgy.

You can't move Python objects between processes without serialization. Why is
using django ORM kludgy in Scrapy but not in Celery?

> The problem it is solving (distributed task processing) is not an easy
> problem. Celery is not simple, but it is GOOD.

You don't necessarily need distributed task processing to do web crawling.
Celery is a great piece of software, and it is developing nicely, but you
always pay for complexity. For example, I faced the following problems when I
was using Celery:

* When redis was used as a broker its memory usage was growing infinitely. Lots of debugging, found a reason and a hacky way to overcome it ([https://github.com/celery/celery/issues/436](https://github.com/celery/celery/issues/436)). The issue was fixed, but apparently there is still a similar issue when MongoDB is used as a broker.

* Celery stopped processing without _anything_ useful in logs (and of course Celery error sending facilities failed and I didn't have external monitoring) - it turns out an unicode exception was eaten. A couple of days of nightmarish debugging; see [https://github.com/celery/celery/issues/92](https://github.com/celery/celery/issues/92).

* I implemented an email sender using Celery + RabbitMQ once. I think I was sending email text to tasks as parameters. Never do that (just use MTA:)! When a large batch of emails was sent at once RabbitMQ used all memory, corrupted its database, dropped the queue; I haven't found a way to check which emails were sent and which were not. This was 100% my fault, but it shows that complex setup is not your friend.

Crawling tasks differ - e.g. if you need to crawl many different websites
(which is not uncommon) you will almost certainly be IO and CPU limited.
Scrapy is not a system for distributed task processing, it is just an event-
loop based crawler. I'm not saying your way to solve the problem is wrong; if
you already use celery it makes a lot of sense to use it for crawling as well.
But I don't agree that going distributed turtles all the way down with
celery+redis+DB for storage+... is easier or more efficient than using plain
Scrapy. A lot of tasks can be solved by writing a spider, getting a json file
with data and doing whatever one wants with it (upload to DB, etc).

------
preinheimer
I wish every post/tool out there on scraping covered obeying robots.txt. It's
a crap standard, but it's what we've got.

"Just ignore it" is a great way to identify yourself as a crappy netizen.

~~~
mdaniel
I'm glad that all of the sites you target want your scraper to access them.
The goal in many cases where one would use a scraper is to access information
not provided in an API or otherwise encased in HTML. Most of their robots.txt
are "User-Agent: *\nDisallow: /\n"

~~~
preinheimer
Then we have no right to scrape that content.

Why is there an implied right to scrape?

------
rbucks
As a self-taught Python and Ruby programmer, I really appreciate this.

------
mushfiq
Clean and direct write up. It reminds me when I used to work on
crawling/scrapping. Covered most of the topics that needs to know for web
crawling.

------
victorhooi
Very nice, thanks for posting =).

Can people suggest any additional resources/reading on scraping/crawling as
well?

I was hoping to experiment with it in GoLang, but there doesn't seem to be
much on crawling/scraping with GoLang, except for GoQuery
([https://github.com/PuerkitoBio/goquery](https://github.com/PuerkitoBio/goquery))

------
Diastro
Little distributed web scraper project I created a while back if anyone is
interested / needs resources :
[https://github.com/Diastro/Zeek](https://github.com/Diastro/Zeek)

------
chatmasta
Really nice. I was just going to comment on how similar this was to scraping I
did during my days working in SEO, and then I saw your username! Long time no
talk. What's up man? Sounds like we should get in touch.

------
benny
indepth article! Had to learn most of the stuff the hard way, could need that
a couple of weeks ago:) Proxies nowadays are really cheap. Isnt ignoring the
robot.txt opening doors for suing? Scraping copyrighted material should be
avoided too in my opinion, but i guess that only matters if you get caught:)

------
halcyondaze
I need to scrape an eccomerce site this weekend, and this will be a great
resource to keep bookmarked. Thanks.

~~~
miket
Check out
[http://diffbot.com/products/automatic/product/](http://diffbot.com/products/automatic/product/)
to do this fully automated.

~~~
logn
I'll self-promote too then.
[https://screenslicer.com](https://screenslicer.com)

~~~
BorisMelnik
very cool! probably one of the first web scrapers I've run across with a front
end where you can easily show users / investors etc what this does. and it
seems to work pretty well. really like your output display as well.

------
tomsthumb
python selenium bindings are also nice if you absolutely have to deal with
information put together by javascript code.

~~~
Jake232
I actually cover this (very briefly) here: [http://jakeaustwick.me/python-web-
scraping-resource/#thesite...](http://jakeaustwick.me/python-web-scraping-
resource/#thesiteisshowingdifferentcontenttomyscraper)

I should dedicate a section to it though, will stick that on my to-do list.

~~~
tomarr
I've always opted for Ghost.py rather than selenium as I've found it uses less
memory and pretty capable. Admittedly my scrapes are normally pretty targeted
and not over 10k+ sites.

One query I had from your piece: "When I've found myself in the unfortunate
place of getting my proxies banned before on certain sites, they have been
more the happy to switch them out for new IP's for me."

If you're getting banned from sites, is it not time to leave them alone? If
they don't want your traffic (which the admin/system has judged as too much),
should you really be circumventing it? With that and the robots.txt bit (which
admittedly you justified), you've got to be careful not to slip into a bit of
a grey area with scraping, which people regard suspiciously in the first
place.

