
Scrapy Tips from the Pros - ddebernardy
http://blog.scrapinghub.com/2016/01/19/scrapy-tips-from-the-pros-part-1/
======
kami8845
I love ScrapingHub (and use them) but these tips go completely against my own
experience.

Whenever I've tried to extract data like that inside Spiders I would
invariably (and 50,000 URLs later) come to the realization that my .parse()
ing code did not cover some weird edge case on the scraped resource and that
all data extracted was now basically untrustworthy and worthless. How to re-
run all that with more robust logic? Re-start from URL #1

The only solution I've found is to completely de-couple scraping from parsing.
parse() captures the url, the response body, request and response headers and
then runs with the loot.

Once you've secured it though, these libraries look great.

PS: If you haven't used ScrapingHub you definitely should give it a try, they
let you use their awesome & finely-tuned infrastructure completely for free.
One of my first spiders ran for 180,000 pages and 50,000 items extracted for
$0.

~~~
dante9999
> code did not cover some weird edge case on the scraped resource and that all
> data extracted was now basically untrustworthy and worthless.

Your data should not be worthless just because you dont catch some edge cases
early. Sure there are always some edge cases but best way to handle them is to
have proper validation logic in scrapy pipelines - if something is missing
some required fields for example or you get some invalid values (e.g. prices
as sequence of characters without digits) you should detect that immediately
and not after 50k urls. Rule of thumb is: "never trust data from internet" and
always validate it carefully.

If you have validation and encounter edge cases you will be sure that they are
actual weird outliers that you can either choose to ignore or somehow try to
force into your model of content.

~~~
kami8845
Hmm, I'll have to investigate that, any tips for libraries to use for
validation that tie well into scrapy?

What do you do if you discover that your parsing logic needs to be changed
after you've scraped a few thousand items? Re-run your spiders on the URLs
that raised errors?

~~~
stummjr
Spider Contracts can help you:
[http://doc.scrapy.org/en/latest/topics/contracts.html](http://doc.scrapy.org/en/latest/topics/contracts.html)

------
Cyph0n
Extremely well designed framework. It can cover more than 90% of use cases in
my opinion. I'm currently working on a project written in Scala that requires
a lot of scraping, and I feel really guilty that I'm not using Scrapy :(

~~~
horva
I think you can still combine the two. For example Scrapy can be behind
service/server to which you'd send request (with same args as if you were
running it as a script + callback url) and after items get collected Scrapy
can call your callback url sending all items in json format to your Scala app.
Or if you want to avoid memory issues for sure, you can send each item to
Scala app as it gets collected. Basically, idea is to wrap Scrapy spiders with
web service features - then you can use it in combination with any other
technology. Or you can use Scrapy Cloud to run your spiders at
[http://scrapinghub.com/](http://scrapinghub.com/).

~~~
darkrho
There is ScrapyRT: [http://blog.scrapinghub.com/2015/01/22/introducing-
scrapyrt-...](http://blog.scrapinghub.com/2015/01/22/introducing-scrapyrt-an-
api-for-scrapy-spiders/)

In the project I work on we do have the usual periodic crawls and use ScrapyRT
to let the frontend trigger realtime scrapes of specific items, all of this
using the same spider code.

Edit: Worth nothing that we trigger the realtime scrapes via AMQP.

------
daturkel
I've played with Scrapy before to make a proof of concept and I was pleased
with how easy it was (haven't yet had to use it for anything else yet).

That being said, had no idea how sophisticated it could get. This is super
impressive, especially the JavaScript rendering.

------
blisterpeanuts
Yay! Another article on scrapy. I'm just getting started and my first goal is
to scrape a tedious web-based management console that I can't get API access
to, and automate some tasks.

Very glad to learn about this site Scraping Hub. Keep the war stories coming.
It's technologies like these that brighten up our otherwise drab tech careers
and help some of us make it through the day.

------
rahulrrixe
I have been using this framework for more than three years and seen how it has
evolved and made scrapping so easy. The portia project is also awesome. I have
customised scrapy for almost all the cases like having single spiders for
multiple sites and providing rules using JSON. I think it is highly scalable
with bit of tweak and scrapy allows you to do very easily.

------
stummjr
Hey, author here! Feel free to ask any questions you have.

~~~
piroux
Here is a first one : What are the best ways to detect changes in html sources
with scrapy, thus giving missing data in automatic systems that need to be fed
?

~~~
stummjr
Hey, not sure if I understood what you mean. Did you mean:

1) detect pages that had changed since the last crawl, to avoid recrawling
pages that hadn't changed? 2) detect pages that have changed their structure,
breaking down the Spider that crawl it.

~~~
stummjr
1) detect pages that had changed since the last crawl, to avoid recrawling
pages that hadn't changed?

You could use the deltafetch[1] middleware. It ignores requests to pages with
items extracted in previous crawls.

2) detect pages that have changed their structure, breaking down the Spider
that crawl it.

This is a tough one, since most of the spiders are heavily based on the HTML
structure. You could use Spidermon [2] to monitor your spiders. It's available
as an addon in the Scrapy Cloud platform [3], and there are plans to open
source it in the near future. Also, dealing automatically with pages that
change their structure is in the roadmap for Portia [4].

[1]
[https://github.com/scrapinghub/scrapylib/blob/master/scrapyl...](https://github.com/scrapinghub/scrapylib/blob/master/scrapylib/deltafetch.py)

[2]
[http://doc.scrapinghub.com/addons.html?highlight=monitoring#...](http://doc.scrapinghub.com/addons.html?highlight=monitoring#monitoring)

[3] [http://scrapinghub.com/scrapy-cloud/](http://scrapinghub.com/scrapy-
cloud/)

[4] [http://scrapinghub.com/portia/](http://scrapinghub.com/portia/)

------
KhalilK
I tried using PHP to scrape 50,000 webpages for a couple of fields, got it
done in 4 hours, with scrapy it took 12 minutes. Been using it ever since.

------
nathell
Wow, never heard of Scrapy! Looks like I've reinvented it in Clojure:
[https://github.com/nathell/skyscraper/](https://github.com/nathell/skyscraper/)

~~~
Cyph0n
That looks pretty cool. I was planning on writing something similar in Scala,
but I'm not sure if I have enough experience with the language to get it done.

~~~
khgvljhkb
If you're lazy (and if you're into FP you must be hehehe), just use that
Clojure library. Calling Clojure code from Java is easy, and I'm sure it's not
much harder from Scala.

------
contingencies
Man, I feel old. Does anyone remember learning web scraping from one section
of Fravia's site? Ever try to move forward from that to write a fully fledged
search engine? These memories are from 15 years ago... quite amusing how much
hasn't changed. In hindsight it was probably easier back then due to the lack
of JS-reliant pages, less awareness of automation and less scraper detection
algorithms.

~~~
melling
What is the state of web scraping? I've got a few thousand URLs that is like
to build a search engine around:

[https://github.com/melling/SwiftResources/blob/master/swift_...](https://github.com/melling/SwiftResources/blob/master/swift_urls.tsv)

I'm using Swift to preprocess the data and I host my server on AppEngine using
Go:

[http://www.h4labs.com/dev/ios/swift.html](http://www.h4labs.com/dev/ios/swift.html)

Since my engine is stitched together, I could use Python or Perl to scrape the
sites and extract the words, ignoring JavaScript and css.

------
hokkos
What I need is an API scraper, scrapy seems to be mostly for HTML. I know how
to look at network requests in Chrome dev tools and JS function to understand
the shape of REST API, so I need something to plan the exploration of the
arguments space. For example if you want to scrap airbnb, you look at their
API, find there is a REST call with a lat, long box, I need something to
automatically explore an area, if the api only give the 50 first results and
you hit this number of calls it should schedule 4 calls with half the size
boxes and so on. If the request has cursors, you should be able to indicate to
the scraper how to follow it. I don't know what is the best tool for that.

~~~
ddebernardy
It's not entirely clear what you're up to, but FYI Scrapy works fine when
scraping JSON data:

[http://stackoverflow.com/a/18172776/417194](http://stackoverflow.com/a/18172776/417194)

------
staticautomatic
I like this article but for its discussion of these libraries. On another
note...

Am I the only one who dislikes Scrapy? I think it's basically the iOS of
scraping tools: It's incredibly easy to setup and use, and then as soon as you
need to do something even minutely non-standard it reveals itself to be
frustratingly inflexible.

~~~
ddebernardy
Scrapy is about as flexible and extensible as you can get... Care to elaborate
on "frustratingly inflexible"?

~~~
staticautomatic
I do a lot of scraping specific pages and often have to auth, form-fill,
refresh, recurse, use a custom SSL/TLS adapter, etc., in order to get what I'm
after. I'm sure Scrapy would be great if I just had a giant queue of GET
requests. Also, don't get me started on the Reactor.

------
inovica
We use Scrapy for a few projects and it is really really good. They have a
commercial-side to them, which is fine, but for anyone doing crawling/scraping
I'd strongly recommend it. Good article also

~~~
IanCal
Same, it's really nicely put together, lots of sensible defaults and it's easy
to add your own bit of awkward logic when necessary.

------
indymike
Great to see Scrapy getting some love. It's really well done and it scales
well (used it to scape ~2m job posts from ATS & government job banks in 2-3
hours).

~~~
mikerice
Using it for the same use case, scraping a whole lot of job posts. Scrapy is
love, scrapy is life.

------
escherize
Here's slides for a talk I gave about an interesting approach to scraping in
Clojure [1]. This framework works really well when you have hierarchical data
thats a few pages deep. Another highlight is the decoupling of parsing and
downloading pages.

[1] - [http://slides.com/escherize/simple-structural-scraping-
with-...](http://slides.com/escherize/simple-structural-scraping-with-
skyscraper)

------
bbayer
I love scraping web and produce structured data from web pages. The only
downside of using XPath or similar extracting approach is necessity of
constant maintenance. If I have enough knowledge about machine learning, I
would like to write a framework that analysis similar pages and finds
structure of data without giving which parts of page should be extracted.

~~~
stummjr
Maybe you should give a try at Portia
([http://scrapinghub.com/portia/](http://scrapinghub.com/portia/)). It does
exactly what you mean.

You may also be interested in this library:
[https://github.com/scrapy/scrapely](https://github.com/scrapy/scrapely)

------
inovica
Quick question. Could you use Scrapy to specific individual pages from
thousands (or millions) of sites, or would you be better off using a search
engine crawler like Nutch for this? I want to crawl the first page of a number
of specific sites and was looking into the technologies for this.

~~~
sheraz
Yes, if you use subclass CrawlSpider, then you will be able to set rules on
your crawls [1]

[1]
-[http://scrapy.readthedocs.org/en/latest/topics/spiders.html?...](http://scrapy.readthedocs.org/en/latest/topics/spiders.html?highlight=crawlspider#crawlspider-
example)

------
steinsgate
Confession : I am guilty of using regex superpowers to extract data from urls.
Will check out w3lib soon!

~~~
hackerboos
I've had to write some gnarly XPath expressions to extract data with Scrapy.

> //b[contains(.,'City')]/following-sibling::a[not(preceding-
> sibling::b[contains(.,'Country')])]/text()

------
novaleaf
Anybody interested in browser automation, can try
[http://api.phantomjscloud.com](http://api.phantomjscloud.com)

disclaimer, I wrote it. No crawler yet, though that's next after a new
website.

------
pjc50
That reminds me, I was going to write a scraper to extract my HN comments.

~~~
detaro
HN API? [https://github.com/HackerNews/API](https://github.com/HackerNews/API)
Haven't used it yet, so no comment on how well it works/how far back it goes)

------
banterfoil
What kinds of careers often deal with web scraping and doing these sorts of
tasks? I am really interested in the field and some of you seem to be real
experts in this field.

~~~
stummjr
All sorts of careers, for example:

\- developers who want to develop some data-based product (a travel agency
website, who finds the best deals from airline companies);

\- lawyers can use it to structure the data from Judgments and Laws (so that
they are able to query the data for things like: which judges have interpreted
this law in their judgments) (more on this:
[http://blog.scrapinghub.com/2016/01/13/vizlegal-rise-of-
mach...](http://blog.scrapinghub.com/2016/01/13/vizlegal-rise-of-machine-
readable-laws-and-court-judgments/))

\- (data-)journalists who work on investigative data-based articles (they use
it to gather the data to build visualizations, infographics, and also to
support their arguments).

\- real state agencies can use it to grab the prices of their competitors, or
to get a map of what people are selling, what are the areas where there is
more demand.

\- large companies that want to track their online reputation can scrape
forums, blogs, etc, for further analysis.

\- online retailers that want to keep their prices balanced with their
competitors can scrape the competitors websites collecting prices from them.

More on Quora: [https://www.quora.com/What-are-examples-of-how-real-
business...](https://www.quora.com/What-are-examples-of-how-real-businesses-
use-web-scraping)

------
elktea
Shame Python 3 isn't there for Scrapy yet - although checking the repo looks
like someone is actively (last few hours) porting it. Good to see.

------
shostack
Rubyist here...how does Scrapy compare to Nokogiri?

~~~
ddebernardy
Much like apples to oranges.

Nokogiri is a tag soup parser. Scrapy is a web scraping framework.

In addition to tag soup parsing, Scrapy handles a slew of things such as text
encoding problems, retrying urls that fail owing to network problems (if you
wish), dispatching requests across multiple spiders with a shared crawl
frontier (see Frontera), shared code between similar spiders using middlewares
and pipelines, and what have you.

There's a Lisp joke that goes something like every sufficiently complex piece
of software in C is a slow, buggy, poorly implemented version of Lisp. Very
much the same could be said about Scrapy and web scraping projects. :-)

~~~
shostack
Makes sense, thanks for the clarification. To your knowledge, is there
anything comparable to Scrapy in Ruby land?

~~~
ddebernardy
Not that we're aware of. Most rubyists use request and a tag soup parser,
without benefiting from any type of parallelization that you get from Scrapy.

