
Web Scraping 101 with Python - shabdar
http://www.gregreda.com/2013/03/03/web-scraping-101-with-python/
======
samarudge
PyQuery is pretty awesome (<https://pypi.python.org/pypi/pyquery>)

Using Requests to download the document, pump it into PyQuery and you can use
any jQuery style selectors to get text, attributes and all sorts of other
stuff.

Example; Here's how to scrape the hacker news homepage
<https://gist.github.com/samarudge/035ab8aaca224415cb49> (that code could
probably be improved but I only spent a couple of minutes on it)

~~~
kimagure
PyQuery seems to always be faster in my experience than BS4 (for ripping the
same information). Anyone else have a similar experience?

~~~
chewxy
Only on wellformed pages. There are many many many many many malformed pages
on the internet. Even those that are created in 2013

~~~
ville
Fortunately HTML5 defined a standard way to parse even broken HTML and that
parser is implemented in html5lib package. You can use it also with lxml and
even use "jQuery like" selectors with lxml.cssselect
(<http://lxml.de/cssselect.html>)

------
pmorici
BeautifulSoup, has received a lot of positive press on HN over the years so
when I needed to do some heavy scraping I gave it a spin. It was a total
disappointment. It's fine if you are scraping a small set of similar pages
from a single site but if you are scraping a large number of pages across many
sites and esp pages with text encoding other than ascii / utf-8 it choaks so
frequently as to be useless. BeautifulSoup is fine for small jobs but if you
were making a web-crawler for example look elsewhere it is totally inadequate.

~~~
wslh
In my experience lxml.html is much better.

~~~
thibauts
I'll second, lxml.html is in my experience very robust and fast. I've been
writing a lot of scrapers along the years and the best combination I found so
far is requests / lxml.html / gevent.

It doesn't get any simpler than this IMO <http://pastebin.com/hacxmAjV>

~~~
danneu
For fun here's the Ruby+Nokogiri version and my attempt at the Clojure+Enlive
version (3rd day learning Clojure).

<https://gist.github.com/danneu/5131596>

~~~
thibauts
Pretty nice. I'd like to see how it goes in a real world scenario with
concurrency, manipulation of extracted nodes and general HTTP post/auth/etc.
I'm not sure about Ruby+Nokogiri but Clojure+Enlive may well be a great
choice.

------
zachwill
Using Requests and lxml is a better solution — except when you need many
concurrent spiders, in which case you should be using Scrapy (you'd probably
be looking for the Middleware discussed here, too:
[https://groups.google.com/d/msg/scrapy-
users/WqMLnKbA43I/B3N...](https://groups.google.com/d/msg/scrapy-
users/WqMLnKbA43I/B3N1ysvoy-4J)).

------
niels
This will only be a good approach if you are going to scrape a small amount of
pages. The problem is using synchronous requests, as this blocks the crawler
until a request has finished. Using asynchronous requests such as supported by
twisted (and scrapy) will allow you to crawl a lot faster using the same
resources.

~~~
boyter
This can actually sometimes be a feature. It makes it far less likely to have
your IP banned. Its also a far more polite way to crawl someones site.

~~~
niels
I agree, and for a 101 web scraping tutorial keeping it simple is nice.

------
nsp
I'm not generally a huge fan of javascript, but phantomjs/casperjs are far and
away the best tools I've used for scraping. Two features that stood out:

1\. It's a headless WebKit browser, so it plays well with javascript and
(sorta) flash. 2\. It's easy to capture screenshots of the pages you're
scraping, which is great for sanity checks later on.

Http://phantomjs.org - the main engine Http://Casperjs.org - syntactic sugar
for phantomjs

~~~
ankimal
They make scraping as easy as finding the right jquery selectors (once you
inject jQuery onto the page) but can be very slow as compared to a vanilla
HTML only scraper.

In my experience, a phantom/casper implementation could take upwards of 5-10
secs. to process a single page (almost 5-10x slower). This, even if you
disable load of remote images and plugins.

~~~
zenocon
There is a startup penalty to getting phantomjs executable up (including all
of its WebKit internals), but once you're there, I've never had any
performance issues. Roll a script using casper.each() and feed it an array of
urls. It is typically very fast for me. You can trap on the page loaded event
and do some benchmarking, but I would disagree with your premise that using
PhantomJS/CasperJS is slow.

------
sbrother
I'm surprised this article didn't mention scrapy. I had to do a lot of web
scraping for a healthcare-related project last month and found scrapy
incredibly fast and easy to use.

~~~
pisarzp
I have to admit that Scrapy is very fast, powerful and easy to use and scale.
However, probably it's easier to start with BS, as Scrapy requires you to
learn "Scrapy way of doing stuff". Furthermore, I find documentation to be a
bit unpolished sometimes.

Still, Scrapy it's amazing and we use it a lot.

~~~
pkmishra
Scrapy is awesome and we have been using it without any problem so far.

------
sergiotapia
Here are some awesome libraries I've used for HTML scraping:

1\. Python - BeautifulSoup

2\. Ruby - Nokogiri (use in conjunction with Watir if you're scraping a very
client-heavy website).

3\. C# - HtmlAgilityPack in conjunction with ScrapySharp (there's a nuget
package for both) - I highly recommend ScrapySharp because it allows you to
query elements using a very familiar Css selector type similar to how you
query dom elements in jQuery. :)

Scraping online content is so simple these days, if a website doesn't offer an
API you still have alternatives ;)

~~~
ankimal
Also recommend Mechanize for Ruby (uses nokogiri under the covers).

<http://mechanize.rubyforge.org/>

<http://phantomjs.org/> or "its easier to deal with cousin"
<http://casperjs.org/> for very client heavy sites.

FWIW, you can get away with HTML only scrapers most of the time, you just need
to look harder to find all the data. Totally recommend using "View page
source" as that would always give you the original HTML vs the possibly
altered DOM (after JS has run on the page) that you might see with Dev
Tools/Firebug.

~~~
johnnyg
Mechanize is far and away the best and easiest way to scrape with Ruby _until_
anything is rendered in javascript, which is explicitly not supported.

I tend to use Mechanized until I can't, then switch to Watir. Over time, I've
found myself just strait up picking up Watir as it runs your browser directly
and supports javascript rendering as a result.

~~~
ankimal
How is performance with Watir? With casperjs a page takes me on an avg. 5-10
secs. to process.

~~~
johnnyg
Not great. About the same...

------
niggler
For the pythonistas: what is the relationship between BeautifulSoup, lxml,
urllib*, scrapy and mechanize?

~~~
bjourne
Here are my highly opinionated opinions about their respective use cases:

* BeautifulSoup: It was the best scraping library ever until python-lxml came around and stole the show. Despite that the manual said _BeautifulSoup gives you unicode, damnit!_ it had some long-standing bugs which it gave you strings or incorrectly decoded web pages. I wouldn't use it anymore because lxml is strictly superior.

* lxml: The king of scraping libraries. It is a big library so it can be hard to approach. It's actually a collection of markup parsers; lxml.html, lxml.etree and some more I've forgotten about. I almost exclusively use lxml.html since it "works" and can handle invalid markup without complaining. Use it like this: <https://gist.github.com/mattoufoutu/823821> lxml can parse both using jQuery-style selectors with cssselect() and XPath 1.0 with the xpath() method. XPath is hard to learn, but once you get it you have a really powerful parsing tool which makes your life simpler.

* pyQuery: Easy to get started with and use. But not as powerful as lxml+XPath.

* urllib: Many of the modules in Python's standard library are there because they have been there for a long time. :) python-requests or httplib2 are the best libraries for http.

* scrapy: A framework for scheduling and supervising scraper spiders. It takes care of everything from downloading pages, following urls, concurrent requests, handling network errors to storing data in a database or generating csv files. It doesn't parse html itself, but delegates that task to lxml. Personally, I've found scrapy to be very good when your problem _fits_ with how scrapy thinks scraping should be done. If you try to depart from the scrapy-way then scrapy suddenly feels very "frameworkish" and limiting. For example, i spent a lot of time trying to get it to support delta-scraping -- periodically scraping the same site, but only download new or changed data -- but it felt impossible getting scrapy to work the way I wanted.

* mechanize: Python port of the Perl module WWW::Mechanize. It's good for tasks like scripting logins. If you want to automate login to a site with a username and password, without having to care about session cookies, then mechanize is the ideal choice. Look elsewhere for an html parsing library.

* Scrapemark: It has a fun and clever approach to scraping. But once again, not as powerful as lxml.

~~~
SeppoErviala
There's even lxml.cssselect if you prefer css selectors over Xpaht.

------
zenocon
I have done quite a fair bit of scraping over the last year, and I have to say
that the combo of PhantomJS / CasperJS is really unbeatable. I have had to
navigate some fairly awful DOM structures replete with errors, confirm
dialogs, IE-only features, horrendous endless iframe trees, and more fun
stuff. There's nothing I haven't been able to plow through yet using
Phantom/Casper.

------
hmottestad
I used jsoup[1] for java when scraping orbitz.com in search of a cheap
flight[2]. It uses jQuery style selectors so I could play with them in the
javascript console before writing the code.

1: <http://jsoup.org/> 2: <http://fluffyelephant.com/2012/09/crawling-orbitz-
com/>

------
fwiw
fwiw, if you want something more advanced, check
<http://github.com/mbr/ragstoriches> (disclaimer: coincidentally, i wrote it
today)

it does async requests using gevent and requests and you can get a simple
scraper in like 20 lines. comes with a craigslist example =)

------
kysol
Not to rain on the parade of this post (I'm in support of more people learning
to scrape, and more services out there giving us easier to access data). I'm
someone who loves web scraping, but I'm also someone who believes that if you
don't know what the library is doing, you shouldn't be using it.

You can give a brief overview of how to use it and what to look for in the
page to extract from, but you're giving a very simple cheat sheet to people
that may not understand HTML (trust me they exist... unfortunately). As soon
as your example breaks, or they reach a limitation with the library, they are
going to throw their arms in the air and deem the library broken, or the task
impossible to do because the example said it would work. The only reason I'm
writing this is that I know of these sorts of people, I deal with them on a
regular basis, and I have to explain to them every time to look at what they
are doing on a lower level to get a better understanding of their problem to
find the solution.

These sorts of people will stumble across this article after their bosses told
them "We need to pull Company X's product information into our sales screens
so that we can compare the competitions prices while making our price
adjustments". Knowing that they don't even have a clue on how to do that, they
will Google for it and retrieve this article. With no experience, and an boss
behind them, they will just blindly use it and pray that it works, but due to
their inexperience with the subject at hand they will fail.

Sorry to be so negative, I just had to say that. It's the same as any other
tutorial out there, just Scraping is something that I feel you need to know
what you're doing before you do it.

Personally, I write my own scrapers from scratch (or using libraries I have
written over time to make certain aspects less painful) for years. I know, I
know, there is a myriad of ready-to-go libraries out there that will do the
same thing and probably better for me, but where's the challenge. Sure if
you're time restricted, then go forth and grab a library and start scraping,
but please at least try to understand what you are doing at a lower level.

~~~
arthur_debert
I'm definitely not advocating for people not understanding the problem they
want solved.

That said, your post sounds empty. Can you elaborate on why your own scrapers
that you write from scratch make it all better? How do you your scrapers deal
with encode detection, broken html, content prioritization and so forth?

I don't like the current options we've got in pythonland, but just writing:
"this sucks, so I write my own" sounds like an ego trip. Can you describe in
detail what BeautifulSoup (or lxml which is usually a better option) is doing
wrong at the lower level and how _your_ scripts are making it better?

~~~
kysol
Sorry if it sounded empty, there is a reason why I didn't include examples.
I'm not really saying "don't use libraries", more just that you should
understand the problem first before looking for an easy solution. To be
honest, I've done all my scraping in PHP/Perl over the years. Only recently
have I started to look into other options such as Python and NodeJS (hence
looking at this thread).

I don't claim that my scrapers are better off because they are written from
scratch, but they do the job that I want them to do. If I find a target that
has a "quirk" I write that into my classes to be used then and in later
instances. The real point of doing it this way is more about knowing what the
scrapper is doing, rather than what it might do. When you're scraping, you're
walking a fine line. Targets may be fine with you doing it to them, but as
soon as your scraper freaks out then starts hammering the site, you're in
trouble (even worse if you end up doing damage to the target).

I'm not saying that 3rd party libraries are prone to doing this, more so if
you forget to set an option or handle an exception, you might screw yourself.
If you wrote the scraper it's your own fault for not handling the issue
properly. If you used a 3rd party library and the library bugged out causing
the issue, you can't really go after the writers, right?

This all comes back to understanding your target, and to understand them, you
need some form of knowledge on how it all works.

In response to your questions - I do a lot of things manually when setting up
the scrapers. I don't import the data into any sort of DOM (due to watching
memory), and in doing that I'm not really concerned about Encoding (for the
record I'm generally dealing with UTF-8 and Shift_JIS only) or Broken HTML (I
do a general check over the source to see if the layout has changed. If it
has, it exits gracefully sending me update notifications on what changed, then
puts itself out of action until I reset it. If it's a mission critical
scraper, lets just say that I have a myriad of alerts that are sent to me).
It's probably not the best way of doing things but it works for me.

Sorry if I was vague, I probably should have put some sort of rant-detection
on my mouth. If I didn't answer something specifically, it's not that I was
ignoring it, it probably just fell into the "I don't trust it so I don't use
it" category. Again, not advocating that people shouldn't use 3rd party
libraries, just that you should at least know what you are doing before you
do.

------
nsomaru
I've used Beautiful Soup (BS) and Scrapy BS is a _lot_ slower than Scrapy,
although you can probably get up and running with BS first (it also more noob
friendly imho since you don't have to learn yet another framework).

Learning Scrapy was made easier after some experience with BS.

------
jliechti1
For those who have been trying to scrape pages that make AJAX requests,
something you can not do with BeautifulSoup alone - I would recommend using
Selenium Server (<http://docs.seleniumhq.org/>). This allows you to automate a
real browser (so your scraping requests look like real page requests, not
robots).

In addition, tools exist for Selenium that let you scale up easily. You can
use Selenium Grid 2 (<https://code.google.com/p/selenium/wiki/Grid2>) to run
multiple browser instances in parallel. This is very beneficial for web
scraping or automated UI testing.

------
crisnoble
If you are interested in learning to build scrapers, I highly recommend this
book: <http://www.webbotsspidersscreenscrapers.com/> , the code is all PHP so
it is very approachable.

------
danielsiders
What's also great for this is ScraperWiki (<http://scraperwiki.com/>) which
supports python, ruby, and php.

------
calufa
For those who want to use a java based solution, I invite you to check out my
open source block tolerant (IP Blocking) web scraper that runs on top of aws
and rackspace, called Tales. Tales is designed to be easy to deploy,
configure, and manage. With Tales you can scrape 10s or even 100s of domains
concurrently.

<https://github.com/calufa/tales-core>

------
jonaldomo
I'm working on a project right now with PHP's built in functions and the help
of Google Chrome developer's tools Copy Xpath functionality:

$html = file_get_contents($url); $doc = new DOMDocument();
$doc->loadHTML($html); $xpath = new DOMXPath($doc); $elements =
$xpath->query('//*[@id="resultCount"]/span');

------
sanspace
I use <http://scraperwiki.com> It's pretty neat and handy! For beginners
here's a tutorial I wrote sometime back <http://blog.sanspace.in/scraperwiki/>

------
josephagoss
What would you recommend for automating website interaction (use a bot to get
betting numbers and then log in and automate a bet without any human
interaction) some sites use an API (betfair) but some don't.

~~~
josephagoss
And those numbers are being posted using ajax or something else (updating in
real time)

------
canadev
For the Rubyists out there, I was just writing a crawler this morning, and I
really like Nokogiri ( <http://nokogiri.org/> ).

------
hisw
As an amateur at scraping and programming as well, I'd like to ask what the
benefits are of using python to scrape over building a php scraper.

------
tyilo
Why is he using lxml as the parser and not just the one built into BS?

~~~
ville
BeautifulSoup uses regular expressions!
[http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/1732454#1732454)

Like many here point out, lxml is a fast and versatile library that could be
used for this alone without BS. lxml.html can parse HTML and lxml also has
support for using HTML5 parser from html5lib that deals with broken HTML in
the standardized way.

~~~
kanzure
> BeautifulSoup uses regular expressions!

Holy hell, you're right.

[http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view...](http://bazaar.launchpad.net/~leonardr/beautifulsoup/bs4/view/head:/bs4/element.py#L476)

------
sandGorgon
would anyone recommend Nutch as a scraping solution. I would think there would
be some way of integrating with Webdriver (within the Java ecosystem) or with
Casperjs.

Isnt Nutch state of the art right now?

------
hoju
lxml.html is great for most cases.

I developed this one for my own web scraping: <http://docs.webscraping.com/>

------
moha24
you can use OpenRefine. I am not saying this is worse and better just another
very powerful option.

------
thrownaway2424
But why python? Stop now and use perl. Perl's HTTP libraries are actually
sane, unlike urllib[2], WWW::Mechanize is brilliant, and it's easier to throw
in disgusting hacks in perl when you need them, which is constantly in the
business of scraping the web.

~~~
cwgem
Because the writer is accustomed to Python as their language of choice? Maybe
it's a Python shop? What if they want to integrate it into their Django site?

If you're coming from a primarily Python background you won't just be waltzing
right in and using CPAN modules right away. You have to understand the
underlying language as well. Python and perl's object oriented systems for
example are quite different (using Moose helps somewhat granted the person
knows it even exists). Then there's the issue of understanding various
contexts. These differences may take time getting used to depending on the
level of the developer.

Python also supports regex as well, which can be useful for weird situations.
Granted however it's not going to be as tightly integrated as perl. There's
even a Natural Language Toolkit if you want to get really crazy with things.

TLDR: Right tool for the job should take environmental circumstances into
account as well

