Hacker News new | comments | show | ask | jobs | submit login
Web Scraping 101 with Python (gregreda.com)
240 points by shabdar on Mar 10, 2013 | hide | past | web | favorite | 78 comments

PyQuery is pretty awesome (https://pypi.python.org/pypi/pyquery)

Using Requests to download the document, pump it into PyQuery and you can use any jQuery style selectors to get text, attributes and all sorts of other stuff.

Example; Here's how to scrape the hacker news homepage https://gist.github.com/samarudge/035ab8aaca224415cb49 (that code could probably be improved but I only spent a couple of minutes on it)

Watch out for unicode when using pyquery and requests. I provided a fix for that just recently now merged into the pyquery repo. I use it (among other things) to scrape upcoming comic book releases =) http://cuppster.com/2013/01/30/decorators-scrapers-and-gener...

I also prefer PyQuery over Beautiful Soup.

Especially since you can use a Chrome or FFX plugin to inject jQuery into any webpage and then refine your selector via the JavaScript console.

All you need to do then is to copy the selector in your python script and you are done.

I definitely recommend this for people used to the jquery syntax. Requests + PyQuery took no time at all to learn and did everything I needed for some basic page crawling.

PyQuery seems to always be faster in my experience than BS4 (for ripping the same information). Anyone else have a similar experience?

Only on wellformed pages. There are many many many many many malformed pages on the internet. Even those that are created in 2013

Fortunately HTML5 defined a standard way to parse even broken HTML and that parser is implemented in html5lib package. You can use it also with lxml and even use "jQuery like" selectors with lxml.cssselect (http://lxml.de/cssselect.html)

BeautifulSoup, has received a lot of positive press on HN over the years so when I needed to do some heavy scraping I gave it a spin. It was a total disappointment. It's fine if you are scraping a small set of similar pages from a single site but if you are scraping a large number of pages across many sites and esp pages with text encoding other than ascii / utf-8 it choaks so frequently as to be useless. BeautifulSoup is fine for small jobs but if you were making a web-crawler for example look elsewhere it is totally inadequate.

lxml.HTML (+ html5lib if needed) is a FAR superior choice indeed. It's an order of magnitude faster, and you can use not only xpath selectors, but also CSS selectors too, from a lxml dom. or indeed you can use Scrapy, which fixes all these encoding BS, handles per domain crawl rate and concurrent requests etc...

> with text encoding other than ascii / utf-8 it choaks so frequently as to be useless

I had the opposite experience though I used it only "for small jobs".

If there are no Content-Type (either in http headers or meta http-equiv) that specifies character encoding, no meta charset, and no xml declaration for xhtml, etc that is if the only way to find out character encoding is to guess then even in this case BeautifulSoup includes UnicodeDammit that uses chardet to guess the encoding.

In my experience lxml.html is much better.

The article shows that BeautifulSoup can use lxml internally. It can also use html5lib.

I'll second, lxml.html is in my experience very robust and fast. I've been writing a lot of scrapers along the years and the best combination I found so far is requests / lxml.html / gevent.

It doesn't get any simpler than this IMO http://pastebin.com/hacxmAjV

For fun here's the Ruby+Nokogiri version and my attempt at the Clojure+Enlive version (3rd day learning Clojure).


Pretty nice. I'd like to see how it goes in a real world scenario with concurrency, manipulation of extracted nodes and general HTTP post/auth/etc. I'm not sure about Ruby+Nokogiri but Clojure+Enlive may well be a great choice.

I don't see a use for BeautifulSoup nowadays. I use lxml.etree for everything (with the HTML parser when needed), and do 99% of queries using XPath. It's the best way to do scraping with Python, in my experience.

what would you suggest?

Why not both?


waiter there's some lxml in my soup!

I personally use a mix of BS4, lxml.html and pyquery.

Using Requests and lxml is a better solution — except when you need many concurrent spiders, in which case you should be using Scrapy (you'd probably be looking for the Middleware discussed here, too: https://groups.google.com/d/msg/scrapy-users/WqMLnKbA43I/B3N...).

This will only be a good approach if you are going to scrape a small amount of pages. The problem is using synchronous requests, as this blocks the crawler until a request has finished. Using asynchronous requests such as supported by twisted (and scrapy) will allow you to crawl a lot faster using the same resources.

This can actually sometimes be a feature. It makes it far less likely to have your IP banned. Its also a far more polite way to crawl someones site.

I agree, and for a 101 web scraping tutorial keeping it simple is nice.

I would argue that the proper implementation provides real rate limiting, both in terms of max requests per second and also max concurrent requests. Limiting to one concurrent request is likely to be extremely slow for any significant amount of data, and a couple concurrent requests is not impolite. Obviously I'm not saying you should effectively DoS the site you're scraping, but there's a balance and 1 concurrent request is almost definitely the wrong place to set it.

You could crawl a lot of different sites one page at a time. When I wrote a large distributed download system, I would use pycurl's bandwidth throttle and also store a 5 minute average of bandwidth per domain that would prevent other downloaders from saturating a domain.

I've heard this from countless people who have read the post. It's definitely made me want to look into Scrapy.

I'm not generally a huge fan of javascript, but phantomjs/casperjs are far and away the best tools I've used for scraping. Two features that stood out:

1. It's a headless WebKit browser, so it plays well with javascript and (sorta) flash. 2. It's easy to capture screenshots of the pages you're scraping, which is great for sanity checks later on.

Http://phantomjs.org - the main engine Http://Casperjs.org - syntactic sugar for phantomjs

They make scraping as easy as finding the right jquery selectors (once you inject jQuery onto the page) but can be very slow as compared to a vanilla HTML only scraper.

In my experience, a phantom/casper implementation could take upwards of 5-10 secs. to process a single page (almost 5-10x slower). This, even if you disable load of remote images and plugins.

There is a startup penalty to getting phantomjs executable up (including all of its WebKit internals), but once you're there, I've never had any performance issues. Roll a script using casper.each() and feed it an array of urls. It is typically very fast for me. You can trap on the page loaded event and do some benchmarking, but I would disagree with your premise that using PhantomJS/CasperJS is slow.

I'm surprised this article didn't mention scrapy. I had to do a lot of web scraping for a healthcare-related project last month and found scrapy incredibly fast and easy to use.

I have to admit that Scrapy is very fast, powerful and easy to use and scale. However, probably it's easier to start with BS, as Scrapy requires you to learn "Scrapy way of doing stuff". Furthermore, I find documentation to be a bit unpolished sometimes.

Still, Scrapy it's amazing and we use it a lot.

Scrapy is awesome and we have been using it without any problem so far.

I've been using BeautifulSoup for a couple years, so it's what I'm most comfortable with. I'd heard of scrapy before, but had never given it a seriously look. That'll change based on all the positive things I've read about it in this thread.

Here are some awesome libraries I've used for HTML scraping:

1. Python - BeautifulSoup

2. Ruby - Nokogiri (use in conjunction with Watir if you're scraping a very client-heavy website).

3. C# - HtmlAgilityPack in conjunction with ScrapySharp (there's a nuget package for both) - I highly recommend ScrapySharp because it allows you to query elements using a very familiar Css selector type similar to how you query dom elements in jQuery. :)

Scraping online content is so simple these days, if a website doesn't offer an API you still have alternatives ;)

> Scraping online content is so simple these days, if a website doesn't offer an API you still have alternatives ;)

Having written code to both leverage a site's (private) API and to scrape that same site (when the private API stopped existing), I would much, much rather use an API. Yes, the scraper works, but the scraper's code is much messier, and JSON keys, for example, provide some documentation in their own right. Looking at the scraper months later, there's a lot more headscratching. Scraping will work, but it also will leave you searching for many little pieces of information that would be exposed by an API but aren't by a static site.

Still, I will agree that scraping is much easier; I've used jsdom (Node.js) extensively, and, for my use cases, it feels like working in a browser (full DOM, scripts, etc.).

Another point in favor of APIs is that scrapers are very brittle, and likely to break with changes to content presentation. Also, scrapers have a lot more overhead as you need to both receive and parse the markup data.

Also recommend Mechanize for Ruby (uses nokogiri under the covers).


http://phantomjs.org/ or "its easier to deal with cousin" http://casperjs.org/ for very client heavy sites.

FWIW, you can get away with HTML only scrapers most of the time, you just need to look harder to find all the data. Totally recommend using "View page source" as that would always give you the original HTML vs the possibly altered DOM (after JS has run on the page) that you might see with Dev Tools/Firebug.

Mechanize is far and away the best and easiest way to scrape with Ruby until anything is rendered in javascript, which is explicitly not supported.

I tend to use Mechanized until I can't, then switch to Watir. Over time, I've found myself just strait up picking up Watir as it runs your browser directly and supports javascript rendering as a result.

How is performance with Watir? With casperjs a page takes me on an avg. 5-10 secs. to process.

Not great. About the same...

I recommend Selenium before I'd recommend PhantomJS in situations where Mechanize/Nokogiri don't cut the mustard,

I've found Selenium scripts much easier to comprehend, modify, and maintain over time than the PhantomJS scripts.

Check out casperjs, it should make life easier. Phantomjs by itself is extremely cumbersome in my experience.

CsQuery is another interesting library for C# along those same lines: https://github.com/jamietre/CsQuery

For the pythonistas: what is the relationship between BeautifulSoup, lxml, urllib*, scrapy and mechanize?

Here are my highly opinionated opinions about their respective use cases:

* BeautifulSoup: It was the best scraping library ever until python-lxml came around and stole the show. Despite that the manual said BeautifulSoup gives you unicode, damnit! it had some long-standing bugs which it gave you strings or incorrectly decoded web pages. I wouldn't use it anymore because lxml is strictly superior.

* lxml: The king of scraping libraries. It is a big library so it can be hard to approach. It's actually a collection of markup parsers; lxml.html, lxml.etree and some more I've forgotten about. I almost exclusively use lxml.html since it "works" and can handle invalid markup without complaining. Use it like this: https://gist.github.com/mattoufoutu/823821 lxml can parse both using jQuery-style selectors with cssselect() and XPath 1.0 with the xpath() method. XPath is hard to learn, but once you get it you have a really powerful parsing tool which makes your life simpler.

* pyQuery: Easy to get started with and use. But not as powerful as lxml+XPath.

* urllib: Many of the modules in Python's standard library are there because they have been there for a long time. :) python-requests or httplib2 are the best libraries for http.

* scrapy: A framework for scheduling and supervising scraper spiders. It takes care of everything from downloading pages, following urls, concurrent requests, handling network errors to storing data in a database or generating csv files. It doesn't parse html itself, but delegates that task to lxml. Personally, I've found scrapy to be very good when your problem fits with how scrapy thinks scraping should be done. If you try to depart from the scrapy-way then scrapy suddenly feels very "frameworkish" and limiting. For example, i spent a lot of time trying to get it to support delta-scraping -- periodically scraping the same site, but only download new or changed data -- but it felt impossible getting scrapy to work the way I wanted.

* mechanize: Python port of the Perl module WWW::Mechanize. It's good for tasks like scripting logins. If you want to automate login to a site with a username and password, without having to care about session cookies, then mechanize is the ideal choice. Look elsewhere for an html parsing library.

* Scrapemark: It has a fun and clever approach to scraping. But once again, not as powerful as lxml.

There's even lxml.cssselect if you prefer css selectors over Xpaht.

I would look at "delta-scraping" as one of the things that is easier with scrapy. There is an existing extension for it: https://github.com/scrapinghub/scrapylib/blob/master/scrapyl... and lots of support and help from the community.

Admittedly writing this from scratch requires learning the framework, but being able to share and reuse code is a huge win (disclaimer: I'm a Scrapy contributor).

Thank you. This is why I come back to HN: for pretty much every technical question I have, there's someone with experience with many alternatives and can give their opinions :)

I have done quite a fair bit of scraping over the last year, and I have to say that the combo of PhantomJS / CasperJS is really unbeatable. I have had to navigate some fairly awful DOM structures replete with errors, confirm dialogs, IE-only features, horrendous endless iframe trees, and more fun stuff. There's nothing I haven't been able to plow through yet using Phantom/Casper.

I used jsoup[1] for java when scraping orbitz.com in search of a cheap flight[2]. It uses jQuery style selectors so I could play with them in the javascript console before writing the code.

1: http://jsoup.org/ 2: http://fluffyelephant.com/2012/09/crawling-orbitz-com/

fwiw, if you want something more advanced, check http://github.com/mbr/ragstoriches (disclaimer: coincidentally, i wrote it today)

it does async requests using gevent and requests and you can get a simple scraper in like 20 lines. comes with a craigslist example =)

Not to rain on the parade of this post (I'm in support of more people learning to scrape, and more services out there giving us easier to access data). I'm someone who loves web scraping, but I'm also someone who believes that if you don't know what the library is doing, you shouldn't be using it.

You can give a brief overview of how to use it and what to look for in the page to extract from, but you're giving a very simple cheat sheet to people that may not understand HTML (trust me they exist... unfortunately). As soon as your example breaks, or they reach a limitation with the library, they are going to throw their arms in the air and deem the library broken, or the task impossible to do because the example said it would work. The only reason I'm writing this is that I know of these sorts of people, I deal with them on a regular basis, and I have to explain to them every time to look at what they are doing on a lower level to get a better understanding of their problem to find the solution.

These sorts of people will stumble across this article after their bosses told them "We need to pull Company X's product information into our sales screens so that we can compare the competitions prices while making our price adjustments". Knowing that they don't even have a clue on how to do that, they will Google for it and retrieve this article. With no experience, and an boss behind them, they will just blindly use it and pray that it works, but due to their inexperience with the subject at hand they will fail.

Sorry to be so negative, I just had to say that. It's the same as any other tutorial out there, just Scraping is something that I feel you need to know what you're doing before you do it.

Personally, I write my own scrapers from scratch (or using libraries I have written over time to make certain aspects less painful) for years. I know, I know, there is a myriad of ready-to-go libraries out there that will do the same thing and probably better for me, but where's the challenge. Sure if you're time restricted, then go forth and grab a library and start scraping, but please at least try to understand what you are doing at a lower level.

I'm definitely not advocating for people not understanding the problem they want solved.

That said, your post sounds empty. Can you elaborate on why your own scrapers that you write from scratch make it all better? How do you your scrapers deal with encode detection, broken html, content prioritization and so forth?

I don't like the current options we've got in pythonland, but just writing: "this sucks, so I write my own" sounds like an ego trip. Can you describe in detail what BeautifulSoup (or lxml which is usually a better option) is doing wrong at the lower level and how your scripts are making it better?

Sorry if it sounded empty, there is a reason why I didn't include examples. I'm not really saying "don't use libraries", more just that you should understand the problem first before looking for an easy solution. To be honest, I've done all my scraping in PHP/Perl over the years. Only recently have I started to look into other options such as Python and NodeJS (hence looking at this thread).

I don't claim that my scrapers are better off because they are written from scratch, but they do the job that I want them to do. If I find a target that has a "quirk" I write that into my classes to be used then and in later instances. The real point of doing it this way is more about knowing what the scrapper is doing, rather than what it might do. When you're scraping, you're walking a fine line. Targets may be fine with you doing it to them, but as soon as your scraper freaks out then starts hammering the site, you're in trouble (even worse if you end up doing damage to the target).

I'm not saying that 3rd party libraries are prone to doing this, more so if you forget to set an option or handle an exception, you might screw yourself. If you wrote the scraper it's your own fault for not handling the issue properly. If you used a 3rd party library and the library bugged out causing the issue, you can't really go after the writers, right?

This all comes back to understanding your target, and to understand them, you need some form of knowledge on how it all works.

In response to your questions - I do a lot of things manually when setting up the scrapers. I don't import the data into any sort of DOM (due to watching memory), and in doing that I'm not really concerned about Encoding (for the record I'm generally dealing with UTF-8 and Shift_JIS only) or Broken HTML (I do a general check over the source to see if the layout has changed. If it has, it exits gracefully sending me update notifications on what changed, then puts itself out of action until I reset it. If it's a mission critical scraper, lets just say that I have a myriad of alerts that are sent to me). It's probably not the best way of doing things but it works for me.

Sorry if I was vague, I probably should have put some sort of rant-detection on my mouth. If I didn't answer something specifically, it's not that I was ignoring it, it probably just fell into the "I don't trust it so I don't use it" category. Again, not advocating that people shouldn't use 3rd party libraries, just that you should at least know what you are doing before you do.

I've used Beautiful Soup (BS) and Scrapy BS is a _lot_ slower than Scrapy, although you can probably get up and running with BS first (it also more noob friendly imho since you don't have to learn yet another framework).

Learning Scrapy was made easier after some experience with BS.

For those who have been trying to scrape pages that make AJAX requests, something you can not do with BeautifulSoup alone - I would recommend using Selenium Server (http://docs.seleniumhq.org/). This allows you to automate a real browser (so your scraping requests look like real page requests, not robots).

In addition, tools exist for Selenium that let you scale up easily. You can use Selenium Grid 2 (https://code.google.com/p/selenium/wiki/Grid2) to run multiple browser instances in parallel. This is very beneficial for web scraping or automated UI testing.

If you are interested in learning to build scrapers, I highly recommend this book: http://www.webbotsspidersscreenscrapers.com/ , the code is all PHP so it is very approachable.

What's also great for this is ScraperWiki (http://scraperwiki.com/) which supports python, ruby, and php.

For those who want to use a java based solution, I invite you to check out my open source block tolerant (IP Blocking) web scraper that runs on top of aws and rackspace, called Tales. Tales is designed to be easy to deploy, configure, and manage. With Tales you can scrape 10s or even 100s of domains concurrently.


I'm working on a project right now with PHP's built in functions and the help of Google Chrome developer's tools Copy Xpath functionality:

$html = file_get_contents($url); $doc = new DOMDocument(); $doc->loadHTML($html); $xpath = new DOMXPath($doc); $elements = $xpath->query('//*[@id="resultCount"]/span');

I use http://scraperwiki.com It's pretty neat and handy! For beginners here's a tutorial I wrote sometime back http://blog.sanspace.in/scraperwiki/

What would you recommend for automating website interaction (use a bot to get betting numbers and then log in and automate a bet without any human interaction) some sites use an API (betfair) but some don't.

And those numbers are being posted using ajax or something else (updating in real time)

For the Rubyists out there, I was just writing a crawler this morning, and I really like Nokogiri ( http://nokogiri.org/ ).

As an amateur at scraping and programming as well, I'd like to ask what the benefits are of using python to scrape over building a php scraper.

Why is he using lxml as the parser and not just the one built into BS?

lxml is superior to BS. Most of the elementtree API is implemented by lxml too so it's compatible with BS - not sure why he's using BS when everything is built into lxml though and things like PyQuery and/or XPath parsing are available.

BeautifulSoup uses regular expressions! http://stackoverflow.com/questions/1732348/regex-match-open-...

Like many here point out, lxml is a fast and versatile library that could be used for this alone without BS. lxml.html can parse HTML and lxml also has support for using HTML5 parser from html5lib that deals with broken HTML in the standardized way.

> BeautifulSoup uses regular expressions!

Holy hell, you're right.


would anyone recommend Nutch as a scraping solution. I would think there would be some way of integrating with Webdriver (within the Java ecosystem) or with Casperjs.

Isnt Nutch state of the art right now?

lxml.html is great for most cases.

I developed this one for my own web scraping: http://docs.webscraping.com/

you can use OpenRefine. I am not saying this is worse and better just another very powerful option.

But why python? Stop now and use perl. Perl's HTTP libraries are actually sane, unlike urllib[2], WWW::Mechanize is brilliant, and it's easier to throw in disgusting hacks in perl when you need them, which is constantly in the business of scraping the web.

Because the writer is accustomed to Python as their language of choice? Maybe it's a Python shop? What if they want to integrate it into their Django site?

If you're coming from a primarily Python background you won't just be waltzing right in and using CPAN modules right away. You have to understand the underlying language as well. Python and perl's object oriented systems for example are quite different (using Moose helps somewhat granted the person knows it even exists). Then there's the issue of understanding various contexts. These differences may take time getting used to depending on the level of the developer.

Python also supports regex as well, which can be useful for weird situations. Granted however it's not going to be as tightly integrated as perl. There's even a Natural Language Toolkit if you want to get really crazy with things.

TLDR: Right tool for the job should take environmental circumstances into account as well

Perl is valid if you know it and want to use it. Otherwise, not really.

There is a port of mechanize to Python as well, if that is your main reason to use Perl.

Whatever you can do in "disgusting hacks" can be done just as quickly in a way which won't make you want to vomit when you look over the code again later.

I bet Perl-based scrapers run faster ;) http://news.ycombinator.com/item?id=5252581

There are other python libraries than urllib.

requests is an excellent library to use in place of urllib

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact