Using Requests to download the document, pump it into PyQuery and you can use any jQuery style selectors to get text, attributes and all sorts of other stuff.
Example; Here's how to scrape the hacker news homepage https://gist.github.com/samarudge/035ab8aaca224415cb49 (that code could probably be improved but I only spent a couple of minutes on it)
All you need to do then is to copy the selector in your python script and you are done.
I had the opposite experience though I used it only "for small jobs".
If there are no Content-Type (either in http headers or meta http-equiv) that specifies character encoding, no meta charset, and no xml declaration for xhtml, etc that is if the only way to find out character encoding is to guess then even in this case BeautifulSoup includes UnicodeDammit that uses chardet to guess the encoding.
It doesn't get any simpler than this IMO http://pastebin.com/hacxmAjV
waiter there's some lxml in my soup!
2. It's easy to capture screenshots of the pages you're scraping, which is great for sanity checks later on.
Http://phantomjs.org - the main engine
Http://Casperjs.org - syntactic sugar for phantomjs
In my experience, a phantom/casper implementation could take upwards of 5-10 secs. to process a single page (almost 5-10x slower). This, even if you disable load of remote images and plugins.
Still, Scrapy it's amazing and we use it a lot.
1. Python - BeautifulSoup
2. Ruby - Nokogiri (use in conjunction with Watir if you're scraping a very client-heavy website).
3. C# - HtmlAgilityPack in conjunction with ScrapySharp (there's a nuget package for both) - I highly recommend ScrapySharp because it allows you to query elements using a very familiar Css selector type similar to how you query dom elements in jQuery. :)
Scraping online content is so simple these days, if a website doesn't offer an API you still have alternatives ;)
Having written code to both leverage a site's (private) API and to scrape that same site (when the private API stopped existing), I would much, much rather use an API. Yes, the scraper works, but the scraper's code is much messier, and JSON keys, for example, provide some documentation in their own right. Looking at the scraper months later, there's a lot more headscratching. Scraping will work, but it also will leave you searching for many little pieces of information that would be exposed by an API but aren't by a static site.
Still, I will agree that scraping is much easier; I've used jsdom (Node.js) extensively, and, for my use cases, it feels like working in a browser (full DOM, scripts, etc.).
http://phantomjs.org/ or "its easier to deal with cousin" http://casperjs.org/ for very client heavy sites.
FWIW, you can get away with HTML only scrapers most of the time, you just need to look harder to find all the data. Totally recommend using "View page source" as that would always give you the original HTML vs the possibly altered DOM (after JS has run on the page) that you might see with Dev Tools/Firebug.
I've found Selenium scripts much easier to comprehend, modify, and maintain over time than the PhantomJS scripts.
* BeautifulSoup: It was the best scraping library ever until python-lxml came around and stole the show. Despite that the manual said BeautifulSoup gives you unicode, damnit! it had some long-standing bugs which it gave you strings or incorrectly decoded web pages. I wouldn't use it anymore because lxml is strictly superior.
* lxml: The king of scraping libraries. It is a big library so it can be hard to approach. It's actually a collection of markup parsers; lxml.html, lxml.etree and some more I've forgotten about. I almost exclusively use lxml.html since it "works" and can handle invalid markup without complaining. Use it like this: https://gist.github.com/mattoufoutu/823821 lxml can parse both using jQuery-style selectors with cssselect() and XPath 1.0 with the xpath() method. XPath is hard to learn, but once you get it you have a really powerful parsing tool which makes your life simpler.
* pyQuery: Easy to get started with and use. But not as powerful as lxml+XPath.
* urllib: Many of the modules in Python's standard library are there because they have been there for a long time. :) python-requests or httplib2 are the best libraries for http.
* scrapy: A framework for scheduling and supervising scraper spiders. It takes care of everything from downloading pages, following urls, concurrent requests, handling network errors to storing data in a database or generating csv files. It doesn't parse html itself, but delegates that task to lxml. Personally, I've found scrapy to be very good when your problem fits with how scrapy thinks scraping should be done. If you try to depart from the scrapy-way then scrapy suddenly feels very "frameworkish" and limiting. For example, i spent a lot of time trying to get it to support delta-scraping -- periodically scraping the same site, but only download new or changed data -- but it felt impossible getting scrapy to work the way I wanted.
* mechanize: Python port of the Perl module WWW::Mechanize. It's good for tasks like scripting logins. If you want to automate login to a site with a username and password, without having to care about session cookies, then mechanize is the ideal choice. Look elsewhere for an html parsing library.
* Scrapemark: It has a fun and clever approach to scraping. But once again, not as powerful as lxml.
Admittedly writing this from scratch requires learning the framework, but being able to share and reuse code is a huge win (disclaimer: I'm a Scrapy contributor).
it does async requests using gevent and requests and you can get a simple scraper in like 20 lines. comes with a craigslist example =)
You can give a brief overview of how to use it and what to look for in the page to extract from, but you're giving a very simple cheat sheet to people that may not understand HTML (trust me they exist... unfortunately). As soon as your example breaks, or they reach a limitation with the library, they are going to throw their arms in the air and deem the library broken, or the task impossible to do because the example said it would work. The only reason I'm writing this is that I know of these sorts of people, I deal with them on a regular basis, and I have to explain to them every time to look at what they are doing on a lower level to get a better understanding of their problem to find the solution.
These sorts of people will stumble across this article after their bosses told them "We need to pull Company X's product information into our sales screens so that we can compare the competitions prices while making our price adjustments". Knowing that they don't even have a clue on how to do that, they will Google for it and retrieve this article. With no experience, and an boss behind them, they will just blindly use it and pray that it works, but due to their inexperience with the subject at hand they will fail.
Sorry to be so negative, I just had to say that. It's the same as any other tutorial out there, just Scraping is something that I feel you need to know what you're doing before you do it.
Personally, I write my own scrapers from scratch (or using libraries I have written over time to make certain aspects less painful) for years. I know, I know, there is a myriad of ready-to-go libraries out there that will do the same thing and probably better for me, but where's the challenge. Sure if you're time restricted, then go forth and grab a library and start scraping, but please at least try to understand what you are doing at a lower level.
That said, your post sounds empty. Can you elaborate on why your own scrapers that you write from scratch make it all better? How do you your scrapers deal with encode detection, broken html, content prioritization and so forth?
I don't like the current options we've got in pythonland, but just writing: "this sucks, so I write my own" sounds like an ego trip. Can you describe in detail what BeautifulSoup (or lxml which is usually a better option) is doing wrong at the lower level and how your scripts are making it better?
I don't claim that my scrapers are better off because they are written from scratch, but they do the job that I want them to do. If I find a target that has a "quirk" I write that into my classes to be used then and in later instances. The real point of doing it this way is more about knowing what the scrapper is doing, rather than what it might do. When you're scraping, you're walking a fine line. Targets may be fine with you doing it to them, but as soon as your scraper freaks out then starts hammering the site, you're in trouble (even worse if you end up doing damage to the target).
I'm not saying that 3rd party libraries are prone to doing this, more so if you forget to set an option or handle an exception, you might screw yourself. If you wrote the scraper it's your own fault for not handling the issue properly. If you used a 3rd party library and the library bugged out causing the issue, you can't really go after the writers, right?
This all comes back to understanding your target, and to understand them, you need some form of knowledge on how it all works.
In response to your questions - I do a lot of things manually when setting up the scrapers. I don't import the data into any sort of DOM (due to watching memory), and in doing that I'm not really concerned about Encoding (for the record I'm generally dealing with UTF-8 and Shift_JIS only) or Broken HTML (I do a general check over the source to see if the layout has changed. If it has, it exits gracefully sending me update notifications on what changed, then puts itself out of action until I reset it. If it's a mission critical scraper, lets just say that I have a myriad of alerts that are sent to me). It's probably not the best way of doing things but it works for me.
Sorry if I was vague, I probably should have put some sort of rant-detection on my mouth. If I didn't answer something specifically, it's not that I was ignoring it, it probably just fell into the "I don't trust it so I don't use it" category. Again, not advocating that people shouldn't use 3rd party libraries, just that you should at least know what you are doing before you do.
Learning Scrapy was made easier after some experience with BS.
In addition, tools exist for Selenium that let you scale up easily. You can use Selenium Grid 2 (https://code.google.com/p/selenium/wiki/Grid2) to run multiple browser instances in parallel. This is very beneficial for web scraping or automated UI testing.
$html = file_get_contents($url);
$doc = new DOMDocument();
$xpath = new DOMXPath($doc);
$elements = $xpath->query('//*[@id="resultCount"]/span');
Like many here point out, lxml is a fast and versatile library that could be used for this alone without BS. lxml.html can parse HTML and lxml also has support for using HTML5 parser from html5lib that deals with broken HTML in the standardized way.
Holy hell, you're right.
Isnt Nutch state of the art right now?
I developed this one for my own web scraping:
If you're coming from a primarily Python background you won't just be waltzing right in and using CPAN modules right away. You have to understand the underlying language as well. Python and perl's object oriented systems for example are quite different (using Moose helps somewhat granted the person knows it even exists). Then there's the issue of understanding various contexts. These differences may take time getting used to depending on the level of the developer.
Python also supports regex as well, which can be useful for weird situations. Granted however it's not going to be as tightly integrated as perl. There's even a Natural Language Toolkit if you want to get really crazy with things.
TLDR: Right tool for the job should take environmental circumstances into account as well
There is a port of mechanize to Python as well, if that is your main reason to use Perl.
Whatever you can do in "disgusting hacks" can be done just as quickly in a way which won't make you want to vomit when you look over the code again later.