Wow, I had used node+jsdom+jQuery to do scraping, which was a big step up from anything I had done previously. But after playing around with PhantomJS, I feel like I finally found the hammer in the toolbox after years of trying to pound nails in with duct-taped-together screwdrivers.
Check that.. I wanted to use js to submit to my own web service, but forgot about the cross-domain issues, so I just used console.log to output the data. I suppose I could have used the iframe querystring parameter trick.
My favorite language when it comes to scraping the web has always been Perl. With tools like LWP, Mechanize and Win32::Mechanize (OLE) scraping any site is a breeze. Unfortunately, I haven't seen many good modules on CPAN for DOM processing. Of course there is the Tokeparser and XPath but those generally don't work well with street-HTML (which most sites are) and are no-where near as fast or friendly as jQuery selectors. By the way, there is one module called pQuery which is Perl's port of jQuery but that only supports a handful of selectors and doesn't work with Mechanize.
If only there was a module in Perl which could marry Mechanize and jQuery (without using IE or OLE) it would make the best scraper in the world!
Actually, libxml (lxml for Python) is very, very good at handling real-life HTML content.
I'm currently on a contract where I do a significant amount of ETL and web scraping as part of the project, and I almost exclusively use lxml and XPath for parsing real-world HTML.
XPath is, without a doubt, the best tool for DOM manipulation that I can think of. And that's just XPath 1.0 -- 2.0 is reputedly even better, but no 2.0 support is forthcoming for lxml as near as I can tell.
Xpath is good when the content is well formed, but usually doesn't work well with even slightly messed up tags. Mojolicious supports broken content much better. And it has full css3 selector support, which we all know is hands down the best way to access Dom elements.
If you're in the .NET world, HtmlAgilityPack does a great job of producing proper XML from broken HTML. CSS selectors don't always suffice, particularly with sites that don't use CSS :) Sometimes you have sites where the best you can do is e.g. get the text of the 2nd h2 header following the span with text 'X'. With some utility functions I can just write:
From which my code then selects the HtmlAgilityPack InnerText (i.e. less formatting, etc.). (Practically speaking, my code also does some case-insensitive translation in there, which is an area where XPath is a bit annoying, plus string trimming, checking and propagating nulls, etc.)
In my experience the greater challenge with scraping lots of data is dealing with stuff like:
- cache disabled in response headers but you've scraped 10K pages and just discovered a page with e.g. a deformed href in an anchor (e.g. "<a href+'....'>"); after giving up trying to understand how the hell they managed that, it's not long before you're writing a crawl repository so you can selectively ignore the caching rules your proxy cache happily abides by so you can quickly restart your debug session for the next weird thing you discover (unfortunately the nature of the site has forced you to do in-memory preprocessing of 50K pages before you can do the real processing for the rest of the site because they have done some OTHER weird stuff)
- sites that treat EVERYTHING as dynamic content even though you could easily cache it... now you get to do the job of the webmaster because you have many data sources you're feeding from and don't want to hammer servers
- sites with bad links but no 404 responses (just redirects); easy to detect, but still a nuisance
- proper request throttling (i.e. throttling on the basis of requests serviced not merely requested)
- dynamically adjusting the above throttling, because sites can be weird :)
- efficiently issuing millions of requests/week to a bunch of sites and scraping data from the responses in custom formats for each site
- site layout changing and breaking your scraping logic. I'm not sure how common this is today, but I was scraping hundreds of commerce sites in 2001, each having several (often 5 but sometimes 50) different product page layouts for different sections, each with its own field names, fields, and crazyness, for a total of a few thousand different "scraping logics" (each just 5-10 lines long, but each had to be individually maintained). Now, every day just two (out of a few thousands) broke, but to keep everything robust, you had to (a) be able to tell which one broke, and (b) fix it within a reasonable time frame. Neither of these is simple.
- sites whose traffic management system you trip while scraping. Many will block you, some actively (with an error message, so you know what is happening), some will just keep you hanging or throttle you down to a few hundred bytes/second all of a sudden, with no explanation and no one to contact. Amazon contacted us when they figured we were scraping (we weren't hiding anything and doing it with a logged in user that had contact details), and were cool about it.
- sites that randomly break and stop in the middle of a page. Happens much more than you'd think; When using the site, you just reload or interact with a half-loaded page. You could, of course, still scrape a half-loaded page - but what if only 20/23 of the items you need are there? What if the site is stateful, and reloading that page would cause a state change you do not want?
Again, this is not true of lxml. XPath was made for structured data, but in lxml you can give it the HTMLParser factory and it will handle rubbish HTML just fine. I use lxml/XPath professionally to scrape economic data for an ibank, and you'd be surprised at the Microsoft Frontpage-type spaghetti HTML I parse with it -- and it all works great.
beautifulsoup for Python is miraculous. It takes a different approach from lxml (http://lxml.de/elementsoup.html), so sometimes one or the other will work better depending on the input, but I don't know what I'd do without it.
Mojolicious is exactly what you are looking for. Lazy and forgiving DOM parser, and CSS3 selector built in. I've used it for scraping a couple of sites, after trying so many different libraries ranging from curl to asihttprequest for objc to mechanize for ruby, I don't see myself trying anything else now.
Mojolicious looks good. Though I've heard about it before I always thought of it as a MVC framework and never as a web scraper. Anyway, looking through the docs, I see that Mojo::DOM is what seems most useful. But does this integrate with Mechanize? Would I be able to fill/submit a form or is Mojolicious just for parsing only?
Let's not make excuses for them. I wouldn't say that it says "nothing" about the practices and priorities oriented around (and consequently through) the project, especially when the website apparently consists of a single small static HTML page.
It's a project created and maintained entirely in my free time. I provide free support when someone needs help and I fix bugs when someone files an issue. I've written documentation and a guide in the project's Github wiki. This is all done outside of full-time work. Forgive me if I don't spend more time improving the website and uptime when it's an endeavour that brings me no income whatsoever. You're the type of person that makes me hate open-source sometimes.
> I wouldn't say that it says "nothing" about the practices and priorities oriented around (and consequently through) the project, especially when the website apparently consists of a single small static HTML page.
It says nothing. All servers go down from time to time. This one was up when I went to it before, and it was up when I tried it again just now. You say it was down for you when you visited it. Fine. That happens sometimes.
We at Rewritely (http://www.rewritely.com) have quite some experience in dealing large scale content migration for clients (not just a single HTML page, but a whole site) mainly using scraping techniques.
Disclaimer: we have no relationship with PhantomJS :)
https://scraperwiki.com/ is a great source of thousands of pre-written scrapers to use / copy / extend etc. It's sort of like github except you can actually schedule the scrapers to run at regular intervals, and then just access the scraped data over a standard API.
While government or private contractors do not allow to gather that data by any citizen (there are examples of cities allowing it), I find myself in the right to do so, and also, in the right to redistribute that data freely so other developers can play / investigate / learn with it.
Why? Well, for starters:
1) Their own apps suck (or are inexistent)
2) They don't want to help their own users.
3) It's fun
4) Their services are paid with public money.
5) It raises awareness on the need of public data legislation.