How's that work? Pass whatever data you want to process with your web service as URL parameters to the web service src URL of your iframe and it circumvents cross-site-scripting protections?
If only there was a module in Perl which could marry Mechanize and jQuery (without using IE or OLE) it would make the best scraper in the world!
I'm currently on a contract where I do a significant amount of ETL and web scraping as part of the project, and I almost exclusively use lxml and XPath for parsing real-world HTML.
XPath is, without a doubt, the best tool for DOM manipulation that I can think of. And that's just XPath 1.0 -- 2.0 is reputedly even better, but no 2.0 support is forthcoming for lxml as near as I can tell.
result.ContentX = doc.Element("span").WithText("X:").FollowingText("h2", 2);
In my experience the greater challenge with scraping lots of data is dealing with stuff like:
- cache disabled in response headers but you've scraped 10K pages and just discovered a page with e.g. a deformed href in an anchor (e.g. "<a href+'....'>"); after giving up trying to understand how the hell they managed that, it's not long before you're writing a crawl repository so you can selectively ignore the caching rules your proxy cache happily abides by so you can quickly restart your debug session for the next weird thing you discover (unfortunately the nature of the site has forced you to do in-memory preprocessing of 50K pages before you can do the real processing for the rest of the site because they have done some OTHER weird stuff)
- sites that treat EVERYTHING as dynamic content even though you could easily cache it... now you get to do the job of the webmaster because you have many data sources you're feeding from and don't want to hammer servers
- sites with bad links but no 404 responses (just redirects); easy to detect, but still a nuisance
- proper request throttling (i.e. throttling on the basis of requests serviced not merely requested)
- dynamically adjusting the above throttling, because sites can be weird :)
- efficiently issuing millions of requests/week to a bunch of sites and scraping data from the responses in custom formats for each site
- site layout changing and breaking your scraping logic. I'm not sure how common this is today, but I was scraping hundreds of commerce sites in 2001, each having several (often 5 but sometimes 50) different product page layouts for different sections, each with its own field names, fields, and crazyness, for a total of a few thousand different "scraping logics" (each just 5-10 lines long, but each had to be individually maintained). Now, every day just two (out of a few thousands) broke, but to keep everything robust, you had to (a) be able to tell which one broke, and (b) fix it within a reasonable time frame. Neither of these is simple.
- sites whose traffic management system you trip while scraping. Many will block you, some actively (with an error message, so you know what is happening), some will just keep you hanging or throttle you down to a few hundred bytes/second all of a sudden, with no explanation and no one to contact. Amazon contacted us when they figured we were scraping (we weren't hiding anything and doing it with a logged in user that had contact details), and were cool about it.
- sites that randomly break and stop in the middle of a page. Happens much more than you'd think; When using the site, you just reload or interact with a half-loaded page. You could, of course, still scrape a half-loaded page - but what if only 20/23 of the items you need are there? What if the site is stateful, and reloading that page would cause a state change you do not want?
FYI you can run a Squid cache which is configured to ignore all that stuff - see http://www.squid-cache.org/Doc/config/refresh_pattern/
Being able to pop up a jsdom, and extract data from the page using jquery is a lot of fun.
I've had pretty good luck with phantom.js, but it is somewhat difficult to debug.
What does open source have to do with the reason your site was down?
I think that impugning the professionalism of someone's work because their project's home page was down once is pretty darned personal.
It says nothing. All servers go down from time to time. This one was up when I went to it before, and it was up when I tried it again just now. You say it was down for you when you visited it. Fine. That happens sometimes.
Disclaimer: we have no relationship with PhantomJS :)
There is also Capybara, usually used as a testing framework, but you can easily navigate, chose a backend (Selenium/Webkit for compatibility, mechanize for speed): https://github.com/jnicklas/capybara
And - I know, it is (now) so old - memcached as a good place to store things.
While government or private contractors do not allow to gather that data by any citizen (there are examples of cities allowing it), I find myself in the right to do so, and also, in the right to redistribute that data freely so other developers can play / investigate / learn with it.
Why? Well, for starters:
1) Their own apps suck (or are inexistent)
2) They don't want to help their own users.
3) It's fun
4) Their services are paid with public money.
5) It raises awareness on the need of public data legislation.
There's a lot more to talk on the subject.
If you want to check it, http://citybik.es
I am helping projects and visualizations like: http://bikes.oobrien.com/
or my own http://citybik.es/realtime/