If a site owner changes the layout or implements a new feature, the programs depending on the scraper immediately fail. This is much less likely to happen when working with official APIs.
Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.
In my spare time, I've been playing around with "scrapers" (I like to call them web browsers, personally) that don't even look at markup.
My first attempt used a short list of heuristics that proved to be eerily successful for what I was after. To the point I could throw random websites with similar content (discussion sites, like HN), but vastly dissimilar structures, at it and it would return what I expected about, I'd say, 70% of the time in my tests.
After that, I started introducing some machine learning in an attempt to replicate how I determine what blocks are meaningful. My quick prototype showed mix results, but worked well enough that I feel with some tweaking it could be quite powerful. Sadly, I've become busy with other things and haven't had time to revisit it.
With that, swapping variables and similar techniques to thwart crawlers seems like it would be easily circumvented.
I'm more interested in what I can do to write fewer scrapers since the content is, at a high level, relatively similar. I've just started with experiments writing "generic" scrapers that try and extract the data without depending on markup. It's going to eventually work well enough but to get the error rate down to an acceptable level is going to take a lot of tweaking and trial and error.
There's a few papers on this, but not much out there. That's why I was interested in someone else working on the same problem in a different space.
These guys do a stellar job on the IP addresses: http://www.hidemyass.com/proxy-list -- the good thing is the data is available for an amazing price.
Other sites I have some across will use large images and css sprites to mask price data.
I write a lot of scrapers for fun, rarely profit, just for the buzz
At least it's easier to code these tricks than to patch a scraper to get around them.
but who cares, no one can beat Xpath :)
The OP addresses that point. His contention is, there's a lot more pressure on the typical enterprise to keep their public-facing website in tip-top shape than there is to make sure whatever API they've defined is continuing to deliver results properly.
Of course, part of the art of (and fun of) scraping is to see if you can make your scraper robust against mere cosmetic changes to the site.
I once had to maintain a (legal) scraper and I can tell you there is no fun in making your scraper robust when the website maintainers are doing there best to keep you from scraping there site.
I've seen random class-names and identifiers, switching of DIVs and SPANs (block display). Adding and removing SPANs for nesting/un-nesting elements. And so on.
Ofcourse the site likes to keep the SEO, but most of the time it's easy to keep parts out of context for a scraper.