Hacker News new | comments | ask | show | jobs | submit login

The issue with web scraping is that it relies on the scraper to keep up with changes made to the site.

If a site owner changes the layout or implements a new feature, the programs depending on the scraper immediately fail. This is much less likely to happen when working with official APIs.




This should be stressed - sites like Facebook do exactly this. Constant changes mean constantly updating your scraper. When it comes to A/B testing? Your scraper needs to intelligent find the data, which might not always be in the same place.

Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.


I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping.

In my spare time, I've been playing around with "scrapers" (I like to call them web browsers, personally) that don't even look at markup.

My first attempt used a short list of heuristics that proved to be eerily successful for what I was after. To the point I could throw random websites with similar content (discussion sites, like HN), but vastly dissimilar structures, at it and it would return what I expected about, I'd say, 70% of the time in my tests.

After that, I started introducing some machine learning in an attempt to replicate how I determine what blocks are meaningful. My quick prototype showed mix results, but worked well enough that I feel with some tweaking it could be quite powerful. Sadly, I've become busy with other things and haven't had time to revisit it.

With that, swapping variables and similar techniques to thwart crawlers seems like it would be easily circumvented.


I would be really interested in knowing which heuristics or machine learning techniques produced decent results. That's if I can't convince you to open source the code. I'm working on the same problem at the moment.


What about something like http:// tubes.io


We're fine with scrapers and scraping infrastructure, although tubes.io is a very interesting idea.

I'm more interested in what I can do to write fewer scrapers since the content is, at a high level, relatively similar. I've just started with experiments writing "generic" scrapers that try and extract the data without depending on markup. It's going to eventually work well enough but to get the error rate down to an acceptable level is going to take a lot of tweaking and trial and error.

There's a few papers on this, but not much out there. That's why I was interested in someone else working on the same problem in a different space.


> Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.

These guys do a stellar job on the IP addresses: http://www.hidemyass.com/proxy-list -- the good thing is the data is available for an amazing price.

Other sites I have some across will use large images and css sprites to mask price data.

I write a lot of scrapers for fun, rarely profit, just for the buzz


I bet you would only need to randomly shuffle between a few alternatives for all of them. You'd need a dedicated effort to work that one out and the cache implications could be managed. No getting around the trade-off of possible page alternatives vs cache nightmare-ness though, and doing that to json apis would get ugly fast.

At least it's easier to code these tricks than to patch a scraper to get around them.


Yes, Facebook used to do that. I had to scrap it once and was surprised by randomly changing classes around input fields.

but who cares, no one can beat Xpath :)


>The issue with web scraping is that it relies on the scraper to keep up with changes made to the site.

The OP addresses that point. His contention is, there's a lot more pressure on the typical enterprise to keep their public-facing website in tip-top shape than there is to make sure whatever API they've defined is continuing to deliver results properly.

Of course, part of the art of (and fun of) scraping is to see if you can make your scraper robust against mere cosmetic changes to the site.


Of course, part of the art of (and fun of) scraping is to see if you can make your scraper robust against mere cosmetic changes to the site.

I once had to maintain a (legal) scraper and I can tell you there is no fun in making your scraper robust when the website maintainers are doing there best to keep you from scraping there site. I've seen random class-names and identifiers, switching of DIVs and SPANs (block display). Adding and removing SPANs for nesting/un-nesting elements. And so on. Ofcourse the site likes to keep the SEO, but most of the time it's easy to keep parts out of context for a scraper.


In most cases, the site doesn't have an api... so we scrape and take the risk that the structure will change. One thing that helps is using tools which give you jquery-like selectors because they give a lot of freedom and are very easy to write/update.


I agree, CSS selectors in BeautifulSoup and pyquery make it less messy.


This is indeed painful. I was scraping the Pirate Bay last year for a school project, their HTML would occasionally change in subtle ways that would break my scraper for hours or days until I noticed it.


Yeah, the author of the post seemed to imply that web APIs are more likely to change than a website. At least, that's how I took it. Blew my mind.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: