Hacker News new | past | comments | ask | show | jobs | submit login
The uncanny valley of web scraping (zemanta.com)
172 points by Swizec on Feb 28, 2012 | hide | past | favorite | 49 comments



For the last 7 years I have worked at a company that does specialized job listing web scraping. And on nearly a weekly basis I encounter other programmers who say, "pssssh, I could do that in a weekend."

It does seem disgustingly easy, but once you move from "getting" the data to "understanding" the data it becomes a beastly nightmare. So thank you to the OP for helping raise "give scrapers some credit" awareness.

Except scrapers that don't respect robots.txt or meta noindex - a pox on their houses....


But there will always be hordes of new, excitable hackers who fail to appreciate the complexity of finishing real-world tasks (and I'm not talking only intelligent scraping). The "awareness" is ephemeral.

Of course, by the time they come around, they will have driven down market prices as well as hackers' image on the whole, with their "Easy! 2 days max, here's the expected cost", invariably followed by "Give me an extra week or two, I'm almost there".

Eventually they'll learn, then come and vent on HN. The circle of life.


It isn't just upstart programmers. How often have we all had discussions with a manager or client who wants something like this done quick and doesn't believe it could have many complexities? "There you go, you got the data, now all you need to do is make the program understand it. Easy!"


Damn. I'm just really glad to know I'm not the only one.

One of my company's biggest sellers is a tool that does web scraping. And since we charge a monthly fee for it (I don't think most people realize that our server fees alone run nearly $2,000 a month for this, and that's with amazing deals on web hosting), we get a lot of "Pfft--how hard could that really be to build?"

The answer is: It's easy--if you're only scraping one site, in small quantities. The minute you try to implement in Unicode (we have) or bundle it all in a report, or keep an ongoing cache of what your scraper found, or heck, scale it to millions of scrapes per month, all cached and indexed--now you are looking at something difficult.

We have a talented team, we're 16 months into this, and we've since seen competitors pop up all over the place. They all get a few customers and then run into scaling issues. It's one of those markets that seems seductively easy, but really isn't. And since there are many competitors lowballing, we've had to focus even more development time on awesome features to stay ahead of the curve.

I wish articles like this woke people up. Unfortunately, once you're committed and have a few customers, it's easy to rationalize going deeper. But web scraping can be a tricky industry to turn a profit in, especially if you are relying on a small and fickle customer base.


> And since there are many competitors lowballing, we've had to focus even more development time on awesome features to stay ahead of the curve.

That is always the way, in any industry (those producing real holdable products as well as those of us producing code and other less tangible output). If you make something good and charge fairly for it, someone else will make a lower quality version & sell it cheaper and people will start expecting you to match that price without dropping any quality. The problem with software is that you often can't see the corners that have been cut until much later than with physical products (where you might have a chance of spotting the shoddy workmanship before you hand over any money), so competing on quality can be difficult (the competition can make the same quality claims as you can, whether they be true or not) meaning you end up having to compete on features.


People should realize that search engines are essentially just huge, generic scrapers.

I've written my share of scrapers and they are almost always a major PITA unless you're just doing a drive-by for one or two DOM elements. Scraping often will end up to be extremely time consuming, although there is rarely significant challenge to it. Just a lot of monotonous, mind-numbing, tedious work.

My advice to most freelancers is: avoid scrapers until you get some teammates to share in the agony.


And on nearly a weekly basis I encounter other programmers who say, "pssssh, I could do that in a weekend."

Hehe, I guess some of those also ask: "Why can't you just parse the HTML with a regex?" ;-)

http://stackoverflow.com/questions/1732348/regex-match-open-...


I spent some time about a year ago fussing with getting a web spider/scraper going; it was simple enough to download data, but actually putting it into a database that was domain-, content-, and time-aware was impressively complex and I put it on the backburner.


This is only tangentally related to the article, but on scraping HTML in general: If you're a Python user, use lxml for it. I know most content on the web will tell you to use BeautifulSoup, and lxml is something you've only heard of in connection with reading and writing XML, but lxml actually has a lovely HTML sub-package, it's faster than BeautifulSoup, and it's compatible with Python 3. I've gotten lots of good mileage out of it (and no, I'm not a developer on it :)).


FYI, BeautifulSoup is actually dropping its own parser entirely for the next version, in favour of being a wrapper around lxml/html5lib/html.parser http://www.crummy.com/2012/02/02/0


Ah, interesting, thanks. I will confess that I haven't followed the latest round of developments in BeautifulSoup - I abandoned it last year when I made the switch to Python 3, and the previous effort to port BS to Python 3 had stalled/failed at that point (looks like they're back on track on that one now, too). Then I found I didn't really miss BS while using lxml.html - not sure I see a point in putting it on top of it.


I think the idea behind BS is that it saves some typing and thinking for common scraping tasks. In that light, it fits perfectly on top of some other (faster) parsing library.


Yup, but all of the stuff in the Quick Start section of http://www.crummy.com/software/BeautifulSoup/bs4/doc/ has more or less close equivalents in lxml.html.

That being said, I think it's great that he's still maintaining BS, and porting it to Python 3 in particular - it keeps existing code working and will allow more people to make the switch. And a responsible and committed maintainer is good advertising for a package in itself, of course.


I admit to using BeautifulSoup, lxml, scrapy, nokogiri and mechanize. But these days I just stick with PhantomJS. Just load the DOM with WebKit. http://phantomjs.org/


I'm even lazier. :) I used the pjscrape wrapper around phantomjs.

https://github.com/nrabinowitz/pjscrape


Another tangentally related recommendation: Use Scrapy if you need something more than one-shot script using BeautifulSoup or lxml.

http://scrapy.org/


I use HTMLUnit for scraping javascript based sites. Since Python is a lovely language and HTMLUnit is written in Java I use Jython. You can take a look at one of my articles on the subject: http://blog.databigbang.com/web-scraping-ajax-and-javascript...


I need to build / find a tool for parsing EDGAR filings for their financial statements tables (not so bad) and parsing those financial tables into usable information (pretty bad, the tables have very bad and often inconsistent HTML layouts).

Any suggestions to where I should be looking? Python? XSLT?


Contact me, email is in my HN profile.


Use HTML Tidy + XSLT.


Anyone scraping in Perl? I've found pQuery very useful. (I tried it in node.js to stay cool, but async scraping is an anti-pattern) You can use jQuery selectors, etc... Just posted something related to it on by blog: scrape with pQuery, dump into Redis, reformat into CSV then into mysql...

http://cuppster.com/2012/02/28/utf-8-round-trip-for-perl-mys...


I also use Perl for web scraping, never heard of pQuery though, I use HTML::TokeParser or HTML::TreeBuilder::XPath.


Relying on RSS feeds is tricky, as many of them are partial extracts, summaries, or just plain wrong (eg archival standalone pages linking to the current front page, stale feeds, links to now-defunct feed services).

If you want to help people writing these things, using hAtom in your HTML is a really good idea.

http://microformats.org/wiki/hatom


Also HTML5 incorporates the article element/tag to help extract article contents: http://dev.w3.org/html5/spec/Overview.html#the-article-eleme...


The thing here is that when properly used, a page can contain several pieces of text tagged as articles, especially blogs with comments (think of article as "an article of clothing", not as "a magazine article"). You'd have to rely on other heuristics to find the "correct" article, which probably is not that much easier than finding the correct div element.


Take a look for example at Fred Wilson's blog http://www.avc.com he is using the article element. You can use multiple times in different blog posts in the same web page and I don't feel this is bad.


I'm currently doing this with http://rtcool.com/

Edit: It's basically a service that abstracts out scraping for those that want to create a readability type project.

I've got one last thing to add and then it will be ready for mass consumption.


very cool idea, thought about it myself! make sure you got all tech sites in. I have adBlock turned on all the time but I have no idea how can soemone "read" something without ad block.


The title should have been "The uncanny valley of recognizing content on a website".

Scraping is a technicality and, as such, trivial. As the article points out, processing the scraped content and getting useful results is the hard part.

I'm running some scrapers for customers but the information they want exist in structured form on the various websites. Thank goodness.


I have another alternative to retrieve the historic RSSs beyond the actual one. Using the Google Reader "NoAPI": http://blog.databigbang.com/extraction-of-main-text-content/

And there are additional resources at the end.


Yeah, I get this with ReadItLater, which works 95% of the time, and produces very odd results the other 5%.


We have a home-brewed scraper and parser (written in C#) at Feedity - http://feedity.com and let me tell you - it's one thing to scrape data but to derive information out of it is not as easy.


Does anyone have experience using diffbot for web scraping? I'm looking for some data points.


Readability has never felt anything but right to me.



Funny, I see only 55, everything else is missing: http://www.readability.com/articles/bylykqti


What am I missing? It appears to work fine.


Come on, really?

You're missing that there are 81 stanzas and Readability keeps exactly one.


Readable keeps all 81 stanzas:

http://readable.tastefulwords.com/


(http://imgur.com/DKwy4)

Works fine for me.


mirror?



ironic


I liked you in Dogma.


I don't know why, but I love web scraping. Somehow it's fun to me to be able to grab data and organize it in a DB in a meaningful way. Even better is using the data to make money.

I wrote my own scraper framework for various page types (I don't want to go into too many details) and my latest uses approx 500GB of bandwidth/month. I run it on one VPS for around $50/month.


In that case, sign up at https://ScraperWiki.com/ and put your passion to good use!


That's awesome. I also love scraping, there's just something about which I find really satisfying.


Where do you store the data? Most hosts don't seem to offer a lot of storage on their servers.


my DBs aren't that huge because I don't store all of the data I scrape, just the important stuff. I used Godaddy for awhile, but I now use 1and1. Out of all the hosts I've used over the years, 1and1 ha been the best in terms of support and up time.

They have a VPS plan with 2TB of bandwidth and you can upgrade the storage. I think my Mysql DB is currently around 20GB.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: