
The uncanny valley of web scraping - Swizec
http://www.zemanta.com/fruitblog/the-uncanny-valley-of-web-scraping/
======
ecaron
For the last 7 years I have worked at a company that does specialized job
listing web scraping. And on nearly a weekly basis I encounter other
programmers who say, "pssssh, I could do that in a weekend."

It does seem disgustingly easy, but once you move from "getting" the data to
"understanding" the data it becomes a beastly nightmare. So thank you to the
OP for helping raise "give scrapers some credit" awareness.

 _Except scrapers that don't respect robots.txt or meta noindex - a pox on
their houses...._

~~~
Radim
But there will always be hordes of new, excitable hackers who fail to
appreciate the complexity of finishing real-world tasks (and I'm not talking
only intelligent scraping). The "awareness" is ephemeral.

Of course, by the time they come around, they will have driven down market
prices as well as hackers' image on the whole, with their "Easy! 2 days max,
here's the expected cost", invariably followed by "Give me an extra week or
two, I'm almost there".

Eventually they'll learn, then come and vent on HN. The circle of life.

~~~
dspillett
It isn't just upstart programmers. How often have we all had discussions with
a manager or client who wants something like this done quick and doesn't
believe it could have many complexities? "There you go, you got the data, now
all you need to do is make the program understand it. Easy!"

------
sho_hn
This is only tangentally related to the article, but on scraping HTML in
general: If you're a Python user, use lxml for it. I know most content on the
web will tell you to use BeautifulSoup, and lxml is something you've only
heard of in connection with reading and writing XML, but lxml actually has a
lovely HTML sub-package, it's faster than BeautifulSoup, and it's compatible
with Python 3. I've gotten _lots_ of good mileage out of it (and no, I'm not a
developer on it :)).

~~~
chernevik
I need to build / find a tool for parsing EDGAR filings for their financial
statements tables (not so bad) and parsing those financial tables into usable
information (pretty bad, the tables have very bad and often inconsistent HTML
layouts).

Any suggestions to where I should be looking? Python? XSLT?

~~~
jessedhillon
Contact me, email is in my HN profile.

------
cuppster
Anyone scraping in Perl? I've found pQuery very useful. (I tried it in node.js
to stay cool, but async scraping is an anti-pattern) You can use jQuery
selectors, etc... Just posted something related to it on by blog: scrape with
pQuery, dump into Redis, reformat into CSV then into mysql...

[http://cuppster.com/2012/02/28/utf-8-round-trip-for-perl-
mys...](http://cuppster.com/2012/02/28/utf-8-round-trip-for-perl-mysql-and-
node-js/)

~~~
dmn001
I also use Perl for web scraping, never heard of pQuery though, I use
HTML::TokeParser or HTML::TreeBuilder::XPath.

------
Kevin_Marks
Relying on RSS feeds is tricky, as many of them are partial extracts,
summaries, or just plain wrong (eg archival standalone pages linking to the
current front page, stale feeds, links to now-defunct feed services).

If you want to help people writing these things, using hAtom in your HTML is a
really good idea.

<http://microformats.org/wiki/hatom>

~~~
wslh
Also HTML5 incorporates the article element/tag to help extract article
contents: [http://dev.w3.org/html5/spec/Overview.html#the-article-
eleme...](http://dev.w3.org/html5/spec/Overview.html#the-article-element)

~~~
kartoffelmos
The thing here is that when properly used, a page can contain several pieces
of text tagged as articles, especially blogs with comments (think of article
as "an article of clothing", not as "a magazine article"). You'd have to rely
on other heuristics to find the "correct" article, which probably is not that
much easier than finding the correct div element.

~~~
wslh
Take a look for example at Fred Wilson's blog <http://www.avc.com> he is using
the article element. You can use multiple times in different blog posts in the
same web page and I don't feel this is bad.

------
lowglow
I'm currently doing this with <http://rtcool.com/>

Edit: It's basically a service that abstracts out scraping for those that want
to create a readability type project.

I've got one last thing to add and then it will be ready for mass consumption.

~~~
joering2
very cool idea, thought about it myself! make sure you got all tech sites in.
I have adBlock turned on all the time but I have no idea how can soemone
"read" something without ad block.

------
NameNickHN
The title should have been "The uncanny valley of recognizing content on a
website".

Scraping is a technicality and, as such, trivial. As the article points out,
processing the scraped content and getting useful results is the hard part.

I'm running some scrapers for customers but the information they want exist in
structured form on the various websites. Thank goodness.

------
wslh
I have another alternative to retrieve the historic RSSs beyond the actual
one. Using the Google Reader "NoAPI": <http://blog.databigbang.com/extraction-
of-main-text-content/>

And there are additional resources at the end.

------
AndrewDucker
Yeah, I get this with ReadItLater, which works 95% of the time, and produces
very odd results the other 5%.

------
nreece
We have a home-brewed scraper and parser (written in C#) at Feedity -
<http://feedity.com> and let me tell you - it's one thing to scrape data but
to derive information out of it is not as easy.

------
jacoblyles
Does anyone have experience using diffbot for web scraping? I'm looking for
some data points.

------
latch
Readability has never felt anything but right to me.

~~~
funkah
Try running this through it. <http://www.cheraglibrary.org/taoist/hua-hu-
ching.htm>

~~~
DanBC
What am I missing? It appears to work fine.

~~~
funkah
Come on, really?

You're missing that there are 81 stanzas and Readability keeps exactly one.

~~~
lesterbuck
Readable keeps all 81 stanzas:

<http://readable.tastefulwords.com/>

------
kristopolous
mirror?

~~~
ja27
ironic

~~~
itmag
I liked you in Dogma.

------
paulhauggis
I don't know why, but I love web scraping. Somehow it's fun to me to be able
to grab data and organize it in a DB in a meaningful way. Even better is using
the data to make money.

I wrote my own scraper framework for various page types (I don't want to go
into too many details) and my latest uses approx 500GB of bandwidth/month. I
run it on one VPS for around $50/month.

~~~
Tichy
Where do you store the data? Most hosts don't seem to offer a lot of storage
on their servers.

~~~
paulhauggis
my DBs aren't that huge because I don't store all of the data I scrape, just
the important stuff. I used Godaddy for awhile, but I now use 1and1. Out of
all the hosts I've used over the years, 1and1 ha been the best in terms of
support and up time.

They have a VPS plan with 2TB of bandwidth and you can upgrade the storage. I
think my Mysql DB is currently around 20GB.

