
I Don’t Need No Stinking API: Web Scraping For Fun and Profit - hartleybrody
http://blog.hartleybrody.com/web-scraping/
======
bdcravens
I've done a ton of scraping (mostly legal: on behalf of end users of an app on
sites they have legit access to). This article misses something that affects
several sites: JavaScript driven content. Faking headers and even setting
cookies doesn't get around this. This is of course is easy to get around,
using something like phantom.js or Selenium. Selenium is great because unlike
all the whiz bang scraping techniques, you're driving a real browser and your
requests look real (if you make 10000 requests to index.php and never pull
down a single image, you might look a bit suspicious). There's a bit more
overhead, but micro instances on EC2 can easily run 2 or 3 Selenium sessions
at the same time, and at 0.3 cents per hour for spot instances, you can have
200-300 browsers going for 30-50 cents/hour.

~~~
goostavos
In regards to the javascript problem. I'd suggest checking out the mobile
versions of the sites first before you hop to a weighty solution like
Selenium. Could be a _very_ simple solution to the problem :)

I recently built a twitter bot that did some scraping and posting. I beat my
head against a wall for a couple of hours trying to find a good tool to deal
with all the javascript driven stuff. I happened to get an update on my phone
when it dawned on me that the mobile.twitter site is, for the most part,
simple html stuff. Once I realized that, I was able to programatically log
into my account with no problems, and the rest of twitter was unlocked for me.
I could scrape and post to (almost) my heart content.

However, there were a few very big problems. Which makes me feel that scraping
is not the way to go about things. I certainly wouldn't build a service based
around scraping a particular site's data.

When I had my twitter bot operational, I would get blocked from twitter for
hours at a time. It seems anytime I hit their servers too hard, or crossed
some threshold, I would be locked out. I'm assuming it was some kind of IP
level ban, because I wasn't even able to access the site from an actual
browser.

I was able to deal with the setback by setting up a script to repeatedly check
its access the site, and then relaunch the scraper upon discovering access,
but the solution was just a band-aid. That would translate to significant
downtime if I was running a service with counted on access to their data. The
ban-hammer is too easily laid down.

Finally, just as a word of caution, I'd warn prospective scrapers to be
careful of just who you scrape. I've inadvertently "DDoS'd" a site when a
multiprocessed script got away from me. It spawned 1000+ instances of this
particular request, all of which were doing their best to beat the bejesus out
of this small websites servers. The site ended up going down for a couple of
hours; I assume because of a bandwidth cap or something.

So, my point being, scraping is cool, but (1) I'm unsure if I agree with
relying on it over a proper API, and (2) with great power comes great
responsibility! Be nice to smaller guys, and don't punish their servers to
bad.

~~~
kybernetyk
For my twitter bot I extracted the xauth keys from twitter's official Mac
client(s) (Tweetie 1 and 2 have different keys) and used those to access the
API. To twitter the bot looked like the official client and they couldn't ban
it without banning their official clients.

And XAuth made account creation and log in a breeze as there was no need for
OAuth tokens - username/password was enough.

But you're not always as lucky as that and many websites are heavily JS
driven. For Reddit I had to resort to selenium.

~~~
traxtech
Reddit JSON api (just add .json to any URL) is not good enough for you ?

~~~
notimetorelax
Another proof that spending 15 minutes on research can save you days in
development and production.

~~~
kybernetyk
And your comment is another proof that people tend to assume everyone else is
an idiot.

~~~
chill1
I think his comment was quite appropriate, and did not feel there was an
implication that he thought anyone was dumb for not having known about the
alternative approach to getting Reddit data.

Often times programmers and the managers that drive them are way too quick to
get going building or solving something with brute force. If they would just
be patient and stop for a moment. Spending even a mere 30 minutes extra doing
your homework on a problem can save hours or days in dev time.

~~~
joesb
Saying that that was a proof of spending 15 minutes to search about Reddit API
would save his time implied that kybernetyk didn't do that research.

But kybernetyk already said he did the research before and that Reddit's API
is not good enough for his requirement.

So this is not the case that 15 minutes of research will save the time. And
his comment meant he assumed kybernetyk didn't do research, i.e. being dumb
for not doing search.

~~~
notimetorelax
Let me tell you what I thought when I wrote it...

I did not assume that kybernetyk was dumb or anything, I simply chuckled and
thought to myself ouch haven't I done similar mistake before?! Please don't
assume the worst when reading someone's comment.

------
derrida
(shameless plug) I can scrape asynchronously, anonymously, with JS wizardry,
and feed it into your defined models in your MVC (e.g. Django). But! I need to
get to a hacker conference on the other side of the world (29c3). Any other
time of year, I'd just drop a tutorial. See profile if you'd like to help me
with a consulting gig.

EDIT: Knowledge isn't zero-sum. Here's an overview of a kick-ass way to
spider/scrape:

I use Scrapy to spider asynchronously. When I define the crawler bot as an
object, if the site contains complicated stuff (stateful forms or javascript)
I usually create methods that involve importing either Mechanize or QtWebKit.
Xpath selectors are also useful for the ability to not have to specify the
entire XML tree from trunk to leaf. I then import pre-existing Django models
from a site I want the data to go into and write to the DB. At this point you
usually have to convert some types.

I find Scrapy cleaner and more like a pipeline so it seems to produce less
'side effect kludge' than other scraping methods (if anybody has seen a
complex Beautiful Soup + Mechanize scraper you know what I mean by 'side
effect kludge'). It can also act as a server to return json.

Being asynchronous, you can do crazy req/s.

I will leave out how to do all this through Tor because I don't want the Tor
network being abused but am happy to talk about it one on one if your interest
is beyond spamming the web.

Through this + a couple of unmentioned tricks, it's possible to get _insane_
data, so much so it crosses over into security research & could be used for
pen-testing.

------
toyg
And this is why we can't have nice things.

Web scraping, as fun as it is (and btw, this title _again_ abuses "Fun and
Profit"), is not a practice we should encourage. Yes, it's the sort of dirty
practice many people do, at one point or another, but it shouldn't be
glorified.

~~~
freshhawk
So you're not so hot on the whole search engine thing?

The article does slide into the sketchy side (I've always wanted an excuse to
do that client side javascript trick too) but I found it more interesting
because of that, these aren't secrets. Maybe if I put my "won't somebody
please think of the children" hat on I agree that glorifying using trojan code
to potentially ddos someones server to get around rate limits desired by the
owners of the server is bad. Adults and especially adult self described
hackers should be able to read this without mock outrage, it's interesting and
it's happening all the time.

You can't condemn web scraping though, that's the backbone of the services we
all depend on for most internet related things. That's the whole point of
structured markup and the world wide web itself.

~~~
cloverich
> So you're not so hot on the whole search engine thing?

They scrape to generate links for _users_ to _go to the site_. That's quite
different than scraping for...any other purpose? So it seems. Would you
(anyone) argue otherwise? (genuine curiosity).

~~~
robryan
They are also using title, description, some snippets from the page and taking
a cached version of the site and images you can view without having to visit
the site itself. They are also using this data as a product to sell
advertising against.

If there wasn't so much benefit for most of all sites to be in search engine
indexes you would thinking at least some would object to this scraping.

There would be lots of other scraping that websites want to prevent that takes
even less data than this. It just doesn't provide much in return for the
website.

~~~
polyfractal
Google is even moving into the territory of scraping content to display.
Relevant wikipedia snippets are now being displayed on the search page as a
side bar. While Wiki probably doesn't care...there are plenty of other sites
that would not like Google to scrape the content and display it on the search
page.

~~~
robryan
Yeah, Wikipedia is creative commons so that should be okay? You are right
though I wonder if they have the rights to sports results and weather that
they are pulling.

They have even convinced us all to go mark up our page to help them pull stuff
like ratings and reviews out.

~~~
kragen
Sports results are facts and are statutorily not subject to copyright in the
US.

------
rsingel
There are some recent federal cases (Weev
[http://www.wired.com/opinion/2012/11/att-ipad-hacker-when-
em...](http://www.wired.com/opinion/2012/11/att-ipad-hacker-when-embarassment-
becomes-a-crime/), Aaron
Swartz[http://www.wired.com/threatlevel/2012/09/aaron-swartz-
felony...](http://www.wired.com/threatlevel/2012/09/aaron-swartz-felony/), and
a prosecution of scalpers
<http://www.wired.com/threatlevel/2010/07/ticketmaster/>) that view scraping
as a felony hacking offense. The feds think that an attempt to evade CAPTCHAS,
IP and MAC blocks is a felony worthy of years in prison.

In fact, the feds might think that clearing your cookies or switching browsers
to get another 10 free articles from the NYTimes is also felony hacking.

Which is to say, be careful what you admit to in this forum AND how you
characterize what you are doing in your private conversations and e-mails.

Weev now faces a decade or more in prison because he drummed up publicity by
sending emails to journalists that used the verb "stole".

~~~
hartleybrody
Very good point, I've added the following disclaimer:

    
    
      While scraping can sometimes be used as a legitimate way to
      access all kinds of data on the internet, it’s also important
      to consider the legal implications. As was pointed out in the
      comments on HN[1], there are many cases where scraping data 
      may be considered illegal, or open you to the possibility of
      being sued. Similar to using a firearm, some uses of web
      scraping techniques can be used for utility or sport, while
      others can land you in jail. I am not a lawyer, but you
      should be smart about how you use it.
    

[1]: Linking to this (parent) comment

~~~
rsingel
Thanks for adding that and linking. The feds are nutter butter these days.

------
kaffeinecoma
From the article:

    
    
       Since the third party service conducted rate-limiting based on IP
       address (stated in their docs), my solution was to put the code that
       hit their service into some client-side Javascript, and then send
       the results back to my server from each of the clients.
    
       This way, the requests would appear to come from thousands of
       different places, since each client would presumably have their own
       unique IP address, and none of them would individually be going over
       the rate limit.
    

Pretty sure the browser Same Original Policy forbids this. Think about it- if
this worked, you'd be able to scrape inside corporate firewalls simply by
having users visit your website from behind the firewall.

~~~
kanzure
> Since the third party service conducted rate-limiting based on IP

By the way, that's one of my projects. You can use a basic fibonacci-related
algorithm to figure out (in the most minimal number of requests) what exactly
the rate limit is. This way, you can scrape at just under the maximum limit. I
am still working on this core library though. :|

~~~
hartleybrody
Sounds pretty interesting! Be sure to share it when it's ready.

------
kevinpfab
The issue with web scraping is that it relies on the scraper to keep up with
changes made to the site.

If a site owner changes the layout or implements a new feature, the programs
depending on the scraper immediately fail. This is much less likely to happen
when working with official APIs.

~~~
prezjordan
This should be stressed - sites like Facebook do exactly this. Constant
changes mean constantly updating your scraper. When it comes to A/B testing?
Your scraper needs to intelligent find the data, which might not always be in
the same place.

Sidenote: I wonder if any webapps use randomly generated IDs and class names
(linked in the CSS) to prevent scraping. I guess this would be a caching
nightmare, though.

~~~
randomdata
_I wonder if any webapps use randomly generated IDs and class names (linked in
the CSS) to prevent scraping._

In my spare time, I've been playing around with "scrapers" (I like to call
them web browsers, personally) that don't even look at markup.

My first attempt used a short list of heuristics that proved to be eerily
successful for what I was after. To the point I could throw random websites
with similar content (discussion sites, like HN), but vastly dissimilar
structures, at it and it would return what I expected about, I'd say, 70% of
the time in my tests.

After that, I started introducing some machine learning in an attempt to
replicate how I determine what blocks are meaningful. My quick prototype
showed mix results, but worked well enough that I feel with some tweaking it
could be quite powerful. Sadly, I've become busy with other things and haven't
had time to revisit it.

With that, swapping variables and similar techniques to thwart crawlers seems
like it would be easily circumvented.

~~~
freshhawk
I would be really interested in knowing which heuristics or machine learning
techniques produced decent results. That's if I can't convince you to open
source the code. I'm working on the same problem at the moment.

~~~
rohamg
What about something like <http://> tubes.io

~~~
freshhawk
We're fine with scrapers and scraping infrastructure, although tubes.io is a
very interesting idea.

I'm more interested in what I can do to write fewer scrapers since the content
is, at a high level, relatively similar. I've just started with experiments
writing "generic" scrapers that try and extract the data without depending on
markup. It's going to eventually work well enough but to get the error rate
down to an acceptable level is going to take a lot of tweaking and trial and
error.

There's a few papers on this, but not much out there. That's why I was
interested in someone else working on the same problem in a different space.

------
cynwoody
Great read!

In the past, I have successfully used HtmlUnit to fulfill my admittedly
limited scraping needs.

It runs headless, but it has a virtual head designed to pretend it's a user
visting a web application to be be tested for QA purposes. You just program it
to go through the motions of a human visting a site to be tested (or scraped).
E.g., click here, get some response. For each whatever in the response, click
and aggregate the results in your output (to whatever granularity).

Alas, it's in Java. But, if you use JRuby, you can avoid most of the nastiness
that implies. (You do need to _know_ Java, but at least you don't have to
_write_ Java.)

Hartley, what is your recommended toolkit?

I note you mentioned the problem of dynamically generated content. You develop
your plan of attack using the browser plus Chrome Inspector or Firebug. So
far, so good. But what if you want to be headless? Then you need something
that will generate a DOM as if presenting a real user interface but instead
simply returns a reference to the DOM tree that you are free to scan and react
to.

~~~
bdcravens
Headless: Xvfb on Linux. (Virtual Framebuffer, let's you run apps that require
a GUI) You can use one of the many options that include Webkit (like
phantom.js, the capybara-webkit gem, or Selenium if you want a real browser
like Firefox to do the work)

~~~
iamjustlooking
PhantomJS doesn't need Xvfb anymore it can run headless without this
dependency.

~~~
catch23
well, it wasn't due to anything in phantomjs actually -- it was because qt
introduced project lighthouse & their qt platform abstraction. project
lighthouse was a fork that got integrated into qt 4.8 that phantomjs includes.
(you can see the entire qt source tree if you git-clone phantomjs)

------
RaSoJo
I love HTML scraping. But Javascript???...The juiciest data sets these days
are increasingly in JS. For the love of me i can't get around scraping JS :(

I do know that Selenium can be used for this...but am yet to see a decent
example for the same. Does anyone have any good resources/examples on JS
scraping that they could share?? I would be eternally grateful.

~~~
alexmic
If you are using Python, you can also use pyv8 to evaluate Javascript code.

~~~
kanzure
Yes, but if you want the DOM you would have to use something like webkit. So
something like pyphantomjs might hit the right spot. It's a python re-
implementation of phantomjs.

<https://github.com/kanzure/pyphantomjs>

------
bdcravens
Another issue not covered: file downloads. Let's say you have a process that
creates a dynamic image, or logs in and downloads dynamic PDFs. Even Selenium
can't handle this (the download dialog is an OS-level feature). At one point I
was able to get Chrome to auto-download in Selenium, but had zero control over
filename and where it was saving. I ended up using iMacros (the pay version)
to drive this (using Windows instances: their Linux version is very immature
comparably).

~~~
dphase
I've done this successfully with Ruby Mechanize.

~~~
bdcravens
Awesome. I'd love some hints or links, as I'm always looking to refactor.

~~~
goostavos
In general, if you're going the mechanize route, _.retrieve()_ is the function
your looking for.

e.g.

    
    
      br = mechanize.Browser()
      br.retrieve("https://www.google.com/images/srpr/logo3w.png, google_logo.png)[0]
    

Mechanize doesn't really have a proper doc, but just about everything you'd
need could be figured out from the very lengthy examples page on their site.

~~~
bdcravens
Playing with it now, and while it seems to hit my download need, I can't seem
to get it to play nice with sites that are JavaScript dependent. Am I missing
something, or is there a way to plugin an underlying WebKit engine?

~~~
bryogenic
PhantomJS is capable of downloading binary content from js dependent sites but
it is a journey to get it working as it is not an out-of-the-box feature.
Instead use CasperJS to drive Phantom and get a ton of snazzy features
including simple binary downloads. Happy scraping!

------
mmastrac
I'm surprised that no one has attempted to write a Twitter client based solely
on scraping to get around the token limits.

~~~
splatzone
Or an alternative API that uses the scraped data from Twitter to make
requests... but that might be getting a bit ambitious (and legally dodgy)

~~~
jQueryIsAwesome
Create an script that scraps proxies and then use those to scrap twitter, use
it in a Russian domain, claim 140 characters can't be copyrighted, claim that
the tweets are being extracted from third party sites that use the twitter API
but lack any kind of TOS and disclaimer; sell API access, profit!

------
lazyjones
Scraping could be made a lot harder by website publishers, but they all depend
on the biggest scraper accessing their content so it can bring traffic: Google
...

The biggest downside of scraping is that it often takes a long time for very
little content (e.g. scraping online stores with extremely bloated HTML and
10-25 products/per page).

~~~
JoeAltmaier
As a pioneer of scraping (NetProphet, the first interactive stock charting app
with push-data) we initially scraped every quote we had in our database from
other sites.

The fundamental problem is, web pages can change a lot. We constantly had
scraper scripts fail either because the web pages changed for some innocuous
reason, or they noticed the scraping and blocked us.

We resorted to a list of scrape targets and constantly-updating scrape-scripts
to adapt continuously to the 'market'. We also pinged each target to find the
least congested.

Eventually we got our own stock feed (guy that did that is a research
scientist at Adobe now) and stopped scraping altogether. But it was a wild
ride.

~~~
lazyjones
We still need to scrape many (several 100) clients' websites because they are
unable to give us product feeds (adequate ones or any at all) for their
stores. But hey, it gives us a small edge because we try harder than the
competition.

------
joe_the_user
An important topic.

The main caveat is that this may violate a site's terms of use and thus
website owners may feel called upon to sue you. Depending on circumstances,
the legal situation here can be a long story.

~~~
frabcus
Yes, it is complicated. That said, this is partly just because there aren't
enough cases - and partly because the law hasn't stabilised (took a century to
stabilise after invention of printing press). It isn't clear what rights
society should grant yet, for maximising business.

My take on it, from ScraperWiki's point of view:
<http://blog.scraperwiki.com/2012/04/02/is-scraping-legal/>

------
zarino
Related: If you fancy writing scrapers for fun _and_ profit, ScraperWiki (a
Liverpool, UK-based data startup) is currently hiring full-time data
scientists. Check us out!

<http://scraperwiki.com/jobs/#swjob5>

~~~
hartleybrody
very well played :)

------
jbranchaud
The title makes it sound as if there is going to be some discussion of how the
OP has made web scraping profitable, but this seems to have been left to the
reader's imagination.

Otherwise, great article! I agree that BeautifulSoup is a great tool for this.

------
mcgwiz
It's pointless to think of it as "wrong" for third-parties to web-scrape.
Entities will do as they must to survive. The onus of mitigating web scraping,
if in the interests of the publisher, is on the publisher.

As a startup developer, third-party scraping is something I need to be aware
of, that I need to defend against if doing so suits my interests. A little bit
of research shows that this is not impractical. Dynamic IP restrictions (or
slowbanning), rudimentary data watermarking, caching of anonymous request
output all mitigate this. Spot-checking popular content by running it through
Google Search requires all of five minutes per week. At that point, the
specific situation can be addressed holistically (a simple attribution license
might make everyone happy). With enough research, one might consider
hellbanning the offender (serving bogus content to requests satisfying some
certain heuristic) as a deterrent. A legal pursuit with its cost would likely
be a last resort.

Accept the possibility of being scraped and prepare accordingly.

------
im3w1l
People seem to wonder how to handle ajax.

The answer is HttpFox. It records all http-requests.

1\. Start recording

2\. Do some action that causes data to be fetched

3\. Stop recording.

You will find the url, the returned data, and a nice table of get and post-
variables.

<https://addons.mozilla.org/en-us/firefox/addon/httpfox/>

~~~
mylittlepony
Isn't this the same as what the Net tab from Firebug does?

~~~
hartleybrody
Yah, I don't understand why people make things so complicated once Javascript
gets involved. Just inspect the XHR traffic to your browser ("Network" tab in
Web Inspector, Firebug, etc) as you update the information on the page. You'll
quickly discover what are essentially undocumented APIs returning the data
used to generate the page. You don't need to use or even read through the
Javascript that's calling them, you just need to figure out what parameters
and cookies are being sent, and tweak those as you wish.

You might have to spoof the Referer header so that it thinks the request is
still coming from their website.

------
metalruler
From a site owner's perspective: if you have a LOT of data then scraping can
be very disruptive. I've had someone scraping my site for literally months,
using hundreds of different open proxies, plus multiple faked user-agents, in
order to defeat scraping detection. At one point they were accessing my site
over 300,000 times per day (3.5/sec), which exceeded the level of the next
busiest (and welcome) agent... Googlebot. In total I estimate this person has
made more than 30 million fetch attempts over the past few months. I
eventually figured out a unique signature for their bot and blocked 95%+ of
their attempts, but they still kept trying. I managed to find a contact for
their network administrator and the constant door-knocking finally stopped
today.

------
mbustamante
when i need to scrap a webpage, i use phpQuery
(<http://code.google.com/p/phpquery/>), it's dead simple if you have
experience with jQuery and i get all the benefits of a server-side programming
language.

~~~
zevyoura
A similar module for node.js: <https://github.com/mape/node-scraper>

~~~
latchkey
Better than that is <http://node.io/> Also, don't use jsdom (it is slow and
strict), <https://github.com/MatthewMueller/cheerio> is much better.

------
SiVal
What I wish I could do is capture Flash audio (or any audio) streams with my
Mac. All I want is to listen to the audio-only content with an audio player
when I'm out driving or jogging, etc. Audio-only content that has to be played
off a web page usually runs into the contradiction that if I'm in a position
to click buttons on my web browser (not driving, for example), I'm in a
position to do my REAL work and have no time to listen to the audio. I'll go
to the web page, see whatever ads they may have, but then I'd like to be able
to "scrape" the audio stream into a file so I don't have to sit there staring
at a static web page the whole time I'm listening.

~~~
codewright
I used to work at a company where capturing flash video and audio streams was
a regular part of our work. You're not going to like the answer.

You basically have to proxy everything through a proxy that can be given a
command or otherwise instructed to capture the top 3 or 4 streams from the
website. From there you can either dumbly accept the largest one or start
checking byte headers.

------
SG-
When scraping HTML where data gets populated with js/ajax, you can get a web
inspector to look at where that data is coming from and manually GET it and it
will likely be in some nice JSON.

Scraping used to be the way to get data back in the days, but websites also
didn't change their layout/structure on a weekly basis too back then and were
much more static when it came to the structure.

Having recently written a small app that was forced to scrape HTML and having
to update it every month to make it keep working, I can't imagine doing this
for a larger project and maintaining it.

------
alhenaadams
To all HN: All this being said, how do we prevent our sites from being scraped
in this way? What can you not get around, and what are the potential uses for
an 'unscrapeable' site to your mind.

~~~
zalew
if you don't want something to be scrapped, don't publish it on the internet.
scrapping prevention reminds me of blocking right-click and other ridiculous
solutions back in the day. hey, if I can view it, it means the data reached my
end point.

------
thomasrambaud
I think the author just completly missed the point with API vs Screen
scraping. The API allows for accessing structured data. Even if the website
changes once, the datas would be accessible the same way through the API.
Whereas, the author, would have to rewrite his code each time an update his
made to the front-office code of the website.

A simple API providing simple json response with http basic auth is far more
efficient than a screen scraping program where you have to parse the response
using HTML / XML parsers.

~~~
frabcus
This isn't always the case - APIs often change. Facebook, for example, is (at
least was, a few years ago) notoriously bad at changing in an unpredictable
and buggy way, and I stopped using it for that reason. Some HTML scrapers are
more reliable than that.

As for efficiency, again not such an issue. HTML is very good these days,
compared to 10 years ago, a simple CSS selector often does the job.

~~~
thomasrambaud
This is true, but APIs are often versionned.

Concerning efficiency this is true CSS / XPath processors, at least, both
offer very nice performances.

But download 70KB of HTML each time you only need a single data, where the API
request cost only a few (avg < 2KB), can be such a pain if you need to do this
frequently. This can be handled by a scalable configuration but I find it a
bit the overkill.

------
6ren
This illustrates the significant difference between the use-cases of "web
APIs" and conventional APIs, that the former are more like a database CRUD
(including REST), rather than a request for computation. They (usually) are an
alternative interface to a website (a GUI), and that's how most websites are
used. e.g. an API for HN would allow story/comment retrieval, voting,
submission, commenting.

They _could_ be used for computation, but (mostly) aren't.

------
treelovinhippie
Not every site. There is data I would really love to access on Facebook
without having to gain specific authorization from the user. It's odd that for
most user profiles the most you can extract via the graph API (with no access
token) is their name and sex. Whereas I can visit their profile page in the
browser, see all sorts of info and latest updates (and not even be friends
with them)

Tried scraping Facebook. They have IP blocks and the like.

~~~
RBerenguel
Do it in JS, client-side with 3 second delays. I have used this to get
available data (location, name, status, etc) from the latest 500 available
likes of a page I manage

------
kuhn
This is a shameless plug but I've created a service that aims to help with a
lot of the issues that OP describes such as rate limiting, JS and scaling.
It's a bit like Heroku for web scraping and automation. It's still in beta but
if anyone is interested then check out <http://tubes.io>.

------
senthilnayagam
I have done a bit of scrapping with ruby mechanize, when we hit limits have
circumvented by proxy and tor

google as a search engine crawls most all sites, but offers very few usable
stuff to other bots

<http://www.google.com/robots.txt>

Disallow 247 Allow 41

------
kragen
Be careful. I got banned from Google for scraping. I did a few hundred
thousand searches one day, and that night, they banned my office IP address
for a week. This was in 2001, so I estimate I cost them a few hundred dollars,
which is now impossible to repay. :(

------
clark-kent
The problem with scraping instead of using the API is that when the website
makes even a slight change to their markup it breaks your code. I have had
that experience and it's a living hell. I can say it's not worth it to scrap
when there is an API available.

------
aleprok
There is just one major trouble with not needing stinking API. You can not
POST as a possible client without requiring them to give their password to
you, which actually would give you full access to their account instead of
limited access with API.

~~~
Dylan16807
You seem to be talking about a specific site? Which one?

~~~
aleprok
Any social network where you can post messages for the user in the users
message stream.

------
thenomad
I had to do some scraping of a rather Javascript-heavy site last year - I
found the entire process was made almost trivial using Ruby and Nokugiri.
Particularly relevant for a non-uber-programmer like me, it's simple to use,
as well as powerful.

------
jmgunn87
So bloody true. A web page is a resource just like an xml doc, there's no
reason public facing urls and web content can't be treated as such and I
regularly take advantage of that fact aswell. great post

------
pknerd
If it's not automated and a fewer times, I will prefer IMacro to perform tasks
on my behalf. The best part of it that you can integrate a Db to record your
desired data.

------
reledi
Automated web testing tools, such as Watir and Selenium, are also pretty good
options. I'm especially surprised Watir hasn't been mentioned yet in the
comments.

~~~
stackthatcode
Indeed - or WatiN, the .NET port of WatiR. I've done some pretty heavy duty
scraping and automation with WatiN, which included building a OO framework
that trivialized writing scripts. Good stuff.

------
tectonic
Checkout <http://selectorgadget.com> as a useful tool for coming up with CSS
selectors.

------
opminion
How about publicly available web scraping tools as a way to encourage sites to
provide good APIs? Everybody wants efficiency, after all.

------
bconway
_No Rate-Limiting_

Clearly someone's never spent time diagnosing the fun that is scaping HN (yes,
unofficial API is available).

------
shocks
Node.js is excellent for web scaping, especially if you're scraping large
amounts very often.

~~~
chadscira
I made this module for this exact reason:
<https://github.com/icodeforlove/node-requester>. Supports horrible things
like proxy rotation.

~~~
kanzure
> Supports horrible things like proxy rotation.

Do you have any plans to track which proxies are actually working, or how
quickly each one is blocked? I want a reverse proxy on my outgoing requests
that knows how to shift my traffic around properly so that I don't get banned.
I don't want to be rate limited and I don't want to sit here for weeks trying
to figure out wtf the rate limit is in the first place.

~~~
chadscira
interesting, im sure you can build this into it. you could hook onto the
didRequestFail method and flag IPs (log this.proxy out to see what the proxy
was). all i would need to do is add a method that makes it easier to
add/remove proxies.

------
ComputerGuru
What is it with all the headlines this week abusing the classic "for fun and
profit" title?

------
eranation
relevant: [http://www.codinghorror.com/blog/2009/02/rate-limiting-
and-v...](http://www.codinghorror.com/blog/2009/02/rate-limiting-and-velocity-
checking.html)

------
yayitswei
I've found diffbot to be quite useful for scraping.

------
buster
I so not agree with that article, it makes me sick. And this guy basically is
some "marketer" so no wonder he gets quite some stuff wrong, imo. :p

~~~
namidark
What did he get wrong?

------
thisisnotatest
Craigslist, anyone?

~~~
marcamillion
I actually wrote a CL crawler in Ruby -
<https://github.com/marcamillion/craigslist-ruby-crawler>

I used it to crawl for freelance web dev gigs, but it can be re-purposed to do
anything.

