Hacker News new | comments | show | ask | jobs | submit login
I Don’t Need No Stinking API: Web Scraping For Fun and Profit (hartleybrody.com)
279 points by hartleybrody 1864 days ago | hide | past | web | favorite | 172 comments



I've done a ton of scraping (mostly legal: on behalf of end users of an app on sites they have legit access to). This article misses something that affects several sites: JavaScript driven content. Faking headers and even setting cookies doesn't get around this. This is of course is easy to get around, using something like phantom.js or Selenium. Selenium is great because unlike all the whiz bang scraping techniques, you're driving a real browser and your requests look real (if you make 10000 requests to index.php and never pull down a single image, you might look a bit suspicious). There's a bit more overhead, but micro instances on EC2 can easily run 2 or 3 Selenium sessions at the same time, and at 0.3 cents per hour for spot instances, you can have 200-300 browsers going for 30-50 cents/hour.


In regards to the javascript problem. I'd suggest checking out the mobile versions of the sites first before you hop to a weighty solution like Selenium. Could be a very simple solution to the problem :)

I recently built a twitter bot that did some scraping and posting. I beat my head against a wall for a couple of hours trying to find a good tool to deal with all the javascript driven stuff. I happened to get an update on my phone when it dawned on me that the mobile.twitter site is, for the most part, simple html stuff. Once I realized that, I was able to programatically log into my account with no problems, and the rest of twitter was unlocked for me. I could scrape and post to (almost) my heart content.

However, there were a few very big problems. Which makes me feel that scraping is not the way to go about things. I certainly wouldn't build a service based around scraping a particular site's data.

When I had my twitter bot operational, I would get blocked from twitter for hours at a time. It seems anytime I hit their servers too hard, or crossed some threshold, I would be locked out. I'm assuming it was some kind of IP level ban, because I wasn't even able to access the site from an actual browser.

I was able to deal with the setback by setting up a script to repeatedly check its access the site, and then relaunch the scraper upon discovering access, but the solution was just a band-aid. That would translate to significant downtime if I was running a service with counted on access to their data. The ban-hammer is too easily laid down.

Finally, just as a word of caution, I'd warn prospective scrapers to be careful of just who you scrape. I've inadvertently "DDoS'd" a site when a multiprocessed script got away from me. It spawned 1000+ instances of this particular request, all of which were doing their best to beat the bejesus out of this small websites servers. The site ended up going down for a couple of hours; I assume because of a bandwidth cap or something.

So, my point being, scraping is cool, but (1) I'm unsure if I agree with relying on it over a proper API, and (2) with great power comes great responsibility! Be nice to smaller guys, and don't punish their servers to bad.


For my twitter bot I extracted the xauth keys from twitter's official Mac client(s) (Tweetie 1 and 2 have different keys) and used those to access the API. To twitter the bot looked like the official client and they couldn't ban it without banning their official clients.

And XAuth made account creation and log in a breeze as there was no need for OAuth tokens - username/password was enough.

But you're not always as lucky as that and many websites are heavily JS driven. For Reddit I had to resort to selenium.


Reddit JSON api (just add .json to any URL) is not good enough for you ?


Another proof that spending 15 minutes on research can save you days in development and production.


And your comment is another proof that people tend to assume everyone else is an idiot.


I think his comment was quite appropriate, and did not feel there was an implication that he thought anyone was dumb for not having known about the alternative approach to getting Reddit data.

Often times programmers and the managers that drive them are way too quick to get going building or solving something with brute force. If they would just be patient and stop for a moment. Spending even a mere 30 minutes extra doing your homework on a problem can save hours or days in dev time.


Saying that that was a proof of spending 15 minutes to search about Reddit API would save his time implied that kybernetyk didn't do that research.

But kybernetyk already said he did the research before and that Reddit's API is not good enough for his requirement.

So this is not the case that 15 minutes of research will save the time. And his comment meant he assumed kybernetyk didn't do research, i.e. being dumb for not doing search.


Let me tell you what I thought when I wrote it...

I did not assume that kybernetyk was dumb or anything, I simply chuckled and thought to myself ouch haven't I done similar mistake before?! Please don't assume the worst when reading someone's comment.


Seems like one a situation where the language in one cultural context would be insulting, but in another, is merely a literal statement.


No, I didn't meant it. It's a too common mistake to make.


Obviously not - since I would have used it if it was?


Well, what exactly was actually crucially-missing from the json one?


Reddit's API doesn't give you access to child comments past a certain number, so that could have been it.


My scraping is part of a transactional B2B service, not a high traffic social or B2C thing, so it's a different set of problems than those who want their hands on Twitter data. These are Fortune 100's, so if I can bring down their site, they have bigger problems. :-)


I wouldn't make that assumption. Do check return codes and load times, and back off if you see issues. If these sites are business partners/suppliers you have a lot to lose if things go wrong. It's worth it to develop your relationship with the business owners of the services you're touching in the correspondent organizations. And do set a User-Agent string that declares who you are and provides a link for information; if you are doing business with them, it should be on a basis of honesty.


If you're using ruby, I've found watir (http://watir.com/) to be very nice to use. There might be better alternatives now but it made my life easier when I had to scrape a bunch of our supplier's crappy B2B sites that required JavaScript.


+1. Thanks for the hints about Selenium.

My 2c about scraping - when you try to obtain data from large websites, always go for javascript content. Pages like Newegg or Amazon * may change html outline very often even without a single alteration to the front-user and even your smartest regex can have a brain fart. In contrast, even when site gets major overhaul, most likely old javascript will be left in place with all up to date variables, because engineers will be concerned of removing that code not to break some functionality .

* given you have rights to scrap.

not that there are no tools to debug the site; but I found websites like mentioned plus youtube, and bunch others just not fiddling too much with js.


> even your smartest regex can have a brain fart

If you're using regex to solve this sort of problem, your code deserves to break, I'm sorry.


I've found that regex is very brittle when you don't control what comes across. DOM traversal is far more reliable.


Agree... basically search methods that specify a branch or leaf locally rather than the entire tree structure can more often resist layout changes.

Regex for HTML is a bad idea ... http://stackoverflow.com/questions/590747/using-regular-expr...


Parsing arbitrary HTML is not the same as scraping a page for data -- that link isn't really that relevant.


Good point. I simply avoided regex for HTML for this reason and it wasn't justified (although a good choice).


This. You usually traverse the DOM. Either you use some XQuery /XPath magic or a library like beautiful soup.


Sizzle for life.


Not quite ready for prime time but I am working on a project that makes it really easy to grab content from any site using a point and click interface no xpaths selectors or regex.

You enter the url you want to capture data from, it gets loaded in an iframe, you click on the texts you need and set a schedule to receive updates and how(email/twitter dm) that's it.

It supports javascript driven content and can handle practically any website.

http://www.followwww.com


In my experience, you seldom need a full browser to extract data from javascript-heavy sites. You often can make your way with a little bit of reverse engineering, starting from a traffic capture and looking after parameters you dont understand in the HTML/JS code. Usually, there is nothing hidden. Though, when they're effectively trying to make your life harder with JS, it is easily solved by feeding a JS interpreter (like python spidermonkey) the offending algorithm.

Depending on your use case, headless may be simpler, but it has also many drawbacks that don't show at first, the main being that they're not simple to drive from remote processes as queue-consuming devices.

The article suggests BeautifulSoup as a parsing library for python. If I'm not mistaken BeautifulSoup is not actively maintained anymore, and other cleaner and faster solutions exists, like lxml.html. Ian Bicking made a good article on that topic : http://blog.ianbicking.org/2008/03/30/python-html-parser-per...


BeautifulSoup is in fact still actively maintained. “The current release is Beautiful Soup 4.1.3 (August 20, 2012).”

http://www.crummy.com/software/BeautifulSoup/

I hear it recommended the most among Pythonistas, and it's plenty clean and fast for my use. But if you're skeptical, I'd still look for a more up to date benchmark (or run your own) rather than rely on results from >4 years ago.


Looks like things have changed since the last time I checked. Thank you for pointing this out. Next time I'll check y facts twice before posting.

Still, lxml being basically a binding to libxml2 the performance comparison of the two libs should still hold. I heard it recommended too, in a python talk about scraping like 1 or 2 (at most) years ago.

BeautifulSoup may still be better for parsing broken documents, though I never had problems with lxml while using it on a very large variety of sites.


You can use BeautifulSoup with lxml if you like, although I just use the HTMLParser in lxml these days and don't use BeautifulSoup any more. It seems to work a little better, at least for my uses.

http://lxml.de/elementsoup.html


(shameless plug) I can scrape asynchronously, anonymously, with JS wizardry, and feed it into your defined models in your MVC (e.g. Django). But! I need to get to a hacker conference on the other side of the world (29c3). Any other time of year, I'd just drop a tutorial. See profile if you'd like to help me with a consulting gig.

EDIT: Knowledge isn't zero-sum. Here's an overview of a kick-ass way to spider/scrape:

I use Scrapy to spider asynchronously. When I define the crawler bot as an object, if the site contains complicated stuff (stateful forms or javascript) I usually create methods that involve importing either Mechanize or QtWebKit. Xpath selectors are also useful for the ability to not have to specify the entire XML tree from trunk to leaf. I then import pre-existing Django models from a site I want the data to go into and write to the DB. At this point you usually have to convert some types.

I find Scrapy cleaner and more like a pipeline so it seems to produce less 'side effect kludge' than other scraping methods (if anybody has seen a complex Beautiful Soup + Mechanize scraper you know what I mean by 'side effect kludge'). It can also act as a server to return json.

Being asynchronous, you can do crazy req/s.

I will leave out how to do all this through Tor because I don't want the Tor network being abused but am happy to talk about it one on one if your interest is beyond spamming the web.

Through this + a couple of unmentioned tricks, it's possible to get insane data, so much so it crosses over into security research & could be used for pen-testing.


And this is why we can't have nice things.

Web scraping, as fun as it is (and btw, this title again abuses "Fun and Profit"), is not a practice we should encourage. Yes, it's the sort of dirty practice many people do, at one point or another, but it shouldn't be glorified.


So you're not so hot on the whole search engine thing?

The article does slide into the sketchy side (I've always wanted an excuse to do that client side javascript trick too) but I found it more interesting because of that, these aren't secrets. Maybe if I put my "won't somebody please think of the children" hat on I agree that glorifying using trojan code to potentially ddos someones server to get around rate limits desired by the owners of the server is bad. Adults and especially adult self described hackers should be able to read this without mock outrage, it's interesting and it's happening all the time.

You can't condemn web scraping though, that's the backbone of the services we all depend on for most internet related things. That's the whole point of structured markup and the world wide web itself.


> So you're not so hot on the whole search engine thing?

They scrape to generate links for users to go to the site. That's quite different than scraping for...any other purpose? So it seems. Would you (anyone) argue otherwise? (genuine curiosity).


They are also using title, description, some snippets from the page and taking a cached version of the site and images you can view without having to visit the site itself. They are also using this data as a product to sell advertising against.

If there wasn't so much benefit for most of all sites to be in search engine indexes you would thinking at least some would object to this scraping.

There would be lots of other scraping that websites want to prevent that takes even less data than this. It just doesn't provide much in return for the website.


Google is even moving into the territory of scraping content to display. Relevant wikipedia snippets are now being displayed on the search page as a side bar. While Wiki probably doesn't care...there are plenty of other sites that would not like Google to scrape the content and display it on the search page.


Well, it probably sucks for Wikipedia because users aren't seeing the Jimmy Wales messages everywhere if they find the content through Google.


Yeah, Wikipedia is creative commons so that should be okay? You are right though I wonder if they have the rights to sports results and weather that they are pulling.

They have even convinced us all to go mark up our page to help them pull stuff like ratings and reviews out.


Sports results are facts and are statutorily not subject to copyright in the US.


Wikipedia explicitly allows that kind of thing with CC-BY-SA licenses, and indeed gets substantial funding from companies like answers.com that do it. (Incidentally, answers.com was the only way to see TeX equations on Wikipedia on my Android phone last time I checked, so it's not like they're adding no value.)


From what I understand, Google uses crawled data as a learning set for their translation service. There is no "this phrase was learned from: www.nytimes.com" when I do a translation, so I guess Google is still guilty?


Does it have to be a search based interface to the indexed data?

Does finding a link to the scrapee have to be the primary purpose of the site (and therefore google would be constantly getting "worse" by this scale)?

So how prominent does the link back have to be for it to be ok?

What about the summarized data from there that search engines are adding these days, so you don't need to leave the google results page to get your answer but the data still comes from some site that you rarely notice the name of?

edit: as to your curiosity, I honestly do not see the line that you see. Unless it's that the link back to the source is required. I don't know that I agree with that but I would understand it, although that gets harder and harder the more you massage your dataset to be useful to users.


Hmm,

I'm not actually in favor of scraping. But I think it is possibility that needs to be considered on both ends. If a site has valuable info and doesn't provide a decent API, it naturally is going to encourage scraping.

And isn't a search engine a kind of scraper?


Why? I would gladly encourage web scraping.


Totally agree, scraping is great.

I can see site operators being against the practice though, as it (usually): - generates no ad revenue - often enables someone else to use data that you struggled to put together, allowing others to profit with no gain for you - hit's edge cases that were never optimized for (as it does not follow real user access)


Right, and this is why sites like Craigslist explicitly forbid scraping. If the site operators wanted, explicitly, to share their data with you, they would provide an API or give you permission to scrape.

The reality of scraping was really known many years ago. If you're doing if for above-board reasons like for research etc., you'll probably get a pass - if you're doing it in order to profit from someone else's work because you are too lazy to do it yourself, it's probably unethical and you won't get a pass --- these concepts have been around for at least a thousand years or more.

Full Disclosure: I have also scraped data - but only from government websites where the scraped data is explicitly public domain to begin with and APIs were not available.


1. That doesn't address search engines, which are doing it to profit from someone else's work. If you open the door for search engines then how many search engine like things do you give passes to?

2. What if I'm scraping it just for me, because I want a different interface? How many friends can I share that with? Can I open source the program?

3. What if I read a bunch of these sites to do research and write up a story on something about it? Not plagiarizing, just summarizing and providing analysis on craigslist rental prices? What if I do this every day? What if I automate that process? The data is transformed just as much as if I had read it myself and crunched the numbers myself, I made just as many requests to the site as my browser would have.

Concepts that have been around a thousand years or more are not fully applicable. Like the printing press, some things alter the scarcity equation for ideas and data distribution and ownership. Considering how little we've agreed on about print after 500 years I have some doubts that this is as closed an issue as you say.


Search engines:

- Respect robots.txt (as mentioned elsewhere) which will often provide a limited subset of all data available

- Give something in return (potential traffic) for the data they reap.

I fully agree that scraping is great, and do it myself frequently. Site operators do have legitimate concerns in some situations though, and it probably comes from feeling as if they are being 'ripped off' somehow.

No one in their right mind is going to object to incidental scraping for personal use.

However, scraping is often scripted into cron or the like and that data is then used to profit someone else. I'm usually cool with that, but if someone is running a web site and they are dependent upon ad revenue to keep the servers running, I understand objecting to it.


Good rules of thumb.

> No one in their right mind is going to object to incidental scraping for personal use.

It would almost certainly involve stripping ads when re-purposing the content.


Good points, but... 1. I do think it addresses search engines because site operators do explicitly give search engines permission to scrape their sites via something called "robots.txt" files otherwise known as the "robots exclusion standard". 2. Like all other scenarios, this one is also likely between you and the site operator. Are you breaking the site's TOU? The answer to that question might help. If you are asking me for the answer to a moral dilemma, I might suggest that you try Shakespeare for some relevant insight to your question(s). 3. See (2). I believe you are incorrect in your last sentence, and in a number of ways, but feel free to disagree.


On #1, you mentioned before that Craigslist disallows scraping, yet unless you are OmniExplorer, it seems scraping is mostly fair game if robots.txt means anything. The robots.txt standard mentions nothing about it being for search engines [2], so there are no special exclusions for search engines specifically.

Additionally, robots.txt is really for automated link traversal, not scrapers in general. If your scraper is initiated by a user, there is no need to follow robots.txt. Not even Google does when the request is user-initiated [3].

From there, the waters just become really murky. Is lynx a scraper because it doesn't render the way most web browsers do? Does it get a pass because it still adheres to web standards? What if a real scraper adheres to web standards? Maybe it is the storage of scraped data that is the issue? What about caches? I could go on, but I'm sure you see what I'm getting at. It's a very complex issue that is not at all understood.

[1] http://www.craigslist.org/robots.txt

[2] http://www.robotstxt.org/robotstxt.html

[3] http://support.google.com/webmasters/bin/answer.py?hl=en&...


Good point. I meant to include a mention of "robots.txt" but I forgot or delete it editing. That's the motivation for number 3. A "robots.txt is the law" philosophy makes some sense to me, but number 3 is an example of a time when I think it falls down. I don't see a distinction between scripting my daily bookmark visits and manually doing it as a meaningful one. What about extensive browser plugins?

This isn't settled legally certainly and it certainly doesn't seem like this is settled ethically either considering the various insane statements that occur when politicians comment on the subject.

Some examples of the specific concepts from a thousand years ago that apply and answer these questions would help me see what you see. I know the basic rules for music sampling and referencing other works when writing and where the line for plagiarism is drawn and the rights for using photography. Don't know the rules for accessing network resources that are open or for using their data.


If you don't want your data used by others, don't send it to them.

You explicitly give them permission to have it by going out of your way to install a program on a common port, with a common API, and giving it a directory full of documents to distribute, and not using any form of authentication. The way the web works is that answering is equivalent to granting permission to ask and sending a file is tantamount to granting permission. When you receive a file you don't first receive a permissions document, you receive the file - authentication and contractual obligations come first because there is no later. (This is like the tide, you may not like it but that doesn't mean you can change it, especially not with laws.)

You have many ways to check authentication and legally they can be VERY weak, 1-bit passwords are sufficient, but if you don't restrict access it is open - not just because it's the default, but because it's the technical reality: they didn't hack into your computer to get that file, they asked your document server and it gave it to them!

Robots.txt is a suggestion, for the scraper's benefit! It suggests better links. You're allowed to see the rest (the server sends them to you without a password) but you're unlikely to find good content.

If you're afraid of someone examining data you send them, don't send them the data if they ask. Expecting them to not ask, or once they've received it, to not manipulate it in certain ways because you can't then extract a fee for them doing so is controlling and more-over, doomed to fail.


There is tons of data on the govt websites in India and the only way to get to that is by scrapping the websites. Knowing that you can scrap it and be on your way feels very liberating at times. Example: Rates for Indian postal department services. Minutes of parliament houses. (great for building machine translation systems. A lot of research in MT has benefit from the availability of parallel corpus consisting of parliament proceedings in 2 or more languages. Hansard corpus from Canada. European Parliament corpus. No such luck in India.)


There are some recent federal cases (Weev http://www.wired.com/opinion/2012/11/att-ipad-hacker-when-em..., Aaron Swartzhttp://www.wired.com/threatlevel/2012/09/aaron-swartz-felony..., and a prosecution of scalpers http://www.wired.com/threatlevel/2010/07/ticketmaster/) that view scraping as a felony hacking offense. The feds think that an attempt to evade CAPTCHAS, IP and MAC blocks is a felony worthy of years in prison.

In fact, the feds might think that clearing your cookies or switching browsers to get another 10 free articles from the NYTimes is also felony hacking.

Which is to say, be careful what you admit to in this forum AND how you characterize what you are doing in your private conversations and e-mails.

Weev now faces a decade or more in prison because he drummed up publicity by sending emails to journalists that used the verb "stole".


Very good point, I've added the following disclaimer:

  While scraping can sometimes be used as a legitimate way to
  access all kinds of data on the internet, it’s also important
  to consider the legal implications. As was pointed out in the
  comments on HN[1], there are many cases where scraping data 
  may be considered illegal, or open you to the possibility of
  being sued. Similar to using a firearm, some uses of web
  scraping techniques can be used for utility or sport, while
  others can land you in jail. I am not a lawyer, but you
  should be smart about how you use it.
[1]: Linking to this (parent) comment


Thanks for adding that and linking. The feds are nutter butter these days.


From the article:

   Since the third party service conducted rate-limiting based on IP
   address (stated in their docs), my solution was to put the code that
   hit their service into some client-side Javascript, and then send
   the results back to my server from each of the clients.

   This way, the requests would appear to come from thousands of
   different places, since each client would presumably have their own
   unique IP address, and none of them would individually be going over
   the rate limit.
Pretty sure the browser Same Original Policy forbids this. Think about it- if this worked, you'd be able to scrape inside corporate firewalls simply by having users visit your website from behind the firewall.


> Since the third party service conducted rate-limiting based on IP

By the way, that's one of my projects. You can use a basic fibonacci-related algorithm to figure out (in the most minimal number of requests) what exactly the rate limit is. This way, you can scrape at just under the maximum limit. I am still working on this core library though. :|


Sounds pretty interesting! Be sure to share it when it's ready.


That's a great point, for most web services, this request would be blocked at the browser level by the Same Origin Policy. Fortunately for me, this site allowed client-side calls by returning a Access-Control-Allow-Origin: * header[1], specifically designed to allow this type of cross-domain access.

[1]: http://en.wikipedia.org/wiki/Same_origin_policy#Cross-Origin...


The issue with web scraping is that it relies on the scraper to keep up with changes made to the site.

If a site owner changes the layout or implements a new feature, the programs depending on the scraper immediately fail. This is much less likely to happen when working with official APIs.


This should be stressed - sites like Facebook do exactly this. Constant changes mean constantly updating your scraper. When it comes to A/B testing? Your scraper needs to intelligent find the data, which might not always be in the same place.

Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.


I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping.

In my spare time, I've been playing around with "scrapers" (I like to call them web browsers, personally) that don't even look at markup.

My first attempt used a short list of heuristics that proved to be eerily successful for what I was after. To the point I could throw random websites with similar content (discussion sites, like HN), but vastly dissimilar structures, at it and it would return what I expected about, I'd say, 70% of the time in my tests.

After that, I started introducing some machine learning in an attempt to replicate how I determine what blocks are meaningful. My quick prototype showed mix results, but worked well enough that I feel with some tweaking it could be quite powerful. Sadly, I've become busy with other things and haven't had time to revisit it.

With that, swapping variables and similar techniques to thwart crawlers seems like it would be easily circumvented.


I would be really interested in knowing which heuristics or machine learning techniques produced decent results. That's if I can't convince you to open source the code. I'm working on the same problem at the moment.


What about something like http:// tubes.io


We're fine with scrapers and scraping infrastructure, although tubes.io is a very interesting idea.

I'm more interested in what I can do to write fewer scrapers since the content is, at a high level, relatively similar. I've just started with experiments writing "generic" scrapers that try and extract the data without depending on markup. It's going to eventually work well enough but to get the error rate down to an acceptable level is going to take a lot of tweaking and trial and error.

There's a few papers on this, but not much out there. That's why I was interested in someone else working on the same problem in a different space.


> Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.

These guys do a stellar job on the IP addresses: http://www.hidemyass.com/proxy-list -- the good thing is the data is available for an amazing price.

Other sites I have some across will use large images and css sprites to mask price data.

I write a lot of scrapers for fun, rarely profit, just for the buzz


I bet you would only need to randomly shuffle between a few alternatives for all of them. You'd need a dedicated effort to work that one out and the cache implications could be managed. No getting around the trade-off of possible page alternatives vs cache nightmare-ness though, and doing that to json apis would get ugly fast.

At least it's easier to code these tricks than to patch a scraper to get around them.


Yes, Facebook used to do that. I had to scrap it once and was surprised by randomly changing classes around input fields.

but who cares, no one can beat Xpath :)


>The issue with web scraping is that it relies on the scraper to keep up with changes made to the site.

The OP addresses that point. His contention is, there's a lot more pressure on the typical enterprise to keep their public-facing website in tip-top shape than there is to make sure whatever API they've defined is continuing to deliver results properly.

Of course, part of the art of (and fun of) scraping is to see if you can make your scraper robust against mere cosmetic changes to the site.


Of course, part of the art of (and fun of) scraping is to see if you can make your scraper robust against mere cosmetic changes to the site.

I once had to maintain a (legal) scraper and I can tell you there is no fun in making your scraper robust when the website maintainers are doing there best to keep you from scraping there site. I've seen random class-names and identifiers, switching of DIVs and SPANs (block display). Adding and removing SPANs for nesting/un-nesting elements. And so on. Ofcourse the site likes to keep the SEO, but most of the time it's easy to keep parts out of context for a scraper.


In most cases, the site doesn't have an api... so we scrape and take the risk that the structure will change. One thing that helps is using tools which give you jquery-like selectors because they give a lot of freedom and are very easy to write/update.


I agree, CSS selectors in BeautifulSoup and pyquery make it less messy.


This is indeed painful. I was scraping the Pirate Bay last year for a school project, their HTML would occasionally change in subtle ways that would break my scraper for hours or days until I noticed it.


Yeah, the author of the post seemed to imply that web APIs are more likely to change than a website. At least, that's how I took it. Blew my mind.


Great read!

In the past, I have successfully used HtmlUnit to fulfill my admittedly limited scraping needs.

It runs headless, but it has a virtual head designed to pretend it's a user visting a web application to be be tested for QA purposes. You just program it to go through the motions of a human visting a site to be tested (or scraped). E.g., click here, get some response. For each whatever in the response, click and aggregate the results in your output (to whatever granularity).

Alas, it's in Java. But, if you use JRuby, you can avoid most of the nastiness that implies. (You do need to know Java, but at least you don't have to write Java.)

Hartley, what is your recommended toolkit?

I note you mentioned the problem of dynamically generated content. You develop your plan of attack using the browser plus Chrome Inspector or Firebug. So far, so good. But what if you want to be headless? Then you need something that will generate a DOM as if presenting a real user interface but instead simply returns a reference to the DOM tree that you are free to scan and react to.


Headless: Xvfb on Linux. (Virtual Framebuffer, let's you run apps that require a GUI) You can use one of the many options that include Webkit (like phantom.js, the capybara-webkit gem, or Selenium if you want a real browser like Firefox to do the work)


PhantomJS doesn't need Xvfb anymore it can run headless without this dependency.


well, it wasn't due to anything in phantomjs actually -- it was because qt introduced project lighthouse & their qt platform abstraction. project lighthouse was a fork that got integrated into qt 4.8 that phantomjs includes. (you can see the entire qt source tree if you git-clone phantomjs)


I love HTML scraping. But Javascript???...The juiciest data sets these days are increasingly in JS. For the love of me i can't get around scraping JS :(

I do know that Selenium can be used for this...but am yet to see a decent example for the same. Does anyone have any good resources/examples on JS scraping that they could share?? I would be eternally grateful.


Phantom.js and casper.js

If you can't get the data from the endpoints the javascript hits then write your scraper in javascript and have it run in a headless browser, and it's the webkit engine so most sites test their site against it heavily.

Either pull the data out of the javascript objects or trigger your extraction from the html by attaching to the events in the javascript.


I've started playing with zombie.js recently as well - much lighter and faster than the ones that instrument a completely full browser. But has a full Javascript engine.


zombie.js is not a full browser. It's a poor emulation using jsdom as its backing. http://zombie.labnotes.org/guts Beware, for some applications, jsdom is super buggy.


That's really interesting, thanks.

I worry that it's not going to replicate a real browser accurately enough, but I'm excited to try it out a bit.


Your worry is correct. http://news.ycombinator.com/item?id=4896054 I've tried scraping with it, and it failed miserably on some sites.


Yeah, it's not mature enough yet.

We're also trying it for integration tests, as it is much quicker than Phantom or Selenium. Even there, where we control the standards-compliant site, it isn't quite good enough yet.

Would love to see more people helping make it so, though!


upvote for casperjs - it's definitely the best system I've come across for scraping javascript / ajax contents.


That is actually very simple, and you can even use a headless browser to execute javascript:

first install Xvfb and pyvirtualdisplay then try this snippet https://gist.github.com/4243582

selenium is great, it can even wait for ajax requests to finish (see WebDriverWait) ..


> first install Xvfb and pyvirtualdisplay

You really don't need xvfb anymore. Use xserver-xorg-video-dummy.


Yippie...that worked. Thanks a ton for sharing that code snippet. Am unfortunately more of an analyst. Have only just started picking up coding for data analysis. JS scraping was one area that I always had difficulty with. Not anymore :D


See my other comment :-)

I've been doing this on a site that is 100% Javascript-driven for over a year, very successfully.

It's really no different than hitting a static site with Selenium. Figuring out the proper XPath to use is often the biggest challenge: Chrome Developer tools help immensely. Also, you need to watch for delays in JS rendering, so put a lot of pauses in your scripts.

It's of course slow, so if you want to distribute it across several machines, use Selenium Grid or a queue system (SQS, Resque, etc). Setup Xvfb to run on headless Linux instances.


In general, any Web functional testing tool can be used as a scraper. Scraping and testing are extremely similar. In both cases, one uses XPath or (hopefully) CSS to locate an element and examine certain aspects of that element's state. A scraper is only different from a functional test in that a scraper is focused only on the state of nodes (potentially) containing human-readable content. That, and a scraper saves the data it collects rather than discarding all data at the end of a test run.

Here's a very old Selenium 1.0 example that scrapes the full, rendered HTML of a page. After performing a scrape like this, I would then feed the HTML into a parser such as Nokogiri http://snipplr.com/view/7906/rendered-wget-with-selenium/


Here's a little experiment with Reddit-automation using Selenium: https://github.com/jsz/reddit_voting


awesome...thanks :)


If you are using Python, you can also use pyv8 to evaluate Javascript code.


There is also Ghost.py[0]

[0]http://jeanphix.me/Ghost.py/

If you are planning to use phantomjs, import sh and it's commandline all the way to payday :D


Yes, but if you want the DOM you would have to use something like webkit. So something like pyphantomjs might hit the right spot. It's a python re-implementation of phantomjs.

https://github.com/kanzure/pyphantomjs


indeed i do use Python. thanks for sharing...it appears most interesting. Have started playing with it...


Another issue not covered: file downloads. Let's say you have a process that creates a dynamic image, or logs in and downloads dynamic PDFs. Even Selenium can't handle this (the download dialog is an OS-level feature). At one point I was able to get Chrome to auto-download in Selenium, but had zero control over filename and where it was saving. I ended up using iMacros (the pay version) to drive this (using Windows instances: their Linux version is very immature comparably).


I've done this successfully with Ruby Mechanize.


Awesome. I'd love some hints or links, as I'm always looking to refactor.


In general, if you're going the mechanize route, .retrieve() is the function your looking for.

e.g.

  br = mechanize.Browser()
  br.retrieve("https://www.google.com/images/srpr/logo3w.png, google_logo.png)[0]
Mechanize doesn't really have a proper doc, but just about everything you'd need could be figured out from the very lengthy examples page on their site.


Playing with it now, and while it seems to hit my download need, I can't seem to get it to play nice with sites that are JavaScript dependent. Am I missing something, or is there a way to plugin an underlying WebKit engine?


PhantomJS is capable of downloading binary content from js dependent sites but it is a journey to get it working as it is not an out-of-the-box feature. Instead use CasperJS to drive Phantom and get a ton of snazzy features including simple binary downloads. Happy scraping!


I'm surprised that no one has attempted to write a Twitter client based solely on scraping to get around the token limits.


Perhaps they were afraid of legal issues ?

scraping for fun is okay, but if you would like to build a business most people would prefer to abide by the terms of use and still use the API. remember the story of Pad Mapper ?


Or an alternative API that uses the scraped data from Twitter to make requests... but that might be getting a bit ambitious (and legally dodgy)


Create an script that scraps proxies and then use those to scrap twitter, use it in a Russian domain, claim 140 characters can't be copyrighted, claim that the tweets are being extracted from third party sites that use the twitter API but lack any kind of TOS and disclaimer; sell API access, profit!


I've written some in-browser JS to download all my tweets without needing to resort to the API and oAuth nightmare. It is indeed possible to write a client, but not recommended at all...


how did you send cross domain request?


if they really needed that much data they probably just requested access to the full phirehose


Scraping could be made a lot harder by website publishers, but they all depend on the biggest scraper accessing their content so it can bring traffic: Google ...

The biggest downside of scraping is that it often takes a long time for very little content (e.g. scraping online stores with extremely bloated HTML and 10-25 products/per page).


As a pioneer of scraping (NetProphet, the first interactive stock charting app with push-data) we initially scraped every quote we had in our database from other sites.

The fundamental problem is, web pages can change a lot. We constantly had scraper scripts fail either because the web pages changed for some innocuous reason, or they noticed the scraping and blocked us.

We resorted to a list of scrape targets and constantly-updating scrape-scripts to adapt continuously to the 'market'. We also pinged each target to find the least congested.

Eventually we got our own stock feed (guy that did that is a research scientist at Adobe now) and stopped scraping altogether. But it was a wild ride.


We still need to scrape many (several 100) clients' websites because they are unable to give us product feeds (adequate ones or any at all) for their stores. But hey, it gives us a small edge because we try harder than the competition.


An important topic.

The main caveat is that this may violate a site's terms of use and thus website owners may feel called upon to sue you. Depending on circumstances, the legal situation here can be a long story.


Yes, it is complicated. That said, this is partly just because there aren't enough cases - and partly because the law hasn't stabilised (took a century to stabilise after invention of printing press). It isn't clear what rights society should grant yet, for maximising business.

My take on it, from ScraperWiki's point of view: http://blog.scraperwiki.com/2012/04/02/is-scraping-legal/


Related: If you fancy writing scrapers for fun and profit, ScraperWiki (a Liverpool, UK-based data startup) is currently hiring full-time data scientists. Check us out!

http://scraperwiki.com/jobs/#swjob5


very well played :)


The title makes it sound as if there is going to be some discussion of how the OP has made web scraping profitable, but this seems to have been left to the reader's imagination.

Otherwise, great article! I agree that BeautifulSoup is a great tool for this.


It's pointless to think of it as "wrong" for third-parties to web-scrape. Entities will do as they must to survive. The onus of mitigating web scraping, if in the interests of the publisher, is on the publisher.

As a startup developer, third-party scraping is something I need to be aware of, that I need to defend against if doing so suits my interests. A little bit of research shows that this is not impractical. Dynamic IP restrictions (or slowbanning), rudimentary data watermarking, caching of anonymous request output all mitigate this. Spot-checking popular content by running it through Google Search requires all of five minutes per week. At that point, the specific situation can be addressed holistically (a simple attribution license might make everyone happy). With enough research, one might consider hellbanning the offender (serving bogus content to requests satisfying some certain heuristic) as a deterrent. A legal pursuit with its cost would likely be a last resort.

Accept the possibility of being scraped and prepare accordingly.


People seem to wonder how to handle ajax.

The answer is HttpFox. It records all http-requests.

1. Start recording

2. Do some action that causes data to be fetched

3. Stop recording.

You will find the url, the returned data, and a nice table of get and post-variables.

https://addons.mozilla.org/en-us/firefox/addon/httpfox/


> The answer is HttpFox. It records all http-requests.

http://mitmproxy.org/


Firebug is a lot better.


Isn't this the same as what the Net tab from Firebug does?


Yah, I don't understand why people make things so complicated once Javascript gets involved. Just inspect the XHR traffic to your browser ("Network" tab in Web Inspector, Firebug, etc) as you update the information on the page. You'll quickly discover what are essentially undocumented APIs returning the data used to generate the page. You don't need to use or even read through the Javascript that's calling them, you just need to figure out what parameters and cookies are being sent, and tweak those as you wish.

You might have to spoof the Referer header so that it thinks the request is still coming from their website.


From a site owner's perspective: if you have a LOT of data then scraping can be very disruptive. I've had someone scraping my site for literally months, using hundreds of different open proxies, plus multiple faked user-agents, in order to defeat scraping detection. At one point they were accessing my site over 300,000 times per day (3.5/sec), which exceeded the level of the next busiest (and welcome) agent... Googlebot. In total I estimate this person has made more than 30 million fetch attempts over the past few months. I eventually figured out a unique signature for their bot and blocked 95%+ of their attempts, but they still kept trying. I managed to find a contact for their network administrator and the constant door-knocking finally stopped today.


when i need to scrap a webpage, i use phpQuery (http://code.google.com/p/phpquery/), it's dead simple if you have experience with jQuery and i get all the benefits of a server-side programming language.


For Perl you can use Web::Query (but don't use the default HTML::TreeBuilder::XPath with it, it's extremely slow - use http://search.cpan.org/dist/HTML-TreeBuilder-LibXML/), or Mojo::DOM (part of Mojolicious: http://mojolicio.us/). Both use the nowdays standard CSS selectors for comfortable handling.

JavaScript can be scraped using WWW::Mechanize::Plugin::JavaScript or even WWW::Mechanize::Firefox).


A similar module for node.js: https://github.com/mape/node-scraper


Better than that is http://node.io/ Also, don't use jsdom (it is slow and strict), https://github.com/MatthewMueller/cheerio is much better.


What I wish I could do is capture Flash audio (or any audio) streams with my Mac. All I want is to listen to the audio-only content with an audio player when I'm out driving or jogging, etc. Audio-only content that has to be played off a web page usually runs into the contradiction that if I'm in a position to click buttons on my web browser (not driving, for example), I'm in a position to do my REAL work and have no time to listen to the audio. I'll go to the web page, see whatever ads they may have, but then I'd like to be able to "scrape" the audio stream into a file so I don't have to sit there staring at a static web page the whole time I'm listening.


I used to work at a company where capturing flash video and audio streams was a regular part of our work. You're not going to like the answer.

You basically have to proxy everything through a proxy that can be given a command or otherwise instructed to capture the top 3 or 4 streams from the website. From there you can either dumbly accept the largest one or start checking byte headers.


My company has an API to do that but it works on Windows: http://www.nektra.com/products/deviare-api-hook-windows/


Wireshark & RTMPDump may get you most of the way there.


When scraping HTML where data gets populated with js/ajax, you can get a web inspector to look at where that data is coming from and manually GET it and it will likely be in some nice JSON.

Scraping used to be the way to get data back in the days, but websites also didn't change their layout/structure on a weekly basis too back then and were much more static when it came to the structure.

Having recently written a small app that was forced to scrape HTML and having to update it every month to make it keep working, I can't imagine doing this for a larger project and maintaining it.


To all HN: All this being said, how do we prevent our sites from being scraped in this way? What can you not get around, and what are the potential uses for an 'unscrapeable' site to your mind.


if you don't want something to be scrapped, don't publish it on the internet. scrapping prevention reminds me of blocking right-click and other ridiculous solutions back in the day. hey, if I can view it, it means the data reached my end point.


I think the author just completly missed the point with API vs Screen scraping. The API allows for accessing structured data. Even if the website changes once, the datas would be accessible the same way through the API. Whereas, the author, would have to rewrite his code each time an update his made to the front-office code of the website.

A simple API providing simple json response with http basic auth is far more efficient than a screen scraping program where you have to parse the response using HTML / XML parsers.


This isn't always the case - APIs often change. Facebook, for example, is (at least was, a few years ago) notoriously bad at changing in an unpredictable and buggy way, and I stopped using it for that reason. Some HTML scrapers are more reliable than that.

As for efficiency, again not such an issue. HTML is very good these days, compared to 10 years ago, a simple CSS selector often does the job.


This is true, but APIs are often versionned.

Concerning efficiency this is true CSS / XPath processors, at least, both offer very nice performances.

But download 70KB of HTML each time you only need a single data, where the API request cost only a few (avg < 2KB), can be such a pain if you need to do this frequently. This can be handled by a scalable configuration but I find it a bit the overkill.


This illustrates the significant difference between the use-cases of "web APIs" and conventional APIs, that the former are more like a database CRUD (including REST), rather than a request for computation. They (usually) are an alternative interface to a website (a GUI), and that's how most websites are used. e.g. an API for HN would allow story/comment retrieval, voting, submission, commenting.

They could be used for computation, but (mostly) aren't.


Not every site. There is data I would really love to access on Facebook without having to gain specific authorization from the user. It's odd that for most user profiles the most you can extract via the graph API (with no access token) is their name and sex. Whereas I can visit their profile page in the browser, see all sorts of info and latest updates (and not even be friends with them)

Tried scraping Facebook. They have IP blocks and the like.


Do it in JS, client-side with 3 second delays. I have used this to get available data (location, name, status, etc) from the latest 500 available likes of a page I manage


This is a shameless plug but I've created a service that aims to help with a lot of the issues that OP describes such as rate limiting, JS and scaling. It's a bit like Heroku for web scraping and automation. It's still in beta but if anyone is interested then check out http://tubes.io.


I have done a bit of scrapping with ruby mechanize, when we hit limits have circumvented by proxy and tor

google as a search engine crawls most all sites, but offers very few usable stuff to other bots

http://www.google.com/robots.txt

Disallow 247 Allow 41


Be careful. I got banned from Google for scraping. I did a few hundred thousand searches one day, and that night, they banned my office IP address for a week. This was in 2001, so I estimate I cost them a few hundred dollars, which is now impossible to repay. :(


The problem with scraping instead of using the API is that when the website makes even a slight change to their markup it breaks your code. I have had that experience and it's a living hell. I can say it's not worth it to scrap when there is an API available.


There is just one major trouble with not needing stinking API. You can not POST as a possible client without requiring them to give their password to you, which actually would give you full access to their account instead of limited access with API.


You seem to be talking about a specific site? Which one?


Any social network where you can post messages for the user in the users message stream.


I had to do some scraping of a rather Javascript-heavy site last year - I found the entire process was made almost trivial using Ruby and Nokugiri. Particularly relevant for a non-uber-programmer like me, it's simple to use, as well as powerful.


So bloody true. A web page is a resource just like an xml doc, there's no reason public facing urls and web content can't be treated as such and I regularly take advantage of that fact aswell. great post


If it's not automated and a fewer times, I will prefer IMacro to perform tasks on my behalf. The best part of it that you can integrate a Db to record your desired data.


Automated web testing tools, such as Watir and Selenium, are also pretty good options. I'm especially surprised Watir hasn't been mentioned yet in the comments.


Indeed - or WatiN, the .NET port of WatiR. I've done some pretty heavy duty scraping and automation with WatiN, which included building a OO framework that trivialized writing scripts. Good stuff.


Checkout http://selectorgadget.com as a useful tool for coming up with CSS selectors.


How about publicly available web scraping tools as a way to encourage sites to provide good APIs? Everybody wants efficiency, after all.


No Rate-Limiting

Clearly someone's never spent time diagnosing the fun that is scaping HN (yes, unofficial API is available).


Node.js is excellent for web scaping, especially if you're scraping large amounts very often.


I made this module for this exact reason: https://github.com/icodeforlove/node-requester. Supports horrible things like proxy rotation.


> Supports horrible things like proxy rotation.

Do you have any plans to track which proxies are actually working, or how quickly each one is blocked? I want a reverse proxy on my outgoing requests that knows how to shift my traffic around properly so that I don't get banned. I don't want to be rate limited and I don't want to sit here for weeks trying to figure out wtf the rate limit is in the first place.


interesting, im sure you can build this into it. you could hook onto the didRequestFail method and flag IPs (log this.proxy out to see what the proxy was). all i would need to do is add a method that makes it easier to add/remove proxies.


This looks great. I think I will incorporate this into some of my projects.


What is it with all the headlines this week abusing the classic "for fun and profit" title?



I've found diffbot to be quite useful for scraping.


I so not agree with that article, it makes me sick. And this guy basically is some "marketer" so no wonder he gets quite some stuff wrong, imo. :p


What did he get wrong?


Craigslist, anyone?


I actually wrote a CL crawler in Ruby - https://github.com/marcamillion/craigslist-ruby-crawler

I used it to crawl for freelance web dev gigs, but it can be re-purposed to do anything.





Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: