However, there were a few very big problems. Which makes me feel that scraping is not the way to go about things. I certainly wouldn't build a service based around scraping a particular site's data.
When I had my twitter bot operational, I would get blocked from twitter for hours at a time. It seems anytime I hit their servers too hard, or crossed some threshold, I would be locked out. I'm assuming it was some kind of IP level ban, because I wasn't even able to access the site from an actual browser.
I was able to deal with the setback by setting up a script to repeatedly check its access the site, and then relaunch the scraper upon discovering access, but the solution was just a band-aid. That would translate to significant downtime if I was running a service with counted on access to their data. The ban-hammer is too easily laid down.
Finally, just as a word of caution, I'd warn prospective scrapers to be careful of just who you scrape. I've inadvertently "DDoS'd" a site when a multiprocessed script got away from me. It spawned 1000+ instances of this particular request, all of which were doing their best to beat the bejesus out of this small websites servers. The site ended up going down for a couple of hours; I assume because of a bandwidth cap or something.
So, my point being, scraping is cool, but (1) I'm unsure if I agree with relying on it over a proper API, and (2) with great power comes great responsibility! Be nice to smaller guys, and don't punish their servers to bad.
And XAuth made account creation and log in a breeze as there was no need for OAuth tokens - username/password was enough.
But you're not always as lucky as that and many websites are heavily JS driven. For Reddit I had to resort to selenium.
Often times programmers and the managers that drive them are way too quick to get going building or solving something with brute force. If they would just be patient and stop for a moment. Spending even a mere 30 minutes extra doing your homework on a problem can save hours or days in dev time.
But kybernetyk already said he did the research before and that Reddit's API is not good enough for his requirement.
So this is not the case that 15 minutes of research will save the time. And his comment meant he assumed kybernetyk didn't do research, i.e. being dumb for not doing search.
I did not assume that kybernetyk was dumb or anything, I simply chuckled and thought to myself ouch haven't I done similar mistake before?! Please don't assume the worst when reading someone's comment.
* given you have rights to scrap.
not that there are no tools to debug the site; but I found websites like mentioned plus youtube, and bunch others just not fiddling too much with js.
If you're using regex to solve this sort of problem, your code deserves to break, I'm sorry.
Regex for HTML is a bad idea ... http://stackoverflow.com/questions/590747/using-regular-expr...
You enter the url you want to capture data from, it gets loaded in an iframe, you click on the texts you need and set a schedule to receive updates and how(email/twitter dm) that's it.
Depending on your use case, headless may be simpler, but it has also many drawbacks that don't show at first, the main being that they're not simple to drive from remote processes as queue-consuming devices.
The article suggests BeautifulSoup as a parsing library for python. If I'm not mistaken BeautifulSoup is not actively maintained anymore, and other cleaner and faster solutions exists, like lxml.html. Ian Bicking made a good article on that topic : http://blog.ianbicking.org/2008/03/30/python-html-parser-per...
I hear it recommended the most among Pythonistas, and it's plenty clean and fast for my use. But if you're skeptical, I'd still look for a more up to date benchmark (or run your own) rather than rely on results from >4 years ago.
Still, lxml being basically a binding to libxml2 the performance comparison of the two libs should still hold. I heard it recommended too, in a python talk about scraping like 1 or 2 (at most) years ago.
BeautifulSoup may still be better for parsing broken documents, though I never had problems with lxml while using it on a very large variety of sites.
EDIT: Knowledge isn't zero-sum. Here's an overview of a kick-ass way to spider/scrape:
I find Scrapy cleaner and more like a pipeline so it seems to produce less 'side effect kludge' than other scraping methods (if anybody has seen a complex Beautiful Soup + Mechanize scraper you know what I mean by 'side effect kludge'). It can also act as a server to return json.
Being asynchronous, you can do crazy req/s.
I will leave out how to do all this through Tor because I don't want the Tor network being abused but am happy to talk about it one on one if your interest is beyond spamming the web.
Through this + a couple of unmentioned tricks, it's possible to get insane data, so much so it crosses over into security research & could be used for pen-testing.
Web scraping, as fun as it is (and btw, this title again abuses "Fun and Profit"), is not a practice we should encourage. Yes, it's the sort of dirty practice many people do, at one point or another, but it shouldn't be glorified.
You can't condemn web scraping though, that's the backbone of the services we all depend on for most internet related things. That's the whole point of structured markup and the world wide web itself.
They scrape to generate links for users to go to the site. That's quite different than scraping for...any other purpose? So it seems. Would you (anyone) argue otherwise? (genuine curiosity).
If there wasn't so much benefit for most of all sites to be in search engine indexes you would thinking at least some would object to this scraping.
There would be lots of other scraping that websites want to prevent that takes even less data than this. It just doesn't provide much in return for the website.
They have even convinced us all to go mark up our page to help them pull stuff like ratings and reviews out.
Does finding a link to the scrapee have to be the primary purpose of the site (and therefore google would be constantly getting "worse" by this scale)?
So how prominent does the link back have to be for it to be ok?
What about the summarized data from there that search engines are adding these days, so you don't need to leave the google results page to get your answer but the data still comes from some site that you rarely notice the name of?
edit: as to your curiosity, I honestly do not see the line that you see. Unless it's that the link back to the source is required. I don't know that I agree with that but I would understand it, although that gets harder and harder the more you massage your dataset to be useful to users.
I'm not actually in favor of scraping. But I think it is possibility that needs to be considered on both ends. If a site has valuable info and doesn't provide a decent API, it naturally is going to encourage scraping.
And isn't a search engine a kind of scraper?
I can see site operators being against the practice though, as it (usually):
- generates no ad revenue
- often enables someone else to use data that you struggled to put together, allowing others to profit with no gain for you
- hit's edge cases that were never optimized for (as it does not follow real user access)
The reality of scraping was really known many years ago. If you're doing if for above-board reasons like for research etc., you'll probably get a pass - if you're doing it in order to profit from someone else's work because you are too lazy to do it yourself, it's probably unethical and you won't get a pass --- these concepts have been around for at least a thousand years or more.
Full Disclosure: I have also scraped data - but only from government websites where the scraped data is explicitly public domain to begin with and APIs were not available.
2. What if I'm scraping it just for me, because I want a different interface? How many friends can I share that with? Can I open source the program?
3. What if I read a bunch of these sites to do research and write up a story on something about it? Not plagiarizing, just summarizing and providing analysis on craigslist rental prices? What if I do this every day? What if I automate that process? The data is transformed just as much as if I had read it myself and crunched the numbers myself, I made just as many requests to the site as my browser would have.
Concepts that have been around a thousand years or more are not fully applicable. Like the printing press, some things alter the scarcity equation for ideas and data distribution and ownership. Considering how little we've agreed on about print after 500 years I have some doubts that this is as closed an issue as you say.
- Respect robots.txt (as mentioned elsewhere) which will often provide a limited subset of all data available
- Give something in return (potential traffic) for the data they reap.
I fully agree that scraping is great, and do it myself frequently. Site operators do have legitimate concerns in some situations though, and it probably comes from feeling as if they are being 'ripped off' somehow.
No one in their right mind is going to object to incidental scraping for personal use.
However, scraping is often scripted into cron or the like and that data is then used to profit someone else. I'm usually cool with that, but if someone is running a web site and they are dependent upon ad revenue to keep the servers running, I understand objecting to it.
> No one in their right mind is going to object to incidental scraping for personal use.
It would almost certainly involve stripping ads when re-purposing the content.
Additionally, robots.txt is really for automated link traversal, not scrapers in general. If your scraper is initiated by a user, there is no need to follow robots.txt. Not even Google does when the request is user-initiated .
From there, the waters just become really murky. Is lynx a scraper because it doesn't render the way most web browsers do? Does it get a pass because it still adheres to web standards? What if a real scraper adheres to web standards? Maybe it is the storage of scraped data that is the issue? What about caches? I could go on, but I'm sure you see what I'm getting at. It's a very complex issue that is not at all understood.
This isn't settled legally certainly and it certainly doesn't seem like this is settled ethically either considering the various insane statements that occur when politicians comment on the subject.
Some examples of the specific concepts from a thousand years ago that apply and answer these questions would help me see what you see. I know the basic rules for music sampling and referencing other works when writing and where the line for plagiarism is drawn and the rights for using photography. Don't know the rules for accessing network resources that are open or for using their data.
You explicitly give them permission to have it by going out of your way to install a program on a common port, with a common API, and giving it a directory full of documents to distribute, and not using any form of authentication. The way the web works is that answering is equivalent to granting permission to ask and sending a file is tantamount to granting permission. When you receive a file you don't first receive a permissions document, you receive the file - authentication and contractual obligations come first because there is no later. (This is like the tide, you may not like it but that doesn't mean you can change it, especially not with laws.)
You have many ways to check authentication and legally they can be VERY weak, 1-bit passwords are sufficient, but if you don't restrict access it is open - not just because it's the default, but because it's the technical reality: they didn't hack into your computer to get that file, they asked your document server and it gave it to them!
Robots.txt is a suggestion, for the scraper's benefit! It suggests better links. You're allowed to see the rest (the server sends them to you without a password) but you're unlikely to find good content.
If you're afraid of someone examining data you send them, don't send them the data if they ask. Expecting them to not ask, or once they've received it, to not manipulate it in certain ways because you can't then extract a fee for them doing so is controlling and more-over, doomed to fail.
In fact, the feds might think that clearing your cookies or switching browsers to get another 10 free articles from the NYTimes is also felony hacking.
Which is to say, be careful what you admit to in this forum AND how you characterize what you are doing in your private conversations and e-mails.
Weev now faces a decade or more in prison because he drummed up publicity by sending emails to journalists that used the verb "stole".
While scraping can sometimes be used as a legitimate way to
access all kinds of data on the internet, it’s also important
to consider the legal implications. As was pointed out in the
comments on HN, there are many cases where scraping data
may be considered illegal, or open you to the possibility of
being sued. Similar to using a firearm, some uses of web
scraping techniques can be used for utility or sport, while
others can land you in jail. I am not a lawyer, but you
should be smart about how you use it.
Since the third party service conducted rate-limiting based on IP
address (stated in their docs), my solution was to put the code that
the results back to my server from each of the clients.
This way, the requests would appear to come from thousands of
different places, since each client would presumably have their own
unique IP address, and none of them would individually be going over
the rate limit.
By the way, that's one of my projects. You can use a basic fibonacci-related algorithm to figure out (in the most minimal number of requests) what exactly the rate limit is. This way, you can scrape at just under the maximum limit. I am still working on this core library though. :|
If a site owner changes the layout or implements a new feature, the programs depending on the scraper immediately fail. This is much less likely to happen when working with official APIs.
Sidenote: I wonder if any webapps use randomly generated IDs and class names (linked in the CSS) to prevent scraping. I guess this would be a caching nightmare, though.
In my spare time, I've been playing around with "scrapers" (I like to call them web browsers, personally) that don't even look at markup.
My first attempt used a short list of heuristics that proved to be eerily successful for what I was after. To the point I could throw random websites with similar content (discussion sites, like HN), but vastly dissimilar structures, at it and it would return what I expected about, I'd say, 70% of the time in my tests.
After that, I started introducing some machine learning in an attempt to replicate how I determine what blocks are meaningful. My quick prototype showed mix results, but worked well enough that I feel with some tweaking it could be quite powerful. Sadly, I've become busy with other things and haven't had time to revisit it.
With that, swapping variables and similar techniques to thwart crawlers seems like it would be easily circumvented.
I'm more interested in what I can do to write fewer scrapers since the content is, at a high level, relatively similar. I've just started with experiments writing "generic" scrapers that try and extract the data without depending on markup. It's going to eventually work well enough but to get the error rate down to an acceptable level is going to take a lot of tweaking and trial and error.
There's a few papers on this, but not much out there. That's why I was interested in someone else working on the same problem in a different space.
These guys do a stellar job on the IP addresses: http://www.hidemyass.com/proxy-list -- the good thing is the data is available for an amazing price.
Other sites I have some across will use large images and css sprites to mask price data.
I write a lot of scrapers for fun, rarely profit, just for the buzz
At least it's easier to code these tricks than to patch a scraper to get around them.
but who cares, no one can beat Xpath :)
The OP addresses that point. His contention is, there's a lot more pressure on the typical enterprise to keep their public-facing website in tip-top shape than there is to make sure whatever API they've defined is continuing to deliver results properly.
Of course, part of the art of (and fun of) scraping is to see if you can make your scraper robust against mere cosmetic changes to the site.
I once had to maintain a (legal) scraper and I can tell you there is no fun in making your scraper robust when the website maintainers are doing there best to keep you from scraping there site.
I've seen random class-names and identifiers, switching of DIVs and SPANs (block display). Adding and removing SPANs for nesting/un-nesting elements. And so on.
Ofcourse the site likes to keep the SEO, but most of the time it's easy to keep parts out of context for a scraper.
In the past, I have successfully used HtmlUnit to fulfill my admittedly limited scraping needs.
It runs headless, but it has a virtual head designed to pretend it's a user visting a web application to be be tested for QA purposes. You just program it to go through the motions of a human visting a site to be tested (or scraped). E.g., click here, get some response. For each whatever in the response, click and aggregate the results in your output (to whatever granularity).
Alas, it's in Java. But, if you use JRuby, you can avoid most of the nastiness that implies. (You do need to know Java, but at least you don't have to write Java.)
Hartley, what is your recommended toolkit?
I note you mentioned the problem of dynamically generated content. You develop your plan of attack using the browser plus Chrome Inspector or Firebug. So far, so good. But what if you want to be headless? Then you need something that will generate a DOM as if presenting a real user interface but instead simply returns a reference to the DOM tree that you are free to scan and react to.
I do know that Selenium can be used for this...but am yet to see a decent example for the same. Does anyone have any good resources/examples on JS scraping that they could share??
I would be eternally grateful.
I worry that it's not going to replicate a real browser accurately enough, but I'm excited to try it out a bit.
We're also trying it for integration tests, as it is much quicker than Phantom or Selenium. Even there, where we control the standards-compliant site, it isn't quite good enough yet.
Would love to see more people helping make it so, though!
first install Xvfb and pyvirtualdisplay then try this snippet https://gist.github.com/4243582
selenium is great, it can even wait for ajax requests to finish (see WebDriverWait) ..
You really don't need xvfb anymore. Use xserver-xorg-video-dummy.
It's really no different than hitting a static site with Selenium. Figuring out the proper XPath to use is often the biggest challenge: Chrome Developer tools help immensely. Also, you need to watch for delays in JS rendering, so put a lot of pauses in your scripts.
It's of course slow, so if you want to distribute it across several machines, use Selenium Grid or a queue system (SQS, Resque, etc). Setup Xvfb to run on headless Linux instances.
Here's a very old Selenium 1.0 example that scrapes the full, rendered HTML of a page. After performing a scrape like this, I would then feed the HTML into a parser such as Nokogiri http://snipplr.com/view/7906/rendered-wget-with-selenium/
If you are planning to use phantomjs, import sh and it's commandline all the way to payday :D
br = mechanize.Browser()
The biggest downside of scraping is that it often takes a long time for very little content (e.g. scraping online stores with extremely bloated HTML and 10-25 products/per page).
The fundamental problem is, web pages can change a lot. We constantly had scraper scripts fail either because the web pages changed for some innocuous reason, or they noticed the scraping and blocked us.
We resorted to a list of scrape targets and constantly-updating scrape-scripts to adapt continuously to the 'market'. We also pinged each target to find the least congested.
Eventually we got our own stock feed (guy that did that is a research scientist at Adobe now) and stopped scraping altogether. But it was a wild ride.
My take on it, from ScraperWiki's point of view:
Otherwise, great article! I agree that BeautifulSoup is a great tool for this.
As a startup developer, third-party scraping is something I need to be aware of, that I need to defend against if doing so suits my interests. A little bit of research shows that this is not impractical. Dynamic IP restrictions (or slowbanning), rudimentary data watermarking, caching of anonymous request output all mitigate this. Spot-checking popular content by running it through Google Search requires all of five minutes per week. At that point, the specific situation can be addressed holistically (a simple attribution license might make everyone happy). With enough research, one might consider hellbanning the offender (serving bogus content to requests satisfying some certain heuristic) as a deterrent. A legal pursuit with its cost would likely be a last resort.
Accept the possibility of being scraped and prepare accordingly.
The answer is HttpFox. It records all http-requests.
1. Start recording
2. Do some action that causes data to be fetched
3. Stop recording.
You will find the url, the returned data, and a nice table of get and post-variables.
You might have to spoof the Referer header so that it thinks the request is still coming from their website.
You basically have to proxy everything through a proxy that can be given a command or otherwise instructed to capture the top 3 or 4 streams from the website. From there you can either dumbly accept the largest one or start checking byte headers.
Scraping used to be the way to get data back in the days, but websites also didn't change their layout/structure on a weekly basis too back then and were much more static when it came to the structure.
Having recently written a small app that was forced to scrape HTML and having to update it every month to make it keep working, I can't imagine doing this for a larger project and maintaining it.
A simple API providing simple json response with http basic auth is far more efficient than a screen scraping program where you have to parse the response using HTML / XML parsers.
As for efficiency, again not such an issue. HTML is very good these days, compared to 10 years ago, a simple CSS selector often does the job.
Concerning efficiency this is true CSS / XPath processors, at least, both offer very nice performances.
But download 70KB of HTML each time you only need a single data, where the API request cost only a few (avg < 2KB), can be such a pain if you need to do this frequently. This can be handled by a scalable configuration but I find it a bit the overkill.
They could be used for computation, but (mostly) aren't.
Tried scraping Facebook. They have IP blocks and the like.
google as a search engine crawls most all sites, but offers very few usable stuff to other bots
Clearly someone's never spent time diagnosing the fun that is scaping HN (yes, unofficial API is available).
Do you have any plans to track which proxies are actually working, or how quickly each one is blocked? I want a reverse proxy on my outgoing requests that knows how to shift my traffic around properly so that I don't get banned. I don't want to be rate limited and I don't want to sit here for weeks trying to figure out wtf the rate limit is in the first place.
I used it to crawl for freelance web dev gigs, but it can be re-purposed to do anything.