Hacker News new | comments | show | ask | jobs | submit login
Web Scraping: Bypassing “403 Forbidden,” captchas, and more (sangaline.com)
564 points by foob 99 days ago | hide | past | web | 225 comments | favorite

Note that 99% of time, if a web page is worth scraping, it probably has an accompanying mobile app. It's worth downloading the app and running mitmproxy/burp/charles on the traffic to see if it uses a private API. In my experience, it's much easier to scrape the private mobile API than a public website. This way you get nicely formatted JSON and often bypass rate limits.

Definitely agree. I'd also recommend using mobile sites for scraping as they tend to be simpler and load faster (assuming they contain all of the data you are interested in).

How do you deal with the issue that most mobile apps have a baked in security key for their private API? Or am I being naive to think that most apps have that?

You reverse engineer the application, or you run it in a debugger.

If the app features certificate pinning to block MITM eavesdropping through your own proxy, you either use one of the XPosed Framework libraries that removes it on the fly in a process hook, or you decompile the app, return-void the GetTrustedClient, GetTrustedServer, AcceptedIssuers, etc. functions.

If it features HMAC signing, you decompile the app, find the key, reverse engineer the algorithm that chooses and sorts the paramaters for the HMAC function, and rewrite it outside the app. If the key is generated dynamically you reverse engineer that too, and if it's retrieved from native .so files you're going to have a fun time, but it's still quite doable.

All they can do is pile on layers and layers of abstraction to make it painful. They can't make the private API truly private if it requires something shipped with the client.

The initial idea was to make your life simpler by parsing JSON instead of HTML. Now we are decompiling binaries. Somewhere on the way, we got lost.

Once you do the one-time work of pulling out the key, you can just add something like, "secret_key=foobar" to your requests, and you're back to happily parsing JSON.

If they keep changing it up, I'm sure you could automate the decompiling process. The reality is that this technique is security by obscurity at its core, and is therefore never going to succeed.

Skype is probably one example where it took developers 10+ years to figure out how the app worked.

Did it take that long to do it or did it take that long for someone to care to do it? I mean, it's Skype.

The story of every IT project ever...

In many cases, if all you're after is the data, you don't even need to reverse-engineer much; I'm not familiar with the mobile world but in Windows you can just send input events to the process, essentially simulating a user, and read its output in a similar fashion. You can still treat the app like a black box and regardless of how much obfuscation they put in the code, you still get the output.

(This technique is useful not just for scraping data, but for UI testing of your own apps.)

> All they can do is pile on layers and layers of abstraction to make it painful. They can't make the private API truly private if it requires something shipped with the client.

This is totally true, but the original premise was to do it just with a MITM. I was being generous and assuming most apps do dynamic generation of their keys. I'm probably wrong now that I think about it.

I feel like at a certain point this crosses the line from unintended use of a private API to unethical hacking.

If the data owner went through the trouble of encrypting the traffic between it and it's app they have a certain expectation of private communications that you'd better have a damn good reason for violating.

When an application you have legally installed on your own computer is communicating with the outside world, it seems a fundamental right to inspect the exchanged data to check that it is not leaking personal information. If the data is encrypted or obfuscated, this could make us suspicious (why hiding if there is nothing to hide ?) and gives additional motivation to audit the security.

Once the api is reverse engineered, we might be tempted to improve the usability of the application by adding some features (scraping all data). If this hurts the server (huge resource consumption), this becomes unethical and may become illegal.

And I suppose you personally test the physical security measures of every retail store you shop at?

No, but I do personally test the physical security measures of every car or computer I purchase and bring into my home.

It's certainly unintended use of a mobile API, but it's not hacking; it's reverse engineering. HMAC is used for client integrity verification as a signing algorithm; it's not used for generating hashes or ciphertexts of confidential user data. Furthermore, even if it were, it's operating on data that we are sending to the server in the first place. We aren't actually breaking encryption or cracking hashes for confidential user data, we are choosing to manually sign messages to the server using the same methodology as the application itself. Cryptographically speaking there is a very large difference in utility here. The only actual encryption present is the TLS, but both you and the server ultimately see the data.

Reverse engineering occupies a much more ethically and legally grey area than outright hacking because you are fundamentally taking software in your possession and modifying it. There are strong arguments that people should have the right to do this. If can lead to hacking, and it's useful for security research, but it is not in of itself an attack on the application's security (you could make a case that it is an attack on the application's trust model, however).

Now, if the developers relied on the privacy of the API as a form of implicit authorization (i.e. by forging requests from the client I can retrieve another user's data using an insecure direct object reference on a username paramater), and I proceed to do that - yes, that's hacking. You're accessing confidential data in an unauthorized manner, just as you would be if an insecure direct object reference were present on the website. The developers made a mistake in conflating client validation with user authorization, but you've still passed a boundary there.

It is arguable that this is unethical or at least amoral, but if all you're doing is scraping publicly available data using the public mobile API, it is at least legally defensible until the other party sends you a C&D for using their API in an unauthorized manner (so long as you haven't agreed to a TOS by using the mobile API, which really depends on whether and how prominently they have a browserwrap in place). I think the spirit of your point is that someone probably just shouldn't be using an API if they're not authorized to do so, but it's a very important legal and technical distinction to make here that you aren't hacking by reversing the embedded HMAC process.

Know of any good guides on doing this? I want to do this for an app right now actually.

Not off the top of my head. The knowledge isn't tribal, but it's certainly scattered (few blog posts will give you take you the whole way) and the tools are...spartan.

I recommend you read CTF writeups (there was one hosted on GitHub where a team retrieved the request signing key for Instagram IIRC). Those are usually very tutorial-like, though they tend to take some level of knowledge for granted even if they don't intend to.

The other thing to do is pick up apktool, JD-GUI, dex2jar and maybe IDA Pro, Hopper or JEB and learn them as you you go.

Android tends to be easier to decompile if you want to discover stored keys, etc. I've done it a few times by using Charles on desktop + setting up a proxy to run my iOS connection through my laptop, and then running the app on mobile.

>All they can do is pile on layers and layers of abstraction to make it painful. They can't make the private API truly private if it requires something shipped with the client.

Do note that if you become annoying and/or conspicuous enough, they can use legal force to stop you, and if the case actually goes through the process, you'll almost surely lose. This is true at least in the United States and Europe.


If they're smart enough to collect data in the app to detect when a request to their api is actually coming from their app (versus one coming from curl or from a script), then your reverse engineering problem just went from painful to unprofitable (i.e. not worth your time).

I'm not sure I understand this. The point of the reverse engineering is to simulate (exactly) requests from the app, using your script, thus making it impossible for them to detect. And the data collected in the app has no bearing on your scripted calls to the API outside the app.

I suppose I can envisage a scenario where the app sends a counter as a request id, and my script will then send the next value of the counter as its id, causing the next request from the app to re-use a counter value and thus fail, but the server API and the app have no way of knowing this is due to my script and not, say, network issues, and therefore it should still not affect my reverse engineering abilities...

Maybe, taking this further, the app could have baked in a certain number of unique, opaque 'request tokens' that are valid for one API call only, and when my script has used all of them it will cease to work, and in doing so cause my copy of the app to become useless, but again, not an insurmountable barrier.

Never assume something is impossible. Tokens/counters that are generated at runtime inside the app is a start at countering bots. But there are MUCH more advanced techniques and big businesses that are built upon helping mobile and webapps detect bots and other scripted requests.

Reminds me of the battle between hackers and Niantic to create scanners for Pokemon Go.

Start with the mitm. You'd be surprised how many apps have no certificate pinning. If there's pinning, you just need a jailbreak so you can override the https method to ignore certificate pinning. See: iseclabs ssl killswitch

Some apps use private keys baked into the app, but you can usually recover those from memory. Do this using lldb and a remote debug server on a jailbroken iPhone over USB.

Or you can treat the app as a black box and use a phone (or multiple) as a "signing server" to sign API requests before sending them via, e.g. curl or python. To do this you install the app on a jailbroken phone, tweak it to expose an http server on the phone that listens for API requests, feeds them to the "sign API client request" method, and then returns the signed request. This method has the benefit of being resilient to frequent app updates.

You capture the token along with the request on mitmproxy.

I guess I was being generous in my assumption that the apps will generate keys dynamically, making that not useful for a repeat attack as it were. I'm probably wrong though, most apps probably use a single baked in key.

Where would the key be generated?

The app and api can share a secret generation method. I once wanted to use the api of an iOS app that sent a timestamp + dynamic token based on that timestamp in the request headers. The timestamp/token combination was validated by the server, which had a tolerance of five minutes (so replaying the same timestamp/key combination that you observed via MITM would stop working within five minutes). Rather than try to work out the algorithm, the approach was to dump the app headers and work out where the key was being generated, then used cycript to attach to the running app and invoke the `tokenForTimestamp:` method to generate valid timestamp/token pairs at one second intervals out for the next several years. Still working a year on :)

If you haven't heard of Frida yet, check it out!


Most of the "worthwhile" apps have certificate pinning and HMAC-signing in place these days, so it's a bit more of an onerous process than simply booting up Charles and playing with the API. But yeah, it's definitely better to deal with JSON outputs than have to parse HTML.

you realize if you piss off anyone doing this, they can have you sent to jail right? Automated scraping is unlikely to qualify as fair use in court.

As far as I know, only Japan has laws.that allow scraping without explicit permission.

> you realize if you piss off anyone doing this, they can have you sent to jail right?

Are you talking about Aaron Swartz? There were complicating factors. Typically, no, scraping is not a criminal matter. If you need a lawyer, however, you should talk to one.

> Automated scraping is unlikely to qualify as fair use in court.

Scraping is not [usually] the problem. It is what you do with the scraped content that is the problem. Additionally, the nature of the scraped content itself is at issue: facts are not copyrightable, for instance, just particular expressions.

Fair use is a hugely, hugely complex topic. If you ever ask someone if something is fair use and they give you a straight "yes" or "no" answer - don't trust it unless they are also showing you a court decision that covers your exact scenario and is valid in your jurisdiction. Any good answer will include a healthy does of "it depends."

The only way to conclusively determine if something is fair use is going through the fair use "run-time" - which is a court decision on the topic. There is no other way - none - to determine if something is fair use and claims to the contrary are false. To be clear: fair use is a judicial determination and it is highly fact specific.

Disclaimer: I am a lawyer, but I am not your lawyer. If you need one, you should get one.

If there's no robots.txt, then you can scrape all you want.

That's how Google got big.

Is robots.txt compliance written into law anywhere?

How does Google function in Japan then?

Western world allows for scraping as long as you follow robots.txt. If you don't, it's still not a jailable offence unless you DOS the system.

You can usually also do this with whatever JavaScript is running on the page, if the data gets loaded in via AJAX.

I once wrote a whole automated test suite for a web app based on this, but nowadays I think it's much more common.

Perfectly feasible technically. However aren't you stepping into a legal minefield? After all, it's a "private" API. Furthermore, since you're explicitly going around the hurdles they've laid out for the public (captchas, T&C), you're surely painting a target on your head?

Note, I'm not debating whether this is a good or bad thing, just that in the current environment this is surely a legally dodgy manoeuvre.

Has anybody ever been sued for scraping a public website?

Weev spent about a year in jail for doing it

And the conviction was vacated on a technicality, so standing precedent on the issue is a little thin.

Wasn't even scraping, just incrementing a query param.


This is absolutely genius. I'd love to see a blog post with more information.

Reminds me of when app developers found the private API of the Pokemon GO app and used it to create their own Pokemon locater apps. Many of them made it to the top free charts and would have raked in hundreds of thousands of downloads, if not millions.

Another fun aspect of that - most banks don't release public APIs giving safety as a reason. On the other hand, each one of them have mobile apps with APIs inside.

Also get the Android APK and decompile it.

Better solution: pay target-site.com to start building an API for you.


* You'll be working with them rather than against them.

* Your solution will be far more robust.

* It'll be way cheaper, supposing you account for the ongoing maintenance costs of your fragile scraper.

* You're eliminating the possibility that you'll have to deal with legal antagonism

* Good anti-scraper defenses are far more sophisticated than what the author dealt with. As a trivial example: he didn't even verify that he was following links that would be visible!


* Possible that target-site.com's owners will tell you to get lost, or they are simply unreachable.

* Your competitors will gain access to the API methods you funded, perhaps giving them insight into why that data is valuable.

Alternative better solution for small one-off data collection needs: contract a low-income person to just manually download the data you need with a normal web browser. Provide a JS bookmarklet to speed their process if the data set is a bit too big for that.

Having been the victim of a VERY badly behaved scraper, I'm willing to listen to this. When that "attack" was going on, we talked about that very thing, if the scraper would only identify himself. (we were able to identify the actual culprit, and circumstantial evidence suggested they were going after our complete price list for a client)

The cost of the bad scraper was pretty significant. They were hitting us as hard as they could, through TOR nodes and various cloud providers. But the bot was badly written, so it never completed its scan. It got into infinite loops, and triggered a whole lot of exceptions. It caused enough of a performance drain that it affected usability for all our customers.

We couldn't block them by IP address because (a) it was just whack-a-mole, and (b) once they started coming in through the cloud, the requests could have been from legit customers. We eventually found some patterns in the bot's behavior that allowed us to identify its requests and block it. But I'd have been willing to set up a feed for them to get the data without the collateral damage.

Story doesn't add up.

First, it's very hard to pull off a DDOS attack using Tor. The most you could get would be less than someone repeatedly pressing refresh every second. This is because if you hit the same domain repeatedly the network will flag and throttle you.

How bad was your server configuration that it would choke if somebody tried to scrape it? Was this running on a dreamhost $10/year server or something? That's the only way to explain it's poor performance. Either that or your SQL queries are unoptimized anyways.

I'm just trying to understand. Unless this was like 10,000 scraper instances trying to scrape your website, I find it hard to believe this story.

Instead of downvoting, why don't you offer rebuttal to what I wrote and post more evidence to support your original story?

> This is weird. First, it's very hard to pull off a DDOS attack using Tor. The most you could get would be less than someone repeatedly pressing refresh every second.

Please explain. Why do you think Tor can't provide a user with many RPS?

The network as I understand will automatically throttle and flag you if you are firing too many RPS. If you are hitting a particular domain over and over especially. So it's not possible to take down websites with TOR unless it's running on Dreamhost's shared hosting plan with a PHP solution.

This is why I find OP's story hard to believe, it doesn't add up.

Tor does nothing of the sort. In order to throttle a client, there would need to be a central authority that could identify connections by client, which would very much defeat the purpose of Tor. And besides, how would it deal with multiple Tor clients for the same user?

That said, it's not particularly effective a as a brute-force DoS machine due to the limited bandwidth capacity and high pre-existing utilisation. Higher level DoS by calling heavy dynamic pages is still possible.

The parent didn't specify that the outages were during the period that the scraping was coming from Tor. It's equally possible that it only started affecting availability after they blocked Tor and switched to cloud machines.

All that said, screw people who use Tor for this kind of thing. They're ruining a critical internet service for real users.

CWuestefeld wrote "They were hitting us as hard as they could, through TOR nodes and various cloud providers."

I think you missed the second part of that sentence. I must admit that I worked for a company that did that to scrape a well known business networking site....

> Better solution: pay target-site.com to start building an API for you.

I'd add to this:

Do you really need continuing access? Or just their data occasionally?

Pay them to just get a db dump in some format. For large amounts of data, creating an API then having people scan and run through it is just a massive pain. Having someone paginate through 200M records regularly is a recipe for pain, on both sides.

A supported API might take a significant amount of time to develop, and has on-going support requirements, extra machines, etc. Then you have to have all your infrastructure or long running processes to hammer it and get the data as fast as you can, with network errors and all other kinds of intermittent problems to handle.

A pg_dump > s3 dump might take an engineer an afternoon and will take minutes to download, requiring getting approval from a significantly lower level and having a much easier to estimate cost of providing.

> pay target-site.com to start building an API for you.

When has that ever worked?

I can't cite specific examples because the ones I know about formed confidential business relationships, but I can say with confidence that this works All. The. Time.

That said, if you're some small-time researcher who can't offer a compelling business case to make this happen, then it won't be worth their time and they're likely to show you the door. [Note: my implication here is that it's not because you're small time, but it's because by the nature of your work you're not focusing on business drivers which are meaningful to the company/org you're propositioning].

Edit: Also be warned that if you're building a successful business on scraped personal info, you're begging to be served w/ a class action lawsuit (though take that well-salted, because IANAL and all that jazz).

Most of the time, in my (admittedly limited) experience. The two exceptions have been:

A giant site, who already had an API & had deliberately decided to not implement the API calls we wanted. I should add that another giant site have happily added API options for us. (For my client, really; not for me.)

An archaic site. The dev team was gone & the owners were just letting it trickle them revenue until it died -- they didn't even want to think about it anymore.

When the money is good enough :) . That is, usually not for startups, yes for established companies with money.

Never to my personal attempts

Many instances of us building scrapers are cases where a partner has data or has tools which are only built into the UI or the UI ones are much more capable.

Rather than waiting potentially years for their IT team to make the required changes we can build a scraper in a matter of days.

I can attest to this. From personal experience I found websites that would ignore scrappers and just allow me to access their data on their public web-site easier to deal with code wise and time wise. I make the request, you give me the data I need and then I can piss off.

Web-sites that make it a cluster &&*& to get access to the data do two things. They setup a challenge to break their idiotic `are you a bot?` and secondly it is trivial in most situation just to spin up a vm, and run chrome with selenium and a python script.

Granted I don't use AJAX API or anything like that. Instead I've found developer who nativly have a JSON string along side the data within the HTML to easiest to parse.

Reasons why I've setup bots/scrappers 1) My local city rental market is very competitive and I hate wasting time email landlords who have all-ready signed up a lease. 2) House prices 3) Car prices 4) Stock prices 5) Banking 6) Water Rates 7) Health insurance comparison 8) Automated billing and payment systems.

This can't be a solution for people using web scraping:

The goal for web scrapers is to pay as little as possible for as much data as possible.

Depends on the scraper. I buy data dumps when I can, if possible. Plus, it can actually be cheaper to enter into a business relationship with the target site than it is to play whack a mile with their anti scraping development efforts over time.

>Alternative better solution for small one-off data collection needs: contract a low-income person to just manually download the data you need with a normal web browser. Provide a JS bookmarklet to speed their process if the data set is a bit too big for that.

This is a good idea that has some interesting legal implications (e.g., the target site's network is never accessed by the software, so CFAA claims are likely irrelevant), but probably isn't enough to cover all the bases. I wanted to try something like this before I got C&D'd, but my lawyer informed that doing it after the fact could potentially constitute conspiracy and cause a lot of problems.

I'm not a lawyer.

My son has done a few of these for me. :) Smaller sites, one time grabs. But yes, persisting AFTER a direct request that you stop is usually both uncool and legally risky.

Also known as the tcgplayer.com strategy. Very disappointing to find out about, especially when the margins on hobby-level Magic: the Gathering card selling are already so low.

Scrapy is indeed excellent. One feature that I really like is Scrapy Shell [1].

It allows to run and debug the scraping code without running the spider, right from the CLI.

I use it extensively to test that my selectors (both CSS and XPATH) are returning the proper data on a test URL.

[1] https://doc.scrapy.org/en/latest/topics/shell.html

A few things turn me off about Scrapy is that it feels over engineered for what it does. Why do I need an entire framework?

I'm taking on technical debt to access data I don't have programmatic access to.

CSS/Xpath are very fragile. You most likely will be changing them in the future.

I've used scrapy a few times and it's never felt like a big over-arching framework.

I've been able to change what I need, and what it does for me are all the things I'd have to have done myself (caching, parallel requests but throttling per domain, dropping into debug mode, scheduled runs, etc).

Really, it feels more like a skeleton + lots of sensible defaults. The meat of the code will be in the parsing, and so if for some reason I really needed to move away from it then I wouldn't feel particularly tied to it.

Interesting. So the web crawling/page fetching component is a major value add to you as a developer?

Whereas the parsing is less of a value add because you prefer to code the parser yourself so that you have more control?

What about changing the parser and crawler as the websites changes?

What other pain points about scrapy do you have?

> Interesting. So the web crawling/page fetching component is a major value add to you as a developer?

When I have a scraping task, yes. It's a set of things that are required each time but also a bit fiddly to get right, and scrapy has solved them well.

> Whereas the parsing is less of a value add because you prefer to code the parser yourself so that you have more control?

These I see as fundamentally having to be things I code as they're the parts that are different each time.

> What about changing the parser and crawler as the websites changes?

Really just a cost of doing scraping. Anything fancier has generally taken more time and created more problems than just assuming an ongoing maintenance cost. The debugging shell in scrapy is very useful for this.

Last big job I did I also built a cache that you could query by time, so all versions of the page seen were stored which was very useful for debugging intermittent problems, and finding page changes. I don't know if scrapy has this in its cache, I don't think so but wouldn't conflict with it.

> CSS/Xpath are very fragile. You most likely will be changing them in the future.

Genuinely curious what the alternative is

I've been doing research on this but it's not clear whether this problem is a pain for enough number of businesses to justify further investments.

I often feel like web scraping is a commodity without understanding any of the inherent technological complexities and challenges.

Very discouraging field to be in, especially when people claim to have pain but are unwilling to pay very much for it or show appreciation for the effort that goes into it.

edit: thanks for the downvotes. perfect illustration of how innovation is punished and unrewarded in this field.

How many downvotes did you get? A few may have just been because your response is vague and doesn't say much for the question. But that doesn't mean a downvote should occur, of course.

Otherwise complaining about downvotes is no good either. Some will downvote because of that.

> especially when people claim to have pain but are unwilling to pay very much for it or show appreciation for the effort that goes into it.

Welcome to pretty much every profession in the world.

FYI, I only downvoted you after you complained about downvotes.

I only downvoted because there was no alternative offered, just complaining about how underappreciated scraper creators are.

> perfect illustration of how innovation is punished

I see no innovation in your post.

The complaining about downvotes was just the icing on the cake, cementing the downvote.

for unstructured data applying NLP

the other alternative is parsing schema.org schemas or other markup

Here's an idea (although probably an unpopular one around here): if a site is responding to your scraping attempts with 403s -- a.k.a. "Forbidden" -- stop what you're doing and go away.

This is a very obvious thing to say. Perhaps it's needed to be said, I don't know -- It's just a very obvious counter.

Yeah, it seemed obvious... but judging by all of the comments here on how to "bypass" 403s, it actually wasn't obvious at all.

I meant it is obvious in that everyone knows that. But they'll still want to bypass it. So everyone is completely aware of it.

The web scraping tool of my choice still has to be WWW::Mechanize for Perl.

P.S. I wrote a WWW::Mechanize::Query ext for it so that it supports css selectors etc if anyone is interested. It's on cpan.

Same in Ruby land. That with inspector gadget and you're golden.

I have done a lot of scraping in Python with requests and lxml and never really understood what scrapy offers beyond that. What are the main features that can't be easily implemented manually?

Pluggable parsers, automatically good error handling and spidering functionality (finding and queueing new links to scrape), great logging, progress stats, exports, pause/resume functionality, and a million other goodies that are seemingly "trivial" but really you don't want to rewrite them every time you write a scraper.

edit: Especially if your scraping jobs take a LONG time - days and weeks, this stuff is extra handy. Might I add a great debugging environment (scrapy shell), error handling, rate limiting, respecting robots.txt, so much more.

How much benefit does the spidering/progress/pause/resume functionality give if you're not just spidering every link on the site, but have complex logic to determine exactly which links to crawl and in what order? Does Scrapy provide convenient extension hooks to change the crawl algorithm?

I haven't had the need to use pause/resume but I do incorporate logic (not necessarily all that complex) in determining which links to crawl. It is very easy to do within each spider especially with how the framework uses generators. It is also easy to extend the pipelines for pre and post processing.

As others have said, managing a project with Scrapy is super easy and highly configurable with sane out of the box settings.

I tried scrapy many years ago when it was pretty new. I felt like it mostly just got in the way back then and resorted to normal HTTP lib scraping methods like you've described until recently, when I decided to give scrapy another try.

I really enjoyed it this time. It has mostly figured out how to be unobtrusive and it now provides a lot of handy stuff out of the box, like the scrapy shell, the ability to easily retry certain error codes, a built-in caching mechanism so that multiple iterative runs are semi-reasonable, and an intelligent crawler that automatically ignores duplicate links from the same session. Compatibility with things like Scrapy Cloud is a nice bonus.

It's worth a try if you haven't looked at it in a while.

The Scrapy cloud alone is worth it. And it gives guidance, making the start very easy.

I'm curious what others use to scrape modern (javascript based) web applications.

The old web (html and links) work fine with tools like Scrapy, but for modern applications which rely on javascript this does no longer work.

For my last project I used a chrome plugin which controlled the browsers url locations and clicks. Results where transmitted to a backend server. New jobs (clicks, change urls) where retrieved from the server.

This worked fine but required some effort to implement. Is there an open source solution which is as helpful as Scrapy but solves the issues provided by modern javascript websites/applications?

With tools like Chrome headless this should now be possible, right?

I have used Selenium for this with quite a bit of success, or as others have mentioned, just figure out where the API endpoints are with fiddler and pull the data directly from the source.

Sometimes this can be a PITA though, for example Tableau obfuscates the JSON they send back, so it's easier to use Selenium to wait ten seconds and then scrape the resulting HTML.

Disclaimer: I'm a co-founder of Apifier [1].

It's not an open source, but free up to 10k pages per month. And it can handle modern JS web applications (your code runs in a context of crawled page). You can for example scrape API key at first and then use internal AJAX calls.

There's also a community page [2] where you can find and use crawlers made by other users.

[1] https://www.apifier.com [2] https://www.apifier.com/community/crawlers

interesting. are you seeing any product/market fit for this?

We see a lot of users who needs data from the web or APIs for sites which doesn't have one. Just not all of them can code and we have to scale custom development.

Are these developers? Business people? I'm curious because we've been searching for a tool like this for a while but ultimately management thought it was a bad idea to rely on scraping, there's simply no replacement for a REST api.

Both - developers on a free plan using own RSS for sites without one and business people (mainly startups) building their products on top of Apifier.

Typical use is an aggregator that needs common API for all partners who are not able to provide it. So they have running API on Apifier in an hour. It might break once in a while - than you have to update your crawler (not that often if you use internal AJAX calls).

I see, so there's not much value beyond startups and bootstrappers.

I feel like it's a hard sell to enterprises. Scraping is viewed inferior to an API so it makes sense for enterprises to just pay the target website for access to the data.

It's also hard to get direct access to the data.

But you're right it's a hard sell to enterprises although we have some (e.g. real estate developer creating pricing maps)

Yes, Chrome is the way to go in my opinion (or in general any browser with a proper DevTools API). Zero setup (start the browser, use the API), zero feature-lag, zero deviation from regular user behaviour, all the security features of the regular browser. The only downside is that it is not as easy to get started as some of the tooling aimed at CI and web-page testing, but once you've built a few tools you'll quickly get the hang of what needs to happen in which order.

I use Google Chrome on https://urlscan.io to get the most accurate representation of what a website "does" (HTTP requests, cookies, console messages, DOM tree, etc). For Chrome, this is probably the best library available: https://github.com/cyrus-and/chrome-remote-interface. Headless is working as well, but still has some issues.

I use Elixir and Hound because it has a nice clean API that's not difficult to mess around with. It's really straightforward.


I used http://phantomjs.org/ as a headless browser for scraping a JS-based site. It was a couple years ago, though, maybe now there's something better.

Not open-source but free: Kantu (https://kantu.io) uses OCR to support web scraping. You mark an anchor image/text with a green frame and mark the area of data that needs to be extracted with pink frames. The image inside the pink frames is then sent to https://ocr.space for processing and Kantu api returns the extracted text. This works very well as long as you do not need a lot of data. It is certainly not a "high-speed" solution for scraping terabytes of data.

Tried the OCR for scraping and gave up because it was too slow and inaccurate.

OCR works well for certain scenarios where UI is fixed like on desktop applications but it's still fragile very much like CSS and Xpath selectors.

In fact, often OCR performs far slower and less accurate than CSS/Xpath selectors.

It has it's niches but I think it's sub optimal for web automation/scraping.

Splash https://github.com/scrapy-plugins/scrapy-splash

Runs a little headless browser.

Interesting, is there a variant which uses chrome? This would also eliminate most scraping protections.

Splash is not chromium I believe. Therefore it's buggy as hell and doesn't render websites that Chrome can as smoothly and easily.

Many times it's actually much easier to scrape an JS-based app. You just find the right API calls and you get nicely formatted data (JSON mostly).

Kimono was good for this, but were acquired and shut down last year (IIRC). Not sure why their exit didn't lead to someone else moving into the space?

We use HTMLUnit. Works pretty well. Not super fast, but you want to scrape individual sites at a moderate rate anyway

Have you run into issues? I'd think HTMLUnit isn't robust enough and it's "browser" Ian limiting?

It's got a couple of idiosyncrasies but works well in general. Barfs out too much log info in general. XPath 1 is limiting, but can use Saxon if need to

Holy crap! Xpath 1 is still being used for it? I actually have no clue what the differences are between xpath versions but I just assumed everyone is on xpath 2.

I guess my other question is - have you run into any situations where the JavaScript parsing or browser rendering wasn't good enough?

Casperjs with Slimerjs and/or Phantomjs work well

I use greasemonkey on firefox. Recently, I have written a crawler for a major accomondation listing website in Copenhagen. Guess what? I got a place to live right in the center in 2 weeks. I love SCRAPERS I love CRAWLERS.

I did almost the same thing.

I used 1 week to selectively go through accommodations manually, then proceed to complain to a friend of mine.

She's barely human, and she found literary one-of-a-kind apartment dead center at a good price. The apartment was mine next day.

Human scrapers man.

Similarly I wrote a scraper for a local used item marketplace and whenever I need to purchase something that isn't urgent and I'm OK with it being second hand, I plug in the relevant stats and load it in to cron. Near instant notification with contact information and details in my email when a match is found.

Well the problem is when someone scrapes ALL the good listings then pre-purchases them for resale at double the cost.

How is it different than paying 50+ low-wage remote workers to "scrape" the phonebook for you and then using the information acquired for profit?

One difference is that Feist v. Rural Telephone says that the data in a phonebook can't be copyrighted.


What about using those employees to "crawl" the web for you then?

I suspect it's roughly the same as a crawer -- same issues of fair use, TOS/CFAA, etc -- but likely there's no expectation that humans will read and follow robots.txt.

I made a web app that does the same thing for campgrounds: http://reserve.wanderinglabs.com/

Fairly simple and it people seem to find it really useful.

so can you use greasemonkey to follow links, load new page, parse new page, just like a headless crawler?

I think you can use a 3rd party backend service to store state of the crawler. So, when page reloads, you know which state you are in.

right. so basically a greasemonkey script is scoped to the current page? Is there any scripting solution that is not scoped to current page? In chrome maybe?

Browser automation via (realistically Seleniun) WebDriver or a proxy that inserts scripts (like TestCafe).

selenium with webdriver(io) or phantomjs. I like selenium more, though.

I use Java with simple task queue and multiple worker threads (scrapy is only singlethreaded, although uses async I/O). Failed tasks are collected into second queue and restarted when needed. Used Jsoup[1] for parsing, proxychains and HAproxy + tor [2] for distributing across multiple IPs.

[1] https://jsoup.org/ [2] https://github.com/mattes/rotating-proxy

Hardest part was synchronization

- to end the main thread only if all tasks are done

- when every running task can produce multiple new tasks

- with limiting the maximum number of running threads

- always running the maximum nubmer of threads if possible

semaphores to the rescue

Doesn't ThreadPoolExecutor take care of all of that if you store the returned Future from the submit method? Then you just have the main thread wait for those.

Note that in some places this constitutes breaking the law.

I think such laws are wrong. Scraping is what Google and Web Archive does and it serves good purposes. For example, one can make an application that compares prices for the same item at different internet shops and helps to find the cheapest offer.

I don't understand what's wrong with downloading the information that is published on a public web server. That is what that server was made for in the first place.

Of course there are people who scrape the website, add advertisement and optimize it to rank better than the original website. But this problem can solved with other measures.

And of course those who do the scraping must limit their request rate so the server doesn't get overloaded.

I disagree. The person who owns the server should get to decide who has access and under what circumstances. Joe Scraper, having invested nothing, has no claim or rights to it whatsoever.

Furthermore, the fact that the server gives 200 responses is not sufficient implied permission IF a no-scraping policy has been communicated in some other way such as robots.txt or (clearly communicated) TOS.

The techno-nihilist argument of "i'm just throwing bits on the wire, they're choosing to respond" is pretty unconvincing in a world where intent and context matters.

The person who owns a server can restrict access by adding a login form and creating user accounts. Or taking other measures (for example, banning IP networks or countries or implementing a captcha).

I think that the only thing that has to be regulated is the load (number of request per unit of time) on the server. So that it doesn't prevent server from serving pages to other visitors.

Regarding robots.txt, I am not sure if ignoring it should have legal consequences. It is just a hint to robots, don't visit these pages or we will ban you.

If there is no harm done to the server I don't see any problem with scraping.

> The person who owns a server can restrict access by adding a login form and creating user accounts.

Or, say, returning "403 Forbidden" or using a CAPTCHA?

This feels a lot like the "downloading mp3 is stealing".

If you don't want people stealing your music, don't put it other people's hands. The minute you release it to the world, it can't be reversed. See streisand effect.

Likewise, you cannot place burden on your visitors to read and analyze ToS with their lawyers and submit an official request via fax.

If you don't want people access your server, put it behind a paywall or a login screen at least so that you can easily ban people who don't play by your rules.

Otherwise, you have no excuse. If it's publicly accessible, then you cannot enforce any legally binding agreement as you've left the front door wide open and expect people to read tiny letters you put in front of your mailbox.

ToS is a one way street.

Seems like your argument is based on the idea that it's unreasonable to expect the end user to understand the terms under which a service is offered. And that's fair in some cases, but it's also kind of a cop out.

"How was I supposed to know they didn't want me to scrape it? I'm just an innocent passerby dropping bits on a wire" is bullshit in many, many cases. You do know, or at least could easily find out if you wanted to, but choose to maintain a fiction of ignorance to avoid responsibility.

Sure, if someone makes it difficult or arcane to read and understand their TOS, you're probably not (morally) bound by it. But if you close your eyes and plug your ears, you don't have much of a leg to stand on.

Scraping a website can be compared to only downloading mp3 file, not redistributing it to other people.

And I don't see how electronic ToS have any legal power. It is not a contract one has signed before visiting a site. I think ToS should only bind the website owner, not its visitors.

It doesn't. It would be like stepping in to a store because people were welcoming you and realizing that by sitting down you've automatically agreed and consented to tiny ToS written behind every napkin.

It's unenforceable and has never stood up in court.

If someone breaks the TOS, you can block them or send them C&D letters.

If you have the cash to take someone to court but you are going to be on the losing side.

ToS is not a mutually binding agreement. Even if you have a checkbox that says "I have read the terms", it's often thrown out because nobody expects you to read EULA down to the letter.

Is there a law requiring me to read and accept TOS or robots.txt before opening a website? What is the legal status of those? What if TOS says "you owe us $100 for every byte included in this TOS"? What if I have my own TOS about opening their links where they owe me $100 per byte sent?

No. TOS is not mutually binding agreement. You can't invite people to your garage sales and tell people that by looking at the items they've made a final sale automatically without their consent.

Not true (in the U.S.). Just ask Aaron Swartz.

The fact that it does (or, at least, has been interpreted to) criminalize TOS violations is one of the must important flaws in the Computer Fraud and Abuse Act.


Aaron Swartz was a weird one. Some zealous IT guy exaggerated it as if national security data were being scraped when in fact most were just shoddy research papers written at 1 AM before the due date.

All people that landed a hand in this man's death should be charged with manslaughter.

Truly a shame and how draconian laws end up killing the innocent in the United States.

That's fine as a discussion starter, but thinking a law is wrong isn't a reason to break it – if that's what you're suggesting.

That argument often turns into "But I probably won't get caught" which is a similarly weak defence.

I am not suggesting to break a law (I don't live in US so that law doesn't apply to me anyway). I just try to point out that it is wrong and serves only publishers' interests.

How is this any different from Google doing it? It is okay for Google to crawl the Internet, but not okay to crawl Google Play? Google raising such an objection would be an ultimate irony.

Edit: On second thought, I guess you are referring to overcoming 403s and Captchas?

The difference is simple - most websites want to be crawled by Google. If you don't, you can 1. put a Robots.txt or 2. Block their crawler on your server. AFAIK, Google (as well a other reputable search engines) tend to follow robots.txt

Unauthorized access, if you access the service in violation of their TOS then potentially they have a case against you. I'm not aware of it ever going to court in a case where they didn't also send a cease and desist.

I'm afraid that as long as explicit agreement is not required to make a TOS binding we'll be dealing with this crap.

Yes, but see QVC v. Resultly, where the robots.txt was considered binding, not the human-readable TOS.

We are getting small wins, but it's going to be slow going until we can get Congress to adjust both the CFAA and the Copyright Act, or until we can get SCOTUS to seriously alter the way these acts have been interpreted with reference to internet access.

Not quite. https://www.law360.com/articles/757906/qvc-website-crash-sui...

Basically, all damage claims are null because QVC & Resultly never entered into a mutual agreement. You can write whatever the fuck you want in your ToS but it's not law binding.

> Judge Beetlestone also rejected QVC’s claims that Resultly violated the Computer Fraud and Abuse Act by knowingly and intentionally harming the retailer when Resultly caused the shopping network's website to crash, reasoning that the tech company and QVC both could only earn money if the site was operational.

I see you are back on the FUD train surrounding web scraping but there's only very specific case where your fears materialize: "When you receive C&D from said website, do not continue scraping". Such was the case for Craigslist vs 3Taps.

Please do not cite legal resources and grossly twist realities to spread FUD. If you don't want to be web scraped, simply do not put it online.

>Basically, all damage claims are null because QVC & Resultly never entered into a mutual agreement. You can write whatever the fuck you want in your ToS but it's not law binding.

I'm pretty sure that's what I said re: the ToS? That's only one element of the case (breach of contract). You are correct that in this case, browsewrap was not considered applicable. There have been a few other cases where it wasn't too, as in Nguyen v. Barnes & Noble, Inc., but there have been cases where it was, as in Hubbert v. Dell Corp.. Also note that most cases re: browsewrap do not challenge the viability of automatically entering an agreement by clicking around the site, but rather argue that the notification was simply not prominent enough. It could be worked around by moving the notice into a more prominent location on the page.

The other element is CFAA, and referring to the wide-open robots.txt helped Resultly establish that they were attempting to act in good faith and were not maliciously damaging QVC's systems and not exceeding authorized access to the computer system.

>Please do not cite legal resources and grossly twist realities to spread FUD. If you don't want to be web scraped, simply do not put it online.

I'm not trying to spread FUD, I'm just trying to make it clear that the legal situation is precarious. Google has clearly shown that if you are able to build your coffers and reputation faster than you can incur lawsuits, you can win on this. In fact, lots of big companies begin that way, and become big companies merely because they were lucky enough to get big enough to stand up for themselves before the legal threats started coming in the door.

It's understood that you'll be scraped if you put it online. That doesn't mean scraping is legal.

You may be confused here -- I'm not a publisher trying to stop people from doing this. I'm an entrepreneur whose business depended on scraping data from a specific source. That business got destroyed when they chose to dispatch their law firm against us.

The point of repeatedly discussing this on HN is to make the legal situation clear so that people work to change it, and to make sure people who are going into similar ventures are informed about the legal risks associated with them.

As I said on another post, I am not a lawyer, and this is according to my layman's understanding. No one should misinterpret my posts as legal advice. I'm not going to copy and paste this disclaimer into every post I make because it should be implicit, and a few IANAL disclaimers is plenty.

Well the difference is pretty clear between your case and the rest. Developers scraping a website isn't going anywhere. A business reliant on scraped data is making money off of it. That will lend you in precarious situation more.

My criticism was that you mixed in service providers and tool providers that enable businesses to make money off scraped data-the vendor cannot be held responsible for misbehaving clients, the best it can do is cut them out when requested by external parties. Toyota doesn't appear as witness to vehicular man slaughter cases. It's a car to take you from A to B but it's not Toyota's fault if the customer runs over something between those two points and not it's intended design (QVC vs Resultly).

It also doesn't help that there are pathological web scrapers who simply does not have the money to do anything fruitful so they will bootstrap using any means necessary and plays the victim card when they are denied. This particular group is responsible for majority of the litigations. People who otherwise have no business by piggybacking off somebody else using brute force to bring heat to everyone involved.

>Well the difference is pretty clear between your case and the rest. Developers scraping a website isn't going anywhere. A business reliant on scraped data is making money off of it. That will lend you in precarious situation more.

Developers presumably scrape websites because the data is of some value to them, frequently commercial value. Google's entire value proposition is based on scraped data, and it's one of the most valuable companies on the planet. The way the data is used is not necessarily relevant to whether the act of scraping a web page violates the law or not -- several more basic hurdles involving access, like the CFAA and potential breach of contract depending on whether the facts of the case are such that the court holds the ToS enforceable, have to be overcome before the matter of whether one is entitled to utilize the data obtained becomes the hinge.

>My criticism was that you mixed in service providers and tool providers that enable businesses to make money off scraped data-the vendor cannot be held responsible for misbehaving clients, the best it can do is cut them out when requested by external parties.

3Taps is one of the most prominent such cases and it was just the type of tool that you're claiming wouldn't be held accountable. 3Taps's actual client was PadMapper, but since 3Taps was the entity actually performing the scrape, they were the party that was liable for these activities.

The lesson we've learned from 3Taps is that scraping tools might be OK if they strictly observe any hint that a target site doesn't want the attention and cease immediately, but there's really no guarantee either way.

Most people won't sue if you adhere to a C&D, not because they couldn't do so and win, but because it's much cheaper to send a C&D and leave it at that, as long as that settles the issue moving forward. Litigation is very slow and expensive.

3Taps became liable because they put their neck out for PadMapper even after they received written letters.

It was a poorly executed business strategy because they were up against powerful legal team.

Best thing to do if you receive C&D or requests to stop scraping, best to not continue and just let that customer go.

You can be sued (and lose) for damages incurred by illegal activity whether the aggrieved party sends a notice or not. It's not the plaintiff's job to let you know you're breaking the law, and they're entitled to damages whether you know you're breaking the law or not.

In fact, it's assumed that defendants weren't intentionally breaking the law, which is why when it's clear that they were, courts triple the actual damages for willful violations. [0]

If a reasonable person wouldn't realize that they were "exceeding authorized access", that probably limits a potential CFAA claim, but that's it, and that's not only the potentially perilous statute when you're a scraper. In the QVC case, Resultly got lucky that QVC did not have an up-to-date robots.txt; otherwise, they very well may have been on the hook for multiple days of lost online revenue, despite their immediate cessation upon receipt of a C&D.

Again, you are more than welcome to take your perspective and run with it, and it's plausible that no one will get mad enough at you to sue over it. That doesn't change the law.

I would assume that 3Taps pursued this litigation not because they had special love for PadMapper, but because they felt it was important for their business to be allowed to scrape major data sources and thought they'd be able to win. Pretty sure Skadden was their law firm so they gave it an earnest try, but ultimately lost.

[0] https://en.wikipedia.org/wiki/Treble_damages

You can be sued for crossing the street. You can be sued for flipping the bird and someone happens to get aneurism from it. You can be sued for writing what you just wrote!

Yes, but if you fight it adequately, you won't lose. If you get sued for scraping, it's quite likely you'll lose, as the law has numerous pitfalls for scrapers, including things as basic as regarding RAM copies as infringing.

If you don't mind de-cloaking, do you mind sending me an email?

The difference is that judges said it was OK for Google because Google is super cool. See Perfect 10 v. Amazon.

Caching and indexing is allowed by specific laws under specific circumstances.

This is the first I've heard of this. Where/what laws prohibit web scraping?

It's almost always illegal in the United States. It's prohibited by a combination of the CFAA, copyright law, and contractual obligations imposed by Terms of Use, which are usually considered applicable if you load more than one page ("browsewrap").

The CFAA makes it a crime to access any computer network without authorization or in excess of granted authorization. The Terms of Use will usually prohibit "any automated or mechanical access" or use similar boilerplate that can be construed as a restriction on automated access. The implied license to make a copy of the page in RAM is no longer applicable and the scraper is thus infringing copyright.

Relevant cases are Craigslist v 3Taps, Facebook Inc. v Power Ventures Inc., and several others. This is at the point where it's basically well-established. The exception is Perfect 10 v. Amazon, where judges ruled that since it was Google and they don't want to break Google, it's OK. Copyright law allows such evaluations because each judge must decide whether a use was "fair" or not.

Doesn't publishing information on a public web server equals to granting authorization to download it? Why publish it otherwise?

And copyright laws are supposed to protect only creative works, not every page on the web.

>Doesn't publishing information on a public web server equals to granting authorization to download it? Why publish it otherwise?

This is the argument that there is an implied license. The counterargument is that the user agreed to the Terms of Use which explicitly defined automated access as violative. In addition, these cases generally begin with a cease and desist demand letter, which explicitly informs the allegedly-infringing party that the publisher believes their rights are being violated and that they must cease and desist immediately. If the TOS argument doesn't hold, the C&D will surely qualify as a revocation of any implied license to access the content for copyright purposes. It generally also serves as explicit notice that the publisher considers the accessor to be "exceeding authorized use" of their computer systems, which is a crime under the CFAA. In Craigslist v 3Taps, the judge also commented that needing to circumvent IP bans should've made it obvious that 3Taps was "exceeding authorized access" under the CFAA.

>And copyright laws are supposed to protect only creative works, not every page on the web.

Copyright law protects all works of sufficient originality. Pretty much the only thing it doesn't protect is a plain list of facts (and in the European Union, it even protects that, known as "database rights"). The minimum standard of originality for copyright protection applies to practically every page on the web, yes.

In effect, this means that you can copy a list of names and addresses from a phone book, but you can't copy the layout. Since you can't access a web page without making a copy in RAM, if the publisher has revoked your license to access the content, even accessing and extracting the raw factual information within the body of the page is an infringement (because your RAM copy is an infringing copy).

IANAL and this is based on my layman's understanding.

ToS is not a legally binding mutual agreement between the website and the scraper. Only when it's materialized in the form of C&D and it caused damage to the website. In the QVC vs Resultly, they caused a website crashed. In the Craigslist vs 3Taps, the damages were non-existant but they still won because of the C&D letters.

So if a website sends you a letter to stop, do not continue scraping them. Until that happens, it's fair game. ToS is not legally binding. Those two cases were the result of damage caused to the website.

STOP. SPREADING. FUD. cookiecaper. All of your comments are the same unsound legal advice yet Mozenda, Import.io, and a whole bunch of tools & service providers are humming along just fine.

Disclaimer: I'm not a lawyer, this is not a legal advice, consult a real lawyer and not random HN comments.

>In the Craigslist vs 3Taps, the damages were non-existant but they still won because of the C&D letters.

(Technically they settled.)

Read up on that case and you'll see that even absent a C&D, the judge reasoned that Craiglist's IP ban against 3Taps was a separate incident of affirmatively communicating its intention that 3Taps refrain from accessing the site:

    > The calculus is different where a user is altogether banned from accessing a website.
    > The banned user has to follow only one, clear rule: do not access the website. The notice
    > issue becomes limited to how clearly the website owner communicates the banning. Here,
    > Craigslist affirmatively communicated its decision to revoke 3Taps’ access through its ceaseand-desist
    > letter and IP blocking efforts. 3Taps never suggests that those measures did not
    > put 3Taps on notice that Craigslist had banned 3Taps; indeed, 3Taps had to circumvent
    > Craigslist’s IP blocking measures to continue scraping, so it indisputably knew that Craigslist
    > did not want it accessing the website at all.
The judge continues to suggest that using proxies at all is atypical and may demonstrate an intention to violate the CFAA. Ruling at http://www.volokh.com/wp-content/uploads/2013/08/Order-Denyi... .

>So if a website sends you a letter to stop, do not continue scraping them. Until that happens, it's fair game. ToS is not legally binding. Those two cases were the result of damage caused to the website.

No, that's incorrect. You can believe this all you want, and I truly hope that it all goes well for you. It's totally possible that you will never piss off someone who has the resources to file a lawsuit over it. But you should know that you can be bound by a browsewrap agreement (and that it's quite easy to be bound by a clickwrap agreement, which, in practice, may not be much different and which scrapers may automatically follow (e.g., a link that says "click here to enter and agree")).

It's really going to come down to the judge's belief that the notification is adequately prominent and that a "reasonably prudent user" would be aware of the stipulations.

Quoth from Nguyen v. Barnes and Noble (citations removed):

    > where, as here, there is no evidence that the website
    > user had actual knowledge of the agreement, the validity of
    > the browsewrap agreement turns on whether the website puts
    > a reasonably prudent user on inquiry notice of the terms of
    > the contract. [...] Whether a user has inquiry notice of a 
    > browsewrap agreement, in turn, depends
    > on the design and content of the website and the agreement’s
    > webpage. Where the link
    > to a website’s terms of use is buried at the bottom of the page
    > or tucked away in obscure corners of the website where users
    > are unlikely to see it, courts have refused to enforce the
    > browsewrap agreement. [...] On the other hand, where the website contains an
    > explicit textual notice that continued use will act as a
    > manifestation of the user’s intent to be bound, courts have
    > been more amenable to enforcing browsewrap agreements.
    > [...] In short, the conspicuousness and
    > placement of the “Terms of Use” hyperlink, other notices
    > given to users of the terms of use, and the website’s general
    > design all contribute to whether a reasonably prudent user
    > would have inquiry notice of a browsewrap agreement.
Full decision at https://d3bsvxk93brmko.cloudfront.net/datastore/opinions/201... .

>STOP. SPREADING. FUD. cookiecaper. All of your comments are the same unsound legal advice yet Mozenda, Import.io, and a whole bunch of tools & service providers are humming along just fine.

I'm not giving any legal advice as I'm not a lawyer. For the third or fourth time here, this is all according to my layman's understanding. It's based on things I learned that time I had to close my business or face a lawsuit from a massive company over just such issues.

It's crucial for companies that provide scraping services to be aware of these issues and I know of at least one such company who is aware of them and who takes several precautions to provide some distance from potential legal liability, though they are still not 100% out of the woods. As is usually required in entrepreneurship, they're taking a calculated risk. Should someone wage a legal challenge against their activity, they have millions of dollars in the bank from investors who presumably have researched this and are willing to accept the cost of the potential legal liability.

I'm not saying that people shouldn't make businesses that depend on scraping data. I just think they should know what they're getting into before they do so.

You are correct that some businesses have been able to engage in such activities without being sued out of existence up to this point. Unfortunately, that doesn't mean that others will be as lucky.

I fully agree that anyone who is seriously interested/concerned about this should ask a lawyer. I certainly did. Their answers were not good news for me. Maybe they will be for you.

So the answer is don't get IP banned. That's easy to solve.

That's potentially an answer if the judge decides that the browsewrap notice was not sufficiently conspicuous to constitute a binding agreement, etc.

I would guess that most judges would not be charitable to someone pretending that they've circumvented this by rotating through proxies pre-emptively. In fact, this would likely work against the defendant as it'd be evidence of willful infringement, which is typically 3x damages. If you can convince the judge that you were just incidentally rotating IPs to protect privacy or something, you might get away with it, but it's definitely not a simple answer, and that still only gets you up to the point of receiving a C&D.

I'm not sure why you're trying so aggressively to mislead people about the legal precariousness of data scraping, but at this point it should be clear that this is not a simple matter and it's not something to approach lightly or dismissively.

Well, looks like all the HN user's of Scrapy better lawyer up because Scrapy Cloud offers exactly that, as do rest of the web scraping vendors like Mozenda out on the market. They've all been around for 10+ years, doesn't seem like this is an issue for them.

Scrapinghub has several proactive/preventative restrictions on the sites they'll allow users to access because they're trying to avoid such liability. They've been successful up to this point and that's great. That doesn't mean that what they're doing is not a legal grey area.

For scraping-related activities, Scrapinghub would probably be the party sued, as was the case in 3Taps, though the clients could probably also be legitimately sued for various things, most obviously copyright infringement.

Again, I'm really not sure what you're getting at here. Yes, it's a great idea to check with a lawyer and assess your potential legal exposure. That's why lawyers exist! You can then ask them questions, as Scrapinghub surely has, about how to minimize that potential legal exposure. You definitely SHOULD do that, especially since scraping is more or less illegal in the United States.

Courts frequently use an analogy to private physical property to address the matter of accessing a web site. Running a business based on scraping until someone sends you a C&D is roughly the same as running a business based on trespassing on private property until someone serves you with a no-trespass order.

Maybe it will work out fine, and most of the time, as long as you leave the property promptly upon request, you probably won't have an issue just because there's no benefit in dragging the matter out further. But that doesn't mean there isn't legal risk involved in running such a business, nor does it mean that you won't be liable for damages incurred whilst trespassing.

In such a case, questions about whether the borders of the property were clearly delineated, whether "No Trespassing" signs were posted, whether a reasonable person would've understood they weren't allowed to be there or not, etc., would be asked to determine the existence and/or extent of the trespasser's liability.

In the same manner, there is substantial risk involved in running a business whose primary function is to scrape websites, and the same types of questions would be (are) asked in a court case related to network access. People deserve to be informed of that.

That's not FUD, it's just the law. If you don't like it, well, most people who know what they're talking about don't either, but that doesn't change the law. Saying "$Party_X hasn't been sued over it!" also doesn't change the law or make the process any less legally risky.

If you find this arrangement unsettling or absurd, as you obviously do, I would suggest that you direct your energies/attention to your local representatives, the EFF, and other types of political activism that may help rectify the situation rather than accusing HN commenters of spreading FUD.

When you do something illegal, you probably won't get sued for it, because it costs a ton of money to sue someone and it's not likely that you're annoying anyone enough to justify that. This is especially the case if you back off at the first sign of annoyance. That's as much as we can say for your angle.

If you're comfortable basing a business on that, be my guest.

The more I read your comment, the less I'm worried. It's clear that you are not a lawyer but someone just overly reacting to perceived legal liabilities by simply generalizing court cases and attempting to reach a conclusion that tries to fit everyone.

Businesses that utilize web scraping to achieve business goals at a direct expense of another business will get you in trouble not because of web scraping but simply trying to create competition. Businesses with a large cash use litigation to snuff out competition because their businesses are largely undefensible without such forceful litigation ex. craigslist would not exist if they let anyone scrape them.

Businesses that build and sell web scraping sevices and tools are less likely to be impacted for the same reasons if they comply with formal requests to stop scraping. 3Taps received notices beyond just IP ban (this alone does not set enough of a context) but they chose to ignore it and continue on. 3Taps had enough of a financial motivation on the line to put out their neck for their customer, PadMapper. Pretty fucking stupid if you ask me, no one customer is worth risking the entirety of your business operation.

It's far more likely that the law exists to serve those who exploit it to protect their business interests. Generalizing and extrapolating based on a few court cases with their own dynamic set of variables and exceptions as fact is dangerous advice.

I just want to warn people reading your comments not to take it word for word as the reality is far far less legally hostile-you are too small for people to go after and not an existential threat to the target website.

The argument that web scraping puts strain on web servers is a pretty laughable defense. Craigslist alone gets millions of hits every day but can't serve pages requested by a python script? 3taps fucked themselves because they took money AND they put their neck out for their customer.

That's the lesson here, don't risk your entire business for one customer. It's not fair to the rest of your customer base.

>It's clear that you are not a lawyer

It should be, because I've stated it probably 6 times in this thread.

>someone just overly reacting to perceived legal liabilities by simply generalizing court cases and attempting to reach a conclusion that tries to fit everyone.

So for the seventh time, I'm not a lawyer, but isn't this how it works when questions about legality are posed? It's always based on the relevant statutes and the case law interpreting and applying those statutes. I mean, correct me if I'm wrong.

I'm glad you're not worried about someone looking at the case law and making a generalization about how it applies to the field.

If you want specific (i.e., non-generalized) legal information, you always need to discuss your individual affairs with a licensed attorney who is knowledgeable in the field and jurisdictions in which you'll be operating.

In practical terms, web scraping is usually illegal in the United States. In this case, that doesn't mean there's a law that says "web scraping is illegal", it means that there is a small group of laws, which, taken together, make it virtually impossible to scrape web pages with confidence that you're not getting exposed to potentially serious legal liability. Note that "illegal" is not the same as "criminal", but that the CFAA does provide for criminal penalties (and Aaron Swartz was being prosecuted under them for scraping research papers out of an academic database).

>Businesses that utilize web scraping to achieve business goals at a direct expense of another business will get you in trouble not because of web scraping but simply trying to create competition.

You're talking about the likelihood that a business will get sued by someone. That's great, but it doesn't change the legal status of the activity that someone is unlikely to sue you for.

My business did not directly compete with anyone. Everyone thought it primarily helped the data sources we used. People always told me that they were shocked that the company that was making the threat was upset about it. Even my lawyer said it seemed unusual and couldn't figure out what their underlying motive was.

The stakes are an important consideration, but yes, it is important to consider the impact if you do get sued/threatened by an unlikely plaintiff.

>3Taps received notices beyond just IP ban (this alone does not set enough of a context)

The 3Taps ruling casts doubt on the suggestion that an IP ban is itself insufficient notice. That issue hasn't been decided directly afaik, but the reasonable conclusion, if you are getting a 403 or a page that explicitly informs you your IP has been banned when you access a site, is that they are trying to keep you out and that further access likely violates the CFAA.

>3Taps had enough of a financial motivation on the line to put out their neck for their customer, PadMapper. Pretty fucking stupid if you ask me, no one customer is worth risking the entirety of your business operation.

That's definitely the risky side of the equation. The alternative side was that they'd win and be allowed to retain access to one of the largest data sources on the internet, and preferably set a precedent that allowed them to continue to scrape big data sources without concern moving forward. That gamble clearly did not pay off for them, but that doesn't mean it wasn't a reasonable gamble to take.

>It's far more likely that the law exists to serve those who exploit it to protect their business interests.

I agree, but I don't see how it's relevant. Lots of people believe that it's beneficial to their business interests to use the legal system to bully people who can't afford to stand up for themselves. Uh, congrats to them I guess? Why are you saying this like it's a normal thing? We should take steps to minimize the surface area that can be used for that.

If you're suggesting there is a small handful of bad guys to whom these laws need to apply, that's fine and I actually agree with you, but that means we need to fine-tune the law so that it only covers the bad guys, not virtually everyone if someone you're scraping is having a bad day.

You keep fighting this fight pretending like I'm saying something that's incorrect, and then you just come back and say that it doesn't matter because a) some people who scrape have not been sued; and b) people who start scraping business may not get sued if they adhere to the requests of those who politely ask them to stop. That's great, but it's neither here nor there. This is about what the law is, not whether you're going to be sued personally.

>Generalizing and extrapolating based on a few court cases with their own dynamic set of variables and exceptions as fact is dangerous advice.

It's all anyone can do when you're dealing with an emerging area of law, afaik.

>I just want to warn people reading your comments not to take it word for word as the reality is far far less legally hostile-you are too small for people to go after and not an existential threat to the target website.

Yes, this is another thing I've stated multiple times. You probably won't get anyone mad enough at you to sue you. But you should know where you stand if you do. And you should try to fix the law in the meantime.

>The argument that web scraping puts strain on web servers is a pretty laughable defense.

Plaintiffs use this argument all the time and get injunctions filed on that basis regularly. Even if the defendant is not disruptive, judges say they need to issue the injunction or it will invite a pile-on effect that will be disruptive. Thus, they grant an injunction under a trespass to chattels doctrine, generally putting legal force behind a C&D.

>3taps fucked themselves because they took money AND they put their neck out for their customer.

3taps fucked themselves only because they tried to stand up and win the case. Perhaps it would've been better for them to try to lobby Congress instead and get the law transformed into something semi-reasonable, though it's likely they recognized the futility in that.

>That's the lesson here, don't risk your entire business for one customer. It's not fair to the rest of your customer base.

It seems like the lesson is that web scraping is legally precarious, and that if you're not careful about it, you can end up in a lot of hot water.

You keep acting like that's an absurd conclusion, but not really showing anything to discount the onerous outcomes that entrepreneurs in this space have faced. 3Taps is not the only case where this has been addressed.

In Facebook v. Power Ventures, the corporate veil was pierced and the entrepreneur was left with $3 million in personal liability, all for trying to create software that made it easy for a user to save their own data only out of Facebook. Facebook acknowledged that it did not have any copyright interest allowing it to forbid Power from accessing that data specifically, but they continued to pursue copyright claims based on the RAM copy of the Facebook site from which the content was extracted.

The point is that the current law makes scraping a perilous exercise. Perhaps you won't have problems, but that's probably only the case if a) you stay so small no one will ever target you or b) you know the law and you take extra precautions to protect your business so that any accusations of wrongdoing are clearly invalid against current law. Scrapinghub is trying to do this, but IMO it's insufficient if they get an aggressive/hostile litigant.

The truth is that Scrapinghub et al are on the precipice and they're going to stay there until precedent changes (likely through a SCOTUS override, particularly one overturning the RAM copy doctrine, which is probably plausible, and one putting constraints on the ability to revoke access to public web sites under the CFAA, which is probably not) or until the law changes. They only need to get hit with one well-placed lawsuit and they'll be goners.

You can argue til the cows come home about how they won't get sued because they stop once they get a C&D, but that's not necessarily true, and that doesn't fix the laws around scraping.

I think that is wrong too. I think that copyright laws should make a distinction between making and distriubuting a copy by a person (for example uploading copyrighted file to a website) and technical processes that happen inside a computer. Copying something from NIC buffers to memory should not be "copying" under copyright law.

The law tries to make this distinction by specifying that copies have to be fixed into a tangible medium to be infringing. The problem is that the "RAM copy doctrine", as it's known, states that RAM is a sufficiently fixed copy into a sufficiently tangible medium to qualify. This doctrine has been used against scrapers repeatedly, as in Ticketmaster LLC v. RMG Technologies, Inc. (https://casetext.com/case/ticketmaster-llc-v-rmg-technologie...) :

    > The copies of webpages stored automatically in a computer's cache or 
    > random access memory ("RAM") upon a viewing of the webpage fall within 
    > the Copyright Act's definition of "copy." See, e.g., MAI Systems Corp. 
    > v. Peak Computer, Inc., 991 F.2d 511, 519 (9th Cir. 1993) ("We recognize 
    > that these authorities are somewhat troubling since they do not specify 
    > that a copy is created regardless of whether the software is loaded into 
    > the RAM, the hard disk or the read only memory (`ROM'). However, since 
    > we find that the copy created in the RAM can be `perceived, reproduced, 
    > or otherwise communicated,' we hold that the loading of software into 
    > the RAM creates a copy under the Copyright Act.") See also Twentieth 
    > Century Fox Film Corp. v. Cablevision Systems Corp., 478 F.Supp. 2d 607, 
    > 621 (S.D.N.Y. 2007) (agreeing with the "numerous courts [that] have held 
    > that the transmission of information through a computer's random access 
    > memory or RAM . . . creates a `copy' for purposes of the Copyright Act," 
    > and citing cases.) Thus, copies of ticketmaster.com webpages 
    > automatically stored on a viewer's computer are "copies" within the 
    > meaning of the Copyright Act.

That's a very, very sad turn of events, and I have to wonder, how did we get there?

I'm increasingly feeling that the law is giving way too much control over content published on the Internet to the publishers.

I agree. There is a lot more fairness in physical space that doesn't translate to cyberspace primarily due to the implementation details of computers and networks. Whereas products and machines built in the real world are primarily protected by things like patents and trade secrets, practically everything in the digital world falls under uber-restrictive copyright protections, since the "creative" work of code and its compiled/interpreted derivatives is the language by which everything is implemented.

Similarly, concepts like the "first sale doctrine" are becoming less applicable with digital delivery, as it's impossible to identify a "hard copy" of something that may be eligible for resell. That completely obliterates the secondary market for many products that are accessed through computers, including software, games, movies, and books.

The CFAA essentially allows network operators to arbitrarily make someone a felon overnight. Reddit co-founder Aaron Swartz is the most prominent example of this; his criminal prosecution under the CFAA (for scraping publicly-funded research papers out of a database) was pending when he committed suicide.

We badly need digital rights reforms, but since major companies have been allowed to profit handsomely off these shifts and since they find it rather convenient to bully small innovators with serious legal threats, which are easy to craft in this climate, it doesn't seem that anyone is making this a priority.

There's even a very highly specific (online ticket sales) bill that passed congress: https://www.congress.gov/bill/114th-congress/senate-bill/318...

We all do this, but how legal is it? If people end up in prison for pen testing without permission, how safe is to intentionally alter the user-agent, circumvent captchas, javascript and other protections? Can that be considered as hacking a site and stealing the data?

Have you seen the sentry antirobot system I can't remember the name exactly but it's a hosted solution that randomly displays captchas, when it senses suspicious(robots) crawling. It's a nightmare, because after you solve 1 captcha it can display 4 more one after the other. They also ban your IP, so oyu need IP rotators. Any workarounds? ic3cold

Proposition: 99% of scraping use cases are eliminated if the scraper agrees to subsequently abide by the target's terms of service.

Googlebot doesn't abide by 99% of websites' terms of use.

But that's "different" because they've built a $600bn company off it.

More that the websites actually want to be found by someone.

The over-reliance on google search is both a blessing and a curse for the web. Today, google IS basically the web, a centralized version of it.

I agree with the implication here, but (1) it doesn't actually rebut the proposition and (2) some significant fraction of websites crawled by Googlebot have signed up on Google's webmaster tools, which includes its own ToS that likely governs.

I've used antigate for captchas and ether Tor or proxies for 403s before. Usually the browser header alone does not help for long.

Anticaptcha and deatbycaptcha are some others. But it mames me feel sad to use them, as it exploits cheap labor overseas.

while I agree, it may also is a option for some unskilled people to get to money they really need. IMO its better than buying cheap clothes, as captcha at least does not kill those people.

To clearify: I lived in some cheap third world cities, and i can certanly see that solving captchas could finance a rather nice life.

Most of the time they use OCR, humans are unreliable and rarely used.

Antigate can solve things that would give OCR fits, like animated letters and the like.

no, at least antigate doesnt. When you hit recaptcha with known proxy urls (or generally hit it a few times per hour) the captchas get so bad that no OCR would be able to solve it, even humans struggle

yup, exactly. i tried tesarract before (nothing too fancy), it didn't have problems solving it, but at some point it became really hard.

I think part of it is how you crawl (phantomjs, for example, seem to hit captcha almost every time), but things like ip&proxy usage could make this trigger more often.

That doesn't seem to match the sales pitch at antigate.com, deathbycaptcha, etc.

It's not exploitation.

Good article! I been doing scraping for the last 10 years and I've seen a lots of differents things to try to avoid us. Also, I'm in the other side protecting websites to ban scrapers, so funny!

I'm in the same position for the first time (protecting against scraping) and honestly I'm kind of blind right now. Which is weird because of how much scraping I've done (okay not that much). Any tips or tricks or blogs you know of off the top of your head for protecting your site?

Virtually everything can be easily defeated. The only outfit I've consistently seen put up a good fight is Distil. They do it by acting a little like Cloudflare. They put their servers in front of your www facing endpoints and use ML to mine their global client traffic to identify bot signals (aided by some aggressive in-browser javascript fingerprinting).

Yeah, Distil is the first outfit I've encountered where they've got the model to make it really hard to reliably bypass. It comes down to "I can spend a significant amount of time trying to bypass this, and I would, but they would likely identify and block me again within a few weeks at most.", and it's not worth it when it's only part of what I need to do to scrap some data, and it's their entire job, and they can afford to hire multiple people.

The economics are in their favor, and I make it a point not to fight economics when I recognize them, it's rarely sustainable.

Distil is really interesting.

Interesting thanks.

After the years, I've arrived at the conclusion that everything can be scrapped. What you have to do is try to put as many walls as you can. But if someone really wants to crawl your site, with the right knowledgement he will able to do it despite of all your walls.

Yes that's what I've assumed as well

What if the target page is blocking by IP address and if even with 20 different IP addresses you wouldn't be able to fetch all the data you need in a month?

Professional proxy services. Price, IP pool size and quality vary hugely but if you're not trying to scrape an aggressively defended target and don't need to make more than a handful of requests per second, 100K IPs will usually be more than enough to circumvent most rate limits and a pool that size can be rented for under $100/month.

Interested to know where to get 100k proxies for $100/mo. Can you give some options?

What if they use that before:after thing where the content takes say a couple of seconds to appear so when you try to scrape the site it appears that nothing is there. I have only used HTMLSimpleDom scraper with PHP at this point.

Sometimes it's also necessary to spread requests over numerous IP addresses.

Enjoyed learning this and playing with it. What would you recommend storing this sort of data in? Not too keen on going with the traditional MySQL.

The first part seems like a very long-winded way to say "don't use the default user agent".

The captcha was unusually simple to solve, in most cases the best strategy is to avoid seeing it in the first place.

Nice overview! The "unfortunately-spelled threat_defence.php" just uses British spelling though.

What's wrong with British spelling? It's also the English spelling using in India, Australia, New Zealand etc. By pure numbers, more people may spell it defence than defense. Americocentrism is quite annoying from the other side :)

I'm not saying anything's wrong with it. The "unfortunately named" bit is from the article, and I'm just pointing out that the author's snark is ill-placed.

Nothing? They were pointing out that the "unfortunately spelled" file is merely using the British way of spelling.

I'm wondering how they were misunderstood, given they even used quotations to show they were quoting the article which is the one that made the claim that it was "unfortunately spelled".

you are shooting the messenger.

Sort of a tangent, but I doubt more people speak and write RE over CE.

Well, I did not know that defense was the American spelling. I spent ages looking at that trying to find out which one was the British and which was the American spelling. You learn something every day.

too bad it's named for ovine prions.

Try lynx

Great article!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact