Hacker News new | comments | show | ask | jobs | submit login
How Google’s Web Crawler Bypasses Paywalls (elaineou.com)
640 points by elaineo on Feb 19, 2016 | hide | past | web | favorite | 232 comments



"Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody."

See also: http://www.apple.com/customer-letter/

:)


I came back here to post this line! It's so perfect.


I was going to post here that there were ways around that; proper security, cryptographic access control... And then I saw the light. ;)


Haha! Thanks for catching that :)


Google also specifies the ip ranges of their boys; just UA checking is sloppy


Hmmm, the following seems to contradict that; instead google recommends verification by DNS lookup: https://support.google.com/webmasters/answer/80553

'Google doesn't post a public list of IP addresses for webmasters to whitelist. This is because these IP address ranges can change, causing problems for any webmasters who have hard-coded them, so you must run a DNS lookup as described next.'


If anyone can tell me how to embed this particular DNS query into a firewall appliance ACL, please let me know. I have customers with Watchguard and Cisco gear who have intermittent trouble with Google's mail relay (formerly Postini). The solution is always to perform that record lookup and add the new IP address blocs.


Ah, they must have changed that the ; last time I checked (few years ago) they listed IPv4 adddresses.


Except of course what FBI is proposing wouldn't give access to "anybody". Just unfettered access for FBI, CIA and NSA by way of gag orders and national security letters, forcing Apple to break security of their hardware and not speak about it publicly. Of note is the fact that it wouldn't even give direct access to FBI, since they don't have the firmware keys (yet).


In this example, the FBI is the "trusted third party", but by giving them access, we inevitably open access for everyone, as the system is no longer strongly secure. The trusted third party in the quote isn't asking for access for everybody either, but in the end that's what happens.


Apple isn't giving access. Apple would be required (by court, unless they manage to fight this off) to install a signed custom build of the OS in order to give access to that particular device. FBI would not have this build, nor a key to create their own signed custom build.


Yup, that's the FBI's pitch. The issue is that it sets a legal precedent as well as potentially leaking a backdoored iOS to the world. Yes I know "signed for a specific device", best of luck with that.


Yeah but if you don't trust those companies to be secure and to maintain that security throughout all their employees and contractors (see: Snowden). Then you must assume giving someone access is the same as giving everyone access.


If they're now blocking clicks from Google, doesn't that mean that they're cloaking and violating the Google's Webmaster Guidelines [1]?

[1]: https://support.google.com/webmasters/answer/66355?hl=en


Google is not okay with cloaking, but they will whitelist publishers if the publisher specifically includes a parameter that declares if the site requires registration or subscription. This is done in the sitemap.

https://support.google.com/news/publisher/answer/74288?hl=en


https://support.google.com/news/publisher/answer/40543?hl=en seems to specifically ban this. WSJ is in violation, not fitting any of the categories there.


I didn't know that, thanks. But reading, it seems that option is about Google News, not the main Google Search.

> News-specific tag definitions

> Yes, if access is not open, else should be omitted

> Possible values include "Subscription" or "Registration", describing the accessibility of the article. If the article is accessible to Google News readers without a registration or subscription, this tag should be omitted.


WSJ is big enough to negotiate their own terms with Google Search.


They're really not. They need Google a lot more than Google needs them.

One of these things is going to happen:

(1) They end this "experiment."

(2) They stop serving Google the full content. (And see their rankings drop accordingly.)

(3) They get delisted for cloaking.


Yes, and I suspect that's what going to get them to change it back again. After a ban from Google and the traffic drop it'll likely bring, that paywall is likely coming right back down. An awful lot of media companies made similar mistakes before, and it's always ended with them quickly removing their 'work arounds'.


Yea, except they're not going to get that ban.


If this is true, what WSJ is doing is called "cloaking" and should cause it to get de-indexed: https://support.google.com/webmasters/answer/66355?hl=en


conversely, everyone should actively cloak and use random generated numbers to dynamically serve variants of their content similar to how mapmakers use trap streets. That way, like, another company wouldn't be profiting directly off of their work and threatening to sort of, destroy their entire business if you disagreed.


What? Any site can very easily not be in Google if they choose to. It's a very dumb decision for a news site, but you're free to do it.


Its a false choice without a conpelling alternative. Like saying, anyone upset with the status quo should vote. I was joking a bit, but I also wasn't.

Google has end to emd control over some users internet experience, and much of it in other cases. They own:

* 100s of thousands of servers

* domain registrar

* ~50% of web browsers in US.

* code CDN, FontService

* define web standards

* hundreds of millions of emails.

* CA implementation

* ISP infrastructure

* Develop software for a large part of thr mobile ecosystem.

* decide what you see when you go to search (most search copy google, buy results, or both)

* also many of the web beacons and advert targeting.

* oh, and the largest collection of video and images in the world.

So when you say, just do what they say or get deindexed, and you present it as if that is reasonable(not just you but the collective you) I just think I must be insane.

I mean, assuming google is good (i fo mostly) doesn't mean I would let them become the entire internet.

Real question, if google were to disappear vs. the "too big to fail banks" that would have gone under, where a case could be madr for a few certainly failing, what would have bigger impact today?

Tl;dr everyone cares about single point if failure except at the macro system level: finance, banking, healthcare, etc


There are certainly more you omitted (similar to your code CDN for javascipt, fonts, etc.)... consider something like recaptcha.

In my experience, all those requests to api.recaptcha.net get forwarded to "www.google.com"

My experience has been that if a user for whatever reason cannot access the IP du jour for www.google.com (www.google.[cctld] will not suffice) then that user is prevented from using the myriad websites that rely on recpatcha.net.

Now, I could be wrong and maybe there is something I am missing, but in my experience this is a sad state of centralization and reliance by websites on Google. Quite brittle.


Right, that was my point. I actually did include code CDN and fonts in the original list, but regardless I think google is an awesome company but as a community it is just downright irresponsible to fork this level of control to an entity.

While I consider Snowden a proper hero it is almost a certainty that this could happen to a "friendly" entity like google. In that, the NSA likely has some top programmers who could get a job there and compromise something, learn enough info to find a vuln, or pass data out. This is of course making the massive assumption that they aren't already cooperating at a system level either voluntarily or involuntarily.

As you can see, as search deteriorates google is motivated to (in my opinion benevelontly) use any means neccessary to continue to fund their larger goals of a connected and automated techno-utopia. However, they will be tempted to leverage what amounts to almost literally 50% of the worlds thoughts to build systems to make short term profit while pressing forward.

Just a few that immeadiately come to mind:

* using their network to control an entire alt coin ecosystem

* using data trends to trade on global markets

* start a competing business and deindex or penalize a competitor.

* build skynet (kind of joke)

So basically, those scenarios are fairly suboptimal and I could certainly imagine that several thousand genius with knowledge of googles systems AND the worlds data could likely be profitable quickly.


> as search deteriorates

Can you expand on why search is deteriorating? Honest question. I certainly don't see the relevance of search sinking, nor can I see any competitor in the market that could even come close to threatening Google's monopoly on search.


It is my position that search can never be decoupled from the browser and when I say "search" in the statement you are referring to I mean Google, as it is Peerless for english lang search.

Search is in fact massively expanding as tooling and machine learning capabilities increase do to research and hardware. Similarly, Google, Apache and Elastic have many open source libraries for search, indexing, storage, caching and serving which allow for scalable architecture. Also, outside of the things above like crawlers and Hadoop, Solr, etc. Microsoft and Google have open sourced JS parsing engines and Node as well as the Electron browser, Brave Browser and Node Web-Kit are built on technology that leverages this.

So, as someone who is not an information architect or data scientist, it seems like we have an ecosystem where a scaled down version of google can be built and trained on the per user basis and completely private.

The solution I have hashed out in more detailed elsewhere, but on to the actual question, is search deteriorating?

* My results seem to be worse and I have much less control than before. Anecdotally, it seems as if qoutes and boolean ops are respected less.

* Discovery is a huge issue that Google solved well, now we have the opposite conditions but the same problem. There were very few sites and it was hard to know what content was on them. Now there is too much content.

* Without fine grained control over my search I can't get make destinction between Information vs. Links. This is needing a date or well accepted piece of content/documentation vs. finding some new apps or non-facts. DuckDuckGo is quite good for some things and Google is good for others. Sometimes you may want to eliminate all wordpress sites (many content mills built on this) or remove Alexa links from your queries if you need to discover something.

* Time is bad. E.g. I have a problem with JavaScript function. Get back results from 3 years ago. This is amazing and difficult to do, so commendable but I need newer info as pace changes. E.g. News.

* Need to eliminate sites and content I don't want. NOT something like a content filter for porn or whatever, something like:

     never return results from %news-websites older than 30 days.

     never return content posted %before nov-2014

     remove links from [%Alexa-1000, %Wordpress, #TLD(.co,.co.uk)] for reputation ranking


     decrease links from [%Alexa-1000, %Wordpress, #TLD(.co,.co.uk)] by [80%] for reputation rankings


There are other things but so far my point has been:

* Google provides no versatile results.

* Many pieces of well tested software would make it easy(for the right group of software engineers) to silo crawl data and parse it with a users own parameters.

There is a way to set up this ecosystem that I have been thinking about, but to conclude:

Google is fucking awesome and really really good at what they do. Search experience is getting worse in terms of control but tooling is leagues better. Google sees this and is working on loftier goals internally (I imagine), thus it has split up into a meta-company that will work as an accelerator for growth while capitalizing on some verticals like the Real Estate thing they are doing or Delivery they just announced to keep short term profitable before they can achieve their end goal. Also, advertisement is an unsustainable paradigm for internet growth for many reasons.

Notes:

The DOM is super fucking horrible.

The Parsing engine is a great fix for a fucking horrid DOM.

DNS security is fucking horrible.

The Next google will be a browser & an optimization marketplace.

I don't think compiling to web assembly makes sense but I could be totally wrong. I think something like Docker would provide a sandbox that would let people get performance and versatility and sidestep the entire DOM, only run JS, need Apps vs. Content thing. No idea how this works on mobile though.


Wow, awesome response. Will need to let that sink in.

> So, as someone who is not an information architect or data scientist, it seems like we have an ecosystem where a scaled down version of google can be built and trained on the per user basis and completely private.

I'm skeptical. Search is a huge problem just because of the bizarre amount of resources you need to throw at it. I can't afford to build my own datacenter(s) to host my custom search system. There might be huge advances ahead in terms of storage capacity on commodity systems, I don't know, but in any case, I'm only one person crawling webpages versus millions of people (and bots!) creating them.

You implicitly address that a bit later by talking about "silo crawling", but again, I'm skeptical. The only silo structure that I can easily see is large sites with useful content like Wikipedia or StackOverflow/StackExchange, but I'm likely to come across these anyway in any given domain, and I can easily filter for these on Google today, e.g. "site:en.wikipedia.org". The more interesting and hard part is the long tail of small, sparsely interconnected websites which might contain unusual insights but are unlikely to come across with a silo crawler (or with Google's current UI, for that matter).

> Search experience is getting worse in terms of control

I guess that's the classical problem of scaling a product to a large audience of mostly technically illiterate users. Maybe Google is learning from Apple, whose UIs have for a long time favored ease of use over giving control to the user.


> I'm skeptical. Search is a huge problem just because of the bizarre amount of resources you need to throw at it. I can't afford to build my own datacenter(s) to host my custom search system. There might be huge advances ahead in terms of storage capacity on commodity systems, I don't know, but in any case, I'm only one person crawling webpages versus millions of people (and bots!) creating them.

I have been thinking about this, and I have come up with some ideas, other people obviously would provide more ideas and a solution could be reached, some of my thinking:

A service that behaves like AWS/GIT/DNS/Google combined. * A user runs the service and indexes data it receives and there is a central repository of information than a user can contribute to or not contribute to. Initially, a new user would either buy a crawler or cache of data from the market and store it locally or on a private server bought as part of the service. The blockchain, or a verification mechanism would be used to provide access to the initial seed data and a hash would verify the contents. The user now has a running cache of data s/he can connect to with a private DNS-like verification system. User runs search.

* The parameters do not return the results s/he wanted from their private store. Similar to DNS, they move up food chain to the service provider (whoever creates this system, or one of the companies/orgs providing the service) to get more data. Here there is a centralized repo of information. This can be a market or platform. People can buy and sell data, filtering mechanisms and crawlers. Also people can include all of their searches, or some of their search results into the master crawl. This would be the "datacenter", but it can also be a platform that maps to many peoples individual caches:

> So if I indexed and codified everything about the Beatles I could sell this to the market by running my own server.

> I could sell a crawler that is really really good at finding all musicians and music to the market.

> I could sell a filtering/parsing engine plugin for music guys crawl results (or all results it is fed) that only delivers high quality FLAC audio files and converts high-enough quality MP3s to FLAC, all this but only for tracks with a Saxophone.

However, fucking music guys crawl stack doesn't have the shit I want in it.

I can buy (or write) a master crawler that goes out onto the internet and finds what I am looking for then delivers it to my private cache, and if I am generous, codifies it in a generally accepted meta language and inserts it into master.

Obviously there is much more here but what I am talking about is distributed and optimized search.

Notes: Google has a nearly impossible job:

* It does not allow a user to provide any filtering outside of some boolean operators and human language.

* Therefore it never knows exactly what the user wants.

* Provides a general service so to some extent it is one size fits all.

* Difficult to do machine learning because it can watch you make selections but may not ever be able to tell what the deliverable was or if you were successful.

* You can not backout or modify algorithim it uses to find results. Neccessarily, even when it knows what you want it is biased because it shows you the results and defines the algorithim. Also, in the fact I am baselessly making up right now, only 2.1% of users ever go to the 3rd page, which means that if google is wrong, it can't know and the problem compounds as users see the same bad pages and keep clicking them.

> I guess that's the classical problem of scaling a product to a large audience of mostly technically illiterate users.

yes. I am not saying google is doing a bad job. They have a nearly impossible task if they only use a searchbar with natural language and 0 filtering to deliver trillions of terabytes of data to millions of people. I am not sure how much easier it would be, but certainly n times easier, if filters worked.

Also, I think basically the idea of HTML is shit EXCEPT for the meta language. If not some simple JSON the actual results need fucking tags not the content, then we could filer down further and better.

Group annotation.

File sharing.

Bitcoin payment for content/filtering/cooperation

Running arbitrary code in a sandboxed environment like docker, not a "DOM"

Also, the silo concept is like DNS if I didn't cexplain it super well. You have a cache on your computer, a cache in the cloud, access to a master cache of information (both receiveables and lookups of other silos) and an optimization market for searching through data, or finding more of it if neccessary.

100% obvious search will end up this way. Brave software seems to sort of get this. I am hoping they realize a browser can't be decoupled from search though because you can't just fork electron and put some plugins in it. They are super talented. I am hopeful. One of the systems similar to what I am suggesting is called memex-explorer. However, I have never used it as the build is currently failing. It was opriginally funded by DARPA and NASAJPL then one day all work stopped on it and I have emailed and tweeted some of the people and orgs with no response. So while doing research, the description seems somewhat inline with my thinking.

The large problem of scaling is handled by the market. Search is essentially an API to call APIs that call an RSS feed if you think about what your browser and google are actually doing. Knowing what those APIs do is pretty fucking important.

I try to share this info with people but they all think I am fucking insane. Does this sound that farfeched? Honest question.


Ha yea, why let Google have the power to destroy your business when you can burn it to the ground yourself!


Correct me if I'm wrong, but wasn't there a long standing Google's policy that the version of the page served to their crawler must also be publicly accessible. That would then be the reason why WSJ articles were accessible through the paste-into-google trick, rather than because WSJ was incompetent and failed to "fix" the bypass.

So does it mean that Google will no longer index full WSJ articles or does it mean a change in the Google's policy?


You are correct, Google requires that you let users see the first click for free if you want to index content behind a paywall. [1]

Since this is billed as an "experiment" I'm guessing that WSJ is just testing the waters. If they roll it out to everyone, they will have to serve only snippets to Google or risk getting delisted.

[1] https://support.google.com/news/publisher/answer/40543?hl=en


And congratulations, you have likely just "exceeded authorized access" and committed a felony violation of the CFAA punishable by a fine or imprisonment for not more than 5 years under 18 U.S.C. § 1030(c)(2)(B)(i).

From the ABA: "Exceeds authorized access is defined in the Computer Fraud and Abuse Act (CFAA) to mean "to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter."

To prove you have committed this terrible felony, the FBI will now demand that Apple assist in disabling the secure enclave of your device in order to access your browser history. But remember, they only need to do this because they aren't allow to MITM all TLS and "acquire" -- not "collect" -- every HTTP request your machine ever makes. </s>


User agent strings have a long history of being intentionally misleading. IE 11 claims to be "Mozilla/5.0". Chrome claims to be "Safari/537.36". The User-Agent string is all lies, and has been ever since the first site started doing UA sniffing.


It's intent that matters. Setting user-agent in order to properly render a page is legal. Setting a user-agent string to gain access to otherwise unauthorized content is probably not.


In that case, I'm just going to set my User-Agent permanently to Google's crawler for fun and see how the web renders, in general. Just for fun, because I want to see what the world looks like from a different perspective. That's my intent. I've stated it. Now that my user agent has been changed, I'm going to go grab lunch, carry on with life, maybe check out some cat pictures and then maybe some news sites over tea and snacks, and then go for a run.


In my opinion the WSJ has a major rendering bug that affects everyone except for the google search crawler, and I selectively set my User Agent to get around said bug.


In any judge's opinion, you'd be full of shit. Judges aren't stupid.

Which doesn't matter, because the WSJ is never going to sue you. But make sure you only consider your justification a personal one, not one that would provide any legal protection.


Most judges and legislators are fucking morons. Legal systems aren't a science. I wrote a thing on it a few years ago:

http://khanism.org/security/legality/


I thought the problem was that it doesn't matter if WSJ is malicious in this case. Because cfaa is a criminal matter, zealous police departments and district attorney's offices can pursue heavy handed cases even without much cooperation from "the victim" which in this case is a corporation.

All I want to say is scrape cfaa now!


> scrape cfaa now!

pun?


intent can be many things.

i disable javascript. so i can't even see their page (and hence i don't know my access is being denied since i got a 200 http response, which means "OK") so i try different user agents with the intent of reading the content they are providing. Just like microsoft case.


Unauthorized? If you put up a sign and tell people they have to pay to look at it, is it illegal to look at it and not pay? This should be a rhetorical question.


[I am not a lawyer and this is not legal advice.]

It is not specifically defined in the law, so it reverts to the traditional meaning: anything the owner of the system says you aren't authorized to access.

It's lunacy, I know. That's what HTTP headers and WAFs and such are for. But that's the stupid law, and it sent someone who used to be my friend to federal prison for changing a user agent and referrer and accessing unprotected data on the web.

Tread carefully.


Whoa. Care to tell that story or link to it?


I believe this refers to the Weev case.


It may, but weev is probably not the only person whose been put away for that. This type of activity is the basis for Google and many other tech startups. I hope they catch Larry and Sergei soon, they've been on the lam for almost two decades!


But that one is for "greater public good" you see .... ;)


18 USC § 1030 (a) (4) https://www.law.cornell.edu/uscode/text/18/1030

  Knowingly and with intent to defraud, accesses a 
  protected computer without authorization, or exceeds 
  authorized access, and by means of such conduct 
  furthers the intended fraud and obtains anything of value,
The courts have interpreted "protected computer" as any computer connected to the internet.


> The courts have interpreted "protected computer" as any computer connected to the internet.

The mismatch between the world views of jurisprudence and engineers is a neverending source of joy. (If working in tech has made you cynical like me, that is.)


User-agent string is not an authorization mechanism.


Interestingly, the CFAA does not define the term "without authorization" however it does define "exceeding authorization" exactly as I quoted above;

  - to access a computer with authorization and to use such access
    to obtain or alter information in the computer that the accesser
    is not entitled so to obtain or alter.
So arguing User-agent is not an authorization mechanism probably won't help you, because exceeding authorization means, first, that you were authorized to access the computer (HTTP GET returns 200) but then that you used that access to obtain information in the computer that you were "not entitled so to obtain."


If the computer on the public Internet responds to a standard HTTP request, it has explicitly authorized my access to whatever information it sent me.


Unfortunately this will not help your defense. Andrew "Weev" Auernheimer was convicted of violating CFAA for exactly this (although the conviction was later overturned on a technicality).

Again, exceeding authorized access means using your authorized access to obtain information you were not "entitled" to. So the question is not 'were you authorized' but rather it is 'were you entitled' to that information? WTF 'entitled' means is another question entirely, but likely it is in the eye of the beholder. A jury decided Weev was not 'entitled' to the email addresses he downloaded from AT&T, and it's safe to assume we are not 'entitled' to free access to WSJ's content. So I would not rest your hopes on the "200 OK".


> Andrew "Weev" Auernheimer was convicted of violating CFAA for exactly this (although the conviction was later overturned on a technicality).

So, your example is...not an example?

>WTF 'entitled' means is another question entirely

No, in this case it is very clear: a request containing a particular user agent string is entitled. I have not tried this myself, but presumably you could verify that is the case by sending a request with the appropriate user agent.


It's the best example we got. The case was overturned (after he spent quite some time in federal prison) not because it was found that he didn't violate the CFAA but because the charges were brought in the wrong jurisdiction.

Again I think you're confusing the fact someone could trick the server into delivering the content for free with WSJ intending to deliver their content to you for free. Since WSJ clearly intends their content to be delivered to only Googlebot for free and to users only if they pay, it is likely a jury would consider this a violation of CFAA.

A web server returning 200 OK is not ipso facto a guarantee the person making the request is not committing a crime. To give a more obvious example, if the request header contains a stolen authorization token. The law does not require the access control be non-trivial to defeat.

I don't like it, and I think the CFAA is seriously problematic, but it is the law and the Feds have been known to enforce it.


As far as the law is concerned, he did not violate anything. You do not have to prove yourself innocent, the burden is on the prosecutor to prove a violation. In the example you cite, no violation has been shown.

It is not at all clear that WSJ intends Googlebot to get their content for free while others must pay. Thus is actually against Google's policies, which would actually call into question whether WSJ's behavior is felonius. WSJ may not be entitled to be incorporated into Google's index, yet they are manipulating the Googlebot to the contrary.


If you will down mod, at least show how I am wrong!


> So, your example is...not an example?

At least in the US, the law doesn't work that way. Decisions will quite often cite some other similar case which reached the opposite conclusion, but under different circumstances, because that other case's decision says something like "X, if it weren't for Y" or "Fortunately for the defendant, they didn't Z, so not X", or something. That isn't binding precedent for the judge to apply X, but it's a very strong sign that X would be reasonable.

A court case that says "Yes, this violates CFAA but we have to throw out the case because A, B, and C" is very strong reason to believe that, if the next prosecutors avoid A, B, and C, the next judge will say "Yes, this still violates CFAA."

(IANAL but I read court cases because I find it useful to understand my jurisdiction's legal system.)

> a request containing a particular user agent string is entitled.

The phrasing of the law is very clear that the word "entitled" applies to a person, not to a request. Stealing someone's password and using their account is definitely a violation of CFAA (see e.g. http://www.wiggin.com/16332). In such a case, the account used to log in is quite plainly "entitled" / "authorized;" that's how you get the data. But the person logging in is not "entitled".


We aren't talking about user names and passwords, but user agent strings.

The other decision was vacated. The jury was not appropriate and their decision is irrelevant.


You are confusing an engineering process with a legal one.

Unfortunately, when the prosecutors come, they will only care about the legal process.


It's not engineering, it's common sense.


Common sense is not a set of legal procedures and rules either.

The legal world cares about how the law applies to the facts of the case, not about how common sense applies.

Not saying I like it.


The facts are dictated by the engineering. Is a lawyer a computer networks expert? Not by default. They will need to defer to the engineers.


Certainly, some of the facts are dictated by the engineering.

However, the set of people "authorized" is not, at least not from a legal perspective. This is what the case law says. The fact that the set of people who technically _can_ access the data is different from the set of people legally authorized to access the data.

That might not be what the engineers who designed the system, run it, and produce the content intended, but that is what the law says.

It's a bummer the two disagree. But only one of the two systems put you in jail if you cross them.

You and I may wish it were otherwise, but wishing isn't going to make it so.


Your request wouldn't be 'standard', it would be deliberately malformed to bypass a paywall. You guys are performing linguistic gymnastics to get around that fact.


As you've indicated, a request can only be considered malformed if it doesn't conform to the standards. Here is what the relevant RFC[0] has to say about the User-Agent header:

> Likewise, implementations are encouraged not to use the product tokens of other implementations in order to declare compatibility with them, as this circumvents the purpose of the field. If a user agent masquerades as a different user agent, recipients can assume that the user intentionally desires to see responses tailored for that identified user agent, even if they might not work as well for the actual user agent being used.

That sure sounds like impersonating other user agents is allowed, but not encouraged. That is a clear distinction from being malformed.

[0] https://tools.ietf.org/html/rfc7231#section-5.5.3


Unless you were being tongue-in-cheek, you've completely sidestepped the intent of my comment and continued with the linguistic silliness.

Obtaining paywall-protected content by faking your user agent to purport yourself to be a Google Crawler is quite clearly fraudulent. This isn't a point for debate.

PS. To play along with the linguistic theme, can you provide a source for the definition of a malformed request? My original intent when using the word malformed was not to invoke it's technical definition but rather it's dictionary definition. But, having said that, I just had a 30 second Google hunt and couldn't find anything to corroborate your position.


So...if I only entitle Mozilla based agents to view my content and Chrome, IE, and Safari users still get access by claiming to be a Mozilla agent, those users have committed a felony?

This line of argumentation is beyond ridiculous.


Are your Mozilla users paying for access? Are the Chrome users with the fake agent strings obtaining paid-for content for free? The use of forged agent strings isn't the offence.


I wasn't playing a linguistic game. You mentioned standards and malformed requests, and I pointed out that you misused those terms. I am not bothering to track down definitions for you to play, as you say, linguistic games. You are free to find something that proves my position wrong, as I used the RFC to cite my position as correct.

If you meant to put forth that faking a user agent is a technique to exceed authorization, that's fine, and I'm glad to have helped you clarify it. Just be clear, it's not what you said with your detour into malformed requests.


> a request can only be considered malformed if it doesn't conform to the standards

That is demonstrably false. I, personally, can consider any request malformed unless it starts with the letter W. Regardless of what the standards say, I can think whatever I want to.

Similarly, the law can make up whatever rules IT wants to about the definition of "malformed". In that case, it pays some scant attention to things like standards, but mostly cares about "to the random guy-on-the-street (jury member or judge) did this seem like stealing". And there, I am afraid you lose.


That argument doesn't matter thought. He didn't say "a request can only be considered malformed by mcherm's definition of the term if...".

I could say "the sky is blue" and you could say "demonstrably false; the sky is personally red because I'm wearing colour-altering sunglasses".

What he meant is obviously "a request can only be considered malformed (as defined by the commonly accepted definition of malformed) .."

Your argument is fallacious. The courts will use common definitions of terms and have regard for context.

However, what you're arguing about is even more pointless. The CFAA doesn't depend on the term "malformed" in any way, but on the term "authorized access".

Your first sentence is not similar at all to your second; they're demonstrating first a blatant disregard for common definitions of terms, followed by praise of such common definitions.


It isn't. But I have seen top internet companies (Netflix and the like) use it to authorise requests. It boggles my mind.


Curious where Netflix is using this to authorize requests. Any more info on that?


Their Android app wouldn't let you through if the device is unrecognised, that is, if your user agent has strings which Netflix hasn't whitelisted. They also check model ID, and build fingerprint too, I believe. A standard practice in the Android world, for some unknown reason. I don't know if they have stopped doing that.


Good luck arguing that in court. Esp against the multiple lawyers a corporation will be able to afford.

If you don't get it, court doesn't care about what's reasonableness, technically correctness, etc. Only if your lawyers can convince jury/judge. Twinky made me crazy, It the gloves don't fit... and so forth.


If you don't set it for a particular website but generally browse the web with it, based on another legitimate purpose (I'm a developer, I had to test a website, I forgot the setting) could you oppose the court on "knowingly and with intent of defraud"? If you didn't see the paywall, how can it be "knowingly"?


Trot out expert witnesses (on web dev). "Sir, have you ever "forgot" the setting?" No, and it's ridculous to think anyone would. "Is it generally know by web devs the dangers and circumvention ability of this setting" Absolutely. etc.

"Here is a transcript of electronic forum detailing how to circumvent access controls and defraud the victim using the exact methods defendant used to access victim's website. A forum the defendent heavily traffics. Often multiple times per day."

Ladies and gentlmen I ask you is it more likely that the defendent, a self professed developer, and expert in these circumvention methods, who reguralry participates on forums discussing hacking and defrauding companies such as the victim. I ask you is it reasonable to believe he "just forgot"?

Lawyers man, Lawyers! Can you not understand that rationality, technicallity don't matter. Lawyering is like statistics/graphs. You can get the data to say whatever you want.


User agent string is not an access control mechanism.


A decent Friday afternoon read on that topic:

http://webaim.org/blog/user-agent-string-history/


I couldn't believe my eyes when I saw that netflix started sniffing the UA. You can't even play a video in SeaMonkey 2.39 without the incredibly stupid

general.useragent.override.netflix.com;Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:42.0) Gecko/20100101 Firefox/42.0

in about:config... what were they thinking?


Pretty obvious weak move restricting by UA$.

Allow by IP range? You can probably find a somewhat accurate range for Google and whoever's crawlers.


Google's _entire business_ is based not only on accessing servers without authorization, frequently in explicit violation of that site's terms of use because ToU boilerplate includes languages excluding all "crawlers, bots, or spiders", but also on flagrant violation of copyright law. They save the entire text of the web page on their servers when they crawl it (unlicensed copying), rehost it in Google Cache (unlicensed redistribution), save all the images and rehost them on Google Images (both), and so forth. All of this is absolutely illegal under current copyright law.

Google was fortunately that no one sued them for these things before they got big enough to defend themselves. Many tech entrepreneurs haven't been so lucky.

You may ask why a big company like Google isn't doing more to change the CFAA or copyright law. The reason is now that they're big enough, legal grey areas like those in the CFAA (particularly, "what is unauthorized access?", because it's not defined by the statute) can be fully exploited, and Google can sit secure in the knowledge that they'll never be realistically challenged on it; meanwhile, they can then threaten potential competitors for doing the same thing, since a lawsuit against a public corporation takes 10 years and $5MM-$20MM. Anyone who could mount that kind of offense against Google won't, because they benefit from the grey area too; they'll just make some backroom deal with Google and not lose their lucrative, competition-destroying ability to do things that companies with sub-$100M revenues aren't able to do.


If those sites don't want spiders, they can just specify that in robots.txt, which Google honors, right?

From the point of view of the law, that might not matter, but when there's a standardized way to make clear to bots that they're not welcome and you didn't bother to implement it, you'll look pretty silly if you complain.


Okay, there are other aspects to Google's business beside information retrieval and search, tho the point that the access is mostly unauthorized is valid. Although the damage done by this is to some extent offset by Google's status essentially as a public utility: it universally provides a social good, "search", and the tax we pay is advertising. Your frustration for the "little guys" is mostly en pointe, since the damage done is rendered then mostly to competitors, rather than to customers, and the argument, true or no, that Google's search is "lightyears" ahead of other possible offerings, to some extent offsets the damage to consumers due to loss of competition through the possibly anti-competitive practices you highlight. So it seems the balance of good is in the favor of consumers of "search". I think this is the main force, rather than any "structural obstacles" to competition, which is the cause of the persistence of the status quo in this market. The thing about this which no one seems to see is that, since everyone is taking information dishonestly, there's a huge opportunity to actually "bring it into the light", do it honestly, and strike some kind of deals with content creators.


Doesn't this just prove that CFAA (and similar laws in other jurisdictions) is massively over-broad and needs to be narrowed. I'm surprised there is not more campaigning for CFAA repeal and replacement with far narrower legislation.

If someone hasn't exploited a security bug (I mean a real security bug, like a buffer overflow, I don't consider behaviour such as serving up content to certain User-Agents only a genuine "security bug"), and they haven't bruteforced/cracked/acquired a password or private key, and they aren't sending unreasonable amounts of traffic ((D)DOS), it should not be a crime, and the law should be changed to reflect that principle. The law should reflect the common sense of the technically literate, but it doesn't, because it was written by the technically illiterate.


I wonder if a user-agent that was something like "Not a Googlebot" would a) allow access (probably regex based) and b) be truthful/plausible deniability.


What is your purpose in setting your user-agent string to that value, other than cleverly bypassing the paywall?


As this entire thread has pointed out, the law makes no sense at all here. So you might as well say "Congratulations! You're going to jail! And I don't have to tell you why!" because none of the details matter anyway.

This is the spot we've reached through legislative meddling, and your best bet is either to be a good little consumer and don't make waves or do what you want and don't get caught. Neither of those seem to make a lot of sense in the long run.

So yes, congrats. It's jail for you -- but not for Google. Because they're Google, and you, well, you're not. You'd just better be happy we don't find out about you rooting your cellphone last year. Good grief.

ADD: I know you meant well, and I appreciated the </s> tag, but there was something I didn't like about your comment. Now I know what it is. By making this a big deal, you're increasing the likelihood that this poor schmuck becomes the next "example" some federal prosecutor decides to make. It's not your fault, but it still sucks. Let's hope that doesn't happen.


How does that law apply to foreigners?


Who has? Surely not the person who wrote the tutorial.


Even worse, the poor author has created a hacking tool capable of enabling said felony, which I believe could get them 10 or 20 years... I'm looking for the statute now.

Edit: I was mis-remembering, the current law is against possession or manufacture of eavesdropping or wiretapping devices, not hacking tools. The EU has been playing with laws against hacking tools, but apparently nothing in the US yet against it.

The law makes it illegal to distribute devices (incl. software) that the design of such [software] renders it primarily useful for the purpose of the surreptitious interception of wire, oral, or electronic communications. Punishable by not more than 5 years and/or not more than $250,000. 18 U.S.C. 2512.

I don't think this blog post qualifies as an "interception" device,... however unauthorized retrieval and recording of another's voice mail messages constitutes an "interception" so who the hell knows. I'm sure you could find a US DA who would argue the falsified User-Agent meant the software is designed to "intercept" communication meant only for Google.


The German Criminal Code has §202c StGB [1], commonly known as the "hacker section" (or "Hackerparagraph" in German).

While §202a and §202b punish unauthorized access to or interception of data, §202c extends this threat to obtaining passwords or creating tools in preparation of such an act.

Note that this section has only existed for a few years. I am not aware if white-hat hackers have actually been prosecuted for creating hacker tools.

[1] http://www.gesetze-im-internet.de/stgb/__202c.html


Wouldn't it run afoul of the DMCA for making available a tool for circumvention?


Under CFAA, I don't know. The DMCA may have some problems with that blog post, though.

And by "may", I do mean "may". I don't know. But it's at least possible.


That's a great point, it's likely an illegal DMCA circumvention device too!

  No person shall manufacture, import, offer to the public, provide, or otherwise traffic
  in any technology, product, service, device, component, or part thereof, that—

  (A) is primarily designed or produced for the purpose of circumventing a technological
      measure that effectively controls access to a work protected under this title;

  (B) has only limited commercially significant purpose or use other than to circumvent a 
      technological measure that effectively controls access to a work protected under
      this title; or

  (C) is marketed by that person or another acting in concert with that person with that
      person’s knowledge for use in circumventing a technological measure that effectively
      controls access to a work protected under this title.
I mean, obviously it's all quite ridiculous, but also deadly serious at the same time :-(


> is primarily designed or produced for the purpose of circumventing a technological measure that effectively controls access to a work protected under this title

The wording reminds me of a similar section in the German Copyright Law which outlaws the circumvention of "effective" copy-protection schemes.

How an access control scheme can be effective and circumventable at the same time is completely beyond me. :)


I know you were kidding, but I still don't understand this "exceeds access" bullshit.

Here's how HTTP(S) works: I issue a REQUEST to the web server; the web RESPONDS with it, or denies it. It is up to the web server to respond or deny or do whatever it wants. If the web server is badly implemented or doesn't know what it's doing, it is the webserver's fault.

Remember: it's just a request. I can request 100 dollars from you; the fact that you give them to me does not make me a mugger.


What if you tell me my car's broken when it's not and request 100 dollars from me to fix it? That would be fraud.

So you go up to a server and lie to it, and it gives you something; is that not acquiring things through deception?

The structure of your argument suggests that e.g. breaking into an ssh server by issuing a login request with a known password which it responds to, isn't illegal. And further, that if data is acquired from the server, there is still no crime - the ssh protocol too is just requests to the server, it's all bits down the line. It's clearly nonsense.


If you request the 100 dollars with a specially crafted piece of paper that I glance at and believe is legitimate, you are committing a crime (check fraud). Which whatever, let's not focus on the analogy. Whether you like it or not, the law doesn't necessarily view the valid server response as authorization to access the requested url, it examines what you were thinking when you created the url.


And one more car analogy. If someone comes up to your car, pulls the door (issues a request) and based on the fact that it opens (responds 200), drives away in it, that would be seen as grand theft auto.

Whether the lock was implemented poorly or you just didn't lock it — doesn't matter.


If I run a car wash with a sign, "red cars washed free", and you paint your car red, have you defrauded me? You're not really a red car, you're just pretending!


Oh pish posh. If you forge a key that fits the lock on my front door, yeah, your breaking the law. The lock itself is not an advertisement to come in if you can unlock it.


Stop it with the tired lock metaphors, they are wildly inapplicable. If the WSJ wanted to require access to some secret knowledge or possession of some token for access, they could. Instead, they are dispatching on something the protocol they're using agrees you are free to set to whatever you want for the "best experience". In fact the specification specifically recognizes that you will change the agent. Much like the color of your car.

"If a user agent masquerades as a different user agent, recipients can assume that the user intentionally desires to see responses tailored for that identified user agent, even if they might not work as well for the actual user agent being used"

It's a param exactly equivalent to asking "who would you like to be treated as?" If I say "the pope!" and you respond by kissing my ring, I have not defrauded you.


Terrible analogy.

You're talking about a sign with practically no legal meaning.

This discussion is about a law which does have legal meaning.

To fix your analogy, it would be "If there's a law that says only explicitly authorized cars are allowed in a car wash; else 5 years in prison, and I have a car wash and say 'only red cars' ...". Of course, that still doesn't properly capture this since the intent of the law also matters and that law would be senseless, and so interpretation would be less obvious.

The better analogy is the one below about breaking into homes being illegal, but what if you happen to have a key that fits the lock? (though that also is a bad analogy in its own way).

Basically, making physical analogies for technical matters is rarely correct. It's often the best way to convince non-technical people of a matter without them needing to actually understand it.


Fraud statutes do exist and the CFAA is in direct correspondence with them.

If there's anything the tech sector should be able to come to agreement on, it's that lock metaphors in a situation with absolutely no hidden knowledge or private tokens make us all dumber.


The law seems to use trespass as the analogue for unauthorised access. We can argue it's a bad analogy, but it's the one we have.

The door is locked, the key is right there and all you have to do is pick it up and you can gain access, but that doesn't make you authorised to gain access.

The CFAA and UK CMA may well be overzealous in restricting what should (to technologists) be allowed, but it's what we have as statute in the respective jurisdictions.


Is that significantly different from viewing google's cached result of their legitimate access? The result is the same.


The difference is intent!

The law says "authorized access" and the WSJ authorized Google to access their content in order to index it. The WSJ did not authorize that content to be presented, for free, to you, an end user, necessarily. I don't know which way a court would rule on it, but it's definitely not black-and-white.

Sure, it's technically similar, but the court doesn't care. The court doesn't care if the law makes no technical sense, because it's a law, not a program.


I agree with you that this is what the law seems to think (being overly broad), but not if you are arguing this is sensical.

In other words, yes, the WSJ does intend to only give Google access to their content, and not the general public.

But no, the WSJ has not "authorized" Google by anything more official than a bank telling their security guards to let anyone into the vault who is wearing a blue t-shirt.

So yeah, I agree with you, there is a lot of conflation of technical means and the law, but we also shouldn't be granting to the WSJ that they are doing any real "authorizing" here, beyond wishing it and hoping it stays true.


Am I alone in feeling like this is akin to a tutorial on how you can shoplift without getting caught? WSJ, for better or worse, does not want to give you content without your paying for it. If you take that content without paying, you are stealing. Just because you have figured out how to get past their security does not mean it's not stealing.

(See the second precept here: https://en.wikipedia.org/wiki/Five_Precepts)


Legally it might, I don't know.

Morally, I'm not sure.

1. If they allowed all bots but disallowed all non-bots, that would raise questions of what defines a "bot". Couldn't I write a personal bot that fetches the story for me? As a browser addon, even?

2. It's even more complex since allowing bots means they allow tools that provide the information to third parties, as the bots are not intended for private use by the bot maker. So the door is already open.

3. But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot? Try to use the "web link" trick from HN on any other search engine, and it doesn't work in my experience. That seems anti-competitive and discriminatory in favor of the existing dominant entity in this space, Google.


> If they allowed all bots but disallowed all non-bots, that would raise questions of what defines a "bot".

Maybe, but I think its a pretty easy distinction. They aren't even allowing all bots - they're allowing a white list of them. You're not just writing your own bot to get around it, you're pretending to be someone else's bot.

> But in practice, it seems they favor certain bots. Is it ok that the WSJ lets Google do things Google's competitors cannot?

That's the really important question. I personally have no context for answering except to say that I can see both sides argued. If you view their website as a physical store / private establishment, then I assume that they have every right to establish who has access to what and under what conditions.

Of course, that hampers a lot of legitimate use cases along the way.


> If you view their website as a physical store / private establishment, then I assume that they have every right to establish who has access to what and under what conditions.

Not true. There are very specific laws about not being able to discriminate against protected classes of people.

Bars have to serve minorities, bakeries have to cater to same sex marriages, etc.

Where you draw the line of legislated equality within private property rights is pretty intriguing. I have don't have any answers, but lean heavily towards the libertarian bent.


That's fair, though private establishments are still allowed to distinguish based on other criteria such as membership, partnership, etc. I didn't meant to be super specific. I'm merely pointing out that they are allowed to establish barriers to entry.


No content is being given or taken. This is restriction on distribution / copying. Massively different ethics than stealing despite what the Copyright Cabal wants you to think.

The only loss is the energy/bandwidth/cycles WSJ servers spent answering your request. Which, I believe, has been basis of computer "fraud" cases.


> Which, I believe, has been basis of computer "fraud" cases.

This can't be true. Surely the argument for why, say, a WSJ-paywall-bypassing-tool causes damage (in the legal sense) to WSJ is that it allows people who would otherwise pay for content to get it for free, thus depriving WSJ of income.

Moreover, I don't think prosecutors need to prove that you caused harm in order to charge you with computer fraud, since, for example, CFAA falls under criminal law.


> it allows people who would otherwise pay for content to get it for free

But the sole existence of this trick and the person that would use it is exactly someone who would not pay, therefore your argument does not hold. And stealing it is not, it is more like listening to the outdoor rock concert beside the fence because you don't want to pay, inconvenient- sure, so plenty of people would still pay.

Cloaking is against Googles rules so it is WSJ that's dishonest.


I'm not arguing that bypassing the WSJ paywall causes damage to WSJ - I'm personally not sure that it does. What I'm arguing for is that if for some reason WSJ needed to prove in court that a paywall-bypassing tool causes them damage, they would use the it-deprived-us-of-potential-revenue argument that I outlined, rather than the it-forced-us-to-waste-electricity argument. You're right that the it-deprived-us-of-potential-revenue argument might not work in this case, in the sense that it is very possible that there does not exist a person who would have paid for a WSJ subscription but, because of the paywall-bypassing tool, did not.


Yeah I spent 10min looking and couldn't find anything. Maybe I'm remembering a "what if" from early days cypherpunk stuff.

Prosecutors never (want to) just charge one thing. They want a laundry list of dozen or more crimes so they can coerce suspect into pleading guilty. The, "theft of resources" would just add to the pile.


WSJ is violating Google's rules referenced here: https://support.google.com/news/publisher/answer/40543?hl=en


So? Is this a "two wrongs make a right?"

More importantly, is violating "Google's rules" suddenly a violation of law?


This would never go to court, but if it did, I could see a jury being sympathetic. You could sort of argue that WSJ had an obligation to Google to allow referred users to view the pages, and that Google passed that right onto the user, so you were contractually authorized to access it.

IANAL and this is most likely wrong, but kind of plausible to my NAL mind.


I don't think its wrong to browse a site's content that they have advertised as publically available?

If they are advertising incorrectly, they should fix that.


False advertising is also illegal


> abstain from taking what is not given

I'm not taking, I am simply absorbing information. That information will still be there when I am done reading it. Have I really stolen, or did I just refuse to give someone money on demand?


I don't disagree with you. But, given the nature of this forum, I think that the information content has merit.

What people choose to do with the information is another story...


Stealing would imply that they no longer are in possession or in control of the content, which of course is false. Copying illegaly however, yes


This is an odd debate. Let's say a restaurant declares "veterans eat free." This blog post is like a friend telling you "Hey if you tell this restaurant you're a vet they'll give you a free meal." No one said it's legal or ethical. It's lying to trick someone into giving you something at their expense.

I think the relevant point, underscored by the author's last sentence, is it doesn't matter who you open a back door for - it opens the possibility for anyone to barge through.


that's a good analogy.


This is not meant to be purely controversial, but I thought long and hard about WSJ back a few months ago when HN mod (always forget his name) said to stop complaining about HN links being posted because paywalls were ok. I agree paywalls are ok. But some things are not ok.

Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles). They want me to pay, and they want me to see ads, and they want to track my behavior? Should I send them my DNA also?

Organizations like WSJ are exactly the disease that causes ad blockers to proliferate and ruin the web for all the decent publishers. They're at war with my privacy (by breaking their site intentionally when I visit with a blocker on). They want it all, ads, tracking, your private data, and subscription revenue, not to mention...

# Agenda-Driven Content

I mean, we're basically talking about NBC or Fox here, just on the web. Imagine every morning when you woke up you turned on the television and tune to some "news" show. After talking about the weather, they start talking about a lost pickle that is thought to be potentially alive and moving about with free will. Over the next two years, talk about the same pickle extends to every other TV show. Before you know it, everybody in the nation is talking about the same pickle. Years go by, and that pickle has become a part of our society, and that's not because people are born with an innate care the well-being of pickles, but because "news" shows taught them to be.

That's not a good position to be in. I have to believe I'm not the only one in here that doesn't watch any TV. So, why do we all treat the same media giants differently on the web? We crave their content so much that we build browser add-ons to get to their content, etc.


If they make sending your DNA a requirement of consuming their content, then yes, you send it to them if you want their content. That's their right, as owners of something, to dictate its use.

You aren't entitled to WSJ.com, NBC, or Fox.


I'm not a collectivist, nor do I believe I'm "entitled" to anything, including human rights. I'm simply stating that "I don't support organizations like WSJ"... as you can clearly see, if you read my comment. I don't propose anyone ban them. I simply "don't know why we [as a society] support them", also clearly in my comment. My point, which your political diatribe disallowed you to notice, was that we're fighting awful hard to consume their garbage. Meanwhile, there is plenty of content out there there is free, not because of socialism or entitlement, but for the same reason as blockers became popular... because information is so readily available now that nobody is ready to send their DNA in.


> Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles).

I don't actually see what you're referring to; maybe its because I get redirected to http://www.wsj.com/europe. Maybe I have a different ad-blocker. Either way, it reminds me somewhat of NME's [1] homepage (New Musical Express, a popular music publication; not sure if it's really known outside of the UK). They deliver their images in such a way that they fall foul of my ad-blocker, although I haven't looked in enough depth to be certain whether this is a way of preventing ad-blockers, or purely unintentional.

[1] http://www.nme.com/


>Take a look, for instance, at the WSJ.com home page with an ad blocker turned on (note all the missing letters and scrambled up titles)

Uh, what? Using uBlock origin, when I visit wsj.com I get what looks like a perfectly normal page. Nothing is scrambled at all.


I'm pretty sure Google will soon stop indexing WSJ. Why index something if the vast majority of users cannot access the pages behind the links?

EDIT: The "paste a headline into Google" trick still works for me, though. If this continues to be the case, they will keep indexing, of course.


>Why index something if the vast majority of users cannot access the pages behind the links?

So people can find it? I'd be pissed if Google de-indexed something like IEEE because it has a paywall.

Assuming the internet has to be freely available is a mistake. Especially with the continued growth of an adblocked internet. We could be facing an internet with significant paywalls in the future.

I'd support a "free" search term to weed out paywalled results.

Furthermore, Google shouldn't be making normative judgements about what people should see. It's an abuse of their monopoly.


Google doesn't forbid or de-index paywalls. What they forbid is cloaking (showing Google different content than what users will see). This is, of course, quite critical to maintaining search quality.

WSJ is free to institute a full paywall and only serve snippets to Google. They might now like what it does to their rankings though.

What they cannot do is continue to sniff the UA before deciding to put up the paywall. (Though I'm still able to use the Google trick, so it seems the experiment might have ended.)


It doesn't work for me. They did say they're "testing" it, so maybe A/B testing conversion rates.

And yes, this violates Google's policies laid out explicitly at https://support.google.com/news/publisher/answer/40543?hl=en


Indexing is fine, a great feature would be if Google was able to show it only to the user that can access it.


Why should Google manage WSJ's paywall?


It doesn't. It 1) penalize WSJ, 2) personalize to WSJ subscribed Google users.


Well, that trick won't last long either. It's trivial to verify that an IP indeed belongs to Google:

https://support.google.com/webmasters/answer/80553?hl=en


Seems to work if you deploy a proxy on Google's app engine and use it to access WSJ ;)


A App Engine wouldn't have a IP with a reverse DNS *.googlebot.com, would it?


Nope, but it does resolve to something .google and if you check netblocks.google.com it will appear there so they might not be limiting it to googlebot only at this point.

https://support.google.com/a/answer/60764?hl=en


Yes, but if one is interested whether an IP is from Googlebot or not, they would check if it resolves to .googlebot.com and not to .google.


I don't think that any server that random people can start a proxy on is in Google's SPF record.


not on the spf list, but on google's domain/ip blocks list.


The link you yourself posted says "The most effective means of finding the current range of Google IP addresses is to query Google's SPF record." While this is intended to be mailservers only, it has always appeared to me that the SPF list is all of their servers that 3rd parties can't run proxies on.


_netblocks.google.com includes only the servers that handle Gmail and corporate email.


Basically, the article is stating to change the User-Agent to GoogleBot or Bing or whatever other crawler UA you'd prefer. While that's doable, that's something that is easily detectable and prevented, as all of the big crawlers can be validated against DNS.

Additionally, I would like to point out that I wrote a Varnish extension for the express purpose of validating User-Agent strings through DNS lookups, and is available here: https://github.com/knq/libvmod-dns

It was built because we had specifically a problem with bad bots crawling a large site (multiply.com) and this was one of the easiest ways to filter out the bad bots from the good, and to enforce robots.txt policies on a per bot basis. It works very well, as you can do any kind of DNS caching internally and prevent this kind of behavior, if that's your goal.


I like wsj but I only read maybe 1 article every other day. They need a more reasonable price point, especially since the market will almost bear no price at all.

That being said I do enjoy their content, save for maybe the op-eds.


I'm surprised that most online papers won't sell you one day's worth online for a buck or so. Like buying a real newspaper.

They all seem to want to sell subscriptions, which are perpetual and probably difficult to cancel..


Even a dollar a day. I can pay Netflix $10 a month and stream unlimited HD video, but wsj wants $30 to read the first few paragraphs of a few articles a day?

The pricing here is much too aggressive


Videos are more relevant to rewatch than articles are to reread. One needs to output more articles than videos, because articles must be "fresh" or you'll lose an audience. Nobody is printing 40 year old news - people are still watching 40 year old movies.

Playing devil's advocate here. Pricing for many online goods is almost completely arbitrary and varies with little accord to service/product quality or even what that service provides.

Another related example of arbitrary pricing: people will pay $2 for a soda from a vending machine but won't pay $1 for a useful app on their phone.

There's something going on there... The sooner the $1 app's figure out what makes people buy $2 sodas is the day they become rich. And the sooner content providers figure out why people will pay $20/month to stream media (Let's say $10 to Netflix, $10 to Spotify or something) and charge people $20/mo for their articles... things will turn around for them.


> Nobody is printing 40 year old news - people are still watching 40 year old movies.

Because news operate at a different scale than movies. People are still reading 40-year-old books.

A 40-year-old news article is not relevant anymore because it only fits within the momentary context in which it was created, whereas a non-fiction book or even an essay can span a broader context and thus stay as informative for future readers.


You can buy single WSJ print or online articles for 29 cents on Blendle or today's whole paper for 3.20.

I don't know if it's available in the US yet but they are at least planning to launch in the near future.


I thought Google specifically disallowed returning different pages based on User-Agent targetting googlebot, and this included paywalls.

Are they running afoul of Google policies and going to get pinged by Google?

I can't find the text from Google now (when can you ever find any docs at google?), but I am very certain I remember reading from them that you may not return different content to GoogleBot based on User-Agent.


Doesn't this kind of also hurt SEO? I'm would guess Google has some automated system to detect and apply a negative signal to sites that provide different content to a Googlebot user agent than a non-Googlebot user agent. I guess these sites are counting that the other signals outweigh that negative hit.

Otherwise, why would expertsexchange be obligated to provide the answers at the very bottom? Did something change?


I'm 99% sure I've encountered a Googlebot crawling pages with the UA of a regular browser, presumably for exactly this purpose.


They do this via an iPhone like UA, though googlebot is still in there too.


I'm pretty sure that's just to see if the site is serving different content to mobile vs desktop. I think they also sometimes hit pages with no mention of googlebot.


If you do an nslookup on the IP, it should come up with crawl-xx-xx-xx-xx.googlebot.com, where xx-xx-xx-xx is the IP.


expert sex change


If you hit a paywall or a "sign up to access this content" message from a google search result, report it. Google will remove them from the search results, they will lose their largest traffic source, and they will address the issue. Or they won't because they have enough paying customers.


i thought of doing that when the "search google" trick stopped working, but i decided it crossed the point where i would feel like i was unfairly circumventing their clear desire not to serve me the content. i've just added wsj to my mental ignore list and count it as a few more minutes gained to do something else.


Yeah, same here. Every time I get to a tab where I see a paywall I just close that tab, probably saving 5-10 mins of my life!


If Google (or any other crawler) wanted to play nice with paywalls, they could issue a public key for their bot, and put a signature in their User Agent string that the domain could then verify.

Those signatures could obviously leak, but on a per-domain basis. Perhaps the domains could have a secure way of bumping the valid key generation if they had a leak.


There are two problems with this.

First, they don't want to. In fact, if a search engine can figure out that a link is going to lead to a paywall, they'll probably want to reduce the ranking of the result, because the user is not going to want results they can't actually look at.

Second, it would be a massive antitrust violation because it would prevent access by competing crawlers. The only way around that is to allow access to anyone who claims they're a crawler, which was the original problem.


The current situation with the WSJ could already be considered an antitrust violation. It's whitelisting one crawler and leaving the other ones out.


Google (and every other major search engine) already provide a way, i.e. reverse DNS lookup, to authentic bot ownership:

https://support.google.com/webmasters/answer/80553?hl=en

AFAIK no content provider actually does this check though.


Bypassing the paywall is more unethical that blocking ads. It is one thing to have control over your own browser but another to steal something from another site.

Also, isn't it illegal to bypass computer security?


How is modifying your own request headers any different than choosing to not display content returned in the response body?

Their server can choose to do what it wants with your request and you can choose what to do with the response it sends.

Are User-Agent headers legally protected identities?


Unfortunately, yes. The law legally protects anything which you might use to gain unauthorized access; that includes e.g. a password field. (That is, it is indeed breaking into a system if you type in the correct password, say by reading it on a post-it on someone's monitor, but you are not supposed to know that password.) This sort of thing makes the entire business of law complicated and impossible to automate.

Then again, lots of sci-fi dystopias are dreams of an automated law that somehow destroys the fabric of society, so...


The difference is substantial; in principle, one can modify his request to intentionally bypass the authorization mechanism occurring on the server. One cannot mislead anyone/anything by displaying his data in a customized way in private on his computer.


Fair point. Intent matters. However must it be demonstrated that there is some specific malicious intent behind changing a request header versus simply changing your user agent for the heck of it?

In a related point, some news sites load a modal and prevent scrolling over an article asking you to sign up. However if the full article is included in the response and I read it by simply viewing the response body (HTML) is that circumventing security? Actual example. In this case modifying how the response is rendered in my browser, I can bypass their intentions.

Obviously the better way of doing this would be to not send the entire article content until they've determined I should be able to view it.


> must it be demonstrated that there is some specific malicious intent...

I think so. Just the act of changing User-Agent alone does not mean a fraud is happening. User-Agents get changed often for valid reasons - research, detection of cloaking, testing etc.

> However if the full article is included in the response and I read it by simply viewing the response body (HTML) is that circumventing security?

That is a hard issue. I think the answer should be no, it is not circumvention, at least not in the above sense.(\) If you can get full article on your computer, directly readable without fraudulent behaviour on your part, this means that the sender did not place a security measure to guard it, so there can be no circumventing it. Switching javascript off or displaying source text of the document is fully in the control of the requestor and the sender should know it.

(\) Unfortunately, it seems legal systems allow companies to restrict what you do with things they produce, even if you do it on your computer in private, so legally, this may not be successful in court.


Based on the comments here, am I to understand that constantly browsing the web with my user agent string set to a googlebot string, I am committing a felony? How would I even know which sites I'm gaining unauthorized access to?

That is completely idiotic if there is a string you can put in a Mozilla browser config that is literally illegal to browse the web with.


I do not think using any User-Agent alone constitutes a crime. There are valid non-criminal reasons why one would like to use Googlebot's or other User-Agent. I think it is the intent to bypass the paywall + success to do so that may be regarded as offense or even crime, but I'm not sure.


Good luck trying to argue to a judge that the law is forbidden from being idiotic. ;)


> Remember: Any time you introduce an access point for a trusted third party, you inevitably end up allowing access to anybody.

coughNSAcough


New workaround: paste the article title into archive.is. I don't know what they're doing but they have a workaround of some sort.


This is what I do.

I actually made a bookmarklet with the following pasted into the URL, so you can do it in a single click:

javascript:void(open('https://archive.is/?run=1&url='+encodeURIComponent(document....)


That's actually really interesting! Could anyone chime in and explain how they might work around this issue?


My guess is they have a login, but they could just be using some workaround like described in OP.

I've used them to save Facebook posts before, and the pages were logged in to some "Nathan" IIRC. They probably have a bunch of hacks for specific sites that needed fixing.


I just tried clicking on "Harper Lee, Author of ‘To Kill a Mockingbird,’ Dies at Age 89" from wsj.com's homepage and got the paywall.

I then pasted the headline into google and clicked on it from Google results and did not get hit by the paywall.


they're probably doing some sort of a/b testing by selectively letting some clicks through


This is basically true ^^


Any idea what the deal is with SEO impacts of WSJ taking the idea of blocking everyone who isn't a google bot?


Mine did, which surprised me, and I did exactly the same thing.


I did the same with the same article and got paywalled from Google results.


I also checked the paywall and the old trick still works for me. Odd.


I was under the impression that the "hack" whereby you searched for the article on Google and clicked through to that article (effectively skipping over the paywall) was a demand of Google's and not an oversight by the paywalled website.

I thought that google deemed providing search results which were behind paywalls as a "bad experience" for their search users, and would penalize websites for doing so.

Is this no longer the case?


Google doesn't demand anything. If your paywalled website is not accessible by Google's crawler, then Google will not index it. Publishers want Google to index their pages and drive potential paying visitors, which is why they open the loophole themselves.

For the second point, Google does require that publishers specify "registration required" in their sitemap.


If you're showing Googlebot one thing, and visitors who visit your website through google another, that's essentially "cloaking" (a blackhat SEO technique). At least it used to be.


Doesn't Google usually try to punish websites that show users something different and even mentions that somewhere?

Not an SEO Expert here, but wonder how and whether Google will end up handling that. I mean making an exception could also be considered abuse of power in some countries of the world. Don't have any strong opinion yet on that, just saying that because of how the EU exercised certain laws in recent years.


Aren't you supposed to verify if a visitor is a googlebot by reverse lookup of the IP address? I.E.: https://support.google.com/webmasters/answer/80553?hl=en

User-agents are notoriously unreliable.


I wonder how many Google Cloud customers use the servers to run spoofed Googlebot crawlers from the Google IP range in order to bypass paywalls and scrape large sites (like LinkedIn) without hinderance.


It's broken already. Tried to access an article about new china rules for online news and it pay-walled me. They're probably looking for clients coming from googlebot.com now.


So does HN now choose to not post articles from the WSJ? I was comfortable with the "google it" trick, and frankly was a little annoyed with constant "paywall, wah!" comments when what should be by now a well-known workaround was available. But that workaround no longer works.


They've been testing the new wall for a while now. I know I made one of those "paywall wah" comments when the Google workaround didn't work for me. Then the next time I tried it worked fine, so it must have been random selection.


My Windows anti-virus deletes the linked sample code automatically upon download, marking it as "Trojan:Win32/Spursint.A". Did anyone have the same experience? (I was actually more interested in using it as a template for writing a simple Chrome extension.)


Yep. I then pasted it but it didn't work on wsj.com. Oh well.


try deleting cookies, then hit refresh.


Not working here either. I tried wiping out cookies and cache, and Chrome is sending the right user agent (Googlebot).


Try disabling other extensions that might alter the user agent.


Solution:

Content providers register a (yet-to-be-written) Google News API account, get an API key, with which Google indexes the site and the site recognizes as legit.


I've noticed that this has stopped working on WSJ if you've already hit the paywall and try to google the article to bypass.


I wonder if anybody tried to do as suggested? I copied the files to Chrome as per instructions, and the paywall was still in place.


You can also access WSJ for free at the library.


It reminds me of this: "Trying to save a quarter..." https://www.youtube.com/watch?v=j4nRHHPpnVc


It's not bypassing at all. Googles crawlers are deliberately let in because a paywall that nobody runs into is useless.


So soon they have to block anyone with a fake Google UA and whitelist the well known 66.249 IP range. Trivial.


Does WSJ check visits from a Googlebot UA against a list of known Google IP addresses?


Fix: replace the user agent string by a cryptographic challenge/response scheme.


They'll start allowing only some IP addresses search engines agreed with them.


Possible in Firefox? Some people won't use Chrome.


Is there a version of this available for Safari?


So their next move is check if IP is from Google



>Archaic news source does something to hurt their market penetration to internet

Great idea here guys


Or simply use incognito mode and click on Google search result.


Did you read the article. It talks about how that trick no longer works on a lot of sites because they are now checking User-Agent strings too.


Actually I noticed sites have simply changed policies -- if you're a regular visitor your cookies will identify you and block content. The Incognito mode trick works for WSJ and others that would still check the referrer header. Allowing Googlebot access and checking the referrer header are two different things.

Also, Google has published IP addresses it uses, so this extension might not last long...


> Also, Google has published IP addresses it uses, so this extension might not last long...

They do not [1], but you can find out by doing a reverse DNS query.

> "Google doesn't post a public list of IP addresses for webmasters to whitelist"

[1] https://support.google.com/webmasters/answer/80553?hl=en


Sorry, that was what I meant. :)


1. Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.

2. The idea that this is somehow new is wrong. The way for a server to identify crawlers have "always" been to look at the user-agent, and, when done right, IP, verified either by net block owner or by doing PTR lookup and then checking that the A or AAA record for the claimed host points back at the same IPv4 or IPv6 address. Meanwhile, I do agree that paywalling is a more recent phenomenon, at least with regards to the extend it is popular among sites today, but the concept of presenting different data to crawlers and visitors arose much earlier and is something Google have been aware of and has made sure to delist such sites when found, whereas in fact Google has since then moved abit in the direction of allowing it in that they do so for Google News if declared as explained by others ITT.

So in my view, it seems that the author is jumping to incorrect conclusions based on an incomplete understanding of what's actually going on here. What then about the HN readership, how come this article became so highly voted and I don't see these issues raised by anyone else? Or maybe I'm just crazy?


> Google's Web Crawlers are not "bypassing" paywall. It's the paywall that let's crawlers through. I.e. exactly the reverse of what the author implies with their headline.

Don't nitpick. It's just a shortened version of How To "Be" a Google’s Web Crawler to Bypass Paywalls. You get it. I get it. Everyone gets it.




Applications are open for YC Winter 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: