
How Google’s Web Crawler Bypasses Paywalls - elaineo
http://elaineou.com/2016/02/19/how-to-use-chrome-extensions-to-bypass-paywalls/
======
lloyddobbler
"Remember: Any time you introduce an access point for a trusted third party,
you inevitably end up allowing access to anybody."

See also: [http://www.apple.com/customer-
letter/](http://www.apple.com/customer-letter/)

:)

~~~
melted
Except of course what FBI is proposing wouldn't give access to "anybody". Just
unfettered access for FBI, CIA and NSA by way of gag orders and national
security letters, forcing Apple to break security of their hardware and not
speak about it publicly. Of note is the fact that it wouldn't even give direct
access to FBI, since they don't have the firmware keys (yet).

~~~
kbenson
In this example, the FBI is the "trusted third party", but by giving them
access, we inevitably open access for everyone, as the system is no longer
strongly secure. The trusted third party in the quote isn't asking for access
for everybody either, but in the end that's what happens.

~~~
melted
Apple isn't giving access. Apple would be required (by court, unless they
manage to fight this off) to install a signed custom build of the OS in order
to give access to that particular device. FBI would not have this build, nor a
key to create their own signed custom build.

~~~
lox
Yup, that's the FBI's pitch. The issue is that it sets a legal precedent as
well as potentially leaking a backdoored iOS to the world. Yes I know "signed
for a specific device", best of luck with that.

------
slig
If they're now blocking clicks _from_ Google, doesn't that mean that they're
cloaking and violating the Google's Webmaster Guidelines [1]?

[1]:
[https://support.google.com/webmasters/answer/66355?hl=en](https://support.google.com/webmasters/answer/66355?hl=en)

~~~
elaineo
Google is not okay with cloaking, but they will whitelist publishers if the
publisher specifically includes a parameter that declares if the site requires
registration or subscription. This is done in the sitemap.

[https://support.google.com/news/publisher/answer/74288?hl=en](https://support.google.com/news/publisher/answer/74288?hl=en)

~~~
pacquiao882
WSJ is big enough to negotiate their own terms with Google Search.

~~~
morgante
They're really not. They need Google a _lot_ more than Google needs them.

One of these things is going to happen:

(1) They end this "experiment."

(2) They stop serving Google the full content. (And see their rankings drop
accordingly.)

(3) They get delisted for cloaking.

------
anewhnaccount2
If this is true, what WSJ is doing is called "cloaking" and should cause it to
get de-indexed:
[https://support.google.com/webmasters/answer/66355?hl=en](https://support.google.com/webmasters/answer/66355?hl=en)

~~~
vonklaus
conversely, everyone should actively cloak and use random generated numbers to
dynamically serve variants of their content similar to how mapmakers use trap
streets. That way, like, another company wouldn't be profiting directly off of
their work and threatening to sort of, destroy their entire business if you
disagreed.

~~~
jonknee
What? Any site can very easily not be in Google if they choose to. It's a very
dumb decision for a news site, but you're free to do it.

~~~
vonklaus
Its a false choice without a conpelling alternative. Like saying, anyone upset
with the status quo should vote. I was joking a bit, but I also wasn't.

Google has end to emd control over some users internet experience, and much of
it in other cases. They own:

* 100s of thousands of servers

* domain registrar

* ~50% of web browsers in US.

* code CDN, FontService

* define web standards

* hundreds of millions of emails.

* CA implementation

* ISP infrastructure

* Develop software for a large part of thr mobile ecosystem.

* decide what you see when you go to search (most search copy google, buy results, or both)

* also many of the web beacons and advert targeting.

* oh, and the largest collection of video and images in the world.

So when you say, just do what they say or get deindexed, and you present it as
if that is reasonable(not just you but the collective you) I just think I must
be insane.

I mean, assuming google is good (i fo mostly) doesn't mean I would let them
become the entire internet.

Real question, if google were to disappear vs. the "too big to fail banks"
that would have gone under, where a case could be madr for a few certainly
failing, what would have bigger impact today?

Tl;dr everyone cares about single point if failure except at the macro system
level: finance, banking, healthcare, etc

~~~
yyin
There are certainly more you omitted (similar to your code CDN for javascipt,
fonts, etc.)... consider something like recaptcha.

In my experience, all those requests to api.recaptcha.net get forwarded to
"www.google.com"

My experience has been that if a user for whatever reason cannot access the IP
du jour for www.google.com (www.google.[cctld] will not suffice) then that
user is prevented from using the myriad websites that rely on recpatcha.net.

Now, I could be wrong and maybe there is something I am missing, but in my
experience this is a sad state of centralization and reliance by websites on
Google. Quite brittle.

~~~
vonklaus
Right, that was my point. I actually did include code CDN and fonts in the
original list, but regardless I think google is an awesome company but as a
community it is just downright irresponsible to fork this level of control to
an entity.

While I consider Snowden a proper hero it is almost a certainty that this
could happen to a "friendly" entity like google. In that, the NSA likely has
some top programmers who could get a job there and compromise something, learn
enough info to find a vuln, or pass data out. This is of course making the
massive assumption that they aren't already cooperating at a system level
either voluntarily or involuntarily.

As you can see, as search deteriorates google is motivated to (in my opinion
benevelontly) use any means neccessary to continue to fund their larger goals
of a connected and automated techno-utopia. However, they will be tempted to
leverage what amounts to almost literally 50% of the worlds thoughts to build
systems to make short term profit while pressing forward.

Just a few that immeadiately come to mind:

* using their network to control an entire alt coin ecosystem

* using data trends to trade on global markets

* start a competing business and deindex or penalize a competitor.

* build skynet (kind of joke)

So basically, those scenarios are fairly suboptimal and I could certainly
imagine that several thousand genius with knowledge of googles systems AND the
worlds data could likely be profitable quickly.

~~~
majewsky
> as search deteriorates

Can you expand on why search is deteriorating? Honest question. I certainly
don't see the relevance of search sinking, nor can I see any competitor in the
market that could even come close to threatening Google's monopoly on search.

~~~
vonklaus
It is my position that search can never be decoupled from the browser and when
I say "search" in the statement you are referring to I mean Google, as it is
Peerless for english lang search.

Search is in fact massively expanding as tooling and machine learning
capabilities increase do to research and hardware. Similarly, Google, Apache
and Elastic have many open source libraries for search, indexing, storage,
caching and serving which allow for scalable architecture. Also, outside of
the things above like crawlers and Hadoop, Solr, etc. Microsoft and Google
have open sourced JS parsing engines and Node as well as the Electron browser,
Brave Browser and Node Web-Kit are built on technology that leverages this.

So, as someone who is not an information architect or data scientist, it seems
like we have an ecosystem where a scaled down version of google can be built
and trained on the per user basis and completely private.

The solution I have hashed out in more detailed elsewhere, but on to the
actual question, is search deteriorating?

* My results seem to be worse and I have much less control than before. Anecdotally, it seems as if qoutes and boolean ops are respected less.

* Discovery is a huge issue that Google solved well, now we have the opposite conditions but the same problem. There were very few sites and it was hard to know what content was on them. Now there is too much content.

* Without fine grained control over my search I can't get make destinction between _Information vs. Links_. This is needing a date or well accepted piece of content/documentation vs. finding some new apps or non-facts. DuckDuckGo is quite good for some things and Google is good for others. Sometimes you may want to eliminate all wordpress sites (many content mills built on this) or remove Alexa links from your queries if you need to discover something.

* Time is bad. E.g. I have a problem with JavaScript function. Get back results from 3 years ago. This is amazing and difficult to do, so commendable but I need newer info as pace changes. E.g. News.

* Need to eliminate sites and content I don't want. NOT something like a content filter for porn or whatever, something like:
    
    
         never return results from %news-websites older than 30 days.
    
         never return content posted %before nov-2014
    
         remove links from [%Alexa-1000, %Wordpress, #TLD(.co,.co.uk)] for reputation ranking
    
    
         decrease links from [%Alexa-1000, %Wordpress, #TLD(.co,.co.uk)] by [80%] for reputation rankings
    
    
    

There are other things but so far my point has been:

* Google provides no versatile results.

* Many pieces of well tested software would make it easy(for the right group of software engineers) to silo crawl data and parse it with a users own parameters.

There is a way to set up this ecosystem that I have been thinking about, but
to conclude:

Google is fucking awesome and really really good at what they do. Search
experience is getting worse in terms of control but tooling is leagues better.
Google sees this and is working on loftier goals internally (I imagine), thus
it has split up into a meta-company that will work as an accelerator for
growth while capitalizing on some verticals like the Real Estate thing they
are doing or Delivery they just announced to keep short term profitable before
they can achieve their end goal. Also, advertisement is an unsustainable
paradigm for internet growth for many reasons.

Notes:

The DOM is super fucking horrible.

The Parsing engine is a great fix for a fucking horrid DOM.

DNS security is fucking horrible.

The Next google will be a browser & an optimization marketplace.

I don't think compiling to web assembly makes sense but I could be totally
wrong. I think something like Docker would provide a sandbox that would let
people get performance and versatility and sidestep the entire DOM, only run
JS, need Apps vs. Content thing. No idea how this works on mobile though.

~~~
majewsky
Wow, awesome response. Will need to let that sink in.

> So, as someone who is not an information architect or data scientist, it
> seems like we have an ecosystem where a scaled down version of google can be
> built and trained on the per user basis and completely private.

I'm skeptical. Search is a huge problem just because of the bizarre amount of
resources you need to throw at it. I can't afford to build my own
datacenter(s) to host my custom search system. There might be huge advances
ahead in terms of storage capacity on commodity systems, I don't know, but in
any case, I'm only one person crawling webpages versus millions of people (and
bots!) creating them.

You implicitly address that a bit later by talking about "silo crawling", but
again, I'm skeptical. The only silo structure that I can easily see is large
sites with useful content like Wikipedia or StackOverflow/StackExchange, but
I'm likely to come across these anyway in any given domain, and I can easily
filter for these on Google today, e.g. "site:en.wikipedia.org". The more
interesting and hard part is the long tail of small, sparsely interconnected
websites which might contain unusual insights but are unlikely to come across
with a silo crawler (or with Google's current UI, for that matter).

> Search experience is getting worse in terms of control

I guess that's the classical problem of scaling a product to a large audience
of mostly technically illiterate users. Maybe Google is learning from Apple,
whose UIs have for a long time favored ease of use over giving control to the
user.

~~~
vonklaus
> I'm skeptical. Search is a huge problem just because of the bizarre amount
> of resources you need to throw at it. I can't afford to build my own
> datacenter(s) to host my custom search system. There might be huge advances
> ahead in terms of storage capacity on commodity systems, I don't know, but
> in any case, I'm only one person crawling webpages versus millions of people
> (and bots!) creating them.

I have been thinking about this, and I have come up with some ideas, other
people obviously would provide more ideas and a solution could be reached,
some of my thinking:

A service that behaves like AWS/GIT/DNS/Google combined. * A user runs the
service and indexes data it receives and there is a central repository of
information than a user can contribute to or not contribute to. Initially, a
new user would either buy a crawler or cache of data from the market and store
it locally or on a private server bought as part of the service. The
blockchain, or a verification mechanism would be used to provide access to the
initial seed data and a hash would verify the contents. The user now has a
running cache of data s/he can connect to with a private DNS-like verification
system. User runs search.

* The parameters do not return the results s/he wanted from their private store. Similar to DNS, they move up food chain to the service provider (whoever creates this system, or one of the companies/orgs providing the service) to get more data. Here there is a centralized repo of information. This can be a market or platform. People can buy and sell data, filtering mechanisms and crawlers. Also people can include all of their searches, or some of their search results into the master crawl. This would be the "datacenter", but it can also be a platform that maps to many peoples individual caches:

> So if I indexed and codified everything about the Beatles I could sell this
> to the market by running my own server.

> I could sell a crawler that is really really good at finding all musicians
> and music to the market.

> I could sell a filtering/parsing engine plugin for music guys crawl results
> (or all results it is fed) that only delivers high quality FLAC audio files
> and converts high-enough quality MP3s to FLAC, all this but only for tracks
> with a Saxophone.

However, fucking music guys crawl stack doesn't have the shit I want in it.

I can buy (or write) a master crawler that goes out onto the internet and
finds what I am looking for then delivers it to my private cache, and if I am
generous, codifies it in a generally accepted meta language and inserts it
into master.

Obviously there is much more here but what I am talking about is distributed
and optimized search.

Notes: Google has a nearly impossible job:

* It does not allow a user to provide any filtering outside of some boolean operators and human language.

* Therefore it never knows exactly what the user wants.

* Provides a general service so to some extent it is one size fits all.

* Difficult to do machine learning because it can watch you make selections but may not ever be able to tell what the deliverable was or if you were successful.

* You can not backout or modify algorithim it uses to find results. Neccessarily, even when it knows what you want it is biased because it shows you the results and defines the algorithim. Also, in the fact I am baselessly making up right now, only 2.1% of users ever go to the 3rd page, which means that if google is wrong, it can't know and the problem compounds as users see the same bad pages and keep clicking them.

> I guess that's the classical problem of scaling a product to a large
> audience of mostly technically illiterate users.

yes. I am not saying google is doing a bad job. They have a nearly impossible
task if they only use a searchbar with natural language and 0 filtering to
deliver trillions of terabytes of data to millions of people. I am not sure
how much easier it would be, but certainly n times easier, if filters worked.

Also, I think basically the idea of HTML is shit EXCEPT for the meta language.
If not some simple JSON the actual results need fucking tags not the content,
then we could filer down further and better.

Group annotation.

File sharing.

Bitcoin payment for content/filtering/cooperation

Running arbitrary code in a sandboxed environment like docker, not a "DOM"

Also, the silo concept is like DNS if I didn't cexplain it super well. You
have a cache on your computer, a cache in the cloud, access to a master cache
of information (both receiveables and lookups of other silos) and an
optimization market for searching through data, or finding more of it if
neccessary.

100% obvious search will end up this way. Brave software seems to sort of get
this. I am hoping they realize a browser can't be decoupled from search though
because you can't just fork electron and put some plugins in it. They are
super talented. I am hopeful. One of the systems similar to what I am
suggesting is called memex-explorer. However, I have never used it as the
build is currently failing. It was opriginally funded by DARPA and NASAJPL
then one day all work stopped on it and I have emailed and tweeted some of the
people and orgs with no response. So while doing research, the description
seems somewhat inline with my thinking.

The large problem of scaling is handled by the market. Search is essentially
an API to call APIs that call an RSS feed if you think about what your browser
and google are actually doing. Knowing what those APIs do is pretty fucking
important.

I try to share this info with people but they all think I am fucking insane.
Does this sound that farfeched? Honest question.

------
eps
Correct me if I'm wrong, but wasn't there a long standing Google's policy that
the version of the page served to their crawler must also be publicly
accessible. That would then be the reason why WSJ articles were accessible
through the paste-into-google trick, rather than because WSJ was incompetent
and failed to "fix" the bypass.

So does it mean that Google will no longer index full WSJ articles or does it
mean a change in the Google's policy?

~~~
morgante
You are correct, Google requires that you let users see the first click for
free if you want to index content behind a paywall. [1]

Since this is billed as an "experiment" I'm guessing that WSJ is just testing
the waters. If they roll it out to everyone, they will have to serve only
snippets to Google or risk getting delisted.

[1]
[https://support.google.com/news/publisher/answer/40543?hl=en](https://support.google.com/news/publisher/answer/40543?hl=en)

------
zaroth
And congratulations, you have likely just "exceeded authorized access" and
committed a felony violation of the CFAA punishable by a fine or imprisonment
for not more than 5 years under 18 U.S.C. § 1030(c)(2)(B)(i).

From the ABA: "Exceeds authorized access is defined in the Computer Fraud and
Abuse Act (CFAA) to mean "to access a computer with authorization and to use
such access to obtain or alter information in the computer that the accesser
is not entitled so to obtain or alter."

To prove you have committed this terrible felony, the FBI will now demand that
Apple assist in disabling the secure enclave of your device in order to access
your browser history. But remember, they only need to do this because they
aren't allow to MITM all TLS and "acquire" \-- not "collect" \-- every HTTP
request your machine ever makes. </s>

~~~
jrockway
User agent strings have a long history of being intentionally misleading. IE
11 claims to be "Mozilla/5.0". Chrome claims to be "Safari/537.36". The User-
Agent string is all lies, and has been ever since the first site started doing
UA sniffing.

~~~
zaroth
It's _intent_ that matters. Setting user-agent in order to properly render a
page is legal. Setting a user-agent string to gain access to otherwise
unauthorized content is probably not.

~~~
jsprogrammer
User-agent string is not an authorization mechanism.

~~~
zaroth
Interestingly, the CFAA does not define the term "without authorization"
however it does define "exceeding authorization" exactly as I quoted above;

    
    
      - to access a computer with authorization and to use such access
        to obtain or alter information in the computer that the accesser
        is not entitled so to obtain or alter.
    

So arguing User-agent is not an authorization mechanism probably won't help
you, because exceeding authorization means, first, that you _were_ authorized
to access the computer (HTTP GET returns 200) but then that you used that
access to obtain information in the computer that you were "not entitled so to
obtain."

~~~
jsprogrammer
If the computer on the public Internet responds to a standard HTTP request, it
has explicitly authorized my access to whatever information it sent me.

~~~
zaroth
Unfortunately this will not help your defense. Andrew "Weev" Auernheimer was
convicted of violating CFAA for exactly this (although the conviction was
later overturned on a technicality).

Again, exceeding authorized access means using your _authorized_ access to
obtain information you were not "entitled" to. So the question is not 'were
you authorized' but rather it is 'were you _entitled_ ' to that information?
WTF 'entitled' means is another question entirely, but likely it is in the eye
of the beholder. A jury decided Weev was not 'entitled' to the email addresses
he downloaded from AT&T, and it's safe to assume we are not 'entitled' to free
access to WSJ's content. So I would not rest your hopes on the "200 OK".

~~~
jsprogrammer
> Andrew "Weev" Auernheimer was convicted of violating CFAA for exactly this
> (although the conviction was later overturned on a technicality).

So, your example is...not an example?

>WTF 'entitled' means is another question entirely

No, in this case it is very clear: a request containing a particular user
agent string is entitled. I have not tried this myself, but presumably you
could verify that is the case by sending a request with the appropriate user
agent.

~~~
zaroth
It's the best example we got. The case was overturned (after he spent quite
some time in federal prison) not because it was found that he didn't violate
the CFAA but because the charges were brought in the wrong jurisdiction.

Again I think you're confusing the fact someone could trick the server into
delivering the content for free with WSJ intending to deliver their content to
you for free. Since WSJ clearly intends their content to be delivered to only
Googlebot for free and to users only if they pay, it is likely a jury would
consider this a violation of CFAA.

A web server returning 200 OK is not ipso facto a guarantee the person making
the request is not committing a crime. To give a more obvious example, if the
request header contains a stolen authorization token. The law does not require
the access control be non-trivial to defeat.

I don't like it, and I think the CFAA is seriously problematic, but it is the
law and the Feds have been known to enforce it.

~~~
jsprogrammer
As far as the law is concerned, he did not violate anything. You do not have
to prove yourself innocent, the burden is on the prosecutor to prove a
violation. In the example you cite, no violation has been shown.

It is not at all clear that WSJ intends Googlebot to get their content for
free while others must pay. Thus is actually against Google's policies, which
would actually call into question whether WSJ's behavior is felonius. WSJ may
not be entitled to be incorporated into Google's index, yet they are
manipulating the Googlebot to the contrary.

~~~
jsprogrammer
If you will down mod, at least show how I am wrong!

------
mbroshi
Am I alone in feeling like this is akin to a tutorial on how you can shoplift
without getting caught? WSJ, for better or worse, does not want to give you
content without your paying for it. If you take that content without paying,
you are stealing. Just because you have figured out how to get past their
security does not mean it's not stealing.

(See the second precept here:
[https://en.wikipedia.org/wiki/Five_Precepts](https://en.wikipedia.org/wiki/Five_Precepts))

~~~
azakai
Legally it might, I don't know.

Morally, I'm not sure.

1\. If they allowed all bots but disallowed all non-bots, that would raise
questions of what defines a "bot". Couldn't I write a personal bot that
fetches the story for me? As a browser addon, even?

2\. It's even more complex since allowing bots means they allow tools that
provide the information to third parties, as the bots are not intended for
private use by the bot maker. So the door is already open.

3\. But in practice, it seems they favor certain bots. Is it ok that the WSJ
lets Google do things Google's competitors cannot? Try to use the "web link"
trick from HN on any other search engine, and it doesn't work in my
experience. That seems anti-competitive and discriminatory in favor of the
existing dominant entity in this space, Google.

~~~
mankyd
> If they allowed all bots but disallowed all non-bots, that would raise
> questions of what defines a "bot".

Maybe, but I think its a pretty easy distinction. They aren't even allowing
all bots - they're allowing a white list of them. You're not just writing your
own bot to get around it, you're pretending to be someone else's bot.

> But in practice, it seems they favor certain bots. Is it ok that the WSJ
> lets Google do things Google's competitors cannot?

That's the really important question. I personally have no context for
answering except to say that I can see both sides argued. If you view their
website as a physical store / private establishment, then I assume that they
have every right to establish who has access to what and under what
conditions.

Of course, that hampers a lot of legitimate use cases along the way.

~~~
sturgill
> If you view their website as a physical store / private establishment, then
> I assume that they have every right to establish who has access to what and
> under what conditions.

Not true. There are very specific laws about not being able to discriminate
against protected classes of people.

Bars have to serve minorities, bakeries have to cater to same sex marriages,
etc.

Where you draw the line of legislated equality within private property rights
is pretty intriguing. I have don't have any answers, but lean heavily towards
the libertarian bent.

~~~
mankyd
That's fair, though private establishments are still allowed to distinguish
based on other criteria such as membership, partnership, etc. I didn't meant
to be super specific. I'm merely pointing out that they are allowed to
establish barriers to entry.

------
mikemikemike
This is an odd debate. Let's say a restaurant declares "veterans eat free."
This blog post is like a friend telling you "Hey if you tell this restaurant
you're a vet they'll give you a free meal." No one said it's legal or ethical.
It's lying to trick someone into giving you something at their expense.

I think the relevant point, underscored by the author's last sentence, is it
doesn't matter who you open a back door for - it opens the possibility for
anyone to barge through.

~~~
elaineo
that's a good analogy.

------
mangeletti
This is not meant to be purely controversial, but I thought long and hard
about WSJ back a few months ago when HN mod (always forget his name) said to
stop complaining about HN links being posted because paywalls were ok. I agree
paywalls are ok. But some things are not ok.

Take a look, for instance, at the WSJ.com home page with an ad blocker turned
on (note all the missing letters and scrambled up titles). They want me to
pay, and they want me to see ads, and they want to track my behavior? Should I
send them my DNA also?

Organizations like WSJ are exactly the disease that causes ad blockers to
proliferate and ruin the web for all the decent publishers. They're at war
with my privacy (by breaking their site intentionally when I visit with a
blocker on). They want it all, ads, tracking, your private data, and
subscription revenue, not to mention...

# Agenda-Driven Content

I mean, we're basically talking about NBC or Fox here, just on the web.
Imagine every morning when you woke up you turned on the television and tune
to some "news" show. After talking about the weather, they start talking about
a lost pickle that is thought to be potentially alive and moving about with
free will. Over the next two years, talk about the same pickle extends to
every other TV show. Before you know it, everybody in the nation is talking
about the same pickle. Years go by, and that pickle has become a part of our
society, and that's not because people are born with an innate care the well-
being of pickles, but because "news" shows taught them to be.

That's not a good position to be in. I have to believe I'm not the only one in
here that doesn't watch any TV. So, why do we all treat the same media giants
differently on the web? We crave their content so much that we build browser
add-ons to get to their content, etc.

~~~
Laaw
If they make sending your DNA a requirement of consuming their content, then
yes, you send it to them if you want their content. That's their right, as
owners of something, to dictate its use.

You aren't entitled to WSJ.com, NBC, or Fox.

~~~
mangeletti
I'm not a collectivist, nor do I believe I'm "entitled" to anything, including
human rights. I'm simply stating that "I don't support organizations like
WSJ"... as you can clearly see, if you read my comment. I don't propose anyone
ban them. I simply "don't know why we [as a society] support them", also
clearly in my comment. My point, which your political diatribe disallowed you
to notice, was that we're fighting awful hard to consume their garbage.
Meanwhile, there is plenty of content out there there is free, not because of
socialism or entitlement, but for the same reason as blockers became
popular... because information is so readily available now that nobody _is_
ready to send their DNA in.

------
metafunctor
I'm pretty sure Google will soon stop indexing WSJ. Why index something if the
vast majority of users cannot access the pages behind the links?

EDIT: The "paste a headline into Google" trick still works for me, though. If
this continues to be the case, they will keep indexing, of course.

~~~
rhino369
>Why index something if the vast majority of users cannot access the pages
behind the links?

So people can find it? I'd be pissed if Google de-indexed something like IEEE
because it has a paywall.

Assuming the internet has to be freely available is a mistake. Especially with
the continued growth of an adblocked internet. We could be facing an internet
with significant paywalls in the future.

I'd support a "free" search term to weed out paywalled results.

Furthermore, Google shouldn't be making normative judgements about what people
should see. It's an abuse of their monopoly.

~~~
morgante
Google doesn't forbid or de-index paywalls. What they forbid is cloaking
(showing Google different content than what users will see). This is, of
course, quite critical to maintaining search quality.

WSJ is free to institute a full paywall and only serve snippets to Google.
They might now like what it does to their rankings though.

What they cannot do is continue to sniff the UA before deciding to put up the
paywall. (Though I'm still able to use the Google trick, so it seems the
experiment might have ended.)

------
sylvinus
Well, that trick won't last long either. It's trivial to verify that an IP
indeed belongs to Google:

[https://support.google.com/webmasters/answer/80553?hl=en](https://support.google.com/webmasters/answer/80553?hl=en)

~~~
dogma1138
Seems to work if you deploy a proxy on Google's app engine and use it to
access WSJ ;)

~~~
slig
A App Engine wouldn't have a IP with a reverse DNS *.googlebot.com, would it?

~~~
dogma1138
Nope, but it does resolve to something .google and if you check
netblocks.google.com it will appear there so they might not be limiting it to
googlebot only at this point.

[https://support.google.com/a/answer/60764?hl=en](https://support.google.com/a/answer/60764?hl=en)

~~~
greglindahl
I don't think that any server that random people can start a proxy on is in
Google's SPF record.

~~~
dogma1138
not on the spf list, but on google's domain/ip blocks list.

~~~
greglindahl
The link you yourself posted says "The most effective means of finding the
current range of Google IP addresses is to query Google's SPF record." While
this is intended to be mailservers only, it has always appeared to me that the
SPF list is all of their servers that 3rd parties can't run proxies on.

------
kenshaw
Basically, the article is stating to change the User-Agent to GoogleBot or
Bing or whatever other crawler UA you'd prefer. While that's doable, that's
something that is easily detectable and prevented, as all of the big crawlers
can be validated against DNS.

Additionally, I would like to point out that I wrote a Varnish extension for
the express purpose of validating User-Agent strings through DNS lookups, and
is available here: [https://github.com/knq/libvmod-
dns](https://github.com/knq/libvmod-dns)

It was built because we had specifically a problem with bad bots crawling a
large site (multiply.com) and this was one of the easiest ways to filter out
the bad bots from the good, and to enforce robots.txt policies on a per bot
basis. It works very well, as you can do any kind of DNS caching internally
and prevent this kind of behavior, if that's your goal.

------
matt_wulfeck
I like wsj but I only read maybe 1 article every other day. They need a more
reasonable price point, especially since the market will almost bear no price
at all.

That being said I do enjoy their content, save for maybe the op-eds.

~~~
acomjean
I'm surprised that most online papers won't sell you one day's worth online
for a buck or so. Like buying a real newspaper.

They all seem to want to sell subscriptions, which are perpetual and probably
difficult to cancel..

~~~
matt_wulfeck
Even a dollar a day. I can pay Netflix $10 a month and stream unlimited HD
video, but wsj wants $30 to read the first few paragraphs of a few articles a
day?

The pricing here is much too aggressive

~~~
Nadya
Videos are more relevant to rewatch than articles are to reread. One needs to
output more articles than videos, because articles must be "fresh" or you'll
lose an audience. Nobody is printing 40 year old news - people are still
watching 40 year old movies.

Playing devil's advocate here. Pricing for many online goods is almost
completely arbitrary and varies with little accord to service/product quality
or even what that service provides.

Another related example of arbitrary pricing: people will pay $2 for a soda
from a vending machine but won't pay $1 for a useful app on their phone.

There's something going on there... The sooner the $1 app's figure out what
makes people buy $2 sodas is the day they become rich. And the sooner content
providers figure out why people will pay $20/month to stream media (Let's say
$10 to Netflix, $10 to Spotify or something) and charge people $20/mo for
their articles... things will turn around for them.

~~~
majewsky
> Nobody is printing 40 year old news - people are still watching 40 year old
> movies.

Because news operate at a different scale than movies. People are still
reading 40-year-old books.

A 40-year-old news article is not relevant anymore because it only fits within
the momentary context in which it was created, whereas a non-fiction book or
even an essay can span a broader context and thus stay as informative for
future readers.

------
jrochkind1
I thought Google specifically disallowed returning different pages based on
User-Agent targetting googlebot, and this included paywalls.

Are they running afoul of Google policies and going to get pinged by Google?

I can't find the text from Google now (when can you ever find any docs at
google?), but I am very certain I remember reading from them that you may not
return different content to GoogleBot based on User-Agent.

------
crazysim
Doesn't this kind of also hurt SEO? I'm would guess Google has some automated
system to detect and apply a negative signal to sites that provide different
content to a Googlebot user agent than a non-Googlebot user agent. I guess
these sites are counting that the other signals outweigh that negative hit.

Otherwise, why would expertsexchange be obligated to provide the answers at
the very bottom? Did something change?

~~~
eli
I'm 99% sure I've encountered a Googlebot crawling pages with the UA of a
regular browser, presumably for exactly this purpose.

~~~
chinathrow
They do this via an iPhone like UA, though googlebot is still in there too.

~~~
eli
I'm pretty sure that's just to see if the site is serving different content to
mobile vs desktop. I _think_ they also sometimes hit pages with no mention of
googlebot.

~~~
x0
If you do an nslookup on the IP, it should come up with crawl-xx-xx-xx-
xx.googlebot.com, where xx-xx-xx-xx is the IP.

------
Gratsby
If you hit a paywall or a "sign up to access this content" message from a
google search result, report it. Google will remove them from the search
results, they will lose their largest traffic source, and they will address
the issue. Or they won't because they have enough paying customers.

------
zem
i thought of doing that when the "search google" trick stopped working, but i
decided it crossed the point where i would feel like i was unfairly
circumventing their clear desire not to serve me the content. i've just added
wsj to my mental ignore list and count it as a few more minutes gained to do
something else.

~~~
ivan_ah
Yeah, same here. Every time I get to a tab where I see a paywall I just close
that tab, probably saving 5-10 mins of my life!

------
jdunck
If Google (or any other crawler) wanted to play nice with paywalls, they could
issue a public key for their bot, and put a signature in their User Agent
string that the domain could then verify.

Those signatures could obviously leak, but on a per-domain basis. Perhaps the
domains could have a secure way of bumping the valid key generation if they
had a leak.

~~~
AnthonyMouse
There are two problems with this.

First, they don't want to. In fact, if a search engine can figure out that a
link is going to lead to a paywall, they'll probably want to reduce the
ranking of the result, because the user is not going to want results they
can't actually look at.

Second, it would be a massive antitrust violation because it would prevent
access by competing crawlers. The only way around that is to allow access to
anyone who claims they're a crawler, which was the original problem.

~~~
LunaSea
The current situation with the WSJ could already be considered an antitrust
violation. It's whitelisting one crawler and leaving the other ones out.

------
mchahn
Bypassing the paywall is more unethical that blocking ads. It is one thing to
have control over your own browser but another to steal something from another
site.

Also, isn't it illegal to bypass computer security?

~~~
nkrisc
How is modifying your own request headers any different than choosing to not
display content returned in the response body?

Their server can choose to do what it wants with your request and you can
choose what to do with the response it sends.

Are User-Agent headers legally protected identities?

~~~
effie
The difference is substantial; in principle, one can modify his request to
intentionally bypass the authorization mechanism occurring on the server. One
cannot mislead anyone/anything by displaying his data in a customized way in
private on his computer.

~~~
nkrisc
Fair point. Intent matters. However must it be demonstrated that there is some
specific malicious intent behind changing a request header versus simply
changing your user agent for the heck of it?

In a related point, some news sites load a modal and prevent scrolling over an
article asking you to sign up. However if the full article is included in the
response and I read it by simply viewing the response body (HTML) is that
circumventing security? Actual example. In this case modifying how the
response is rendered in my browser, I can bypass their intentions.

Obviously the better way of doing this would be to not send the entire article
content until they've determined I should be able to view it.

~~~
effie
> _must it be demonstrated that there is some specific malicious intent..._

I think so. Just the act of changing User-Agent alone does not mean a fraud is
happening. User-Agents get changed often for valid reasons - research,
detection of cloaking, testing etc.

> _However if the full article is included in the response and I read it by
> simply viewing the response body (HTML) is that circumventing security?_

That is a hard issue. I think the answer should be no, it is not
circumvention, at least not in the above sense.(\\) If you can get full
article on your computer, directly readable without fraudulent behaviour on
your part, this means that the sender did not place a security measure to
guard it, so there can be no circumventing it. Switching javascript off or
displaying source text of the document is fully in the control of the
requestor and the sender should know it.

(\\) Unfortunately, it seems legal systems allow companies to restrict what
you do with things they produce, even if you do it on your computer in
private, so legally, this may not be successful in court.

------
hueving
Based on the comments here, am I to understand that constantly browsing the
web with my user agent string set to a googlebot string, I am committing a
felony? How would I even know which sites I'm gaining unauthorized access to?

That is completely idiotic if there is a string you can put in a Mozilla
browser config that is literally illegal to browse the web with.

~~~
effie
I do not think using any User-Agent alone constitutes a crime. There are valid
non-criminal reasons why one would like to use Googlebot's or other User-
Agent. I think it is the intent to bypass the paywall + success to do so that
may be regarded as offense or even crime, but I'm not sure.

------
chrishn
> Remember: Any time you introduce an access point for a trusted third party,
> you inevitably end up allowing access to anybody.

 _cough_ NSA _cough_

------
ikeboy
New workaround: paste the article title into archive.is. I don't know what
they're doing but they have a workaround of some sort.

~~~
LunaSea
That's actually really interesting! Could anyone chime in and explain how they
might work around this issue?

~~~
ikeboy
My guess is they have a login, but they could just be using some workaround
like described in OP.

I've used them to save Facebook posts before, and the pages were logged in to
some "Nathan" IIRC. They probably have a bunch of hacks for specific sites
that needed fixing.

------
jgh
I just tried clicking on "Harper Lee, Author of ‘To Kill a Mockingbird,’ Dies
at Age 89" from wsj.com's homepage and got the paywall.

I then pasted the headline into google and clicked on it from Google results
and did not get hit by the paywall.

~~~
zem
they're probably doing some sort of a/b testing by selectively letting some
clicks through

~~~
adamrights
This is basically true ^^

~~~
simonswords82
Any idea what the deal is with SEO impacts of WSJ taking the idea of blocking
everyone who isn't a google bot?

------
GigabyteCoin
I was under the impression that the "hack" whereby you searched for the
article on Google and clicked through to that article (effectively skipping
over the paywall) was a demand of Google's and not an oversight by the
paywalled website.

I thought that google deemed providing search results which were behind
paywalls as a "bad experience" for their search users, and would penalize
websites for doing so.

Is this no longer the case?

~~~
elaineo
Google doesn't demand anything. If your paywalled website is not accessible by
Google's crawler, then Google will not index it. Publishers want Google to
index their pages and drive potential paying visitors, which is why they open
the loophole themselves.

For the second point, Google does require that publishers specify
"registration required" in their sitemap.

~~~
GigabyteCoin
If you're showing Googlebot one thing, and visitors who visit your website
through google another, that's essentially "cloaking" (a blackhat SEO
technique). At least it used to be.

------
tete
Doesn't Google usually try to punish websites that show users something
different and even mentions that somewhere?

Not an SEO Expert here, but wonder how and whether Google will end up handling
that. I mean making an exception could also be considered abuse of power in
some countries of the world. Don't have any strong opinion yet on that, just
saying that because of how the EU exercised certain laws in recent years.

------
Illniyar
Aren't you supposed to verify if a visitor is a googlebot by reverse lookup of
the IP address? I.E.:
[https://support.google.com/webmasters/answer/80553?hl=en](https://support.google.com/webmasters/answer/80553?hl=en)

User-agents are notoriously unreliable.

------
philip1209
I wonder how many Google Cloud customers use the servers to run spoofed
Googlebot crawlers from the Google IP range in order to bypass paywalls and
scrape large sites (like LinkedIn) without hinderance.

------
0xCMP
It's broken already. Tried to access an article about new china rules for
online news and it pay-walled me. They're probably looking for clients coming
from googlebot.com now.

------
mikestew
So does HN now choose to not post articles from the WSJ? I was comfortable
with the "google it" trick, and frankly was a little annoyed with constant
"paywall, wah!" comments when what should be by now a well-known workaround
was available. But that workaround no longer works.

~~~
mark-r
They've been testing the new wall for a while now. I know I made one of those
"paywall wah" comments when the Google workaround didn't work for me. Then the
next time I tried it worked fine, so it must have been random selection.

------
coverband
My Windows anti-virus deletes the linked sample code automatically upon
download, marking it as "Trojan:Win32/Spursint.A". Did anyone have the same
experience? (I was actually more interested in using it as a template for
writing a simple Chrome extension.)

~~~
mattmaroon
Yep. I then pasted it but it didn't work on wsj.com. Oh well.

~~~
elaineo
try deleting cookies, then hit refresh.

~~~
mrgrieves
Not working here either. I tried wiping out cookies and cache, and Chrome is
sending the right user agent (Googlebot).

~~~
nzealand
Try disabling other extensions that might alter the user agent.

------
mildweed
Solution:

Content providers register a (yet-to-be-written) Google News API account, get
an API key, with which Google indexes the site and the site recognizes as
legit.

------
jasonwilk
I've noticed that this has stopped working on WSJ if you've already hit the
paywall and try to google the article to bypass.

------
f137
I wonder if anybody tried to do as suggested? I copied the files to Chrome as
per instructions, and the paywall was still in place.

------
warrenmar
You can also access WSJ for free at the library.

~~~
creativityhurts
It reminds me of this: "Trying to save a quarter..."
[https://www.youtube.com/watch?v=j4nRHHPpnVc](https://www.youtube.com/watch?v=j4nRHHPpnVc)

------
jupp0r
It's not bypassing at all. Googles crawlers are deliberately let in because a
paywall that nobody runs into is useless.

------
chinathrow
So soon they have to block anyone with a fake Google UA and whitelist the well
known 66.249 IP range. Trivial.

------
yyin
Does WSJ check visits from a Googlebot UA against a list of known Google IP
addresses?

------
amelius
Fix: replace the user agent string by a cryptographic challenge/response
scheme.

------
pmontra
They'll start allowing only some IP addresses search engines agreed with them.

------
daveheq
Possible in Firefox? Some people won't use Chrome.

------
spitfire
Is there a version of this available for Safari?

------
systemz
So their next move is check if IP is from Google

~~~
philip1209
[https://cloud.google.com/](https://cloud.google.com/)

------
throwaway21816
>Archaic news source does something to hurt their market penetration to
internet

Great idea here guys

------
dude_abides
Or simply use incognito mode and click on Google search result.

~~~
mrmcd
Did you read the article. It talks about how that trick no longer works on a
lot of sites because they are now checking User-Agent strings too.

~~~
lstamour
Actually I noticed sites have simply changed policies -- if you're a regular
visitor your cookies will identify you and block content. The Incognito mode
trick works for WSJ and others that would still check the referrer header.
Allowing Googlebot access and checking the referrer header are two different
things.

Also, Google has published IP addresses it uses, so this extension might not
last long...

~~~
slig
> Also, Google has published IP addresses it uses, so this extension might not
> last long...

They do not [1], but you can find out by doing a reverse DNS query.

> "Google doesn't post a public list of IP addresses for webmasters to
> whitelist"

[1]
[https://support.google.com/webmasters/answer/80553?hl=en](https://support.google.com/webmasters/answer/80553?hl=en)

~~~
lstamour
Sorry, that was what I meant. :)

------
obelisk_
1\. Google's Web Crawlers are not "bypassing" paywall. It's the paywall that
let's crawlers through. I.e. exactly the reverse of what the author implies
with their headline.

2\. The idea that this is somehow new is wrong. The way for a server to
identify crawlers have "always" been to look at the user-agent, and, when done
right, IP, verified either by net block owner or by doing PTR lookup and then
checking that the A or AAA record for the claimed host points back at the same
IPv4 or IPv6 address. Meanwhile, I do agree that paywalling is a more recent
phenomenon, at least with regards to the extend it is popular among sites
today, _but_ the concept of presenting different data to crawlers and visitors
arose much earlier and is something Google have been aware of and has made
sure to delist such sites when found, whereas in fact Google has since then
moved abit in the direction of allowing it in that they do so for Google News
if declared as explained by others ITT.

So in my view, it seems that the author is jumping to incorrect conclusions
based on an incomplete understanding of what's actually going on here. What
then about the HN readership, how come this article became so highly voted and
I don't see these issues raised by anyone else? Or maybe I'm just crazy?

~~~
tomkwok
> Google's Web Crawlers are not "bypassing" paywall. It's the paywall that
> let's crawlers through. I.e. exactly the reverse of what the author implies
> with their headline.

Don't nitpick. It's just a shortened version of _How To "Be" a Google’s Web
Crawler to Bypass Paywalls_. You get it. I get it. Everyone gets it.

