
Ask HN: Why does Google index pages I cannot access? - FreeHugs
Today I googled &quot;follow famous readers online&quot; and this is the top result:<p>https:&#x2F;&#x2F;www.nytimes.com&#x2F;programs&#x2F;better-reader&#x2F;day-1<p>Google shows that the term is on the page. But when I try to access it, it redirects me to a signup page.<p>What&#x27;s the deal here?<p>Does Google not notice that? Does it turn a blind eye? Does the NYT trick Google by displaying something else to Google then to me?<p>Or are pages that only members can read now part of the Google index?
======
FreeHugs
I now noticed that you can read the page if you hit escape fast enough:

[https://www.nytimes.com/programs/better-
reader/day-1](https://www.nytimes.com/programs/better-reader/day-1)

When you visit it, it shows the page for a split seconds. Then redirects. By
hitting escape in that split second, the page stays on screen and the redirect
does not take place.

Modern developers :-) Implement everything clientside in Javascript.

It's kind of a fun game. Especially since the page then has no ads at all. All
newspaper pages should look like this.

~~~
weinzierl
Spiegel Online used (or uses, idk) a blurring overlay to hide their payed
content. My first thought was "That'll be easy!" and I removed the overlay
only to find that the underlying text was obfuscated in a way that made it
look like the original text through the blur. Well played Spiegel!

Still, I suspect the substitution could at least be partially reversed but I
never tried. What makes it more complicated than a usual substitution cipher
is that it can be arbitrarily lossy because it isn't meant to be deciphered.
As an extreme case they could just replace all uppercase letters with X and
all lowercase letter with x - for example - and no information could be gained
while still looking similar when blurred. Luckily it's not what they did, the
substitution looked more complicated. Another reason to make me believe that
it's reversible is that I think I remember David Kriesel (the _" Lies, damned
lies and scans"_ guy) once hinted that he did it. Anyway, deciphering it would
be a nice Sunday afternoon entertainment, I guess...

~~~
pintxo
They did in fact use a Caesar cipher [1], a quick calculation on letter
frequencies gave a strong hint. The rest was looking up the shift value and
coding a js bookmarklet [2] to "decrypt" the text. It had some issues with
certain German Umlauts, but was not enough of an issue to fix it.

Looks like they now longer include the full text publicly. So they seem to
have improved.

[1] (German) [https://andreas-zeller.blogspot.com/2016/06/spiegel-
online-n...](https://andreas-zeller.blogspot.com/2016/06/spiegel-online-nutzt-
unsichere-casar.html)

[2] javascript: document.querySelectorAll('div.obfuscated-
content')[0].parentNode.classList = [];var cc = (s, c) => s.split('').map(s =>
/[^\s]/.test(s) ? String.fromCharCode(s.charCodeAt(0)+c) : s).join('');var dn
= (n) => { if (n.hasChildNodes()) {Array.from(n.childNodes).map(dn);} else if
(n.parentNode.nodeName !== 'A') {n.textContent=cc(n.textContent, -1)
}};document.querySelectorAll('p.obfuscated').forEach(dn);document.querySelector('.lp_mwi_payment-
method-wrapper').parentNode.parentNode.remove();

~~~
pintxo
Others [1] still seem to use this technique.

[1]
[https://www.faz.net/aktuell/feuilleton/debatten/klagenfurter...](https://www.faz.net/aktuell/feuilleton/debatten/klagenfurter-
kunststreit-was-soll-ein-wald-im-stadion-16377833.html)

------
snthd
[https://duckduckgo.com/?q=google+paywall+policy](https://duckduckgo.com/?q=google+paywall+policy)

first result:

[https://support.google.com/news/publisher-
center/answer/4054...](https://support.google.com/news/publisher-
center/answer/40543?hl=en)

>We‘ve removed the First Click Free requirement for publishers on Search and
News. Read more about the new policy on our blog[0]

[0][https://webmasters.googleblog.com/2017/10/enabling-more-
high...](https://webmasters.googleblog.com/2017/10/enabling-more-high-quality-
content.html)

------
digitalengineer
Even worse for google image searches. Every search is invested with Pinterest
results you can’t see without signing up...

~~~
ZoomStop
Google seems intent on ruining their decent image search engine. The removal
of features like search by exact size is making it worthless for certain
tasks, without a good alternative.

~~~
LocalH
Probably a bullshit concession to the copyright cartel

------
dplgk
I don't know why but I know it used to be against Google rules to show one
thing to the crawler and different content to the user. Apparently they don't
care anymore.

~~~
Rerarom
That sounds like something one would have read in a computer magazine from the
early 2000s...

~~~
headalgorithm
See Google's Webmaster Guidelines on cloaking:
[https://support.google.com/webmasters/answer/66355](https://support.google.com/webmasters/answer/66355)

------
moksly
I think it’s really hard to balance without giving us better personal search
settings. This is anecdotal of course, but I rarely find anything on free “as
in beer” media that’s worth my time. If Google excluded paywalled content its
search functionality would be a lot less useful to me as a result.

On the other hand, you don’t subscribe to everything, and it might be useful
to tune some of it out. Like I have a subscription to two Danish news papers,
Information and Weekendavisen, and I like when google provides me with
articles for them, because those articles are often going to be the height of
what I want from my search results on subjects they cover. I don’t have a
subscription to other Danish news papers, however, and maybe google would be
better if it let me filter them. Maybe not though, it would certainly increase
my personal bubble, but banning paywalls outright would really break google
for me.

~~~
jpalomaki
Would help if you could easily filter out non-free results. Or highlight the
free ones.

~~~
m-p-3
Or put a little dollar sign beside non-free results, like they do (or did) for
ads.

------
blihp
A number of paywalled sites allow Google through either because it's in their
self interest to do so or because they have a business arrangement with them.
This has been going on for a long time. It used to be that Google would
penalize a site in its search results for showing their web crawler something
other than what a user would see but that appears to have been relaxed in
recent years, at least for the larger paywall sites.[1]

[1] I believe that's the case... I don't recall in earlier years paywall pages
ranking so high in the search results but my memory could be faulty on this.

------
mellosouls
Not answering your question, and I'm not sure how general or useful this is,
but if it's in the cache, the text only ("strip") version should presumably be
free of the redirect/whatever action. It may sometimes only contain initial
content I suppose, but worth a try.

E.g.

[http://webcache.googleusercontent.com/search?q=cache:https:/...](http://webcache.googleusercontent.com/search?q=cache:https://www.nytimes.com/programs/better-
reader/day-1&strip=1)

------
luckylion
> Does it turn a blind eye? Does the NYT trick Google by displaying something
> else to Google then to me?

Pretty much this. Google used to have a policy called "first click free" that
basically said "you can have a paywall, but a user must be able to see the
first content they click on from a search result. for more content, you can
make them sign up". They dropped that policy at some point (and it was never
_really_ enforced, though many/most did follow it), so cloaking for paywall-
purposes is okay now.

~~~
ebj73
Do you know anything about how they're doing this? Is it special content in
the http headers of the requests from Google? Or do they explicitly know which
IP-addresses Google will be doing it's web crawling from?

~~~
luckylion
It's just a different version targeted at Googlebot. Google's IPs aren't
secret, and Google is open about how to identify (and verify) their
webcrawlers:
[https://support.google.com/webmasters/answer/80553](https://support.google.com/webmasters/answer/80553)

------
ronreiter
Google must give a paywall filter in my opinion. I totally agree that a lot of
times there is no intent for the Google user to see paywalled content when
looking for information.

------
Cyder
Sometimes you can bypass signup/restricted walls by using developer tools in
Firefox. My son does this when doing homework and looking for answers.

------
buboard
Because they removed the requirement for "first click free" for paywalls

[https://support.google.com/news/publisher-
center/answer/4054...](https://support.google.com/news/publisher-
center/answer/40543?hl=en)

So google is basically now offering you ads, paywall span and stock photo spam
in their image results. Wonderful

------
baraln
There was a time when paywall articles from Google search were free. Sigh!
[https://www.theverge.com/2017/10/2/16395604/first-click-
free...](https://www.theverge.com/2017/10/2/16395604/first-click-free-policy-
flexible-sampling-publishers)

------
HocusLocus
Accessing hidden pay content is WRONG.

And I dussent think you wuz brought up to do WRONG.

------
tus88
At some point google entered into an entreaty with corporate news
organizations to drive users towards subscription sites and paying for news.
It is also probably influenced by more recent attempts to steer people towards
establishment media sources rather than "fake news" websites.

~~~
_nalply
Is this speculation or do you have a source?

~~~
tus88
I believe it all centers around this:

[https://newsinitiative.withgoogle.com/](https://newsinitiative.withgoogle.com/)

