
It seems that Google is forgetting the old web - barry-cotter
http://stop.zona-m.net/2018/01/indeed-it-seems-that-google-is-forgetting-the-old-web/
======
kickscondor
While it's become impossible to browse the wider Web with Google, it's getting
a bit easier elsewhere.

A few helpful search engines:

* [https://millionshort.com/](https://millionshort.com/)

* [https://wiby.me/](https://wiby.me/)

* [https://pinboard.in/search/](https://pinboard.in/search/)

A recent movement to build personal Yahoo!-style directories:

* [https://href.cool/](https://href.cool/) (my own project)

* [https://indieseek.xyz/](https://indieseek.xyz/)

* [https://districts.neocities.org/](https://districts.neocities.org/)

* [https://the.dailywebthing.com/](https://the.dailywebthing.com/)

The above resources are focused on general blogging and personal websites -
for software and startups, I would refer to the appropriate 'awesome'
directories.
([https://github.com/sindresorhus/awesome](https://github.com/sindresorhus/awesome)
or [https://awesomelists.top](https://awesomelists.top))

If you know of any more, please list them - a small group of us are collecting
these and trying to encourage new projects.

~~~
marttt
There's also Kenneth Goldsmith's UbuWeb, a curated directory of (hard or
impossible-to-find) avant-garde art, music, writing, video. Launched in 1996.

[http://ubu.com](http://ubu.com)

[https://en.wikipedia.org/wiki/UbuWeb](https://en.wikipedia.org/wiki/UbuWeb)

~~~
Funes-
This one's great. I've lost count of how many films I've watched that I would
have never found otherwise.

Here's another big art repository:

[https://monoskop.org/Monoskop](https://monoskop.org/Monoskop)

And a very well-documented collection (a "wiki") of paintings, also non-
profit:

[https://wikiart.org](https://wikiart.org)

~~~
marttt
+1, I forgot about Monoskop, a truly fascinating resource.

Another interesting one is Aaaaarg (according to Monoskop's wiki, originally
with one less "a", acronym of Artists, Architects, and Activists Reading
Group):

[https://aaaaarg.fail/](https://aaaaarg.fail/)

[https://monoskop.org/Aaaaarg](https://monoskop.org/Aaaaarg)

Basically it's a collaborative environment for reading, annotating and
discussing texts. The content is submitted by users and (thus) of high
quality.

I think you need an invite to access the community. Also, the domain used to
be aaaaarg.org, but I think they faced copyright issues of some kind and had
to find an alternative domain. (Not sure about this; excellent new suffix,
though!)

EDIT: More precise description:

[https://www.memoryoftheworld.org/blog/2014/10/28/aaaaarg-
org...](https://www.memoryoftheworld.org/blog/2014/10/28/aaaaarg-org/)

------
degenerate
This author is in his own little bubble and doesn't understand the vast amount
of blog-repost spam that google has to deal with. The way their algorithm most
likely deals with this is a mixture of domain rank + tenure... how long has
_this_ copy of _this_ article existed on _this_ domain, and can we be sure
this is the original copy?

The author says the article was removed in 2006 (" _[...] posts, were not
accessible anymore_ ") and then he re-posted the article at a _new_ domain in
2013. That means any copy/crawl/repost of the article from 2006-2012 is now
the oldest living, and thus "original", version of the article. His 2013
repost was seen as just another blog-spam copy.

Google is not forgetting the old web unless we see evidence of content
disappearing from the index that have been consistently hosted at the same
domain & URL since their original posts. Unless you properly 301 your URLs to
new locations and consistently host your content, it's a guessing game for the
crawler to determine where the original content has moved to.

~~~
founderling
Here is an example:

[http://www.gnoosic.com/discussion/metallica__5.html](http://www.gnoosic.com/discussion/metallica__5.html)

No matter how you search for the content on Google, _nothing_ comes up:

[https://www.google.com/search?q="Metallica+only+played+2+son...](https://www.google.com/search?q="Metallica+only+played+2+songs+from+St.+Anger")

DuckDuckGo has it:

[https://duckduckgo.com/?q="Metallica+only+played+2+songs+fro...](https://duckduckgo.com/?q="Metallica+only+played+2+songs+from+St.+Anger")

I checked the wayback machine and the content has constantly been on that url
for over 10 years.

This is the first example of an old forum page I tried after reading the
article. So I tend to think it's true. Google is discarding the "classic" web.

~~~
Liquix
Anecdotally and perhaps unrelated - has anyone else noticed a decrease in the
accuracy and general quality of Google search over the past 2-4 months?
They've had to have been utilizing ML to 'improve' searches for some time now,
but the quality of the results has decreased suddenly and inexplicably (for
me).

~~~
JohnFen
> Anecdotally and perhaps unrelated - has anyone else noticed a decrease in
> the accuracy and general quality of Google search over the past 2-4 months

Yes. Not just over the past 2-4 months, but over the past five years or so.

It's become so bad that Google is no longer the most useful search engine for
me.

~~~
therein
It all started going downhill since Google's "Hummingbird" switch to be
honest. Interviewing for Google, I actually brought this up with an engineer
in the search team during the lunch.

He said they haven't noticed any regressions. I said I figured that would be
the case but I can definitely feel the difference as a daily user.

~~~
friendzis
This is indicative of a larger issue - testing is probably as difficult as
solving the halting problem i.e. code could be generated from proper tests,
yet teams tend to trust their tests completely. I see high profile websites
having severe usability issues or being outright broken in ways that would be
immediately caught by "interns randomly click here and there" usability tests.
But these version got deployed probably because testing did not show any
regressions.

I tend to believe that if user complaints about new problems or regressions
increase over statistical noise - there is a problem.

~~~
JohnFen
> yet teams tend to trust their tests completely.

Well said. This is a big problem. We see a similar problem with the use of
telemetry data as well.

------
inetknght
I have noticed that searching for exact quotes seems to have been broken on
Google for a few years. But only _minimally_ broken. And I've had no idea how
to reason with it. This article completely corresponds with problems I've
encountered with searching for results on StackOverflow or software
documentation sites; it's especially perplexing that "site:..." combined with
exact quotes _does not work_ for many cases.

Google certainly doesn't seem to value feedback at all. It's practically
impossible to get in touch with a human to ask for help and Google's feedback
forms have _always_ felt like a black hole.

~~~
Ygg2
I too noticed that for some queries, Google is becoming really, really
unwieldy.

I can't recall the exact search term, but I kept looking for some site I
visited some time ago, and no combination of words could get it to actually
find the actual site. I finally just gave up and found it in my browser
history.

~~~
asark
I've "lost" a few sites that way. Forgot the address, can craft several
searches that years ago _definitely_ would have brought up the exact site I
want. But... nothing. Or it's buried so far in the search results I'll never
find it. I need a _good_ search on top of Google search, these days.

To be fair DDG rarely works for me in that way, either. I think that kind of
old-school, precise search engine's just dead now. It seems like everyone's
indices are a lot "fuzzier" and full of holes, like they're discarding large
parts of pages from the index if those parts don't look important to the algo.
Not just deprioritizing, but tossing those pieces out entirely. Except the
algo's very wrong.

------
brudgers
"Deprecating" might be a better term than "forgetting." Google's business
isn't driven by long tail content. Probably it never was.

Maybe somewhere there is a Google disk with the the hash of an exact phrase
the author typed into the search box. But statistically, that hash won't be
found in hot memory vector space when cosine similarity runs on a nearby
server. Finding the phrase would require a batch job that runs much longer
than the engineered time limit Google imposes on search queries. Without a
"let me know in 24 hours" option, Google's search will partition data into
what should and shouldn't be accessible. That partition will always be
according to Google's business goals. All the information may be indexed, but
only the fraction of the index beneficial to Google will ever be accessible to
ordinary users.

The crux of the story is that there is no business case for Google to return
the author's web pages in search results even if the Wayback Machine implies
that Google could.

------
reaperducer
I maintain two web sites that date back to 2003. They are still very active
(thousands to tens of thousands of uniques per day), but I and my users have
noticed that only the more recent content (2012+) shows up in Google.

In a way I’m had to hear that Google is delisting the older content because I
thought I was doing something wrong.

But it’s still frustrating for my visitors because every few months I get a
message about how they can’t believe all the information there is on the site
that they’ve searched for for years but never found through search engines but
it’s all right there on this one site. (It’s something of a regional history
site.)

I guess those sites have involuntarily become part of the “dark web.”

~~~
specialist
Is it possible to check your access logs, see if google how much google,
others are spidering?

I know very little about google's SOP, but had the impression they
periodically rescan stuff.

~~~
reaperducer
I've built new sitemaps, submitted them to Webmaster Tools (or whatever
they're calling it this week), requested a re-spider, everything.

Duck Duck Go finds it and shows it. According to the logs, Google spiders it,
but chooses not to show it.

~~~
dcbadacd
Google's spider is weird anyways. I have a site that has sitemaps submitted
like three months ago and it still hasn't taken a look at them.

------
gambler
This has been been happening for years and getting increasingly worse. But
it's not just old websites. I think it's certain types of content.

Here is one specific example out of dozens I've seen. There is a short
satirical rant "published" on Pastebin called The Java Way. Posted in 2015.
Unfindable on Google. It was indexed and findable around the time it was
posted.

[https://www.google.com/search?client=firefox-b-1-d&q=%22the+...](https://www.google.com/search?client=firefox-b-1-d&q=%22the+java+way%22+pastebin)

First result on DDG:

[https://duckduckgo.com/?q=%22the+java+way%22+pastebin&t=ffsb...](https://duckduckgo.com/?q=%22the+java+way%22+pastebin&t=ffsb&ia=web)

The worst part is that Pastebin uses Google for its own search.

~~~
Macuyiko
Amazingly, three hours after your post, this hacker news comment is now
showing up in the top 5 for the Google query. Wow.

~~~
gambler
Not on my version of Google. But for me it started showing this as top result:

    
    
      https://duckduckgo.com/?q=%22the+java+way%22+paste...
      No information is available for this page.
      Learn why
    

I tested this on different browsers and IPs. Seems like it indexed the link
from this thread, but can't display it because of DDG's robots.txt settings or
something like that.

Bizarre.

~~~
gambler
Just for historic record: Google finally started showing a link to the
original pastebin page and it is the first result. I suspect it's because
someone submitted the page to HN, which shows as the second result now.
([https://news.ycombinator.com/item?id=19609280](https://news.ycombinator.com/item?id=19609280))

------
mathnmusic
When web directories like Yahoo lost out to web search engines like Google, we
lost something crucial. While search is good for answering questions you know
how to ask, browsing was exploratory and led us to know what we didn't know.
When it comes to learning a complex topic like mathematics, this kind of
serendipity was very useful. There are some amazing resources on the web, but
googling won't let you discover those.

Sometimes, the same idea is available in a book, in a TED talk, and in a
podcast. Some of us are curating such resources categorized by topic / format
/ year / difficulty / estimated time. Our GitHub repo received 100+ stars in
less than a week, so I thought it would be a good time to show it to HN. I'd
love to get some feedback and critique from the HN community where I have
learned and discovered so much.

Here's the Show HN post:
[https://news.ycombinator.com/item?id=19604295](https://news.ycombinator.com/item?id=19604295)

~~~
kickscondor
'Awesome' directories have really given a nice resurgence to the Yahoo! style
of organization. I think the problem with Yahoo! is that it simply got too big
to be used as a directory. Niche directories are where it's at. (Also: Reddit
wikis, which often are used similarly.)

~~~
teleclimber
> Niche directories are where it's at.

Agreed. And if we came up with a standard format for such lists you could make
them searchable and we could end up with distributed searchable curated
indexes that are not centrally controlled. And that is quite compelling
compared to centralized fully algorithmic search systems run by mega-corps.

------
SXX
Personally I have similar experience, but other way around. Every time I try
look for anything in Google almost always most of results are from 3-6 years
ago unless I specifically specify I want results from last month / year / etc.
And I not just talking about technical questions, but all kind of stuff
include music, travel information and such. I not even sure when the last time
google provide me the link to some new website with fresh content.

Like it's sometimes feel like web completely frozen and all content moved into
closed gardens. I switched to DDG a while ago for this and bunch of other
reasons, but I wonder if someone else noticed this. Anyone?

~~~
Pxtl
I don't think your "3-6 years ago" and "forgetting the old web" are
incompatible. I've noticed the same - Google seems to gravitate to results
from 2014ish, even when newer information is available, or when I'm searching
about an event from far before that time.

~~~
alecdibble
I thought I was going crazy with this 2014ish search result dynamic. I almost
always have to use date filters to find recent information. Especially when
trying to research products, reviews, types of products, etc.

------
ordersofmag
I have a site that includes lots of older content. I checked the first page
that came to mind (published in 2005) and it still shows up on the first page
of Google results for the obvious search terms. So it certainly isn't as
clear-cut as 'everything older than 10 years is not in the index'. _update_
checked another circa 2003. It's also on the first page of search results for
a search on its title.

------
visiblink
I have been noticing (what appear to be) truncated search results in Google
for some time now. At first I thought that was because I was accessing Google
through Startpage. But that's not the case.

Anyways, I find myself using Bing more and more often these days, because the
search results dig more deeply into the 'obscure'.

I'm not at all upset by this. It seems to me that as Google's results are not
completely satisfactory, more people will make use of various alternatives.
Maybe one day, search will become decentralized again, somewhat like it was in
the 1990s, when you regularly made use of many search engines, like Altavista,
Lycos, Excite, and Yahoo.

I would imagine that there must still be metasearch sites out there somewhere
that submit your query to several search engines. I need to find one again and
would appreciate recommendations.

~~~
casefields
The meta search engine www.dogpile.com still exists but I'm not sure how good
it actually is.

~~~
visiblink
Found a very good one. In case anyone's still reading this thread: searx.me

------
Udo
The assertion that this is because " _indexing the whole Web is crushingly
expensive, and getting more so every day_ " is a bit flawed. Since old content
is very unlikely to be updated, it doesn't have to be re-crawled a lot. I'm
certain Google has a score that tells it how often the content of a given site
is likely to change. This argument of expense becomes even less durable when
you consider that DuckDuckGo, a company with an infinitesimal fraction of
Google's resources, is perfectly able to keep that kind of content in its
database.

I agree with the observation that this is about shifting everything to current
data, because people overwhelmingly care about things that happened a few days
ago. There used to be a long tail of users searching for old data and
references, but I suspect they're fading away. Biasing the index towards
recency also has legal advantages for Google, because delisting old content
makes it less likely to receive takedown requests in connection with "right to
be forgotten" legislation.

~~~
jsnell
Crawling isn't the real problem, nor is the bulk storage for the crawled
pages.

What do you do with these pages after you've crawled them? You need to build
an index out of them, and serve that index out of some kind of low latency
storage (DRAM, Flash). That makes increasing the index size very expensive.
The index size has to be limited, and selecting the right pages to include in
the index is thus a core quality feature for a search engine.

~~~
Udo
I'm having trouble imagining that Google would be more limited by the ratio of
hardware power vs data size today than it was in the early days. If keeping
the whole index in DRAM is now a requirement, then yes, I'd expect a hugely
reduced overall dataset - but wouldn't that affect way more sites/pages than
the comparatively few dropped historical records?

I still suspect that this whole thing is more about bias (and personalization,
be it correct or incorrect) in the results.

~~~
puzzle
Google's index has been in memory for most of its life now:
[http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-
wsd...](http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-
wsdm-2009.html)

It's actually more complicated than just a single static index, which is also
why it's unrealistic to expect a search engine to be deterministic at scale.

------
syphilis2
It's not just the old web. I have a small website that does not show up in
Google's results. When searching for the title of the site it's the 7th result
on DDG yet does not appear on Google at all (Google runs out of results after
19 pages, which is also ridiculous considering the search is very common).
When I search for the domain name (sans tld) on Google the associated YouTube
page is the first result, also an unrelated Twitter account, and even a Reddit
comment that links to my website (Reddit is a more effective search engine!).
But my website is not listed. Finally, if I search for the full domain name
the website shows up as the first result.

This has been a problem for three years, it's incredibly frustrating, and also
demotivating.

~~~
lnanek2
Google does have some webmaster tools where, after you prove you own the site,
you can check if there's something you have to fix about it to get listed, can
submit new URLs for them to scan, etc..

~~~
tannhaeuser
Even if your site is perfect according to Search Console/Webmaster Tools, that
won't get you listed on Google Search.

~~~
josefresco
Google Search Console will tell you what pages were submitted (via sitemap)
and what pages are indexed as well as any errors.

~~~
tannhaeuser
Yes but they _don 't show up in the fscking search results_ even for highly
matching searches.

------
mbrumlow
I would not be surprised that if along the way, googles search results were
optimized for income of ad revenue, over other metrics - knowingly or not.

I to have complained about the search results of google going down hill. I was
told "I was just to technical".

Whatever the case, the web is not the same as it was in the early 2000s and it
really is sucking if you are wanting to search for something.

------
xorand
Google Scholar started to forget articles
[https://news.ycombinator.com/item?id=19599365](https://news.ycombinator.com/item?id=19599365)

~~~
xorand
Downvoted against the evidence given. Anybody cares to explain why? Thank you.

~~~
yorwba
I didn't downvote, but you link to a comment by yourself that isn't much
longer and links to a blog post by yourself, and you already made a comment
with basically the same content elsewhere in the thread. That's close enough
to spammy self-promotion to get you downvoted irrespective of whether people
agree with you

~~~
xorand
The newer coment is one hour after the older one. I made it after I was
downvoted. As for the spammy self-promotion, thank you, I linked to a HN
comment (made yesterday, relevant to the matter but with no reactions at that
moment) instead of the blog post directly exactly because I don't want to
replicate links. Finally, while these comments will fade from attention, if
not already, that blog post will remain, excuse me for giving first some
evidence that's something wrong with Google Scholar. Obviously spam.

------
Funes-
From my experience I can say that the web is filled with garbage that tries to
exploit whichever search engine you may be using, while also trying to exploit
your own attention with flashy headlines and sensationalistic content. These
two go hand in hand.

It wasn't that long ago that any Google search would return a big list of blog
entries; personal, non-commercial blogs, that is. That was the case with
YouTube, as well. I remember people making and uploading videos just for the
sake of it. Even I did that (I had a January 2006 account), and the purpose
was only one: sharing what you liked to engage in. Nothing more. I guess
that's part of the long-gone, old web. When I browse the web nowadays, I feel
like I am constantly being sold something, because I actually am.

~~~
JohnFen
> I guess that's part of the long-gone, old web.

That old web still exists! It tends to get drowned out by all of the
commercial sites, and you won't find more than a hint or two of it through
Google, but it is still there...

------
thsowers
Previous discussion from when Tim Bray's article was posted:
[https://news.ycombinator.com/item?id=16153840](https://news.ycombinator.com/item?id=16153840)

------
coliveira
One of the changes that made Google forget the old web is favoring https
sites. This is a big benefit to new and commercial web sites because setting
up SSL is still a burden for non-comercial publishers.

~~~
xvector
I feel like LetsEncrypt is a 5-minute burden. Unless I’m being naive - I’ve
only used it in personal projects. Thoughts?

~~~
efreak
Plenty of websites out there being kept up by volunteers without hardware
access, the original owner having died or otherwise gone MIA. It's not always
possible to add https, particularly when the site owner died 10 years ago and
someone has an 'agreement' with the hosting provider to keep a website up as a
memorial. No, I don't have any specific examples right now, but anecdotally I
occasionally come across a website that's been in such a read-only form for as
long as a decade due to family members being willing to continue paying the
$15/year hosting fee, but not having the technical knowledge, passwords, or
interest to fix problems. Sometimes there's evidence of a partial upgrade (the
search engine stopped working due to a php upgrade), or a forum that has been
converted to a static site entirely (the login buttons don't work either). In
any of these cases, getting LE working is almost certainly more trouble than
it's worth for whoever is currently paying the fees.

------
jpswade
Google has, for quite some time now, preferred magazine style websites that
grow exponentially.

Their search engine algorithm is designed to favour rich media content and
websites that are growing, because this is how Google grows and learns.

Favouring the "old web" would not help Google's business model which favours
growth.

------
xondono
It's very worrying that people seem to forget how Google plays around with
their platform and tinker with the results, because for a lot of people Google
Search is HOW they navigate the internet. I'm most disturbed by how articles
critical of Google completely banish.

An article from 2011, it's not particularly damaging anyway, just the kind of
things that happen in tech:

[https://www.zdnet.com/article/google-busts-itself-for-
distri...](https://www.zdnet.com/article/google-busts-itself-for-distributing-
malware/)

If you search the title of the article:

\- DuckDuckGo: First result

\- Bing: First result

\- Yahoo: First result

-Dogpile: First result

\- Yippy: First result

\- Google: Does not show (I've gone through the 4 pages of results with no
luck). To find it you need to use the "site:zdnet.com" option, and then it's
the first result.

------
andybak
Holy shit. With all the other reasons to consider abandoning Google the one
thing I've held on to is the depth and reliability of their search results. I
just _assumed_ they would never abandon their core strength in this area.

This might be the straw for my personal camel's back.

------
hartator
Worth noting that was an de-indexing bug this weekend affecting all Google
servers and a lot of pages.

Might be just that. More information:
[https://www.google.com/amp/s/searchengineland.com/googles-
de...](https://www.google.com/amp/s/searchengineland.com/googles-de-indexing-
issue-still-not-fully-resolved-but-google-is-working-on-it-315070/amp)

~~~
poizan42
Non-amp link: [https://searchengineland.com/googles-de-indexing-issue-
still...](https://searchengineland.com/googles-de-indexing-issue-still-not-
fully-resolved-but-google-is-working-on-it-315070)

------
lsiebert
Unfortunately google is the only way to search old usenet archives from the
web, since they bought out such archives. That search leaves much to be
desired.

Only the relevance search has results for usenet posts, you can't order by
date, and other then using date ranges, there's no way to only see usenet
posts.

For example searching for "gamer" before 1/1/2000:

relevance
[https://groups.google.com/forum/#!search/%22gamer%22$20befor...](https://groups.google.com/forum/#!search/%22gamer%22$20before$3A2000$2F01$2F01%7Csort:relevance)

date
[https://groups.google.com/forum/#!search/%22gamer%22$20befor...](https://groups.google.com/forum/#!search/%22gamer%22$20before$3A2000$2F01$2F01%7Csort:date)

~~~
zandorg
Well Usenet provider Giganews has 10 years of Usenet, but I don't know if they
kept it from 1990 or... just threw it away.

~~~
lsiebert
Ten years doesn't even go back to when google bought dejanews's archive, in
2001.

------
gypsy_boots
Author says:

> I also find misleading the title of BoingBoing’s report of this story:
> “Google’s forgetting the early web”. The two posts mentioned here are not
> “early web”, nor really “old”.

While the title of this author's post is "Indeed, it seems that Google IS
forgetting the old Web"

------
besulzbach
> But this only makes bigger the problem of what to remember, what to forget
> and above all who and how should remember and forget.

And today, if the big search engines decide something will (no longer) be
indexed, they can make it effectively unreachable.

~~~
JohnFen
> they can make it effectively unreachable.

I think you mean "unfindable", not "unreachable". It may seem pedantic, but I
think there's a critical difference there.

------
ghaff
For better or worse, Google is very explicitly not a card catalog though. One
might disagree about how Google determines relevance--and the criteria are
opaque. But, in any case, it's quite different from a card catalog which, with
the caveat that the work in question must be part of the library's curated
collection, is completely non-judgmental about the relative importance of a
given item.

~~~
influx
It used to be. I remember when one of the selling points of google was how
many pages they indexed and how fast your search came back over that data set.

I guess no one at Google is getting a promotion by making sure old pages can
be returned.

------
juskrey
The reason is simple: apparently they are not making money on old content
searches.

~~~
mojuba
And the reason they are not making money (I'm guessing) is because older web
sites are much less likely to use one of Google's spying plugs, such as
Analytics or Fonts. Or the ads themselves after all.

------
dontbenebby
Interesting that DuckDuckGo has it first.

Increasingly I've found if I know _what_ I'm looking for (ex: a specific
article or web page), DDG is quite good at finding it.

Google still excels where I'm looking for info on a topic but not sure what I
need. (You end up treating search terms like labels on a venn diagram whereas
DDG is more traditional, using search operators to prune the list of possible
sites)

------
mahmoudhossam
Can we add (2018) to the title?

------
eruci
"Reputable SEO company from India" is to blame.

------
mavhc
[https://www.google.co.uk/search?hl=en&q=%22Seven+Things+we%2...](https://www.google.co.uk/search?hl=en&q=%22Seven+Things+we%27re+tired+of+hearing+from+software+hackers%22&meta=)

Works for me

------
IronWolve
It kinda makes sense google thinks using a Pareto distribution (Matthew
Effect) is correct, people search on popular topics, so popular results are
typically the correct one.

BUT... now we have a big problem with what google filters and promotes. It has
altered its search engine to ignore search modifiers in some of its products.
Such examples, Google wants to be in the TV/Entertainment business so its
promoting hollywood above content creators. And many results favor CNN/NBC on
its platforms, even with "exact title searches". Its using its business models
to override consumers wants.

Its like google turned the search into a BuzzFeed top 10 generator.

------
barnitm3
When searching Google for simple words that should return a lot of content the
results are limited to around 20 pages. If I click the link to not omit
similar results it changes to around 45. The results don't cover every company
in a specialist area or every good article on a topic, what's more after you
go back a number of pages the results start turning into junk. That seems
unnecessarily harsh on any relevant site that isn't a "top site". The
x-billion results returned number is still at the top from the days when you
could search through a seemingly infinite number of pages.

------
avian
Here’s a blog post I wrote recently on a similar topic:

[https://www.tablix.org/~avian/blog/archives/2019/02/google_i...](https://www.tablix.org/~avian/blog/archives/2019/02/google_index_coverage/)

On my website, the Google index coverage is clearly falling with age of the
content. According to the Search Console only about 60% of my posts from 2006
are indexed.

------
CloudNetworking
Just yesterday I was thinking to myself that the results I was getting from
Google where more artificial (paid, SEO, etc) and less organic than before.

That means I am getting less and less value out of my Google searches these
days, but only from certain topics like e.g. recipes / food searches. Coding
and technical searches tend to return high quality results, but that may be
because StackOverflow exists...

------
Const-me
I think google applies arbitrary undocumented filters to their output. I have
noticed it more than once, e.g. see this comments thread, couple weeks old:
[https://news.ycombinator.com/item?id=19420383](https://news.ycombinator.com/item?id=19420383)

Personally, a year ago I have switched the defaults to duckduckgo.com on my
devices, works OK for me so far.

------
qwerty456127
It should better develop an algorithm to forget (deprioritize) old posts on
subjects in which old posts actually are irrelevant. I.e. tutorials on old
versions of actively-evolving technologies, top-10-best-mobile-apps-for-
something posts from past years etc.

~~~
JohnFen
Tutorials on old versions of software are not universally irrelevant.
Sometimes, for some people, they are invaluable. I know such tutorials have
been godsends to me before.

"Relevancy" is very context-dependent. I don't see how a human can accurately
determine it for others, let alone algorithms.

------
dcbadacd
I think it should be job of other sites to keep an "index" on certain subjects
and make that searchable via Google. Just like Wikipedia, just like Reddit,
just a lot of other content aggregators.

------
gfo
I thought part of Google's ranking algorithm includes how often a site is
updated? Perhaps there's some inclusion of how often an article is accessed
which also determines a page's ranking?

------
u801e
I've had the opposite issue when searching for how to do something in perl.
I'll frequently get 15 to 20 year old results from the perlmonks.org website
instead of something more recent.

------
lucaspottersky
Google is the new Yahoo.

------
bookofjoe
>Unless we’re all missing something here, it seems more correct to say that
Google forgets stuff that is more than 10 years old.

"The right to be forgotten"

------
will4274
It's not just the old web. Google's results for political content or content
with political implications have become increasingly strange in the past
couple years. I noticed this when trying to recall the story of a false rape
accusation that occurred on my college campus when I was a student. This saga
consists of three articles: a) initial reporting of the reported
rape]([https://news.cornell.edu/stories/2012/09/cornell-police-
inve...](https://news.cornell.edu/stories/2012/09/cornell-police-investigate-
reported-attempted-rape)) b) [a police statement that they had irrefutable
video evidence that the initial report was false
]([https://cornellsun.com/2012/11/28/cornell-police-report-
of-a...](https://cornellsun.com/2012/11/28/cornell-police-report-of-attempted-
rape-on-campus-was-false/)) c) [an article defending the school's policies on
sexual assault in light of a and b]([https://cornellsun.com/2013/02/25/after-
false-report-cornell...](https://cornellsun.com/2013/02/25/after-false-report-
cornell-defends-new-rules-for-sexual-assaults/)). All three of these articles
are impossible to find unless you search for exactly the right thing.

For example, if you search for "Cornell Police: Report of Attempted Rape on
Campus Was False" on Google (the exact title of the second article), you find
it. But, if you search for any variation (e.g. "Cornell Police: False Report
of Attempted Rape on Campus") - the later articles (b and c) are impossible to
find.

I've never been able to find a satisfactory explanation for why the
discoverability on these articles has been turned to zero. I _think_ that
there are some odious websites devoted to covering false rape accusations, and
these articles may be inheriting the low reputation of the publications that
link to them. Or, perhaps the hosting website (student newspaper) is doing
something to de-rank the stories. In all cases though, it seems wrong that in
response to a query "cornell trolley bridge false rape accusation 2012",
Google's top results would be "Why false rape accusations are rarer than you
think"

------
rdiddly
Sounds like _somebody_ is ripe for disruption.

------
bunnycorn
I disagree with the author.

Google is the old web, no wonder The similarities between Hoolie and Google.

(hooli is a fictional company in the HBO series Sillicon Valley)

------
zallarak
My guess is that Google prioritizes new content over old with the argument
that it is more relevant.

------
fixermark
This article itself could use a 2018 tag (does that make it part of the old
web? ;) ).

------
agumonkey
google is the dual of the old web, the old web was a jungle, that's why we
wanted google, and now google is the urban regulation body

------
PaulHoule
Gotta make way for the new spam.

------
wyck
Matt Cutts fallout.

~~~
WalterGR
What do you mean?

~~~
wyck
Matt did an incredible job at Google search and left in 2016, since then it's
been downhill, not entirely because of him, but it's certainly a noticeable
effect, especially if you followed Matt's google search related Q&A's, videos,
etc when he was at Google.

~~~
JohnFen
I don't know. It seems to me that Google started going downhill before 2016.

------
dosy
I think this is not "forgetting the older web" but the algorithm simply
penalizing things that look like they pretend to be from the past.

~~~
sct202
That would be interesting, because it seems like a lot of blogs pretend to be
from the present by updating the post/last modified date frequently but
leaving the content the same as several years ago.

------
nukeop
Google is pretty much useless for anything older than a few years, as well as
for many less popular topics. I can't even remember how many times Google
decided to drop a crucial word in my query to show me something, anything,
instead of admitting it pulled a blank. Instead of playing the endless game of
"intitle: filetype: inurl:" just use something else (DDG maybe).

~~~
ishan1121
DDG is much better to get a wider scope of search results while Google is good
for personal things. Most the old forums and post do come in the first DDG
search, but not on the first Google search. I have started to hate Google's
Personalised filter bubble.

------
meh206
I think it is more of a case of google suppressing information to the mass
public. Relevant information is becoming harder and harder to find with their
search, and let's not forget Project Dragonfly.

------
tannhaeuser
Yep. And I thought I was the only one thinking Google's brainwashed web isn't
representative of what's out there. It's not only the old stuff, though. I
have a suspicion Google won't show you pages without Google ads on them.

------
stuffedBelly
Google's profit is driven primarily by ads. The old web is no longer the
primary target for Google. Companies like this should probably stop pretending
they are non-profit charity or labeling themselves as enablers of a 'better
world'. At least this way people would have more realistic expectation, but
hey, why should a mega-corp care what people really want...

~~~
stuffedBelly
downvote all you want...

------
0898
The post unflatteringly compares Google with DuckDuckGo. But doesn't
DuckDuckGo use Google's technology?

~~~
bamboozled
No, it uses Bing AFAIK

~~~
Insanity
Not only Bing, from their docs:

"In fact, DuckDuckGo gets its results from over four hundred sources. These
include hundreds of vertical sources delivering niche Instant Answers,
DuckDuckBot (our crawler) and crowd-sourced sites (like Wikipedia, stored in
our answer indexes). We also of course have more traditional links in the
search results, which we also source from a variety of partners, including
Oath (formerly Yahoo) and Bing."

(Regarding search results + instant answers:
[https://help.duckduckgo.com/duckduckgo-help-
pages/results/so...](https://help.duckduckgo.com/duckduckgo-help-
pages/results/sources/))

~~~
puzzle
How current is that page? Yahoo used Google since 2015.

~~~
mrweasel
That just raises the question: To what extend does DuckDuckGo still depend on
Bing?

I've asked before, and no one seems to be able to provide me with a source
that states that DuckDuckGo is just Bing. Yet it comes up every time
DuckDuckGo is mentioned in a positive light.

~~~
puzzle
Does anyone outside the company know how DDG works? IIRC a job description
hinted at their backend being in Perl, but little else has trickled out. It's
easier to find out how Google works, through patents, presentations, the
Percolator paper, etc.

Saying that Bing is one of hundreds of sources is like saying that PageRank
was just one of Google's 200 signals. Or that Google was just one of hundreds
during Bing's hiybbprqag brouhaha.

------
stcredzero
_Bray writes that: “I think Google has stopped in­dex­ing the old­er parts of
the We­b. I think I can prove it. Google’s com­pe­ti­tion is do­ing bet­ter.”_

Those who forget the past are doomed...to be marketed to again.

As Vernor Vinge noted (he popularized the notion of "The Singularity") not
indexing something basically amounts to erasing it from the awareness of the
majority. If you can control the past, you can control how people think about
the world. If you can control the past, you can control the present. The
ability to quietly erase something from popular consciousness amounts to a
huge chunk of power.

There should be a concept like Net Neutrality. Perhaps there should be a
search neutrality -- if you're going to index, you need to index _everything._

~~~
YayamiOmate
It's literally impossible, and it was never true in humankind history.

~~~
stcredzero
If it's literally impossible, then we're doomed. Being able to choose what to
index is can literally be used to censor reality for most of the population.

 _it was never true in humankind history._

Basically everyone using search was never true in humankind's history.

