
Google Memory Loss - AndrewDucker
https://www.tbray.org/ongoing/When/201x/2018/01/15/Google-is-losing-its-memory
======
userbinator
I've noticed this many times too, particularly recently, and I call it "Google
Alzheimer's" \--- what was once a very powerful search engine that could give
you thousands (yes, I've tried exhausting its result pages many times, and
used to have much success finding the perfect site many dozens of pages deep
in the results) of pages containing nothing but the exact words and phrase you
search for has seemingly degraded into an approximation of a search engine
that has knowledge of only very superficial information, will try to rewrite
your queries and omit words (including _the very word that makes all the
difference_ \--- I didn't put it in the search query for nothing!), and in
general is becoming increasingly useless for finding the sort of detailed,
specific information that search engines were once ideal for.

To add insult to injury, if you do try to make complex and slightly varying
queries and exhaust its result pages in an effort to find something _you know
exists_ , very often it will think you're a robot and present you with a
CAPTCHA, or just ban you completely (solving the CAPTCHA just gives you
another, and no matter how many you solve it keeps refusing to search; but
they probably benefit from all the AI help you just gave them, what
bastards...) for a few hours.

Google had the biggest most comprehensive index for many years, which is why
it was my sole search engine. Now I'm often finding better results with Bing,
DuckDuckGo, Yahoo, and even Yandex, but part of me is very worried that large
and extremely valuable parts of the Web are, despite still being accessible,
simply "falling off the radar".

~~~
Declanomous
> has seemingly degraded into an approximation of a search engine that has
> knowledge of only very superficial information, will try to rewrite your
> queries and omit words (including the very word that makes all the
> difference...)

I think the biggest irony is that the web allows for more adoption of long-
tail movements than ever before, and Google has gotten significantly worse at
turning these up. I _assume_ this has something to do with the fact that
information from the long tail is substantially less searched for than stuff
within the normal bounds.

This is a nightmare if you have any hobbies that share a common phrase with a
vastly more popular hobby, and is especially common when it comes to tech-
related activities. I use Linux at home, and I program VBA at work. At home
Linux is crossed out of most of the first few pages, and I just get a ton of
results about Windows, and at work VBA is crossed off and I get results about
VB6 and .NET.

Completely. Useless.

I can only imagine this has something to do with their increasing reliance on
AI, and the fact that the AI is probably incentivized to give a correct
response to as many people'above the fold' as is possible. If 95% of people
are served by dropping the specifically-chosen search term, then the AI
probably thinks it's doing a great job.

It seems like the web is being optimized for casual users, and using the
internet is no longer as skill you can improve to create a path towards a more
meaningful web experience.

~~~
colordrops
This same AI effect can be seen in the Android keyboard, where _properly_
spelled words will be replaced after typing another word or two because it's
been determined to be more likely what you want. It's infuriating.

~~~
Larrikin
It's all the keyboards. Swype used to work great, but now when it actually
gets words right, if I have sentence with a second word that could possibly
have been two similar words on the path it will just straight up replace both
rendering the sentence completely incomprehensible. Who are these people that
don't correct their sentences until the second incorrect word?

~~~
dx034
I second that. 5 years ago I could swipe a whole message blindly with no
errors. Now I have to correct every second word.

I'd love a feature to disable all that deep learning and AI and just use the
algorithm they originally had (proximity of where you typed to words in the
dictionary). That worked so much better.

------
ohazi
I've convinced myself that this happens in gmail / hangouts history search
too. It'll very confidently tell you that here are the only six results for
your search term going back to the beginning of time, but if you go and
manually dig up something that you know is there from ten years ago, then all
of a sudden there are seven results the next time you search for the same
term.

I haven't done this methodically, and I can't prove that this is happening,
but it's infuriating nonetheless.

~~~
aglionby
I've had this for Chrome history as well. There have been multiple times where
I'm _sure_ I've browsed a site with some keyword in the title and it just
doesn't show up in search. I don't tend to have a clue about the time window
it would be in either so I can't go looking for it, so I can't prove it.

~~~
Puer
Chrome history may as well be useless and/or developed by the same people who
created Reddit search.

~~~
jakecopp
I feel that it is intentionally bad so people don't realize how much Google
knows about them.

~~~
Puer
Meanwhile their image recognition gets better and better. For those of you who
use Google Photos backup, try a keyword image search in Google Drive sometime
of your untagged photos ("beach", "face," etc.) You'll be creepily surprised
on what Google is indexing, even against what they claim they don't (try some
sketchier words).

~~~
fyfy18
I can one up that: I was living in Dubai a few years ago and have a number of
photos of fancy cars I could never even dream of affording. If I search for
“Lamborghini” or “Rolls Royce” it gives me the photos of those cars. I’ve
never tagged them and I’m not an Android user, so they aren’t reading my
messages.

[https://imgur.com/caz8D2Y](https://imgur.com/caz8D2Y)

I even have four photos I took at night, in burst mode, as a Bughatti Veyron
zoomed passed, and yes, it can recognise those...

~~~
whatusername
iOS photos app does this as well.

~~~
dingo_bat
Samsung gallery app does this as well. You can even search for stuff like
"selfie".

~~~
wangweij
Maybe every photo using the front camera is a selfie?

~~~
dingo_bat
Didn't think of that obvious solution!

------
Houshalter
Google drank too much of their own koolaid. They were always seen as an "AI
company", but they really weren't. As late as 2008, Google's head of AI said
they used very little machine learning in search. And actually tried to avoid
using it as much as possible. Because they found it very unreliable and it
gave weird an unpredictable edge cases. Everything was painstakingly hand
engineered.

But now we are at peak AI hype cycle. No wonder it's gone downhill. I'm sure
the AI does better on whatever metrics they tell it to maximize. But AIs game
the hell out of metrics. And we are still nowhere near human level
intelligence. It doesn't understand your query or the content of the websites.
All it has to go by is simple keyword matching and meta indicators like the
size of the website.

So now basically all searches return the same handful of large websites. When
was the last time you got a search that went to some niche forum? Or some
small little homemade website by someone passionate about that specific
subject? No, it's always a wikipedia link, followed by a bunch of contentless
news articles. And there's never any point of going past page 1, because every
other page is like that too.

And now the web has become this:
[https://www.ncta.com/sites/default/files/platform-
images/wp-...](https://www.ncta.com/sites/default/files/platform-images/wp-
content/uploads/2015/01/expanding-consolidation-consumer-internet.jpeg)

~~~
propman
The little passionate website in the 2000s used to be the best source of
information if properly vetted. Now it's only big websites regurgitating the
same superficial content written by paid writers who have no expertise in the
subject matter. They instead just copy other superficial sources.

I think this has dumbed down society because it certainly has dumbed down me.
Also, complicated topics are dumbed down to 1 sentence answers and it's very
difficult to get detailed information about something. Ironically, I've
started to go back to purchasing and reading books if I really want to go in
depth on a subject.

~~~
greenhouse_gas
On the other hand, from my experience:

I wanted to see how Raspberry Pi assembly worked - quite a niche subject (who
programs? In assembly? On a Raspberry Pi?).

Yet I found quite a few blogs going through the subject - a tutorial of sorts.

I was also looking up how to write an OS (just for fun) - another fairly niche
field. Yet I found tons of tutorials in C, C++, and Rust.

I remember trying the same in the 1990s.

Nada.

You couldn't find nothing. I mean maybe you'll find a basic site with the
source-dump of an OS (often without building instructions), but you want to
dig into the meat (here is how you get from bootloader to x86-64 in 21 days,
and _why_ it works)? Foggetaboutit.

I _definitely_ don't want to go back to pre-google web1.0.

~~~
danmaz74
In my experience, it depends a lot on how much competition there is from
content farms. If your niche isn't targeted by those, it's easy to find the
good blogs. Otherwise, it's very difficult.

I suppose that this is a use case where ML could help a lot to recognize
"content farmed + SEO optimized" content. Hopefully Google could improve the
situation in the future - supposing they are trying.

~~~
mikehollinger
One of the most entertaining "bugs" that popped up recently while I was
looking around for some opinion content that I would _hope_ is recent was that
one of the content farm sites plagiarized a rather good blog post, sentence
for sentence, for pages and pages. Instead of "I found that ..." they'd
thinly-edited it to "Now, let us find..."

The annoying thing was that the content farm appeared _above_ the original
content.

~~~
danmaz74
Probably because they employed better SEO than the original author :(

------
habosa
Many people here are pining for the old days of Google search where if you
knew the operators and some tricks you could develop very specific queries and
get a page that matches exactly.

I remember those days too. It did feel like Google search skill was a super
power. I also remember my non technical friends and family being pretty much
unable to find what they were looking for online. Google did not turn their
rambling approximate queries into the results they want.

Now I find that although the precision of the old style is gone, Google is
incredible at guessing what you want. Back in the day I might find the page I
was looking for on the twelfth 'o' of 'Goooooooogle'. Now I rarely have to
venture outside the top 3 results. I find that my family no longer needs my
help to fast craft a search query.

Isn't it possible that Google just made the choices that are better for the
average user and left some of us advanced users out? If your response to that
is "they could leave me an advanced mode", consider how much work it would be
to maintain to serve a tiny customer subset.

I think if you want a search engine for power users you want DDG

Disclaimer: I work at Google but not on Search or anything close.

~~~
akerro
I think you forgot how maintains computers of these non-powered users, it's
us. Advanced users. When I decide Bing or DDG is better than Google Search,
then Google can kiss my ass goodbye and my 20 computers I maintain
occasionally for family and friends.

~~~
mersault
Why would you take your family to DDG? Not all users need to use the same
toolset. If the parent comment is correct, and Google is targeting the "mass
market" user, then wouldn't leaving your presumably less-tech friends and
family with Google as their default search engine make your life easier? Just
because Google is no longer the best fit for you doesn't mean DDG is the best
fit for them.

------
telltruth
I'm wondering if rackless Ruth, Google's bean-counter-in-chief, is behind all
these. At Bing, bean counters often had calculated that if you cut down index
to half (after certain size), you reduce half of the cost but don't lose half
of the revenues. So there is a sweet spot where you can maximize revenue if
you are willing to let go few demanding customers. When quarterly results
needs a little push, everything is a fair game. I bet Sundar Pitchai doesn't
want to look bad as "CEO" who can't meet analyst expectations.

I definitely miss old days of Google. After Amit Singhal left, things haven't
been same at all. He had resisted unexplainable AI getting in to search
features. But as he left, RankBrain had Ai-driven feature that is 3rd most
significant. That feature is the reason why you often see pages even if they
don't contain keywords and even a phrase you had specified. Those old guard
knew that trading explainibility and little bit of revenue with slight
decrease in customer sat wasn't worth it.

~~~
m12k
Did you really intend to call her "rackless Ruth" or was the slur an
accidental misspelling of reckless? Because pointless and crude name-calling
like that really detracts from any valid points you might have brought up.

~~~
shangxiao
According to Merriam-Webster [1] "rackless" is a variant of "reckless".

[1] [https://www.merriam-webster.com/dictionary/rackless](https://www.merriam-
webster.com/dictionary/rackless)

~~~
m12k
Do you consider it likely that the poster I replied to routinely uses the
dialectal variant rackless over the common spelling reckless?

~~~
smoyer
You should also consider which areas of the world use "rack" as a slang for
"breasts". (I'm not disagreeing with you - as a native English speaker I
didn't know that rackless and reckless were synonyms).

------
cbr
Hmm. I just tried to reproduce this with old posts of my own and couldn't. I
picked random phrases from five early 2006 blog posts that get basically no
traffic and searched for them:

    
    
        "I had been playing the accordion Davy lent to
        Rosie during winter break"
    
        "The language they're using is not that different
        from the one I wrote PlayGUI to use"
    
        "I've been playing a decent amount of music lately,
         mostly guitar and piano."
    
        "warm dry socks was the most important aspect of the
        festival"
    
        "This wouldn't be that bad, if it was not exactly what
        happened a year and a half ago."
    

Google found all three. For each one there were either 2 or 3 results: first
my old post, then one or two from rssing.com which seems to do something with
my rss feed.

Trying them with Bing, it also found all five of my posts, and ranked them
first in four cases. In the fifth case ("This wouldn't be that bad, if it was
not exactly what happened a year and a half ago.") it ranked a
goodhousekeeping.com post higher, which had all the individual words but none
of the phrases.

(Disclosure: I work for Google, though not in search.)

~~~
Buge
>Google found all three.

All five? Or 3/5?

~~~
cbr
Sorry, all five. I wrote the comment with three, then thought more data would
be better and tried two more.

------
timbray
[Tim here] Folks might notice that the article is once again find-able. Being
2 links from the top of Hacker News will do that…

~~~
fixermark
Or the dataset outage was temporary. Remember, it's a distributed data corpus.

~~~
londons_explore
This needs more attention. The rarest pages are probably not very redundantly
stored.

Sometimes the datacenter your search query lands in might not have a copy of
the necessary page. Now they have to decide if they delay the entire search
query to remotely query another datacenter, or not. I would guess returning
the results early is nearly always more important than returning a result
which is so rare it has never been clicked in the past decade.

------
anfilt
I would not be surprised if google still has the data. Not sure how google
handles things internally. However, google needs to pull up the results fast.
So they might have 4 billion results with the word "water" in it. They only
make tiny portion of that available. So if I type the words "Hot water" google
it looks at the subset of pages with words "Hot" and the word "Water" So
google must pull the pages that have both words quickly. So the number pages
in these subsets "Water" and "Hot" must be small enough to quickly be
merged/intersected. There are other things that could be done to speed it up,
but I think you get the main idea.

However, what I am getting at with that simple example is for the searches to
be quick google keeps these lists small. So there is a limited space due to
time-constraints. So google must decide what is relevant for the available
portion of their index.

However, that does not explain why other search engines don't have trouble
with older sites/links. I suspect it's more of business decision than a
technical one.

~~~
incompatible
Intersections are another thing that Google search doesn't do properly
anymore. If I search for something like lkasdfjer samsung galaxy s8 it just
gives me matches for samsung galaxy s8 and ignores the first word. When I do
searches like this, I do it for a reason and don't want matches that lack some
of the search terms.

~~~
rob-olmos
I've found if I put the keyword in double-quotes then it makes the keyword
required in the search

~~~
RickHull
Not even this is sufficient any more. They now have a "verbatim" search, but I
think even then some terms can be ignored -- terms which are not conventional
"stopwords" like _the_.

~~~
yesenadam
Yes, verbatim is distinctly broken sometimes.

edit: It's not tedious for me on my browser, just click _Verbatim_ on the LHS
of page. (Can select that or _All results_ )

~~~
incompatible
I don't see that on the LHS. It would be nice if there's a link to it,
something like
[https://www.google.com.au/search?verbatim=true](https://www.google.com.au/search?verbatim=true)
that I could bookmark. Edit: or somehow set as the browser's default search
engine.

~~~
monort
add tbs=li:1 to the url:

[http://google.com/search?q=hacker+news&tbs=li:1](http://google.com/search?q=hacker+news&tbs=li:1)

~~~
emmelaich
Nice. Did you discover that (gleaned from url) or is it some documentation?

~~~
monort
From the url, I don't think they have a documentation :)

------
snowwrestler
I wouldn't be surprised if this issue is as mysterious to Google staff as it
is to you.

Google Search no longer runs a clearly defined algorithm to find search
results. It is a collection of AI systems that are trained continuously on a
variety of data. There is probably no human alive who fully understands how
Google makes decisions about which results to return and how to rank them.
They just understand how to provide feedback to adjust results they don't
like.

If the system gets rewarded for finding common things quickly, then it will
adjust its internal algorithms to make that happen--perhaps even if that means
dropping unpopular results altogether.

~~~
colordrops
That's unfortunate, because their simpler system worked better.

~~~
dx034
I don't think so. A few years ago, Google worked much better when you knew how
to phrase queries. I often helped family members to find something online,
just by rephrasing their original query.

Today, it doesn't matter how you build your query, Google returns good results
in any case. That also means that you can't search for specific info by
phrasing queries differently but for the vast majority of people it makes life
much easier.

~~~
brain5ide
You have a good point here. However, it would be great if both directions
would be maintained. One, with "good enough" results for the common crowd, and
another one, with a scalpel sharp ability to dissect it.

~~~
dx034
If you can do that at Google's scale I'm sure they'd be happy to hear your
proposal. I really doubt Google's engineer wanted to give up the precision
(they'll use it for their daily job as well), I just think they couldn't
justify the additional resources.

~~~
Dylan16807
Don't make it sound like a lack of ability. They could keep giving time slices
to the old code if they wanted to. And it wouldn't that _that many_ resources,
"justified" or not.

~~~
londons_explore
Keeping old code running within Google is surprisingly hard...

Most big services (eg. Google Docs) have entire teams of people who run them
(SRE's). Without those people to keep the system going, it would probably
break within a few weeks.

------
imgabe
> They’ve nev­er claimed to in­dex ev­ery word on ev­ery page.

Not in those words, but they do claim to aspire to “Organize the world’s
information and make it universally accessible and useful.”[1] which ought to
include old web pages. They've gone to the effort of finding out of print
books and digitizing them to make _those_ searchable so it doesn't seem like a
ten year old web page should be such a stretch.

[1] [https://www.google.com/intl/en/about/our-
company/](https://www.google.com/intl/en/about/our-company/)

~~~
jxramos
you'd think it would at least come up in the internet archive if not anywhere
else.

~~~
paulcole
[https://web.archive.org/robots.txt](https://web.archive.org/robots.txt)

~~~
emmelaich
That's unfortunate. But understandable in a way.

    
    
        # robots.txt web.archive.org 2013-10-02
    
        User-agent: *
        Disallow: /
    
        User-agent: ia_archiver
        Allow: /

~~~
jxramos
touche, I don't suppose the old non commercial websites mentioned in the
article suffer the same problem though right? Maybe an accidental robots.txt
file was mistakenly left around?

------
tjoff
_So from a busi­ness point of view, it’s hard to make a case for Google
in­dex­ing ev­ery­thing, no mat­ter how old and how ob­scure._

I don't get that line of thought, somehow people have starting defend lack of
quality as something expected or reasonable.

The whole point of going to google is to find stuff, that includes "boring"
old and obscure stuff that won't sell ads. But that is part of the deal, if
google don't care about that why should I care about google?

While we are on the subject, I still miss being able add a + in front of a
word to highlight its significance, this was removed in favor of google+ (you
can still do it with quotes) but now when google+ has been irrelevant for the
better part of a decade maybe it is time let us quickly emphasis words? Sigh.

~~~
mort96
Google's job is to make money, and to grow and make ever increasing amounts of
money, not to find stuff. If Google decides that it's worth it to lose some of
its customers in order to reduce costs, then it's fair game for them to do so.

Maybe a competitor will come in and steal marketshare from Google by filling
the hole Google leaves behind when they increasingly make changes which annoy
a subset of users (duckduckgo is probably closest, and I use it as my default
search engine, but I don't find it's as good as Google yet in a lot of cases).
Maybe enough people will switch to be an issue for Google, in which case
Google miscalculated the cost of pissing of that subset of its userbase. Maybe
so few people will switch that it's offset by the cost savings of not keeping
everything indexed, in which case it might've been the correct decision, from
a capitalistic point of view.

This focus on constant growth is the issue with relying on companies and
capitalism, but that's a bigger discussion.

~~~
tjoff
What makes you believe this is a rational decision? Even if you reduce
everything to "must make money"?

This hasn't anything to do with capitalism. It is pure greed. Companies
willingly do anything for a slight increase in revenue even if they willingly
acknowledge that it will cost them ten times as much in the (not so) long
term.

There is nothing about capitalism that says you must be a colossal idiot,
that's just a consequence of a poisonous culture where employees don't give a
crap about the company but only focus on their own career.

------
smoyer
I've done essentially the opposite ... I use DuckDuckGo as my everyday search
engine and only use Google when there's not an acceptable result there. Over
the last couple of years, the number of times per week that I switch to Google
has probably halved.

~~~
amerine
This is basically my exact experience. Once a long time ago I made DDG my
default browser and it stuck. I really like using “!g” to switch quickly to
google.

If my search bar DDG search doesn’t return satisfactory results, I hit my
address/search bar key (cmd+l for me), press left arrow/start-of-line key,
type “!g “ and hit return. That gives me a google result page quicker than the
time it would take to reach the mouse. It’s become a good workflow.

I’m having to use it less and less. DuckDuckGo is getting really good.

~~~
lloeki
Tip: while in your case you still have to type something to get rid of the
selection, you don't have to put !g at the beginning, anywhere in the query
will do. I mostly put it at the end.

~~~
knight17
Instead of clicking you can use / or h to focus on the search field, press
[right] arrow key and your preferred bang (!g) to open in google, no need to
add any space to the last word (as you said).

[https://duck.co/help/features/keyboard-
shortcuts](https://duck.co/help/features/keyboard-shortcuts)

------
tomc1985
I would bet this is one of those more subtle long-term effects that nobody
really saw coming... when Google refocused search with an eye towards
commercial results, I imagine it deprioritized a lot of the older, more
innocent informational content lying around

~~~
malchow
This has been my experience. With Google I am constantly asking myself: Wait,
what about all that glorious, smart, noncommercial web content that I _know_
exists? Like the Stanford Philosophy Encyclopedia[2] or that economics
professor's dataset that I remember being referenced in a podcast a year ago?

Google seems to have decided that Wikipedia is the only blessed noncommercial
source of intelligence.

I guess, if I were to put it strongly, I'd say: using Google is not like using
the Internet any longer.

FWIW, HNers may wish to check out Yewno[1], a knowledge search engine based in
Redwood City that I've had the pleasure of being (tangentially) involved in.

[1] [http://yewno.com](http://yewno.com)

[2] Yes, I know this is indexed. It just frequently gets buried in my
searches.

~~~
telchar
I feel this way with regard to the many, many forum posts containing the exact
answers to questions I want to know. So much subject-specific, often hobbyist
information seems to exist primarily on forums yet I almost never see forum
posts turning up in search results these days. More typically I'll get (e.g.)
five results at variations of the hp.com landing page or some other
contentless nonsense.

------
tobinfricke
I just noticed this also, earlier today. I have a blog entry titled "Highest
airports in California" that I attempted to find using Google. Even with the
double quotes, and restricting the search to the correct site, it doesn't seem
to come up in search results.

The site's robots.txt seems permissive enough. What's up?

link:
[https://nibot.livejournal.com/1075122.html](https://nibot.livejournal.com/1075122.html)

~~~
RileyJames
This comment now comes up with the quotes search term, so your blog post will
likely be indexed again in the next few days. If you submit a sitemap to
google search console[1] you'll be able to see which pages have/n't been
indexed. I'm not sure how else you can easily track indexing.

[https://www.google.com/webmasters/tools/](https://www.google.com/webmasters/tools/)

------
pbw
It seems possible these particular articles fell out of Google's index for
some other reason than they are not indexing "old" stuff. That's kind of a big
leap to make without many more examples.

~~~
dvfjsdhgfv
We may never get to know these reasons. Moreover, from the point of view of
the author they might be irrelevant. What matters is that Google is not
indexing them, whereas the competition does.

(I agree looking for one's own articles is a specific case - in most other
situations you'd want to know Google's reasons very badly.)

~~~
jaimehrubiks
I still agree that it could be any other reason the one that led google de-
index that website and therefore not indexing it again.

A good review would show more profs than just websites behind the same one
domain.

Interesting though, let´s see if there´s more news on this.

------
l33tbro
While we're all grumbling about Google, my main gripe is the increasing
indexing of Pinterest in web search. It was bad enough a few years ago when
they statted clogging up image searches, but I've noticed a lot of links to
Pinterest now in the first few web results of a search.

~~~
bambax
Concurred. Pinterest is really bad. You need to create an account to do
anything with it. Now I have to add -site:pinterest.com to most of my image
searches.

~~~
dx034
The problem with many search algorithms appears to be that you profit
massively if your content is on a highly ranked domain. Pinterest is very
highly ranked so all content automatically also is assumed to be of high
quality.

The same content hosted on another domain would appear much lower on Google
than a Pinterest post (or a post on any other large website). Not sure if
that's really the best approach but I guess it's not easy coming up with a
better one.

------
fallous
I'm afraid the author has it wrong when he claims what Google cares about is
"It cares about giv­ing you great an­swers to the ques­tions that mat­ter to
you right now." It cares about making money via ad revenue. Even assuming that
it gives you "great answers to questions that matter to you right now", it
only cares about doing so if that results in making money via ad revenue. The
result set is a means to an end, not the end itself. If more money can be made
via ad revenue by providing some other result set, that is what it will become
the preferred method.

Google is not a search business despite opinions to the contrary. A search
business would have as its main source of revenue customers paying for either
search results or the search technology itself. Google makes its money via the
delivery of ads, through its own advertising sales and placement platforms and
through the audience it can provide from its own content.

------
fergie
Yes, I have noticed this recently too, although I have no hard and fast
examples.

Google _seems_ to be returning more popular/mainstream sites at the expense of
less popular sites that may be more relevant.

Also- Google has stopped penalising sites that don't contain all of the search
terms. It has always removed "stopwords" (words that are too common to be
relevant: and, it, or, etc.) which is fine, but now it seems to remove
significant terms as well. This makes a big difference if you are searching
for programming stuff, particularly error messages.

From a pure search point of view Google is losing ground to its competitors,
and that has _never_ happened before.

------
liveoneggs
I have also noticed that google search results, especially in the last few
months, are incredibly weird and jumbled, as if they so desperately want to
show me the $current_chosen_web_winners of news/ecommerce that they sneak in
results from them no matter what the terms.

------
ternaryoperator
This is definitely a thing. Many sites when they update take down old content
that was getting few views. Many times that content is irretrievably lost.

This shows the value of actually grabbing content that you plan to use or hope
to refer to in the future, rather than merely bookmarking it. And it also
underscores the value of the Internet Archive.

~~~
DashRattlesnake
> This shows the value of actually grabbing content that you plan to use or
> hope to refer to in the future, rather than merely bookmarking it. And it
> also underscores the value of the Internet Archive.

Yes. Everyone should install the Wayback Machine plugin and click the "save
page now" whenever they find something useful or interesting:

Chrome: [https://chrome.google.com/webstore/detail/wayback-
machine/fp...](https://chrome.google.com/webstore/detail/wayback-
machine/fpnmgdkabkmnadcjpehmlllkndpkmiak?hl=en-US)

Firefox: [https://addons.mozilla.org/en-US/firefox/addon/wayback-
machi...](https://addons.mozilla.org/en-US/firefox/addon/wayback-machine_new/)

I hate hitting an unarchived dead-ends when doing research, so I'm trying to
do my part to prevent it. Many page I've archived had never been archived
before I saved them.

------
larkeith
This explains a number of times I've been unable to find old
articles/forums/what have you, even when fairly certain I recall most or all
of their titles. This may finally be enough for me to move to DuckDuckGo, as
the quantity of information published longer ago increases, and the
information I may wish to reference becomes increasingly difficult to locate.

------
saint-loup
Anecdota, plural of data and all that, I know, but Google is able to find 15
years old posts from rather obscure french blogs.

[https://encrypted.google.com/search?hl=fr&q=%22Dans%20mon%20...](https://encrypted.google.com/search?hl=fr&q=%22Dans%20mon%20entourage%2C%20et%20probablement%20dans%20le%20monde%20entier%2C%20il%20y%20a%20des%20hypertol%C3%A9rants%22)

[https://encrypted.google.com/search?hl=fr&q=%22Eh%20bien%2C%...](https://encrypted.google.com/search?hl=fr&q=%22Eh%20bien%2C%20d%27abord%2C%20travailler.%20Je%20suis%20au%20regret%20de%20dire%20que%20ma%20th%C3%A8se%20n%27est%20pas%20encore%20%C3%A9crite%22)

BUT, interestingly enough, it can find posts for the very first months of the
blog existence, but stays clueless for several posts dated a few years later.
For instance, several exact strings from this page are not found by Google :

[http://blog.smwhr.net/2003/10/](http://blog.smwhr.net/2003/10/)

------
visarga
Google search team, I wouldn't want to be in your place today. My frustration
with Google Search is it's degree of protection against bots. If you, Google,
can spider everything webmasters post, why can't webmasters spider everything
Google returns?

In my case, I would like to use Google search to bootstrap a few minor search
engine indexes and collect data for NLP projects. But the free version is too
limited and the paid version prohibitively expensive, so, no luck.

~~~
fixermark
> why can't webmasters spider everything Google returns?

I'm sure you could start client-side caching every search you ever do to
Google.

But if you're searching enough to eat up Google's bandwidth, they're paying
for that data and they're under no obligation to keep serving you as a client
(much as any server is under no particular obligation to serve a search
spider).

~~~
trendia
> But if you're searching enough to eat up Google's bandwidth, they're paying
> for that data and they're under no obligation to keep serving you as a
> client

Do you not see the irony?

~~~
fixermark
No, I don't. Can you help clarify it for me?

Search engines crawling millions of sites each with---on average---a few MB of
data distributes cost globally.

Extracting terabytes of index data from a single search engine's repository
consolidates the cost on the back of that repository's bandwidth provision.

These are not symmetrical cost structures.

~~~
nbsd4lyfe
Our git repository went down when crawlers decided to index it

~~~
dx034
But probably not Google. The google crawler is very careful and stops as soon
as they encounter higher error rates. Bing appears to do the same.

------
mc32
I hadn't used Bing because Google whenever I would compare the results, it was
no match, specially in long-tail results. This article has made me reconsider
my assumption.

Usually GOOG had best results for technical queries --but fresh results for
those tend to be the better results (things go out of date pretty quick
because of quicker release cycles)

However, on occasion it has been difficult to find very specific results to
non technical questions --I never even bothered with DDG or Bing, but now I
will surely give them a try.

If they are data driven (and they are) for their average users, long tail, old
results probably don't make sense --how many people really go down to page 20
of the SERP? I'm sure it's a very miniscule number.

~~~
userbinator
_how many people really make go down to page 20 of the SERP? I 'm sure it's a
very miniscule number._

Unfortunately those people are the ones who are searching the hardest for the
most difficult-to-find things, and thus need the services of a search engine
the most. It's unfortunate because, for every one of those, there's probably
millions of others who just want to search "facebook" and click the first
link; a site that I don't even use yet can recite the domain name of off the
top of my head.

~~~
mc32
What's most disappointing is when Google tries to pull fresh content over
"stale" content when you know that you exactly want the "stale" content.

This is a contrived example, but say you want to look something up regarding
what someone said about drug overdoses back in the 80s. Google would try to
insist on brining up information about the most recent overdose studies for
example, because people are currently discussing that more, so to Google,
obviously I should also be looking the fresher content, so the end result is I
get less relevant to virtually irrelevant results.

------
CapitalistCartr
I loved AltaVista. It didn't provide the cleverness of Google; you had to
bring your own. I'd construct searches of the pattern:

(word OR word) AND (word NEAR word)

and get excellent results. Of course, the Web was much smaller then.

------
fixermark
"[Up­date: Now you can, be­cause this piece went a lit­tle vi­ral. But you
sure couldn’t ear­li­er in the day]"

And that's kind of the point, right? Beyond a certain threshold of popularity,
some things aren't always available from every search query, because a
distributed system can't have 100% uptime, consistency, and tolerance to
network partitioning.

------
harshreality
Are Google using a neural nets as an integral part of search indexing yet?

It's well known that there are a bunch of metrics that go into which results
to return — metrics including things like pagerank, (probably) historical
value (# of clicks when the page appears in results), and social media
popularity.

I wouldn't be surprised if Google has experimented with training models to
predict most of those metrics, given only content from the site itself, and
tried using those models as a filter for what to index in the first place. If
the NN is accurate enough, they can use it as a filter at indexing stage
("should I index this?") rather than at the results ranking stage (where real
data, rather than NN model output, answers the question "should I show this
page close enough to the top of results that someone will see it?").

~~~
taneq
From bits and pieces they've posted, it sounds like they use some all-
encompassing glob of statistical inference that they call RankBrain, which
almost certainly includes some deep-learning components. They've said that the
old PageRank algorithm is now one input into RankBrain.

------
Joe-Z
>My men­tal mod­el of the Web is as a per­ma­nen­t, long-lived store of
humanity’s in­tel­lec­tu­al her­itage.

And that's where I don't agree with the author. I've been thinking about this
and to me it seems we need to try to foster a culture of forgetting. Just
because we can store everything forever, doesn't mean we should. That kind of
thinking is exactly where this 'track everything everyone does'-mentality
comes from, governments seem so keen to apply in the name of 'terror
prevention'. It also values regular stuff you do way too highly.

Let's face it: A huge portion of the web is garbage and most information on it
is ephemeral. The really useful stuff, like encylopedias, will be used
regularly and thus keep indexed anyway. For the rest: Just let it go.
Forgetting can also be relieving, you know?

~~~
pmlnr
I wrote about this when I dropped a lot of content from my site:
[https://petermolnar.net/making-things-
private/](https://petermolnar.net/making-things-private/)

------
mherdeg
I have some questions about information retrieval and SLOs:

* Is there a metric of search quality which is appropriate here -- specifically, "when I search for [site:tbray.org rock roll], and receive a set of results, that set includes Tim's article"? What do we call this metric? The metric would be lower when the result set is empty (no relevant results returned) and higher when the result set contains the desired article (a relevant result was returned).

* How would you assess the quality of this particular search against a metric?

* How would you measure the overall quality of "all searches in the past hour, including the [site:tbray.org rock roll] search"? How would this one failure to find a page contribute to an overall success rate?

* Is there any possible automation that would notice whether Tim's article has started to be missing from indexes and say "hey, this represents a loss of a kind of quality"?

* Suppose the index were to (say) discard all pages created before 1999 but simultaneously improve the relevance of all queries that find more recent results. If (say) 99.99% of queries have users happy getting only post-1999 links and (say) only 0.01% are unhappy because they specifically wanted a pre-1999 result, but things get way way better for the 99.99%, was that a bad change? would any metrics show a problem?

I don't see super satisfying answers to this at e.g.
[https://www.quora.com/How-does-Google-measure-the-quality-
of...](https://www.quora.com/How-does-Google-measure-the-quality-of-their-
search-results/answer/Nikhil-Dandekar) or [https://www.quora.com/How-can-
search-quality-be-measured](https://www.quora.com/How-can-search-quality-be-
measured) . If I'm reading right, it sounds like part of the state of the art
for search quality recently involved human raters manually running sample
queries… That seems kinda crazy / totally unlikely to catch certain obscure
issues. But then again:

* What is the service level objective for search quality? If search is getting way better for 99.99% of users because of various optimizations, is it a problem if a particular 0.01% of queries such as Tim's old review query, which he expected to find one specific page, instead find no results at all?

And then I guess I wonder:

* According to whatever metric correctly captures Tim's review being missing as a problem, what is the current search quality of Google web searches and how has it been changing over time?

~~~
BjoernKW
This won’t answer all of questions but the measures you’re looking for are
called ‘recall’ and ‘precision‘:

\- recall: number of relevant documents retrieved / number of relevant
documents

\- precision: number of relevant documents in result set / number of documents
in result set

~~~
mherdeg
Yeah you know, it's funny, the last time I worked on question-answering code,
we were trying really hard to find algorithms that could improve a particular
metric (F-score, a synthetic agglomeration of precision and recall) ... I
don't remember hearing very many conversations at all about whether we were
measuring the right thing.

Given a query like [site:tbray.org "rock n roll animal"], and knowing that the
1 relevant document we actually want is the review at
[https://www.tbray.org/ongoing/When/200x/2006/03/13/Rock-n-
Ro...](https://www.tbray.org/ongoing/When/200x/2006/03/13/Rock-n-Roll-Animal)
, I think we can say that

* if Google search returns 4 results for the query, not including the review: precision is 0/4, recall is 0/1 (so p=0, r=0)

* if Google search returns 5 results for that query, including the review: precision is 1/5, recall is 1/1 (so p=0.2, r=1)

But while I _kind of_ understand how we can use these measures to assess the
outcome of a single query, I'm really not sure I understand what meaningful
ways are available to aggregate those metrics. Suppose we're going to get 1M
queries in the next hour. Do we prefer an algorithm which has the highest mean
F-score per query? highest median F-score per question? or which has the
highest 1st percentile F-score per question (99% of queries get the best
possible outcomes?)

If there is published literature on how search quality is measured I'd love to
see it. Would be especially interesting to see real-time data -- e.g. what is
the impact of 1 data shard outage on overall user-experienced quality
according to some metric?

~~~
BjoernKW
"Modern Information Retrieval" by Baeza-Yates / Ribeireo Neto a few years ago
used to be a good standard work.

I'm not sure though how well it's kept up in terms of aspects like real-time
search and graph search, both of which are fairly recent developments.

------
damonsauve
In May of 1998, I published a review of Lou Reed's "Perfect Night" by Kevin
McGowin, and when I google (in FF private mode, as if that helps) "lou reed
kevin mcgowin" it comes up #1.

If I google "lou reed perfect night review", I stopped looking after page 17
of results. There's just too many results.

If I google /"lou reed" "perfect night" review/ with quotes as you see them,
the review I published is on page 2, result #3.

I feel your pain, but, as someone who started publishing content in 1995, I
don't see Google as having a memory problems.

My pain points are related to how Google crawls my content, my sitemaps, and
how the two seem completely independent of each other.

------
40four
Great tip about the intext: operator from @wahnfrieden. I'll admit I was
unaware of this. Many of us could probably benefit from learning how to
actually use Google more skillfully, instead of expecting the algorithm to
spit out a 'perfect' answer every time.
[http://www.powersearchingwithgoogle.com](http://www.powersearchingwithgoogle.com)
intext: operator is covered in the Power Searching course, section 3.5! I
share many of the same frustrations I've seen in other comments, but I'm going
to work though these courses & hopefully become a better googler.

------
skerit
A few years ago they put up an old index people could, well, google in.

I wonder if they have any other index snapshots stashed away somewhere, I
would love to do that again. Even if only to retrieve the urls of old
homestead websites I had back then.

------
StavrosK
IPFS mirror:
[https://www.eternum.io/ipfs/QmTVbQsQrf8AmzUZxm4DvYW2sFEdsaPr...](https://www.eternum.io/ipfs/QmTVbQsQrf8AmzUZxm4DvYW2sFEdsaPrUeNoKUTghSdmXz/)

~~~
ajkjk
Why?

~~~
StavrosK
The site is very spotty to load for me. Is it just me?

EDIT: Yeah, the TLS handshake takes 20 seconds for me. I don't know why.
Everything else works fine.

------
bsaul
This post and some comments about the lack of predictability of today’s AI
behaviors, makes me wonder: could this be the start of a new trend for
startups ? « 100% predictable, AI-free product » ?

------
dlwdlw
I think AI works best when you have the AI on one side, and something "dead"
on the other, like a pattern. Games, images, etc...

Information nowadays isn't dead like it used to be, it's alive. It have
desires and seeks readership, it mutates and wants to spread. This information
has its source in human intelligence.

It seems possible it is outwitting Google's AI as the smart people have
shifted focus from being smarter than "bad" information to building dumber
than human AIs to improve bean metrics and bask in glory.

The core problem has mutated from "what is relevant" to "what is quality".
Pagerank and the web of dead info could answer both with one number because
the quality signal and the relevance signal were the same and hard to fake.

But if you can hyper optimize for relevance by making content addictive thus
affecting downstream attitudes on "relevance", quality is no longer relevant.
Current AI is smart, but still childlike and easily corrupted/manipulated.
It's a black box that can't be inspected and adapts to change but because it's
tempo of dynamic modeling is slower than a real humans it can be "trained" or
hypnotized.

Hell, it's probably the core sociopath skill. Being able to manipulate the
value/principles system of someone/something else.

------
stevenicr
Thank goodness people are pointing these issues out. I've been considering
some blog posts to collect more info and hard examples of exactly this and
similar google issues that are so often not talked about, especially by those
who have ties to the big G.

There are indeed many keyword searches in which google is obviously censoring
the internet in big chunks. Because they are so opaque no one knows if it's
directed by a gov agency, or shareholders, or the vision of some guy at the
top who thinks they need to grow up, or some small team that is so hellbent on
destroying some things they do not worry about the collateral damage - or
maybe it is "machine learning" figuring out if you censor big chunks here and
there that people will spend more on ads. Who knows? Very few people, and who
is affected? Many people, many who do not even know it. Facebook of all
systems is changing thier over algo ways to be better for people and less for
the algorythm - will things change with big G now that the last babysitter
left?

Google's current path imho is becoming the next yellow pages. Sure they will
do fine being embedded on so many mobile devices and being the go to place for
crowd sourced directions and locations - but the yellow pages was king for a
few days and then people realized it's value to the consumer and to the
business was no longer what it was and now who is using yp? More details on a
non-google run blog soon. Who can I trust not to give up ip info of posters to
tech peeps at G? Wordpress.com? Oh wait, they use google fonts and all kinds
of things don't they. Hopefully people can help put this info together with
me, it is prime time to create some new alternatives that only do slices of
what G used to do well.

------
cbar_tx
Haven't used google's search engine in years. It was obvious to me back then
that google products are designed primarily around googles's incentives, and
the user second. SEO seems like it would be something great...if the
"optimization" was in fact user-centric but it's not. I don't like being
guided by algorithms and I don't really shop around for stuff online. When I
search for something, it's usually for some very specific type of information.
I mean I'm usually pretty sure about what exactly I'm looking for. Granted,
most people probably don't use the web the same way I do so it's not like I
expect them to do what they do differently. I just don't use google, and
whenever I do tell someone to "google it", I'm being sarcastic and probably
not in the mood to answer questions that can be answered with a few minutes at
a keyboard.

Furthermore, when someone links me to an amp.whatever page, I might be in an
even worse mood if I have to talk with them about it afterwards. imo, Google
and facebook algorithms are half of what's causing most of the conflict and
hate on the internet these days. The machines have literally taken over, and
they've started with our minds. YouTube suggestions have devolved into nothing
more than a rabbit hole that gets scarier the further you follow. Either they
are completely out of touch with what people want or the mentality of
society/humanity is way more fucked off than I'd ever imagined it could be.
Until they provide actually useful and configurable settings for people who
can think for themselves, I will not return to using these products.

Besides, how fair would it be for me to freeload off an ad company's services
when my hosts file hasn't allowed an ad to load in my browser as long as I can
remember?

------
GistNoesis
Not later than yesterday, hearing about Dolores from the Cranberries, it
reminded me of Amy Mac Donald (whose voice I find similar) hit "This is the
life" from ten years ago. In this song it mentioned official lyrics "Talking
about Robert Ragger and his 1 leg crew". Being of curious nature I searched
for the reference ("Robert Ragger") to know who is this Robert Ragger (and who
is this 1 leg crew) . Bing got it immediately. Google 1st result is a self
referencing forum from 2009 telling there are plenty of result and you just
have to google it : [https://forum.wordreference.com/threads/talking-about-
robert...](https://forum.wordreference.com/threads/talking-about-robert-
ragger-and-his-one-leg-crew.1425718/)

For those who don't do the search. (It's officially misspelled and instead
referring Robert Riger, an art photographer, (hence the one leg crew :) ))

------
njarboe
I imagine that people who work at Google would like to have a good search
engine that works well for software engineers while at work and not one
optimized for selling ads. Is there such a thing internally? I remember when
exact quotes of error messages from software would usually return something
helpful. Now almost never. Can I pay for that please.

------
hyperpallium
Theory: Because google has local data-centers all over the world (that's why
it's much faster than the competition, and google suggest works so well),
indexes must be maintained at all of them. Because google has so many, this is
a significant expense, and to keep costs down, they reduce the indexes to what
is profitable.

~~~
pavs
Good theory, maybe that's why they push the "personalization" and localization
of search engine so aggressively.

I personally don't feel that Google search quality has degraded very much if
at all. It's true that I rarely get new sites outsite the echo chamber or few
1000 popular sites, but to me 99% of the time they give me relevant and useful
information I am looking for. To be honest Google search results on average is
still heads and shoulders ahead of the competition. in almost all aspects.

~~~
dx034
I think they push localisation because it makes sense for the user. In Europe,
queries for at least 10 countries will usually be answered by the same data
center and are still localised. The main reason why I still often use Google
instead of DDG is to find localised content where you can't localise by
language (e.g. finding UK specific issues).

------
JepZ
Yeah, while most of the time modern Google has an awesome capability of
understanding what you are searching for, sometimes it feels like a chat bot
which didn't fully get what you said to it. It seems to understand some part
of what you are asking for but miserably fails to answer the complete question
(in some cases) :D

------
flukus
De-indexing old stuff might not be a good idea, but I'm increasingly running
into the problem of google (and DDG) returning old and outdated results, I
wish they would put more weight on recent articles, or at least add the option
too. The time filtering options just aren't enough.

~~~
viraptor
Why is the drop down in tools not enough?

~~~
flukus
Because I often don't know the range I'm looking for, it could have been
yesterday or it could have been 4 years ago. If I select last 12 months I
might miss something 13 months ago. There's a lot of ambiguity in what I'm
searching for (otherwise it wouldn't be called a lookup) and when I'm looking
for current information then weighting by age is a lot more natural.

Another issue is that I don't know what the filter is selecting either, a 6
year old article might be better if it's been updated, but I can't tell from
the interface what property is being filtered.

~~~
mxfh
I often happen to search for things in multiple intervals. Often the top
results are obsolete, since they are more than a year old.

What drove me nut's, is that their QA didn't catch the bug with the date
format order for years. It only recently got fixed. The calendar selection was
regional and put in the date in the regions format (Often dd/mm/yyyy), while
the query form expects mm/dd/yyyy.

------
ins0
I'm confused, did i do something wrong or why can i find the referred article
just fine on google, just by searching the keywords?

[https://imgur.com/a/szBcB](https://imgur.com/a/szBcB)

~~~
hayksaakian
becuase the OP wrote an article that links to these things, and caused them to
be re-indexed.

It's not purely a function of 'date published' it is also about frequency of
access

~~~
ins0
Ah true that makes sense, thanks for clarifying my confusion

------
wybiral
Doesn't Google downplay non-HTTPS sites, which older ones probably will be.
Also, maybe it's just part of the algorithm that people tend to search for
recent pages more often than old ones.

------
montrose
Maybe this means there is room for someone to start a new search engine for
hackers. That would be very exciting news, because that's what Google was
originally: the search engine hackers used.

------
just2n
This reminds me a lot of another article posted here recently:
[http://www.sicpers.info/2017/12/computings-fundamental-
princ...](http://www.sicpers.info/2017/12/computings-fundamental-principle-of-
no-learning/).

This behavior makes sense for most people, those who don't know what they're
doing, at the expense of rendering the tool useless in cases where advanced
functionality is needed. Perhaps Google should have an advanced, AI lite
version.

------
ksec
Slightly off topic; Not on the Results Accuracy for a minutes.

 _Ob­vi­ous­ly, in­dex­ing the whole Web is crush­ing­ly ex­pen­sive, and
get­ting more so ev­ery day_

Why? CPU, Memory, SSD has all gotten a lot faster and bigger. Bandwidth is
also a lot cheaper for Google since they essentially owns the fibre. Faster
Algorithm may have an even bigger impact.

I would have thought information, in terms of text ( Not Video and Pictures
which is a few order of magnitude larger ) would now be cheaper then the old
days.

~~~
andrewmcwatters
Developers say this type of shit all the time and people wonder why we don't
have fast software.

------
justinmoh
I think this is a way that google kind of _forcing_ you to participate in
contributing to their AI development.

By _improving_ their system, it creates some difference, if no one cares or no
one can justify the necessity, then google doesn’t need to care.

The constant increasing amount of information on the web nowadays is certainly
a burden to Google. And if such AI can handle 99.99% of what’s matter to user,
this AI is already a brilliant one.

After all I don’t think google ever wanted to be an archive searcher.

------
cleeus
> My men­tal mod­el of the Web is as a per­ma­nen­t, long-lived store of
> humanity’s in­tel­lec­tu­al her­itage.

My mental model of the Web is a human brain with all kinds of weird mechanisms
- one that forgets things, makes things up, mixes memories. Google is like an
"association cortex" \- take it away and all the memories are still there but
access to them is much harder and needs to go through more indirections
(links).

------
Arbalest
I imagine this is a symptom of their parallelism. Sort of like the app engine
datastore being eventually consistent, they operate optimistically. Presumably
this reduces their costs and increases responsiveness. As for gmail, I
noticed, when trying to archive a bunch of stuff, of search, select all, not
operating on the theoretical full search set many years (at least 3 I'd guess)
ago.

------
Jeff_Brown
We don't need to rely on Google. We can already build* our own knowledge
graphs using open source software.

[https://github.com/synchrony/smsn/wiki](https://github.com/synchrony/smsn/wiki)

We can even interconnect them -- selecting what's private, what's public,
what's shared with some people but not others.

------
vog
_> I think Google has stopped in­dex­ing the old­er parts of the We­b. I think
I can prove it. Google’s com­pe­ti­tion is do­ing bet­ter._

If this about finding older parts of the web, don't forget to use the _Wayback
Machine_ of archive.org:

[https://archive.org/web/](https://archive.org/web/)

------
irishcoffee
n+1 I've 100 percent noticed this too, infuriating when it comes to looking
for emails as they pertain to legal matters

------
fouc
I think the solution would be for some sort of personal proxy that captures
all the searches and webpages we visit. Then when we go to make new searches,
we can first search locally (to re-find things we're thinking of) and then
externally (to find new things). Not sure if something like this exists yet

------
kens
If there seems to be a problem with Google not indexing your site, Google's
webmaster tools can help debug what's going on:
[https://www.google.com/webmasters/tools](https://www.google.com/webmasters/tools)

~~~
kuschku
That's if your own site is affected - but what if I want to find content from
another site, and Google doesnt properly index that? I can't exactly use
Google's webmaster tools for that, and the owner may not care, or can't be
contacted.

~~~
Jaruzel
Actually you can:

[https://www.google.com/webmasters/tools/submit-
url](https://www.google.com/webmasters/tools/submit-url)

But you DO need a Google Account though.

------
StavrosK
DuckDuckGo seems to have no problem returning the result, even with general
terms:
[https://duckduckgo.com/?q=tim+bray+rock+roll+animal](https://duckduckgo.com/?q=tim+bray+rock+roll+animal)

~~~
on_and_off
Google also returns the article with this query

~~~
bamboozled
That depends, considering search results are usually tailored to suit
individuals (if logged in).

~~~
on_and_off
True, but it could not return this article for anybody if it was not indexed.

Although it is fair to assume that Tim's blog post and all these new links
pointing to the old article have triggered it's reindexation.

It feels a bit like quantum physics, now that the article is out, the state of
the web has been observed and has changed.

------
wildpeaks
I wish you could use _both_ Verbatim and Sort by date at the same time.
Instead, you end up having to choose between recent but irrelevant results, or
relevant but outdated results.

------
debt
Wait so Google isn't surfacing you or your friend's shitty article on some
band you guys like so that makes it broken? Sounds like Google is working
better than it used to.

------
technofiend
Honestly it feels like a good time to go back to MH handling my inbox but
dropping a nice indexer like elasticsearch on the front.

~~~
FreedomWarrior
I started using notmuch[1] a couple of years ago and cannot imagine living
without it. It can do free text search on almost a million emails (and
probably much more) in a fraction of a second. I subscribe to a lot of mailing
lists and add various tags on each making for some very powerful search
queries.

E.g "from:torvalds and to:linux-ext4" to bring up all emails ever with those
properties. Add some free text and/or "tag:foo" to narrow it down.

[1] [https://notmuchmail.org](https://notmuchmail.org)

------
hguhghuff
I like the idea that the Internet forgets things. Seems healthy. Eternal
memory for some things is disturbing.

------
bombita
[https://www.google.com/search?q="we+were+watching+the+Democr...](https://www.google.com/search?q="we+were+watching+the+Democratic+National+Convention+on+TV")

I can find the article this person is referring just by looking at a string of
it. So the article _is_ indexed. Not sure what he is referring to.

~~~
Veedrac
That only gives me
[http://inessential.com/2008/11/](http://inessential.com/2008/11/), not
[http://inessential.com/2008/11/04/that_new_sound](http://inessential.com/2008/11/04/that_new_sound).

~~~
mortehu
The second link is a partial duplicate of the first link; it contains nothing
not contained in the first link. Maybe that's why?

------
known
I've built my own search engine that caters 90% of my needs. It's not that
difficult.

------
lyager
But, by posting this article you will probably have ruined your own evidence.
:-D

"Nice try!" :D

------
butler14
Both sites could use a little SEO. Google being imperfect isn't a new thing.

------
sunstone
This really sucks. Time for an open source search engine a la wikipedia.

------
known
I think Google should have "Search from Archive" feature

------
zappo2938
I often limit the search with the time / date filter.

------
faragon
Is Google doing something to fix that?

------
staunch
I recently switched off Google Search and it's totally fine. If it disappeared
I wouldn't be too upset.

------
billysielu
memory loss is good for those struggling with the right to be forgotten

------
dingo_bat
I've switched completely to bing for a few reasons:

1) Bing is much faster to load

2) Bing doesn't muck up the links. I can right click->copy and get the actual
link. With google you get a long incantation that doesn't tell you what site
it goes to.

3) Bing handles a few things better, eg, "$500 in ₹" works in bing but not in
google.

However, when I need local results, or results about something ongoing, Google
is the undisputed king. Nobody else even comes close.

~~~
r3bl
When trying to convert currencies, try using three-letter codes for the
currency. To take something a bit more obscure than dollars and rupees as an
example, "500 PLN to RSD" (Polish zloty to Serbian dinar) will definitely work
in every search engine.

As far as I understand it (I never actually looked it up to confirm), currency
codes = two letter country codes + first letter of the noun from the currency
name, so they're relatively easy to guess right.

~~~
dingo_bat
I know, INR works instead of ₹. But goddammit I don't wanna type those extra 2
letters!

------
cup-of-tea
Remember when searching used to be a taught skill? People would learn to use
boolean operators to find exactly what they want. Now I'm not sure if any of
that even works with Google. We went from a simple machine with many (mostly
optional) inputs, to a complex but stupid one with one input that actively
fights against your own intelligence.

------
tzakrajs
The style used for the links borders on vulgarity.

~~~
astura
... Why? Looks fine to me (other than that I prefer underlined links)

~~~
tzakrajs
Dark red links do not contrast well amongst black text. I still can barely
tell what is a link and what is just text. Maybe I am experiencing some vision
loss or colour blindness.

------
danifeld
Interesting perspective

------
staunch
I've been using Google since 1998. I recently switched off Google Search and
don't miss it at all.

Google is in big trouble.

~~~
pgrote
What did you move to?

------
peterwwillis
It's going to be a little sad when Google Maps goes away. It will be much more
sad when Google Flights goes away. But I can't think of anything else Google
provides that I couldn't get from a bunch of competitors. Well... maybe Google
Translate.

~~~
kuschku
In many places I've found both OSM and Here maps to provide higher quality
data than Google Maps, actually. So it won't be that bad when (not if,
considering it's Google) they stop Google Maps. And DeepL is starting to beat
Google Translate on some language pairs, too.

The hardest to replace part is search today.

------
matznerd
Try pinging your site: [https://pingomatic.com/](https://pingomatic.com/)

------
TekMol
Nobody can index the whole web. Even a single site in the form of

    
    
        Homepage of Joe Infinity
        You are on page <?$pageNr?>
        <a href="<?$pageNr+1?>">Next Page</a>
    

can not be completely indexed. A search engine will crawl it to some depth
based on many factors. Age might be one of them. There is no way to index
'everything' on the web.

~~~
swvjeff
That's really not relevant to this article. The author is not talking about
crawling and indexing the entire web (although he mentions the "whole web"
once, that's clearly not what he means). He is wondering why old pages --
pages that used to be in Google's index -- are no longer showing up in SERPs
even when using appropriately-targeted long-tail queries.

~~~
TekMol
For the same reason 'Joe Infinity page 1234567' would not be found anymore.
Google thinks its not relevant enough to keep it indexed. Yes, it is debatable
what is relevant enough and what isn't. But everyone who indexes 'the web' has
to decide what to keep and what not. Nobody can store 'everything'.

Also it's not as easy as just keeping everything that ever was in the index in
there. Then searchengines would link to noexisting urls _most_ of the time.
Most URLs have a short lifespan. Links rot pretty fast.

~~~
swvjeff
I completely agree with you, but your initial argument was that "Joe Infinity
page ∞" wouldn't be indexed because Google cannot index every viable page on
the internet. That is true, and Google will certainly set limits on what pages
is crawls and what pages it indexes. However, in this instance the articles
_were_ crawled and they _were_ indexed and they _were_ relevant at one point
in time. But google decided to remove them from SERPs for some reason or
another (age, lack of traffic, etc).

------
jsnell
> I think Google has stopped in­dex­ing the old­er parts of the We­b. I think
> I can prove it. Google’s com­pe­ti­tion is do­ing bet­ter.

The first sentence is just common sense, and no particular proof is needed.
The last sentence might or might not be true, but the anecdotes in this
article say nothing about whether or not its true. The problem is that we
don't know how Tim selected these two particular pages as examples.

If he randomly selected two 10 year old pages from the universe of all such
pages, it'd at least be a valid methodology, just with far too small a sample
size. But obviously he didn't do that. If the methodology instead was to
search for pages on Google first, then on Bing iff there was no Google match,
this tells us nothing at all. You need to run all queries on both engines, not
just the ones that fail on one search engine.

Another reasonable method would be to look at aggregate referer trends; is
traffic from Google to old pages decreasing faster than traffic from Bing to
those pages.

~~~
lolc
> The first sentence is just common sense, and no particular proof is needed.

How is this common sense?

> The problem is that we don't know how Tim selected these two particular
> pages as examples.

Yes we do know. He was using Google to find his own old stuff over the years.
Some content he was referring regularly disappeared from Google's results.
These pages had previously been included in the results.

> Another reasonable method would be to look at aggregate referer trends; is
> traffic from Google to old pages decreasing faster than traffic from Bing to
> those pages.

Yes that would be interesting.

I've been wondering whether Google would actually purge the URL too. The
Googlebot used to be very persistent in retrying "404 not found" results.

~~~
jsnell
> How is this common sense?

Because it's in practice impossible to index every page. Index selection has
always been a core quality feature in search engines. (Both re: which pages
get included, and re: which pages get included in which layer of index in
multi-layered index schemes).

> Yes we do know. He was using Google to find his own old stuff over the
> years. Some content he was referring regularly disappeared from Google's
> results. These pages had previously been included in the results.

That's just a guess, it's not actually stated anywhere in Tim's article. But
yes, given he did not say otherwise, what you propose is probably what
happened. He had a couple of pages which he knew were not found on Google, and
checked whether they could be found on Bing.

But my whole point is that this kind of methodology is total garbage. And then
he's making pretty absolute statements, like his tweet about the post "TIL
that both Bing and DuckDuckGo apparently index a lot more of the Web than
Google does".

~~~
epistasis
People used to say that Google would always be the best search index because
it had the biggest index, and nobody could match Google there. Being more
selective about what you include seems like a big change from past practice,
or at least past narrative.

~~~
cromwellian
Yes, but what jsnell is saying it, perhaps if you perform the same experiment
with Bing, you'd find pages it didn't index, but that were present in other
search engines.

You can't say A is better than B with a few data points. You can say you think
B's behavior has changed compared to the past. But that's also erroneous.

It's possible the behavior was always there, you just never tripped over it
and rare enough that most people don't, either because the web was too smaller
before, or your own content was smaller, or it's recent link and access
patterns changed enough to trigger the behavior.

