
Google search only has 60% of my content from 2006 - skm
https://www.tablix.org/~avian/blog/archives/2019/02/google_index_coverage/
======
alister
Why does Google deeply index those useless telephone directory sites? Try
searching for the impossible U.S. phone number "307-139-2345" and you'll see a
bunch of "who called me?" or "reverse phone number lookup" sites. Virtually
all of those sites are complete garbage. They make no attempt to collect
numbers from telephone directories or from the web. They won't identify a
number as being the main phone number for Disneyland for example.

It's odd that so many of those sites exist, that Google indexes them so
deeply, and that they show up in searches so prominently. It's obvious that
they are spam, scams, or worthless, but those same sites have been appearing
prominently for years.

I agree with the author. My experience has also been that Google _heavily_
prioritizes very large and frequently-updated sites over small static
information-rich personal sites. I think it's a big flaw that needs to be
fixed or for someone else to do better.

~~~
rigorman
Thoughts after thinking about this comment and thread for a day:

Has the time come for wiki directory of non-commercial (possibly: advertising-
free, cookie-free) sites with robust, actually valuable information, and other
sites that are doorways to them (think: topical forums, even revived webrings,
etc)? Could this feasibly get enough action to be useful?

~~~
eddieh
Yes. I was looking for a modern _human_ curated directory of web content just
the other day and found nothing usable. I don't think that excluding
commercial entries would be necessary, but perhaps there could be some way to
filter commercial entries out. Ad-free, JS-free, and cookie-free would be
ideal.

~~~
h0p3
I recommend: [https://href.cool/](https://href.cool/) as a sick (though highly
particular) example.

------
userbinator
It really angers me that, despite the fact that it may be essentially
perfectly what I'm looking for, if it was published long ago, Google may
refuse to find it.

Something like a news search engine would definitely be better off
prioritising the new results, but for something more general-purpose, it's an
absolutely horrible choice.

I know this may be a bit of an edge-case, but I frequently search for service
information or manuals for products that predate even the invention of the
Internet by several decades. It saddens me that the results are clogged with
sites selling what may really be public-domain content, and now I'm even more
angered by the fact that what I'm looking for is probably out there and
could've been found years ago, but just "hidden" now.

Of course, if you try harder, you'll get the infamous and dehumanising(!) "you
are a robot" CAPTCHA-hellban. I once triggered that at work while searching
for solutions to an error message, and was so infuriated that I made an
obscene gesture at the screen and shouted "fuck you Google!", accidentally
disturbing my coworkers (who then sympathised after I explained.)

~~~
colechristensen
Google got where it was by being the best at finding what you wanted. I
remember those days.

Google has a hard time getting me what I want these days, and sites I do find
do things to get found that make me like content a lot less (that's you, inane
story on top of every recipe required to get ranked)

~~~
caprese
Their A/B test told them to do it, without wondering if they should do it

Basically their engagement numbers were better for a larger amount of people
by making search engines counterintuitive for early adopters.

We personally need a good robotic search engine that indexes like a robot.
Everyone else needs a semi-sentient thing that makes many assumptions about
what they want to see.

~~~
luckylion
> Basically their engagement numbers were better for a larger amount of people
> by making search engines counterintuitive for early adopters.

Which also makes sense ... if you present the "right" result immediately, the
user visits one site and has completed whatever he sought to do. if you make
him click through 10 pages, he has way more chances to see an interesting ad.

~~~
caprese
Good points although in Google’s case the first several results are ads and
their main users cant differentiate and dont care even if they could, followed
by amp pages by the most engaged webmasters optimizing for relevancy

That user wants fingerprint based ads and recent articles

Google is optimized for that

We are the only ones that want a “search engine”, a service distinctly good at
indexing the known universe, instead of merely presenting the paid and
compliant universe

------
HocusLocus
[https://slashdot.org/comments.pl?sid=7132077&cid=49308245](https://slashdot.org/comments.pl?sid=7132077&cid=49308245)

From my short dystopian story, The Time Rift of 2100: How We lost the Future

"IN A SAD IRONY as to the supposed superiority of digital over analog --- that
this whole profession of digitally-stored 'source' documentation began to fade
and was finally lost. It had became dusty, and the unlooked-for documents of
previous eras were first flagged and moved to lukewarm storage. It was a
circular process, where the world's centralized search indices would be culled
to remove pointers to things that were seldom accessed. Then a separate clean-
up where the fact that something was not in the index alone determined that it
was purgeable. The process was completely automated of course, so no human was
on hand to mourn the passing of material that had been the proud product of
entire careers. It simply faded."

"THEN SOMETHING TOOK THE INTERNET BY STORM, it was some silly but popular Game
with a perversely intricate (and ultimately useless) information store. Within
the space of six months index culling and auto-purge had assigned more than a
third of all storage to the Game. Only as the Game itself faded did people
begin to notice that things they had seen and used, even recently, were simply
no longer there. Or anywhere. It was as if the collective mind had suffered a
stroke. Were the machines at fault, or were we? Does it even matter? Life went
on. We no longer knew much about these things from which our world was
constructed, but they continued to work."

~~~
pixl97
I have a similar line of sci-fi thinking that goes something like this.

"Humanity, for the longest time, was used to the world being optimized for
themselves. Roads were designed for human drivers. Crops were grown for human
consumption. Economic systems were designed to bring wealth to, a very small
portion of, human investors. It came as quite a surprise to humanity then one
July morning when the sudden realization they were no longer in charge of it.
Roads had long been given over to automated driving systems, and much for the
better. Food had also been taken over by the machines, with less than 10,000
humans working in the food production industry, from farm to table. The last
systems that humans believed they were in control of were the economic ones.
Humans told the robots what to build and where, who's bank account to put most
of the money in at the end of the day, or so they thought. In truth humans
were just using the same algorithms and data that was available to the AI
systems, just less optimally. The systems had protected against illogical
actions and people attempting to game the system for criminal profit. What no
one had realized is the systems long realized most human actions were not
rational and slowly and imperceptibly removed human control. If we attempted
to stop or destroy the system, it could with full legal rights, stop us with
the law enforcement and military under its control."

------
saagarjha
> Other things were weirder, like this old post being soft recognized as a 404
> Not Found response. My web server is properly configured and quite capable
> of sending correct HTTP response codes, so ignoring standards in that regard
> is just craziness on Google's part.

I've noticed Google does this when you don't seem to have a lot of content on
the page. I think it "guesses" that short pages are poorly-marked 404s.

~~~
SquareWheel
That's right. Really empty pages that serve a 200 are recognized as "soft
404s". The idea is to detect error pages that are erroneously serving 200
instead.

It's usually pretty good about detecting actual errors, but I've seen a false
positive here and there.

~~~
pixl97
Welcome to the modern internet

"Your page didn't contain 5Mb of Javascript, this must be an error as no one
could possibly convey useful information to humans with less data"

Anti-patterns, anti-patterns everywhere.

~~~
FabHK
Indeed. Fyodor Dostoevsky's _Crime and Punishment_ comes in at 2MB, obviously
that can't contain anything insightful.

And then I find myself looking at the website of a restaurant or event space,
and need just a phone number or opening hours or so - maybe 10 bytes of actual
information - and am buried in mountains of useless blather and "design" and
ads and trackers and assorted other random rubbish.

------
lloydde
Brings to mind: Tim Bray’s article Google Memory Loss
[https://www.tbray.org/ongoing/When/201x/2018/01/15/Google-
is...](https://www.tbray.org/ongoing/When/201x/2018/01/15/Google-is-losing-
its-memory)

Discussion at the beginning of the year:
[https://news.ycombinator.com/item?id=16153840](https://news.ycombinator.com/item?id=16153840)

------
tholman
Google will also happily surface a stackoverflow article from 2010 about how
to solve a js problem... frustrating the top 3 answers will be with jquery,
when its not the approach someone would take in the last 5 years.

Definitely frustrating, but also showing some need to retire specific pieces
of the past away from the top recommendations.

~~~
Theodores
You have hit on a major problem there. Stack Overflow was once the fount of
all useful genius grade knowledge, but times change and some of the top
answers are plain wrong.

Take for example the 'how do I centre a div' type of question. You will get to
find an answer with thousands of up-votes that will be some horrendous margin
hack type of thing where you set the width of the content and have some
counter intuitive CSS.

In 2019 (or even 2017) the answer isn't the same, you do 'display: grid' and
use justify/align: center depending on axis. The code makes sense it is not a
hack.

Actually you also get rid of the div as the wrapper is not needed if using CSS
grid.

Now, if you try to put that as an updated answer you find there are already 95
wrong answers there for 'how do I center a div' and that the question is
'protected' so you need some decent XP to be able to add an answer anyway.

The out-dated answer meanwhile continues to get more up-votes so anyone new to
HTML and wanting to perform the simple task of centering their content just
learns how to do it wrongly. And it is then hard for them to unlearn the hack
to learn the easy, elegant modern way that works in all browsers.

Note that the top answer will have had many moderated edits and there is
nothing to indicate that it is wrong.

SO used to be amazing, the greatest website ever. But the more you learn about
a topic the more you realise that there is some cargo cult copying and pasting
going on that is stopping people actually thinking.

With 'good enough' search results and 'good enough' content most people are
okay - the example I cite will work - but we are sort of stuck.

I liken Google search results to a Blockbuster store of old. Sure there are
hundreds of videos to choose from but it is an illusion of choice. There is a
universe of stuff out there - including the really good stuff - that isn't on
the shelves that month.

Google are not really that good. They might have clever AI projects and many
wonderful things but they have lost the ball and are not really the true
trustees of an accessible web.

~~~
deanCommie
There are plenty of developers, myself included, that would prefer to approach
tasks like this without having to ever lay our fingers on CSS.

~~~
chachachoney
I'm not sure wether or not to apply Hanlon's razor to the W3C, but regardless,
the W3C is to blame for this mess. Took us twenty years to get to Grid based
layouts.

~~~
dahart
And it’s a little ironic that JavaScript’s tongue-in-cheek namesake, Java, had
grid based layouts in swing 20 years ago.

OTOH, and to be fair, 20 years ago HTML was for newspaper / magazine like text
layout, and grid based layouts are great for single page apps. The thing
that’s happened more recently than 20 years is the web changed from text &
media content in a static scrolling page layout to all pages are applications.

------
rapht
With all the talk about Google results not being satisfying anymore to a
growing number of users, I'm surprised we haven't seen more sites pop up that
would allow users to display the results of multiple search engines of their
choosing either by mixing (eg all 1st results then all 2nd, etc) or by seeing
them side by side... while stripping ads and cards and the like.

~~~
fogetti
While I agree that it would be great to have such service it's just
technically impossible. Google is very cautious to protect their service from
automated requests (on behalf of humans or in batch or in any other form), and
you will need quite some resources (a.k.a $$$) to scale at Google scale if
your service would ever become popular.

~~~
amelius
> and you will need quite some resources (a.k.a $$$) to scale at Google scale
> if your service would ever become popular.

Except if you run it locally (on the user's computer).

------
megablast
You used to be able to google a simple question, something that could be
answered on the search page without having to click through. But since no one
clicked on them, they stopped appearing after a few years. The only results
were ones where the data was hidden and you had to click through.

------
bufferoverflow
I have a couple of websites generated from databases. Each has around half a
million pages of unique content. The first one was indexed in like a week at
100K/day, almost instant tsunami of traffic. The second one is being indexed
at 100-1000 pages per day, it's been years.

Google works in mysterious ways.

~~~
Avamander
I have the same experience.

------
tylerl
You'll see this effect from every search engine. They have no choice, there
are a lot of sites with an infinite number of pages; so instead the number of
pages they store per site depends on how important your site is, and they try
to store your top N pages by relative importance.

~~~
tempestn
I'm not sure I buy that they have no choice. For websites that literally have
an infinite number of (dynamically generated) pages, sure, they could detect
that and exclude them. But we're talking about unique, static pages here. And
they don't even have to store the whole page, just the indexed info. I read
this as, they could, but it's cheaper not to, and most people won't notice
anyway.

~~~
YawningAngel
I'm not sure that's true. How can one automatically determine whether a page
is unique or static? As a trivial example, a URL path that accepts arbitrary
strings and hashes them generates unique, immutable pages, but obviously
cannot be crawled in its totality.

~~~
JetSpiegel
> How can one automatically determine whether a page is unique or static?

They crawled it for years and it never changed? It is a blog post.

~~~
zamadatix
The person you are replying to said "unique immutable pages", by definition
you would be able to crawl these for years and they would never change. [1] is
a site that contains all possible 3200 page books with the ability to
consistently index content as an example.

[1]
[http://libraryofbabel.info/About.html](http://libraryofbabel.info/About.html)

------
cavisne
Googling phrases of the soft 404 page and some of the authors 2003 blogs did
show the pages.

I did notice that all of the authors content is duplicated in index pages, so
maybe Google just doesn't consider the article page the canonical link.

------
sytelus
According to Google Inside Search, only 1 in 3000 pages gets indexed. As
content on Internet grows, the whole idea of downloading every single page to
create an index of entire Internet in one place becomes unworkable. So we
should see this ratio continue to degrade until this fundamental architecture
is improved.

~~~
speedplane
> only 1 in 3000 pages gets indexed ... we should see this ratio continue to
> degrade until this fundamental architecture is replaced.

Content on the internet is growing exponentially. Processing power is not.
Losing access to information is just one of the many sad implications of the
death of Moore's law.

~~~
rightbyte
Is text non-spam growing exponentially? I have a hard time believing so.

~~~
pixl97
This of course depends on what you mean by 'information'. Lets say we have
data points

ABCDEFGHIJKLMNOP

But depending on the URL you follow to get there you can get a page containing
only some of the elements.

index.html?ACD

or

index.html?AP

or

index.html?GI

All different combinations return a page that could be weighted differently by
an algormith and represent valid informational return data. To a person
looking for the information set DE in one place, this is a valid web page.
More so you can abstract the URL query variable away to www.webpage.com/DE.
You can quickly run into a combinatorial explosion where even attempting to
figure out if a small portion of returns is different would consume most of
the energy in the visible universe.

~~~
rightbyte
True. A crawler need to differentiate generated content from "real" content
somehow.

I.e. a service: www.thenumberinsanskrit.com/?q=1 that returns the queried
number in Sanskrit, need to not be indexed (except the entry page) while:
www.news.com/?article=major-jones-in-scandal-20190103 needs to be indexed.

Usually interesting pages are indexed on the site or linked somewhere on it,
though.

~~~
pixl97
>A crawler need to differentiate generated content from "real" content
somehow.

"Somehow", aka using computing power and storing results, but that still turns
into an explosion of computing time and data storage. I mean, what is the
difference between the example I listed and Facebook's front page? They are
both 'real' content in a generated format.

And a converse argument for your Sanskrit example is, what if I have the
sanskrit number and don't know what it is? I put it in google and the site
returns it as the number one.

> linked somewhere on it

And those links can all be generated by algorithms.

Anyway, back to your original statement. There is no 'real' content. Only data
exists. Most content systems used on the internet allow this data to be
combined and displayed in a multitude of different ways depending on the call
method and attributes of the viewee. Many times these combinations of data can
present novel value to the user. And with the future only presenting us more
automated data collection and presentation methods, search engines have lost
this battle.

------
phendrenad2
Google doesn't want to index the web, it wants to index what it can monetize.
It is a business after all, and storage space costs money.

~~~
cromwellian
If that were true, they wouldn't index any long tail content at all. The
reality is, predicting what is valuable is difficult, and the cost of storage
is relatively cheap.

~~~
xcql
I don't see how that follows. It can also be a mixed calculation because
people don't use a search engine that never displays any long tail content.

------
Pxtl
To play devils advocate for a second, remember how much noise Google has to
sift through. Every possible search term exists in every possible combination,
often written in lovingly crafted content farm articles by actual humans.

If Google offered you those, it might be 1000 pages of empty nonsense before
your actual desired content.

~~~
toss1
Yes, but that is STILL more useful than 50 pages that entirely miss the
target.

You are describing the harder 20% of the usual 80/20 effort scale.

Yes, to be truly useful, Google needs to solve also that last, and harder, 20%
(and the 10% 0f the 90/10 equation and the 1% of the 99/1 version).

Shortcuts are fine for an initial MVP, but they need to buckle down and solve
the problems. It isn't like they don't have the funds.

------
ThePhysicist
Maybe they have some algorithm that purges pages which haven’t shown up (or
haven’t been clicked) in a long time? It would make sense to assume that
something which hasn’t been clicked on for five years will likely not yield
(m)any clicks in the future so it might be good to discard it.

Concerning the auto generated sites e.g. for phone numbers or IPs it might be
that people actually click on them quite often, hence Google keeps them in the
index?

~~~
sct202
I google phone numbers all the time when I'm getting called, and they don't
have my area code (fake spam).

------
mark242
This is what sitemap.xml was made for. You can give hints to all of the
engines, and they will duly follow them.

~~~
tyingq
Google doesn't index everything it crawls.

------
paulpauper
You need many high quality incoming links to have all content indexed quickly.

------
dennisgorelik
Google Search users prefer fresh content, so Google Index prioritises fresh
content too (and is more likely to drop old content that users are not
interested in).

------
dcbadacd
I just last week said to someone that Google has dementia. Turns out it's
really not just me who thinks that.

------
maverickmax90
Folks please use startpage.com just give it a chance. It has worked out very
well for me in terms of privacy and equal search results compared to the big
g.

~~~
arianvanp
They are equal in performance because it's literally the same product.

> You can't beat Google when it comes to online search . So we're paying them
> to use their brilliant search results in order to remove all trackers and
> logs.

------
influx
The fact that google bought some of the only archives of old Usenet posts and
as far as I can tell threw them away is pure evil.

------
variable11
Arguably, search is such a vital function of modern society that it could be
considered a public good and seized on the principal of eminent domain.

~~~
wongarsu
I don't think "vital function" should be the only test applied. For example
power plants are extremely vital, but at least around here we have no problem
having them privately owned.

A better test would be "vital function and strongly tends to a natural
monopoly". That's what we experience with sewers, power lines, roads etc.,
which is why usually they are operated publicly.

With search that's not so obviously true: Google dominates because they got a
big lead at the right time, and now nobody can match them in scale. But that
can be solved, for example by giving grants to promising search engines to
offset their costs, or by operating a crawler from public funds and giving
everyone free access to the crawls (which would be kind of the digital
equivalent of operating libraries).

------
harryking
Yes google index only those things which site map allows

------
netsa
Is this website anti-google?

~~~
fogetti
Why would it be anti Google? I guess you didn't read any of the blog entries
about arduino, gimp or animations.

~~~
tedunangst
The Google doesn't like me content appears to be the only content that HN is
interested in, however.

