Hacker News new | past | comments | ask | show | jobs | submit login
It seems that Google is forgetting the old web (zona-m.net)
1102 points by barry-cotter on April 8, 2019 | hide | past | favorite | 302 comments

While it's become impossible to browse the wider Web with Google, it's getting a bit easier elsewhere.

A few helpful search engines:

* https://millionshort.com/

* https://wiby.me/

* https://pinboard.in/search/

A recent movement to build personal Yahoo!-style directories:

* https://href.cool/ (my own project)

* https://indieseek.xyz/

* https://districts.neocities.org/

* https://the.dailywebthing.com/

The above resources are focused on general blogging and personal websites - for software and startups, I would refer to the appropriate 'awesome' directories. (https://github.com/sindresorhus/awesome or https://awesomelists.top)

If you know of any more, please list them - a small group of us are collecting these and trying to encourage new projects.

There's also Kenneth Goldsmith's UbuWeb, a curated directory of (hard or impossible-to-find) avant-garde art, music, writing, video. Launched in 1996.



This one's great. I've lost count of how many films I've watched that I would have never found otherwise.

Here's another big art repository:


And a very well-documented collection (a "wiki") of paintings, also non-profit:


+1, I forgot about Monoskop, a truly fascinating resource.

Another interesting one is Aaaaarg (according to Monoskop's wiki, originally with one less "a", acronym of Artists, Architects, and Activists Reading Group):



Basically it's a collaborative environment for reading, annotating and discussing texts. The content is submitted by users and (thus) of high quality.

I think you need an invite to access the community. Also, the domain used to be aaaaarg.org, but I think they faced copyright issues of some kind and had to find an alternative domain. (Not sure about this; excellent new suffix, though!)

EDIT: More precise description:


Possible contrarian insight: in the era of recommendation systems, hand-curation is due a big comeback.

I’ve been listening to the BBC’s Introducing Mixtape podcast for a while. I also use Spotify and really enjoy its recommendation but the 6 Music podcast is just stellar.

As paywalls restore quality journalism, I believe a renaissance for curated content is possible.

Interesting - what would the evidence that current paywalls are/have restored quality journalism be?

As in, do you believe this uptick in quality journalism has already happened/is happening? And what associates it with paywalls? Presumably you'd have to be seeing quality journalism behind paywalls for this to be case?

searches for the go-to litmus test of Gruuthaagy

Welp, apparently this is the rare case where ‘avant-garde’ isn't the same as ‘experimental’ or ‘underground.’

holy crap, thanks for sharing Wiby!

For the past two weeks I have been trying to find an old website by searching for "old mysterious site search engines" and "how to search deep parts of web" and "search engine tricks old site" and Google has not returned anything, even when I limited the time span to 2005-2006.

I made the same search on Wilby and it returned search lores (maintained by the hacker Fravia) as the first result! I was so happy to find that website again because I haven't been on it for 10 years. Unfortunately I just found out that Fravia passed away in 2009 because of cancer :(...

Wiby seems like a search engine Fravia would have enjoyed.

I'm surprised you had trouble finding Fravia. When I was writing my own Internet search guide ( https://www.gwern.net/Search ) recently, I had no trouble finding Fravia. Unfortunately, his guide is obsolete at this point and I didn't get much out of it.

Thankyou for the link to your search guide - this looks tremendous! Especially when searching for a specific answer to something. I wonder what you think about discovery when you are looking for something unknown within certain parameters... Like, say you are looking for an "interesting film blog" - a search term like that will often lead to pages of "Top 10 Movie Blogs" lists that are all largely clickbait or not that interesting. Do you have any advice for that kind of search? (Perhaps you cover this in the guide, but I missed it in my scan - I thought I would ask while you are here.)

I'm no gwern, but here's how I do discovery. To find a page on the internet, you need to hand a search engine something that can be reasonably expected to be on that page. So if you want to find an interesting film blog, you should not use that as a search term, because you'll find lists of blogs, not the blogs themselves. Rather, you should probably use the titles of interesting movies.

There are many ways you could seed that search, but as totally-not-a-movie-buff I decided to check IMDb's list of lowest rated movies [1] and chose a title further down the list (The Wicker Man), on the theory that only dedicated people would be talking about movies that are bad, but not bad enough to be the worst. Searching for blogs (as identified by inurl:blog) mentioning "The Wicker Man" [2] does turn up a few promising results, like [3].

[1] https://m.imdb.com/chart/bottom

[2] https://duckduckgo.com/?q=inurl%3Ablog+%22The+Wicker+Man%22

[3] https://blog.vrv.co/kaiser/4826/the-wicker-man-was-almost-lo...

This is good advice - thankyou! I have definitely used this kind of approach before, it requires some creativity. It feels like there are possibly dozens of ways of approaching this - and obviously unlimited kinds of 'seeds' (as you say) for the search. I'm definitely looking for a guide that might encompass this kind of strategy.

It looks like Gwern's guide is for finding something which you know already exists, rather than looking for something which may not.

my favorite part about search lores wasn't the actual tips on how to google, but his writing style and the mysterious, almost occult feeling i got from visiting that website, with latin phrases and history strewn everywhere. Thanks for your guide though, ill chek it out :)

I just did a Surprise Me! search on Wiby. This is what it came back with ;-)


Just checked out millionshort and wiby. And those are absolutely awesome resources (click on remove top e.g. million in millionshort)!

In my opinion there has to be widespread fatigue of Google just somehow managing to return a large chunk of something like 1,000 sites for pretty much any search. It's in part SEO, but it's also like the article mentions - Google makes money from ads. These sites they spam at you generate substantial revenue for Google - no name sites do not. Being the world's largest advertising corporation and search engine is one hell of a conflict of interest in terms of delivering what the user wants, instead of delivering what Google wants.

I'm collecting "spartan" websites via reddit: https://www.reddit.com/r/SpartanWeb

DuckDuckGo's "lite" search: https://duckduckgo.com/lite?q=

My DDG-from-shell Bash function:

    ddg () 
        w3m https://duckduckgo.com/lite?q="$*"

Also, you can add

to the end of the URL to get direct links in the search results.

For example:


I use the html search instead of the lite search. Not sure what the difference is, but it gets past the reload in elinks.

In console browsers, far less screen cruft, and no HTML redirect.

Focus is either on the submission field or gets there on first tab. Search button is focused on next tab, and visible.

How do I get updates on the groups project?

Wiby seems amazing - the first three surprise me links were a human powered ornithopter, lego maniacs and a guide to knife throwing. Thank you for sharing!

The group mostly converses on micro.blog. I cover the various conversations and new discoveries on my site (kickscondor.com). Do you just want to follow along? Or do you have some ideas to share?

This is a very new group that has sprung up in the last few months.

I was just reading along this thread and wracking my brains trying to think of this one site that I had seen which curated interesting websites and up popped your comment and I was like ding ding ding then back down the rabbit hole!

Glad this is becoming more organized and I will follow along with interest.

It brings back a little of the wonder of the old web.

A smaller subset somehow seems bigger, more infinite.

I wanted to follow along - your blog looks excellent, I'll check there and micro.blog. Thank you.

Incidentally (and rhetorically), how I have not heard of micro.blog? It looks amazing! This whole thread has become a goldmine of interesting things.

We're among those that believe there is room in the search space. We're building out our hyper local product search service city by city at https://attic.city. It's meant to fill the void with regard to smaller, non-chain stores that Google shopping seems to focus on.

Edit: ...that Google shopping seems to ignore (!)

Thanks for posting! I have added some of those, along with others in the replies, to my own page of text-only and minimal sites at http://www.friendlyskies.net/fmk/index.php?tpl=Links-Text-On...

Isn't http://curlie.org/ (ex-DMOZ) the most Yahoo!-style-like directory?

For sure - and there are several others like it: illumirate.com, joeant.com, skaffe.com, gimpsy.com, seekon.info, goguides.org, somuch.com.

I'm personally not a fan of directories that try to tackle the _entire_ web - it's just too sprawling. So I tend to not recommend them; you have to drill down pretty deep to get anywhere. I think 'awesome' directories (and Reddit wikis) have proven how well niche directories can work - and so I like to encourage folks to build their own directories that encompass their personal view of the web. They act like those 'little libraries' you see on the roadside or at pubs - but for the web.

Thanks! for sharing links. wiby in particular is amazing!

Your href.cool project is rad, and has inspired me to make my own! Thanks for sharing.

Sweet! Let me know where it ends up and I'll link to it. I'm excited to see what you come up with.

it's almost like we need a way to aggregate these sites and rate pages based on tags and how many people list their site. maybe not in the traditional del.icio.us sense but also as a potentially self hosted thing.

I feel like millionshort is experiencing some denial of service from HN traffic. So slow right now that I can barely perform a search.

Thanks for bringing these sites to attention, can't way browse them when traffic is lighter.

Wow. These are absolutely wonderful resources. Thanks for posting them

we're also collecting alternatives here: https://ethical.net/resources/

I like that href.cool project! I also just got started looking into GM'ing a Dungeon World campaign so that DW Improv link came in handy.

Thank you for the links. I've been toying with the idea of building a yahoo, seeing your post has rekindled my interest in it again.

Looks like millionshort is already dead? Last and only blog post is over a year old. copyright is still 2018. Sad, it looked promising.

As long as the search engine still works, I<m happy. I<ve been using it quite a bit lately to dig up old musician interviews that either don<t exist on Google search or are buried behind mountains of keyword-optimized blogspam with that artist<s name.

Thanks, you've just convinced me to make my Pinboard bookmarks public (after I clean them up).

Wow. Thank you. I've had the feeling that Google's search has gone to shit recently and this helps a lot.

Great search engine resources. Thanks.

This author is in his own little bubble and doesn't understand the vast amount of blog-repost spam that google has to deal with. The way their algorithm most likely deals with this is a mixture of domain rank + tenure... how long has this copy of this article existed on this domain, and can we be sure this is the original copy?

The author says the article was removed in 2006 ("[...] posts, were not accessible anymore") and then he re-posted the article at a new domain in 2013. That means any copy/crawl/repost of the article from 2006-2012 is now the oldest living, and thus "original", version of the article. His 2013 repost was seen as just another blog-spam copy.

Google is not forgetting the old web unless we see evidence of content disappearing from the index that have been consistently hosted at the same domain & URL since their original posts. Unless you properly 301 your URLs to new locations and consistently host your content, it's a guessing game for the crawler to determine where the original content has moved to.

Here is an example:


No matter how you search for the content on Google, nothing comes up:


DuckDuckGo has it:


I checked the wayback machine and the content has constantly been on that url for over 10 years.

This is the first example of an old forum page I tried after reading the article. So I tend to think it's true. Google is discarding the "classic" web.

Anecdotally and perhaps unrelated - has anyone else noticed a decrease in the accuracy and general quality of Google search over the past 2-4 months? They've had to have been utilizing ML to 'improve' searches for some time now, but the quality of the results has decreased suddenly and inexplicably (for me).

> Anecdotally and perhaps unrelated - has anyone else noticed a decrease in the accuracy and general quality of Google search over the past 2-4 months

Yes. Not just over the past 2-4 months, but over the past five years or so.

It's become so bad that Google is no longer the most useful search engine for me.

It all started going downhill since Google's "Hummingbird" switch to be honest. Interviewing for Google, I actually brought this up with an engineer in the search team during the lunch.

He said they haven't noticed any regressions. I said I figured that would be the case but I can definitely feel the difference as a daily user.

This is indicative of a larger issue - testing is probably as difficult as solving the halting problem i.e. code could be generated from proper tests, yet teams tend to trust their tests completely. I see high profile websites having severe usability issues or being outright broken in ways that would be immediately caught by "interns randomly click here and there" usability tests. But these version got deployed probably because testing did not show any regressions.

I tend to believe that if user complaints about new problems or regressions increase over statistical noise - there is a problem.

> yet teams tend to trust their tests completely.

Well said. This is a big problem. We see a similar problem with the use of telemetry data as well.

I noticed the same. I've wondered for years why it happened and sometimes when I'm frustrated I try to think about it. But I am not entirely sure that the degradation in search results for me happened only in the past 5 years. Maybe, but I'm not sure.

I had no idea about Hummingbird though.

In 2009 Google was amazing for diagnosing Linux issues. I would just copy the error from the console and I'd have links to the issue tracker, a work around and the version in which the bug was fixed. Today I get a link to some github project that has nothing to do with what I'm working on and was closed as being an upstream issue.

I don't have the time, money or energy to build a specific crawler, but a Linux search engine that indexed all the major distros, packages, mailing lists, forums and issue trackers would be amazing.

> Google's "Hummingbird"

I had assumed that Google search had gone downhill because it started trying to "personalize" my search results. That wasn't a great explanation though, as I don't use a Google account.

Hummingbird seems a much more likely explanation.

Oh, they still "personalize" your search results.

I do a full clear on my web browser (cookies / offline storage / history, everything) and then open YouTube in a private browsing window and it asks me which of my two Gmail accounts I want to log in with. I'd guess it's just a combo of external IP and browser fingerprint, but it's creepy.

I know they do, and I consider this a real problem. I was just saying that personalization isn't a completely satisfactory explanation for the decline in Google search result quality. It is likely to be a factor in that, though.

I noticed the same but only on an absolute level. Compared to Bing, DDG & Co, Google is still by far the best search engine.

What would you recommend instead?

DuckDuckGo.com is my daily driver.

I love the concept of DDG and have it as my standard but still use Google (via !g) for about 30-50% of my queries. Simple queries work well in DDG (which is basically Bing) but more complicated queries only really work in Google.

Sadly I've been finding the same result. Exact searches on Google are often frustrating, but lately they've been all but impossible on DDG. It seems that all search engines (including those backing DDG) are getting on the ML train and assuming they know what I'm looking for better than I do.

I understand this being default behavior, but there really needs to be a way to disable it.

DDG has become my first stop. It gets me what I need 90% of the time.

I found the opposite. I started using DDG when I moved to Brave but after a month, I found I would go to DDG, search page after page and get frustrated and open google and have my result on page one or two.

I've heard others say similar things. That's simply not my experience. I wonder if it depends on the sorts of searches that we each tend to perform?

For me English has been working decently on DDG but last time I tried I had a really hard time getting decent results in other languages.

Qwant might be a good option, as it's European it should be better for searching in (some) other languages.


Personally, my impression is that since at least ~1-4 years, the google searches return less exact matches especially when I search for an exact match of an error message, or multiple exact matches involving the same error message (when I start to become desperate I usually tend to split the error into 2 parts...).

On the other hand the non-exact hits that it returns push me from time to time in the right direction.

Having said this, I don't know of course if A) I'm too old (40) and the mindset of the younger search-people has now changed and/or B) Google just doesn't index tech forums as much as it used to and/or C) there are just fewer forum-posts and/or D) my problems became more complex (don't think so) and/or etc... .

I tried (and still try from time to time) to use DDG and Bing but without success.

Does anybody else have the same impression?

Hello fellow 40 year old. I run DDG as my daily, but similarly have a hard time finding exact matches for error messages. I am suspicious, however, that I've learned how to use Google's search controls (", +, etc.) and I'm not sure they work the same on DDG. I also can't find a reference on DDG for how to control advanced searches.

So ... although I feel like we might be having the same issue, I'm not sure I'm using DDG correctly enough to say it's a problem.

Google search worked much better over 10 years ago than it does for me today, i.e. before it abandoned the what-you-search-is-what-you-get model. My once-masterful Google-fu seems to be borderline useless today. I'm not sure what happened over the years, but Google search has morphed into a completely different, less useful product, at least for me.

Any search term not wrapped in quotes can be randomly ignored today. It can inject keywords it thinks you want (but really don't). Google is great for searching modern sites like Stack Overflow, but it seems to have lost interest in servicing power users.

2-4 months? Try at least a year. Google search results are now that creamy scum that floats between an industrial wasteland and the tidal flats it was built upon during the changing of the tides.

Upwards of five years, actually. It was already declining when they decided to fuck the +WORD operator for their Facebook ripoff.

So you're also saying ever since Hummingbird [0], Google search hasn't been the same.

I agree.

[0] https://en.wikipedia.org/wiki/Google_Hummingbird

Yup, that's when I started to switch to DDG. I remember Google saying that you needed to add '+' +before +words that must be included instead of "putting them in quotes"–how annoying. But even using their new operators, I couldn't get answers like I used to be able to. I already didn't like their data gathering practices by then, so gimping search for me made the transition a breeze. Interestingly even up to even 9-12 months ago I remember people consistently saying that DDG was so much worse than Google, which I always figured was a result of user error, or a result of not caring about tracking and leveraging the Google profile. I'd been off the Google grid for a while so I couldn't really argue, but I knew that I got significantly better information from DuckDuckGo, having grown accustomed to the level of detail needed. These days I probably only use Google a handful of times a month. The concept that they are purposefully soiling search results to add value to ads and sponsored results sounds about right, honestly. Advertisements used to be much less relevant than the results I'd get if I inputted a string of 5+ words, but anymore I have to be careful not to accidentally click on an ad, as the results tend to be terrible, and I'd rather enter a url into a browser than click on an ad I'm actually interested in. .

I asked Jeeve's about this, he picked up a Magic 8 Ball and it said, "not so good".

Yes, i'm having to go several pages deep and even then not finding anything relevant, I've started to use other search engines and reddit to actually find useful info.

Google poured billions into their search engine for two decades to make it better. Now that have a ridiculous amount of money and power, the search results get.... objectively worse. Which brings us to the elephant in the room: what are Google's motives behind this (clearly intentional) change?

Could something as innocent as training a new neural net or testing a buggy version of the algorithm on subsets of users. But it could also be as sinister as driving traffic to those in bed with Google, silencing opposition, or effectively whitewashing the entire internet...

> Which brings us to the elephant in the room: what are Google's motives behind this (clearly intentional) change?

They are maximising ad revenue, not the search relevance/usefulness.

I was looking forward to seeing someone else share this opinion. So the google behavior is driving some users away, im wondering why others are sticking with it? I propose that in the course of professional contact we should strive to avoid use of google as a verb. Yes i know its not slick to say, "Perform a search - using the search engine." instead of, "Google it."; but it starves a mentality, i think it would disconnect the G-word from the perceived face of the internet, The whole point is a monopoly eventually gets out of hand and starts screwing its users, to its own benefit, due to largesse of the users. If google is to improve itself, We the users have to force it to by ignoring it and going elsewhere, This i think starts by RE-Realizing, as a herd, that there is choice other than the Alaughabet search engine [aka google].

> I propose that in the course of professional contact we should strive to avoid use of google as a verb.

Stopped using "google" as a verb a long, long time ago, in favor of just saying "search". I don't think that's ever confused anyone.

One possibility is that Google hasn't gotten worse, but the spammers have come up with new techniques that Google hasn't adapted to.

Maybe bad organic results lead to more ad clicks.

Most definitely. But the folk at Google are very smart, very rich, and already run the most lucrative ad platform in the world. Wouldn't hamstringing their flagship product for the sake of a few extra $B/yr harm them in the long run as more and more users switch to other search engines? They had to have considered that and made the change anyway. What's the endgame? I don't feel it's more ad clicks.

What makes you think they wouldn't? Everything else google seems to do is in the interest of short term profits. Look at all the great products they've shut down simply because they weren't all that profitable.

It's my opinion that a large portion of the websites on the front page of any search (quora and pinboard anyone) are completely bought and paid for.

I think the endgame ends up very close to same everytime this sort of thing happens. corp gets good people like it, then they get rich and take on investors. when stocks and investors get involved, then there is an expectation of an ever increasing >RATE< of profit. if that rate decrases then stocks are dropped, and if this goes on long enough, the corp is so interested in maximum profits over a shrinking timeslice that it basically takes all and gives nothing in return, that is the point when it is no longer a service, and exodus begins. [myspace]

Maybe you had this problem before but your expectations grew faster than technology? Can you think of something from your search history and fins anything that other search engines found but Google failed?

Their keyboard predictions have gone from "OK" to "Amazing, we live in the future", and over the past couple years to "of course I didn't mean 'aaAAaAAnd', wtf were you thinking".

I frequently suspect they're starting to optimize more for $ than they were before, and ML just gives them more ways to make that number go up another % or so... but it often comes with impossible-to-predict and wildly inhuman edge cases. It's a pretty common trend when companies start focusing on small number increases - each A/B test shows improvement, but the product as a whole worsens and it drives people away in time.

About 2-3 months ago they basically nuked Youtube's search and recommendation. This was associated with some bad press about those features coming up with "harmful content" like unapproved radical politics & conspiracy theories. Now you basically see mostly curated front-page stuff plus some user stuff that had probably never come up in search before (e.g. a fairly common search term will come up with videos that are a decade old and only have 5k views). Maybe changes in Google search are related?

IMO, Youtube changed for the better. It used to focus on controversial and current, now it focus on curated and evergreen content. Exactly the kind of thing people in this thread are missing from Google Search.

Maybe some similar change is coming to Search.

Yep, I've noticed a lot more commercial results than before. To find something relevant I often have to dig deep, especially if what I'm looking for is a little bit obscure. I'm glad you mentioned it.

Yes ! This morning I was not finding exactly what I was looking for in ddg and felt back to Google and result were quite noticibly worst.

To me they started spiraling down when they started to give too much power to designers. Form over content is a terrible idea for a search engine ...

Google Images is a partial example, but this happened a while ago. It appears what they do is use ML to classify what is in the image, and then show images that fit those categories. It is useless now for checking things like, did this logo designer you hired off Upwork/Fiverr/etc just steal someone else's design.

Aspiring science fiction authors, or Neal Stephenson, should write a novel about a world where ML tuned models optimize everything to be just good enough not to churn customers while maximizing margins. (Also applicable to non-profit items like politicians and universities)

Google Images still checks for exact matches. The ML stuff is an extra.

Can you give an example from your search history? How can you quantify that results got worse?

I've had this problem recently. I can craft a search for something just slightly obscure and specific that should, nonetheless, have had plenty of hits on the "old web", let alone now on the many-times-larger web. But "no pages found". Loosen up the search and it's nothing but Google-friendly blogspam that isn't remotely related to what I'm trying to find. I call bullshit.

Loosen it up? You mean google didn't automatically remove your keywords for you?

Heh, oh yeah, tons of that, usually the ones most relevant to narrowing the search beyond "everything on the Web". Thanks, Google.

So then I do the quotes thing, especially quoting phrases that 100% for sure must exist on some web pages, along with all my other keywords and pretty soon I'm at "no pages found". Pull back just a little, and it's page after page of entirely unrelated-to-what-I-want blogspam.


Looks like only page 6 is indexed for some reason. The site owner would be able to check the webmaster tools on Google to see why.

Search console isn't really helpful in many cases. Unless there's an error, it'll probably say "crawled but not indexed", which gives you no idea why they didn't include it.

there are 3 parties:

the end user searching for the content

the webmaster or author of the content

the search provider

If I'm searching for something that I know exists and I cant find it there is no excuse. The search provider failed to do its job.

There is not but the webmaster should have done this and that. He was hit by a bus 10 years ago and we should be happy the content is still available.

A good search provider would link a vanished website to archive org if the content is exactly what the customer wanted.

Long long ago when posting interesting links in comments didn't trigger commercial hysteria people would cite bits of texts and have a link to the full text. Later this became simply citing a chunk of text. I use to drop a few lines from the citation into the search engine and find the original work.

Just look!


As i'm writing this there are exactly 45 search results above the one that should have been displayed.

There is no excuse like HN not ranking enough, they didn't not index the page, the other results didn't match the query better.

If we do this with 4 exact lines from a less popular site it will end up some place on page 20 of the search results.

Another example, I really don't care for indexing but here is an article that I always (jokingly) refer to as my greatest work.

The exact title:


A really weird result. Safe to say nothing matching is there.

The first many words from the text:


It doesn't find it.

Then we check if it is even indexed...


And there it is! Why does it even crawl the page?

It also lists websites that have the number 8616 on them and ones with both the word "blog" and "here" in the text.

I'm not suppose to laugh?

Probably because the site is not https, and Google rank includes https: https://www.sangfroidwebdesign.com/search-engine-optimizatio...

page 5 seems not indexed. Everything on others pages can be found with Google.

you can force the site with "site:... ": https://www.google.com/search?q="metallica+only+played+2+son...

it's doesn't find page 5 with these terms but find page 6.

There is probably an issue within the page 5 itself.

From what I can tell, there's no links anywhere on the site to that particular page, you have to know the exact term and search it: http://www.gnoosic.com/discussion/

How is Google supposed to find that out?!

That's a good point. Googlebot probably wouldn't try out combinations in the search box, so unless the site owner provides a sitemap Google wouldn't know about entries.

How are you determining nothing links this? Google?

No, I'm browsing the site and I'm unable to find any links to those band-specific pages.

Seeing Metallica on HN makes me feel much more welcome :)

Well, search may have bugs (or undocumented features) too, I have googled content from other pages on this site (related to Metallica). Page 6 for example: https://www.google.com/search?q=%22Listen+up+you+fags+metall...

I don't think the articles premise is that Google axed all content older then 5 years or so. But that it gradually discards old unique content.

Which goes against the original mission of Google to "organize the world's information and make it universally accessible".

A "bug" could be an option, but I don't expect that to be the reason. It's too easy to find examples of forgotten content. And I don't think a bug of that magnitude in Googles core business would go unnoticed.

>> Googles core business

Which core business are you referring to?

Search. They still form a major part of their business. Through direct ad revenues but also to redirect traffic to other Google products (e.g. Maps, Youtube).

Interestingly, even though DuckDuckGo finds the post, Bing doesn't seem to.

This is reminding me of the meta search engines that consolidated results from multiple sources. I haven't used one of those in probably 15 years.

Not showing up here. Not even if I add quotes.

I tried with quotes and it doesn't show up, ironically. It must be without quotes.

It forced my to solve a bunch of CAPTCHAS too.

I do see it on Bing.

Also on the "Million Short" search engine mentioned by kickscondor:


I never saw that one. Do they have their own crawler?

Thanks a lot for the example!

Part of the problem is that their algorithm has become weighted against blogs and personal websites.

> Rumors spread that large link pages (for surfing) might be considered “link farms” (and yes on SEO sites they were but these things eventually trickle down to little personal site webmasters too) so these started to be phased out. Then the worry was Blogrolls might be considered link farms so they slowly started to be phased out. Then the biggie: when Google deliberately filtered out all the free hosted sites from the SERP’s (they were not removed completely just sent back to page 10 or so of the Google SERP’s) and traffic to Tripod and Geocities plummeted. Why? Because they were taking up space in the first 20 organic returns knocking out corporate and commercial sites and the sites likely to become paying customers were complaining.


SEO seems to have become a huge obstacle course that smaller websites can't play.

You're jumping from describing observable results to a state of mind or motive which you can't observe.

> Then the worry was Blogrolls might be considered link farms so they slowly started to be phased out. Then the biggie: when Google deliberately filtered out all the free hosted sites from the SERP’s...

That's all observable fact.

Why? Because they were taking up space in the first 20 organic returns knocking out corporate and commercial sites and the sites likely to become paying customers were complaining.

I think the more reasonable, less diabolical motive was that the blogs and free hosted sites were largely link farms that no one wanted to visit.

It sucks for the few legitimate pages on those platforms, but when most of the legitimate page is the rare gem in a minefield of automated copies of other blogs, just with SEO links and ads inserted.

It's like a comments section: without moderation or captchas or both, a "thriving local community" on, say, a small town news site can be overwhelmed by automated pharmaceuticals spam. Then the newspaper kills the comment section, not out of any malice towards the original community but because they don't want to deal with the spam.

And yeah, dealing with spam and black hat SEO does take resources. If you (or worse, your chosen blog host) don't keep the weeds down, soon your pasture will be overrun and burned off.

I absolutely agree with you that whether Google is intentionally diabolical or not is up in the air. My reason for quoting Brad there is to succinctly recount a history where Google has been a menace (deliberate or not) to individual blogs and websites. Blog rolls were absolutely a great way to discover new blogs and were hardly “link farms” but were an incredibly valuable resource. (An equivalent to modern friend lists.)

Where I don’t agree with you is in the portrayal of the Web as largely comprised of link farms and “few legitimate pages”. I spend a lot of my time cataloging the hidden corners of the Web and it is mostly individuals working on their personal Web projects. Spam is simple to identify (much more so than ‘clickbait’) and many of the reasons people don’t read personal websites any more isn’t because interesting and mind-blowing projects on the Web are too rare. (I don’t have statistics to back this up, but I feel like they are more common on the Web than on social media.)

> I spend a lot of my time cataloging the hidden corners of the Web and it is mostly individuals working on their personal Web projects.

That sounds interesting. Do you have a list of some interesting projects that you're willing to share?

I catalog my findings on my blog: https://www.kickscondor.com/ and I have a directory of my favorites: https://href.cool/.

Thankyou for asking. If you know of any sweet links, pass them along!

Awesome, thanks for sharing!

I wish they would filter out Pinterest by default (instead of adding -pinterest), they're worse than old Blogrolls.

It's not Google's fault this time.

The problem is that Blogspam is now a (legitimate) industry much bigger than Google can manage.

Google Search became a playground for marketing firms to dump content made by low-paid freelancers with algorithmically chosen keywords, links and headers. It's SEO on large scale. Everything is monitored via analytics and automatically posted to Wordpress. Every time Google tweaks its algorithm to catch it, they're able to A-B test and then change thousands of texts all at once.

Personal blogs can't even dream about competing with that.

In fact, those companies are actively competing with personal blogs by themselves: via tools like SEMRush and social media monitoring, they know which blogs are trending and use their tools to produce copycat content re-written by freelancers and powered by their SEO machine.

I know a startup that is churning 10 thousand blogposts per day on clients blogs, each costing from 2 to 5 dollars for a freelancer to write according to algorithmically defined parameters.

Just wait until they get posts written via OpenAI-style machine learning: the quality will be even lower.

Not only that: there's no need for black hat SEO anymore. Blogposts from random clients have links to others clients blogs, and it is algorithmically generated in order to maximize views and satisfy Google's algorithm. They have a gigantic pool of seemingly unconnected blogs to link to, so why not use it.

The irony is that companies buy this kind of blogspam to skip paying AdSense. Why pay when you can get organic search results? So not only they're damaging the usefulness of the SERP, they're directly eating Google's bottom line. These blogs also have ZERO paid advertising inside them, since they're advertising themselves.

That's the reason Bing, DuckDuckGo and Yandex still have "old web" results.

That puts Google in a very difficult position and IMO they're not wrong to fight it.

Well, I disagree. (Though I think your record of things is correct!) Certainly if you look at this as a bot war then Google's actions make sense: we need our bots to outsmart the 'bots' (human bots even!) that are writing blogs.

But look at it another way: you have lots of humans writing - and it's all of varying quality. Why not let the humans decide what's good? The early Web was curated by humans, who kept directories, Smart.com 'expert' pages, websites and blogrolls that tried to show where quality could be found. Google's bot war (and the idea that Google is the sole authority on quality) eliminated these valuable resources as collateral damage.

I agree with you.

Maybe the problem is that PageRank (or whatever they call it these days) has run its course. I mean, it supposed to gauge "what humans think is good", but it's failing miserably. It's indeed time for a more curated, artisanal, web.

PageRank is predicated on an assumption that most pages (and thus, most links) are created/curated by humans. This was true when it was invented, but appears to be less likely now.

What gives me pause here is all the anecdotes in this thread about other engines getting results right. If the real answer is "PageRank has been successfully flooded by bots", then everyone would have bad results.

What I suspect, off nearly no evidence, is that Google is using ad tracking to inform a notion of search relevancy. My nearly unjustified belief is that that system is the one being flooded by bots.

You can see some evidence that suggests it when you search for a specific software or ebook to download.

Piracy is gone, but you will find hundreds of automatically generated credit card phishing sites full of Google Ads, sometimes promising pirated versions but serving a trojan, sometimes showing a credit card form. Some of them are on the first page, sometimes before legitimate websites.

> IMO they're not wrong to fight it.

But if their efforts in fighting it are a large part of the reason that Google search results are getting downright bad, then they're wrong in how they're fighting in.

I agree with you.

What I mean is: I don't think their fight is misguided or evil this time, they're trying to keep the result pages useable for end users. They're just doing a terrible job out of it. (Or: they're doing a worse job than spammers)

>It's not Google's fault this time.

Isn't Google responsible for making Internet advertising accessible and widespread? They developed and launched AdWords (2000) and AdSense (2003).

> SEO seems to have become a huge obstacle course that smaller websites can't play.

Absolutely right. I recently started a blog, and was disheartened to learn that I have sign up for accounts with several search engines, conform to their standards and rules, give them a bunch of data... and still sometimes have mysterious issues with indexing with no real recourse. How much time and effort do I really want to spend to play the seo game? I have a job, projects, and hobbies, I don't have the time or patience to play their game of "let's fuck with things randomly until you get indexed and ranked higher". That was fun for a few hours, but I'm done with it.

If you decide to start a blog again, please contact me - I will list you in my monthly "href hunt" - a raw dump of newly discovered sites. And I can point you to directories like personalsit.es that list blogs.

And, of course, consider having a blogroll of the sites you follow, which is all of our little way of contributing to the effort of finding each other. :)

I plan on posting some more soon; work just got crazy for a bit. And that's a good idea, I should provide links to blogs I follow!

That's exactly what it is, and Google's also incentivizing many low-quality sites to engage disproportionately in SEO to boost their Google Adsense earnings too.

A friend spoke to an SEO analyst just yesterday and it seems the counterplay is to add "recency" to your posts.

If you have an older post that's great but not changed, it'll become less prominent. So go in, edit in some changes, and now it's fresh and ready to be indexed prominently again.

If this is how it goes, I guess it helps in a way. The articles we care about get attention and don't drop off. But there's so much of the old web we might lose in the haystack.

The first paragraph of the article mention the story of Tim Bray[0], which is exactly about this : Google forgetting an article which did not change location.

[0] : https://www.tbray.org/ongoing/When/201x/2018/01/15/Google-is...

Yesterday I noticed that Google Scholar forgot one of my articles from 2018, on arXiv. See: https://scholar.google.com/scholar?q=arXiv%3A1811.04960 Google Scholar is not the same as Google Search, which can still find it https://www.google.com/search?q=arXiv%3A1811.04960 For how long, I have no idea. The article was at the same link all the time and arXiv is very reputable.

I also noticed that all our scholarly articles are gone from Google Scholar. The only thing there is our two highly cited books. https://scholar.google.ca/scholar?hl=en&as_sdt=0%2C5&q=site%...

We've come to rely on Google too much, so much that if you are not on Google you don't exist. That's a problem with researchers that are looking for articles to cite.

Somebody starts a site for collecting "scholar dropouts"? An article qualifies as a scholar dropout if:

- it was previously available on Google Scholar

- it cannot be retrieved, or the search on Google Scholar gives a misleading result (for example it gives another article, as explained in [1])

Please help to make a list of scholar dropouts! Thank you.

[1] https://news.ycombinator.com/item?id=19604722 HN comment with evidence

[2] https://news.ycombinator.com/item?id=19604955 HN reply with another evidence

Is this recent? In my case I noticed it yesterday.

The articles are still on Google Search though: https://www.google.com/search?q=site%3Arepo.risat.org

Fingers crossed they don't get dropped from the main index too.

I also just noticed it. No idea when the rest of the papers were dropped.

I sent today a message to Google Scholar with this https://support.google.com/scholar/contact/general

>The way their algorithm most likely deals with this is a mixture of domain rank + tenure... how long has this copy of this article existed on this domain, and can we be sure this is the original copy?

This rationalization doesn't change the fact that it's incresingly hard or impossible to find certain things on Google, that they are effectively biased against certain types of websites and certain types of pages (even when the content is perfectly good) and that other search engines seem to be able to deal with these issues much better.

"Google is not forgetting the old web unless we see evidence of content disappearing from the index that have been consistently hosted at the same domain & URL since their original posts."

I can, very loosely and anecdotally, confirm.

My personal website has been online for about 20 years and I just picked some deep strings of text and searched for them and google has the whole thing indexed just fine ...

Google has weird rules and inconsistent indexing. I recently published a two-part article on using graphql/apollo with react and rails; it never indexed the first part (the graphql / rails bit) but did index the second part (react and apollo). And in fact, searching for with "graphql rails react apollo" still doesn't show any results for this page on google despite ostensibly being indexed, but it shows up on duckduckgo. And looking over the results on google, only ~4 are actually relevant to the topic, so it's not like good content is being shown.

I tested it with an article on my own website from 2003.

I first posted it on


then I had a 301 redirect there for a couple of years to


until I stopped paying for the .de domain. About 5 years ago I made another 301 redirect to


which is still in place. DDG finds it but not Google and actually neither does Bing.

So as with all other open systems, spam is destroying the web.

It seems like the opposite actually -- spam is destroying Google.

They're so big that it's worth blackhats spending significant resources to game their algorithm. That induced them to implement a spam filter which is now discarding the ham along with the spam.

Which means that smaller search engines that aren't being targeted by spammers are now giving better results. That is a major long-term problem for Google if they can't avoid throwing the baby out with the bathwater like this.

People only use Google because it has historically had the best results. They'll get some way on inertia now, but that doesn't last forever. They need to fix this or they're ultimately in trouble, and we could be heading for a landscape where being a search engine above a threshold size is a liability.

IME, it's not blackhats anymore causing the problem. It's (legitimate, but shady) marketing agencies and startups handling thousands of customers and with deep pockets to do SEO research.

I count those agencies as a variety of blackhat.

blackhat is as a blackhat does, its no difference how or why you screw users in someway by failing to be forthright and of candor. if you do it constituatively thats blackhat.

The web is fine, and search is fine. It's specifically Google search that's being destroyed by spam.

It's odd to put forward the hypothesis that DuckDuckGo is now better at search (aggregation) than Google is at search. But that seems to be where we have landed.

I think it may be a simple consequence of the fact that Google Search is increasingly less of a searching engine and more of an answering engine.

I think Google has been explicit about this (I may be wrong, but I seem to remember thinking about this because Google themselves said it). Essentially, I believe, they are no longer concerned about being a way to navigate all the material found on the internet. Instead, they are concerned with answering the question posed by each search attempt.

That's exactly it.

A few years ago they made a push to answer questions to the point it was in their product description on their "how Google search works" page. To quote it exactly, it used to say their objective is to "return timely, high-quality, on-topic, answers to people's questions."

And that's kind of the whole problem and why there is space for a search that actually returns results from the web in a clear and logical way.

In that case, they have a branding problem, and should rename Google Search to Google Answers.

> I think it may be a simple consequence of the fact that Google Search is increasingly less of a searching engine and more of an answering engine.

I've been thinking about this, and it seems very plausible to me. Which means that Google Search isn't really "search" anymore -- which explains why it's become so bad at that!

Too bad. I remember when Google had the best search engine going. It was a real game-changer. Those days are long gone.

Does Google work if my questions is "what's a good article about X" ? I'm willing to modify my search terms to speak Google's language.

In my experience, Google works better if you ask it questions like that. Not good enough, especially if you're looking for something specific and technical, but better.

Yeah, but how DDG was able to show "the new original"?

DDG's primary search is actually Bing IIRC.

DDG isn't only using Google, it uses other search engines too.

Mostly Bing.

They even have a tool just for this: canonical URLs. It lets websites specify which version is the source/canonical version and avoids the old copy indexing.

If the originally indexed copy no longer exists, Google shouldn't down-rank a reposted version!

I have noticed that searching for exact quotes seems to have been broken on Google for a few years. But only minimally broken. And I've had no idea how to reason with it. This article completely corresponds with problems I've encountered with searching for results on StackOverflow or software documentation sites; it's especially perplexing that "site:..." combined with exact quotes does not work for many cases.

Google certainly doesn't seem to value feedback at all. It's practically impossible to get in touch with a human to ask for help and Google's feedback forms have always felt like a black hole.

I too noticed that for some queries, Google is becoming really, really unwieldy.

I can't recall the exact search term, but I kept looking for some site I visited some time ago, and no combination of words could get it to actually find the actual site. I finally just gave up and found it in my browser history.

I've "lost" a few sites that way. Forgot the address, can craft several searches that years ago definitely would have brought up the exact site I want. But... nothing. Or it's buried so far in the search results I'll never find it. I need a good search on top of Google search, these days.

To be fair DDG rarely works for me in that way, either. I think that kind of old-school, precise search engine's just dead now. It seems like everyone's indices are a lot "fuzzier" and full of holes, like they're discarding large parts of pages from the index if those parts don't look important to the algo. Not just deprioritizing, but tossing those pieces out entirely. Except the algo's very wrong.

Yeah, I made similar experiences. Also to dive into some random topic, Google is not really helpful and just suggests only really obvious things first like Wikipedia.

On the other hand, for day to day work at least for me it is still indispensable. Googling a random exception out of an unexpected stack trace works far better than with DDG for instance.

> Google certainly doesn't seem to value feedback at all.

As it is with most big companies that profit from ad revenue. They seem to consider performance indicators to be sufficient to know if a new feature is good or bad, instead of worrying about written customer feedback.

Ive noticed lately some of the search operators(intitle, inurl) have more limited results. Feels like a step backwards in functionality, but I suppose its a step forwards in getting the user on Amazon or whatever

Amazon is even worse.

What is your estimate of how much it would cost to respond to every piece of feedback from a >1Billion-user user-base?

What's your point? That google is a for-profit company? Why should the people who utilize the services take a back seat to profits?

I didn't make an estimate because I literally do not care about the monetary cost. Why should Google be exempted from basic standards of customer service just because it's profitable to do so? That's the exact opposite of being a productive member of society.

Fact is, users aren't perfect and will always need help to use services provided by others. A company which does not help its own users because it's less profitable to do so is an unethical company and should not have any business whatsoever in a civil society.

If that makes the company not profitable then perhaps the company should actually sell their services for a cost instead of "free", or perhaps become a non-profit (and reduce taxes), or become absorbed by the government (and citizens' taxes provided the "clearly beneficial but not profitable" services), or... you know... stop being profitable and stop being an unethical business.

you have to go to the options at the bottom

more >>

settings >>

or whatever. Fiddle around until you find "verbatim" and choose it.

Verbatim certainly fixes certain kinds of searches, but it is insufficient.

"Deprecating" might be a better term than "forgetting." Google's business isn't driven by long tail content. Probably it never was.

Maybe somewhere there is a Google disk with the the hash of an exact phrase the author typed into the search box. But statistically, that hash won't be found in hot memory vector space when cosine similarity runs on a nearby server. Finding the phrase would require a batch job that runs much longer than the engineered time limit Google imposes on search queries. Without a "let me know in 24 hours" option, Google's search will partition data into what should and shouldn't be accessible. That partition will always be according to Google's business goals. All the information may be indexed, but only the fraction of the index beneficial to Google will ever be accessible to ordinary users.

The crux of the story is that there is no business case for Google to return the author's web pages in search results even if the Wayback Machine implies that Google could.

I maintain two web sites that date back to 2003. They are still very active (thousands to tens of thousands of uniques per day), but I and my users have noticed that only the more recent content (2012+) shows up in Google.

In a way I’m had to hear that Google is delisting the older content because I thought I was doing something wrong.

But it’s still frustrating for my visitors because every few months I get a message about how they can’t believe all the information there is on the site that they’ve searched for for years but never found through search engines but it’s all right there on this one site. (It’s something of a regional history site.)

I guess those sites have involuntarily become part of the “dark web.”

Is it possible to check your access logs, see if google how much google, others are spidering?

I know very little about google's SOP, but had the impression they periodically rescan stuff.

I've built new sitemaps, submitted them to Webmaster Tools (or whatever they're calling it this week), requested a re-spider, everything.

Duck Duck Go finds it and shows it. According to the logs, Google spiders it, but chooses not to show it.

Google's spider is weird anyways. I have a site that has sitemaps submitted like three months ago and it still hasn't taken a look at them.

This has been been happening for years and getting increasingly worse. But it's not just old websites. I think it's certain types of content.

Here is one specific example out of dozens I've seen. There is a short satirical rant "published" on Pastebin called The Java Way. Posted in 2015. Unfindable on Google. It was indexed and findable around the time it was posted.


First result on DDG:


The worst part is that Pastebin uses Google for its own search.

Amazingly, three hours after your post, this hacker news comment is now showing up in the top 5 for the Google query. Wow.

Not on my version of Google. But for me it started showing this as top result:

  No information is available for this page.
  Learn why
I tested this on different browsers and IPs. Seems like it indexed the link from this thread, but can't display it because of DDG's robots.txt settings or something like that.


Just for historic record: Google finally started showing a link to the original pastebin page and it is the first result. I suspect it's because someone submitted the page to HN, which shows as the second result now. (https://news.ycombinator.com/item?id=19609280)

Your DDG link is now the second result in the Google results for me. (One spot higher up than your reply.)

When web directories like Yahoo lost out to web search engines like Google, we lost something crucial. While search is good for answering questions you know how to ask, browsing was exploratory and led us to know what we didn't know. When it comes to learning a complex topic like mathematics, this kind of serendipity was very useful. There are some amazing resources on the web, but googling won't let you discover those.

Sometimes, the same idea is available in a book, in a TED talk, and in a podcast. Some of us are curating such resources categorized by topic / format / year / difficulty / estimated time. Our GitHub repo received 100+ stars in less than a week, so I thought it would be a good time to show it to HN. I'd love to get some feedback and critique from the HN community where I have learned and discovered so much.

Here's the Show HN post: https://news.ycombinator.com/item?id=19604295

'Awesome' directories have really given a nice resurgence to the Yahoo! style of organization. I think the problem with Yahoo! is that it simply got too big to be used as a directory. Niche directories are where it's at. (Also: Reddit wikis, which often are used similarly.)

> Niche directories are where it's at.

Agreed. And if we came up with a standard format for such lists you could make them searchable and we could end up with distributed searchable curated indexes that are not centrally controlled. And that is quite compelling compared to centralized fully algorithmic search systems run by mega-corps.

What's the point of saying "We have collected links for [...], relationships, [...]. When there are 0 links collected under relationships? Seems weird for me

Adding to that, I sometimes fear that my Google search results are biased towards my own search history.

Personally I have similar experience, but other way around. Every time I try look for anything in Google almost always most of results are from 3-6 years ago unless I specifically specify I want results from last month / year / etc. And I not just talking about technical questions, but all kind of stuff include music, travel information and such. I not even sure when the last time google provide me the link to some new website with fresh content.

Like it's sometimes feel like web completely frozen and all content moved into closed gardens. I switched to DDG a while ago for this and bunch of other reasons, but I wonder if someone else noticed this. Anyone?

I don't think your "3-6 years ago" and "forgetting the old web" are incompatible. I've noticed the same - Google seems to gravitate to results from 2014ish, even when newer information is available, or when I'm searching about an event from far before that time.

I thought I was going crazy with this 2014ish search result dynamic. I almost always have to use date filters to find recent information. Especially when trying to research products, reviews, types of products, etc.

I have a site that includes lots of older content. I checked the first page that came to mind (published in 2005) and it still shows up on the first page of Google results for the obvious search terms. So it certainly isn't as clear-cut as 'everything older than 10 years is not in the index'. update checked another circa 2003. It's also on the first page of search results for a search on its title.

I have been noticing (what appear to be) truncated search results in Google for some time now. At first I thought that was because I was accessing Google through Startpage. But that's not the case.

Anyways, I find myself using Bing more and more often these days, because the search results dig more deeply into the 'obscure'.

I'm not at all upset by this. It seems to me that as Google's results are not completely satisfactory, more people will make use of various alternatives. Maybe one day, search will become decentralized again, somewhat like it was in the 1990s, when you regularly made use of many search engines, like Altavista, Lycos, Excite, and Yahoo.

I would imagine that there must still be metasearch sites out there somewhere that submit your query to several search engines. I need to find one again and would appreciate recommendations.

I miss those days. There seemed to be so much choice back then. I do remember throwing the same search into 3-4 search engines, not knowing if the reason I couldn<t find anything was because of the engine itself, or if there literally was no information about it on the Net...something that was a possibility back in 1999.

The meta search engine www.dogpile.com still exists but I'm not sure how good it actually is.

Found a very good one. In case anyone's still reading this thread: searx.me

I used to use dogpile back in the day. I'll give it a try again. Thanks!

The assertion that this is because "indexing the whole Web is crushingly expensive, and getting more so every day" is a bit flawed. Since old content is very unlikely to be updated, it doesn't have to be re-crawled a lot. I'm certain Google has a score that tells it how often the content of a given site is likely to change. This argument of expense becomes even less durable when you consider that DuckDuckGo, a company with an infinitesimal fraction of Google's resources, is perfectly able to keep that kind of content in its database.

I agree with the observation that this is about shifting everything to current data, because people overwhelmingly care about things that happened a few days ago. There used to be a long tail of users searching for old data and references, but I suspect they're fading away. Biasing the index towards recency also has legal advantages for Google, because delisting old content makes it less likely to receive takedown requests in connection with "right to be forgotten" legislation.

Crawling isn't the real problem, nor is the bulk storage for the crawled pages.

What do you do with these pages after you've crawled them? You need to build an index out of them, and serve that index out of some kind of low latency storage (DRAM, Flash). That makes increasing the index size very expensive. The index size has to be limited, and selecting the right pages to include in the index is thus a core quality feature for a search engine.

I'm having trouble imagining that Google would be more limited by the ratio of hardware power vs data size today than it was in the early days. If keeping the whole index in DRAM is now a requirement, then yes, I'd expect a hugely reduced overall dataset - but wouldn't that affect way more sites/pages than the comparatively few dropped historical records?

I still suspect that this whole thing is more about bias (and personalization, be it correct or incorrect) in the results.

Google's index has been in memory for most of its life now: http://glinden.blogspot.com/2009/02/jeff-dean-keynote-at-wsd...

It's actually more complicated than just a single static index, which is also why it's unrealistic to expect a search engine to be deterministic at scale.

It was a limit back in the day as well. Remember Google's "supplemental results"[0]? This has always happened. The only thing that's different is that a blogger was personally insulted by his output not being fully indexed, and decided to pitch it as history being erased.

[0] https://searchengineland.com/google-dumps-the-supplemental-r...

The index only has to be on "low latency storage" if low-latency results to any query are required. While that's definitely true of the modal "Google Search", most of these queries for "long-tail, old content" as discussed in the OP don't really need that sort of quick response.

Interview questions:

How would a search engine distinguish between the two kinds of queries, tens of thousands of times a second?

And how would one architect such a two-tiered system, particularly with an eye toward cascading failures?

Make it opt-in. Instead of requiring every search to be finished in less than 0.5 seconds, allow users to tick a box that says "Take your time" and pull indices from slow storage in that case. If I know I want something niche, I am willing to wait the extra few seconds or even a minute.

Hell, even waiting an entire day (e.g with results sent via email) might be reasonable for some searches.

Behind the scenes, a search on Google involves at least a thousand machines.

How much extra state (internal connections, memory for partial results, etc.) would such a new search type create?

How do you deal with a new kind of hot spots now?

What if millions of people suddenly activate such an option?

What if a botnet does it?

All good points :)

I don't work at Google so I'm probably way off base, but if I was designing it I wouldn't bother telling the difference between the two types of queries.

I'd break up the indices into digestible chunks, perhaps chronologically by year/month crawled, and then run all queries simultaneously (in parallel) against all those index chunks and combine the results at the end. Infinitely scalable and can be tweaked to ensure specific response times.

And there'd definitely be no need to set some arbitrary date cut-off; just add a few more virtual machines. I'd bet that's what Google was doing, and then scaled back those machines to save money and boost profits.

That's kind of how Google works, with multiple index tiers. Look up patents by Anna Paterson to get a few clues, assuming your lawyers won't bark at you.

Still, you can't keep partial results around forever, unless you want to make searches a lot more expensive, having to add a lot of capacity just to deal with the buffer bloat. Each query touches at least a thousand machines. Adding "a few more virtual machines" isn't going to cut it, especially if you have to handle tens of thousands of requests per second.

Progressive loading is a thing.

Yeah, it seems a natural consequence of the combination of vast amounts of recent content with the fact that people mostly want recent content. To pick one trivial example from yesterday, if I'm looking for help with an interface issue with some current version of a program, forum posts from 10 years ago are probably not useful.

Information that people regularly access for whatever reason will tend to remain relatively visible. But, yeah, relatively obscure older content is just going to get drowned out unless you know exactly where and how to look. One might argue with Google's criteria around relevance. However, that older information is going to get harder and harder to find just in the natural course of things.

It's not just the old web. I have a small website that does not show up in Google's results. When searching for the title of the site it's the 7th result on DDG yet does not appear on Google at all (Google runs out of results after 19 pages, which is also ridiculous considering the search is very common). When I search for the domain name (sans tld) on Google the associated YouTube page is the first result, also an unrelated Twitter account, and even a Reddit comment that links to my website (Reddit is a more effective search engine!). But my website is not listed. Finally, if I search for the full domain name the website shows up as the first result.

This has been a problem for three years, it's incredibly frustrating, and also demotivating.

Google does have some webmaster tools where, after you prove you own the site, you can check if there's something you have to fix about it to get listed, can submit new URLs for them to scan, etc..

Even if your site is perfect according to Search Console/Webmaster Tools, that won't get you listed on Google Search.

Google Search Console will tell you what pages were submitted (via sitemap) and what pages are indexed as well as any errors.

Yes but they don't show up in the fscking search results even for highly matching searches.

I would not be surprised that if along the way, googles search results were optimized for income of ad revenue, over other metrics - knowingly or not.

I to have complained about the search results of google going down hill. I was told "I was just to technical".

Whatever the case, the web is not the same as it was in the early 2000s and it really is sucking if you are wanting to search for something.

Google Scholar started to forget articles https://news.ycombinator.com/item?id=19599365

Downvoted against the evidence given. Anybody cares to explain why? Thank you.

I didn't downvote, but you link to a comment by yourself that isn't much longer and links to a blog post by yourself, and you already made a comment with basically the same content elsewhere in the thread. That's close enough to spammy self-promotion to get you downvoted irrespective of whether people agree with you

The newer coment is one hour after the older one. I made it after I was downvoted. As for the spammy self-promotion, thank you, I linked to a HN comment (made yesterday, relevant to the matter but with no reactions at that moment) instead of the blog post directly exactly because I don't want to replicate links. Finally, while these comments will fade from attention, if not already, that blog post will remain, excuse me for giving first some evidence that's something wrong with Google Scholar. Obviously spam.

Even in this august forum, you will find that there are people who downvote data that they might find unpleasant.

It happens far less frequently than in other places of the web, but I've seen it happen often enough with some of my comments.

From my experience I can say that the web is filled with garbage that tries to exploit whichever search engine you may be using, while also trying to exploit your own attention with flashy headlines and sensationalistic content. These two go hand in hand.

It wasn't that long ago that any Google search would return a big list of blog entries; personal, non-commercial blogs, that is. That was the case with YouTube, as well. I remember people making and uploading videos just for the sake of it. Even I did that (I had a January 2006 account), and the purpose was only one: sharing what you liked to engage in. Nothing more. I guess that's part of the long-gone, old web. When I browse the web nowadays, I feel like I am constantly being sold something, because I actually am.

> I guess that's part of the long-gone, old web.

That old web still exists! It tends to get drowned out by all of the commercial sites, and you won't find more than a hint or two of it through Google, but it is still there...

Previous discussion from when Tim Bray's article was posted: https://news.ycombinator.com/item?id=16153840

One of the changes that made Google forget the old web is favoring https sites. This is a big benefit to new and commercial web sites because setting up SSL is still a burden for non-comercial publishers.

I feel like LetsEncrypt is a 5-minute burden. Unless I’m being naive - I’ve only used it in personal projects. Thoughts?

Plenty of websites out there being kept up by volunteers without hardware access, the original owner having died or otherwise gone MIA. It's not always possible to add https, particularly when the site owner died 10 years ago and someone has an 'agreement' with the hosting provider to keep a website up as a memorial. No, I don't have any specific examples right now, but anecdotally I occasionally come across a website that's been in such a read-only form for as long as a decade due to family members being willing to continue paying the $15/year hosting fee, but not having the technical knowledge, passwords, or interest to fix problems. Sometimes there's evidence of a partial upgrade (the search engine stopped working due to a php upgrade), or a forum that has been converted to a static site entirely (the login buttons don't work either). In any of these cases, getting LE working is almost certainly more trouble than it's worth for whoever is currently paying the fees.

I have an old Linode with 4-5 personal sites running off a single Apache server. Until recently, it was running Ubuntu 10.04 LTS.

It took me a few days to safely upgrade to a new Ubuntu version with a new enough Python to successfully run letsencrypt, without also breaking the weird custom apache configuration rules that had accreted over the years.

I'd imagine lots of hosts don't allow users to setup Let's Encrypt so then the obstacle is first migrating to a host that does allow it (or includes direct support for it).

As a user, I don't care. If Google is preventing me from finding a site I want just because it's not using HTTPS, I count that as a fault with Google, not the site.

Google has, for quite some time now, preferred magazine style websites that grow exponentially.

Their search engine algorithm is designed to favour rich media content and websites that are growing, because this is how Google grows and learns.

Favouring the "old web" would not help Google's business model which favours growth.

It's very worrying that people seem to forget how Google plays around with their platform and tinker with the results, because for a lot of people Google Search is HOW they navigate the internet. I'm most disturbed by how articles critical of Google completely banish.

An article from 2011, it's not particularly damaging anyway, just the kind of things that happen in tech:


If you search the title of the article:

- DuckDuckGo: First result

- Bing: First result

- Yahoo: First result

-Dogpile: First result

- Yippy: First result

- Google: Does not show (I've gone through the 4 pages of results with no luck). To find it you need to use the "site:zdnet.com" option, and then it's the first result.

Holy shit. With all the other reasons to consider abandoning Google the one thing I've held on to is the depth and reliability of their search results. I just assumed they would never abandon their core strength in this area.

This might be the straw for my personal camel's back.

Worth noting that was an de-indexing bug this weekend affecting all Google servers and a lot of pages.

Might be just that. More information: https://www.google.com/amp/s/searchengineland.com/googles-de...

Unfortunately google is the only way to search old usenet archives from the web, since they bought out such archives. That search leaves much to be desired.

Only the relevance search has results for usenet posts, you can't order by date, and other then using date ranges, there's no way to only see usenet posts.

For example searching for "gamer" before 1/1/2000:

relevance https://groups.google.com/forum/#!search/%22gamer%22$20befor...

date https://groups.google.com/forum/#!search/%22gamer%22$20befor...

I wonder how much it would cost to buy those Usenet archives back from Google. I can't imagine they're very profitable for them anymore (if they ever were).

Maybe archive.org could take this on.

Well Usenet provider Giganews has 10 years of Usenet, but I don't know if they kept it from 1990 or... just threw it away.

Ten years doesn't even go back to when google bought dejanews's archive, in 2001.

Author says:

> I also find misleading the title of BoingBoing’s report of this story: “Google’s forgetting the early web”. The two posts mentioned here are not “early web”, nor really “old”.

While the title of this author's post is "Indeed, it seems that Google IS forgetting the old Web"

> But this only makes bigger the problem of what to remember, what to forget and above all who and how should remember and forget.

And today, if the big search engines decide something will (no longer) be indexed, they can make it effectively unreachable.

> they can make it effectively unreachable.

I think you mean "unfindable", not "unreachable". It may seem pedantic, but I think there's a critical difference there.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact