So how does Google do a query? Well fundamentally it has lists of sites that have a keyword, so it gets the list for each of your keywords, then finds the intersection of those lists, then sorts it by the pagerank of the resulting URL's. That's how it used to work anyway.
So back in the day I decided to try to do a test. I got the Gutenberg list of all of Shakespeare's words, and I wrote a program to go through it to find the longest string of all "stop" words (which you could still force Google to include - normally it ignores them, since there are just too many sites that include words like "the" and so on). Stop words are super common words like "the" which would be on millions or billions of pages. A huge list of URL's.
So I coded up my little algorithm that goes word by word through the text, keeping track of what the longest string it found so far is, that consisted entirely of stop words.
The longest string it found was: "from what it is to a".
Just now I again did that query, and on the whole Internet it just found 4 matches (soon it will find this comment too and any archives of this comment), all being the exact phrase from Shakespeare:
Impressively, that query now takes 0.84 seconds (I haven't done that query in several years, it's possibly cached but I doubt it.)
However, when I first performed it, it took 30+ seconds. I didn't take a screenshot but I was super impressed. I brought Google to a crawl for 30 seconds, in the exact way I was intending. Moohahahahaha.
"Holy cow. I just made Google's databases join six lists each with millions of pages on them, find the intersection, and then go through all of them for which ones had my phrase in literal order. And then it found it."
Pretty mind-blowing that today it can do that in < 1 second.
For example I just looked at the top post on Reddit right now, it's about 20 page-downs of comments, and has the words this many times:
a - more than 1000
If you have a positional index  you don't have to go through the full cached content, you only have to check the index to see whether there's an occurrence of the words in the query with the correct distance. E.g. for that Reddit post, you'd check the 17 occurrences of "from" and notice quickly that your query string doesn't match at that position.
… because everyone speaks english?!?
England had a large empire back in the days but it wasn't that big, nor did it span the whole world.
But why just 4 occurrences? Every utterance of Shakespeare has been printed on thousands of web sites and Google has indexed thousands of those sites. Another line from the same play, "Nay, answer me: stand, and unfold yourself", gets 10,300 hits. I tried another line from deep within the play: "I know him well: he is the brooch indeed." Google found 3,890 results. Even though the search is slow, why aren't we getting thousands of hits for "from what it is to a"?
Because Google isn't an exact search engine. Each of the terms in the phrase above appears on billions of pages. My guess is that for common terms Google doesn't store all postings in its inverted index but truncates the posting lists after a couple of million entries.
the OR google OR a OR "supercalifragilisticexpialidocious" -the -google -a
Searches are read-only, so I suppose locking doesn't need to happen anyway.
Another thing is that your query is essentially a literal "substring" search, which Google probably handles differently. See e.g. the Burrows Wheeler transform for an idea how it could be implemented.
Today they probably use common pairs of words as well for their lists, so they have only 3-4 intersections of smaller lists in this case.
Pretty mind-blowing that today it can do that in < 1 second
You are reading a lot into what sounds like it was a single measurement. Was this experiment repeatable,?
Example: someone compiled a list of things they learned from indiehackers.com interviews.
There are a lot of quotes on that page, and unfortunately none of them are linked back to the original interview. Some quotes are very interesting, so I wanted to find the original interview.
I took some of those quotes, put them in double quotation marks, and searched on Google, DuckDuckGo and Bing. By the way, you can only replicate these results by adding the double quotation marks.
Google always shows toomas.net in the top results, and almost always finds the relevant interview article on the indiehackers website.
DDG (usually) finds the article written on toomas.net, but not the indiehackers interview.
Bing often fails to list toomas as the top result, and doesn't find the indiehackers interview at all.
I have a similar issue with a side project of mine. JS-only websites solve this with server-side rendering or static generation.
1. A web crawler
2. A search index
Or as a good boss used to say garbage in garbage out... the quality of the engine is as much a function of the index’s ability to rank as the quality of the input from the crawler... the fact that none of the other engines crawl with JS enabled is a huge competitive advantage for google
What value a service provides is what matters in the end. Implementation details are secondary.
I’m honestly having a hard time seeing how this isn’t a bug in Google search.
(Disclosure: I work for Google, not on search)
Non-JS web indexing as a default would vastly dimnish the utility of JS-only pages.
I separately think sites that only work with JS enabled are not great, but that's a problem with the system and not with individual actors.
So Google is enabling the excess of JS we see today? Sites wouldn't do that if it couldn't be indexed by Google.
Nothing prevents a site from doing that and just loading JS afterwards. Or from still being basically read-only without JS. Which might be an improvement...except within epsilon of zero people care in the first place and Google is building software for a rather larger proportion of the market than that, so it's an improvement for what's barely an audience.
Nobody is really "enabling" it except for a browser. And that's not changing. The shirt-rending is becoming tiresome.
It's a shame you find it "tiresome" when people insist on retaining their own opinions, but your exhaustion is not particularly relevant.
And you can have whatever opinion you want; nobody’s saying otherwise. But the constant whining about nothing — and it is nothing — is exhausting.
I think these are both true. Google quality has certainly fallen in my opinion (perhaps as a result of having to constantly counter SEO tricks). Other search engines still have quite a bit of catching up to do as well.
This is one example of many, but it was the last straw for me. I guess techies aren't Google's core audience anymore, I get that, but it feels like we've lost a valuable tool.
People that can actually notice a major difference in result quality should probably go work on search optimization.
Are you able to explain this? Bing and DuckDuckGo are the same thing. DDG uses Bing's API (for non bang searches).
This is not, and has never been, true. Why do people like you keep saying it?
Googles been able to use ephemeral experiences (like autocomplete and answer boxes) to influence users, especially undecided voters. The research, which was reproduced by a German team, is believed to have influenced 2+ million people in the US election alone --
take a listen on the congressional testimony http://naplay.it/1157/37:25 (recommended listen speed is 1.5X)
He claims they moved 2.5 million votes, not voters. This relies on an assumption of straight ticket voters and about 20 votes per voter. He doesn't correct people when they make the votes/voters mistake.
And even that new number, 100k voters swayed is not really well founded. He makes a couple of overlapping claims, one of them I looked into deeply and found that using his own papers, a reasonable way of expressing it was that 8-10k additional votes for Clinton was a reasonable upper bound, as he claimed, might be affected.
And the number was likely lower (and was only that high because more people identify as democrat than republican in the US).
(I work at Google, but take interest in this mostly because it's just terrible abuse of mathematics)
Maybe but it's pretty clear to me that Google now no longer indexes entire websites. This as far as I can tell has not been a concern in the past.
The reduction in Google Search quality is their need to be "helpful" with their desire to use my past search results, and other "big data" they gobble up about me to "improve" my specific results when in fact it makes it worse. That is just the base line, then you have the 100's of other things they have done to the search over the years for political, regulatory and other reasons.
Google search gets objectively worse every year than the previous year.
That’s like Yahoo designing a fancier logo while the company falls apart.
Honestly, the mentality in this twitter thread is a joke. Google handles insane amounts of data beyond what 99.99% of engineers have ever dealt with. The fact that their search works at all is a miracle. But I guess people like to ignore that and much rather complain on Twitter that it takes a few seconds to query a database of trillions of records.
If, for example, you wanted to harvest domain names because you had a hot 0day on wordpress, this would be a convenient way to do it.
Yes, there are probably other signatures, too.
About 6,060,000,000 results (6.82 seconds)
What do you mean? People are seeing slow queries for "powered by". Unless you think they are lying, I don't see how that isn't factual?
"Come on Google I don't have all day!"
"This is actually one of those crazy "facts from the future" to go back in time and tell people 10 years ago: Google has the dominant browser and mobile operating system, but their search sometimes takes 7 seconds."
"@Google have you tried implementing these recommendations? https://developers.google.com/speed/"
Still, you do you. I'm sure Google appreciates the peanut gallery coming to their defense for even the most trivial of issues...
> Powered by is a footprint that black hats/hackers and spammers use to find targets.
So maybe Google isn't being slow, it's just sleeping intentionally to avoid abuse?
"x powered by header" takes >5 seconds
"x-powered by header" takes >5 seconds
"x-powered-by header" takes ~0.5/0.3 seconds
Is that a thing?
The number of results may represent number of hits in a database, but it's not actually accessible to the end user. After you get into the last few pages, it'll also just cap the result count to the number of pages they've decided to show you.
I've had that a few times lately. I thought it was my internet being shoddy but the header and footer load, just no results in the middle nor any error about not finding anything... just blank space.
The reason doesn't seem clear, but one comment  claims that "powered by" is a common query used by black hats and spammers to find targets. Not sure why any "antispam" behind the scenes would cause the query to delay though. Hardcoded delays?
"Powered by X" is on millions of web pages because it gets autogen'd by popular web-facing CMS (Joomla, Wordpress etc). So when a search query includes "Powered by" google must determine which among the billion pages with this
phrase is most relevant"
My experience with working at a FAANG: I remember a service that was easily dealing with >10M QPS (albeit much simpler queries than a Google search), or some other services continuously processing (reading/writing to disks) > 100 GB/s.
There's A LOT you don't see in FAANG frontends. In particular, the Twitter thread owner suggesting https://developers.google.com/speed/ has no idea where is the slowness coming from.
That being said I'd love to be a Google employee to investigate those queries. Perfect bug reports.
And yes this is an oversimplified model. No idea how much CPU is actually used.
Especially on mobile, making it challenging to edit URLs on their latest chrome browser. UX consistency has gone out the window for weird things like like swapping around the tab ordering for searches (check out chicago, chicagos and chicago's here: https://pbs.twimg.com/media/Cm0C1o8VYAgLv2O?format=png)
Other things like not providing custom date ranges on mobile search are just utterly baffling. Booleans are gone, ranges are gone, and it ignores most of the words I put in. I honestly think alta vista results circa 1997 were better than what I'm getting these days. In fact, they most certainly were.
They've also abandoned their namesake "googol" ... they don't seem to care about http, newsgroups or many other things. They should probably rename themselves to "Around300orso"
Google ceased providing functional tools for technical people a while ago. There's some decent open space for another firm to come along (like DDG or perhaps the Microsoft Renaissance) and just snatch the technical user, create compelling products for just that, and own it.
I'd even pay probably $250/yr or so for such a no-bullshit resource stack - a high quality searchable answers-oriented quick-to-use, quick-to-read technical reference without a bunch of wrong information or concrete answers buried in pages of theory. Heck, I'd probably consider dropping $2,000, that sounds amazing.
The thing is: Most searches aren't even that complicated. Whatever "magic" (read: extra processing power) Google uses is mostly helping for extreme niche cases. The rest they seem to do, nowadays, is "editorializing" results, pushing popular websites before more relevant ones, displaying results from some internal database, etc.
As far as I can tell, 6 years ago the reason was that AWS didn't support IPv6. It has for 2.5 years by now though.
I doubt they have any IP-dependent code, since their main selling point is not tracking anything.
I've actually run into Gabriel a few times online and had a bit of conversation with him. Nice guy. Can't really say I've chatted it up with Eric Schmidt...
Sometimes Google seems to be better at guessing the context especially if you search for C and some other term that has a bunch of different meanings.
I find it pretty funny that I automatically add !g when the Google results suck as well.
This makes me sad, because I will often have very exact queries like a part number and there is content out there on the internet that I cannot locate with search.
Entirely fictional example: "hacker news comment bold"
And seemingly every time, the specific thing is crossed out. At that point Google search is worse than useless.
I have a few hypotheses:
1) They are trying to turn it into a general directory in their continued war on DNS.
2) The web has got so big that their technology just isn't good enough anymore.
3) The users they care about are no longer people (such as technical or, per sibling comment, academic) who are trying to find specific information. They prefer to focus on mass-market.
Our strongest defence as consumers is to resist the further slide into monopoly.
This is the only explanation that makes sense. The vast majority of people probably enter questions and queries in natural language so a purely keyword based search is out the window. Why a list of stop words wouldn't cover that I don't know, but let's just assume it doesn't. The second point is that most people also probably hardly ever look for precise information but ask ambiguous stuff that's just hard to properly answer with traditional approaches.
So my guess is that at some point they realized that most queries are of that kind, then shortly discussed keeping two search engines in parallel and finally went with "screw those stupid devs, we're freaking Google!"
And gods help you if you are looking for something both old and by keyword, because Google sure won't.
Unfortunately, DDG has a lot of the same issues - which is a shame, as if they just used Google's algorithm of a decade ago I'd switch in a heartbeat.
Google say they removed it because most searches that included a plus operator were using it wrong.
> In most cases, Google’s algorithms make things better for our users - but in some rare cases, we don’t find what you were looking for. In the past, we provided users with the “+” operator to help you search for specific terms. However, we found that users typed the “+” operator in less than half a percent of all searches, and two thirds of the time, it was used incorrectly. A couple of weeks ago we removed the “+” operator, encouraging the use of the double quotes, which are more likely to be used correctly.
I'd be intersted to know how often the "quotes" are used correctly.
I end up having to put everything in double quotes just to get results that I'd expect.
> Your "Red Room" query is hard in a couple ways. First, it looks like that root page used to have the words on the page: "The Red Room Doors open 6pm $18 Pre-Booked" And it's also tough because it looks like the name changed to the "2nd Degree Bar & Grill" at some point. The fact that you can type [red room] and get a suggestion for [red room st. lucia] is actually pretty helpful in my book because it leads you to the answer that the name changed.
At first, I thought my search skills had evaporated, but I realized Google has become little more than a digital flea market. Even Yandex is better in many regards than Google.
Why not just search for what I typed? It has gone backwards.
Still has its problems, but better than the default search mode.
As for the sentence search, it appears to me that they are using stemmed shingles. For that, you reduce words to their base form (houses -> house) and then you chop the page down into every 3-pair of words and hash those. The hashed number for 3 words is then called one shingle. Again, that's a common technique for content similarity detection, so Google kind of has to do that anyway to spot verbatim copycats. One can then also use this for sentence search by searching for all the shingles in the search query, but that introduces ambiguity. For example "cats singing in front of the house" would likely have the same shingles as "cat sang about front before houses", as only "cat sing front house" are being indexed. The issue here is that "front" can be used both in a positional way, as well as for a military formation.
To me, both of these look like side-effects of them reducing operational costs. I mean, who would bid marketing dollars on "ISpatialAudioClient::GetMaxDynamicObjectCount" ? Given that you will likely tolerate 2-3 complete search failures before trying Bing, the business value of completing such a query correctly is very close to 0.
Microsoft, on the other hand, has a stronger interest in people finding MS APIs, which is why you will find this result on Bing
but not on Google.
Whenever I see someone using DDG as a default they always turn to Google for more complex queries.
It might be that Google hat lost focus of a few search niches where results used to be better.
It collates hundreds of sources, many with obvious or subtle bias. If an aggregator puts bias in, they can't provide unbiased out...