Hacker News new | past | comments | ask | show | jobs | submit login
An AI system has become the third-most important signal to Google search results (bloomberg.com)
110 points by adventured on Oct 26, 2015 | hide | past | favorite | 71 comments



So PageRank was an algorithm, and RankBrain is an AI? I'd love to understand a bit more of what makes them different from each other. I don't feel as though I've seen the search results become any better. In fact I've been frustrated by how many words it leaves out without telling me. Or how it says "here are the results to your search" when in all honesty it had 0 results to my search.


PageRank and RankBrain both seem to be (complicated) features to the actual core search algorithm, which is some unspecified machine learned model [1].

PageRank, for example, can't be the whole search algorithm since it's not even query-dependent. It would just put the same most authoritative document at the top for every query.

Similiarly, RankBrain doesn't sound like it could be the whole algorithm. It sounds like it is just a text understanding model (which wouldn't know anything about e.g. the global reputation of the document, the popularity of the document, etc.). In fact, the article explicitly confirms this:

>RankBrain has become the third-most important signal contributing to the result of a search query, he said.

I'd guess some kind of composite content quality signal and some kind of composite popularity signal sit at #1 and #2.

[1] - I don't have any insider knowledge about how Google works, but this article suggests that they were getting ready to switch from a hand-tuned model to a machine learned one in 2008: http://anand.typepad.com/datawocky/2008/05/are-human-experts...


Hiya, author here - yes, you are correct. RankBrain is one of hundreds of distinct signals that go into the results page. It just happens to be one with a great deal of influence, which I think demonstrates the surprising generality/adaptability of this type of approach for natural language processing and interpretation.


Well, except the results of Google got worse and worse over the past years (since around 2013 I could pinpoint the issue), to the point that by now often only one or two entries on the results page are relevant.

Even if I search for a very specific query, Google will just only give me results ignoring several parts of my query, and at best one or two about my actual query.

Even duckduckgo manages to give better results often at this point.

EDIT: Just found verbatim search, at least that gives almost equal results as duckduckgo. Still not as good, but it’s okay.


> Just found verbatim search

Even Verbatim search will happily lop-off search words.

Tonight I was searching for "raf ballykelly vulcan" and after a few puzzling results realised that Ballykelly had been discarded. Since it was the airfield in question the results were therefore useless.


It is exactly this kind of issue that I am talking about. Even DuckDuckGo manages this better than Google. It’s seriously an issue.

And with Google Maps being already worse than Bing, OSM, or even fucking Apple Maps, all Google products I actively used are now useless for me.


Either they've fixed it or I'm not seeing the problem - I get a bunch of pages that include ballykelly and vulcan. Forcing "ballykelly" didn't seem to make any difference.


If you'd like to share the query I'd be really interested to debug.


I don’t have one specific query – it is EVERY query where Google will ignore 90%+ of the words I entered, and show me around 5 related results, and otherwise completely bullshit results, with no way to see more.


Use [Search Tools] => [All Results] => [Verbatim] that is a known issue for people who have been using Google Search for years. There are many discussions about this topic, like: https://www.webmasterworld.com/google/4744658.htm


Even that doesn’t solve all issues – it still shows results that are irrelevant to my query and do not even match the query.

It should not be hard to convert a query into a regex for content, title and URL and apply it on the index.


That sounds really frustrating. Any examples you can give would really help to debug.


Often the issue occured with searching technical things, like, I copy-pasted a python exception, and Google would tell me where to download python and which python books to get and that python is a type of snakes.

It’s incredibly frustrating, but if I, by accident, use Google again and see the issue, I’ll tell you.


Here, a recent case, where Google tries to completely ignore half of my query: https://www.google.de/search?q=TLD+.gov.sa


Verbatim search is what you want: On the search results page, choose "Search tools -> All results -> Verbatim"


Thank you for this. I felt like my search terms were being "conveniently" omitted for the longest time just to show me more results instead of more refined results.


Is there a way to set this permanently? I couldn't find anything.


Perform a search then change the setting to Verbatim. Right click the search bar and "Add as Keyword" or "Add as Search Engine" (depending if you are using Firefox or Chrome)

Then give it a keyword, I use "vg" for "Verbatim Google".

Then in the Navbar I can type "vg foo bar", which will search "foo bar" verbatim. Closest thing to permanent once you get used to using keywords ( which are awesome by the way :D )


I didn't see the Add as Search Engine option in Chrome Dev.

Specifically, the url you want to use is: https://www.google.com/search?q=%s&tbs=li:1

Good tip. I use a similar one with site:en.wikipedia.com and I'm Feeling Lucky to quickly jump to wikipedia articles.


Of course, thanks :)


RankBrain is at the front, query-interpretation end. PageRank is at the back end, for picking pages which reasonably match the query.

Does RankBrain have an intermediate form which shows its interpretation of the query? Wolfram Alpha does, and will show an explanation of how it interpreted the query. (It has to, because it may give you an numeric answer). It would be useful for Google to tell you what question they think you are asking.


Marketing. AI is the new word for algorithm. I think most people probably feel that search quality is going down; unsurprising given Google's monopoly position.


That's going too far. AI always uses algorithms. What constituted it varied quite a bit. However, we usually allowed the term if it involved machine learning or decision-making based on heuristics. Especially if it was adaptive overtime. The AI's were also usually more resource intensive (slower) than regular algorithms. Kept them out of use in many places until AI field caught up with requirements.

PageRank was a simple, stupid algorithm that produced incredibly smart results. The exact kind of thing that sees widespread deployment with a startup. The description of this AI sounds more like an AI tool in general. It would've been much harder for Google to have started with this. The computers alone would've been prohibitive. So, we can call it an AI.


Are you saying that because Google is the leader, people perceive a drop in quality that may not be measurable? I suspect not but hope yes.


My guess is RankBrain is the personalization piece that operates on user data (location, history, etc.) while PageRank is the search index piece that operates on web data (web-pages, trends, etc.).


Hiya. For those interested, the RankBrain approach of converting words and phrases into vectors ties directly to Geoff Hinton's more ambitious ideas about AI. He speaks about it a bit from 32 mins in, in this video from the Royal Society in London earlier this year.

Geoff Hinton - "If we can convert a sentence into a vector that captures the meaning of the sentence, then google can do much better searches, they can search based on what is being said in a document. Also, if you can convert each sentence in a document into a vector, you can then take that sequence of vectors and try and model why you get this vector after you get these vectors, that's called reasoning, that's natural reasoning, and that was kind of the core of good old fashioned AI and something they could never do because natural reasoning is a complicated business, and logic isn't a very good model of it, here we can say, well, look, if we can read every English document on the web, and turn each sentence into a thought vector, we've got plenty of data for training a system that can reason like people do. Now, you might not want to reason like people do on the web, but at least we can see what they would think."

https://www.youtube.com/watch?v=IcOMKXAw5VA


"In the few months it has been deployed, RankBrain has become the third-most important signal contributing to the result of a search query, he said."

Do we know what is the most and the second-most important signal?


Hiya. Author here. They wouldn't tell me. Asked a lot.


Presumably it's 1. personal data or some amalgamation of meta data thereof, 2. page rank, 3. rank brain


personal data and such would almost certainly be below page rank.


Not if that includes the geographical location of the query...


Personal data is almost certainly in the top two. Stating that, however, would raise privacy concerns from users.


> Personal data is almost certainly in the top two

Do you experience a drastic drop in result quality when you use a public computer?


I would guess relevance(how closely text on the page matches the query) and backlinks(how many popular websites link to the page).

Or if relevance doesn't count as signal maybe it's social sharing.


Google users feed Google with training data everyday, just by clicking on links and refining search queries. They basically tell Google what they were looking for. I have no idea how Google works, but I'd guess the user data plays a big role in rankinkg the search results.


The article uses this query as a motivating example:

"What’s the title of the consumer at the highest level of a food chain?"

But the results page (for me) does not contain the words 'apex predator'. The top result is the wikipedia page for "Consumer (food chain)", which does contain that term.

It would have been very cool if the AI could have identified the concept described by the query. But it didn't. It just found a very relevant page for three strings in the query.

The journalist doesn't report on the results of this example. Who came up with it and why?


Google will get to apex predator if I search

"what predator is at the top of the food chain?" or "what type of animal is at the top of the food chain?"

but it fails to do so if I ask "what consumer is at the top of the food chain?"

It seems like "consumer" is too ambiguous to work in this example.


This is presented as if AI is a new thing to Google. The truth is that Pagerank is based on a classic neural network. The pages are the nodes, the links are the weights and we are the feedback. It has been in training since at least 1996 ;)


I don't think that anybody considers PageRank to be a classic neural network. It's a recursively defined centrality algorithm. It has a graph structure; beyond that it's not really a neural network.


I know neural networks are fashionable these days, but come on.


It's not that it's a secret. Consider this quote from early 2000: Reporter: "Why would we need another search engine? Alta Vista is quite good enough." - Larry Page: "We're not building a search engine. We're building an A.I."


And neural networks has been in fashion for a very long time ;)


By inference, this looks to be "just" integrating a deep semantic embedding (presumably neural network based) of the individual webpages as a signal into their existing ranking framework.

AI is a stretch, but it is cool.


Tried to get at this in the article (author here), but it's using vectors (think word2vec and seq2seq) to distill meaning and embed words and phrases into a single space that the computer can then use to reason about. From my understanding this is all done on the query end of things, so it's basically letting them do better natural language processing. It also ties into Hinton's work on "Thought Vectors".


As some feedback to the author, the following sentence doesn't make sense: "Artificial intelligence sits at the extreme end of machine learning..." Machine Learning is a subfield of AI.


Thanks for the feedback. As a general news organization we struggle with definitions/scoping for stuff within AI as it's such a new area and we try to write for a broad, albeit informed, readership. I'll keep this in mind for future articles where we classify the two.


Natural language processing, inference, and machine learning are all AI-related techniques. Calling a system that uses all of them effectively an AI is a fair label.


> deep semantic embedding (presumably neural network based)

Probably not. My bet is that it is word2vec based

https://code.google.com/p/word2vec/


Hiya. They wouldn't explicitly confirm that it is word2vec, but everything we discussed indicated it's likely doing something roughly equivalent to word2vec, and is also doing similar conversions for sequences which is likely connected to Sequence to Sequence learning (PDF: http://papers.nips.cc/paper/5346-sequence-to-sequence-learni...). It also links to Geoff Hinton's stuff on Thought Vectors which implicitly involves word2vec.


word2vec is a broadly neural network based embedding.


So any ideas about how RankBrain works? I suppose it is a neural network (or a bunch of them). But what are the input and output quantities?


Google seems to be getting worse with technical queries. I spend many queries just trying to craft a query that gets the results I am looking for. This is especially true for keywords where case is important. Google seems to just neglect case as a signal.


Yeah, I also feel the "" to force Google to find an exact match is simply ignored in most requests since a few months.


Do you remember any examples I could debug? You can check your search history here: https://history.google.com/history/ if you have it turned on.


One of the most annoying for code searches is that the engine seems to ignore punctuation, even in quotations. For example, searching for

"the quick brown fox; jumped over the lazy dog"

returns hits for the version without the semicolon


I think that "A*" (the algorithm) has started to work only since recently, even without quotation.


I don't appear to have search history enable, but I can shoot you an email as soon as I stumble upon another one.


"" always works for me, but the annoying part is how often I have to use it these days, because Google so often second-guesses my queries.

I suppose for the average person this "fuzzy searching" is an improvement, but I wish I had the ability to flip a switch somewhere that says: "Please only use exactly the words I gave you, always."


I've noticed this as well, and I think it's because 1) technical queries have a different optimal algorithm than non-technical queries and 2) as Google's audience grows, the proportion of technical queries shrinks.

For a technical query, you essentially want something like PageRank-weighted grep, which is, of course, what you used to get. All of the fancy NLP/fuzzy-matching stuff that Google has been adding recently, while helpful for all sorts of other things, is going to be a detriment for technical queries.

When you're doing something like googling an error message or a code snippet, you're basically querying machine-generated speech, and much of Google's recent work has been on improving querys of human-generated speech.

It seems like it should be simple to implement a little "technical query" checkbox...


It really isn't as simple as adding a checkbox. Just think about the design implications...


For error messages I think it helps if you put them in quotes, to mean "these words are needed in that sequence".


Have you tried "Verbatim" mode? It's under the Search Tools drop down along the top menu bar (where News, Images etc. are), then under "All results", and basically seems to minimise the amount of clever business Google does with your search (synonyms, "did you mean?" etc) and so can be quite useful for technical searches


If I am logged in, I get so-so results for the first few queries on a topic, and then really good after that once Google realizes what I'm actually looking for (e.g. it will learn that when I am looking for "Unity" it knows I mean "Unity3D", not "Unity Ubuntu").

Of course, being logged in all the time makes me uncomfortable...


Sometimes this feature works against you. E.g. I need to do X in java, maybe someone already has an implementation I can look at so I search for "how to do X in java". Turns out there's not really any good solutions so I broaden the search figuring I could learn from an implementation in any language, but now "how to do X" is just filled with the same useless java focused results as my first search.


Totally agree with this, I've come up against this problem in the past week. I've found that I get the same list of results regardless of how I phrase a search term involving two JS frameworks. It's like it only matches those keywords and ignores the rest of the words.


I don't think it's just technical queries. I have to look up a bunch of tech and nsfw stuff, and when google tries to think for me, it ends up going to shit. I usually end up just "searching" "like" "this". Even then, it tends to ignore words in quotes |:(


If you can remember any examples, I'd love to debug.


Another one from today:

"does ctime always change"

Google returns:

Showing results for "does time always change"

Search instead for "does ctime always change"

None of the results are relevant because Google thinks I made a typo.

If I then search instead for "does ctime always change" the top result is:

"Does directory mtime always change when a new file ..."

Google has fuzzed ctime to match mtime which is not the same thing.

My intention was to see whether ctime would always be updated if there was any data or metadata change to a file or folder, or if there would be some edge cases where it would not change.


Thanks. Those are great examples!


Thanks, here is one I tried today:

"ALPN websocket handshake chrome"

The first 3 results are all acknowledged by Google to be "Missing: ALPN". I get this often with queries, where Google returns results which are missing the keywords I am most interested in.

My intention was to see if there was any progress with WebSocket handshakes over ALPN (to save a roundtrip).

From the descriptions of the results it is also not clear if any of the top 10 matches are relevant.


Case and (as someone mentions below) punctuation have never been part of google search. You could search with those in Code Search, but sadly that's long gone :(




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: