This is a really good overview. If there's one part I want to highlight, it's that you should expect to spend a lot of time fine-tuning your ranking function for your particular product & corpus. The default ElasticSearch ranking function kinda sucks. It was changed in ES 5.0 to Okapi BM25, which is the current academic state-of-the-art in non-machine-learned ranking functions. However, search is one field where the current academic state-of-the-art is at least a couple decades behind where things are in industry. When you use a service with good search that just works, chances are that there's been a lot of engineer hours devoted to identifying exactly which signals are most useful in your corpus, how they relate to subjective evaluations of relevance, and how to clean them up so that noise doesn't dominate the signal.
These signals are tightly tied to the corpus you're working with. One slightly unpleasant surprise striking out on my own was just how little the stuff I learned about websearch is relevant to working with different corpora (though I guess I should've predicted this, having seen how many different teams across Google Search operated and how all their signals differed from core websearch). I'll reiterate the article's point about process and method being more important than any specific function or algorithm; the most useful stuff I learned at Google was actually a process for taking a vague domain where I'm not sure what's out there, and then learning how to turn that into a system for getting useful information out of it.
And then once you've got users, you probably want to feed usage data back into it through some sort of machine learning. In that order, though; with all the hotness around AI, it's really tempting to think "Oh, I'll just read the latest LtR papers, Mechanical Turk a bunch of query evaluations, and feed the data into the algorithm", but without a deeper understanding of the particular corpus you're working with and how users want to access it, it's unlikely an initial LtR system will perform well enough to attract enough users to bootstrap the system.
I imagine at a place like Google, your "corpus" is just about everything under the sun, as are your queries. (i.e. less chance for a subject-specific tuning)
What happens then?
You accept the fact that a good search engine is inevitably going to be a multi-layered beast that requires constant effort to improve, with heuristics, machine learning, magic constants, special lists of words, and regular expressions.
After that, it's just a matter of picking a commercially useful objective function ("instant search", "query autocomplete", "maximize user satisfaction") and pointing large teams of well-resourced PhDs at it. Probably a whole weekend at least.
As for what happens when these become inadequate for the queries users ask...well, buy Metaweb and rebrand it as the Knowledge Graph. ;-)
Still under development, but drop me an e-mail if you (or anyone else reading this) is interested in beta-testing. I'm starting out e-mail first, so the initial UI is just that you get a daily digest of links & snippets to threads related to your interests.
Without trying to get too much off track, I gotta say that it would be so nice to be able to use the original theme data used in Wave (very obviously sans branding). Wave In A Box is... blech, in terms of design, I have to admit.
For example, my current video game obsession is Factorio, which has about 200,000 users and will likely never appear in a major news story, because it's un-economical for a news outlet to write a story that has an audience of at most 200,000. Despite this, there are 4 active forums dedicated to it, which generate a few hours worth of interesting reading each day. This content is a lot more interesting to me than anything that appears on Google News, but it's a lot less interesting to the millions of other readers of Google News. But the beauty of computers is that we can match content up to precisely the users that care about it.
In theory. I often wonder about pathologically impossible-to-query statistics like "who is snoring the loudest right now?" "show me a global map of everyone waking up right now (and a graph tracking how many people are waking up per second); and provide a second-to-second pinpoint of the person who feels the most refreshed." "what is the single most relevant set of webpages for this highly obscure, domain-specific query?" etc.
Heh, it sometimes takes me actual effort to calm myself down about the fact that, beyond a certain threshold, we literally cannot collect enough entropy (data) to direct a database to the most relevant results - and that we similarly won't connect users to the data they're most interested in, beyond that point.
I do totally get what you mean though.
If you ask a "search expert" today what's he is trying to fix, he will say something related to scaling.
If you asked an expert in the 80's or 90's they would talk about query complexity and NLP i.e. Who were the four semi-finalists of last years Wimbledon? And you would get back 4 names.
Today you will get back a result saying 50 million pages were found in 1 second. The page may or may not contain the 4 names if you wade through 17 popups. Nobody questions how brainless this is.
Edit: But, I agree, we don't often see any good HN posts or papers about search-quality. Perhaps a case of trade secrets?
Surprisingly enough, with those concepts, Tf-Idf  was quite good to extract keywords from documents which allowed us to build document descriptor tables which can eventually be used for document search . We also built a small prototype that allowed us to retrieve results of specific areas of documents where the searched concepts were more "valuable" than on other parts of the document .
It is a very interesting subarea of AI/NLP which unfortunately doesn't seem to attract much interest.
Since the article also talks a bit about Wikipedia dumps datasets, here's a tool that I've created to build textual corpora: https://github.com/joaoventura/WikiCorpusExtractor
 - http://www.sciencedirect.com/science/article/pii/S1877050912...
 - https://en.wikipedia.org/wiki/Tf%E2%80%93idf
 - https://link.springer.com/chapter/10.1007/978-3-642-40669-0_...
 - https://link.springer.com/chapter/10.1007/978-3-642-40669-0_...
This seems quite similar to LSA (Latent Semantic Analysis). In LSA, a word is represented by a vector (list / array) of values which represent the number of occurrences of the word in a document. For instance, the list [0, 5, 9, 10] for the word "aeroplane" means than "aeroplane" occurs zero times in D1 (document #1), 5 times in D2, etc. Unlike LSA, word2vec seems to have the vectors values implicit in the neural network weights (?).
The interesting thing on these kind of approaches is that words that share similar contexts, tend to occur in the same documents. In other words, the vectors tend to have approximate values. For instance, it is natural that "aeroplane" tends to occur in the same set of documents as "aeroport".. Tf-Idf  does something similar, so these vectors are mostly a data representation of single words.
As for my research, the basis of it was to be able to quantify how "specific" a word or multiword is. The idea is that the more a word/multiword occurs in different contexts (in my case, near different words) the less specific it is. Being able to identify the specificity of words allowed me to know which words/multiwords were probably concepts and which ones weren't. Then, with these concepts I was able to use Tf-Idf to extract document keywords and later I was able to infer broad relationships between concepts (like in if two words tend to co-occur in the same documents and same parts of documents, they are somewhat related).
So, how does word2vec compare to my research?
- I prefer simple statistical approaches like the one I did (which is just counting frequencies and dividing numbers, etc.) than the black-box approach of neural networks. I can tell you why my research provides good results (explained above) but if you read a follow-up paper on word2vec  this is what you can find in section 4: "Why does this produce good word representations? Good question. We don’t really know".
- Also, bag-of-box approaches like word2vec and LSA don't tend to represent multiwords (such as "President of the United States of America", which leaves out lots of information on texts.
- Finally, I think there's more possible applications after you know which words/multiwords are informative. With word2vec and LSA you are only representing something similar to frequencies of words in documents, which mostly allows for Information Retrieval-like applications. On the other hand, one of the applications I did with the extracted concepts was to implement a prototype which allowed us to find the definition of a concept on the corpus. The idea was that the definition of a concept was the paragraph where that concept occurred many times but also used other concepts to help the first one to be defined. I don't know how you could do something like that using something like word2vec.
But as I said above, I am not that knowledgeable about word2vec, so I may miss something..
 - https://en.wikipedia.org/wiki/Word2vec
 - https://arxiv.org/abs/1301.3781
 - https://en.wikipedia.org/wiki/Tf%E2%80%93idf
 - https://arxiv.org/pdf/1402.3722.pdf
- It’s all about the relevance models. Specifically, BM25 and TF-IDF which are widely used by e.g Lucene and variants just won’t do. Specifically, they only word for large enough documents anyway.
- Indexing and Search algorithms and practices haven’t changed much in decades (though some novel ideas have been introduced not long ago). Lucene’s index encoding is compact and facilitates fast access, but even compared to the one made available by Google which is arguable simpler in design, doesn’t result in more than around 5% reduction in index size and postings list access time (according to my measurements that is). Posting lists intersections, unions and other such operations implementations are pretty much common across IR systems as well, with little room for improvement.
To get great results, you need really great relevance models(1), a great query rewrite system(2), and, because query rewrites usually expand a query to include multiple disjunctions (OR terms and phrases), your search engine needs to be particularly efficient at handling those(3).
You also need to care for spelling suggestions and personalisation/content biases and factors, but those are secondary concerns.
It's a nice idea, flagging part of the text with coarse semantic meaning. However, I find I don't trust the writer's judgement to decide what is important for me; I must read the whole thing anyway.
Another recommended book that was not mentioned in the post: Search Engines: Information Retrieval in Practice https://www.amazon.com/Search-Engines-Information-Retrieval-...
One thing I'd add is a book called "Relevant Search" - I have a hobby project to search talks, and the book helped me out a ton (https://www.findlectures.com).
The former is basically a solved problem. Lucene/ElasticSearch and Google are using basically the same techniques, and you can read about them in Managing Gigabytes , which was first published over 2 decades ago. Google may be a generation or so ahead - they were working on a new system to take full advantage of SSDs (which turn out to be very good for search, because it's a very read-heavy workload) when I left, and I don't really know the details of it. But ElasticSearch is a perfectly adequate retrieval system, and it does basically the same stuff that Google's systems did circa 2013, and even does some stuff better than Google.
The real interesting work in search is in ranking functions, and this is where nobody comes close to Google. Some of this, as other commenters note, is because Google has more data than anyone else. Some of it is just because there've been more man-hours poured into it. IMHO, it's pretty doubtful that an open-source project could attract that sort of focused knowledge-work (trust me; it's pretty laborious) when Google will pay half a mil per year for skilled information-retrieval Ph.Ds.
That's a bit of a stretch :) The high-level architecture is quite mature and stable, but there's still a lot of research, both in academia and industry, on the data structures to represent indexes, on query execution (see all the work on top-k retrieval), and distributed search systems (for example query-dependent load balancing, novel sharding methods).
Google sits on more interaction data than anyone and a 100bn gold mine and reinvests a significant amount of money back into improving Search, which is not a solved problem, and your question is why a few hobbyists haven't recreated it?
For most needs personally after having learned the ins and outs I still have a soft spot for sphinx so was happy to see honorable mention in there. It can scale really cheap and is a tank that never has downtime. It is closer to metal but if you look at how Craigslist does it you can do the fancy scaling things still