Hacker News new | comments | show | ask | jobs | submit login
What every software engineer should know about search (medium.com)
368 points by forwidur 13 days ago | hide | past | web | 55 comments | favorite

Ex-Google search engineer here, now using hosted ElasticSearch extensively in my startup.

This is a really good overview. If there's one part I want to highlight, it's that you should expect to spend a lot of time fine-tuning your ranking function for your particular product & corpus. The default ElasticSearch ranking function kinda sucks. It was changed in ES 5.0 to Okapi BM25, which is the current academic state-of-the-art in non-machine-learned ranking functions. However, search is one field where the current academic state-of-the-art is at least a couple decades behind where things are in industry. When you use a service with good search that just works, chances are that there's been a lot of engineer hours devoted to identifying exactly which signals are most useful in your corpus, how they relate to subjective evaluations of relevance, and how to clean them up so that noise doesn't dominate the signal.

These signals are tightly tied to the corpus you're working with. One slightly unpleasant surprise striking out on my own was just how little the stuff I learned about websearch is relevant to working with different corpora (though I guess I should've predicted this, having seen how many different teams across Google Search operated and how all their signals differed from core websearch). I'll reiterate the article's point about process and method being more important than any specific function or algorithm; the most useful stuff I learned at Google was actually a process for taking a vague domain where I'm not sure what's out there, and then learning how to turn that into a system for getting useful information out of it.

And then once you've got users, you probably want to feed usage data back into it through some sort of machine learning. In that order, though; with all the hotness around AI, it's really tempting to think "Oh, I'll just read the latest LtR papers, Mechanical Turk a bunch of query evaluations, and feed the data into the algorithm", but without a deeper understanding of the particular corpus you're working with and how users want to access it, it's unlikely an initial LtR system will perform well enough to attract enough users to bootstrap the system.

> When you use a service with good search that just works, chances are that there's been a lot of engineer hours devoted to identifying exactly which signals are most useful in your corpus

I imagine at a place like Google, your "corpus" is just about everything under the sun, as are your queries. (i.e. less chance for a subject-specific tuning)

What happens then?

> I imagine at a place like Google, your "corpus" is just about everything under the sun, as are your queries. > What happens then?

You accept the fact that a good search engine is inevitably going to be a multi-layered beast that requires constant effort to improve, with heuristics, machine learning, magic constants, special lists of words, and regular expressions.

After that, it's just a matter of picking a commercially useful objective function ("instant search", "query autocomplete", "maximize user satisfaction") and pointing large teams of well-resourced PhDs at it. Probably a whole weekend at least.

Google's search departments were for years organized by the type of content being indexed, i.e. Web Search, Image Search, Video Search, Product Search, Book Search, Blog Search, Discussion Search (RIP), Google News, Scholar, etc. For a long time these distinctions were apparent in the UI (they were de-emphasized after I left, which IMHO was a very good thing), but the differences go way down through the search stack, with different indexes, different ranking functions, and in some cases different ingestion mechanisms.

As for what happens when these become inadequate for the queries users ask...well, buy Metaweb and rebrand it as the Knowledge Graph. ;-)

If you can identify categories of queries or documents that interest people, you can treat tuning that part of the search engine as if it was a new product.

what does your startup do?

Cross between an RSS-reader and a search engine - it lets you subscribe to a topic instead of a site, so that you don't have to manage hundreds of subscriptions, and if a new discussion pops up somewhere on the web that's relevant to your interests, it'll find it and let you know. As forums get boring or off-topic and new ones spring up, it adjusts automatically, so you don't need to do the "Does anyone know of other places on the web like X?" dance.

Still under development, but drop me an e-mail if you (or anyone else reading this) is interested in beta-testing. I'm starting out e-mail first, so the initial UI is just that you get a daily digest of links & snippets to threads related to your interests.

Are you considering adding sentiment analysis/clustering as one feature? You know, to let people read about different opinions, most often, they only need to read one post in each cluster.

Ah, so something like ElasticSearch percolater queries but applied to the web.

Yep, pretty much. The percolater queries are fairly handy in this instance, and one of the things we didn't really have at Google.

I just read https://stackoverflow.com/questions/21536599/what-does-perco..., and was strongly reminded of Google Wave's "search for documents that haven't been created yet, and they show up when they appear" functionality.

Without trying to get too much off track, I gotta say that it would be so nice to be able to use the original theme data used in Wave (very obviously sans branding). Wave In A Box is... blech, in terms of design, I have to admit.

Google News topic search has an RSS. How is your product better ?

Google News concentrates mostly on news sites, which have an economic incentive to generate content that emotionally grabs you and spreads virally (i.e. clickbait). I get the sense that people are getting fairly outraged at all the outrage in the MSM & popular blog-based sites - I know I am - and are fatiguing of that. I'm concentrating on forums, blogs, and other user-generated content, which has no economic incentive to reach millions of people and so can be both more authentic and more informative.

For example, my current video game obsession is Factorio, which has about 200,000 users and will likely never appear in a major news story, because it's un-economical for a news outlet to write a story that has an audience of at most 200,000. Despite this, there are 4 active forums dedicated to it, which generate a few hours worth of interesting reading each day. This content is a lot more interesting to me than anything that appears on Google News, but it's a lot less interesting to the millions of other readers of Google News. But the beauty of computers is that we can match content up to precisely the users that care about it.

> But the beauty of computers is that we can match content up to precisely the users that care about it.

In theory. I often wonder about pathologically impossible-to-query statistics like "who is snoring the loudest right now?" "show me a global map of everyone waking up right now (and a graph tracking how many people are waking up per second); and provide a second-to-second pinpoint of the person who feels the most refreshed." "what is the single most relevant set of webpages for this highly obscure, domain-specific query?" etc.

Heh, it sometimes takes me actual effort to calm myself down about the fact that, beyond a certain threshold, we literally cannot collect enough entropy (data) to direct a database to the most relevant results - and that we similarly won't connect users to the data they're most interested in, beyond that point.

I do totally get what you mean though.

Very interesting. Indeed, Google News is silent on english-content on Factorio. But then how is your product better at finding specific forums ? Furthermore, why wouldn't I just add this particular forum's RSS to my reader?

I think somewhere along the way work on search just devolved into handling millions of queries at a time. It could have been so much better.

If you ask a "search expert" today what's he is trying to fix, he will say something related to scaling.

If you asked an expert in the 80's or 90's they would talk about query complexity and NLP i.e. Who were the four semi-finalists of last years Wimbledon? And you would get back 4 names.

Today you will get back a result saying 50 million pages were found in 1 second. The page may or may not contain the 4 names if you wade through 17 popups. Nobody questions how brainless this is.

Well, I hear a lot of people complaining that the results on DuckDuckGo are still worse than on Google, even though both search-engines produce results within a second. And these are people that really want to quit using Google for privacy reasons. I never hear people complaining that a search is slow. So I do think that search-quality is where the competition is happening.

Edit: But, I agree, we don't often see any good HN posts or papers about search-quality. Perhaps a case of trade secrets?

The quality/ranking code and algorithms are definitely one of the "holy grails" at Google. That code and documentation is locked down tight, compared to the 99.9% of the rest of the codebase that's open to any engineer.

You don't hear people complaining about speed, because there's constant optimization around that. If searches start taking 2-3 seconds with better search quality. The service might be much more useful, but it will feel very much worse. And how people feel is how they judge you.

Actually quite the opposite. While we do make sure performance is good, the bulk of our time is spent dealing with relevance improvements, content wrangling, language pipelines, and lots of other cool stuff.

FWIW I do see organizations building relevance/discovery teams staffed differently than their team for "scaling out a search engine". Usually the relevance team is tasked with anything in the product that matches users to content, whether it be search, recommendations, or other 'discovery' themed functionality.

I used to work on textual Information Retrieval (which is quite related to search). One of the key contributions of my PhD thesis was a statistical method to extract concepts from texts [0].

Surprisingly enough, with those concepts, Tf-Idf [1] was quite good to extract keywords from documents which allowed us to build document descriptor tables which can eventually be used for document search [2]. We also built a small prototype that allowed us to retrieve results of specific areas of documents where the searched concepts were more "valuable" than on other parts of the document [3].

It is a very interesting subarea of AI/NLP which unfortunately doesn't seem to attract much interest.

Since the article also talks a bit about Wikipedia dumps datasets, here's a tool that I've created to build textual corpora: https://github.com/joaoventura/WikiCorpusExtractor

[0] - http://www.sciencedirect.com/science/article/pii/S1877050912...

[1] - https://en.wikipedia.org/wiki/Tf%E2%80%93idf

[2] - https://link.springer.com/chapter/10.1007/978-3-642-40669-0_...

[3] - https://link.springer.com/chapter/10.1007/978-3-642-40669-0_...

Hey Joao, I've been using your WikiCorpusExtractor to work on the wiki dumps for years now for my NLP research. It has provided tremendous value to me. Thank you very much for your work and dedication !

How does your research compare to word2vec? That seems to be getting some attention

Neural networks are all the rage these days. As my knowledge on NNs is quite limited, what I could grasp from the wikipedia article [0] and the original paper [1] is that word2vec uses a vector to represent words in a n-dimensional space.

This seems quite similar to LSA (Latent Semantic Analysis). In LSA, a word is represented by a vector (list / array) of values which represent the number of occurrences of the word in a document. For instance, the list [0, 5, 9, 10] for the word "aeroplane" means than "aeroplane" occurs zero times in D1 (document #1), 5 times in D2, etc. Unlike LSA, word2vec seems to have the vectors values implicit in the neural network weights (?).

The interesting thing on these kind of approaches is that words that share similar contexts, tend to occur in the same documents. In other words, the vectors tend to have approximate values. For instance, it is natural that "aeroplane" tends to occur in the same set of documents as "aeroport".. Tf-Idf [2] does something similar, so these vectors are mostly a data representation of single words.

As for my research, the basis of it was to be able to quantify how "specific" a word or multiword is. The idea is that the more a word/multiword occurs in different contexts (in my case, near different words) the less specific it is. Being able to identify the specificity of words allowed me to know which words/multiwords were probably concepts and which ones weren't. Then, with these concepts I was able to use Tf-Idf to extract document keywords and later I was able to infer broad relationships between concepts (like in if two words tend to co-occur in the same documents and same parts of documents, they are somewhat related).

So, how does word2vec compare to my research?

- I prefer simple statistical approaches like the one I did (which is just counting frequencies and dividing numbers, etc.) than the black-box approach of neural networks. I can tell you why my research provides good results (explained above) but if you read a follow-up paper on word2vec [3] this is what you can find in section 4: "Why does this produce good word representations? Good question. We don’t really know".

- Also, bag-of-box approaches like word2vec and LSA don't tend to represent multiwords (such as "President of the United States of America", which leaves out lots of information on texts.

- Finally, I think there's more possible applications after you know which words/multiwords are informative. With word2vec and LSA you are only representing something similar to frequencies of words in documents, which mostly allows for Information Retrieval-like applications. On the other hand, one of the applications I did with the extracted concepts was to implement a prototype which allowed us to find the definition of a concept on the corpus. The idea was that the definition of a concept was the paragraph where that concept occurred many times but also used other concepts to help the first one to be defined. I don't know how you could do something like that using something like word2vec.

But as I said above, I am not that knowledgeable about word2vec, so I may miss something..

[0] - https://en.wikipedia.org/wiki/Word2vec

[1] - https://arxiv.org/abs/1301.3781

[2] - https://en.wikipedia.org/wiki/Tf%E2%80%93idf

[3] - https://arxiv.org/pdf/1402.3722.pdf

Doug Turnbull here, author of Relevant Search. I'm in love with this post. Thought I'd point out I added a few pointers that stream-of-conscioused into my head as I read this!


Very interesting article. However I really hate the floating menu and footer on the site. I browse on a small laptop and I really don't like the obscured viewing

Add this as a bookmarklet to kill the sticky menus and enjoy a full screen experience ;)

javascript:(function()%7B(function () %7Bvar i%2C elements %3D document.querySelectorAll('body *')%3Bfor (i %3D 0%3B i < elements.length%3B i%2B%2B) %7Bif (getComputedStyle(elements%5Bi%5D).position %3D%3D%3D 'fixed') %7Belements%5Bi%5D.parentNode.removeChild(elements%5Bi%5D)%3B%7D%7D%7D)()%7D)()

Seemed OK with Reader View in Firefox.

It is OK with Reader View , it's not really the point though is it?

"Seems GREAT on my 60 inch. Hope that helps!"

They are exceptionally huge

I have some experience in developing IR/search software ( https://github.com/phaistos-networks/Trinity ) and, the way I see it, it all comes down to accepting the following premises:

- It’s all about the relevance models. Specifically, BM25 and TF-IDF which are widely used by e.g Lucene and variants just won’t do. Specifically, they only word for large enough documents anyway.

- Indexing and Search algorithms and practices haven’t changed much in decades (though some novel ideas have been introduced not long ago). Lucene’s index encoding is compact and facilitates fast access, but even compared to the one made available by Google which is arguable simpler in design, doesn’t result in more than around 5% reduction in index size and postings list access time (according to my measurements that is). Posting lists intersections, unions and other such operations implementations are pretty much common across IR systems as well, with little room for improvement.

To get great results, you need really great relevance models(1), a great query rewrite system(2), and, because query rewrites usually expand a query to include multiple disjunctions (OR terms and phrases), your search engine needs to be particularly efficient at handling those(3).

You also need to care for spelling suggestions and personalisation/content biases and factors, but those are secondary concerns.

What's with the random unicode characters spread throughout this article? Several headings and sentences are prefixed oddly with an exclamation mark ("\u2757\ufe0f"), or diamond ("\U0001f537").

There's an “Emoji Legend“ near the top of the article.

It's a nice idea, flagging part of the text with coarse semantic meaning. However, I find I don't trust the writer's judgement to decide what is important for me; I must read the whole thing anyway.

I've never encountered an "Emoji Legend" before. I instinctively skip over the word "emoji" so I missed that section.

It seems that the author has chosen to annotate the page with emojis. There is a legend near the top of the page

This was a great read and incredibly helpful. As someone that has been working on a search system for three years, it is incredibly difficult. There is a ton of variables and the cost of being wrong early costs you dearly later. There is a lot in this article that resonates with me. Some stuff I know, some stuff I wish I had known, and some stuff that was new. Great read!

Great post. I'm working on another open source search engine on top of Redis (http://redisearch.io), mostly focused on index building and serving, and real-time updates of the data. The part about queries being highly varying is extremely challenging. You have to deal with simple "foo bar" queries and complex queries with intricate filtering and crazy stuff (I have a user doing an AND intersection of 17 OR unions, each of 32 terms, while checking the ordering of term offsets!). It's super fun to work on this stuff.

Another recommended book that was not mentioned in the post: Search Engines: Information Retrieval in Practice https://www.amazon.com/Search-Engines-Information-Retrieval-...

If you allow users to execute arbitrary queries, how do you protect your server resources from being totally consumed by a small subset of your users?

In terms of redis being single threaded, we put a lot of work into making the search concurrent, see here for details: https://redislabs.com/blog/making-redis-concurrent-with-modu.... Basically we rotate between long queries and short queries, making long queries run slower but not block short ones. Right now rate limiting is left to the user and not solved at the engine level.

This is about the complexity of queries (which, without knowing your system, may be unbound) - rate limiting will not help.

Yeah, I was also thinking about adding execution cap to a single query, that's not hard to add as there is a simple scheduler that checks their progress constantly. So just terminating a long running query is simple.

I think search problems are a ton of fun, because there are a lot of opportunities for creative solutions to problems.

One thing I'd add is a book called "Relevant Search" - I have a hobby project to search talks, and the book helped me out a ton (https://www.findlectures.com).

Thanks for sharing my book ;) I actually just replied to this (fantastic) post with some extra pointers I've discovered on my search projects! https://medium.com/@softwaredoug/this-is-a-fantastic-post-e9...

I was also going through the similar problem of search. How the open source has multiple products for various stuff, but for search and index creation, we just have lucene and tools on top of that like solr or elasticsearch. Not sure why we are not innovating in the search space with new algorithms, systems, open source systems/tools and so on.. Almost all search engines are on top of lucene and inverted index. Why is no one able to reverse engineer google/bing capabilities ?

Separate out the concepts of "search infrastructure" (how documents and posting lists are stored in terms of bits on disk & RAM) and "ranking functions" (how queries are matched to documents).

The former is basically a solved problem. Lucene/ElasticSearch and Google are using basically the same techniques, and you can read about them in Managing Gigabytes [1], which was first published over 2 decades ago. Google may be a generation or so ahead - they were working on a new system to take full advantage of SSDs (which turn out to be very good for search, because it's a very read-heavy workload) when I left, and I don't really know the details of it. But ElasticSearch is a perfectly adequate retrieval system, and it does basically the same stuff that Google's systems did circa 2013, and even does some stuff better than Google.

The real interesting work in search is in ranking functions, and this is where nobody comes close to Google. Some of this, as other commenters note, is because Google has more data than anyone else. Some of it is just because there've been more man-hours poured into it. IMHO, it's pretty doubtful that an open-source project could attract that sort of focused knowledge-work (trust me; it's pretty laborious) when Google will pay half a mil per year for skilled information-retrieval Ph.Ds.

[1] https://www.amazon.com/Managing-Gigabytes-Compressing-Multim...

> The former is basically a solved problem.

That's a bit of a stretch :) The high-level architecture is quite mature and stable, but there's still a lot of research, both in academia and industry, on the data structures to represent indexes, on query execution (see all the work on top-k retrieval), and distributed search systems (for example query-dependent load balancing, novel sharding methods).

It's true that different index codecs are being designed with different tradeoffs (index size vs postings lists access time and cost), but the work is very incremental IMHO, no huge advances to speak of. Also, can you talk about the top-k challenges you mentioned? Priority queues are not optimal enough ?

Ahaha you're kidding, right?

Google sits on more interaction data than anyone and a 100bn gold mine and reinvests a significant amount of money back into improving Search, which is not a solved problem, and your question is why a few hobbyists haven't recreated it?

TFA mentions case when documents are coming from potentially adversarial sources. Any tips for handling relevance when users are trying to game the system?

There is a 100% chance I'll end up referencing this article in the future.

For most needs personally after having learned the ins and outs I still have a soft spot for sphinx so was happy to see honorable mention in there. It can scale really cheap and is a tank that never has downtime. It is closer to metal but if you look at how Craigslist does it you can do the fancy scaling things still

This is indeed a great article about building a practical search system. However the title is a bit aggrandizing. Not every software engineer actually needs to know this stuff, just people who are building products with search capabilities.

Just. Great.

Applications are open for YC Winter 2018

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact