
What every software engineer should know about search - forwidur
https://medium.com/startup-grind/what-every-software-engineer-should-know-about-search-27d1df99f80d
======
nostrademons
Ex-Google search engineer here, now using hosted ElasticSearch extensively in
my startup.

This is a really good overview. If there's one part I want to highlight, it's
that you should _expect to spend a lot of time fine-tuning your ranking
function for your particular product & corpus_. The default ElasticSearch
ranking function kinda sucks. It was changed in ES 5.0 to Okapi BM25, which is
the current academic state-of-the-art in non-machine-learned ranking
functions. However, search is one field where the current academic state-of-
the-art is at least a couple decades behind where things are in industry. When
you use a service with good search that just works, chances are that there's
been a lot of engineer hours devoted to identifying exactly which signals are
most useful in your corpus, how they relate to subjective evaluations of
relevance, and how to clean them up so that noise doesn't dominate the signal.

These signals are tightly tied to the corpus you're working with. One slightly
unpleasant surprise striking out on my own was just how little the stuff I
learned about websearch is relevant to working with different corpora (though
I guess I should've predicted this, having seen how many different teams
across Google Search operated and how all their signals differed from core
websearch). I'll reiterate the article's point about _process_ and _method_
being more important than any specific function or algorithm; the most useful
stuff I learned at Google was actually a process for taking a vague domain
where I'm not sure what's out there, and then learning how to turn that into a
system for getting useful information out of it.

And then once you've got users, you probably want to feed usage data back into
it through some sort of machine learning. In that order, though; with all the
hotness around AI, it's really tempting to think "Oh, I'll just read the
latest LtR papers, Mechanical Turk a bunch of query evaluations, and feed the
data into the algorithm", but without a deeper understanding of the particular
corpus you're working with and how users want to access it, it's unlikely an
initial LtR system will perform well enough to attract enough users to
bootstrap the system.

~~~
makmanalp
> When you use a service with good search that just works, chances are that
> there's been a lot of engineer hours devoted to identifying exactly which
> signals are most useful in your corpus

I imagine at a place like Google, your "corpus" is just about everything under
the sun, as are your queries. (i.e. less chance for a subject-specific tuning)

What happens then?

~~~
mayank
> I imagine at a place like Google, your "corpus" is just about everything
> under the sun, as are your queries. > What happens then?

You accept the fact that a good search engine is inevitably going to be a
multi-layered beast that requires constant effort to improve, with heuristics,
machine learning, magic constants, special lists of words, and regular
expressions.

After that, it's just a matter of picking a commercially useful objective
function ("instant search", "query autocomplete", "maximize user
satisfaction") and pointing large teams of well-resourced PhDs at it. Probably
a whole weekend at least.

------
mss2
I think somewhere along the way work on search just devolved into handling
millions of queries at a time. It could have been so much better.

If you ask a "search expert" today what's he is trying to fix, he will say
something related to scaling.

If you asked an expert in the 80's or 90's they would talk about query
complexity and NLP i.e. Who were the four semi-finalists of last years
Wimbledon? And you would get back 4 names.

Today you will get back a result saying 50 million pages were found in 1
second. The page may or may not contain the 4 names if you wade through 17
popups. Nobody questions how brainless this is.

~~~
amelius
Well, I hear a lot of people complaining that the results on DuckDuckGo are
still worse than on Google, even though both search-engines produce results
within a second. And these are people that really want to quit using Google
for privacy reasons. I never hear people complaining that a search is slow. So
I do think that _search-quality_ is where the competition is happening.

Edit: But, I agree, we don't often see any good HN posts or papers about
search-quality. Perhaps a case of trade secrets?

~~~
packetslave
The quality/ranking code and algorithms are definitely one of the "holy
grails" at Google. That code and documentation is locked down tight, compared
to the 99.9% of the rest of the codebase that's open to any engineer.

------
jventura
I used to work on textual Information Retrieval (which is quite related to
search). One of the key contributions of my PhD thesis was a statistical
method to extract concepts from texts [0].

Surprisingly enough, with those concepts, Tf-Idf [1] was quite good to extract
keywords from documents which allowed us to build document descriptor tables
which can eventually be used for document search [2]. We also built a small
prototype that allowed us to retrieve results of specific areas of documents
where the searched concepts were more "valuable" than on other parts of the
document [3].

It is a very interesting subarea of AI/NLP which unfortunately doesn't seem to
attract much interest.

Since the article also talks a bit about Wikipedia dumps datasets, here's a
tool that I've created to build textual corpora:
[https://github.com/joaoventura/WikiCorpusExtractor](https://github.com/joaoventura/WikiCorpusExtractor)

[0] -
[http://www.sciencedirect.com/science/article/pii/S1877050912...](http://www.sciencedirect.com/science/article/pii/S1877050912001251)

[1] -
[https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

[2] -
[https://link.springer.com/chapter/10.1007/978-3-642-40669-0_...](https://link.springer.com/chapter/10.1007/978-3-642-40669-0_42)

[3] -
[https://link.springer.com/chapter/10.1007/978-3-642-40669-0_...](https://link.springer.com/chapter/10.1007/978-3-642-40669-0_44)

~~~
garysieling
How does your research compare to word2vec? That seems to be getting some
attention

~~~
jventura
Neural networks are all the rage these days. As my knowledge on NNs is quite
limited, what I could grasp from the wikipedia article [0] and the original
paper [1] is that word2vec uses a vector to represent words in a n-dimensional
space.

This seems quite similar to LSA (Latent Semantic Analysis). In LSA, a word is
represented by a vector (list / array) of values which represent the number of
occurrences of the word in a document. For instance, the list [0, 5, 9, 10]
for the word "aeroplane" means than "aeroplane" occurs zero times in D1
(document #1), 5 times in D2, etc. Unlike LSA, word2vec seems to have the
vectors values implicit in the neural network weights (?).

The interesting thing on these kind of approaches is that words that share
similar contexts, tend to occur in the same documents. In other words, the
vectors tend to have approximate values. For instance, it is natural that
"aeroplane" tends to occur in the same set of documents as "aeroport".. Tf-Idf
[2] does something similar, so these vectors are mostly a data representation
of single words.

As for my research, the basis of it was to be able to quantify how "specific"
a word or multiword is. The idea is that the more a word/multiword occurs in
different contexts (in my case, near different words) the less specific it is.
Being able to identify the specificity of words allowed me to know which
words/multiwords were probably concepts and which ones weren't. Then, with
these concepts I was able to use Tf-Idf to extract document keywords and later
I was able to infer broad relationships between concepts (like in if two words
tend to co-occur in the same documents and same parts of documents, they are
somewhat related).

So, how does word2vec compare to my research?

\- I prefer simple statistical approaches like the one I did (which is just
counting frequencies and dividing numbers, etc.) than the black-box approach
of neural networks. I can tell you why my research provides good results
(explained above) but if you read a follow-up paper on word2vec [3] this is
what you can find in section 4: "Why does this produce good word
representations? Good question. We don’t really know".

\- Also, bag-of-box approaches like word2vec and LSA don't tend to represent
multiwords (such as "President of the United States of America", which leaves
out lots of information on texts.

\- Finally, I think there's more possible applications after you know which
words/multiwords are informative. With word2vec and LSA you are only
representing something similar to frequencies of words in documents, which
mostly allows for Information Retrieval-like applications. On the other hand,
one of the applications I did with the extracted concepts was to implement a
prototype which allowed us to find the definition of a concept on the corpus.
The idea was that the definition of a concept was the paragraph where that
concept occurred many times but also used other concepts to help the first one
to be defined. I don't know how you could do something like that using
something like word2vec.

But as I said above, I am not that knowledgeable about word2vec, so I may miss
something..

[0] -
[https://en.wikipedia.org/wiki/Word2vec](https://en.wikipedia.org/wiki/Word2vec)

[1] - [https://arxiv.org/abs/1301.3781](https://arxiv.org/abs/1301.3781)

[2] -
[https://en.wikipedia.org/wiki/Tf%E2%80%93idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)

[3] -
[https://arxiv.org/pdf/1402.3722.pdf](https://arxiv.org/pdf/1402.3722.pdf)

------
softwaredoug
Doug Turnbull here, author of Relevant Search. I'm in love with this post.
Thought I'd point out I added a few pointers that stream-of-conscioused into
my head as I read this!

[https://medium.com/@softwaredoug/this-is-a-fantastic-
post-e9...](https://medium.com/@softwaredoug/this-is-a-fantastic-
post-e9caae910334)

------
markpapadakis
I have some experience in developing IR/search software (
[https://github.com/phaistos-networks/Trinity](https://github.com/phaistos-
networks/Trinity) ) and, the way I see it, it all comes down to accepting the
following premises:

\- It’s all about the relevance models. Specifically, BM25 and TF-IDF which
are widely used by e.g Lucene and variants just won’t do. Specifically, they
only word for large enough documents anyway.

\- Indexing and Search algorithms and practices haven’t changed much in
decades (though some novel ideas have been introduced not long ago). Lucene’s
index encoding is compact and facilitates fast access, but even compared to
the one made available by Google which is arguable simpler in design, doesn’t
result in more than around 5% reduction in index size and postings list access
time (according to my measurements that is). Posting lists intersections,
unions and other such operations implementations are pretty much common across
IR systems as well, with little room for improvement.

To get great results, you need really great relevance models(1), a great query
rewrite system(2), and, because query rewrites usually expand a query to
include multiple disjunctions (OR terms and phrases), your search engine needs
to be particularly efficient at handling those(3).

You also need to care for spelling suggestions and personalisation/content
biases and factors, but those are secondary concerns.

------
jimnotgym
Very interesting article. However I really hate the floating menu and footer
on the site. I browse on a small laptop and I really don't like the obscured
viewing

~~~
genpfault
Seemed OK with Reader View in Firefox.

~~~
jimnotgym
It is OK with Reader View , it's not really the point though is it?

------
cpburns2009
What's with the random unicode characters spread throughout this article?
Several headings and sentences are prefixed oddly with an exclamation mark
("\u2757\ufe0f"), or diamond ("\U0001f537").

~~~
metafunctor
There's an “Emoji Legend“ near the top of the article.

It's a nice idea, flagging part of the text with coarse semantic meaning.
However, I find I don't trust the writer's judgement to decide what is
important for me; I must read the whole thing anyway.

~~~
cpburns2009
I've never encountered an "Emoji Legend" before. I instinctively skip over the
word "emoji" so I missed that section.

------
kylegalbraith
This was a great read and incredibly helpful. As someone that has been working
on a search system for three years, it is incredibly difficult. There is a ton
of variables and the cost of being wrong early costs you dearly later. There
is a lot in this article that resonates with me. Some stuff I know, some stuff
I wish I had known, and some stuff that was new. Great read!

------
garysieling
I think search problems are a ton of fun, because there are a lot of
opportunities for creative solutions to problems.

One thing I'd add is a book called "Relevant Search" \- I have a hobby project
to search talks, and the book helped me out a ton
([https://www.findlectures.com](https://www.findlectures.com)).

~~~
softwaredoug
Thanks for sharing my book ;) I actually just replied to this (fantastic) post
with some extra pointers I've discovered on my search projects!
[https://medium.com/@softwaredoug/this-is-a-fantastic-
post-e9...](https://medium.com/@softwaredoug/this-is-a-fantastic-
post-e9caae910334)

------
dvirsky
Great post. I'm working on another open source search engine on top of Redis
([http://redisearch.io](http://redisearch.io)), mostly focused on index
building and serving, and real-time updates of the data. The part about
queries being highly varying is extremely challenging. You have to deal with
simple "foo bar" queries and complex queries with intricate filtering and
crazy stuff (I have a user doing an AND intersection of 17 OR unions, each of
32 terms, while checking the ordering of term offsets!). It's super fun to
work on this stuff.

Another recommended book that was not mentioned in the post: Search Engines:
Information Retrieval in Practice [https://www.amazon.com/Search-Engines-
Information-Retrieval-...](https://www.amazon.com/Search-Engines-Information-
Retrieval-Practice/dp/0136072240)

~~~
sorokod
If you allow users to execute arbitrary queries, how do you protect your
server resources from being totally consumed by a small subset of your users?

~~~
dvirsky
In terms of redis being single threaded, we put a lot of work into making the
search concurrent, see here for details: [https://redislabs.com/blog/making-
redis-concurrent-with-modu...](https://redislabs.com/blog/making-redis-
concurrent-with-modules/). Basically we rotate between long queries and short
queries, making long queries run slower but not block short ones. Right now
rate limiting is left to the user and not solved at the engine level.

~~~
sorokod
This is about the complexity of queries (which, without knowing your system,
may be unbound) - rate limiting will not help.

~~~
dvirsky
Yeah, I was also thinking about adding execution cap to a single query, that's
not hard to add as there is a simple scheduler that checks their progress
constantly. So just terminating a long running query is simple.

------
bootcat
I was also going through the similar problem of search. How the open source
has multiple products for various stuff, but for search and index creation, we
just have lucene and tools on top of that like solr or elasticsearch. Not sure
why we are not innovating in the search space with new algorithms, systems,
open source systems/tools and so on.. Almost all search engines are on top of
lucene and inverted index. Why is no one able to reverse engineer google/bing
capabilities ?

~~~
nostrademons
Separate out the concepts of "search infrastructure" (how documents and
posting lists are stored in terms of bits on disk & RAM) and "ranking
functions" (how queries are matched to documents).

The former is basically a solved problem. Lucene/ElasticSearch and Google are
using basically the same techniques, and you can read about them in _Managing
Gigabytes_ [1], which was first published over 2 decades ago. Google may be a
generation or so ahead - they were working on a new system to take full
advantage of SSDs (which turn out to be _very_ good for search, because it's a
very read-heavy workload) when I left, and I don't really know the details of
it. But ElasticSearch is a perfectly adequate retrieval system, and it does
basically the same stuff that Google's systems did circa 2013, and even does
some stuff better than Google.

The real interesting work in search is in ranking functions, and this is where
nobody comes close to Google. Some of this, as other commenters note, is
because Google has more data than anyone else. Some of it is just because
there've been more man-hours poured into it. IMHO, it's pretty doubtful that
an open-source project could attract that sort of focused knowledge-work
(trust me; it's pretty laborious) when Google will pay half a mil per year for
skilled information-retrieval Ph.Ds.

[1] [https://www.amazon.com/Managing-Gigabytes-Compressing-
Multim...](https://www.amazon.com/Managing-Gigabytes-Compressing-Multimedia-
Information/dp/1558605703)

~~~
ot
> The former is basically a solved problem.

That's a bit of a stretch :) The high-level architecture is quite mature and
stable, but there's still a lot of research, both in academia and industry, on
the data structures to represent indexes, on query execution (see all the work
on top-k retrieval), and distributed search systems (for example query-
dependent load balancing, novel sharding methods).

~~~
markpapadakis
It's true that different index codecs are being designed with different
tradeoffs (index size vs postings lists access time and cost), but the work is
very incremental IMHO, no huge advances to speak of. Also, can you talk about
the top-k challenges you mentioned? Priority queues are not optimal enough ?

------
PhilMcCracken
TFA mentions case when documents are coming from potentially adversarial
sources. Any tips for handling relevance when users are trying to game the
system?

------
Subetai
There is a 100% chance I'll end up referencing this article in the future.

For most needs personally after having learned the ins and outs I still have a
soft spot for sphinx so was happy to see honorable mention in there. It can
scale really cheap and is a tank that never has downtime. It is closer to
metal but if you look at how Craigslist does it you can do the fancy scaling
things still

------
nickm12
This is indeed a great article about building a practical search system.
However the title is a bit aggrandizing. Not every software engineer actually
needs to know this stuff, just people who are building products with search
capabilities.

------
dkorolev
Just. Great.

