
Ask HN: Is 'search' a solved problem? - search_
I remember the time (not so long ago) when &#x27;search&#x27; seemed to be the hottest topic in the industry. We had the rise of Google and competitors. There were search startups. Open source projects like Lucene and Solr were in the news. There were books published, blogs, conferences..<p>And now it seems that the industry have moved on. There is a million papers&#x2F;books&#x2F;blogs&#x2F;online courses&#x2F;video lectures&#x2F;meetups about ML&#x2F;AI, but I can&#x27;t seem to find anything current on search.<p>What are the good resources to learn the fundamentals of search and keep up with the current happenings in that space? Not SEO, but more from the computer science&#x2F;engineering point of view?
======
Radim
"Search" is too broad to ever be solved. That's like "solving entropy".

Google focused on a specific subset — you enter a few keywords or a phrase,
and the machine returns the top ~10 links to pre-existing (indexed) web pages.
But that's not all there is to search!

Challenges:

1\. Intranets: internal documents, typically in different modalities (FAQs,
support cases, wikis, public pages) and across diverse storages that evolved
throughout the years via acquisitions and osmosis.

2\. Clustering: you don't have any keywords, but rather want to find how a
particular document (legal template, its clause section) evolved over time.
You want to _avoid_ using keywords. Search for similar documents or document
sections. Find similarity between two documents that is based on semantics
rather than query keywords. Applications: eDiscovery, contract management…

3\. SME & Intent: "relevant result" means different things in different
domains, or even different aspects of a single domain. Google is doing an
amazing job with their "single search box", but there are industries (for
example, HR) where search _precision_ matters much more than _recall_. More
elaborate, focused, domain-specific facets or even dialogue systems make sense
there.

Commercial plug: we built a search solution focused around semantic search (in
the "machine learning and vectors" sense, not "sematic web and RDFs" sense),
[https://scaletext.ai](https://scaletext.ai). It's still early days in that
our clients are all over the place, but to say Google/Lucene solved search is
patently false.

~~~
halflings
Definitely valid point (that there are very different types of "search", not
just one universal way to do things), but:

2) Google does not do simple keyword matching, and certainly has a strong
sense of "semantics".

3) Try searching "Cheap hotels in San Francisco" or "Plumber jobs in Chicago"
in Google. Just because it's a single search box does not mean all results are
generated/displayed in the same way.

~~~
Radim
Absolutely. Determining query intent is an ancient, well researched and still
active domain. And Google publishes their results regularly (kudos to them).

My point was more that encoding the relevant signals into a single general-
purpose search box (a sort-of natural language) is an inherently noisy,
ambiguous process. When you know what kind of search you want, it's better to
factor out the relevant parameters and feedback loops explicitly, give them a
clear UI and search flow. Rather than have users fumble with double quotes and
double-guessing the query parser.

~~~
bobosha
@Radim - is this something you are doing with your current startup? p.s. kudos
from another fan of gensim.

------
nikanj
If anything, it's an abandoned problem. A lot of companies bought really
expensive enterprise search systems, which are sitting dormant because the
results are so bad.

With advances in spamming, internet/email search is getting to be a harder
problem every year.

I remember when Google was quite effective in finding what I need, but
nowadays it's dismal. As an example, I googled for "storename return policy",
and got page after page of results. All of them were tagged "missing
storename", so just randomly picked return policies from other stores.

Search is their bread and butter, and they're probably keenly aware of the
diminishing quality of results. I'd love to hear what's caused the recent
trend for results that are missing a few of the most crucial keywords.
Probably over-enthusiastically trying to filter out keyword farms.

~~~
slyfocks
Lately, it has felt like Google has become more optimized for less
intelligent/tech-savvy users (i.e. people who search using natural language
e.g. “tell me the store policy for target please”). In these cases many of the
words used are either irrelevant or counterproductive.

I’ve noticed Facebook’s search has started catering to the lowest common
denominator as well. Searching for (made-up example) “Louis Potter” used to
give all exact matches priority. Current results look something like 1) Louis
Potter 2) Luis Porter 3) Louis Potters 4) Louis Potter (#2). I don’t like when
search engines assume that I’m misspelling words/names, and it would be nice
if they adapted for individual behavior in this regard.

~~~
basch
facebook used to let you do direct knowledge graph searches

"restaurants liked by people who like Joe's Pancakes and Green Dragon Mexican
Pizza"

"movies liked by people who are friends with people who like Back to the
Future and Arrested Development"

"people who live nearby and like Green Dragon Mexican Pizza and Arrested
Development"

"friends of friends of friends who like Arrested Development and Joe's
Pancakes"

"Friends of Megan Albert friends who are women named "Erin" Chipotle employ"

I think you can see why they removed it.

this URL might lead to something like "restaurants liked by people who like
Snow White" [https://www.facebook.com/search/130104022710/likers/pages-
li...](https://www.facebook.com/search/130104022710/likers/pages-
liked/273819889375819/places/intersect/)

~~~
basch
[https://code.facebook.com/posts/153625638171563/under-the-
ho...](https://code.facebook.com/posts/153625638171563/under-the-hood-
indexing-and-ranking-in-graph-search/)

------
arthev
Given that google is the de facto standard ATM and often returns irrelevant
results, I'd wager that search _isn 't_ solved.

"That does not by itself mean there's room for a new search engine, but lately
when using Google search I've found myself nostalgic for the old days, when
Google was true to its own slightly aspy self. Google used to give me a page
of the right answers, fast, with no clutter. Now the results seem inspired by
the Scientologist principle that what's true is what's true for you. And the
pages don't have the clean, sparse feel they used to. Google search results
used to look like the output of a Unix utility. Now if I accidentally put the
cursor in the wrong place, anything might happen."
[http://www.paulgraham.com/ambitious.html](http://www.paulgraham.com/ambitious.html)

~~~
asperous
I would suppose that this change is less google changing and more the web
changing as people pump out content marketing and SEO. Also might be "olden
days were greener", I sure remember a lot of keyword stuffing back in the day.

~~~
telchar
You're probably right about that. Although I can point to one very specific
way the olden days were better: a Google search used to turn up lots of forum
posts, which are often the most useful results on a given topic. These days I
almost never see forum posts of any kind turned up a search results (except
stack overflow, thankfully). I'm not sure what caused the change but I find
useful information is much more likely to be buried by low-information
wikihow-type results and corporate landing pages.

~~~
a008t
Possibly because most forums seem to have died? Or maybe they died because
they got excluded from search results?

------
sgift
Not really, no. Some parts of search (Internet) have been cornered/solved by
big players (e.g. Google), but many other parts (e.g. search in Intranets) are
still an open problem. No one has found a solution that's as simple as
PageRank for Intranets. No one has found a solution for the "the author of
half of these documents is the intern who wrote the template"-problem and many
other things. There are good products out there, but a "Google for Non-
Internet" is still far away.

p.s.: All the bad searches on the various products I use or the websites of
some companies make me almost want to get back into search. Because even for
things that ARE solved in the technical sense there seems to be no "out of the
box" solution or people would use it.

edit: To expand a bit - all the search solutions I've seen which weren't for
Internet search were more or less bespoke, so you needed a project to get
something decent. Sure, you can install a plain Lucene/Solr, but Lucene/Solr
cannot understand the way your data works, which parts are important or if you
want to show results which are older further down (or not!). They have decent
defaults for the "common case", but you have to tune them for every
customer/installation for good results and that makes it non-scalable. And
being scalable without effort is usually one of the requirements people have
for something to be "solved".

~~~
DEADBEEFC0FFEE
So true, I can ask Google a range of verbal questions and get great answers,
mostly what-is-the-fact questions.

At work we are nowhere near asking an intranet "show me the last pentesting
report for Tony's new website" or "show me the change requests for the failed
change this quarter". I know exactly what I want, and how to ask, bit I will
not get the results I want.

------
obelix_
We are still in the stone age when it comes to search.

Ask Google who played the mens semifinals of Wimbeldon three years ago and
Google will tell you it indexed 6 million pages to provide a link that may or
may not have the 4 names I am looking for. Why is it doing all this pointless
work? And why is it that dumb in 2018?

We have got so used to what it does that lot of people have stopped asking
questions about how it does things and wether all the stuff it does is
required.

Wolframalpha, Freebase/SemanticWeb/Wikidata/dbpedia approaches, NLP/NLU are
still very underdeveloped and untapped.

Having open and distributed indexes like we see in nature with DNA is also
totally unexplored because of Google type centralised index monopolies in
various domains. It just takes a Gig or so to store a local offline index off
all Wikipedia or Stackoverflow pages. And given the massive RAM and hard disks
everyone has these days why aren't we seeing sophisticated local offline
search apps?

The internet is getting exponentially more noisy day by day and in many ways
its easier to find quality info going through a top notch library's index than
wading through Google's. So there are lots of blindspots and areas to explore
in search right now imho.

~~~
diziet
I think these sort of queries are solved.. if you know the categories of sites
that index information and are able to scroll and process text, images and
information quickly.

My first query string idea was 'men semifinals Wimbeldon 2015 wiki' and the
resulting page contains the list in a nice format.

This is because I have the context that wiki pages would contain this sort of
information. Google and others are getting better at processing more vague
queries (like 'three years ago'), but I do agree we are nowhere close to being
able to ask general questions. Knowing how to use the tools like google search
(and other searches) and really advanced queries syntax is a force
multiplier/enabler.

~~~
nailer
> if you know the categories of sites that index information and are able to
> scroll and process text, images and information quickly.

That's a job for computers to do.

------
snaily
If you just want a primer on how to think about adding search to a product,
this piece by Max Grigorev is a great starting point:
[https://medium.com/startup-grind/what-every-software-
enginee...](https://medium.com/startup-grind/what-every-software-engineer-
should-know-about-search-27d1df99f80d) It's ostensibly for the engineer, but
actually feels more like it's written from the POV of a product manager.

------
Nuzzerino
It's not that people have moved on. It's that the entire culture of the
ecosystem is built upon a narrative that Google is an all-powerful machine
that cannot be stopped or contested. So people don't try to compete, and if
they do, they will be ridiculed for it.

And for what good reason? Certainly not because of past attempts. In fact,
Google has bought some startups that were involved with search.

I'd be interested in knowing what you find. There are bound to be many
relevant texts not labeled as search, if you know what you are looking for.
Perhaps I would start with web crawlers and go from there.

~~~
CookieMon
As a side topic, is there a useful web search engine that uses a fundamentally
different approach to Google, e.g. aren't using backlinks as a ranking signal?

When Google's approach isn't giving me an answer, it'd be nice to try a search
that wasn't based on a discoverability feedback loop.

------
Jaruzel
My personal view is that there should be dedicated search providers for
specific areas, such as academic, or technical, or news.

That way each provider can focus on building a good platform with clever
machine learning tailored to _that_ dataset.

We also need the return of proper Boolean operators and complex nested
queries. Yes, Joe Public will never use them, but a _lot_ of people who search
the internet as part of their jobs, or just have deep interests and would love
to have all those advanced search features back, and be able over to override
the 'fuzzy logic' that generic Search engines such as Google enforce on us.

I also disagree that all sites need to be mobile first. If I have a site that
provides software for the Enterprise Sector, Google will still penalise me if
my site isn't responsive, even though my target market is IT professional
sitting in front of powerful laptops/desktops.

As many would agree, Google have too much power in the Search space, and they
have basically dictated to the world how Search should be done, whether they
are actually right or not.

------
Grue3
No, it's not solved. Google utterly fails at it.

I'm a person with simple needs. I mostly search in English, sometimes in
Russian (which is my native language) and sometimes in Japanese (which I'm
learning). These languages use completely disjoint sets of characters so it
should be obvious to Google what I'm trying to do (since I'm logged in and
they know all about me). There's also a setting that allows me to pick
languages that Google should prefer when searching (English is always
selected). Now there's a little problem:

1) If I choose any language other than English, Google prefers websites in
that language over English _even if my query is in English_. If I'm searching
for programming-related stuff, I don't need it poorly translated into my
native language, or any other language! I want the original, on the first page
of search results. So any language other than English gets switched off.

2) If Japanese is not switched on, Google thinks that any query that consists
only from kanji (a subset of Japanese characters that are also used in
Chinese) is in Chinese language, so I get pages and pages of Chinese websites
before I see any Japanese ones. Since I can't read Chinese at all, it's
completely useless. Now, you might say: of course Google has no way to know if
I wanted Japanese, not Chinese. Oh, it knows. I can search for a name of a
Japanese person, and Google will display a sidebar with their English
Wikipedia article, date of birth and nationality and everything, and yet all
the search results will still be in Chinese.

~~~
soared
> utterly fails

> $74b revenue

> 80% market share

~~~
andrewmcwatters
Social networking's industry is also a great success, look at Facebook.

------
andy_ppp
I can give a list of things I'd like in a search engine:

1) Sticky topics - if I'm at work and I type Flow I am way more likely to mean
Facebook's js library than I am Flow energy.

2) Different views on the information (Grid view/masonary that makes sense)

3) Ability to search within a set of search results

4) Ability to customise the algorithm with other programs I can write/plugins

5) Many more things for programmers and advanced "omni bar" style features
that allow me to type shortcuts and autocomplete things.

6) More automation of programmer stuff - if I type a number and a unix
timestamp it should show me a date etc. for possibly millions of things. Same
for unicode char of a hammer, url decode, etc. etc.

7) Clean integration with your OS search that makes sense.

I'm aware Duck Duck Go does some of these things (relatively badly). I think
I'll give it another go then and see if it's better.

~~~
severine
Great list, +1 to everything, +10 to #1, #3.

------
fiedzia
Far from it. I frequently have issues even with search on Google or Amazon,
who invested millions into that. Dealing with people being imprecise in naming
and spelling, different contexts and personalized results is still barely
touched or non-existent, and definitely not solved.

Lucene provides building blocks, not a solution, and not even all of them.

Search on a small scale may seem working if you have phrases that are mostly
unique, like recent movie titles or celebrity names. Go beyond that and its a
mess.

~~~
Finnucane
The 'solution' you're looking for also isn't necessarily the 'solution' Google
and Amazon are looking for. You're looking for an answer to a question, or to
find a thing, they're looking to maximize ad revenue and sales. So the answer
to your search is tempered by what gets them the most money. _That_ is what
they've spent millions trying to figure out.

------
quanto
I don't see 'search,' aka information retrieval (IR), as a solved problem. I
went to the last SIG IR conference in Tokyo, and yes, the heavy hitters in the
field were there to promote and present their latest research. It is no doubt
a very active research field using machine learning techniques. Reading the
papers published could give a view of the (academic) state of the art.

Whether it is a good business strategy to challenge Google head-on is another
question. Whether there is a sure way to learn and engineer said systems on
extremely large scales is another question.

------
blihp
In 1994 there were people who believed search was 'solved' thanks to Yahoo!
providing a comprehensive index of the Web (which actually was doable for a
couple of years) Then the Internet exploded in size so in 1996 there were
people who believed search was 'solved' thanks to AltaVista. The Internet
continued to grow and thanks to early SEO techniques like keyword stuffing, an
opening existed for Google to fill when they 'solved' search with 'The
Algorithm.' Now we're in the midst of search being 'solved' again via
ML/DNN's. And I'm sure it will be 'solved' again and again in years to come.
So to answer your question about if it's solved, you'd first have to specify
which time? :-) Search is a very, very large space and will likely not be
solved anytime soon.

Also, don't confuse the business/tech press definition of solved (i.e. a
dominant player raking in large piles of money and shutting out competition)
with said problem actually being solved as in there does not exist a better
way to attack the problem. Granted, when the business/tech world considers a
problem solved, this often just means that the lions share of (known)
financial incentives are no longer 'low-hanging fruit'... until a challenger
figures out a way to blow up the incumbents business model.

------
taurine
DARPA/IARPA now invests in search 2.0: Instead of asking it to retrieve stored
information, you can instruct an agent/search bot to perform tasks for you.

For instance, one should be able to search for: "who is the leader of this IRC
hacker group?" "where can heroine be bought on the deep web?" "Is there women
trafficking going on behind that log-in wall?" and then an intelligent agent
is dispatched, avoiding/crossing roadblocks, like log-in forms, and will
eventually bring you the answer.

Coursera has more on the basics, like: [https://www.coursera.org/learn/text-
retrieval](https://www.coursera.org/learn/text-retrieval)

OpenAI has more on the current happenings of creating more intelligent search
bots: [https://github.com/openai/universe](https://github.com/openai/universe)

Other possible future research areas in information retrieval include being
able to search for services ("Where is cheapest taxi service for current
location?") and an integration with IOT.

------
euske
I tend to think the search is often _the last resort_ and indicative of the
other navigation system being broken. When people can reach the info they need
in a more organized way quickly, they'd probably do so. Therefore full text
search has to cover every residual task; it's bound to be messy.

------
sebringj
Elastic and Solr (Lucene) with the help of various other dbs or graph dbs etc
get you farther along but you really have to combine machine learning to get
things in context which requires more data in certain domains so it requires
someone like me that goes things alone to dive into many disciplines. It is
not straight forward and results are not yes or no, more like 70% good and the
rest is subjective.

The most challenging parts are if you have many dimensions at the same time
such as location / full text / user preferences / social filters / permissions
etc. These are the problems that make my life suck right now as you cannot
simply not have joins for some things or not have graph relations etc etc.
Case in point following feeds with many dimensions so you need a pipeline
approach in stages.

------
xnxn
I've learned that search is very domain-specific and it's only "solved" if you
can pin down what "relevant" means for your corpus of documents.

In terms of fundamentals, I'd suggest reading about tf-idf, which is the basis
of Lucene (which powers Solr and Elasticsearch).

------
stef25
Google's good for 1 - 3 key words / phrases but anything longer can still be
difficult.

Regardless of query length it's strange that I often find the second result
better than the first one.

Finding something in Gmail can be a real challenge, I wonder if would be
possible to unleash Algolia on your inbox.

------
sdruskat
I've confronted my PhD supervisor (Professor of Library and Information
Science) with this statement once, and she almost went berserk. Her take is
that free text search is approaching the solved problem stage, but almost all
other search isn't.

------
ferdous
You might find this interesting
[https://dynamicguy.com/post/160792434677/1-search-engine-
not...](https://dynamicguy.com/post/160792434677/1-search-engine-not-good-
anymore)

------
visarga
The search problem is connected to the spam filtering problem which is an ever
advancing arms race - it's never solved and depends on the new schemes spammer
come up with. So search itself is never a solved problem.

------
adamnemecek
I wish there were a search engine that could figure out the groups the search
results belong to. Sometimes when searching for swift programming language, a
certain singer also makes an appearance. Or when searching for physics related
things, I get that Olivia Newton John song. Like I was a search engine that
displays some sort of Venn diagram (not quite but it's close) that let's me
hone in on my results.

To be honest, I think that I would actually want an engine that scrapes fewer
sites but good sites and tries to understand them better. Also regular
expressions.

------
bo1024
Great topic. Part of the issue is probably a transition from algorithmic
approaches to data-driven approaches. What did previous users search for and
click on? Existing companies have a huge advantage from years of data, and not
the kind of advantage that others can learn from (compare to publishing a
better algorithm). Another factor may be that parts of the problem can
separated out and are studied on their own, such as natural language
processing.

~~~
_trampeltier
>What did previous users search for and click on?

This is why google often is bad for searching tech things. You get often very
old useless links for a topic

------
chriswarbo
One aspect that I don't see mentioned is that APIs for search have basically
disappeared from the major players, likely due to them being expensive
(programs can hammer an API faster than a human with a text box) and lacking a
revenue model (no human may be involved, making advertising useless).

This has caused "search" to devolve into "human using Web browser types
natural language in box, human is presented with results (some semantic, most
just links to things other people wrote, some advertisments around the side),
human reads through results to see if any are useful to them".

This is certainly useful, and I rely it all the time, but it's not the
pinnacle of what search could be. It's like if `grep` didn't pipe to stdout,
but instead popped up an alert box for each line like "Your query '.*' matched
the line 'foo'. Try the new McDonalds saver menu today! [Next/Cancel]", it
would still be useful, but nowhere near as useful as piping to stdout.

Many years ago Google provided an API to their search engines, which
applications could build on to be more "smart". This could have paved the way
to much better software: for example, imagine a prolog system where all of
Google's knowledge could be used by the calculations.

That path was mostly abandoned since there's no scalable incentive to make it
operate, much like the semantic Web. Rather than opening up databases to
empower others, it's much more profitable to keep them walled off behind a few
limited, pay-per-use interfaces (e.g. "paying" by showing a human some
adverts). Attempts to bypass or abstract over these interfaces are hit with
rate limiters, Recaptcha checks, etc.

------
maephisto
Lucene/Solr, Elastic Search and Algolia did a great job creating search tools
and services and this extinguished the thirst of the masses. I don't think
it's a solved problem, it's just a problem that has commercially viable
solutions. When it comes to resources, I've found valuable knowledge in
Lucene/Solr forums and mailing lists, back in the day. It's worth a read.

------
fifnir
I think search is not only not solved, but a failed concept overall.

It's too hard to find relevant information from the gazillions of pages based
only on a few words.

I think search needs to be replaced by some kind of
indexing/ontology/knowledge organization system, and then maybe only be
applied in the "last mile" of a person's 'search' for relevant information

------
dredmorbius
I'm increasingly thinking that a problem with search is inconsistent and
nonstandard (or nonexistent) conventions, protocols, and APIs.

It seems to me that a fair bit of the search problem could be addressed by
sites themselves serving wordlists, tuples, statistically improbable terms
(there's another term for this that's escaping me), etc., rather than
consenting to being heavily crawled by numerous spiders.

Vastly improved content metadata (particularly for largely fixed "article"
content), including author, date, and (a reliable) topical categorisation
would help. For realtime information, APIs are probably more reasonable than
text search, though those would likely be front-ended by specific
applications.

This still leaves the very nontrivial problems of reputation, relevance,
black-hat SEO, and information manipulation (propaganda, misinformation,
misdirection, disinformation), and just plain street-grade idiocy.

But several elements of this strike me as amenable to either localised or
distributed solutions.

------
taxonomyman
We're trying. Kindly see [https://millionshort.com](https://millionshort.com)

~~~
fiveFeet
Gives good results. Thanks for providing the link.

------
shpx
I'd like a search for things that I've read or seen in the last few days. Or
years.

[https://twitter.com/patrickc/status/953011978217205760](https://twitter.com/patrickc/status/953011978217205760)

~~~
_trampeltier
[https://yacy.net/en/](https://yacy.net/en/)

Yacy can do it. You can use it as a proxy and it does index all visited pages.
(I tryed Yacy, but never this feature).

------
randyrand
It depends what you are looking for. I have the hardest time searching for
laptops that meet all of the specs I want, for instance.

Takes days.

Minimum screen brightness. 1440p touch. Needs USB-A. 8th gen quad core. 20+
watt TDP. etc. Not too thin, not too thick. decent graphics.

~~~
treffer
In the EU you can use
[https://geizhals.eu/?cat=nb](https://geizhals.eu/?cat=nb) which is pretty
decent. I'm using that regularly to check options.

Is something like this not available in the US/other territories? If not: here
is your opportunity :-D

------
marban
Self-plug: I recently launched a dedicated news search engine at
[https://yetigogo.com](https://yetigogo.com) — Based on my personal needs, but
it works pretty well for tracking any current event.

~~~
bitL
I hope you are ready for the "link tax"...

------
lbriner
Searching is easy enough but scoring the value of what it finds is not! Is an
answer on Stackoverflow valuable because the poster has 100K reputation? (no!)
Was the information valid 10 years ago but completely irrelevant now? Is this
programming blog post very specific to Drupal and not relevant to other PHP
frameworks? Was it information taken from somewhere else? (I would rather send
traffic to the original).

Another problem is how to search for something when you don't understand it
enough to search for it or you can't think of distinct enough words or phrases
to search for it.

------
hedora
Internet search has been solved for about 10 years.

Evidence: Before it sold off to Bing, Yahoo search was quantifiably better
than google for a few years (in blind tests where you rip off the branding).

No one cared, because google was good enough.

Having said that, I use duck duck go these days, and occasionally spot check
using google. I think the google results have become unusable because they too
aggressively map to related concepts, and otherwise second guess what I’ve
typed, but there have been endless debates about that on HN, and it’s
essentially decided by the user’s taste.

~~~
JohnStrangeII
Couldn't you use quotation marks to search for phrases? I do that all the
time. The search is not verbatim, Google certainly stems the content of
quotation mark phrases, but I was under the impression that related concepts
are not included.

I'm not claiming to be sure about that, it's a subjective impression, so if
someone can explain better, I'd be very interested. Since I use Google for
work (more often than Google scholar), I'd hate not to find what I'm looking
for, but generally Google results seem to be much better than others. It's the
main reason why I don't use DuckDuckGo, ixQuick, Yandex, Bing.

~~~
hedora
Aggressively “”’ing can help, but I think that also turns off useful stuff,
like word stemming, when what I want it to do is not map to some other more
popular term.

I’m sure people have been trained to use google effectively, but I switched
years ago, and find it harder than the alternatives.

To be clear, this is all nitpicking. The last time this came up on HN the only
query anyone found that actually showed practical differences between the two
was the acronym FOSS.

Google rewrote to “free open source software”, with no quotes, giving poor
quality results (eg bsd software).

Ddg gave the definition of foss, and relatively few results pointing at actual
software, because that acronym was (is?) obscure.

------
elskerpudding
It's solved, but as many has cited not to a satisfactory degree. Modern search
engines have extremely short query deadlines (users won't wait more than a
couple of seconds when searching), which gives low precision results.

This book is very informative: [https://nlp.stanford.edu/IR-
book/](https://nlp.stanford.edu/IR-book/)

edit: I should say it's slightly outdated because of lack of "big data" and
how search companies currently deals with huge amount of data.

------
keybits
I discovered this recently: [https://typesense.org/](https://typesense.org/)
Very nicely done and might be useful for learning from.

------
gillesjacobs
The field of Information Retrieval has largely moved to multi-modal retrieval
(search across video, audio, text), linked data, question-answering and so on.

But document retrieval (classic search as you describe it) is not a hot topic
anymore. That does not mean there are not people working on constantly
improving document retrieval: Google Scholar returns 17,800 results for
"document retrieval" in 2018 and 34 results with it in title. So it is in
widespread use but not the focus of the field, I would say.

------
hellbanner
Absolutely not. Frequently I try searching for things from years ago by
describing them into a search engine, and SEO spam floods my results. I
usually end up asking a human.

------
bovermyer
I would love to see a search engine that only returns websites that have no
JavaScript present (or some other artificial way of excluding major, "modern"
websites).

------
mbeex
In a sense, it deteriorated for some basic requirements. Remember the the
time, when you were able to use boolean operators and _exact_ sequences on
Google reliably.

------
ankurdhama
If you squint a bit, you will find every computation is a kind of search. You
are searching for the "output" based on the given "input". Ex: Deep learning
is just searching for the weights. What people usually called search is where
the search space is explicitly defined (a set of records in db etc) but in
most cases the search space is there just that it is implicitly defined in the
computation problem description.

------
dfdffsdfff
I don't think it is a solve problem. It just seems daunting or difficult to
take on Billion dollar companies in the space.

------
thom
The default search engine in people's web browsers is a solved problem. Search
itself, not so much.

------
gumby
Search is very much not "solved" as other have pointed out. I'll add that when
google started search was considered pretty much "solved", and Yahoo turned
down the option of buying google as they considered their search good enough.

------
AppleseedJenny
To break into this space, I would recommend starting with a subset that people
want to search. Like the facebook model of only being available for some
colleges.

Some ideas:

\- Only academic papers

\- Only news sources

\- Only hacker topics

\- Only financial topics

\- Only small bloggers

\- Only literal keyword search which Google discontinued

Get traction in that domain, then build out from there.

------
stevenicr
I see opportunity for niche search engines, there are several areas that
google does not do well in on purpose it seems.

I think the truly hard part is that so many people accept whatever default is
already there. If they get an android phone - they use the search box there.
If they are using chrome browser, whatever input box is there on the first
screen is obviously the url bar and use that (you and I may know the
difference, the average user doesn't care, it's one less click to just type
'google' into the url box in the center of the page, of fbook or whatever,
then google brings up the url you were going to (not searching).

This is why I think there is much less hype about competing in this space.
Unless there is a thing forcing companies to put other browsers and search
boxes on phones, tablets and chromebooks like the microsoft IE debacle so long
ago.. then trying to be the next google is impossible, even if you had better
results, better tech, etc.

Regardless of that, I think it's quite possible to make much better niche
search engines and get them used. If ten micro engines could make 1% of
googles revenues each, that would be a decent amount of money in my neck of
the woods.

I'd like to see other people post more sources about search tech in general,
several searches last year only brought a few info bits on what it may cost to
create an index of the net - someone posted some numbers using servers bought
off ebay and a rack at hurricane I think - had some numbers for the cost of
servers to pull a new index every month or so?

Certainly the tech and costs have changed since that was published, but not
much I've seen.

I'm pretty excited at this project posted recently:
[https://news.ycombinator.com/item?id=16976941](https://news.ycombinator.com/item?id=16976941)
( Show HN: A search engine that doesn't track you, where users vote for
results (github.com) )

I am hoping to get some people together to make a less persnickety and
fussbudgety search option for people who don't want to be babysit with
censoring kids gloves when looking for fun things.

If anyone wants to make a couple adults only engines, or ones that are more
fun, let me know.

Average people talk in slang and cut up about less high brow things, the big G
gives rank to the college papers and deranks for so many things, it's on the
road to being the next yellow pages and sciences journal, but not the place to
go when you want fun things anymore.

------
PaulHoule
There are big opportunities precisely because the field seems dead.

(1) The first big story is the dominance of Google. With an advertising-
centered model, Google has a reason to degrade result quality. If you get
trained to scroll down to find the real results and you found them good, you
might avoid touching any of the ads (hard to do because they cover so much of
the screen.)

(2) The web is 95% Javascript and 95% Spam -- getting useful results at all
requires fairly strict 'censorship' and vast resources if you want to compete
on Google's ground. No serious competitor will come in with a different model,
nothing will change unless you have a search engine that _YOU_ pay for and not
the advertisers.

(3) "Desktop search" is discredited in most peoples minds. Your OS might have
added it as a feature back in 1995, but you've kept it turned off because it
slows down your computer and never finds what you are looking for. Result
quality is an issue, but the #1 perception here is that the indexing process
harms the user experience. In the era of multicore, NVMe, etc. can this be
changed?

(4) "Website search" is also discredited. Product search commonly works, but
search on most web sites is so bad that people are trained to just search on
Google. Thus you have very few chances to change people's minds.

(5) There is a big literature (the TREC conference) but there is something
profoundly depressing about it. It was one of the first big competitions, but
unlike the SAT Solver competition or Imagenet it was not associated with a
rapid improvement of technology but rather a painful slog through the mud. If
you start reading it at the beginning or in the middle somewhere you will find
that 20 or so things that you thought were sure bets to improve relevance
don't work. If you read the cliff note's to the first 10 years written by the
organizer, you find out that there was an interesting discovery made 5 years
in...

(6) The BM25 ranking function which has two tunable parameters. BM25 was a
huge advance because it can be tuned to comparable rank documents that are
highly variable in size. BM25 is built into Elastic Search, but nobody will
give you any advise how to tune those parameters...

(7) Because they don't follow the relevance evaluation protocol in TREC; this
is badly flawed, but the data exists, and going from naive tfidf to tuned up
BM25 or information theoretic approach (also implemented in Elasticsearch)
will put up better numbers AND seem more relevant to end users.

(8) An open-source project to do that evaluation on Lucene got started but
never made a project; I have talked with Enterprise Search vendors who were
very aware of points 5-7 but did not tune up their search because it was
easier to sell customers on having hundreds of "connectors".

(9) The mainstream of TREC (it has broken into many flavors) and IR research
has been getting high recall at low precision. Maybe that's because when
Gerard Salton was messing around with punched cards at Cornell, 70 abstracts
was a lot of documents. Patent searchers and paralegals are interested in deep
recall, other people aren't.

(10) A major flaw in the mainstream TREC approach is that they are trying to
tune up the wrong function: the ideal relevance score is a probability
estimator of how likely the document is to be relevant.

(11) Google and Bing have made noises about personalized search but they don't
really do it. They are both stuck at 70% relevance for the first result
because of their limits in inferring user intent. The real relevance function
has the user's context as an input variable, but sampling by that thins the
data points to where it can't be approached as a "big data" problem.
"Personalization" works for advertisers who don't know your real intent but
are willing to pay for a 5% chance you may click, but not for you where you
will feel misunderstood (primed to get irrationally angry) 95% of the time.

~~~
thanatropism
Re: desktop search. What. I see "lay users" in meetings with projectors/TVs
using the search function in the windows menu to find documents all the time!
My org is named something like XQWK, so the files they want to show have the
letters XQWK. It's pretty natural to them.

------
galuggus
'Search' might be solved, but 'find' isn't.

------
z3t4
Why doesn't browsers have a search for bookmarked pages ? It's because they
get payed by Google to have their users use Google search instead.

------
gwbas1c
I remember when I used to hear all the time about the semantic web. It sounded
more and more like mind reading.

Computers can not read minds.

------
batteryhorse
I remember reading about this exact topic somewhere but I can't seem to find
the link.

~~~
Nuzzerino
Here are my comments on the subject from December which might be of interest.
[https://www.quora.com/Is-it-possible-to-beat-
Google](https://www.quora.com/Is-it-possible-to-beat-Google)

------
Tycho
The solution we need is search that doesn’t rely on an information monopoly
like Google.

------
arafalov
(Disclaimer: I am an Apache Solr committer and popularizer)

Search is interesting! And it is important to differentiate the web search
(Google) and domain-specific search (Solr, Elasticsearch, recent release of
[http://vespa.ai/](http://vespa.ai/)). You cannot tune Google to your domain
needs and understanding.

For domain-specific search, the basics are there. Even the fancy "basics". It
is now very easy to add search to one's stack. In fact, Solr is in so many
stacks, it is not even mentioned much anymore. But we still get the
contributions back from Cloudera, Bloomberg, Alfresco, etc.

So, the cutting edge in Search is now on personalization, relevancy-tuning,
indexing non-text content (music, images, etc), multi-word semantic search,
graph traversal and, yes, Machine-Learning. See, for example,
[https://lucene.apache.org/solr/guide/7_3/learning-to-
rank.ht...](https://lucene.apache.org/solr/guide/7_3/learning-to-rank.html)

In fact, the Solr conference that used to be called Lucene/Solr Revolution is
now Activate and has focus on ML/AI because the topics are really starting to
overlap ([https://activate-conf.com/](https://activate-conf.com/)). You can
see the interesting topics from last conference:
[https://www.youtube.com/playlist?list=PLU6n9Voqu_1FMt0C-tVNF...](https://www.youtube.com/playlist?list=PLU6n9Voqu_1FMt0C-tVNFK0PBqWhTb2Nv)

Learning (Solr at least) is a different issue. There are so many features now
that the Reference Guide is absolutely enormous. And the demo schemas are
still a bit of a kitchen sync, making it look more complicated than it needs
to be. And, the last comprehensive book was several versions back. Again,
that's because Solr is big and is growing really fast still...

Actually that's why I chose to be a popularizer within the Solr community and
focus on making it easier for beginners to start.

See, for example, my latest presentation slides at:
[https://www.slideshare.net/arafalov/rapid-solr-schema-
develo...](https://www.slideshare.net/arafalov/rapid-solr-schema-development-
phone-directory) and the backing configuration repo:
[https://github.com/arafalov/solr-
presentation-2018-may](https://github.com/arafalov/solr-presentation-2018-may)
(includes smallest viable useful schema)

(tl;dr) Search is still exciting, lots of cutting edge cool stuff, and there
are people trying to make it easy for beginners to start.

------
gremlinsinc
Search can cover other domains too... what about an AI that can search
books/research articles/lectures/videos to diagnose a medical disease - some
of which aren't actually published live to the internet (perhaps behind
paywalls) -- then it takes a person's current symptoms and comes up with the
best diagnoses from it's search across multiple media types.

How about search in the context of AR... if people overlay data on top of the
world we live in, in AR apps, will there be searchable things there? There's
room for search related projects in the future, but it just matters what the
data is, and why it's being searched.

Normal search 'engines' for web documents --- that itself seems pretty much
'won' by google, until something better comes along (an implant that has
better search than google, and I only need to think about what I want to
search for then I automatically download the data to my brain for the top 10
results)

------
markpapadakis
Search means a lot of things, but even if we limit to mean web-search, as most
people understand it, there is a lot more to it than the actual technology
that matches queries to documents.

IR is for all intents and purposes a solved problem -- in fact it was solved a
long time ago, and I highly recommend the seminal book “Managing Gigabytes”. I
also recommend [https://github.com/phaistos-networks/Trinity/wiki/IR-
Search-...](https://github.com/phaistos-networks/Trinity/wiki/IR-Search-Links)
this page(disclaimer: I am maintaining it) for some interesting/important
links to IRC technologies, developments, etc. While some novel ideas come out
from time to time, the fundamentals haven’t changed -- progress there is
incremental and mostly specific to different encoding schemes or ways to
execute queries faster by using JIT or more cache-aware datastructures, etc.

Managing and queries documents based on keywords and boolean operators is one
thing, and Lucene/Solr, and Trinity ([https://github.com/phaistos-
networks/Trinity](https://github.com/phaistos-networks/Trinity)) among other
technologies can be used to take care of those challenges. But that’s the easy
part (assuming you can do this fast enough, because you almost always can’t
afford long-running queries):

\- User Interfaces: Not just how results are presented, but also how users can
construct or input queries. What options can be come available for filtering
matches? \- Ranking: Precision is key, and rather simple formulas (tf/idf,
BM25, etc) generally don’t work well for many/most domains. Furthermore,
ranking is almost always not just about relevancy. It factors in static
context scores (e.g document “popularity”), personalisation biases(how likely
is it for user to mean Soccer or American Football for [football]),and other
signals, fused together somehow to determine the final ranking of matched
documents. \- Scale: Getting everything right is one thing, getting everything
right at massive scale is whole different game. What may work on small
scale(algorithms, technologies, services) may not work at all when you scale
out. \- Everything else not directly related to search but either important or
fundamental to a good experience/business: from matching queries to ads, to
analytics, to autosuggestions, to training ML models to power all that, etc.

Web search is not a zero sum game. Bing makes over 3nb / year and while it may
not have a chance to catch up with Google anytime soon, that’s a great
business right there. Ditto for DDG. There are also companies that offer a
different or better experience and access to datasets google doesn’t yet.

So, all told, search may be solved only in terms of the basic IR technology
that makes it all work, and arguably a lot better than it used to be in terms
of user interfaces, ranking, etc, but it will take a lot longer until those
other aspects of web search may be considered ‘solved’.

------
kapauldo
It's definitely not solved from a technical perspective but it's really hard
to compete in a business perspective. The market wants good fast search. If
you come up with great fast search, it's still hard. The only opportunity I
see from the business perspective is to challenge the visual paradigm. Having
said that, there are tons of opportunities from the academic perspective such
as inferring context, letting users control context, etc.

------
garyfirestorm
I think Google has pretty much monopolized search. You could say it's solved.
I doubt if there are any complaints that sound like 'i couldn't find something
using Google'

~~~
Nuzzerino
Google ultimately serves the advertisers, not the end users. This incentivizes
manipulation of search results in order to maximize ad growth. On that token
alone, they have not solved search. Any newcomer developing a search service
would do well to make it radically different from Google rather than a clone
with added privacy or a Microsoft logo.

Regarding always being able to find what you are looking for on Google: I
often struggle to find niche information using Google, and most content on the
Web is not indexed by Google.

I'm more of the opinion that YouTube is less possible to compete with in any
serious way.

