Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Is 'search' a solved problem?
141 points by search_ 8 days ago | hide | past | web | favorite | 121 comments
I remember the time (not so long ago) when 'search' seemed to be the hottest topic in the industry. We had the rise of Google and competitors. There were search startups. Open source projects like Lucene and Solr were in the news. There were books published, blogs, conferences..

And now it seems that the industry have moved on. There is a million papers/books/blogs/online courses/video lectures/meetups about ML/AI, but I can't seem to find anything current on search.

What are the good resources to learn the fundamentals of search and keep up with the current happenings in that space? Not SEO, but more from the computer science/engineering point of view?

"Search" is too broad to ever be solved. That's like "solving entropy".

Google focused on a specific subset — you enter a few keywords or a phrase, and the machine returns the top ~10 links to pre-existing (indexed) web pages. But that's not all there is to search!


1. Intranets: internal documents, typically in different modalities (FAQs, support cases, wikis, public pages) and across diverse storages that evolved throughout the years via acquisitions and osmosis.

2. Clustering: you don't have any keywords, but rather want to find how a particular document (legal template, its clause section) evolved over time. You want to avoid using keywords. Search for similar documents or document sections. Find similarity between two documents that is based on semantics rather than query keywords. Applications: eDiscovery, contract management…

3. SME & Intent: "relevant result" means different things in different domains, or even different aspects of a single domain. Google is doing an amazing job with their "single search box", but there are industries (for example, HR) where search precision matters much more than recall. More elaborate, focused, domain-specific facets or even dialogue systems make sense there.

Commercial plug: we built a search solution focused around semantic search (in the "machine learning and vectors" sense, not "sematic web and RDFs" sense), https://scaletext.ai. It's still early days in that our clients are all over the place, but to say Google/Lucene solved search is patently false.

Just want to say thank you for gensim [1].

[1]: https://radimrehurek.com/gensim/

Definitely valid point (that there are very different types of "search", not just one universal way to do things), but:

2) Google does not do simple keyword matching, and certainly has a strong sense of "semantics".

3) Try searching "Cheap hotels in San Francisco" or "Plumber jobs in Chicago" in Google. Just because it's a single search box does not mean all results are generated/displayed in the same way.

Absolutely. Determining query intent is an ancient, well researched and still active domain. And Google publishes their results regularly (kudos to them).

My point was more that encoding the relevant signals into a single general-purpose search box (a sort-of natural language) is an inherently noisy, ambiguous process. When you know what kind of search you want, it's better to factor out the relevant parameters and feedback loops explicitly, give them a clear UI and search flow. Rather than have users fumble with double quotes and double-guessing the query parser.

@Radim - is this something you are doing with your current startup? p.s. kudos from another fan of gensim.

Another fun one that I encountered in an NLP class to many years ago now is multilingual search.

Search in one language, get results in multiple languages. The context in which this came up was in organisations such as the EU or the Nordic Council.

Another one that I would like to see improved is better search for scientific articles.

If anything, it's an abandoned problem. A lot of companies bought really expensive enterprise search systems, which are sitting dormant because the results are so bad.

With advances in spamming, internet/email search is getting to be a harder problem every year.

I remember when Google was quite effective in finding what I need, but nowadays it's dismal. As an example, I googled for "storename return policy", and got page after page of results. All of them were tagged "missing storename", so just randomly picked return policies from other stores.

Search is their bread and butter, and they're probably keenly aware of the diminishing quality of results. I'd love to hear what's caused the recent trend for results that are missing a few of the most crucial keywords. Probably over-enthusiastically trying to filter out keyword farms.

Lately, it has felt like Google has become more optimized for less intelligent/tech-savvy users (i.e. people who search using natural language e.g. “tell me the store policy for target please”). In these cases many of the words used are either irrelevant or counterproductive.

I’ve noticed Facebook’s search has started catering to the lowest common denominator as well. Searching for (made-up example) “Louis Potter” used to give all exact matches priority. Current results look something like 1) Louis Potter 2) Luis Porter 3) Louis Potters 4) Louis Potter (#2). I don’t like when search engines assume that I’m misspelling words/names, and it would be nice if they adapted for individual behavior in this regard.

facebook used to let you do direct knowledge graph searches

"restaurants liked by people who like Joe's Pancakes and Green Dragon Mexican Pizza"

"movies liked by people who are friends with people who like Back to the Future and Arrested Development"

"people who live nearby and like Green Dragon Mexican Pizza and Arrested Development"

"friends of friends of friends who like Arrested Development and Joe's Pancakes"

"Friends of Megan Albert friends who are women named "Erin" Chipotle employ"

I think you can see why they removed it.

this URL might lead to something like "restaurants liked by people who like Snow White" https://www.facebook.com/search/130104022710/likers/pages-li...

There are many cases where you don't know you misspelled the name.

Ah! This drives me insane and it's only started recently. I can enter as little as two words in format 'uncommonProperNoun commonNoun', and the entire first page will be results of 'missing uncommonProperNoun'. It's maddening. It's like searching for 'the moon' and getting results back containing 'the'. Does anyone know how/why this started? Is there a way to enforce a query?

I think enclosing the word you really want in "quotes" forces Google to not discard that term.

It does! And another trick that can be useful in these situations is gaps in quotes: "the * fox" will return (for me): "the main character, Simon Fox", "The Fabulous Fox", etc.

I thought you were supposed to do "+ImportantWord LessImportantWord". Neither seems entirely reliable.

Google has a habit of silently breaking things that used to work, and as it's silent you don't know what the workaround is, even if you are so lucky as to have one.

The +ImportantWord syntax was recycled for Google Plus searches at some point and replaced with "ImportantWord". And now to get a "Important Set of Words" query to work, you now have to enter some special "Verbatim" mode by going into Search Tools and clicking on a special checkbox.

The problem with enterprise search systems is that companies often buy a product expecting it to be as good as a public search engine out of the box, but it doesn't work that way.

Companies can have a great intranet search experience, but it requires time to properly configure the schema, train taxonomies, tune relevance, and understand the resources required to make it work quickly and reliably.

Document processing pipelines often require customization to meet the above requirements, and multi-lingual search may never be perfect within an organization due to differences in the way wordbreaking is done.

Some companies do put a lot of effort into doing this right and getting the best experience possible today, but most don't.

Google used to be a tool, now its a product.

Not sure why this is getting downvotes, it carries a lot of truth. Internet-based tech companies like Google have matured to the point where many innovators have left and MBAs have invaded, favoring short-term profit over user experience.

> As an example, I googled for "storename return policy", and got page after page of results. All of them were tagged "missing storename", so just randomly picked return policies from other stores.

I think it helps to put the critical search term in quotes. At least in DuckDuckGo, `"storename" return policy` would exclude results that are missing "storename".

DDG seems to have recently switched to the same atrocious imprecise search as Google is running for a few years already...

I don't want to fight with the search engine: "you meant this!" -- "nno, find me that" -- "you surely meant this so here it is!" ...

I hate DDG auto DYM (did-you-mean) so much.

So would you have rather got results with only 'storename' and not 'return policy', or no results at all, or what's the problem here?

If there was a result containing the whole search string it would (I believe) have been the first one.

If you don't have any results for the query, then return a page saying so. Otherwise you're annoying the user by making them scan through a page of irrelevant text to confirm that it really is all irrelevant. When this happens to me my reaction is "don't just ignore what I said to you", which is as infuriating when a computer does it as when a human does.

Google returns results specifically showing which terms were missing if this were the case for me. This is useful for finding articles with intersections of concepts (assuming I got the terminology/jargon right) and knowing when those probably don't exist, so I like the feature.

This being relatively large chain store, I'm quite certain there should be forum posts or reddit discussions about their return policy. I have no way of proving they exist, seeing as how Google is shit at finding stuff.

"So what's the problem here"? The problem is, the search engine threw away one third of my query terms, and then gave me all chaff. Apparently I should've thrown in some random punctuation to convince it to actually use all of them.

Can you just tell the name of the store? I want to see for example what happens if I select verbatim mode.

Example with Target, in DDG, verbatim: https://duckduckgo.com/?q=%22target%22+%22return+policy%22&t...

Seems to work.

Given that google is the de facto standard ATM and often returns irrelevant results, I'd wager that search isn't solved.

"That does not by itself mean there's room for a new search engine, but lately when using Google search I've found myself nostalgic for the old days, when Google was true to its own slightly aspy self. Google used to give me a page of the right answers, fast, with no clutter. Now the results seem inspired by the Scientologist principle that what's true is what's true for you. And the pages don't have the clean, sparse feel they used to. Google search results used to look like the output of a Unix utility. Now if I accidentally put the cursor in the wrong place, anything might happen." http://www.paulgraham.com/ambitious.html

>> Given that google is the de facto standard ATM and often returns irrelevant results, I'd wager that search isn't solved.

Search has become a cat and mouse game where every time search is "solved" with an improved algorithm, it is "unsolved" quickly by people looking to game the algorithms.

Also, large publishers can start a new site in any niche and quickly outrank competitors by linking from their old domains with high authority. Quality of content doesn't matter, it's all about "who you know" on the Internet.

> Now if I accidentally put the cursor in the wrong place, anything might happen.

Google used to be one of the few websites I whitelisted on my Javascript blocker. I had to remove it from the whitelist, because otherwise its search became unusable: things moving when you scroll over them, URLs magically changing when clicked or even right-clicked, the search page URL actually pointing to a previous search because it used Javascript to submit the new search, etc.

At least Google search still works fine with Javascript disabled. Unfortunately, they also moved Google Maps to the main domain (it used to be in a subdomain), so I have to temporarily whitelist it every time I want to use Google Maps, then remember to remove it again from the whitelist before doing a normal web search.

For me, google has become the entry point whenever I search simple stuff, like song lyrics, an address. Whenever I need real insight, I turn to other sources like HN search or even twitter

Can you define "real insight"?

>Now if I accidentally put the cursor in the wrong place, anything might happen.

That literally happens to me sometimes when using Gmail (in regular, i.e. JS-enabled mode, not in HTML UI mode).

Sometimes when I move my cursor around with the arrow keys (e.g. up or down), the cursor suddenly vanishes when it reaches some part of the email I am writing or replying to.

Sometimes it happens near the quoted part of the email (the part I'm replying to), or something else. Irritating. Then I have to scrap the email and rewrite it, and maybe if I am lucky I can copy (some of) the text I've written and paste it into the new version.

I think it is some issues with their JS or CSS, also maybe related to monitor size and resolution.

I would suppose that this change is less google changing and more the web changing as people pump out content marketing and SEO. Also might be "olden days were greener", I sure remember a lot of keyword stuffing back in the day.

You're probably right about that. Although I can point to one very specific way the olden days were better: a Google search used to turn up lots of forum posts, which are often the most useful results on a given topic. These days I almost never see forum posts of any kind turned up a search results (except stack overflow, thankfully). I'm not sure what caused the change but I find useful information is much more likely to be buried by low-information wikihow-type results and corporate landing pages.

Possibly because most forums seem to have died? Or maybe they died because they got excluded from search results?

There was, if anything, more search engine spam then. It's just that you got the tools to remove it yourself and then after 12-15 tries you got beautiful search results. Now they've gotten rid of that refinement process in favor of initial results that aren't spam but also aren't that exciting.

Is it the MBAs that do this to good products? Google maps was better 7 years ago too.

The best way I have heard this process described is 'the elves leaving middle earth' : https://steveblank.com/2009/12/21/the-elves-leave-middle-ear...

Anyone else know why Google is going full Kodak?

EDIT: It's the SEO/spammers/viruses type folks, isn't it? Seems like a good candidate.

Actually it is in Google's favor to "lose" the SEO wars.

So long as the paid ads are more relevant than the search results, the less likely you are to scroll past the ads and find the "real" search results, which are either repeats of the paid results or spam.

Google can always blame the spammers and they can control them entirely by turning them off when they feel like it so there is no motivation for the spammers to invest in improving their sites, or not for anyone else for that matter. (Google can always turn you off)

After all if you want more visibility in Google, Google wants you to buy ads.

> the Scientologist principle that what's true is what's true for you.

Care to elaborate more on that ?

That was a quote. There was a link behind the quote.

Not really, no. Some parts of search (Internet) have been cornered/solved by big players (e.g. Google), but many other parts (e.g. search in Intranets) are still an open problem. No one has found a solution that's as simple as PageRank for Intranets. No one has found a solution for the "the author of half of these documents is the intern who wrote the template"-problem and many other things. There are good products out there, but a "Google for Non-Internet" is still far away.

p.s.: All the bad searches on the various products I use or the websites of some companies make me almost want to get back into search. Because even for things that ARE solved in the technical sense there seems to be no "out of the box" solution or people would use it.

edit: To expand a bit - all the search solutions I've seen which weren't for Internet search were more or less bespoke, so you needed a project to get something decent. Sure, you can install a plain Lucene/Solr, but Lucene/Solr cannot understand the way your data works, which parts are important or if you want to show results which are older further down (or not!). They have decent defaults for the "common case", but you have to tune them for every customer/installation for good results and that makes it non-scalable. And being scalable without effort is usually one of the requirements people have for something to be "solved".

So true, I can ask Google a range of verbal questions and get great answers, mostly what-is-the-fact questions.

At work we are nowhere near asking an intranet "show me the last pentesting report for Tony's new website" or "show me the change requests for the failed change this quarter". I know exactly what I want, and how to ask, bit I will not get the results I want.

We are still in the stone age when it comes to search.

Ask Google who played the mens semifinals of Wimbeldon three years ago and Google will tell you it indexed 6 million pages to provide a link that may or may not have the 4 names I am looking for. Why is it doing all this pointless work? And why is it that dumb in 2018?

We have got so used to what it does that lot of people have stopped asking questions about how it does things and wether all the stuff it does is required.

Wolframalpha, Freebase/SemanticWeb/Wikidata/dbpedia approaches, NLP/NLU are still very underdeveloped and untapped.

Having open and distributed indexes like we see in nature with DNA is also totally unexplored because of Google type centralised index monopolies in various domains. It just takes a Gig or so to store a local offline index off all Wikipedia or Stackoverflow pages. And given the massive RAM and hard disks everyone has these days why aren't we seeing sophisticated local offline search apps?

The internet is getting exponentially more noisy day by day and in many ways its easier to find quality info going through a top notch library's index than wading through Google's. So there are lots of blindspots and areas to explore in search right now imho.

I think these sort of queries are solved.. if you know the categories of sites that index information and are able to scroll and process text, images and information quickly.

My first query string idea was 'men semifinals Wimbeldon 2015 wiki' and the resulting page contains the list in a nice format.

This is because I have the context that wiki pages would contain this sort of information. Google and others are getting better at processing more vague queries (like 'three years ago'), but I do agree we are nowhere close to being able to ask general questions. Knowing how to use the tools like google search (and other searches) and really advanced queries syntax is a force multiplier/enabler.

> if you know the categories of sites that index information and are able to scroll and process text, images and information quickly.

That's a job for computers to do.

“ask google who played in the semfinals in wimbledon”

You sure about that?https://www.google.com/search?client=safari&hl=en-us&q=who+p...

If you just want a primer on how to think about adding search to a product, this piece by Max Grigorev is a great starting point: https://medium.com/startup-grind/what-every-software-enginee... It's ostensibly for the engineer, but actually feels more like it's written from the POV of a product manager.

It's not that people have moved on. It's that the entire culture of the ecosystem is built upon a narrative that Google is an all-powerful machine that cannot be stopped or contested. So people don't try to compete, and if they do, they will be ridiculed for it.

And for what good reason? Certainly not because of past attempts. In fact, Google has bought some startups that were involved with search.

I'd be interested in knowing what you find. There are bound to be many relevant texts not labeled as search, if you know what you are looking for. Perhaps I would start with web crawlers and go from there.

As a side topic, is there a useful web search engine that uses a fundamentally different approach to Google, e.g. aren't using backlinks as a ranking signal?

When Google's approach isn't giving me an answer, it'd be nice to try a search that wasn't based on a discoverability feedback loop.

My personal view is that there should be dedicated search providers for specific areas, such as academic, or technical, or news.

That way each provider can focus on building a good platform with clever machine learning tailored to that dataset.

We also need the return of proper Boolean operators and complex nested queries. Yes, Joe Public will never use them, but a lot of people who search the internet as part of their jobs, or just have deep interests and would love to have all those advanced search features back, and be able over to override the 'fuzzy logic' that generic Search engines such as Google enforce on us.

I also disagree that all sites need to be mobile first. If I have a site that provides software for the Enterprise Sector, Google will still penalise me if my site isn't responsive, even though my target market is IT professional sitting in front of powerful laptops/desktops.

As many would agree, Google have too much power in the Search space, and they have basically dictated to the world how Search should be done, whether they are actually right or not.

No, it's not solved. Google utterly fails at it.

I'm a person with simple needs. I mostly search in English, sometimes in Russian (which is my native language) and sometimes in Japanese (which I'm learning). These languages use completely disjoint sets of characters so it should be obvious to Google what I'm trying to do (since I'm logged in and they know all about me). There's also a setting that allows me to pick languages that Google should prefer when searching (English is always selected). Now there's a little problem:

1) If I choose any language other than English, Google prefers websites in that language over English even if my query is in English. If I'm searching for programming-related stuff, I don't need it poorly translated into my native language, or any other language! I want the original, on the first page of search results. So any language other than English gets switched off.

2) If Japanese is not switched on, Google thinks that any query that consists only from kanji (a subset of Japanese characters that are also used in Chinese) is in Chinese language, so I get pages and pages of Chinese websites before I see any Japanese ones. Since I can't read Chinese at all, it's completely useless. Now, you might say: of course Google has no way to know if I wanted Japanese, not Chinese. Oh, it knows. I can search for a name of a Japanese person, and Google will display a sidebar with their English Wikipedia article, date of birth and nationality and everything, and yet all the search results will still be in Chinese.

Seconded. Bilingual support is lacking. For all the profiling they do, they should know my language preferences better than myself.

> utterly fails

> $74b revenue

> 80% market share

Social networking's industry is also a great success, look at Facebook.

Utterly fails is a little bit _too_ hyperbolic no? There are still immature frontiers, sure, but for most day to day search it's pretty magic to me still.

I can give a list of things I'd like in a search engine:

1) Sticky topics - if I'm at work and I type Flow I am way more likely to mean Facebook's js library than I am Flow energy.

2) Different views on the information (Grid view/masonary that makes sense)

3) Ability to search within a set of search results

4) Ability to customise the algorithm with other programs I can write/plugins

5) Many more things for programmers and advanced "omni bar" style features that allow me to type shortcuts and autocomplete things.

6) More automation of programmer stuff - if I type a number and a unix timestamp it should show me a date etc. for possibly millions of things. Same for unicode char of a hammer, url decode, etc. etc.

7) Clean integration with your OS search that makes sense.

I'm aware Duck Duck Go does some of these things (relatively badly). I think I'll give it another go then and see if it's better.

Great list, +1 to everything, +10 to #1, #3.

Far from it. I frequently have issues even with search on Google or Amazon, who invested millions into that. Dealing with people being imprecise in naming and spelling, different contexts and personalized results is still barely touched or non-existent, and definitely not solved.

Lucene provides building blocks, not a solution, and not even all of them.

Search on a small scale may seem working if you have phrases that are mostly unique, like recent movie titles or celebrity names. Go beyond that and its a mess.

The 'solution' you're looking for also isn't necessarily the 'solution' Google and Amazon are looking for. You're looking for an answer to a question, or to find a thing, they're looking to maximize ad revenue and sales. So the answer to your search is tempered by what gets them the most money. _That_ is what they've spent millions trying to figure out.

I don't see 'search,' aka information retrieval (IR), as a solved problem. I went to the last SIG IR conference in Tokyo, and yes, the heavy hitters in the field were there to promote and present their latest research. It is no doubt a very active research field using machine learning techniques. Reading the papers published could give a view of the (academic) state of the art.

Whether it is a good business strategy to challenge Google head-on is another question. Whether there is a sure way to learn and engineer said systems on extremely large scales is another question.

In 1994 there were people who believed search was 'solved' thanks to Yahoo! providing a comprehensive index of the Web (which actually was doable for a couple of years) Then the Internet exploded in size so in 1996 there were people who believed search was 'solved' thanks to AltaVista. The Internet continued to grow and thanks to early SEO techniques like keyword stuffing, an opening existed for Google to fill when they 'solved' search with 'The Algorithm.' Now we're in the midst of search being 'solved' again via ML/DNN's. And I'm sure it will be 'solved' again and again in years to come. So to answer your question about if it's solved, you'd first have to specify which time? :-) Search is a very, very large space and will likely not be solved anytime soon.

Also, don't confuse the business/tech press definition of solved (i.e. a dominant player raking in large piles of money and shutting out competition) with said problem actually being solved as in there does not exist a better way to attack the problem. Granted, when the business/tech world considers a problem solved, this often just means that the lions share of (known) financial incentives are no longer 'low-hanging fruit'... until a challenger figures out a way to blow up the incumbents business model.

DARPA/IARPA now invests in search 2.0: Instead of asking it to retrieve stored information, you can instruct an agent/search bot to perform tasks for you.

For instance, one should be able to search for: "who is the leader of this IRC hacker group?" "where can heroine be bought on the deep web?" "Is there women trafficking going on behind that log-in wall?" and then an intelligent agent is dispatched, avoiding/crossing roadblocks, like log-in forms, and will eventually bring you the answer.

Coursera has more on the basics, like: https://www.coursera.org/learn/text-retrieval

OpenAI has more on the current happenings of creating more intelligent search bots: https://github.com/openai/universe

Other possible future research areas in information retrieval include being able to search for services ("Where is cheapest taxi service for current location?") and an integration with IOT.

Elastic and Solr (Lucene) with the help of various other dbs or graph dbs etc get you farther along but you really have to combine machine learning to get things in context which requires more data in certain domains so it requires someone like me that goes things alone to dive into many disciplines. It is not straight forward and results are not yes or no, more like 70% good and the rest is subjective.

The most challenging parts are if you have many dimensions at the same time such as location / full text / user preferences / social filters / permissions etc. These are the problems that make my life suck right now as you cannot simply not have joins for some things or not have graph relations etc etc. Case in point following feeds with many dimensions so you need a pipeline approach in stages.

I tend to think the search is often the last resort and indicative of the other navigation system being broken. When people can reach the info they need in a more organized way quickly, they'd probably do so. Therefore full text search has to cover every residual task; it's bound to be messy.

I've learned that search is very domain-specific and it's only "solved" if you can pin down what "relevant" means for your corpus of documents.

In terms of fundamentals, I'd suggest reading about tf-idf, which is the basis of Lucene (which powers Solr and Elasticsearch).

Google's good for 1 - 3 key words / phrases but anything longer can still be difficult.

Regardless of query length it's strange that I often find the second result better than the first one.

Finding something in Gmail can be a real challenge, I wonder if would be possible to unleash Algolia on your inbox.

I've confronted my PhD supervisor (Professor of Library and Information Science) with this statement once, and she almost went berserk. Her take is that free text search is approaching the solved problem stage, but almost all other search isn't.

The search problem is connected to the spam filtering problem which is an ever advancing arms race - it's never solved and depends on the new schemes spammer come up with. So search itself is never a solved problem.

I wish there were a search engine that could figure out the groups the search results belong to. Sometimes when searching for swift programming language, a certain singer also makes an appearance. Or when searching for physics related things, I get that Olivia Newton John song. Like I was a search engine that displays some sort of Venn diagram (not quite but it's close) that let's me hone in on my results.

To be honest, I think that I would actually want an engine that scrapes fewer sites but good sites and tries to understand them better. Also regular expressions.

One aspect that I don't see mentioned is that APIs for search have basically disappeared from the major players, likely due to them being expensive (programs can hammer an API faster than a human with a text box) and lacking a revenue model (no human may be involved, making advertising useless).

This has caused "search" to devolve into "human using Web browser types natural language in box, human is presented with results (some semantic, most just links to things other people wrote, some advertisments around the side), human reads through results to see if any are useful to them".

This is certainly useful, and I rely it all the time, but it's not the pinnacle of what search could be. It's like if `grep` didn't pipe to stdout, but instead popped up an alert box for each line like "Your query '.*' matched the line 'foo'. Try the new McDonalds saver menu today! [Next/Cancel]", it would still be useful, but nowhere near as useful as piping to stdout.

Many years ago Google provided an API to their search engines, which applications could build on to be more "smart". This could have paved the way to much better software: for example, imagine a prolog system where all of Google's knowledge could be used by the calculations.

That path was mostly abandoned since there's no scalable incentive to make it operate, much like the semantic Web. Rather than opening up databases to empower others, it's much more profitable to keep them walled off behind a few limited, pay-per-use interfaces (e.g. "paying" by showing a human some adverts). Attempts to bypass or abstract over these interfaces are hit with rate limiters, Recaptcha checks, etc.

Great topic. Part of the issue is probably a transition from algorithmic approaches to data-driven approaches. What did previous users search for and click on? Existing companies have a huge advantage from years of data, and not the kind of advantage that others can learn from (compare to publishing a better algorithm). Another factor may be that parts of the problem can separated out and are studied on their own, such as natural language processing.

>What did previous users search for and click on?

This is why google often is bad for searching tech things. You get often very old useless links for a topic

Lucene/Solr, Elastic Search and Algolia did a great job creating search tools and services and this extinguished the thirst of the masses. I don't think it's a solved problem, it's just a problem that has commercially viable solutions. When it comes to resources, I've found valuable knowledge in Lucene/Solr forums and mailing lists, back in the day. It's worth a read.

I'm increasingly thinking that a problem with search is inconsistent and nonstandard (or nonexistent) conventions, protocols, and APIs.

It seems to me that a fair bit of the search problem could be addressed by sites themselves serving wordlists, tuples, statistically improbable terms (there's another term for this that's escaping me), etc., rather than consenting to being heavily crawled by numerous spiders.

Vastly improved content metadata (particularly for largely fixed "article" content), including author, date, and (a reliable) topical categorisation would help. For realtime information, APIs are probably more reasonable than text search, though those would likely be front-ended by specific applications.

This still leaves the very nontrivial problems of reputation, relevance, black-hat SEO, and information manipulation (propaganda, misinformation, misdirection, disinformation), and just plain street-grade idiocy.

But several elements of this strike me as amenable to either localised or distributed solutions.

I think search is not only not solved, but a failed concept overall.

It's too hard to find relevant information from the gazillions of pages based only on a few words.

I think search needs to be replaced by some kind of indexing/ontology/knowledge organization system, and then maybe only be applied in the "last mile" of a person's 'search' for relevant information

We're trying. Kindly see https://millionshort.com

Gives good results. Thanks for providing the link.

I'd like a search for things that I've read or seen in the last few days. Or years.



Yacy can do it. You can use it as a proxy and it does index all visited pages. (I tryed Yacy, but never this feature).

Self-plug: I recently launched a dedicated news search engine at https://yetigogo.com — Based on my personal needs, but it works pretty well for tracking any current event.

I hope you are ready for the "link tax"...

It depends what you are looking for. I have the hardest time searching for laptops that meet all of the specs I want, for instance.

Takes days.

Minimum screen brightness. 1440p touch. Needs USB-A. 8th gen quad core. 20+ watt TDP. etc. Not too thin, not too thick. decent graphics.

In the EU you can use https://geizhals.eu/?cat=nb which is pretty decent. I'm using that regularly to check options.

Is something like this not available in the US/other territories? If not: here is your opportunity :-D

I think this issue resides with the OEM's not providing information. Search or indexing cannot solve this problem. I think you'll have to watch plenty of YouTube influencers in laptop domain to find what you're looking for.

Can't that be solved with a bunch of checkbox filters, generating the proper WHERE statements? Several websites offer this, be it for laptops or other products.

Searching is easy enough but scoring the value of what it finds is not! Is an answer on Stackoverflow valuable because the poster has 100K reputation? (no!) Was the information valid 10 years ago but completely irrelevant now? Is this programming blog post very specific to Drupal and not relevant to other PHP frameworks? Was it information taken from somewhere else? (I would rather send traffic to the original).

Another problem is how to search for something when you don't understand it enough to search for it or you can't think of distinct enough words or phrases to search for it.

Internet search has been solved for about 10 years.

Evidence: Before it sold off to Bing, Yahoo search was quantifiably better than google for a few years (in blind tests where you rip off the branding).

No one cared, because google was good enough.

Having said that, I use duck duck go these days, and occasionally spot check using google. I think the google results have become unusable because they too aggressively map to related concepts, and otherwise second guess what I’ve typed, but there have been endless debates about that on HN, and it’s essentially decided by the user’s taste.

Couldn't you use quotation marks to search for phrases? I do that all the time. The search is not verbatim, Google certainly stems the content of quotation mark phrases, but I was under the impression that related concepts are not included.

I'm not claiming to be sure about that, it's a subjective impression, so if someone can explain better, I'd be very interested. Since I use Google for work (more often than Google scholar), I'd hate not to find what I'm looking for, but generally Google results seem to be much better than others. It's the main reason why I don't use DuckDuckGo, ixQuick, Yandex, Bing.

Aggressively “”’ing can help, but I think that also turns off useful stuff, like word stemming, when what I want it to do is not map to some other more popular term.

I’m sure people have been trained to use google effectively, but I switched years ago, and find it harder than the alternatives.

To be clear, this is all nitpicking. The last time this came up on HN the only query anyone found that actually showed practical differences between the two was the acronym FOSS.

Google rewrote to “free open source software”, with no quotes, giving poor quality results (eg bsd software).

Ddg gave the definition of foss, and relatively few results pointing at actual software, because that acronym was (is?) obscure.

It's solved, but as many has cited not to a satisfactory degree. Modern search engines have extremely short query deadlines (users won't wait more than a couple of seconds when searching), which gives low precision results.

This book is very informative: https://nlp.stanford.edu/IR-book/

edit: I should say it's slightly outdated because of lack of "big data" and how search companies currently deals with huge amount of data.

I discovered this recently: https://typesense.org/ Very nicely done and might be useful for learning from.

The field of Information Retrieval has largely moved to multi-modal retrieval (search across video, audio, text), linked data, question-answering and so on.

But document retrieval (classic search as you describe it) is not a hot topic anymore. That does not mean there are not people working on constantly improving document retrieval: Google Scholar returns 17,800 results for "document retrieval" in 2018 and 34 results with it in title. So it is in widespread use but not the focus of the field, I would say.

If you squint a bit, you will find every computation is a kind of search. You are searching for the "output" based on the given "input". Ex: Deep learning is just searching for the weights. What people usually called search is where the search space is explicitly defined (a set of records in db etc) but in most cases the search space is there just that it is implicitly defined in the computation problem description.

Absolutely not. Frequently I try searching for things from years ago by describing them into a search engine, and SEO spam floods my results. I usually end up asking a human.

I would love to see a search engine that only returns websites that have no JavaScript present (or some other artificial way of excluding major, "modern" websites).

In a sense, it deteriorated for some basic requirements. Remember the the time, when you were able to use boolean operators and _exact_ sequences on Google reliably.

I don't think it is a solve problem. It just seems daunting or difficult to take on Billion dollar companies in the space.

I see opportunity for niche search engines, there are several areas that google does not do well in on purpose it seems.

I think the truly hard part is that so many people accept whatever default is already there. If they get an android phone - they use the search box there. If they are using chrome browser, whatever input box is there on the first screen is obviously the url bar and use that (you and I may know the difference, the average user doesn't care, it's one less click to just type 'google' into the url box in the center of the page, of fbook or whatever, then google brings up the url you were going to (not searching).

This is why I think there is much less hype about competing in this space. Unless there is a thing forcing companies to put other browsers and search boxes on phones, tablets and chromebooks like the microsoft IE debacle so long ago.. then trying to be the next google is impossible, even if you had better results, better tech, etc.

Regardless of that, I think it's quite possible to make much better niche search engines and get them used. If ten micro engines could make 1% of googles revenues each, that would be a decent amount of money in my neck of the woods.

I'd like to see other people post more sources about search tech in general, several searches last year only brought a few info bits on what it may cost to create an index of the net - someone posted some numbers using servers bought off ebay and a rack at hurricane I think - had some numbers for the cost of servers to pull a new index every month or so?

Certainly the tech and costs have changed since that was published, but not much I've seen.

I'm pretty excited at this project posted recently: https://news.ycombinator.com/item?id=16976941 ( Show HN: A search engine that doesn't track you, where users vote for results (github.com) )

I am hoping to get some people together to make a less persnickety and fussbudgety search option for people who don't want to be babysit with censoring kids gloves when looking for fun things.

If anyone wants to make a couple adults only engines, or ones that are more fun, let me know.

Average people talk in slang and cut up about less high brow things, the big G gives rank to the college papers and deranks for so many things, it's on the road to being the next yellow pages and sciences journal, but not the place to go when you want fun things anymore.

The default search engine in people's web browsers is a solved problem. Search itself, not so much.

Search is very much not "solved" as other have pointed out. I'll add that when google started search was considered pretty much "solved", and Yahoo turned down the option of buying google as they considered their search good enough.

To break into this space, I would recommend starting with a subset that people want to search. Like the facebook model of only being available for some colleges.

Some ideas:

- Only academic papers

- Only news sources

- Only hacker topics

- Only financial topics

- Only small bloggers

- Only literal keyword search which Google discontinued

Get traction in that domain, then build out from there.

'Search' might be solved, but 'find' isn't.

Why doesn't browsers have a search for bookmarked pages ? It's because they get payed by Google to have their users use Google search instead.

There are big opportunities precisely because the field seems dead.

(1) The first big story is the dominance of Google. With an advertising-centered model, Google has a reason to degrade result quality. If you get trained to scroll down to find the real results and you found them good, you might avoid touching any of the ads (hard to do because they cover so much of the screen.)

(2) The web is 95% Javascript and 95% Spam -- getting useful results at all requires fairly strict 'censorship' and vast resources if you want to compete on Google's ground. No serious competitor will come in with a different model, nothing will change unless you have a search engine that YOU pay for and not the advertisers.

(3) "Desktop search" is discredited in most peoples minds. Your OS might have added it as a feature back in 1995, but you've kept it turned off because it slows down your computer and never finds what you are looking for. Result quality is an issue, but the #1 perception here is that the indexing process harms the user experience. In the era of multicore, NVMe, etc. can this be changed?

(4) "Website search" is also discredited. Product search commonly works, but search on most web sites is so bad that people are trained to just search on Google. Thus you have very few chances to change people's minds.

(5) There is a big literature (the TREC conference) but there is something profoundly depressing about it. It was one of the first big competitions, but unlike the SAT Solver competition or Imagenet it was not associated with a rapid improvement of technology but rather a painful slog through the mud. If you start reading it at the beginning or in the middle somewhere you will find that 20 or so things that you thought were sure bets to improve relevance don't work. If you read the cliff note's to the first 10 years written by the organizer, you find out that there was an interesting discovery made 5 years in...

(6) The BM25 ranking function which has two tunable parameters. BM25 was a huge advance because it can be tuned to comparable rank documents that are highly variable in size. BM25 is built into Elastic Search, but nobody will give you any advise how to tune those parameters...

(7) Because they don't follow the relevance evaluation protocol in TREC; this is badly flawed, but the data exists, and going from naive tfidf to tuned up BM25 or information theoretic approach (also implemented in Elasticsearch) will put up better numbers AND seem more relevant to end users.

(8) An open-source project to do that evaluation on Lucene got started but never made a project; I have talked with Enterprise Search vendors who were very aware of points 5-7 but did not tune up their search because it was easier to sell customers on having hundreds of "connectors".

(9) The mainstream of TREC (it has broken into many flavors) and IR research has been getting high recall at low precision. Maybe that's because when Gerard Salton was messing around with punched cards at Cornell, 70 abstracts was a lot of documents. Patent searchers and paralegals are interested in deep recall, other people aren't.

(10) A major flaw in the mainstream TREC approach is that they are trying to tune up the wrong function: the ideal relevance score is a probability estimator of how likely the document is to be relevant.

(11) Google and Bing have made noises about personalized search but they don't really do it. They are both stuck at 70% relevance for the first result because of their limits in inferring user intent. The real relevance function has the user's context as an input variable, but sampling by that thins the data points to where it can't be approached as a "big data" problem. "Personalization" works for advertisers who don't know your real intent but are willing to pay for a 5% chance you may click, but not for you where you will feel misunderstood (primed to get irrationally angry) 95% of the time.

Re: desktop search. What. I see "lay users" in meetings with projectors/TVs using the search function in the windows menu to find documents all the time! My org is named something like XQWK, so the files they want to show have the letters XQWK. It's pretty natural to them.

I remember when I used to hear all the time about the semantic web. It sounded more and more like mind reading.

Computers can not read minds.

I remember reading about this exact topic somewhere but I can't seem to find the link.

Here are my comments on the subject from December which might be of interest. https://www.quora.com/Is-it-possible-to-beat-Google

The solution we need is search that doesn’t rely on an information monopoly like Google.

Search can cover other domains too... what about an AI that can search books/research articles/lectures/videos to diagnose a medical disease - some of which aren't actually published live to the internet (perhaps behind paywalls) -- then it takes a person's current symptoms and comes up with the best diagnoses from it's search across multiple media types.

How about search in the context of AR... if people overlay data on top of the world we live in, in AR apps, will there be searchable things there? There's room for search related projects in the future, but it just matters what the data is, and why it's being searched.

Normal search 'engines' for web documents --- that itself seems pretty much 'won' by google, until something better comes along (an implant that has better search than google, and I only need to think about what I want to search for then I automatically download the data to my brain for the top 10 results)

(Disclaimer: I am an Apache Solr committer and popularizer)

Search is interesting! And it is important to differentiate the web search (Google) and domain-specific search (Solr, Elasticsearch, recent release of http://vespa.ai/). You cannot tune Google to your domain needs and understanding.

For domain-specific search, the basics are there. Even the fancy "basics". It is now very easy to add search to one's stack. In fact, Solr is in so many stacks, it is not even mentioned much anymore. But we still get the contributions back from Cloudera, Bloomberg, Alfresco, etc.

So, the cutting edge in Search is now on personalization, relevancy-tuning, indexing non-text content (music, images, etc), multi-word semantic search, graph traversal and, yes, Machine-Learning. See, for example, https://lucene.apache.org/solr/guide/7_3/learning-to-rank.ht...

In fact, the Solr conference that used to be called Lucene/Solr Revolution is now Activate and has focus on ML/AI because the topics are really starting to overlap (https://activate-conf.com/). You can see the interesting topics from last conference: https://www.youtube.com/playlist?list=PLU6n9Voqu_1FMt0C-tVNF...

Learning (Solr at least) is a different issue. There are so many features now that the Reference Guide is absolutely enormous. And the demo schemas are still a bit of a kitchen sync, making it look more complicated than it needs to be. And, the last comprehensive book was several versions back. Again, that's because Solr is big and is growing really fast still...

Actually that's why I chose to be a popularizer within the Solr community and focus on making it easier for beginners to start.

See, for example, my latest presentation slides at: https://www.slideshare.net/arafalov/rapid-solr-schema-develo... and the backing configuration repo: https://github.com/arafalov/solr-presentation-2018-may (includes smallest viable useful schema)

(tl;dr) Search is still exciting, lots of cutting edge cool stuff, and there are people trying to make it easy for beginners to start.

Search means a lot of things, but even if we limit to mean web-search, as most people understand it, there is a lot more to it than the actual technology that matches queries to documents.

IR is for all intents and purposes a solved problem -- in fact it was solved a long time ago, and I highly recommend the seminal book “Managing Gigabytes”. I also recommend https://github.com/phaistos-networks/Trinity/wiki/IR-Search-... this page(disclaimer: I am maintaining it) for some interesting/important links to IRC technologies, developments, etc. While some novel ideas come out from time to time, the fundamentals haven’t changed -- progress there is incremental and mostly specific to different encoding schemes or ways to execute queries faster by using JIT or more cache-aware datastructures, etc.

Managing and queries documents based on keywords and boolean operators is one thing, and Lucene/Solr, and Trinity (https://github.com/phaistos-networks/Trinity) among other technologies can be used to take care of those challenges. But that’s the easy part (assuming you can do this fast enough, because you almost always can’t afford long-running queries):

- User Interfaces: Not just how results are presented, but also how users can construct or input queries. What options can be come available for filtering matches? - Ranking: Precision is key, and rather simple formulas (tf/idf, BM25, etc) generally don’t work well for many/most domains. Furthermore, ranking is almost always not just about relevancy. It factors in static context scores (e.g document “popularity”), personalisation biases(how likely is it for user to mean Soccer or American Football for [football]),and other signals, fused together somehow to determine the final ranking of matched documents. - Scale: Getting everything right is one thing, getting everything right at massive scale is whole different game. What may work on small scale(algorithms, technologies, services) may not work at all when you scale out. - Everything else not directly related to search but either important or fundamental to a good experience/business: from matching queries to ads, to analytics, to autosuggestions, to training ML models to power all that, etc.

Web search is not a zero sum game. Bing makes over 3nb / year and while it may not have a chance to catch up with Google anytime soon, that’s a great business right there. Ditto for DDG. There are also companies that offer a different or better experience and access to datasets google doesn’t yet.

So, all told, search may be solved only in terms of the basic IR technology that makes it all work, and arguably a lot better than it used to be in terms of user interfaces, ranking, etc, but it will take a lot longer until those other aspects of web search may be considered ‘solved’.

It's definitely not solved from a technical perspective but it's really hard to compete in a business perspective. The market wants good fast search. If you come up with great fast search, it's still hard. The only opportunity I see from the business perspective is to challenge the visual paradigm. Having said that, there are tons of opportunities from the academic perspective such as inferring context, letting users control context, etc.

I think Google has pretty much monopolized search. You could say it's solved. I doubt if there are any complaints that sound like 'i couldn't find something using Google'

Google ultimately serves the advertisers, not the end users. This incentivizes manipulation of search results in order to maximize ad growth. On that token alone, they have not solved search. Any newcomer developing a search service would do well to make it radically different from Google rather than a clone with added privacy or a Microsoft logo.

Regarding always being able to find what you are looking for on Google: I often struggle to find niche information using Google, and most content on the Web is not indexed by Google.

I'm more of the opinion that YouTube is less possible to compete with in any serious way.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact