Hacker News new | more | comments | ask | show | jobs | submit login
Ask HN: Are there any purely keyword-based search engines with no NLP?
121 points by John_KZ 11 months ago | hide | past | web | favorite | 80 comments
I want to query the web based on specific criteria (ie keywords), no what Google or Bing thinks I really meant.

Ideally I'd love an interface to a large database including the text in each webpage that has been crawled. But something like Google the way it was 10 years ago will still work. Are there any such search engines left? Even ddg is switching to semantic/NLP results.




I was wishing for old-school keyword search just this morning! Nice synchronicity.

I must completely disagree with the other posters claiming that keyword searches are not useful. For niche research, they are extremely helpful or even necessary. Google and Bing have reached the point where it is impossible to do real, niche academic research on them. For instance, I had a very specific thing I was trying to look up involving medicine, religion, and Marco Polo.

Try searching for "marco polo doctors" on Google and witness it giving very counterintuitive, one-sided results that may align with the current zeitgeist of interest from people searching Google, but diverge completely with the aim of literal, precise keyword search needed by academics. I did work to improve or hone the search down, looking up Kublai Khan, doctors, atheism brings up blogspam articles on doctors and atheism, but scant results on the 13th century Mongol emperor's religious medical interest. Trying to narrow the search further by including variations on Cambulac, Cambaliech, trying to find any info beyond surface-level on John of Montecorvino and his retinue... all is impossible with search engines in 2018.


Don't forget(!) Google's horrible forgetfulness, and its way of banning you if you try hard enough to extract anything useful from it:

https://news.ycombinator.com/item?id=16153840

For me, the niche is repair information and in particular, identifying IC part numbers and finding datasheets. Searching "service manual" now invariably brings up useless user's manuals, and searching too many times for IC part numbers gets you CAPTCHA-banned.

(Somewhat understandbly, part numbers tend to look like semirandom bot-queries, but it's still a horrible experience to be called a bot just because you're actually after more information than the average user.)

Keyword-based would be a great step forward(!), but something like "grep for the Web" would be ideal. I remember many decades ago learning how to use boolean operators and such, since nearly all search engines of the time provided such functionality. Now the mainstream ones which have a big enough index to be effective also have removed much of that functionality and try very hard to limit you from using it. For another example, try using "site:" searches multiple times with Google --- another way to get rapidly banned.


When you find domains that contain useful information, crawl and index them manually.


Indeed, the best solution.

Interesting enough, I find separate web crawling as a service and search engine as a service, but not both?


You just described the Bing/Yahoo BOSS APi


Allright, I forgot that ones.

However they are quite pricey. Maybe some solution that one can host himself is a nicer alternative.


> something like "grep for the Web" would be ideal

A couple of these (e.g., Blekko) popped up 5-10 years ago. I don't think any made it far.


Some of them got bought like Blekko.


I find that more and more often a search like "keyword1 keyword2 keyword3" will give results that only match 2/3 keywords in the first N results. I feel that I'm frequently having to think "How can I phrase this search to get Google to do what I want?", which seems like a problem they solved (mostly) fairly early on.

It's especially annoying when you search "keyword1 keyword2" then "keyword1 keyword2 keyword3" and get the same results, just with a "Missing terms: keyword3" note below each (and more often than not, an alternative search will find what I'm looking for, so it's not just a case of there being nothing to match all three).

Edit: missed "note".


I also noticed recently that if you search for problems using google cloud stuff (app engine in my case) on google, the full first page of results are the documentation for the product. What I wanted was stack overflow posts, or angry forum posts where other users had the same questions. Or somebody’s personal blog or GitHub gist where they talk about what to do. If I want all results from the documentation I can go to the documentation and search from there! If I used google to search for information on C# programming they wouldn’t return a page of 100% MSDN results, so I don’t see why they do for app engine.

Not strictly related to your comment, but similarly frustrated.


Interesting; I usually prefer to see documentation search results instead of "me too" Q&A posts where nobody's solved the problem in 8 years. Maybe a good mix of first-party and third-party sources would be ideal for the first results page; definitely not an entire page of just original documentation.


This one particular thing is easy enough to work around, if the results you don't want are all from the same domain/path:

    -site:cloud.google.com/storage/docs/


Try using G search "Tools/All results/Verbatim" option for your 'marco polo doctors' query. Maybe throw in a "-who" as well. These tricks help a little. Of course, one may also ask whether the page that you envision exists on the searchable web. It should, but maybe not.


Kind of off topic, but did you ever find out Kublai Khan's religious medicinal interests? I'd be curious to know.


Not yet! I'm still on the hunt, but now I am searching for books. I suspect it will require checking out a few tomes from library to find what I'm looking for.


It might be easier to email a professor who's an expert in Khan to point you in the right direction. If you're lucky, they might know exactly the information you want.


No :-). For what its worth, Google never worked that way either, it used Page Rank initially to sort results by social importance. I would guess that the last keyword based engine of any merit was AltaVista. AV had the issue of returning all pages with your words but it had no notion of sorting them (at least initially, you can read their patent as well)

These days its fairly easy to code up an n-gram index and host it on a moderate server. If you have a corpus of documents you can see how well it works. The simplest corpus is to just down load all of Wikipedia. Last time I checked it fit on a 2 TB disk drive.

You could also use the Common Crawl database and index it as you would like, or talk to the Internet Archive project for some sort of collaborative project. However I will warn you that of the people who have done this experiment (and it was a new hire task at both Blekko and Google where I worked) folks quickly discovered it wasn't very useful.


To be fair wikipedia compressed without images is pretty small. A couple tens of gigs no more than 80. You can download it to your phone and there are apps to view it all. Not sure how larger it is decompressed but 2 TB sounds like overkill. But yeah wikipedia would be a good way to test it.

Also did you mean every single wikipedia site or just the English one? I'm thinking just the English one's sufficient.


I know what you mean with "sufficient",but English pages are often lacking informations on a lot of subjects related to countries outside the anglosphere.

I still don't understand why wikipedia articles aren't translated according to a common script, at least where there are featured translations (marked a star)


Feel free to make that happen.


Also, did he mean the entire revision history of Wikipedia, or just the most recent versions of pages?


That would definitely explain 2TB haha but yeah, if you pull the recent English version without images it's like ... I wanna say 30GB off the top of my head.

As for anyone curious:

https://dumps.wikimedia.org/backup-index.html


The most recent was 13.9GB compressed and ~65GB uncompressed. The version without images, talk pages, and revision history is 'enwiki-20180320-pages-articles.xml.bz2' here:

https://dumps.wikimedia.org/enwiki/20180320/

The article text contains Wikipedia markup which is a bit difficult to remove but not impossible, there are some existing projects for doing that. DBPedia would have the raw text, but it's not nearly as current.


You can get the raw text with wikitext and whatnot removed from the search dumps made every week (also has various other metadata)

http://dumps.wikimedia.your.org/other/cirrussearch/

These are structured with a JSON string for each doc roughly like https://en.m.wikipedia.org/wiki/Foobar?action=cirrusdump


Google did work that way. Page rank was used for ranking, which is fine. Page rank did not override the search query, which is what it now does.

Our complaint is that the results we get from our queries simply do not match the queries we are making.


Since you appear to have a background that includes search and Google I thought I would ask you or anyone else a tangential question - with more than 30 Trillion sites[1] on the internet how is the crawling ever completed before it needs to be re-indexed again?

I'm imagining that "the frontier" a crawler needs to crawl is actually a distributed queue and that crawlers are massively parallelized. I'm also imagining the frontier is bucketed by its indexing frequency i.e daily, weekly monthly etc. Is that close? Might you have any resources on how this is architected at large search providers?

[1] https://venturebeat.com/2013/03/01/how-google-searches-30-tr...


The crawl is never done. It's constantly crawling. And you're right, it's just a queue of websites that's been found, and when a crawling machine has an available slot, it crawls the next site (obviously that's simplified).

Certain pages get crawled more often than others. As a website owner, you can tell Google how often your content changes, which they will use as a clue for how often to recrawl your site.

If you're really big (like for example Reddit), you actually lose that control in Google dashboard -- they fully control the crawl rate. From experience, I can tell you that they are crawling large sites, like Reddit, continuously. Even 7 years ago, when I last worked there, they were crawling Reddit so much that we had to set up a separate server infrastructure just to respond to Google's requests, because their access patterns were so different from every other user.


That's fascinating. Is there any way to contact Google to get them to change their crawl patterns? Or a more efficient way to directly send them the site updates as they happen? I imagine a "direct line" to Google would save resources both on Google's side and yours, but I've never ran a site big enough for that to be a problem so I wouldn't know.


I think that would lend itself to abuse.


   > "[H]ow is the crawling ever completed before
   > it needs to be re-indexed again?"
Crawls essentially never finish and indexes are never 'done'. If you're building a modern search engine you crawl and index continuously and you use the ranking signals you have developed to steer the crawler to promising new pages.

The reality is that there are roughly between 5 to 15 billion pages that are nominally "not spam" and not duplicates. Literally 99.9% of the internet is crap. So finding web pages as long since switched from 'crawling every page that is accessible from the web' to 'only surfacing those pages that have something of value on them.' That was the fundamental thesis for the founding of Blekko and it is still true today.

That said, a cluster of ~100,000 threads and a couple of petabytes of storage attached with sufficient bandwidth to keep the threads busy can deal with what is out there. If you can create a hash space for strings that is sufficiently uniform you evenly spread the load of crawling every URI you discover across the cluster.

As you crawl you take new pages you discover and apply your ranking algorithm to them where they will score a value between 0 (never index) and 1 (always index).

At which point you can dial the 'rankable' value from 0 (index everything) to 1 (rank only must rank pages) to set the size of the index you can tolerate.


Can you elaborate why setting up a simple search engine based on common crawl was not useful?


Basic keyword search is great at recall but precision (top 10) gets worse as the number of documents increases. Given the size of the web, basic keyword search tends to perform poorly in terms of relevance. Common Crawl is large enough to see this problem.

I think what OP and several people in this thread actually want is Google search minus synonyms and the ability to specify advanced syntax like AND and NEAR queries. I believe that would go a long way to satisfying someone who says they just want "keyword search".


You are exactly right, and specific keyword search allows you to weaponize a search engine so Google doesn't let you do it. (they have been victimized in the past for letting people specify things that could pull out social security numbers, for example.


Not sure if that's what you're looking for, but SymbolHound[1] takes input queries literally, which is useful for e.g. code search.

[1]: http://symbolhound.com/


That looks useful. Thanks!

I can use it via DuckDuckGo too: https://duckduckgo.com/bang?q=symbolhound


Nothing has changed with Google (Bing and Duckduckgo are the same too more or less). 10 years ago it still did essentially the same thing, it's up to you to use the more advanced features to filter the results.

For example if you type "marco polo doctor's -doctor -who" or "marco polo doctors group:science".

Google operators for example: https://en.wikipedia.org/wiki/Google_Search#Search_syntax

Cheat Sheet: https://www.searchlaboratory.com/wp-content/uploads/2012/11/...


Google has absolutely changed in the past 10 years. Have you tried looking for something obscure recently? Google increasingly over time takes more and more liberties with your search terms, regardless of operator use.

I notice that one of the pages you link to is from 2012. Again, the old ways just don't cut it any more. Since it doesn't even mention Google's "Verbatim" search option, it suggests to me that while its content might be technically correct, it's useless to hand-wave towards it as the cure for contemporary complaints about Google's results.


Tell me about it. I am an Erlang programmer and I am currently learning OCaml... It's extremely hard to find OCaml topics, as Google bombards me with Erlang results.

This certainly was not like this 10 years ago.


Sorry to be clear, I meant that to find information you always had to do bit of digging, it's true the algorithm has changed a lot.


Not only the algorithm. Recently, using double quotes or plus sign before terms became useless, as it just lists results with the specific word you marked crossed over when there are not much exact hits.


> Recently, using [..] plus sign before terms became useless

Unless it came back at some point, it wasn't very recent - it was intentionally removed in 2011 due to their using "+" for their social network.


This isn't true - that syntax is no longer honoured by default. It may give the search engine clues about what you want, but it sometimes chooses to ignore keywords and double quotes.

Verbatim mode is a little better, but still not as direct as it was 10 years ago.


Google has removed literals. You can no longer do 'how to setup obscure program "LINUX" "NGINX"'

What you get in return is a bunch of slashed out "Linux"s and "Nginx"s and a bunch of "How to setup obscure program... On windows" and "How to setup obscure program... on Apache". It's downright infuriating having to learn some of the tools I cannot do without. Ones where documentation is spotty, but user forums/mailing lists/etc are top notch, even for Linux and Nginx. But, you won't know that even if you specifically type: 'setup obscure program "Linux" "Nginx" -apache -windows'.

It has changed. You don't have the right to find what you're actually looking for. You have the privilege to only look in places Google approves.


Oh man...

This is literally my whole business, and I wrote about this here: https://austingwalters.com/is-search-solved/

Hinting at why search was broken.

Essentially, search providers are for the general, meaning if you type "whales" it'll bring you to the wikipedia. This is because probably 80% of the time you're looking for wikipedia. It uses NLP to determine when you say "I want to know about whales" because it works in the general case. If you want an exact match do "I want to know about whales" and it'll look for that exact phrase.

Now my business, is actually the reverse - not looking for the general case, but identifying the niche - i.e. "what an expert would want":

https://projectpiglet.com/

This lets me build a financial advisor which is averaging over 100% YoY returns because it identifies and tracks specific topics (as opposed to just wikipedia changes). So if you go on there and type "Iran", you'll get a lot of search results about Iran, but also about Isreal, Jordan and the like because it identifies Iran being associated in the graph. This works great for investing because you want to know about the related topics (you may not even realized you wanted to know about).

Now, that's NLP. But it works for my customers, because exact matches typically are not what people want. They want the Niche, the general, or occasionally as in your case an exact match (if they can think of the right words). Luckily you have quotes "search phrase", in my system I always assume you mean to type "search phrase", so I always look for an exact match. But I still apply NLP to the results, because that's the value.


Kozmos' (https://getkozmos.com) public search engine is keyword based. You can search about 1 million pages -including their content-, sorted as how much they're liked (a.k.a bookmarked) by the users.

You can watch my 1-minute pitch in which I mention why Kozmos' search functionality is more interesting: https://youtube.com/watch?v=ETjeEz5Dk_M

We have academician/researchers users who love Kozmos. My goal is to improve search with deep learning, keeping the sorting algorithm same (like count) though. I'm walking towards this goal slowly but surely!

P.S If you're into this topic and live in Europe (Berlin), let's have coffee!


Have you tried Google’s “Verbatim” mode? It’s listed under search tools and claims to do exactly this, iirc.


intext:MYSEARCHTERM is what Google taught a few years ago when they had a "google poweruser web course" thing. My understanding is that these days it either doesn't work consistently or makes Google think you're a spambot and will shut you out after just a couple searches.


It used to work better.


CommonSearch https://github.com/commonsearch/ was a project to try and build a search engine where you could "Explain" (in the SQL sense) the result, based on Common Crawl. Open Source and transparent. But it did not seem to have gathered much enthusiasm. Which I find sad.

If you have some loose change on you.. a bit of processing on 71TB of data.. and you got yourself an index precisely like you want it.

Anyway, without "some" NLP no search engine is going to be very useful.

You need to know how to tokenize.. at a minimum. For many languages, this is not as trivial as it is for English.


Look into commoncrawl.org which provides a free web index which you can query against. Now that cloud is available, you could in theory download the index and load it into Google's big query or AWS and run your experiments.


If you can deal with the German UI you could check out labarama.com. I wouldn't know it if I hadn't met the developers on a night out. It's an altruistic project from Austria that keeps to the basics.


So you know more about these guys? I have heard some rumour that the ones behind this search engine are living in constant fear being raid by Google henchmans


This can't work for a simple reason. Spam. What's stopping anyone from flooding a page with specific keywords which might be completely unrelated to the content of the page, or the page itself providing too little real info on the searched keyword. It's exactly for reasons like these that Google considers a variety of factors for ranking pages.


OK, so some sites are “garbage in”. I'd like an option to see the “garbage out”, unfiltered. Then maybe I'll appreciate the clever filtering.


Kind of off topic but I recently realized that after searching for the exact same thing multiple times in a short period google was giving me different results (presumably assuming I had not found what I was looking for). Which was kind of annoying since I was just using google as the entry point to those top results.


This only finds things that are in some kind of RSS or Atom feed, but inoreader (with a Professional account) can keyword-search every public article that they have in their index. It doesn't return results quickly, and there is often blogspam, but it does find interesting things as well. A screenshot of the search options: https://i.imgur.com/fIzl2cp.png


My first startup needed that, specifically querying for exact hits for names, companies, etc. The inference that Google did at the time was too inaccurate for this use case (not to mention my use case being against their ToS). I ended up building a custom engine with Lucene/Solr and creating a custom Tokenizer to be more strict


Here's one for code that I always hoped would grow, but I'm not sure if it's been updated for a few years.

http://symbolhound.com/


If Google is an option, put your keywords in quotes.


Google sometimes ignores this, for reasons that nobody seems to actually understand.


I used to be an engineer on Google's search indexing team. I don't have any specific inside information in this case, but knowing the Google mindset, I'm sure they would have pulled up a large unbiased frequency-weighted sample of search queries with quotes in them. The would have then spun up a custom web search front-end that used an algorithm that sometimes ignored quotes, and ran all of the sampled queries through both the regular front-end and the custom front-end. They then would have manually looked through the affected queries. They would have used this to determine whether the change seemed to better serve the user's intent on average. Of course, this involves a bit of subjective inferring of the user's intent.

It don't like it, but I'm sure that for the average query using quotes, it does better find what a user is looking for. That's why Google does this.

On the other hand, there's a lot of observation bias: I'm much less likely to notice Google is ignoring quotes when it works out well. It's mostly in the frustrating cases that I really notice and remember noticing that Google sometimes ignores quotes.

As for what black magic allows them to sometimes determine that you don't really want quotes, you're guesses would probably be as good as mine. I'm just pretty sure that on average, this black magic has a positive impact on search quality.


Really? I haven't seen that before; do you have an example?


Exact occurrences seem to vary over time, per user/account, the presence/absence of other features being used (e.g. time range or site: operator), and possibly other factors. It's periodically reported on the Google Search and Assistant Help Forum, and there's a 100+-post thread [1] about it. The behavior is like something in Google's stack is incorrectly matching a cached set of results that had the same set of terms without quotation marks.

[1] https://productforums.google.com/forum/#!topic/websearch/6gH...


Did you tried qwant? (best use lite.qwant.com the main sain is a bit overloaded with info) They have their own crawler apparently.


ask.com used to work that way. That was more or less why Google replaced Askjeeves as the top engine: Google would throw at you everything whereas Ask would give you only exact matches. It seems to have changed now, though. Too little too late.


Hows duck duck go in this realm?


It honours a limited but useful search syntax.

https://duck.co/help/results/syntax


I use DuckDuckGo. I have been frustrated several times by it trying to be too clever at second-guessing me. In those instances my plan B is Google, and I find that sometimes Google has more useful results.

(Of course, since I only use Google when DDG's results are poor, I would expect to see Google's results be superior a lot of the time, irrespective of whether their results are generally better or worse.)


Yacy?


i still add meta keywords in the hopes of some day they'll be used by search engines ...


duckduckgo.com ?


No, I use DDG every day and it's worse! They very often explicitly (it even tells you) ignores keywords unless you put every word in double quotes.


" unless you put every word in double quotes." Then at least you can make it use the keywords.


wiby.me seems to do just that.


You really don't want this at all. If you think keyword optimized spam is bad today wait until it's not filtered out.


a purely keyword search would not very useful. BTW semantic search engines can do exact keyword matches enclosing your query using double quotes e.g. https://www.google.com/search?q="keyword+search+engine".


Sadly, results #1 and #2 link to a page where the phrase "keyword search engine" isn't even present. So much for exact keyword matches.


But keywords with ranges would be useful in many cases.

keyword1 w5 keyword2 [...]

[find pages where keyword1 is within 5 words' distance of keyword2]


Google can do this. keyword1 AROUND(5) keyword2.


This gives me pages with the word "around" instead




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: