
Ask HN: Are there any purely keyword-based search engines with no NLP? - John_KZ
I want to query the web based on specific criteria (ie keywords), no what Google or Bing <i>thinks</i> I <i>really</i> meant.<p>Ideally I&#x27;d love an interface to a large database including the text in each webpage that has been crawled. But something like Google the way it was 10 years ago will still work. Are there any such search engines left? Even ddg is switching to semantic&#x2F;NLP results.
======
neuralk
I was wishing for old-school keyword search just this morning! Nice
synchronicity.

I must completely disagree with the other posters claiming that keyword
searches are not useful. For niche research, they are extremely helpful or
even necessary. Google and Bing have reached the point where it is impossible
to do real, niche academic research on them. For instance, I had a very
specific thing I was trying to look up involving medicine, religion, and Marco
Polo.

Try searching for "marco polo doctors" on Google and witness it giving very
counterintuitive, one-sided results that may align with the current zeitgeist
of interest from people searching Google, but diverge completely with the aim
of literal, precise keyword search needed by academics. I did work to improve
or hone the search down, looking up Kublai Khan, doctors, atheism brings up
blogspam articles on doctors and atheism, but scant results on the 13th
century Mongol emperor's religious medical interest. Trying to narrow the
search further by including variations on Cambulac, Cambaliech, trying to find
any info beyond surface-level on John of Montecorvino and his retinue... all
is impossible with search engines in 2018.

~~~
userbinator
Don't forget(!) Google's horrible forgetfulness, and its way of banning you if
you try hard enough to extract anything useful from it:

[https://news.ycombinator.com/item?id=16153840](https://news.ycombinator.com/item?id=16153840)

For me, the niche is repair information and in particular, identifying IC part
numbers and finding datasheets. Searching "service manual" now invariably
brings up useless user's manuals, and searching too many times for IC part
numbers gets you CAPTCHA-banned.

(Somewhat understandbly, part numbers tend to look like semirandom bot-
queries, but it's still a horrible experience to be called a bot just because
you're actually after _more_ information than the average user.)

Keyword-based would be a great step forward(!), but something like "grep for
the Web" would be ideal. I remember many decades ago learning how to use
boolean operators and such, since nearly all search engines of the time
provided such functionality. Now the mainstream ones which have a big enough
index to be effective also have removed much of that functionality and try
very hard to limit you from using it. For another example, try using "site:"
searches multiple times with Google --- another way to get rapidly banned.

~~~
sitkack
When you find domains that contain useful information, crawl and index them
manually.

~~~
ccozan
Indeed, the best solution.

Interesting enough, I find separate web crawling as a service and search
engine as a service, but not both?

~~~
AznHisoka
You just described the Bing/Yahoo BOSS APi

~~~
ccozan
Allright, I forgot that ones.

However they are quite pricey. Maybe some solution that one can host himself
is a nicer alternative.

------
ChuckMcM
No :-). For what its worth, Google never worked that way either, it used Page
Rank initially to sort results by social importance. I would guess that the
last keyword based engine of any merit was AltaVista. AV had the issue of
returning all pages with your words but it had no notion of sorting them (at
least initially, you can read their patent as well)

These days its fairly easy to code up an n-gram index and host it on a
moderate server. If you have a corpus of documents you can see how well it
works. The simplest corpus is to just down load all of Wikipedia. Last time I
checked it fit on a 2 TB disk drive.

You could also use the Common Crawl database and index it as you would like,
or talk to the Internet Archive project for some sort of collaborative
project. However I will warn you that of the people who have done this
experiment (and it was a new hire task at both Blekko and Google where I
worked) folks quickly discovered it wasn't very useful.

~~~
giancarlostoro
To be fair wikipedia compressed without images is pretty small. A couple tens
of gigs no more than 80. You can download it to your phone and there are apps
to view it all. Not sure how larger it is decompressed but 2 TB sounds like
overkill. But yeah wikipedia would be a good way to test it.

Also did you mean every single wikipedia site or just the English one? I'm
thinking just the English one's sufficient.

~~~
tomcooks
I know what you mean with "sufficient",but English pages are often lacking
informations on a lot of subjects related to countries outside the
anglosphere.

I still don't understand why wikipedia articles aren't translated according to
a common script, at least where there are featured translations (marked a
star)

~~~
zeckalpha
Feel free to make that happen.

------
blauditore
Not sure if that's what you're looking for, but SymbolHound[1] takes input
queries literally, which is useful for e.g. code search.

[1]: [http://symbolhound.com/](http://symbolhound.com/)

~~~
gregknicholson
That looks useful. Thanks!

I can use it via DuckDuckGo too:
[https://duckduckgo.com/bang?q=symbolhound](https://duckduckgo.com/bang?q=symbolhound)

------
wyck
Nothing has changed with Google (Bing and Duckduckgo are the same too more or
less). 10 years ago it still did essentially the same thing, it's up to you to
use the more advanced features to filter the results.

For example if you type "marco polo doctor's -doctor -who" or "marco polo
doctors group:science".

Google operators for example:
[https://en.wikipedia.org/wiki/Google_Search#Search_syntax](https://en.wikipedia.org/wiki/Google_Search#Search_syntax)

Cheat Sheet: [https://www.searchlaboratory.com/wp-
content/uploads/2012/11/...](https://www.searchlaboratory.com/wp-
content/uploads/2012/11/searchoperators.jpg)

~~~
WalterGR
Google has _absolutely_ changed in the past 10 years. Have you tried looking
for something obscure recently? Google increasingly over time takes more and
more liberties with your search terms, regardless of operator use.

I notice that one of the pages you link to is from 2012. Again, the old ways
just don't cut it any more. Since it doesn't even mention Google's "Verbatim"
search option, it suggests to me that while its content might be _technically_
correct, it's useless to hand-wave towards it as the cure for contemporary
complaints about Google's results.

~~~
wyck
Sorry to be clear, I meant that to find information you always had to do bit
of digging, it's true the algorithm has changed a lot.

~~~
gkya
Not only the algorithm. Recently, using double quotes or plus sign before
terms became useless, as it just lists results with the specific word you
marked crossed over when there are not much exact hits.

~~~
Izkata
> Recently, using [..] plus sign before terms became useless

Unless it came back at some point, it wasn't very recent - it was
intentionally removed in 2011 due to their using "+" for their social network.

------
lettergram
Oh man...

This is literally my whole business, and I wrote about this here:
[https://austingwalters.com/is-search-solved/](https://austingwalters.com/is-
search-solved/)

Hinting at why search was broken.

Essentially, search providers are for the general, meaning if you type
"whales" it'll bring you to the wikipedia. This is because probably 80% of the
time you're looking for wikipedia. It uses NLP to determine when you say "I
want to know about whales" because it works _in the general case_. If you want
an exact match do "I want to know about whales" and it'll look for that exact
phrase.

Now my business, is actually the reverse - not looking for the general case,
but identifying the niche - i.e. "what an expert would want":

[https://projectpiglet.com/](https://projectpiglet.com/)

This lets me build a financial advisor which is averaging over 100% YoY
returns because it identifies and tracks specific topics (as opposed to just
wikipedia changes). So if you go on there and type "Iran", you'll get a lot of
search results about Iran, but also about Isreal, Jordan and the like because
it identifies Iran being associated in the graph. This works great for
investing because you want to know about the related topics (you may not even
realized you wanted to know about).

Now, that's NLP. But it works for my customers, because exact matches
typically are not what people want. They want the Niche, the general, or
occasionally as in your case an exact match (if they can think of the right
words). Luckily you have quotes "search phrase", in my system I always assume
you mean to type "search phrase", so I always look for an exact match. But I
still apply NLP to the results, because that's the value.

------
roadbeats
Kozmos' ([https://getkozmos.com](https://getkozmos.com)) public search engine
is keyword based. You can search about 1 million pages -including their
content-, sorted as how much they're liked (a.k.a bookmarked) by the users.

You can watch my 1-minute pitch in which I mention why Kozmos' search
functionality is more interesting:
[https://youtube.com/watch?v=ETjeEz5Dk_M](https://youtube.com/watch?v=ETjeEz5Dk_M)

We have academician/researchers users who love Kozmos. My goal is to improve
search with deep learning, keeping the sorting algorithm same (like count)
though. I'm walking towards this goal slowly but surely!

P.S If you're into this topic and live in Europe (Berlin), let's have coffee!

------
metastable
Have you tried Google’s “Verbatim” mode? It’s listed under search tools and
claims to do exactly this, iirc.

~~~
tgb
intext:MYSEARCHTERM is what Google taught a few years ago when they had a
"google poweruser web course" thing. My understanding is that these days it
either doesn't work consistently or makes Google think you're a spambot and
will shut you out after just a couple searches.

------
OriPekelman
CommonSearch
[https://github.com/commonsearch/](https://github.com/commonsearch/) was a
project to try and build a search engine where you could "Explain" (in the SQL
sense) the result, based on Common Crawl. Open Source and transparent. But it
did not seem to have gathered much enthusiasm. Which I find sad.

If you have some loose change on you.. a bit of processing on 71TB of data..
and you got yourself an index precisely like you want it.

Anyway, without "some" NLP no search engine is going to be very useful.

You need to know how to tokenize.. at a minimum. For many languages, this is
not as trivial as it is for English.

------
bluecat22
Look into commoncrawl.org which provides a free web index which you can query
against. Now that cloud is available, you could in theory download the index
and load it into Google's big query or AWS and run your experiments.

------
saintPirelli
If you can deal with the German UI you could check out labarama.com. I
wouldn't know it if I hadn't met the developers on a night out. It's an
altruistic project from Austria that keeps to the basics.

~~~
jhoechtl
So you know more about these guys? I have heard some rumour that the ones
behind this search engine are living in constant fear being raid by Google
henchmans

------
elorant
This can't work for a simple reason. Spam. What's stopping anyone from
flooding a page with specific keywords which might be completely unrelated to
the content of the page, or the page itself providing too little real info on
the searched keyword. It's exactly for reasons like these that Google
considers a variety of factors for ranking pages.

~~~
gregknicholson
OK, so some sites are “garbage in”. I'd like an option to see the “garbage
out”, unfiltered. Then maybe I'll appreciate the clever filtering.

------
pooya13
Kind of off topic but I recently realized that after searching for the exact
same thing multiple times in a short period google was giving me different
results (presumably assuming I had not found what I was looking for). Which
was kind of annoying since I was just using google as the entry point to those
top results.

------
ivank
This only finds things that are in some kind of RSS or Atom feed, but
inoreader (with a Professional account) can keyword-search every public
article that they have in their index. It doesn't return results quickly, and
there is often blogspam, but it does find interesting things as well. A
screenshot of the search options:
[https://i.imgur.com/fIzl2cp.png](https://i.imgur.com/fIzl2cp.png)

------
yesimahuman
My first startup needed that, specifically querying for exact hits for names,
companies, etc. The inference that Google did at the time was too inaccurate
for this use case (not to mention my use case being against their ToS). I
ended up building a custom engine with Lucene/Solr and creating a custom
Tokenizer to be more strict

------
3131s
Here's one for code that I always hoped would grow, but I'm not sure if it's
been updated for a few years.

[http://symbolhound.com/](http://symbolhound.com/)

------
ILikeConemowk
If Google is an option, put your keywords in quotes.

~~~
0xcde4c3db
Google sometimes ignores this, for reasons that nobody seems to actually
understand.

~~~
Falling3
Really? I haven't seen that before; do you have an example?

~~~
0xcde4c3db
Exact occurrences seem to vary over time, per user/account, the
presence/absence of other features being used (e.g. time range or site:
operator), and possibly other factors. It's periodically reported on the
Google Search and Assistant Help Forum, and there's a 100+-post thread [1]
about it. The behavior is like something in Google's stack is incorrectly
matching a cached set of results that had the same set of terms without
quotation marks.

[1]
[https://productforums.google.com/forum/#!topic/websearch/6gH...](https://productforums.google.com/forum/#!topic/websearch/6gHVUEl8y1k)

------
iagovar
Did you tried qwant? (best use lite.qwant.com the main sain is a bit
overloaded with info) They have their own crawler apparently.

------
nercht12
ask.com used to work that way. That was more or less why Google replaced
Askjeeves as the top engine: Google would throw at you everything whereas Ask
would give you only exact matches. It seems to have changed now, though. Too
little too late.

------
drharby
Hows duck duck go in this realm?

~~~
dleslie
It honours a limited but useful search syntax.

[https://duck.co/help/results/syntax](https://duck.co/help/results/syntax)

------
dest
Yacy?

------
z3t4
i still add meta keywords in the hopes of some day they'll be used by search
engines ...

------
EamonnMR
wiby.me seems to do just that.

------
joeseeder
duckduckgo.com ?

~~~
akvadrako
No, I use DDG every day and it's worse! They very often explicitly (it even
tells you) ignores keywords unless you put every word in double quotes.

~~~
lancewiggs
" unless you put every word in double quotes." Then at least you can make it
use the keywords.

------
paulcole
You really don't want this at all. If you think keyword optimized spam is bad
today wait until it's not filtered out.

------
bobosha
a purely keyword search would not very useful. BTW semantic search engines can
do exact keyword matches enclosing your query using double quotes e.g.
[https://www.google.com/search?q="keyword+search+engine"](https://www.google.com/search?q="keyword+search+engine").

~~~
mc32
But keywords with ranges would be useful in many cases.

keyword1 w5 keyword2 [...]

[find pages where keyword1 is within 5 words' distance of keyword2]

~~~
earenndil
Google can do this. keyword1 AROUND(5) keyword2.

~~~
exebook
This gives me pages with the word "around" instead

