Are there any earch engines which exclude or at least penalize results from, say, top 500 websites?
There are so many cool things I remember reading on the web like 10-20 years ago that still exist that are so buried now on Google they might as well not exist. Nowadays searching any topic seems to always lead you to CNN and Microsoft and Facebook and other huge corporations. Search results are just becoming more sanitized and beige and meaningless every day.
My trick now is to use Twitter to discover interesting people, and follow them there. Granted, it's not a search engine, but it's at least given me the ability to discover weird things again.
One of the things I enjoy doing on Twitter is posting up something I'm working on, and then clicking through to all the profiles of the people who like, comment, or retweet my work. I stumble across an incredibly diverse range of people by doing this, many with conflicting opinions to my own, and many who belong to strange subcultures that I don't understand, but who were all drawn to my work for one reason or another.
I think there's definitely a danger of crafting a bubble for yourself if you choose to use it that way, but as a tool for discovering people making cool stuff who otherwise wouldn't cut through the noise on something like Google search, I haven't found anything better.
Over time you would get a 'pagerank for people' and could do awesome stuff with that, like 'You don't know XYZ, but 3 people you trust trust her, and this is what they tell about her:' ...
In the context of "trying to do research on coronaviruses" your comment appears to be not only correct but an important distinction, rather than the pedantry it appears to be.
From Wikipedia: "...more lethal varieties [of coronaviruses] can cause SARS, MERS, and COVID-19."
"Severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2] is the strain of coronavirus..."
I learned something today!
To be honest, I was a bit disappointed when I found out, though I admit now it's a little refreshing to have be so simply named.
If these were names for services and classes that came in a code review, how many would really approve?
They changed the name of this coronavirus to reflect the disease more accurately to COVID-19.
The CDC has a list of other coronavirus’ that have existed.
Edit: Since there seems to be a misunderstanding from everybody’s part on this as it’s referred to as both and often interchangeably in a mainstream setting, take a look at John Hopkins guide: https://www.hopkinsguides.com/hopkins/view/Johns_Hopkins_ABX...
From link :
"SARS-CoV-2 (the novel coronavirus that causes coronavirus disease 2019, or COVID-19)"
The previous comment was just making the point that the (new) virus is called SARS-CoV-2 and the associated disease is called COVID-19.
COVID-19: disease caused by SARS-CoV-2
SARS-CoV-2: strain of SARS-CoV
SARS-CoV: severe accute respiratory syndrome coronavirus
Coronavirus: virus that causes respiratory diseases in mammals, such as SARS (SARS-CoV) MERS (MERS-CoV), and COVID-19 (SARS-CoV-2)
Excuse the incivility, but no. SARS-CoV-2 is not a strain or type of SARS-CoV. The viruses share ancestors, but SARS-CoV-2 did not come directly from SARS-CoV. SARS-CoV and SARS-CoV-2 are in the category of beta coronaviruses.
"The whole genome-based phylogenetic analysis presented that two Bat SARS-like CoVs (ZXC21 and ZC45) were the closest relatives of SARS-CoV-2."
While we're on the topic of linguistic pedantary, strain isn't exclusive to direct mutations from a parent genome. Strains, like much of biological taxonomy, are a human abstraction to make communication of the idea of -- in this case -- "a virus sharing similar properties to coronaviruses that cause severe acute respiratory syndrome" -- albeit this is a very simplified definition for the sake of brevity.
SARS is caused by SARS-CoV-1 and COVID-19 is caused by SARS-CoV-2.
Rather, if we would like to be absolutely correct about these classifications, we would say SARS-CoV-1 and SARS-CoV-2 are both strains of SARSr-CoV (Severe accute respiratory syndrome related coronavirus), which in itself is a species, an abstract concept used to group related organisms into a convenient umbrella term.
There is no "eukaryote" organism the same way there is no "SARSr-CoV" organism. The added "r" was a recent addition when COVID-19 was discovered.
I will cede that I didn't specify this last point, and you were correct to point it out.
Thank you for making my point, again.
Genera -- as in SARS-CoV-2's genus is Betacoronovirus -- don't have "strains."
Only families -- such as the SARSr-CoV family -- have strains.
> SARS-CoV-2: strain of SARS-CoV
GP was pointing out that this was incorrect, and you just made that point by stating it yourself.
Assuming you are intending to engage in the conversation and not be a pedant, I might let you know that your replies are coming across quite coarsely. More specifically, as to prefaces on earlier comments, there is no need to excuse incivility, because there is no need for incivility here.
At least this did not fall into the category of "Cold regurgitation of data" (quite popular it seems) and had a level of warmth that was an indication of passion, more than anger (from all parties).
If they added a temperature social cue to HN comments..... That would be funny.
There was "rebranded" web search that someone created a number of years ago and posted on HN that aimed to exclude the top websites from results. I cannot remember the name he gave to the project.
One way to exclude the world's biggest websites when using Google is to restrict the search to TLDs other than .com, .net and .org. The root zone is full of silly new TLDs that no one uses for large websites. There are hundreds to choose from.
Looks like Google Scholar is including a number of "coronavirus links" on the main page but thankfully not in the results.
Why not skip Google and "web search" and use a database that does not include all the crap one finds on the www
I have a theory that web crawling alone is not the best way forward to find the most relevant results because of the volume of content continually being created, much of which is niche and sometimes dynamic.
Instead I believe linking together vertical search sources that have targeted information based on search intent will provide better results.
I created Runnaroo  for that purpose. If you search a programing question, it will pull traditional organic results from Google, but it will also directly query Stack Overflow for a deeper search.
versus google's completely useless:
Everyone else, get in here: this is top notch stuff.
Sure people hosted on geocities and tripod and they were the biggest and easiest to remember. But quality of a geocities page compared to a mit student page was much lower.
Especially shopping. The endless stores are the worst part of search results. If I search for anything that remotely looks like a product, the results are just choked with store after store trying to sell me the thing. Awful.
I haven't had that great results with it myself though.
Garbage in, garbage out. I guess. Still I like the idea of something to side-step the SEO perhaps with more effort they can make it work but relying on Google or any major search engine for the base results is the wrong way to go.
I suppose it depends on the category.
For other engines you can use https://addons.mozilla.org/en-US/firefox/addon/greasemonkey/ with this script https://greasyfork.org/en/scripts/1682-google-hit-hider-by-d...
Not just ads, but also ranked by the number of third-party cookies/tracking scripts a site has.
Very surprised where I see those these days, and they always make me run away.
That's a nice negative feedback loop or catch-22
I do a random city + documentary as the search term, it's taken me all over the world and seen some very strange things.
One of my favourites was Aarhus, which had a Danish language rapper proclaiming he was putting Aarhus on the global map (I have never heard of the city of Aarhus). https://youtu.be/WSZxuzgImLo They dis Copenhagen a lot too, lol. You get a more intimate YouTube experience with the low view videos
But I also seen amazing religious rituals. An excellent documentary on Karachi.
Because it's observable hq you can fork it and figure out your own algorithm for biasing the random.
Specifically this quote: "The way to win here is to build the search engine all the hackers use. A search engine whose users consisted of the top 10,000 hackers and no one else would be in a very powerful position despite its small size, just as Google was when it was that search engine."
There has been a lot of grumblings about the state of search these days. Maybe the time is nigh for a new search engine?
It will be limited, but still quite powerful, similar to the way that we can pick and choose different host file sources from the web.
Before I knew about DEVONagent I would often just search multiple engines and sources trying to find something particular (e.g. a particular PDF) or unique results.
Does anybody knows of something similar for Windows or Linux ?
All it does is add -site:pinterest.com to the search bar for image results (can be configured to also do it for Web results), but it gets the job done.
> In the early days of the web, pages were made primarily by hobbyists, academics, and computer savvy people about subjects they were interested in. Later on, the web became saturated with commercial pages that overcrowded everything else. All the personalized websites are hidden among a pile of commercial pages. [...] The Wiby search engine is building a web of pages as it was in the earlier days of the internet.
For example, I submitted Pizza Hut's archived original web page , but it wasn't added.
Even for a search engine exposing niches, updating a directory manually will likely be too slow, unless the directory is maintaining a single nich (e.g., unladen airspeed of every species of swallow), but then we end up with some insane number of search engines and how to select which one?
The problem I'm running into is that I still have to use major search engines to find new content, way more than I'd like. I hope to make my local service available open source once I have 'federated' history search working, so that we can have a primitive search engine and share with people we trust. Also need to work out some security issues - it's scary having all the content you read and see on your home network, protected only by your hackily-patched-together security.
EDIT: Actually I'd like to elaborate a bit more in case anybody actually reads this and has any ideas. On the desktop side, it's pretty easy. Initially started out MITMing my own traffic with a self-signed cert added as a root cert to all my machines. This only works on my home network, so I did a VPN thing. This was way to clunky and the security concerns are innumerable. I ended up biting the bullet and writing a chrome extension which works wonderfully, except for some slight performance issues.
However, I wish to also archive my phone content - I read just as much on my phone as my computer. I can do it on Android with the MITM process, but the same issues as above still apply, and it doesn't work with iOS (at least I can't find a way).
I'm thinking of taking an open source project, like Firefox/Fennec and building it in to the app itself. In that case it may make sense to forgo the browser extension and just roll my own forked browser on every platform, even iOS. I don't know much about iOS dev though.
Wiby is based around two main things:
http://wiby.me/submit contains the submission criteria.
For example, I enjoy weightlifting and strength sports. I did a search for "muscle", and every result but one was using the word "muscle" as a figurative metaphor. Barely anything about actual muscles. Searching "funk" was just as bad. One page about Motown and a LOT of midis.
Ex: The network map for “weightlifting” would include many clusters, but 2 big ones would be the hypertrophic cluster (surrounded by a bunch of related terms) and toning cluster (calisthenics would be under this cluster for example). Click on either and the results will change accordingly.
This would actually work even better for subjects you don’t know much about, because Google will teach you about the salient clusters in that field. The clusters could be enhanced with popular images associated with each term. Popular clusters would display as larger than others.
I’d have to submit every blog post?
Basically the idea is to have people band together and "recommend" links. You then do your normal spidering of the websites to create a search engine (or even just call through to a number of existing search engines). However, the ranking of the results is based on the weighting of the recommendations.
It's essentially a white list based on your own personal bubble. Of course this won't work in general because you will always get SEO creeps spamming recommendations. However, it gives you tools for working around those creeps. The average person probably won't be able to manage it, but power users probably will.
By not trying to solve the problem for everybody, it makes it easier to solve to problem for some people. Or at least that's my thesis :-) I might be wrong.
If you're generous; you can make your index available to other P2P instances.
I wanted to run an API search the other week and was blown away with how quickly I could prop-up my own custom search portal (I didn't want to pay for API access to other search engines, and YaCy comes with a JSON and Solr endpoints).
I ran it locally to test my crawl filters, then pushed a private instance out to Digital Ocean to turn up the heat with the crawling. The only issue I had was the crawler would hit the max memory threshold on long crawls and the container would restart, but that was fixed by scaling up the box.
While I typically still use RSS for reading music blogs, I find having the search engine is a great way to go back and find something or discover something new! Every time I find a new blog, I just add it as an index to yacy to crawl.
I think it'd be great to see people spinning up larger instances that are highly specialized. For example, maybe a search engine that is dedicated solely to sci-fi and only crawls high quality boards, personal sites and blogs, and skips all the spammy, seo-optimized sites.
I share that same desire to visit the web less travelled. I want to discover interesting sites that deserve to be bookmarked because they will never show up in a search engine.
True "Interdimensional Cable" vibes.
This was fire. If a topic were being discussed on the web, you could find it with this tool. Unfortunately, it did not fit the vision of the parasitic overlords who bred us to produce and consume for their benefit.
You could add a bunch of heuristics such as size, number of links etc.
Maybe even train a classifier to select the “smaller” part of the web.
When I type “shoes”, it would give me: links for the functional and creative history of footwear, the taxonomy of shoes, methods of construction, current and historical footwear industry data, synonyms and antonyms, related terms and professions, the dictionary definition, and similar links related to secondary meanings (such as any protective covering at the base of an object, horseshoes etc). I’d also hope for a comedy link to a biography of Cordwainer Smith.
What I actually get, which I don’t want at all: pages and pages of shoe shopping.
The various means to exclude “top X sites” are the roughest possible heuristic in that direction, and throw out the baby with the bathwater (for example, a long-established manufacturer may well have an informational online exhibit)
Google has essentially failed me in its primary mission. Bing at least has the grace to admit they are here to “connect you to brands”. And sadly, right now, every other option is an also-ran.
In practice I use DDG, directed by !bangs towards known encyclopaedic or domain-specific sources. I am certain that I’m missing out.
* when you make a query to this knowledge base, it has a history of your prev searches / preferences (not google)
* it can propose variants of suggestions on what is your intent in this particular query - and make much more detailed queries (auto include/exclude keywords websites etc) to multiple sources (not only google, maybe anonymously)
* it can parse results from these sources and re-arrange them (use own rank system) according to the your preferences. In this system, you can explicitly say - I hate that, and I like that, and this will affect the behavior. Yes this is 'information bubble' but it is controlled by you and not by google!
* finally, this system may work in background and handle 'research' search queries. What I mean here: currently, Google is about instant search - it gives you results in milliseconds, and that's all. It cannot spend much computations for more precise, more intellectual check of content in links from the search results - it cannot do reasoning - and you have to do that by yourself: open links from 1-st page, and close most of them immediately b/c they are not relevant for you, go to 2-nd page and so on. It would be cool if most of this could be automated - with modern natural language processing approaches and old-school prolog-like reasoning this is real and not a fantastic from sci-fi.
My vision that this kind of search assistant cannot be SaaS / closed source. It is about the freedom - and thus this should be open source / self-hosted app that can be deployed on PC or on cloud VM - but hosting should be controlled by end-users, not companies.
I don't know if something like this ever exist. If not, maybe its time to create it.
Discovering unknown parts and blogs on the internet is one of the enduring goals of a newsletter that I run , which provides a single link to an interesting article every day, usually by lesser-known authors and blogs across the internet.
On a daily basis your brain use shortcut to get to the point.
Open Firefox (of course) ALT+B. Then add a new bookmark for instance :
Name : Stack Overflow
Location : https://stackoverflow.com/search?q=%s
Keyword : st
Add "%s" to all your favorites website search url.
Example : https://en.wikipedia.org/wiki/%s
To discover some new website content, apply the same trick to Hacker news, Reddit or any RSS River.
Voila, bye bye GG.
See this example of filtering Stack Overflow out of search results:
Popularity, Relevance, Age, Type, etc. type could be blog, forum, site, or video. Or like it used to be.
Control is being forfeited to steer users back to more profitable content in order to capitalize on a captive market.
I wonder if being open about it would be so bad for business, instead of the attempt to manipulate users into enjoying the ratcheting-up of their impotence.
Now, Youtube truncates search results and loads the recommendation stream instead, long before hits are exhausted.
At least it's been a while since Silicon Valley was keeping the mythical personalized advertising spiel in active circulation.
Then I use Violentmonkey an open source js/css injector to inject this user script: https://greasyfork.org/nl/scripts/1682-google-hit-hider-by-d... This will block specific domains for you in google, yahoo, duckduckgo etc. I use this to block domains like Quora, sourceforge, cnet and softonic.
The nice thing about this script is that you can permaban domain you know are junk and they will completely be removed or you can ban a domain like commercial websites. When you ban something it is not removed from google or duckduckgo but it only shows the title in light gray, Im currently experimenting with this on some mayor webstores so I can not really say if this may help you but It can be a good start.
(edit) I saw some people say why this was not possible before. Google allowed you to block domains and website a few years ago, but they removed this feature. Duckduckgo never allowed you to do that because that would mean that you will have a cookie that remembers your preferences and that is against there principles.
I knew about !bangs, but I didn't know you could put them anywhere in the query (e.g. "hello !g world" searches Google for "hello world"). This is going to save me a lot of time on mobile. Thanks!
Implementing this properly involves having your own search index. And that's pretty expensive.
Edit: Maybe it’s the first million results? I use it to find obscure things sometimes.
A search engine that returns results whose pages weigh in under a certain size.
From the comments it seems most of the "cruft" filling up Google results are newer web apps, generally JS-heavy and advertising-heavy, etc.
If you had a filter for pages with (e.g.) < ABC kb of JS, < XYZ external links (excluding img tags), I feel like there'd be a good chance that the "old" web and the "unknown" web would bubble to the top.
There are plenty of false positives (particularly for "small" forums build with modern JS apps, etc), but it could be one of many filtering tools to achieve better search results.
Now there are a few extensions that do that, but obviously they only hide the results from each page, so sometimes you will see pages with 2 results, if any at all.
But i find the search is at a much lower quality than Google.
[search term] -google -youtube -facebook ... -top100website and it should work.
I found a list of the top 1m alexa websites here:
An add-on with that list should do the work.
- there's probably a pretty low limit for size of Google queries, you'll likely hit it quickly
- you won't be able to search for e.g a story about YouTube censoring some content
facebook censorship -site:facebook.com
-site:google.com -site:youtube.com -site:facebook.com -site:jd.com -site:yahoo.com -site:wikipedia.org -site:amazon.com -site:netflix.com -site:reddit.com -site:live.com -site:zoom.us -site:okezone.com -site:alipay.com -site:instagram.com -site:twitch.tv -site:csdn.net -site:blogspot.com -site:microsoft.com -site:bing.com -site:github.com -site:tribunnews.com -site:myshopify.com -site:office.com -site:panda.tv -site:stackoverflow.com -site:ebay.com -site:bongacams.com -site:livejasmin.com -site:babytree.com -site:naver.com -site:apple.com <search query>
It’s custom google search results, but since it’s excluding .com, .net, .org etc then you probably won’t see any of the large sites there.
It’s also interesting to see which sites have been built in the last few years, as the new gTLDS haven’t been around that long.
I was intrigued by how dorkweed’s approach has changed over time, as described in a reply to a sibling comment.
As general search results get watered down and rotten tomato inflation maybe trends towards reflecting company interests rather than my interest-level, maybe it’s worth re-evaluating the vetting avenues we take as users.
Here’s mine: for games and shows I’ve recently found myself using quantity of fan-videos on YouTube as a proxy for quality. So far it’s been a decent means to find cult followings for something I otherwise wouldn’t necessarily hear about.
Obviously this approach has its flaws - and is subject to financial perversions to an extent - but I figure if enough people genuinely want to pay tribute to a work, it might be worth checking out.
Personal trick: I follow reaction video blogs, and if they are reacting to something then it is usually worth watching. But reaction blogs are only for short videos and other short form content.
I find that the YouTube sidebar is useful for me to find interesting music. I have eclectic tastes, and Google seems to have figured that out. I don't mind.
I suspect that it would be possible to create a custom API query to Google that would have a "blacklist."
Seeing folks mention the NOT operator (-). It's quite powerful! For example, you can do:
intext:"Powered by intercom" -site:intercom.com
will find all the sites that use the Intercom widget
~blog bread baking -inurl:checkout -intext:checkout
will find bread blogs (or similar) without commercial intent
I put together a list of the two dozen or so most useful templates of this, for folks who are interested:
I think they try to do exactly what you ask, but I haven't used them extensively so don't know how good are they.
Each session would have an updatable list of sites that are favored, whitelisted or blacklisted for a particular class of search.
Anyone reading this, please post if you find any
1. Looking for niche domain or institutional/social knowledge produced by experts or insiders for an informed audience that isn't necessarily available in a scientific journal.
Especially with respect to the social sciences and literary analysis, there's a wealth of intelligent commentators that don't surface well on Google without very specific search terms, and the willful subtraction of domains like quora, medium, and tumblr.
They're usually contained on poorly maintained WordPress sites that the author has long-since forgotten about, or as invalid, handcoded html docs hidden in the personal subdomains of university professors and students.
2. Finding online communities that aren't a part of Reddit or a similarly prominent platform
Currently for three product tiers (furniture, home decor, and fashion/clothing) in 14 major US markets, where stores within ~100 miles or a ~2 hour drive are considered as part of the market.
Disclaimer: I'm one of the founders.
Google says they need our information to "improve our experience", but we can't tell them what to omit ...
Its kinda new so it excludes kinda everything :-) But you can make it work better :-)
If anyone noticed during the first couple days of covid, google search was free from large media results, the algorithm reverted back to how it was years ago and it was such a breath of fresh air. Of course they fixed the algo immediately, it went back to only showing curated media results..there was an anon google employee who posted why this occurred.
When SEC laws, shareholder interest, quarterly performance and stock volatility comes into play, corporations become this mindless soulless monster that will devour everything in its way and fuck consumers in every which way.
Democratization of funds from central authority to public creates disincentives and the shareholders don’t give a shit about many auxiliary things such as environmental concerns. Bottom line always matters.
It’s not just google but any public corporation. Can you imagine SpaceX being able to operate with the same passion with shareholder interests?
It is, potentially, the compensation plans. If you go to the proxy document and look at how comp plans are set, they usually hire a consultant, and "best practices" drivers are cash + big bonus based on typically some TSR (total shareholder return metric).
So for google, "don't be evil" is what's written down, but for the top execs "sell ads" is what gets they paid out before they retire. And those senior level "lifers" are what 40 now?
Don't really have proof to support these claims though.
The problem I see on DDG & Google is having to scroll 5-10 pages of utter SEO nonsense.
"Do you have a question about ____? Many ask about _______. ____ is a common question, here the are we some answer. [sic]".
Just utter garbage pages.
It used to be just with recipes or medical questions, but now it feels like most everything that is a general query.
Especially removing Quora, Pinterest, and aggregation/reposting/SEO/affiliate blogs.
And all "product" images with a white background. Only show real photographs.
Just a thought experiment, curious what others think.
can google allow us to exclude certain sites? i was surprised to see w3school showing up above official documentations for pandas and numpy. this is simply ridiculous!!
A search engine that shows only urls that are not indexed b google / another one that gives you the websites with lower pagerank
"If you don’t read the newspaper you are uninformed; if you do read the newspaper you are misinformed."
There's so many Chinese forums for hardware/firmware hacking/mods, a shame translators are still very bad...
> Ask HN: Is there a search engine which
excludes the world's biggest websites?
> Discovering unknown paths of the web
seems almost impossible with google et
> Are there any earch engines which
exclude or at least penalize results from,
say, top 500 websites?
Let's back up a little and then try for an
(1) For some qualitative exclamation,
there is a LOT of content on the
(2) There are in principle and no doubt so
far significantly in practice a LOT of
searches people want to do. The search in
the OP is an example.
(3) Much like in an old library card
catalog subject index, the most popular
search engines are based heavily on key
words and then whatever else, e.g., page
rank, date, etc.
So: (1) -- (3) represent some challenges
so far not very well met: In particular,
we can't expect that the key words, etc.
of (3) will do very well on all or nearly
all the searches in (2) for much of the
content in (1).
And the search in the OP is an example of
a challenge so far not well met.
Moreover, the search in the OP is no doubt
just one of many searches with challenges
so far not well met.
Long ago, Dad had a friend who worked at
Battelle, and IIRC they did a review of
information retrieval that concluded
that keyword search covers only a
fraction, maybe ballpark only 1/3rd, of
the need for effective searching. And the
search in the OP is an example of what is
not covered because the library card
catalog did not index size of the book or
Web site! :-)!
Seeing this situation, my rough, ballpark
estimate has been that the currently
popular Internet search engines do well on
only about 1/3rd of the content on the
Internet, searches people want to do, and
results they want to find.
So, I decided to see what could be done
for the other 2/3rds.
I started with some not very well known or
appreciated advanced pure math; it looks
like useless, generalized abstract
nonsense, but if calm down, stare at it,
think about it, ..., can see a path for a
solution. Although I never thought about
the search in the OP until now, in
principle the solution should work also
for that search. Or, the math is a bit
abstract and general which can
translate in practice to doing well on
something as varied as the 2/3rds.
Then for the computing, I did some
original applied math research.
Using TeX, I wrote it all up with theorems
So, the project is to be a Web site.
While in my career
I've been programming for decades,
this was my first Web site. I selected
Windows and .NET, and typed in 100,000
lines of text with 24,000 statements in
Visual Basic .NET (apparently equivalent
in semantics to C# but with
syntactic sugar I prefer).
The software appears to run as intended
and well enough for significant
I was slowed down by one interruption
after another, none related to the work.
But, roughly, ballpark, the Web site
should be good, or by a lot the best so
far, for the 2/3rds and in particular for
the search in the OP.
there's one coded and running and on the
way to going live!
I intend to announce an alpha test here at
Before you can even do a keyword search, you obviously need an intent to do so. But that means keyword search is pretty useless when you don't know what you don't know.
Encoding that intent...maybe doesn't matter for common searches, but everyone has heard of the concept of "Google-Fu". English text is a pretty lossy medium compared to the thoughts in people's heads...Shannon calculated 2.62 bits per English letter, so the space of possibly-relevant sites for almost any keyword is absolutely enormous (e.g. there are about 330,000 7-letter english keyword searchs...distributed across how many trillions of pages, not even counting "deep web" dynamically generated ones?). So we punt on that and use the concept of relevance for sorting results, and in practice no one looks beyond the first 10. I don't know what an alternate encoding might look like though
> Before you can even do a keyword search, you obviously need an intent to do so. But that means keyword search is pretty useless when you don't know what you don't know.
Right: The way at times in the past I have put something like that is to say that, ballpark, to oversimplify some, keyword search requires the user to know what content they want, know that it exists, and have keywords/phrases that accurately characterize that content. For some searches, e.g., the famous movie line
"I don't have to show you no stinking badges",
that is fine; otherwise it asks too much of the user.
For "encoding", my work does not use keywords or any natural language for anything.
The role of the advanced pure math is to say that the data I get and the processing I do with that data and what is in the database should yield good results for the 2/3rds. The role of my original applied math is to make the computations many times faster -- they would be too slow otherwise.
When keywords work well, and they work well enough to be revolutionary for the world, my work is, except for some small fraction of cases, not better. So, there is ballpark the 1/3rd where keywords work well. Then there is the ballpark, guesstimate, 2/3rds I'm going for.
My work is not as easy for the users as picking a great, very accurate, result from the top dozen presented by a keyword search engine, e.g., the movie line example, but is much easier to use than flipping through 50 pages of search results and is intended usually to give good results unreasonable to get from a keyword search, without "characterizing" keywords, that yields, say, millions of search results and would require a user to flip through dozens of pages of search results.
Ads are off on the right side and not embedded in the search results. The SEO (search engine optimization) people will have a tough time influencing the search results!
We will see how well users like it. If people like it, then it will be good to make progress on the huge, usually neglected, content of the 2/3rds.
But thanks for your interest.
- health search that excludes sellers, wellness and snake-oil websites
- news search that excludes conspiracy theories, magical thinking, political operatives, and paid bloggers
- image search by similarity, similarity to an uploaded picture/s, words, or description
- media and warez search engine that excludes link-spam and malware sites
- complex queries search because none of them do it well
- shopping search that kicks out disreputable sellers and phony store-fronts
- mapping like OSM but fast, practical with an app, and detail-accessible
- monetize using affiliate links that don't affect ranking
- semi-curated results (domain reputation-ranked voting)
- related pages
- inbound/outbound links search
- archive.org integration &| history page caching
- documented query syntax
- query within results
- quick query history results navigation
- keyword alerts
- keyboard shortcuts that always work