Hacker News new | past | comments | ask | show | jobs | submit login
One company's plan to build a search engine Google can't beat (protocol.com)
264 points by prostoalex on July 26, 2020 | hide | past | favorite | 321 comments

Lately I find myself inserting "site:reddit.com" at the end of my Google searches. Most of the stuff that shows up by default is pure garbage that is riddled with ads and doesn't answer my question quickly.

Anecdote 1: I was Googling a lot of information on Angular, back when I was first introduced to the framework a year or two ago. I'd say that well over half of the content surfaced by Google was SEO spam: marketers masquerading as tutorials, with the sole intent of upselling me on some shitty Angular plug-in. A lot of them are clever about it too; they don't upsell you until you've invested a decent amount of time reading their "tutorials". The SEO spam only went away as I began entering more granular search queries, as my familiarity with the framework improved.

Anecdote 2: I have a close friend that is a high school teacher. She's not all that tech savvy, so I help her out sometimes. Google is damn near useless when it comes to helping her develop her courses. Not exaggerating at all... maybe 60, 70... 80 percent of the results for educational queries are SEO spam. Every result is essentially, "want to learn how to write a monologue? Click here to pay $20/month for the privilege".

It's gotten so bad that I jokingly tell her that it would be more efficient to just walk down to the library and take out a book on whatever topic she's querying. But, you know, they do say that every joke has an element of truth to it...

Searching for teaching materials or activities for children is so bad that developing the materials yourself is usually far less work!

I doubt the declining quality of search results is a product of Google's own advertising business, as the article would have us believe, and is mostly a product of third-party SEO spam. This outcome also wholly predictable. Part of the appeal of Google to early adopters was the lack of the SEO spam that destroyed the utility of earlier search engines. I recall conversations at the time when Google was so new they eschewed graphical ads that foresaw this outcome, though our predictions may have been a bit on the pessimistic side (i.e. Google held up against SEO longer than we thought it would).

I have a more pessimistic theory. Google's search based on PageRank used to work well at the start because the "SNR" in the underlying graph was really high. Lots of people wrote personal web pages, linked to other interesting content, etc. Contrast with today, majority of web pages are automated social media fronts with content optimized for engagement and SEO. The thing that made PageRank work is mostly gone.

Arguably PageRank cannibalized itself, it created incentives to make the web itself worse.

Given that we have AGI around the corner cough, building a model which discriminates SEO-spam from original content might be viable, don't you think so? Or for every new domain in the index you check whether it's legit or not by human intervention, then repeat each year and build a blacklist, similar to how blacklisting IPs works for Email (how do these sites even get listed? There should be networks apparent!). One could also put a button: "report abuse" and make a process to unblacklist legit sites (the same process for email works very well with most providers... except Google.)

My theory, why this is not happening: I guess that most of these SEO-spam sites are actually including Google-funneled ads, so this means there's something wrong there too...

AGI may initially reduce the incidents of SEO spam, but I very much doubt that it can eliminate it. The same can be said for human vetting. The thing to keep in mind is that SEO is performed for different reasons and takes many forms, so developing a universal model will likely be impossible. When the current forms of SEO spam become less effective, more effective forms will be adopted. Any form of filtering also presents the problem of false positives. More aggressive filters would likely produce more false positives.

Blacklists are also problematic. I have an email account with a smaller provider that I never bother to use since there is a good chance that their servers are blacklisted at any given point in time. Getting their server removed from these lists is non-trivial in most cases. Once they are removed, they usually get relisted within a few months. The problem also runs in the opposite direction: I have corporate email accounts where every external email (including those from certain departments of their own organization) is labelled as such and as potentially suspicious simply because blacklists are not sufficient. The only reliable outcome of blacklisting is the reduced reliability of communications channels.

User reported abuse is even more problematic since it opens up avenues for abuse. While it may be relatively easy to filter out bad reports in situations where there is a minimal vested interest (e.g. finding something disagreeable), that won't be the case when there is a considerable vested interest (e.g. attacking the competition).

Can GPT-3 be flipped on its head, to remove such content, rather than create it?

wait, isn't the point of AGI that it reasons like a human? E.g. for a simple (SEO-plagued) query like "disassembly of consumer electronics X", I get actual, worthwhile content and not a 5min video-ad? Basically every mildly intelligent human can discern this content from fake content, AGI should be able to do the same imo...

Then you have hundreds of SEO companies putting their own AGI to beat google one.

I mean, can we really call it an intelligence if it wants to work in marketing?

Obviously, because it has figured out the highest return for the least effort.... Get that "agi" a contract, please.

What if the SEO companies keep improving their AGIs with the aim of beating Google's AGI and in the process inadvertently start producing high quality content? Will SEO companies morph into knowledge mining and organizing companies (the business Google is supposed to be in)?

Obligatory xkcd:


Sounds like a GAN with more steps

Browser extension uBlacklist* can "Blocks specific sites from appearing in Google search results". It also supports DuckDuckGo and Startpage. You can add subscription like those ad-blocking extension. Maybe this is what you want.

*: https://github.com/iorate/uBlacklist

Google and publishers are connected in a feedback loop. Google subsidizes content -> content makes google useful. Google has always had the upper hand in this relationship and set the price tag for content makers. Ad prices have gone down the drain the past 10 years so naturally you 'll only get SEO spam because there is no motive for quality content. If google wants that to change , they have to start subsidizing publishers again. They don't seem interested in that so perhaps they're betting on AI that will de-SEO the content.

In today's world of powerful neural networks, one would think training a network to identify sites which try to get the user to pay for something wouldn't be that hard.

I think this is instead an active choice on the part of Google. When the options are "show someone's personal blog as the top result" or "show some company as the top result", they always choose the latter because a company is somehow believed to be more trustworthy.

Google will return different types of results depending on your query. For example shopping style queries produce Pinterest like sites. Questions produce Yahoo answers, Quora, and other faq style sites. It's much more nuance than just "show a company".

Also the AI you speak of can equally be used to produce content that looks real but is not. It's even possible to produce articles that are impossible to determine if a human wrote it.

This is a cat and mouse game that won't end. There's too much money on the line.

I wish I could see that powerful AI in action. Think of something easy like, for instance, AI powered point of sale in the supermarket. Right now many shops introduce "automatic" POS, and the automation means that instead of shop assistant I need to scan every product.

Those badass AI should be able to use their image recognition power and be able to recognize products I put on the counter and calculate the price. It seems this is still beyond capabilities AI has (apparently playing chess or go is way easier than working in the grocery store).

This is more or less how Amazon Go stores work.

> It seems this is still beyond capabilities AI has

Really? Classification of images on a known static background should work very well. Or at least well enough that you can request a manual scan it every 10th item and still get a large speed increase. The bag weight works as a double-check anyway.

See Amazon Go No need to even put things on a counter

A friend of mine not too tech savy always clicks on these sponsored google links that take a page and a half on some screens when looking something up. I always tell him "you clicked on an ad !" and still, he always clicks on the 2-3 ads showing up before the results, even years after. Mind not that he's a UI/UX designer, his job is attention to detail, yet, he still cannot discriminate between results and sponsored links on the google search page. I wonder how many people do that, but it must make Google a lot of cash !

Only a small % do that. However that small percentage is enough to generate billions in revenue.

Just install an adblocker for him.

It seems personalization/bubbling is assisting this garbage searches. From my limited observations of close friends and family, different people tend to get different number of useless results for the same query string. I personally disabled all possible personalisations in all google products, including history of searches, youtube views, etc and it seems to actually help, at least I really don't find myself complaining on google quality at all recently.

Similarly: try comparing big data warehouses/tools by use case.

I know some people who sell their teaching material on https://www.teacherspayteachers.com/ -- that might be worth a shot. Sure, you're still paying, but it's honest & upfront, the prices seem very reasonable to me and on the average the material is better than random searches would find.

Are you defining all paid content as spam? Or all sites that sell products?

It's one thing to be selling Angular templates, when the title on your website, as seen thru Google, is "Buy Angular Templates"

It's a totally different thing be selling Angular templates, when your website is titled "Tutorial: Learn how to write Angular services"

One the two is clearly misleading.

I find google's search filters are 100% broken. Just tested this myself again.

Try searching for

site:reddit.com best headphones

And click "Tools" and set the date filter to be "Past week" for example.

Click the first result. You will notice it's from a year ago. Click the second result, it's from 2 years ago and so on. Not a single result matches the date filter. Similar issues with the other filters.

It's 100% broken and been this way for at least a year now as far as I remember.

The google filter means modified on the last 2 weeeks, and dynamic pages have related posts and other part of the pages that change, so those filters are not going to work well on something like Reddit.

I find this sort of issue very encouraging. You sometimes think, how am I ever going to compete with someone like Google? They can hire more PhDs per day than I can write lines of code.

And then this. An unsolved problem that they could have solved 20 years ago - recognising that pages are not necessarily one atomic unit and that different parts can be updated at different times. Or more generally that different types of websites require slightly different approaches to search.

It's not a trivial problem if you think about it. But for a search company to get this completely wrong even for a global top 100 discussion forum requires a severe lack of incentives.

And that's where I get the sinking feeling that I shouldn't be encouraged by Google's failure at all, because those incentives are difficult to fix for anyone. I'm pessimistic about Ramaswamy's approach. Putting a paid ad blocker on Bing isn't going to fix this.

I am fairly certain this feature used to work well earlier last year. It broke sometime in summer last year as I even made a post on Reddit asking if this was a new bug.

Unless the redesign of Reddit broke this somehow? But as far as I know, this has broken for other sites too. And Reddit redesign is over 2 years old now.

Google should rely on a last modified date supplied by an author, and penalise sites that change it without actual changes to the content.

Or alternatively, improve their algorithm for detecting page changes.

Somehow it should be improved, as it’s a useful feature.

This. It used to go by the date the Reddit thread was created, but now they’re going by modifications, which makes the timeframe useless. Unfortunately, Reddit’s built-in search is also lacking.

I mean, that's why it's broken - but it's still broken. It doesn't filter pages based on when the relevant content was added.

Reddit is rank 20 on Alexa. They could make it a special case.

I also noticed the same when it comes to Reddit. I usually search for a specific problem, for example "sony TV sound issue" and want only recent results (TVs that use the latest firmware). The Reddit results appear as "1 day ago" even if they are 4 years old.

I tried it just yesterday when I wanted some real reviews for a product and it worked well.

Your Google is not everyone elses Google.

While smaller companies seems to have a hard time implementing A/B-testing Google seems to be running tens or hundreds of tests continously.

Here's a nice trick: If you report any issue on a search results page it seems to opt you out of the experimental group and you get normal results for some hours/days/weeks. Still not great like in 2009 but not as crazy as whatever bugs you now. At least it has worked for me on a couple of occasions.

Googlers: If you are relying on people giving feedback to know if a change is annoying users, be aware that your feedback process actively discourages people from submitting feedback. Even I have to be reaaalllly motivated to send feedback.

> running tens or hundreds of tests continously.

How quaint.

A common human fallacy is to pick a medium size number and think it is big. People have trouble grasping large scale (perhaps due to how our senses instinctively use a logarithmic scale)

A common human fallacy is to pick a single misstep and over-generalize it into a larger pattern of... oh God, I’m doing it now, aren’t I?

I bet they're doing a lot more A/B tests than that. I work with a system that has thousands going on at any one time, I wouldn't be surprised that that is typical for bigger companies that get into the testing culture like that.

Very likely.

I was just thinking a few days ago: Google is starting to resemble the last days of Altavista, where the results were full of junk.

If you use Firefox, some results that are full of a Firefox-specific scam where "you are the billionth search result" (maybe it also happens for other browsers). This has been going on for years now - I noticed it because scammers were scraping my blog for content and republishing it but making searches terms redirect to scam sites.

My main grief with Google is that it frequently omits keywords, contrary to DDG. As to trying to make people confuse organic results and ads, Bing is by far the worst. It often proposes malware sites when queried with open-source projects (try VLC, Audacity).

Sure having a search engine do exactly what the user requests would be great for us, IT people. Not sure it is possible to build much traffic with that alone, though.

DDG also omits search terms, sadly. It just does so a bit less blatantly than Google.

I think DDG started doing it within last 6 months. When I just switched to it, this was not the problem. Now I am almost back to using Google, because when Google omits terms, results are a bit more on the topic (when forced not to, DDG and G work similarly).

DDG always omitted terms as far as I can remember (definitely 1+y).

I can't say for sure if this got more frequent in recent times or not.

I made a web search tool [0] that pulls Reddit results (and other sources) directly into the SERP.

I'm working off the thesis that combining highly relevant vertical results is the best way to combat SEO.

As a fun aside, when I did a Show HN for this tool, the title was "Runnaroo, a new search engine that didn't raise 37.5M to launch" as a friendly joke toward Neeva. Dang changed it later to be just "Runnaroo, a new search engine."

[0] Example search for "bose headphones reddit": https://www.runnaroo.com/search?term=bose%20headphone%20redd...

This is an interesting idea! It's something I've been thinking a lot about recently, so it's nice to see someone actually do it.

I saw from your previous post that you're using Google for the web results, but the only option listed in their docs has a 10k queries/day limit. Have you been able to get them to agree to a higher limit, or are you planning to move off of Google once your traffic grows?

Also, your example search has a character encoding issue - "Stolen iPad Pro & Bose headphones".

i've found runnaroo to be super helpful with programming related results where i can actually find useful blog posts instead of those spam SEO low quality ones. awesome site. thanks

i love this. in the article, it's mentioned how one idea is stackoverflow results, and then i see that that's nicely done in runnaroo. as well as others depending on the search. excellent search tool. cheers!

sort by date doesn't work. Neither does on google, but if you're trying to get me to bookmark, what's your distinct advantage over the easy to remember site:reddit.com?

Your feedback is helpful, I just added the sort by data feature, and it is a work in progress.

Regarding, "what's your distinct advantage over the easy to remember site:reddit.com"

It's arguable that just typing 'reddit' with the query is easier to remember than typing "site:reddit.com" for most people, but you can have the best of both worlds and still use the site operator and get direct Reddit results [0].

[0] https://www.runnaroo.com/search?term=site%3Areddit.com+best+...

I've been doing this too sometimes. i swear google has gotten worse. sometimes it just ignores half the search terms, even if you put them in quotes. and yeah, I also get a lot of SEO spam results

I switched to DDG mostly, and nowadays its results are great for most cases. Imo Google is a habit to break.

I switched to DDG a while back when I jumped out of chrome when was threatening adblockers, but I'm not always satisfied with the search results:

it's great if I know what I want to search (i.e. I know the field and the keywords and I need the specifics) but when I don't know what I want to search I can't use it for discovery (i.e. when learning new things and I lack the terminology)

The problem with DDG is that it doesn't do really well for non-English results. I use DDG now, but sometimes I do have to use !g to redirect to Google.

The bangs are one of the main draws of DDG for me. Besides not being google.

DDG. Come looking for anonymity, stay for the bangs.

I used https://lite.duckduckgo.com for a long time because of its near instant redirects, but it recently broke Firefox's omnisearch, and so I've all but stopped using it for even that.

The bangs are painfully slow over the main duckduckgo site.

Miss the bangs.

Being made by the French, Qwant.com seems to give better non-English results than DuckDuckGo and gives similar priority to privacy.

Qwant has a reputation on r/france for being completely awful. Not sure how justified that reputation is.

Also, for non-trivial results, Qwant is mostly a fancy Bing front-end, so you might as well use Bing.

I have been using DDG a little more frequently recently, however, it's image search was useless whenever I tried it.

Funnily enough, my experience is that the best image search engine is Yandex's. E.g. searching for a page from a comic with Google just gives me random comics, but Yandex's will more often find that exact comic.

Also, Yandex has an reverse image search [0] that works for faces. It's a little creepy, but an interesting tool to find similar looking pictures of people.

[0] https://yandex.com/images/

Used DDG for two years. Now switched to Qwant. Much better and not US based.

Not a fan of Google, but Qwant is simply unusable.

Google used to have a forum tab for search results with only user generated content among discussions and forums! Damn, I still remember how sad and frustrated I was on the day they removed it with no explanation.

Google search started to decline in quality since the release of GooglePlus (since they removed the + operator and replaced it with double quotes).

Double quotes existed before as well. + and " had different meanings but were combined into one.

And yes, it was around that time that search quality took a nose dive for me.

AFAIK "+" is just space replacement, and quotes don't work anymore, verbatim flag is now used instead.

Note that I use past tense in my post above.

Also unless you work at Google or have done extensive research: I think you are still wrong.

you must be a teenager for not remembering the + operator... it was really useful for finding what you want.

Same here. From basic questions to daily struggles, I go to reddit. Maybe I can relate more to the opinions of common people than the so called expert advice in the websites.

It’s all blog spam these days. Feels like reading AI generated junk that tells me nothing of value

Funny enough, people on specialised reddits can be much better experts than random blog spammers.

Yes, same here. I noticed reddit is much better at providing (useful and direct) answers.

I think part of the reason for that is that off-reddit there's so much emphasis on SEO and analytics (trying to hit all the right keywords, linking to other pages/sites, tricking people to stay on a page/site longer etc.) that everything becomes very cluttered very quickly.

Maybe if enough people do the same, reddit will start attracting the same kind of SEO spam. It's almost like Google is the victim of its own success. If everyone try to game your system some will success.

This. Google is in a bad state after their switch to the nlp engine. Now when you search for very specific things you end up with list of all phone number combinations or some other spam

!r on DuckDuckGo is slightly less typing and easier on the muscle memory.

I use !g for Google pretty frequently, but there are many searches I just send directly to the site I think has the best answer.

And if your queries are too specific while not being logged in, you get the "unusual traffic" captcha and can volunteer to improve Google's image recognition programs.

which is weird, because I have never found a useful response on reddit. I find people asking the exact question I have, but then there are either no responses, joke responses, or people also asking for help.

I'm still tricked every time though, I see the result matching exactly what i'm looking for, I disable my reddit block from browser, click the link, and am 99.9% of time disappointed in the result.

I’ve been doing the same for a while, but Google also seems to be actively limiting and censoring reddit results as well.

I'm not sure I'd search on reddit if I wanted quality results. ;)

Actually, Google has gotten so bad that stuff on reddit is more informative. That's got to tell you where Google is at this point.

How to google: make duckduckgo your default search engine, enter "[term] !g site:reddit.com".

DDG has a Reddit bang which does the same: !greddit

> Rather than try to build a search infrastructure from scratch, Neeva instead opted to use Bing's search API for its basic results.

So it's another UI over bing like duckduckgo? I'm not too optimistic, at the moment there are fundimental issues with how search engines interpret text and rank results.

Is DuckDuckGo just a Bing UI? I thought they had their own index.

DDG USED to be Yahoo (which... Is just Bing results too. Yup I’m serious). There’s a bunch of alternative search engines that are UI’s on top of Yahoo/Bing. Ecosia is one of them too.

I’m pretty confused then. What’s the benefit of DDG over Bing? Is it just marketing?

The two that come to mind offhand:

Privacy. DDG at least claims to make it a first-class feature, and one would imagine that means that they're not selling you out when they pass the search along to Bing. Going directly to Microsoft may be OK. I haven't really bothered to look into it; I just went with my warm fuzzies. This is a spot where Microsoft has a checkered past, and it's going to take a bit more than the ill-fated "Scroogled" ad campaign to change minds there.

Cleaner UI. Bing's interface is relatively cluttered compared to DDG's. It loads all sorts of images, sticks a chumbox on the bottom of the home page, nags you to download Edge, etc. If I run a search, I have to often scroll through two entire screen heights of I-don't-know-what before I get to actual webpages. Lately, DDG has been adding clutter to their site as well, but there's still quite a lot less of it, and what there is tends to be less visually noisy.

It still baffles me that Microsoft still doesn’t get the value of clean UI. What are the designers there thinking?

Same with Windows Menu? WTF is cotton candy doing there on a fresh install ?

Why does opening Edge have so much msn news spam?

Like is Microsoft just oblivious to what the user really cares about ?

Most likely some execs compensation is tied to how many ads they shove so they prioritize that over the user’s experience.

Microsoft's behavior starts to make a lot more sense if you think of it as a large conglomerate of smaller organizations, each with its own agenda, and its own ways of throwing its weight around.

For a while, you could get away with interpreting Apple's behavior as if it were a single person with a coherent mind. Since 2011, though, that model's been getting less and less workable for Apple as well.

Or more likely MS figured out they can make more money licensing Bing search to other search companies instead of trying to appeal directly to consumers.

- bangs (put g! or !g anywhere in your query to search on Google, !gem searches on rubygems etc)

- searching via POST requests (so that your searches are not saved in browser history)

- I heard many people like DDG's browser for iOS, as a dedicated incognito browser

> bangs (put g! or !g anywhere in your query to search on Google, !gem searches on rubygems etc)

Which you can do inside your browser too.

> searching via POST requests (so that your searches are not saved in browser history)

That's not a benefit to me.

If I want less browser history I'll handle that locally, thank you.

> I heard many people like DDG's browser for iOS, as a dedicated incognito browser

That's not a benefit of the site.

Better privacy and better infoboxes sometimes are what I would immediately list, but purely on features that's not very compelling.

However a lot of us find the privacy nice and it cuts down on google searches which is what a g! is. If you use that you just gave up your privacy, but it's nice to have alternatives. Not worrying about search history is also a great benefit for most of us as well. Good luck with your preferences though.

They have some extras built on top of it, but imo yeah mostly marketing. In general no one knows how their arrangement with Microsoft works, would be nice to hear the details.

While don't know how their arrangement with Microsoft works, anyone can pay for API based access to Bing.

Prices are here:


They have ads revenue sharing agreement with Yahoo (essentially reselling Bing Ads) as well.

Little known fact: if you buy anything on Amazon, DuckDuckGo gets to know exactly (the precise item) you bought as part of the Amazon affiliate program.

> Little known fact: if you buy anything on Amazon, DuckDuckGo gets to know exactly (the precise item) you bought as part of the Amazon affiliate program.

This isn't specific to DuckDuckGo in any way whatsoever. This happens for ANY affiliate. I've used the Amazon affiliate program, and every month I would get reports of the exact items people were purchasing. I couldn't link those purchases to any particular individual, mind you, but I could see exact items.

But it violates DDG's primary business proposition of not tracking you.

Do they track people based on that?

They probably do have access if they wanted it but then they would be violating their TOS and spirit of their entire company by doing so. Whether you trust them or not is one thing but saying they do it is unfair and not called for.

they don't track you

What about Bing?

Doesn't see who makes the request.

DDG proxies it so bing doesn't know the IP or person originating it. So bing is simply paid for it by DDG.

Only for their infoboxes. The organic results are just Bing.

That’s not true.

DDG has their own crawler and their results are a composite of many different sources of which Bing is one.

It’s easy enough to check. I searched my name on both DDG and Bing, the results are completely different.

While they do have a crawler (DuckDuckBot) they primarily rely on other search engines the main ones being Bing and Yandex.

The fact that they return different results doesn’t mean much, firstly the bing API and web search return slightly different results especially if you use some of the extended parameters, secondly it doesn’t mean that they return results without additional post processing since they can have their own weighting/pageranking algorithms, filters etc. on top of Bing.

DDG isn’t just a UI for Bing but their results rely on Bing.

> I searched my name on both DDG and Bing, the results are completely different.

Search for "what is my ip" and you will see Bing bot IP in the DDG snippets.


> To do that, DuckDuckGo gets its results from over four hundred sources. These include hundreds of vertical sources delivering niche Instant Answers, DuckDuckBot (our crawler) and crowd-sourced sites (like Wikipedia, stored in our answer indexes). We also of course have more traditional links in the search results, which we also source from multiple partners, though most commonly from Bing (and none from Google).

In other words, their own crawler and the other 400 sources are used for their Instant Answers and widgets while all "traditional links" (i.e. the search results) come from Bing.

I do believe your main point is correct, however your easy check doesn't prove it.

Even back when DDG did only use Bing/Yahoo data, you'd likely have seen different results for your name depending on what personalised results Bing/Yahoo might show you, or other aspects (such as weighting applied to your location).

What’s the point of using Neeva if it’s just going to serve me Bing results with fewer ads?

Presumably it's a matter of how they combine aspects of the results, focusing on being more useful, as the article explains.

A distinct focus isn't just "Bing without ads", in a similar way that Bing isn't just Google with different ads. They can have different focuses, qualities and utilities.

The value of a search engine it's in the size of its index and the ranking it does.

The UI is less important. Those 3-4 ads appearing at the top of search results can be filtered out by browser extensions.

Bing isn't just Google, because it uses a different index and a different ranking algorithm. The results are vastly different.

If a search engine is just a shell over Bing, then there isn't much point in using that over Bing.

Note DDG provides some niceties over Bing, plus the pledge that they won't share your searches with them, but you basically have to take their word for it.

Well, you get to pay for the privilege. That seems to be the difference.

Unless they are maintaining their whole global index, it IS a Ui over Bing. Don't be fooled by their 400+ sources, that stuff just affects things like the answer box

Comparing the results, in a search for "Neeva", the only results they have in common on the first page are exact name matches and one article from the NY Times:


neeva.co, neeva.co/blog, moneycontrol.com, indiatimes.com, nytimes.com, neeva.tech, neevagroup.com, babycenter.com



neeva.co, nytimes.com, androidauthority.com, medium.com, oflox.com, gomoguides.com, neeva.tech, neevagroup.com


(I don't use Bing, and DDG isn't supposed to track me, so neither should be personalized results.)

Not sure why this is getting downvoted. Even if DDG is built mostly on top of Bing, the discrepancy would be interesting to explain.

In my opinion, it is due to location. Bing returns 8 results on the first page, and DDG returns 10 results. At the top of the Bing page, I can see that Bing already takes my location into account (and it is precise, it is my city), so there is no box to tick. Once I tick the "location" box on DDG (it is only the country), I get 7 identical results out of 8 on the first page! The order is the same for results 1-4, the 5th result is different, the 6th result is identical, and results 7-8 are identical but their positions are swapped.

That's interesting. By location, do you mean the box that says "All Regions"? I get the same results on DDG (different from Bing) whether I set that to "All Regions" or "US (English)".

For me, it is a setting to tick, which mentions the country:

https://i.imgur.com/fML3x0l.png (in French)

Once it is set to my country (France), then I get almost identical results.

I don't know how to deactivate this feature on Bing, so I could not compare Bing and DDG in the case where the "location" setting would be ticked off on both sites.

Different index algos and filter bubble. People don’t get the same search engine results.

I use Bing and Google normally. I never use DDG. My results are different than yours. Including things that you don’t have.

I mean if they took results from Bing and Yandex and applied their own ranking algorithm to form the composite are they still a UI over Bing?

It mostly is, though.

> We also of course have more traditional links in the search results, which we also source from multiple partners, though most commonly from Bing (and none from Google).

Sounds like a UI over Bing to me.

It is mostly.

If every other "search engine" is using Bing, is it profitable?

How to build a good search engine:

1. Actually return the results matching the words that people typed in your search box. The more they match, the more they go up.

More and more, it has become extremely disappointing what you get back from Google (and others as well). Verbatim search (where you surround exact terms in quotes), seems to have vanished in the last year or so. More often than not I have clicked on a result, only to find out that my particular query is nowhere to be found on the site, but some "related stuff" does.

>1. Actually return the results matching the words that people typed in your search box. The more they match, the more they go up.

As virtually all early search engine developers have found out the hard way, it's extremely easy for website owners to cheat such a basic algorithm by spamming keywords.

But the answer to that shouldn't be to return results that don't even contain the words searched for.

> the answer to that shouldn't be to return results that don't even contain the words searched for.

This is everything that's wrong with search in a single sentence.

First, verbatim mode should actually work when you're sure this is what you want. But as a default I think this would produce bad results.

"What's the name of the plant that that eats flies?"

This should, and does, produce results for Venus Fly Traps but I doubt any useful page contains all of those words.

But why should that return results for Venus Fly Traps? You are search verbattim for the sentence "What's the name of the plant that that eats flies?". I'd say it's a bug to return any result that doesn't have that specific sentence.

If you were searching that sentence without the quotations I could get behind it return Venus Fly Traps results, but if you are using the double quotes you are trying to search for that EXACT string.

Exactly. Results should first satisfy at least the "contains all words in query" property. Then filter and rank that set of pages.

How do you handle a query like

"What's the movie where the guy turns into a pidgin?"

This should return results for Spies in Disguise, and indeed, every search engine I tried does. I doubt any relevant page actually contains all the words in my query.

Typically, turning on verbatim helps with that.

One of the reasons I finally gave up Google a couple of years ago was that verbatim hasn't worked reliably since maybe 2015 or something.

Of course, around that time Duckduckgo started going downhill as well in this regard.

Where does one find that in DDG and Google?

Not sure about DDG. In google, it's under tools. Click the "all results" dropdown.

Just recently I was googling nftables (A replacment for iptables in Linux). Google decided iptables was a synonym and all the results were for iptables. (As it's older and more common)

Try “nftables” in quotes. It’ll help prevent query expansion.

FYI there’s a search mode called verbatim, and it’s not the same as putting the individual words in quotes. It’s hidden under the search tools menu. In my experience the results it provides always contain the search terms.

Why on earth is that not the default?

Because it's wrong for basically everyone. Spelling errors, stemming, etc. Most people want pages that match what they mean not what they typed and for natural language (vs programming topics) that's right.

I share your frustration, but the obvious rejoinder is that most people don't agree with you. You can certainly target a relatively niche group of power-searchers, but most Google users probably think that the search engine guessing their intent is part of what makes it "good".

Ha. My parents actually do google.com just to type in something like facebook.com which searches and they find facebook.

I doubt getting them to switch would be very easy unless you did some deal with browsers. But they like most others else uses Chrome, doubt Google would do that.

It's not just your parents.

I was sitting down with a very experienced C++ engineer a few months ago to work on a problem. There was something we needed to do a web search on.

He opened Chrome, clicked in the address bar (which already had the keyboard focus, but never mind that), and typed "google". This did a Google search for Google, and the first search result was www.google.com. Then he clicked www.google.com which took him to the Google home page, and there he typed in the search terms.

Yes, he Googled Google to do a Google search.

I'd say it's muscle memory. Back 20 years or so before the address bar became a search bar it was common for browsers to just successively add '.com', '.org', '.net' to an invalid address and take you to the first one that came up.

It's a coincidence that modern behaviour is the same for certain domain names.

If he's an experienced C++ engineer he's probably of that vintage, and has probably been in demand enough in the meantime that he hasn't had to address this "niggling" behaviour.

Perhaps it was muscle memory? I could see that happening to me if I had recently installed the browser and it won't auto-complete the searches based on past history. Also the fact that my IQ drops by 50 points when someone is watching my screen and I look like an idiot.

> Also the fact that my IQ drops by 50 points when someone is watching my screen and I look like an idiot.

I've heard this a lot, and have basically trained myself by rote not to hover over people's computers when we're looking at something together, but I still can't say I understand it. I don't know if it's an inarticulable phenomenon, but do you have some sense of what drives this and/or what it feels like?

Surely we're all seeing this on Zoom calls these days?

People sharing their desktop/app/window, then breaking out of that sharing selection to bring something else up, while they're talking.. and searching, and things aren't working exactly as expected, so they go into rabbit-hole mode, etc...

Everyone's got a million things going on in their brains, and an audience changes things, and "presenting to an audience" is different than sharing with an audience.

I switch between my Mac and a Windows machine, between Safari and Chrome, between ctrl- and Command-, etc etc. Half the time things I'm connected to are broken (VPNs, endpoints, services...) so I find myself half-stabbing my way through little time windows during the day. And if I'm talking to someone, while trying to do something I've done 1000x before, I'm probably covering my bases by stabbing at the keyboard even more.

I do presentations a lot too, it's a totally different mode.

I am not sure what causes it. Maybe it's the anxiety? I've dealt with it most of my life. Even basic things like going outside, talking to a clerk at a store, talking on a phone would make me a nervous wreck, where I could barely function.

Perhaps it makes me really uncomfortable when what I am doing is the center of attention. Like the nervousness normal people feel while performing something on a stage in front of hundreds of people, but on a micro scale, even doing something in front of a single person elicits the same response from me.

Perhaps Google should serve its home page, instead of returning a SERP for "google", if the search URL indicates a search from a browser address bar. If a user really wants to google for "Google", they can then search on the google.com home page.

Most of the time when I Google the term Google, I want search results like Wikipedia, or pages discussing Google's business or history.

>he Googled Google to do a Google search.

The HN circle is complete

"Google’s Top Search Result? Surprise, It’s Google" https://news.ycombinator.com/item?id=23975001 https://themarkup.org/google-the-giant/2020/07/28/google-sea...

Well it's possible to be so focused on something that you know nothing about related subjects. Possible though not likely.

Try doing the following queries from your address bar:






I trained myself to use ctrl+k for searches vs ctrl+l for the url bar, from back in the day when I used Firefox and they had separate search/URL bars. Chrome respects these shortcuts, but it distinguishes searches from urls by pre-pending a question mark.

That’s… That’s concerning

Serious question: if he had done that during his job interview, all other things being equal, would that have been a deal breaker?

I don't hire or interview people (and also don't google google to open google), but just want to comment that we are all humans and not robots. We all posses certain quirks and behaviors that are suboptimal. Seriously considering such episode in a hiring is like measuring during interview how far from the paper he put down a pen, because putting it too far is suboptimal and he wasted several microseconds reaching for it to pick it up. It is ridiculous metric and HRs need to be self aware is they are using such metrics in real life, as opposed to real metrics like ability to solve work problems.

Of course we all agree that technical interviews are often broken and interviewers use ridiculous metrics.

The distance from their pen to their paper is an example of a ridiculous metric. And expecting humans to be robots is a straw man to what I'm asking.

The candidate's ability (or effort) to understand and use the tools they use 100 times a day to do their work is not an irrelevant metric.

Doctors need to have a basic understanding of how stethoscopes work. If they start by listening to your elbow, and only then proceed to listen to your chest and back, something is amiss.

It's not a matter of suboptimal behavior or poor efficiency.

I'd agree to some degree if you could be confident that this was his actual routine, instead of a brainfart caused by nerves or distraction or any of a thousand other things that makes an interview environment different from a day-to-day work environment. I remember my dad getting annoyed at me when I was in high school because he asked me what time my soccer game was and I looked at my watch: he thought I was about to make a smart-ass joke[1] in the middle of us trying to figure out a schedule.

That doesn't suggest that I don't know how calendars or timekeeping work: I was just distracted and glanced at my watch on autopilot.

[1] I did make a lot of smart-ass jokes...

It is debatable whether web browser is a work tool, in the context of specialized tools. I'm using DDG for 3 years already but I still don't use bangs, even for often used websites like wikipedia, instead I always do a normal search and then click on a wikipedia result. It is obviously suboptimal and I'm doing it in the web browser, am I unfit for my job because of this? The question to the hypothetical person who will select candidates based on googling google - did you yourself optimized ALL that can be optimized in your workflow? (this is a rhetorical question, answer to which is 100% "no", regardless of who is that person)

i would attribute this kind of thing more to acting on auto pilot than a lack of understanding.

Yahoo Search used to (not any more) return a second search box as the first result if you searched for Google, because a lot of people with their homepage set to Yahoo would enter google in the Yahoo search box to get a link to Google to go there to search.

I don't remember the stats, but when I worked for Yahoo (2003-2005) they got a fairly substantial number of daily users to stay on Yahoo with that trick.

> most people don't agree with you. You can certainly target a relatively niche group of power-searchers, but most Google users probably think that the search engine guessing their intent is part of what makes it "good".

Translate: Feedback from catering to the least common denominator boosts techie self-esteem.

PSA: Similar reasoning is responsible for political ads.

I don't rely on Google anymore when I'm trying to find good information. I actually have much better results searching Reddit and HN for specific information.

I was not aware that HN had a search feature. Link?

It's at the bottom of the page dude.

After using HN for 6 years I just realized that it is there. Instead I had a separate bookmark to HN search for all these years :) . Apparently I just ignore footer completely as useless part of website.

There's room for slight improvements to the hn layout I guess ;) While we're at it, we could maybe place the logout link a bit further away from the profile link to avoid frequent inadvertent logoffs on mobile. Also, it's probably just me, but I haven't yet figured out how to post an Ask HN/Show HN.

> Also, it's probably just me, but I haven't yet figured out how to post an Ask HN/Show HN.

Isn't this just manually entering the prefix?

I don't know. Is it?

I've used search several times, but each time I want to use it again I still waste time looking for it in the header instead of scrolling to the footer.

Wow. I cannot believe I’ve been reading HN for so long and never noticed that. Thank you.

The search box is at the bottom of this very page.

See the footer.

Type "[search term] news.ycombinator.com" or "[search term] reddit.com" into Google. If that doesn't work, you can reverse it "news.ycombinator.com [search term]" if you really want to limit to just hackernews. The former seems to work better for me most of the time.

No need to downvote. I'm familiar with the keywords. My point was that omitting them sometimes produces better results. I suspect that putting hackernews in the query tricks it into realizing I'm looking for technical topics.

Your search engine is a spammer's dream. There's a million other things to consider. No thanks.

Pagerank solved that, right?

For a while, yes. That's part of what made it so novel. The fact that results are still full of SEO spam isn't really an indication that the right direction is to throw up your hands and make it easier to manipulate results.

> Actually return the results matching the words that people typed in your search box

That wouldn't take care of SEO spam, which is very good at stuffing its pages with whatever words people search for.

Also, as much as I hate to admit it, Google is pretty good at guessing wrong spelling or synonyms and getting good results in a majority of cases.

How hard would it be to assign websites an actual reputation again?

Include ads on the destination website as part of the reputation score. If a website is loaded with ads, sink it to the bottom. Google doesn't do this because they likely make money from the ads.

Heuristically determine spammy content. It's pretty easy to tell at first glance which content is bullshit, so it's probably not hard to create an ML model to do the same classification.

Manually assign positive weights to websites used by engineers and domain experts. You could even curate this list in the open and solicit help in maintaining it.

it seems that the web has for at least a decade already been at the point where a search engine that is built to be useful by humans should (ideally, but apparently impossible with the current skewed incentives due to ad business considerations, overheated stock market, over-enthusiastic ML expert workforce, etc.) index only sites that opt _in_, vetted by humans.

there is a clear upper bound on the amount of total legitimate web content, and that upper bound is not prohibitively high -- linear on the total number of coherent-content-producing humans with only so many hours in a day and only so many years of adequate brain functioning (and not on the amount of whatever computing resources that are thrown in to support whatever algorithmic content fire hoses that the currently-dominant search engines contend with).

The main problem: SEO.

I want to share a recent search I did through Google -- at work I wanted to look up how to implement conditionals in Microsoft Forms (i.e. the next question in the form is based on the previous answer). I searched for "microsoft forms conditional" (without quotes) and the first search result on Google was this page on how to "use branching in Microsoft Forms" - https://support.microsoft.com/en-us/office/use-branching-in-...

That page doesn't contain the word "conditional" at all - the word that Microsoft uses is "branching" but Google deduced that it was the best result, which was perfect. The same search in either Bing or DDG produces results that all have the words "microsoft", "forms", and "conditional" in them and none of them link to the page I mentioned above, which I consider to be the best result.

Moral of the story? Search is hard, but Google does it better than everyone else.

Also, I learned via this thread that DDG is just a wrapper around Bing, which explains why the search results between DDG and Bing were near identical - they even have the matching video suggestions.

Just a comment:

Sounds like the results of a neural network: roughly approximating your intent and searching around that intent, in continuous space, to find other viable search terms and phrases. (This is one possible approach, given in broad strokes.)

That's a massive barrier to entry. You need enough data and compute to train a massive language model, more compute to run the model against all incoming queries, and then even more compute to handle the extra search load precipitated by use of the language model.

Not to mention the years of R&D that go into these models and their associated tooling.

> That's a massive barrier to entry. You need enough data and compute to train a massive language model, more compute to run the model against all incoming queries, and then even more compute to handle the extra search load precipitated by use of the language model.

Luckily, most of the time you could improve my user experience by removing that cr*p and give me my 2007-2009 Google back.

From there you would only need to allow users to make personal blacklists, share personal blacklists (this was about the time when auto-generated content started to become popular) and maybe also aggregate some popular blacklists for a default blacklist and it would be better than anything we have seen since.

(I remember having a txt-file with -spammydomain.com -anotherspammer.com etc etc that I pasted in at the end of certain searches to take care of sites that had either had

- auto-generated content

- or stuffed their pages with black/black or white on white keywords )

Giving you 2007 Google might not work because people are using 2020 strategies to game it. But I'm definitely skeptical that that's all there is to it.

I hear this a lot.

But in all honesty it is not the SEO scammers fault that Google serves me pages that doesn't contain the words I searched for after I have chosen the verbatim option.

It also isn't SEO scammers fault that when I search for Angular mat-table[0] I get a number of pictures of tables with mats on. That is probably the result of someone playing with some cool AI tools while othwrs are busy trying to make more efficient ways to ignore customer feedback ;-)

We must manage to keep those two thoughts in our head simultaneously:

- Black hat SEO have changed

- Google has adapted to another audience and has ditched us power users hoping we wouldn't notice.

[0]: screenshots of that and some other clear examples of Google and Amazon testing out AI in production here: https://erik.itland.no/tag:aifails

Do you know many black hat techniques? Around 2011-2013 was when Google shifted from being extremely easy to game to very difficult. 2014 was really the end of it. Have a look at some niche site blogs from the time - revenue from new niche sites tanked from like $1.5k/month each to barely close to $100 (with a lot more work up front).

Anyway my point is if you rewound the clock to 2008, you'd have a way bigger problem than you might think.

Fine. But we must still must be able to separate between backend and frontend: it should be possible to upgrade the anti-spam machinery without breaking

- doublequotes

- + (ok, they broke that deliberately around Google+)

- the verbatim operator

All those should be able to work even if the crawler and processing techniques are updated, right?

Also a heads up: I added some more details to my post above,I didn't think you would answer so fast :-)

Edit: I only know the black hat methods that was well know 10 years ago like:

- backlink farming from comment fields (we protected against it by applying nofollow to all links in comments)

- Google bombing (coordinated efforts to link to certain pages with particular words in the link, trying to get Google to return a specific result for an unrelated query. I think the canonical example was something like a bunch of people making links with the text "a massive failure" that all pointed to the White house website.

- Link-for-link schemes

- etc

No, it's deliberately made worse the last couple years. They have put way too much confidence in their semantics-ai and it seem way overfit.

I used to be able to learn how to effectively search but lately it's just so terrible. It's made for 99% of people's dumb searched, but try to get specific and it fails hard.

>Moral of the story? Search is hard, but Google does it better than everyone else.

True, but they do it worse and worse.

Google still has some margin before their usability declines to the level of their competitors, but they're headed there quite fast.

I was moaning about this elsewhere on this site, how searching for exact phrases in quotes seems to now be a dead thing. Google and Amazon both just pick a couple of the words and shows you that. Dammit, if i put multiple words in a quote search for that phrase. If you don't find much, then sure show the other thing.

I never understand why people seem to like the DDG results so much. I often have to go back to Google.

So, finally a paid search engine that does not have to rely on ads for revenue.

So far it's based on Bing, which does. This makes it a bit a hard sell, compared to an intelligent ad blocker.

The most important problem of search engines is SEO spam. Google themselves sort of have a moral hazard to not be too stringent on SEO spam, because it shows ads by Google, increasing Google's revenue.

OTOH I wonder if the subscription revenue is going to be sufficient to have access to a reasonably good search index and enough processing power to efficiently combat SEO spam while returning relevant results. This takes your own data centers run frugally, because fees of something like AWS or Azure will just be exorbitant for a global search engine and a global search index.

I wonder if companies aiming to provide alternative search engines will cooperate on maintaining a common index, to distribute the massive costs of doing that. They could even publicly sell access to it, at a point where running a competing search engine won't be practical; e.g. researchers would buy it.

I am actually wondering if search engines really need such a large index. The vast majority of sites in Google's index are crap. If you were able to better select which sites to index it might help search quality and improve efficiency.

This is just a hunch but I imange the internet also follows the pareto principle and at least 80% of everything in google's index is basically worthless.

We may end up with search engines that use multiple indices where each is curated to a certain domain of information, rather than a one-size-fits-all index.

I'm working on a search engine like this. So far just government websites but I think you're right about the need to focus on selected sites for some kinds of searches. For example, if you're looking for securities laws, you probably want to search government law sites and securities regulator sites. Maybe you want law blogs and law journal articles too, but you probably don't want Reddit or Hacker News.

How can you tell what is crap while crawling, is there enough time? How can you tell what is crap for the next person, and what is not crap?

It's not an easy question.

People! Let the users help. All the best content sources on the internet are either experts or communities with voting / submissions.

> People! Let the users help. All the best content sources on the internet are either experts or communities with voting / submissions.

There are over 1.7 billion websites [0], so the task of ranking content, the way algorithmic search engines do it in a matter of milliseconds, is not as easy as it sounds when you add humans into the mix. It would only end up the way Mahalo did [1].

[0] https://www.statista.com/chart/19058/how-many-websites-are-t...

[1] https://en.wikipedia.org/wiki/Mahalo.com

False dichotomy, you can use algorithms in tandem. And cherry picking... I remember Mahalo well, and one example of one version of it doesn’t prove anything. Mahalo is far from how I’d structure it.

You can still automatically index but have users vote on the results. There are 1.7 billion websites and 3 billion+ users, and you don’t need that many to be active voters to help assist algorithms. Plus how many are at the top anyway? I’d love to downvote a ton of google results even if it only used it as a trainer for my own.

Plus, there are so many “super curation” sites like here and Reddit that provide a big dataset curated by people automatically. Lean on them more. Everyone knows “site: reddit.com” or “site:stackoverflow.com” already give you better results.

A simple upvote downvote on their results would let me downvote all the spam SEO sites. It wouldn’t take many votes for them to start tuning it.

Stats are a good way to blind yourself. That algorithms scale doesn’t mean people don’t improve them. Google’s problem is they are too cocky about algorithms, but their algorithms fail compared the curated communities all over the web already.

the other day I was searching "<sitename> feature" and from the results (which to be fair aren't many, since it was a fairly obscure site), once you get through the legit results on the site i was searching for, there were a bunch of markov generated spam things that had clearly scraped posts from various sites and tangled them together

I think Google bases a lot of its ranking on bounce rate, i.e. "how often people who searched this were happy with this result." I don't think you need that many hits to establish bounce rate for quite specific searches. Like, is 40 is enough? I can imagine so.

But if you filter for paid members, you've filtered for much smarter people (on average) picking the best results - and you've also filtered out a lot of the SEO people who are going to be trying to manipulate the bounce rate with proxies.

Using Bing's API isn't really going to help with some of the problems.

One thing that would be really helpful, stop counting the words appearing on links on the sidebar or other non-content part of the page as important for my search. It's amazing how many searches go astray because someone has some words on a sidebar that don't have anything to do with the content of the page. You would think with all this ML someone would teach a search algorithm to ignore it.

Was filled with hope at the start of the article and it faded away pretty quickly while reaching the part about Bing. Thereafter hope suddenly fell from a cliff...

>Neeva's most unusual feature is its ability to also search users' personal files. In a demo, Ramaswamy searched for tax documents and photos, all surfaced within his search results or available in a Personal tab in the Neeva interface.

Privacy focused my a$$!

That, and forum post signatures.

Search results for hardware problems üs a total garbage landing you in forum topics of totally different hardware, just because some show off poster listed every single electronic equipment he's ever owned in his forum signature.

> it felt like a product being made worse by its business model.

Something even worse than this has has forced me to ddg out of necessity: fucking CAPTCHA.

Which is essentially discrimination against anyone with a mobile internet connection who blocks google's tracking.

I just can't use google search anymore, it's a special kind of torture, after years of getting used to using google for everything and now getting this thrown in my face every single fucking time - way to permanently train users away from your search engine. It's time we had more competitors, google search is a monopoly and it's only going to abuse it's users more.

The more I think about it, the more I think the search engine problem has no (currently known) solution. The options are:

1) Word-based crawling/indexing: quickly abused by spammers.

2) IA/ML-based: I think this is the current model (?), but after a while the machine got "clever" and it makes Google to think it knows better than me about what I am looking for, and returns result for "most people" tastes. The problem is when you are not "most people", and/or you are looking for some niche topic/work related/tech stuff/etc. Simply trying to discover new things like an interesting blog or a small shop it's impossible.

3) Paid-based: as in the article, and might be a good idea. But I think it has to run on a custom indexer. Why would I pay for Bing results?

4) Aggregators: a search engine that returns results from a bunch of other search engines, like DDG and others.

5) A mix of the above?

So unless new ideas come to the rescue, I think it's always going to get worse.

> 1) Word-based crawling/indexing: quickly abused by spammers.

I guess that ends quickly once spam means you get blacklisted no matter how many Google ads you serve ;-)

A large piece of this article hints at how they have some interesting options that Google doesn't have.

Maybe the pendulum could swing back the other way slightly and we could have room for an older Yahoo-style indexed search engine. This wouldn't be something that would be browsed instead of search, but if you had a general categories list indexed, say 'finance' or 'graphic design', users could enter those sections and search sites categorised accordingly. Perhaps some ranking could be done based on search terms used within a category and what site a user ends up visiting (e.g. many users end up visiting a certain stackexchange link when searching for 'parsing json in python' within the 'computer programming' category, and so its rank increases). Heck instead of ads, maybe each category page could have a list of 'popular / trending sites on this topic' section at the topic that pages could pay to be placed in.

Not sure if this solves any user problem to be honest, and the idea is only appealing to me because I think some amount of domain expertise and human curation could go into categorising pages. While this sounds (and no doubt would be) labour intensive, if we consider the number of domains that users actually visit when conducting a search (i.e. ignoring anything past page 1 of google) then perhaps it's not so extreme.

Back in the good old days of Dmoz, this was the way, but then it became extremely limiting, and I remember wanting to rank in the directory for something and I "knew" one of the people in charge of a section. There was even some accusations of "pay for play" too. Eventually Dmoz died when Google came on the scene. But I get what you mean.

I hope this works out. I'm a big believer that companies eventually take the shape of their business models, and if you're free but serve ads, you're an ads company, not a search company. So it'll be interesting to see how a company without the ads tension ends up evolving.

That said, search is a really hard problem to solve (even if you can take shortcuts like using Bing's API).

> putting search results back at the top of search results

Watching Google slowly fill more and more of the search results with ads, this is an obvious and very welcome idea.

Why watch? uBlock origin [1], the most recommended browser extension in the history of the internet, blocks the ads on Google. You can also right click and block any element you don't want in your search results, like "Top Stories" and "Videos". You're still getting the search results, if there's content in your browser you don't want to see, you have more control over that than Google does.

I haven't seen an advertisement in 10+ years. I don't really understand why anyone chooses to see them when they don't have to.

And sorry if this comes off as confrontational, I just see so many people talking about advertisements and it's difficult to have to tell each individual that adblocking extensions have existed for close to 15 years [2]. I wish there was some better way to spread this information so no one would have to see ads or comment about their existence ever again. The internet is so much better without them.

[1] https://en.wikipedia.org/wiki/UBlock_Origin

[2] https://en.wikipedia.org/wiki/Adblock_Plus

uBlock Origin is great but doesn't work on Safari, which a lot of people use since Chrome is an incredible battery hog and Firefox is significantly slower on Mac.

None of the choices of content blockers in Safari successfully block the majority of ads on the internet.

Furthermore, it's not just about the surfaced ads. Even if you use uBlockOrigin, search engines like Google optimize for ad clicking, which will affect the search result ranking even if you have ads blocked. As a result, search quality has been steadily decreasing over the past decade (there have been hundreds of highly ranked HN discussions on this in the past).

Finally, uBlockOrigin is an amazing tool developed by 1 person. There is always the chance that, in the future, there are developments in browsers or ad-serving technologies that render it obsolete (e.g if Google decides to make a breaking change to the Chrome Extension API, like Safari did). In that case, it would be worthwhile to have alternatives.

This is changing in the next release of Safari. They will support the standard WebExtensions API so Firefox/Chrome extensions will be easily portable.


According to uBlock Origin's developer it is not enough https://www.reddit.com/r/uBlockOrigin/comments/hdz0bo/will_u...

1Blocker on Safari has been comparable to ublock origin for some time now (and really excels on iOS - how I found them). I say "has" because recently Google changed something with Youtube and it isn't 100% effective against YouTube preroll ads - but other than that it's been just as good as ublock origin.

I totally agree, and there are already multiple options on both desktop and mobile for different adblockers.

As to the search quality decrease, that's definitely more of a reason to desire an alternative than seeing ads is.

Yeah, I think the search quality problem is often brought up here on HN because we're mostly engineers here, but article writers seem to always talk about the visible ads problem since it's easier to explain to non-technical users.

We are building a Mac Webkit-based browser with web extensions support (including uBlock). if you want to try our alpha release feel free to get in touch.

I see two fundamental issues with this approach.

The first is punative. While it does, kind of, punish bad behavior, it doesn't go far enough. You may deprive Google of some small revenue, but you're still giving them a lot that contributes to their bottom line. They can still claim a googol of clicks and eyeballs, etc. People will still see them as the end-all-be-all, and voluntarily submit their sites to google's attention--ignoring any other options.

The second is the failure to reward. It is not enough to kill off bad actors if you haven't nurtured good actors to take their place.

In this case, the good actor could even be Google. If there are efforts to do things differently, we should find the ones we like and reward them. They will benefit and their competition will observe and imitate.

> sorry if this comes off as confrontational

It didn't to me. I use ublock and take it a step further to use other search engines. Personally, I only notice how ad filled Google search is when I use someone else's computer.

Google can beat them by also offering a paid-for search engine. People trust Google with their mail and phones, they will also trust google with privacy-respecting search.

On the other hand, how come Cliqz has been shut down instead of being sold?[1] Are there no companies with deep pockets who are interested in containing Google's revenue besides MS? E.g. since Cliqz was so privacy focussed, wouldn't that have been a great start for Apple to have a privacy respecting search engine?

[1] https://news.ycombinator.com/item?id=23909484

I think Apple Maps probably left a sour taste in their mouth about running such a service. Search engines are never done; it's a wide-open problem domain with diminishing returns and investment in many directions. I think Apple also realizes they're not a services company.. they sell hardware and support an OS and app store; everything else is a value add to bring people into the fold.

Apple is absolutely building core business around services, and has been for several years.

Siri has search and its top hits are surprisingly helpful. Maps is getting slowly better, and has both native and web clients. Apple TV+, iCloud Drive has paid tiers. Shortcuts and Messages look like apps, but they’re really UI around services, as is Siri.

Apple does have a search engine. Lots of the search suggestions that appear when you type in the Safari omnibar are Apple-sourced.

Not mentioned in the article or in the comments so far, but Neeva has raised $37.5m in funding[0]. I'm curious how that money will be spent, if they're not actually spending it on building a new search engine. Is it going to be mostly spent on buying the results in from Bing, and/or on advertising their new ad-free search, and/or something else?

[0] https://tech.economictimes.indiatimes.com/news/startups/7649...

Asking users to pay for something they've been getting free for 22 years is a poor business model. Look at news websites.

But people are still willing to pay for news when it's done right - see NYTimes as an example

This website (protocol.com) hounded me with a newsletter subscribe popup as soon as I landed on the page.

I hadn't even had a chance to read the first line of the article and the site's already asking me to sign up for their marketing trash.

Why in the world would anyone think that's acceptable? I have never seen the site before and have no idea of the quality of its content so what on earth makes them think they're going to get a subscriber out of me?

I agree, I usually insta-close a website when I get an unexpected pop-up. Pop-ups should only be used as a result of a direct user action (eg. click on delete account, receive a confirmation pop-up).

Search engines are such a big thing they should be open sourced and distributed over the community. It's like the most basic infrastructure the internet needs to work and we are outsourcing it.

Please describe in detail how you will distribute crawl, index, and ranking as basic infrastructure.


Obviously, not at the same level as google and there are other parts. But I believe we can do this together if we try to. People were talking about building their own search engine on elixir forums a while ago and many seemed interested.

The same way you decentralize anything else.

You can do crawling by using an extension that allows you to create a new tab, crawl data on your current url and send it up to the mothership.

You can actually do even better because you don't get SEO-hacks like disabling certain javascript when Google is on the page to improve speed.

I was thinking exactly this when I stumbled upon your comment, except I figured it should work for any private tab and it'd also need a browser that makes tabs private (and contained) by default.

It's a solution more easily solved by vc companies or government laws, because we're not seeing Google doing that in this lifetime, while FOSS solutions simply won't get the needed traction.

What happens when these self-hosted crawlers access illegal content in one's country?

The same thing that happens when a peer accesses an illegal torrent on his country? How is this relevant? It is a decentralized system, it shouldn't make a difference.

"Honstly officer. I didn't click on that link to CA imagery. It was my webcrawler."

> do even better

You just traded SEO as we know it for a scheme in which any rando can just upload the supposed contents of any URL.

So, couldn't you keep a database? Many people would upload the same url, whichever ones are bogus would get a low score, like shadowbanning. Say, use a dht with proof-of-work and things should work? Obviously I'm oversimplifying, but I see it as a solved problem by using a blockchain.

Also, ethically speaking, aren't we at the point of considering the idea of trusting random people smarter than trusting huge corporations whose only goals are to make more and more money?

How do you know which copies are bogus? It can't be just by saying that the one you have the most copies of is the right one. The problem is that most legit copies will be subtly different. While an attacker trying to forge page contents can make their copies identical. You can't do fuzzy matching when deciding what to store since that would require all be the nodes to agree on the fuzzy matching algorithm. That's going to mean hard-coding a complex algorithm that requires constant updates into your Blockchain infra.

A proof of work does not seem viable either. You're asking for the submitters to pass it for no reward, so the difficulty factor can't be particularly high. But then it becomes useless at blocking somebody who is actually deriving a benefit from submitting (fake) results.

The giant company will in this case build an index that's far superior. The crowd-sourced version will have huge amounts of duplication of popular pages, and massive underrepresentation of the long tail. And can you imagine how inefficient the distributed version will be both on storage and bandwidth. There can't be any facility for scheduling pages to be crawled at sensible intervals given the push model. The indexing nodes will just be flooded with pages they didn't actually want.

The crowd-sourced version will also not be "random people" like you suggested. A lot of them will have an agenda, and will be trying to manipulate the index to meet that agenda. And manipulate it in a way that's not useful to the people making searches. At least the company's goal of making money is furthered by building as useful an index as they can given the resource constraints.

Here's the way you do it...

The search engine page can be used for validation, just allow people pressing the back button on the page to tell you whether the results were useful or not.

What was being proposed was a way of decentralising the crawling. I tried to demonstrate with some examples why that could not work: you'd end up with an extremely inefficient index. What you're proposing does not solve any of those problems. Sure, you'll get a weak signal about page quality, but far too late in the pipeline to inform the decentralized crawling and indexing.

But further, you are not really thinking through how one would abuse this kind of a feature. If doing seo, I wouldn't forge a page to have content that make it be returned for irrelevant searches. Instead I would forge some high quality pages to show up as having backlinks to my page, and boost its pagerank. Or to demote the page of people I dislike, I'd forge it to have results that make it not show up on any searches. Your heuristic would not work there: if there's no clicks in the first place, there can't be any bounces.

Sounds like a lovely PhD dissertation!

That sounds a bit like yacy or searx:

https://yacy.net/ https://searx.me/

Applications are open for YC Winter 2022

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact