Hacker News new | past | comments | ask | show | jobs | submit login
How is search so bad? A case study (svilentodorov.xyz)
319 points by Tenoke on Jan 19, 2020 | hide | past | favorite | 396 comments



I have been thinking about the same problem since a few weeks. The real problem with search engines is the fact that so many websites have hacked SEO that there is no meritocracy left. Results are not sorted based on relevance or quality but by SEO experts' efforts at making the search results favor themselves. I can possibly not find anything deep enough about any topic by searching on Google anymore. It's just surface-level knowledge that I get from competing websites who just want to make money off pageviews.

It kills my curiosity and intent with fake knowledge and bad experience. I need something better.

However, it will be interesting to figure the heuristics to deliver better quality search results today. When Google started, it had a breakthrough algorithm - to rank page results based on number of pages linking to it. Which is completely meritocratic as long as people don't game for higher rankings.

A new breakthrough heuristic today will look something totally different, just as meritocratic and possibly resistant to gaming.


The real reason why search is so bad is that Google is downranking the internet.

I should know - I blew the whistle on the whole censorship regime and walked 950 pages to the DOJ and media outlets.

--> zachvorhies.com <--

What did I disclose? That Google was using a project called "Machine Learning Faireness" to rerank the entire internet.

Part of this beast has to do with a secret Page Rank score that Google's army of workers assign to many of the web pages on the internet.

If wikipedia contains cherry picked slander against a person, topic or website then the raters are instructed to provide a low page rank score. This isn't some conspiracy but something openly admitted by Google itself:

https://static.googleusercontent.com/media/guidelines.raterh...

See section 3.2 for the "Expertise, Authoritativeness and Trustworthiness" score.

Despite the fact that I've had around 50 interviews and countless articles written about my disclosure, my website zachvorhies.com doesn't show up on Google's search index, even when using the exact url as a query! Yet bing and duckduckgo return my URL just fine.

Don't listen to the people who say that's its some emergent behavior from bad SEO. This deliberate sabotage of Google's own search engine in order to achieve the political agenda of the controllers. The stock holders of Google should band together in a class action lawsuit and sue the C-Level executives of negligence.

If you want your internet search to be better then stop using Google search. Other search engines don't have this problem: I'm looking at qwant, swisscows, duckduckgo, bing and others.

~Z~


Google's search rankings are based on opinions held by other credible sources. This isn't really blowing the whistle when, as you admitted, Google admits this openly.

And maybe your site doesn't get ranked well because it's directly tied to project veritas. I don't like being too political, especially on hn and on an account tied to my real identity, but project veritas and it's associates exhibit appalling behavior in duplicity and misdirection. I would hope that trash like this does get pushed to the bottom.


In a political context, "credible" is often a synonym for "agrees with me". Anyone ranking "page quality" should be conscious of and try to avoid that, and yet the word "bias" doesn't even appear in the linked guidelines for Search Quality Raters.

Of course Google's own bias (and involvement in particular political campaigns) is well known, and opposed to Project Veritas, so it's quite possible that you are right and Google is downranking PV.

Would that be good? Well, that's an opinion that depends mostly on the bias of the commentator.


https://en.wikipedia.org/wiki/Project_Veritas

I doubt this affected search rankings but Project Veritas does have a ton of credibility issues.


And so does wikipedia.


Project Veritas does not have credibility issues. They have never issued a single retraction.

If this surprises you, then welcome to the systematic bias of wikipedia.


> In a political context, "credible" is often a synonym for "agrees with me".

Not among credible sources.


Meta-comment, but this is partly why search and the internet is so bad now; there are a large number of political disinformation campaigns which are getting increasingly blatant, and getting better at finding believers on the internet.

People have a vested interest in destroying the idea that anything can be a non-partisan "fact". Anything can become a smear. Only the most absolutely egregious ones can be reined in by legal action (e.g. Alex Jones libelling the Sandy Hook parents).

(This is not just internet, of course; the British press mendacity towards the Royal family is playing out at the moment.)


His website contains gems like "Things got political in June 2017 when Google deleted "covfefe" out of it's arabic translation dictionary in order to make a Trump tweet become nonsense." (No, covfefe doesn't mean anything in Arabic.)

Here is someone who believes that a private company's open attempts to rank websites by quality amounts to "seditious behaviour" deserving of criminal prosecution, and the only people willing to pay attention were Project Veritas. Google has plenty of ethics issues, but this guy's claims are absurd.


Not only is it a word, but Google had to delete the word twice.

https://www.zachvorhies.com/covfefe.html


> Despite the fact that I've had around 50 interviews and countless articles written about my disclosure, my website zachvorhies.com doesn't show up on Google's search index, even when using the exact url as a query!

I just tried it, it's just showing results for "zach vorhies" instead, which it thinks you meant. I just tried a few other random "people's names as URL" websites I could find, sometimes it does this, sometimes it doesn't.

Furthermore, the results that do appear are hardly unsympathetic to you. If google is censoring you/your opinions, they're doing a very poor job of it.


(I work at Google, but don't work on search)

> If wikipedia contains cherry picked slander against a person, topic or website then the raters are instructed to provide a low page rank score

This sounds like a good thing to me. Sites that contain lies, fabrications, and falsehoods should not be as highly ranked as those which do not.

Why should shareholders sure Google for, as far as I can tell from your argument, trying to provide users with a more useful product?


Why do your colleagues get to decide what is fact and what is fiction? It's our right, as humans, to be able to make that decision on our own after we encounter information. If Wikipedia gains a reputation for libel, then the onus should be on the public to stop trusting them.

Google does not have the moral authority to censor the internet, and it's absolutely wrong for them to attempt this. Information should be free, and you don't have the right to get in the way of that.


They don't get to decide any such thing, and in fact, can't. Google (fortunately for all of us) doesn't run the Internet.

They do run a popular search page, and have to decide what to do with a search like "Is the Earth flat?".

Personally, I would prefer they prominently display a "no". Others would disagree, but a search engine is curation by definition, that's what makes it useful.


> Personally, I would prefer they prominently display a "no".

Oh hey! I just tried this, and it does! image: https://i.imgur.com/OqqxSq3.png


You would, and fortunately Google agrees with you, but imagine for a moment that they didn’t. 90% if the internet would suddenly see a ’yes’ to that answer, even if 99% of websites disagree.


They get to decide because it is their algorithm and the whole point is of a search function is to discriminate inputs to be relevant. They aren't "getting in the way" - they are using it as they please.

What you ask for isn't freedom but control over everyone else - there is nothing stopping you from running your own spiders, search engines, and rankings.


Who said anything about censorship? The topic of discussion is what order results are in. Are you saying Google would be more useful if it returned the 100 million results randomly and left you to sort them out?


They get to decide what they display as results. What do you suggest they do instead? Display all of the internet and let the user filter things for themselves?


> decide what is fact

The fun thing about facts is that nobody needs to decide whether or not they are true. Perhaps the fact that you can honestly claim to think otherwise means you need to take a step back and examine your reasoning.


> Perhaps the fact that you can honestly claim to think otherwise

This is an example of a "fact" that I'm talking about. It's not a fact, it's an opinion being presented as fact. I guess if you present yourself this way online you have no problem with Google controlling what "facts" are found when you use their search engine.

I guess I'll just have to wait until they start peddling a perspective you disagree with.

> If wikipedia contains cherry picked slander against a person, topic or website

Just remember that this is the comment we're discussing... how does one determine if a statement is slander? Are you telling me Google has teams of investigative journalists following up on each of their search results? Or did someone at Google form an opinion, then decide their opinion is the one that should be presented as "fact" on the most popular search engine in the world?


> Despite the fact that I've had around 50 interviews and countless articles written about my disclosure, my website zachvorhies.com doesn't show up on Google's search index, even when using the exact url as a query!

I am not sure what is happening but I directly searched for your website : zachvorhies.com on Google (in Australia, if that matters), which returned the website as the first result:

https://i.imgur.com/Z7RTsuE.png


I'm in the US and Google does not display zachvorhies.com when searching for "zachvorhies.com".


Does for me. USA.

[0] https://i.imgur.com/S2rPywz.png

Also, is this guy a Q follower or something? The favicon is a Q.


> Also, is this guy a Q follower or something? The favicon is a Q.

Actually the favicon is a headshot.

https://www.zachvorhies.com/favicon.ico

And that's also the icon in the page source:

<link rel=”shortcut icon” href="favicon.ico" type=”image/x-icon” />

I wonder why Google shows a Q.


> I wonder why Google shows a Q

A placeholder used if the algorithm thinks the favicon is not appropriate for some reason.


I would think the quotes would make a difference. I was not using quotes, does using quotes in the US still return the website? I would use a VPN and test it out but I am at work right now.


Not using quotes, the first result I get is his twitter account, and the second is a link to a Project Veritas piece about this document leak he describes. Hardly seems like he's getting buried.


Additionally, if you search "site:zachvorhies.com" you get the site. So it isn't de-indexed. It just isn't ranking well.


[removed, for some reason I thought australia was part of the EU. I'm in the US, also did not see the site link through google]


I think the only time Australia has been part of some sort of European organization or group is when Australia competed in the Eurovision.

Anyhow, I would be interested to know what results you get in the EU.


I'm in the US and I get the same search results. Although if I put in your name, I get your Twitter instead. Not sure why anyone would be searching for a dot com instead of going to it.


Same for me in Australia, searching for the name results in the Twitter handle showing up, and then a WikiSpooks websites and so forth. The website isn't even on the first page. I think that's rather concerning, that searching for a person's won't return their website but rather a twitter feed and other websites that have possibly optimized for SEO.


Australia isn't in the EU.

For what it's worth, I'm in Australia and have the same search results as erklik.


Ahh, Zachary Vorhies. I remember your bizarre internal posts literally claiming that Obama's birth certificate was fake. I wasn't surprised at all when you leaked a pile of confidential information to a far right conspiracy theorist group (Project Veritas), and were subsequently fired.

I wouldn't trust a single word that comes out of this guy's mouth.


FWIW I see your website as the first results when I google it:

https://imgur.com/a/jhx7N9D


For what it's worth, when I search for your site it's the first result. You have to click past the "did you mean", which searches on your name originally, but then it's there.


Highly polarized views, like the one you hold, are a result of a multiplicity of communications between entities on the Internet. Those humans who are more prone to spontaneous visualizations or audio tend to follow patterns which use biased arguments over reasoned arguments. That nets you comments like:

> Don't listen to the people who say

> this beast has to do with a secret Page Rank score that Google's army of workers

Anyone who tells you to not listen to others intends to tell you gossip about why you shouldn't gather data that conflicts their own views. They'll SAY all sorts of things to try to make you "see" it their way. Rational people DO things to prove or disprove a given belief. (Just to note SAYING a bunch of stuff does not equal DOING a bunch of stuff.)

For anyone rational and interested, Google "this video will make you angry" and bump the speed to 75%. The idea is that memes will compete for "compute" space in both people's minds and the space in which they occupy the Internet. Those who get "infected" with a given meme, will go to all sorts of lengths to rationalize why that meme is true, even though the meme the arguing against is just as irrational as their own.


>my website zachvorhies.com doesn't show up on Google's search index, even when using the exact url as a query!

Just searched it and it's literally the first result.


> I can possibly not find anything deep enough about any topic by searching on Google anymore.

> It kills my curiosity and intent with fake knowledge and bad experience. I need something better.

It's hard for me to take this seriously when wikipedia exists, and almost always ranks very highly in search results for searches for "knowledge topics". Between wikipedia and sources cited on wikipedia, I find the depth of almost everything worth learning about to be far greater than I can remember in, say, the early 2000s, which is seems like the "peak" of google before SEO became so influential.

In general, I think there are a lot of people wearing rose tinted glasses looking at the "old internet" in this thread. The only thing that has maybe gotten worse is "commercial" queries like those for products and services. Everything else is leaps and bounds better.


There is a lot of stuff you won't find on wikipedia that is now buried, one example being old forum threads containing sage wisdom from fellow enthusiasts on any given topic. You search for an interest and a half dozen relevant forums used to come up on page 1.

These days I rarely see a forum result appear unless I know the specific name of the forum to begin with and utilize the search by site domain operator.

Another problem these days, unrelated to search but dooming it in the process, is all these companies naming themselves or their products after irrelevant existing english words, rather than making up something unique. It's usually fine with major companies, but I think a lot of smaller companies/products shoot themselves in the foot with this and don't realize it. I was once looking for some good discussion and comparison on a bibliography tool called Papers, and that was just a pit of suffering getting anything relevant at all with a name like that.


Add inurl:forum to the query. Google used to have a filter "discussions", but they removed it for some reason. Nowadays I usually start with https://hn.algolia.com/ and site:reddit.com when I want to find a discussion.


The fact that Wikipedia exists, is frequently (though not always) quite good, has citations and references, and ranks highly or is used directly for "instant answers" ...

... still does nothing to answer the point that Web search itself is unambiguously and consistently poorer now than it was 5-10 years ago.

Yes, I find myself relying far more on specific domain searches, either in the Internet sense, or by searching for books / articles on topics rather than Web pages. Because so much more traditionally-published information is online, this actually means the net of online-based search has improved, but not for the most part because of improved Web-oriented resources (Webpages, discussions, etc.), but because old-school media is now Web-accessible.


This. More and more, I have been finding that good books provide better learning than the internet.

You search for quality books online, mostly through discussion forums as search fails here, or through following references of books and articles. Then spend time digesting them.


Wikipedia is surface level knowledge. Using wikipedia what is AM4 socket’s pinout? Do a google search and you find several people asking the question but no answers. On the other hand you can easily find that for an old 8086 cpu.

What’s sad is Google has generally indexed the pages I want, it’s just getting harder to actually find them.


Did you ever find the pin out manual for AM4? Your comment sent me down a google hole...

Clearly, they don’t want it available because their tech docs they host stop at AM3b. I was hoping an X470 (or other flavor) motherboard manufacturer would have something floating around...


Basically, I think you can divide search between commercial interest search and not commercial interest searches. I can find deep discussions of algorithms curated quite nicely. But information curtains, say, that will be as bad as the OP says.


> A new breakthrough heuristic today will look something totally different, just as meritocratic and possibly resistant to gaming.

I wonder how much of this could be obtained back by penalizing:

1. The number of javascript dependencies 2. The number of ads on the page, or the depth of the ad network

This might start a virtuous circle, but in the end, this is just a game of cat-and-mouse, and website might optimize for this as well.

What we might need to break this is a variety of search engines that uses different criteria to rank pages. I suspect it would be pretty hard, if not impossible, to optimize for all of them.

And in any case, frequently change the ranking algorithms to combat over-optimization by the websites (as that's classically done against ossification for protocols, or any overfitting to outside forces in a competitive system).


You could even have all this under one roof: one common search spider that feeds this ensemble of different ranking algorithms to produce a set of indices, and then a search engine front end that round-robins queries out between the different indices. (Don’t like your query? Spin the algorithm wheel! “I’m Feeling Lucky” indeed.)


The Common Crawl is a thing already. Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage, and I can't think of anything that could change that in the foreseeable future. That's why I think providing a federated Web directory standard, ala ODP/DMOZ except not limited to a single source, would be a far more impactful development.


Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage

Maybe instead of a problem, there is an opportunity here.

Back before Google ate the intarwebs, there used to be niche search engines. Perhaps that is an idea whose time has come again.

For example, if I want information from a government source, I use a search engine that specializes in crawling only government web sites.

If I want information about Berlin, I use a search engine that only crawls web sites with information about Berlin, or that are located in Berlin.

If I want information about health, I use a search engine that only crawls medical web sites.

Each topic is still a wealth of information, but siloed enough that the amount of data could be manageable to a small or medium-sized company. And the market would keep the niches from getting so small that they become useful. A search engine dedicated to Hello Kitty lanyards isn't going to monetize.


I´d be happy with something like Searx [1,2,3]

[1] https://en.wikipedia.org/wiki/Searx [2] https://asciimoo.github.io/searx/ [3] https://stats.searx.xyz/

featuring the semantic map of [4] https://swisscows.ch/

incorporating [5] https://curlie.org/ and Wikipedia and something like Yelp/YellowPages embedded in Open Streetmaps for businesses and points of interest, with a no frills interface showing the history (via timeslide?) of edits.

Bang! Done!


That's the problem that web directories solve. It's not that you're wrong, it's just largely orthogonal to the problem that you'd need a large crawl of the internets for, i.e. spotting sites about X niche that you wouldn't find even from other directly-related sites, and that are too obscure, new, etc. to be linked in any web directory.


That's the problem that web directories solve

Not really. A web directory is a directory of web sites. I can't search a web directory for content within the web sites, which is what a niche search engine would do.


On the other hand, the niche search engine depends upon having such a web directory (the list of sites to index).



Don't forget the search engine search engine!


You don’t really need to store a full text crawl if you’re going to be penalizing or blacklisting all of the ad-filled SEO junk sites. If your algorithm scores the site below a certain threshold then flag it as junk and store only a hash of the page.

Another potentially useful approach is to construct a graph database of all these sites, with links as edges. If one page gets flagged as junk then you can lower the scores of all other pages within its clique [1]. This could potentially cause a cascade of junk-flagging, cleaning large swathes of these undesirable sites from the index.

[1] https://en.wikipedia.org/wiki/Clique_(graph_theory)


Javascript, which Google coincidentally pushed and still pushes for, doesn't exactly make the web easier to crawl either.


What if: SEO consultants aren’t gaming system, but the search and web is being optimized for “measurable immediate economic impact” that is ad revenue at this moment — due to web itself being in-monetizable and unable to generate value.

I don’t like the whole concept of SEO, I don’t like the way the web is today, but I think we should stop and think before resorting to “immoral few is destroying things, we unfuck it reclaim what we deserve” type simplification.


> the search and web is being optimized for “measurable immediate economic impact” that is ad revenue at this moment

So much is obvious. The discussion is about whether there is a less shitty metric.


Merging js deps into one big resource isn't difficult. The number of ads point is interesting though. How would one determine what is an ad and what is an image? I have my ideas, but optimizing on this boundary sounds like it would lead to weird outcomes.


Adblockers have to solve that problem already. And it's actually really easy because "ads" aren't just ads unfortunately, they're also third-party code that's trying to track you as you browse the site. So it's reasonably easy to spot them and filter them out.


Though, advertisers are already using first party redirection. Future of adblockers is bleak.

https://github.com/uBlockOrigin/uBlock-issues/issues/780/


Back in the early days of banner ads, a CSS-based approach to blocking was to target images by size. Since advertising revolved around specific standards of advertising "units" (effectively: sizes of images), those could be identified and blocked. That worked well, for a time.

This is ultimately whack-a-mole. For the past decade or so, point-of-origin based blockers have worked effectively, because that's how advertising networks have operated. If the ad targets start getting unified, we may have to switch to other signatures:

- Again, sizes of images or DOM elements.

- Content matching known hash signatures, or constant across multiple requests to a site (other than known branding elements / graphics).

- "Things that behave like ads behave" as defined by AI encoded into ad blockers.

- CSS / page elements. Perhaps applying whitelist rather than blacklist policies.

- User-defined element subtraction.

There's little in the history of online advertising that suggests users will simply give up.


Some of those techniques will make the whole experience slow compared to the current network request filters and dns blockers.

And that will probably be blocked or severely locked down by your most popular browser, chrome.

I don't need to give advertisers data myself when someone else I know can. I really doubt it is easy to throw off chrome monopoly at this stage. I presume we will see a chilling effect before anything moves like IE.


At this point, I'll take slow over shitshow.


That is fixed since 1.24.3b7 / https://github.com/gorhill/uBlock/releases ?


In the early days of DMOZ, some editors would rank sites lower based on the number of ads they had.


I don't think DMOZ had ranking per se? They could mark "preferred" sites for any given category, but only a handful of them at most, and with very high standards, i.e. it needed to be the official site or "THE" definitive resource about X.


You are correct, the sites weren't "ranked" the same way that Google ranks sites now. But there were preferred sites, and each site had a description written by an editor who could be fairly unpleasant if they wanted to.

I had a site that appeared in DMOZ, and the description was written in such a way that nobody would want to visit it. But it was one of only a few sites on the internet at the time with its information, so it was included.


Given that Google makes money off the ads, that would be hard. DuckDuckGo could pull it off. You need another revenue stream though.


Google has taken on so many markets that I don't think they can do anything reasonably well (or disruptive) without conflicting interests. A breakup is overdue: if they didn't control both search and ads, the web would be a lot better nowadays. If they didn't control web browsers as well, standards would be much more important.


> What we might need to break this is ...

Create a core protocol at the same level as DNS etc., that web servers can use to offer an index of everything they serve/relay. A multitude of user-side apps may then query that protocol, with each app using different algorithms, heuristics and offering different options.


I've been thinking along similar lines for a year or so now.

There are several puzzling omissions from Web standards, particularly given that keyword-based search was part of the original CERN WWW discussion:

http://info.cern.ch/hypertext/WWW/Addressing/Search.html

IF we had a distributable search protocol, index, and infrastructure ... the entire online landscape might look rather different.

Note that you'd likely need some level of client support for this. And the world's leading client developer has a strongly-motivated incentive to NOT provide this functionality integrally.

A distributed self-provided search would also have numerous issues -- false or misleading results (keyword stuffing, etc.) would be harder to vet than the present situation. Which suggests that some form of vetting / verifying provided indices would be required.

Even a provided-index model would still require a reputational (ranking) mechanism. Arguably, Google's biggest innovation wasn't spidering, but ranking. The problem now is that Google's ranking ... both doesn't work, and incentivises behaviours strongly opposed to user interests. Penalising abusive practices has to be built into the system, with those penalties being rapid, effective, and for repeat offenders, highly durable.

The problem of potential for third-party malfeasance -- e.g., engaging in behaviours appearing to favour one site, but performed to harm that site's reputation through black-hat SEO penalties, also has to be considered.

As a user, the one thing I'd most like to be able to do is specify blacklists of sites / domains I never want to have appear in my search results. Without having to log in to a search provider and leave a "personalised" record of what those sites are.

(Some form of truly anonymised aggregation of such blocklists would, of course, be of some use, and facilitating this is an interesting challenge.)


I too have been thinking about these things for a long time, and I also believe a better future is going to include "aggregation of such blocklists would, of course, be of some use, and facilitating this is an interesting challenge."

I decided it is time for us to have a bouncer-bots portal (or multiple) - this would help not only with search results, but also could help people when using twitter or similar - good for the decentralized and centralized web.

My initial thinking was these would be 'pull' bots, but I think they would be just as useful, and more used, if they were perhaps active browser extensions..

This way people can choose which type of censoring they want, rather than relying on a few others to choose.

I believe creating some portals for these, similar to ad-block lists - people can choose to use Pete'sTooManyAds bouncer, and or SamsItsTooSexyfor work bouncer..

ultimately I think the better bots will have switches where you can turn on and off certain aspects of them and re-search.. or pull latest twitter/mastodon things.

I can think of many types of blockers that people would want, and some that people would want part of - so either varying degrees of blocking sexual things, or varying bots for varying types of things.. maybe some have sliders instead of switches..

make them easy to form and comment on and provide that info to the world.

I'd really like to get this project started, not sure what the tooling should be - and what the backup would be if it started out as a browser extension but then got booted from the chrome store or whatever.

Should this / could this be a good browser extension? What language / skills required for making this? It's on my definite to do future list.


There are some ... "interesting" ... edge cases around shared blocklists, most especially where those:

1. Become large.

2. Are shared.

3. And not particularly closely scrutinised by users.

4. Via very highly followed / celebrity accounts.

There are some vaguely similar cases of this occurring on Twitter, though some mechanics differ. Celebs / high-profile users attract a lot of flack, and take to using shared blocklists. Those get shared not only among celeb accounts but their followers, though, because celebs themselves are a major amplifying factor on the platform, being listed effectively means disappearing from the platform. Particularly critical for those who depend on Twitter reach (some artists, small businesses, and others).

Names may be added to lists in error or malice.

This blew up summer of 2018 and carried over to other networks.

Some of the mechanics differ, but a similar situation playing out over informally shared Web / search-engine blocklists could have similar effects.


Create a core protocol at the same level as DNS etc., that web servers can use to offer an index of everything they serve/relay

Isn't that pretty much a site map?

https://en.wikipedia.org/wiki/Sitemaps


A sitemap simply tells you what pages exist, not what's on those pages.

Systems such as lunr.js are closer in spirit to a site-oriented search index, though that's not how they're presently positioned, but instead offer JS-based, client-implemented site search for otherwise static websites.

https://lunrjs.com


How would this help anything? It would make the blackhat SEO even easier, if anything.


The results could be audited.

Fail an audit, lose your reputation (ranking).

The basic principle of auditing is to randomly sample results. BlackHat SEO tends to rely on volume in ways that would be very difficult to hide from even modest sampling sizes.


How do you stop the server lying?

If a good site is on shared hosting will it always be dismissed because of the signal of the other [bad] sites on that same host? (you did say at DNS level, not domain level)


> Create a core protocol at the same level as DNS etc., that web servers can use to offer an index of everything they serve/relay.

So, back to gopher? That might actually work!


Google already penalizes based off payload size/download speed.


> 1. The number of javascript dependencies

How about we don't start looking at the /how/ a site is made, when it's already difficult to sort out the /what/ it is.


Goodhart‘s Law in action. I wonder how we make a new measure that buys us more time?

https://en.wikipedia.org/wiki/Goodhart's_law


The backtick in your link broke it:

https://en.wikipedia.org/wiki/Goodhart's_law

(Were you on mobile and using a smart keyboard?)


I manually added the comma on a mobile smart keyboard. :) Didn't know that doesn't work haha.


That damn ‘Smart Quotes’ misfeature is still causing havoc even after 30 years.


Nitpick: It's actually an apostrophe " ' ", not a backtick/grave accent " ` " or comma " , " :D

https://en.wikipedia.org/wiki/Apostrophe

https://en.wikipedia.org/wiki/Grave_accent#Use_in_programmin...

https://en.wikipedia.org/wiki/Comma


Oh yes sorry I’m full of mistakes today. Of course, not a comma!


In the specific case of date based searches they are pretty difficult because of how pages are ranked. For a long time (and still to a large extent) Google ranks 'newer' pages higher than 'relevant' pages. At Blekko[1] there was a lot of code that tried to figure out that actual date of the document (be it a forum post, news article, or blog post). That date would often be months or years earlier than the 'last change' information would have you think.

Sometimes its pretty innocuous, a CMS system updates every page with an updated copyright notice at the start of each year. Other times its less innocuous where the page simply updates the "related links" or side bar material and refreshes the content.

It is still an unsolved ranking relevance problem where a student written, 3 month old description of how AM modulation works ranks higher than a 12 year old, professor written description. There isn't a ranking signal for 'author authority'. I believe it is possible to build such a system but doing so doesn't align well with the advertising goals of a search engine these days.

[1] disclaimer I worked at Blekko.


“I can possibly not find anything deep enough about any topic by searching on Google anymore. It's just surface-level knowledge that I get from competing websites who just want to make money off pageviews.”

Is it possible that there is no site providing non fluffy content on your query? For a lot of niche subjects, there really are very few if any substantial content on that topic.


> Is it possible that there is no site providing non fluffy content on your query? For a lot of niche subjects, there really are very few if any substantial content on that topic.

“Very few if any substantial”

The problem is that google won’t even show me the very few anymore. It’s just fluff upon fluff and depth (or real insight at least) is buried in twitter threads and reddit/hn comments, and github issue discussion.

I fear the seo problem has not only killed knowledge propagation, but also thoroughly disincentivized smart people from even trying. And that makes me sad.


I mirror your sentiment. It used to be that you could use your google fu and you'd be able to find a dozen relevant forum posts or mail chains in plain text. It's much, much, harder to get the same standard of results. "Pasting stack traces and error messages. Needle in a haystack phrases from an article or book. None of it works anymore."

Yeah, if I know where I'm looking (the sites) then google is useful since I can narrow it to that domain. But if I don't know where to look then I'm SOL.

The serendipity of good results on Google.com is no longer there. And given the talent at google you have to wonder why.


The devil is in this detail: Regular users don’t want those “weird looking” results. Normies prefer the fluff.

And guess what: most users are normal. Us here on HN are weird:


So this point is such an interesting and common anti-pattern on the internet though:

1. Something is or provides access to good quality content.

2. Because of this quality, it gets more and more popular.

3. As popularity grows, and commercialization takes over, the incentive becomes to make things "more accessible" or "appealing" to the "average" user. More users is always better right!?

4. This works, and quality plummets.

5. The thing begins to lose popularity. Sometimes it collapses into total unprofitability. Sometimes it remains but the core users that built the quality content move somewhere else, and then that new thing starts to offer tremendous value in comparison to the now low quality thing.


It is only solvable for a short period of time. Then, when whatever replaces the current search is successful enough there will be an incentive to game the new system. So the only way to really solve this is by radical fragmentation of the search market or by randomizing algorithms.


Someone in a previous thread that I, unfortunately can’t remember, suggested that it might not just be the SEO but the internet that changed. Google used to be really good at ascertaining meaning from a panoply of random sources, but those sites are all gone now. The Wild West of random blogs and independent websites are basically dead in favor of content farms and larger scale media companies.


I’ve found more and more I have reverted to finding books instead of searching to find deeper knowledge. The only issue is it is easy to publish low quality books now. Depending on the topic you are looking into often if a book stands the test of time it is a worthwhile read. With tech books you have to focus on the authors credentials.


> However, it will be interesting to figure the heuristics to deliver better quality search results today.

If only there were some kind of analog for effective ways to locate information. Like if everything were written on paper, bound into collections, and then tossed into a large holding room.

I guess it's past the Internet's event horizon now, but crawler-primary searching wasn't the only evolutionary path to search.

Prior to Google (technically: AdWords revenue funding Google) seizing the market, human-currated directories were dominant [1, Virtual Library, 1991] [2, Yahoo Directory, 1994] [3, DMOZ, 1998].

Their weakness was always cost of maintenance (link rot), scaling with exponential web growth, and initial indexing.

Their strength was deep domain expertise.

Google's initial success was fusing crawling (discovery) with PageRank (ranking), where the latter served as an automated "close enough" approximation of human directory building.

Unfortunately, in the decades since we seem to have forgotten how useful hand-currated directories were, in our haste to build more sophisticated algorithms.

Add to that that the very structure of the web has changed. When PageRank first debuted, people were still manually tagging links to their friends' / other useful sites on their own. Does that sound like the link structure we have in the web now?

Small surprise results are getting worse and worse.

IMHO, we'd get a lot of traction out of creating a symbiotic ecosystem whereby crawlers cooperate with human currators, both of whose enriched output is then fed through machine learning algorithms. Aka a move back to supervised web search learning, vs the currently dominant unsupervised.

[1] https://en.m.wikipedia.org/wiki/World_Wide_Web_Virtual_Libra... , http://vlib.org/

[2] https://en.m.wikipedia.org/wiki/Yahoo!_Directory

[3] https://en.m.wikipedia.org/wiki/DMOZ , https://www.dmoz-odp.org/


Mixing human curation with crawlers is probably something that'd help with search results quality, but the issue comes in trying to get it to scale properly. Directories like the Open Directory Project/DMOZ and Yahoo's directory had a reputation for being slow to update, which left them miles behind Google and its ilk when it came to indexing new sites and information.

This is problematic when entire categories of sites were basically left out of the running, since the directory had no way to categorise them. I had that problem with a site about a video game system the directory hadn't added yet, and I suspect others would have it for say, a site about a newer TV show/film or a new JavaScript framework.

You've also got the increase in resources needed (you need tons of staff for effective curation), and the issues with potential corruption to deal with (another thing which significantly effected the ODP's usefulness in its later years).


Federation would help with both breadth and potential corruption, compared to what we had with ODP/DMOZ. A federated Web directory (with common naming/categorization standards, but very little beyond that) would probably have been infeasible back then simply because the Internet was so much smaller and fewer people were involved (and DMOZ itself partially made up for that lack by linking to "awesome"-like link lists where applicable) - but I'm quite sure that it could work today, particularly in the "commercial-ish" domain where corruption worries are most relevant.


The results are human curated as much as google would like to publicly pretend otherwise.

I think a more fundamental problem is a large portion of content production is now either unindexable or difficult to index - Facebook, Instagram, Discord, and YouTube to name a few. Pre-Facebook the bulk of new content was indexable.

YouTube is relatively open, but the content and contexts of what is being produced is difficult to extract, if, for the only reason that people talk differently than they write. That doesn’t mean, in my opinion, that the quality of a YouTube video is lower than what would have been written in a blog post 15 years ago, but it makes it much more difficult to extract snippets of knowledge.

Ad monetization has created a lot of noise too, but I’m not sure without it, there would be less noise. Rather it’s a profit motive issue. Many, many searches I just go straight to Wikipedia and wouldn’t for a moment consider using Google for.

Frankly I think the discussion here is way better than the pretty mediocre to terrible “case study” that was posted.



Immediately before Google were search engines like AltaVista https://en.wikipedia.org/wiki/AltaVista (1995) and Lycos https://en.wikipedia.org/wiki/Lycos (1994) which were not directories like Yahoo. Google won by not being cluttered with non-search web portal clutter, and by the effectiveness of PageRank, and because by the late 1990s the web was too big to be indexed by a manually curated directory.


"Halt And Catch Fire" had a cool way of taking these 2 approaches of search into their plot line.


“When a measure becomes a target, it ceases to be a good measure” - Charles Goodhart


Perhaps it’s not always a new heuristic that is needed, but a better way to manage the externalities around current/preceding heuristics.

From a “knowledge-searching” perspective, at a very rudimentary level, it makes sense to look to sites/pages that are often cited (linked to) by others as better sources to rank higher up in the search results. It’s a similar concept to looking at how often academic papers are cited to judge how “prominent” of a resource they are on a particular topic.

However, as with academia, even though this system could work pretty well for a long time at its best (science has come a long way over hundreds of years of publications), that doesn’t mean it’s perfect. There’s interference that could be done to skew results in one’s favor, there’s funneling money into pseudoscience to turn into citable sources, there’s leveraging connections and credibility for individual gain, - the list goes on.

The heuristic itself in not innately the problem. The incentive system that exists for people to use the heuristic in their favor creates the issue. Because even if a new heuristic emerges, as long as the incentive system exists, people will just alter course to try to be a forerunner in the “new” system to grab as big a slice of the pie while they can.

That’s a tough nut for google (or anyone) to crack. As a company, how could they actually curate, maintain, and evaluate the entire internet on a personal level while pursuing profitability? That seems near impossible. Wikipedia does a pretty damn good job at managing their knowledge base as a nonprofit, but even then they are always capped by amount of donations.

It’s hard to keep the “shit-level” low on search results when pretty much anyone, anywhere, anytime could be adding in more information to the corpus and influencing the algorithms to alter the outcomes. It gets to a point where getting what you need is like finding a needle in a haystack that the farmer got paid for putting there.


> I have been thinking about the same problem since a few weeks. The real problem with search engines is the fact that so many websites have hacked SEO that there is no meritocracy left.

That's not actually the problem described here. His problem is actually a bit deeper rootet since he specified the exact parameters of what he wants to see, but got terrible results. He specified a search for "site:reddit.com" but the resilts he got were ireelevant and worse than the results that he would have got when searching reddit directly.

I don't say that SEO, sites that copy content and only want to genrate clicks and large sites that culminate everything are bad fkr the internet of today, but the level of results we get off of search engines today is with one word abysmal.


Wrong. The site query worked. The issue is that there is no clear way to determine information date, as pages themselves change. Since more recent results are favored, SEO strategy of freshness throws off date queries. https://www.searchenginejournal.com/google-algorithm-history...


Wrong. In the article he elaborates.

> As you can see, I didn’t even bother clicking on all of them now, but I can tell you the first result is deeply irrelevant and the second one leads to a 4 year old thread.

He also wrote

> At this point I visibly checked on reddit if there’ve been posts about buying a phone from the last month and there are.

Duckduckgo even recognized the date to be 4 years old and reddit doesn't hide the age of posts. There are newer more fitting posts, but they aren't shown. And again a quote

> Why are the results reported as recent when they are from years ago, I don’t know - those are archived post so no changes have been made.

So your argument (also it really is a problem) in this case is a red herring. The problem lies deeper since google seems to be unable to do something as simple as extracting the right date and ddg ignores it. Also since all the results are years old it adds to the confusion why the results don't match the query. (He also wrote that the better matches were indeed indexed, but not found)


You said this wrong thing: He specified a search for "site:reddit.com" and claimed it was irrelevant. THAT IS NOT A RELEVANCY TERM. It correctly scopes searches to the specified site.

The entirety of the problem is the date query is not working, because of SEO for freshness. You also said this other wrong thing: "That's not actually the problem described here" . That is the problem here. The page shows up as being updated because of SEO.

The date in a search engine is the date the page was last judged to be updated, not one of the many dates that may be present on the page. When was the last reddit design update? Do you think that didn't cause the pages to change? Wake up.


>but the resilts he got were ireelevant and worse than the results that he would have got when searching reddit directly

Wrong. Internal reddit search has bad results too, and why even let you filter by last month.


Not to totally detract from your point, but my previous experience with SEO people showed that some SEO strategies actually not only improve page ranking, but also actual usability.

The first, and the most important perhaps was page load speed. We adopted a slightly more complicated pipeline on the server side, reduced the amount of JS required by the page, and made page loading faster. That improved both the ranking and actual usability.

The second was that SEO people told us our homepage contained too many graphics and too few text, so search engines didn't quite extract as much content from our pages. We responded by adding more text in addition to the fancy eye-catching graphics. That improved both the ranking and actual accessibility of the site.


I have noticed most HN comments with SEO in them take it as being bad bad bad and the reason for the death of good search, the need for powerful whatever..

I really wish everyone would qualify, and not just black-hat seo / whitehat - there are many types of SEO, often with different intentions.

I understand there has been a lot of google koolaid (and others) about how seo is evil it's poisoning the web, etc...

But now, or has it been a couple years how? google had a video come out saying an SEO firm is okay if they tell you it takes 6 months... they have upgraded their pagespeed tool which helps with seo, and were quite public about how they wanted ssl/https on everything and that that would help with google seo..

so there are different levels of SEO, someone mentioned an seo plugin I was using on a site as being a negative indicator, and I chuckled - the only thing that plugin does is try to fix some of the inherent obvious screwups of wordpress out of the box... things like no meta description which google flags on webmaster tools as multiple same meta descriptions.. also tries to alleviate duplicate content penalties by no-indexing archives or and categories or whatever.

So there is SEO that is trying to work with google, and then there is SEO where someone goes out and puts comments on 10,000 web sites only for the reason of ranking higher.. to me that is kind of grey hat if it was a few comments, but shady if it's hundreds and especially if it's automated..

but real blackhat stuff - hacking other web sites and adding links.. or having a site that is selling breast enlarging pills and trying to get people who type in a keyword for britney spears.. that is trying to fool people.

I have built sites with good info and had to do things to make them prettier for the ranking bot, but they are giving the surfer what they are looking for when they type 'whatever phrase'... I have also made web sites better when trying to also get them to show up in top results.

So it's not always seo=bad, sometimes seo=good for the algorythm and the users.

and sometimes it's madness - like extra fluff to keep a visitor on page longer to keep google happy like recipes - haha - many different flavors of it - and different intentions.


I've often thought one approach, though one I wouldn't necessarily want to be the standard paradigm, would be exclusivity based on usefulness.

So for example duckduckgo is still trying to use various providers to emulate essentially "early google without the modern google privacy violations", but when I start to think about many of the most successful company-netizens, one thing that stands out is early day exclusivity has a major appeal.

So I imagine a search engine that is only crawling the most useful websites and blogs and works on a whitelist basis. Instead of trying to order search results to push bad results down, just don't include them at all or give them a chance to taint the results. It would have more overhead, and would take a certain amount of time to make sure it was catching non-major websites that are still full of good info ... but once that was done it would probably be one of the best search panes in existence. I have also thought to streamline this, and I know it's cliche, but surely there could be some ml analysis applied to figuring out which sites are SEO gaming or click-baiting regurgitators and weed them out.

Just something I've been mulling over for a while now.


And how do you determine which websites are good other than checking if they are doing seo? Is reddit.com good or bad? If a good site that does seo should it be taken out?

And what if what you're searching exists only in a non good website? Isn't it better to show a result from a non good website than showing nothing?


> It's just surface-level knowledge that I get from competing websites who just want to make money off pageviews.

Can you give some examples of queries/topics? Not that I disagree, I often have the same problem, but have found ways to mitigate.


Can you elaborate some of these ways? I just have a big minus list of sites.


I would need to hear examples, queries or topics that results in solely superficial information, like OP stated.


I too am asking people to start making lists of lame query returns. I have taken screen shots of some, even made a video about one... but a solid list in a spreadsheet perhaps would be helpful.. of course with results varying for different people / locations and month to month / year to year, having some screen shots would be helpful too. Not sure if there is a good program for snapping some screen shots and pulling some key phrases and putting it all together well...


This is very much misguided.

Many websites do have "hacked" (blackhat/shady) SEO, but these websites do not last long, and are entirely wiped out (see: de-ranked) every major algorithm update.

The major players you see on the top rankings today do utilize some blackhat SEO, but it's not at a level that significantly impacts their rankings. Blackhat SEO is inherently dangerous, because Google's algorithm will penalize you at best when it finds out -- and it always does -- and at worst completely unlist your domain from search results, giving it a scarlet letter until it cools off.

However, the bulk of all major websites primary utilize whitehat SEO, i.e "non-hacked," i.e "Google-approved" SEO to maintain their rankings. They have to, else their entire brand and business would collapse, either from being out-ranked or by being blacklisted for shady practices.

Additionally, Google's algorithim hasn't changed much at all from pagerank, in the grand scheme of things. If you can read between their lines, the biggest SEO factor is: how many backlinks from reputable domains do you have pointing at your website? Everything else, including blackhat SEO, are small optimizations for breaking ties. Sort of like PED usage in competitive sports; when you're at the elite level, every little bit extra can make a difference.

Google's algorithm works for its intended purposes, which is to serve pages that will benefit the highest amount of people searching for a specific term. If you are more than 1 SD from the "norm" searching for a specific term, it will likely not return a page that suits you best.

Google's search engine based on virality and pre-approval. "Is this page ranked highly by other highly ranked pages, and does this page serve the most amount of people?" It is not based on accuracy, or informational-integrity -- as many would believe by the latest Medic update -- but simply "does this conform to normal human biases the most?"

If you have a problem with Google's results, then you need to point the finger at yourself or at Google. SEO experts, website operators, etc. are all playing a game that's set on Google's terms. They would not serve such shit content if Google did not: allow it, encourage it, and greatly reward it.

Google will never change the algorithm to suit outliers, the return profile is too poor. So, the next person to point a finger at is you: the user. Let me reiterate, Google's search engine is not designed for you; it is designed for the masses. So there is no logical reason for you to continue using it the way you do.

If you wish to find "deep enough" sources, that task is on you, because it cannot be readily or easily monetized; thus, the task will not be fulfilled for free by any business. So, you must look at where "deep enough" sources lay: books, journals, and experts.

Books are available from libraries, and a large assortment of them are cataloged online for free at Library Genesis. For any topic you can think of, there is likely to be a book that goes into excruciating detail that satisfies your thirst for "deep enough."

Journals, similarly. Library Genesis or any other online publisher, e.g NIH, will do.

Experts are even better. You can pick their brains and get even more leads to go down. Simply, find an author on the subject -- Google makes this very easy -- and contact them.

I'm out of steam, but I really felt the need to debunk this myth that Google is a poor, abused victim, and not an uncaring tyrant that approves of the status quo.


> Google's algorithm works for its intended purposes, which is to serve pages that will benefit the highest amount of people searching for a specific term.

Does it? So for any product search, thrown-together comparison sites without actual substance but lots of affiliate links are really among the best results? Or maybe they are the most profitable result, and thus the one most able to invest in optimizing for ranking? Similarly, do we really expect results on (to a human) clearly hacked domains to be the best for anything, but Google will still put them in the top 20 for some queries? "Normal people want this crap" is a questionable starting point in many cases.


Over the long-term, Google's algorithm will connect the average person to the page most likely to benefit them, more than it won't.

There is no "best result."

Any page falling under "thrown-together comparison sites without actual substance but lots of affiliate links" are temporal inefficiencies that get removed after each major update.

Will more pop up? Yes, and they will take advantage of any ineffeciency or edge-cases in the algorithim to boost their rankings to #1.

Will they stay there for more than a few months? No. They will be squashed out, and legitimate players will over time win out.

This is the dichotomy between "churn and burn" businesses and "long term" businesses. You will make a very lucrative, and quick, buck going full blackhat, but your business won't last and you will be consistently need to adapt to each successive algo update. While long-standing "legit" businesses will only need to maintain market dominance -- something much easier to do than break into the market from ground zero, which churn and burners will have to do in perpetuity until they burn out themselves.

If you want to test this, go and find 10 websites you think are shady, but have top 5 rankings for a certain search phrase. Mark down the sites, keyword, and exact pages linked. Now, wait a few months. Search again using that exact phrase. More likely than not, i.e more than 5 out of 10, will no longer be in the top 5 for their respective phrases, and a couple domains will have been shuttered. I should note that "not deep info" is not "shady," because the results are for the average person. Ex. WebMD is not deep, but it's not shady either.

I implore people to try and get a site ranked with blackhat tricks and lots of starting capital, and see just how hard it is to keep ranked consistantly using said tricks. It's easy to speculate and make logical statements, but they don't hold much weight without first-hand experience and observation.


>Will they stay there for more than a few months? No. They will be squashed out, and legitimate players will over time win out.

This isn't true at all in my experience. As a quick test I tried searching for "best cordless iron", on the first page there is an article from 2018 that leads to a very broken page with filler content and affiliate links. [1] There are a couple of other articles with basically the exact same content rewritten in various ways also on the first page.

A quick SERP history check confirms that this page has returned in the top 10 results for various keywords since late 2018.

>It's easy to speculate and make logical statements, but they don't hold much weight without first-hand experience and observation.

This statement is a bit ironic given that it took me 1 keyword and 5 seconds of digging to find this one example.

[1] https://www.theironingroom.com/best-cordless-iron-reviews-of...


Here's an xkcd [0] inspired idea. We have several search engines, each with some level of bias. We're not looking to crawl the whole internet because we can't compete with their crawlers. However, we could make a crawler to crawl their results, and re-rank the top N from each engine according to our own metric. Maybe even expose the parameters in an "advanced" search. I'm assuming this would violate some sort of eula though. Any idea if someone has tried this approach?

Edit: thinking more about this post's specific issue, I'm not sure what to do if all the crawlers fail. Could always hook into the search apis for github, reddit, so, wiki, etc. Full shotgun approach.

[0] https://xkcd.com/927/


Isn't this basically what DuckDuckGo does?


Good to hear your concerns.

> The real problem with search engines is the fact that so many websites have hacked SEO that there is no meritocracy left.

I intend to announce the alpha test of my search engine here on HN.

My search engine is immune to all SEO efforts.

> I can possibly not find anything deep enough about any topic by searching on Google anymore.

In simple terms my search engine gives users content with the meaning they want and in particular stands to be very good, by far the best, at delivering content with "deep" meaning.

> I need something better.

Coming up.

> However, it will be interesting to figure the heuristics to deliver better quality search results today.

Uh, sorry, it's not fair to say that my search engine is based on "heuristics".

I'm betting on my search engine being successful and would have no confidence in heuristics.

Instead of heuristics I took some new approaches:

(1) I get some crucial, powerful new data.

(2) I manipulate the data to get the desired results, i.e., the meaning.

(3) The search engine likely has by far the best protections of user privacy. E.g., search results are the same for any two users doing the same query at essentially the same time and, thus, in particular, independent of any user history.

(4) The search engine is fully intended to be safe for work, families, and children.

For those data manipulations, I regarded the challenge as a math problem and took a math approach complete with theorems and proofs.

The theorems and proofs are from some advanced, not widely known, pure math with some original applied math I derived. Basically the manipulations are as specified in math theorems with proofs.

> A new breakthrough heuristic today will look something totally different, just as meritocratic and possibly resistant to gaming.

My search engine is "something totally different".

My search engine is my startup. I'm a sole, solo founder and have done all the work. In particular I designed and wrote the code: It's 100,000 lines of typing using Microsoft's .NET.

The typing was without an IDE (integrated development environment) and, instead, was just into my favorite general purpose text editor KEdit.

It's my first Web site: I got a good start on Microsoft's .NET and ASP.NET (for the Web pages) from

Jim Buyens, Web Database Development, Step by Step, .NET Edition, ISBN 0-7356-1637-X, Microsoft Press.

The code seems to run as intended. The code is not supposed to be just a "minimum viable product" but is intended for first production to peak usage of about 100 users a second; after that I'll have to do some extensions for more capacity. I wrote no prototype code. The code needs no refactoring and has no technical debt.

While users won't be aware of anything mathematical, I regard the effort as a math project. The crucial part is the core math that lets me give the results. I believe that that math will be difficult to duplicate or equal. After the math and the code for the math, the rest has been routine.

Ah, venture capital and YC were not interested in it! So I'm like the story "The Little Red Hen" that found a grain of wheat, couldn't get any help, then alone grew that grain into a successful bakery. But I'm able to fund the work just from my checkbook.

The project does seem to respond to your concerns. I hope you and others like it.

How should I announce the alpha test here at HN?


This sounds extremely implausible; especially claiming "immune to SEO" is like declaring encryption "unbreakable". A lot of human effort would be devoted to it if your engine became popular.


How to be immune to SEO? Easy once see how.

More details are in my now old HN post at

https://news.ycombinator.com/item?id=12404641

For a short answer, SEO has to do with keywords. My startup has nothing to do with keywords or even the English language at all. In particular, I'm not parsing the English language or any natural language; I'm not using natural language understanding techniques. In particular, my Web pages are so just dirt simple to use (user interface and user experience) that a child of 8 or so who knows no English should be able to learn to use the site in about 15 minutes of experimenting and about three minutes of watching someone use the site, e.g., via a YouTube video clip of screen captures. E.g., lots of kids of about that age get good with some video games without much or any use of English.

E.g., how are you and your spouse going to use keywords to look for an art print to hang on the wall over your living room?

Keyword search essentially assumes (1) you know what you want, (2) know that it exists, and (3) have keywords that accurately characterize it. That's the case, and commonly works great, for a lot of search, enough for a Google, Bing, and more in, IIRC, Russia and China. Also it long worked for the subject index of an old library card catalog.

But as in the post I replied to, it doesn't work very well when trying to go "deep".

Really, what people want is content with some meaning they have at least roughly in mind, e.g., a print that fits their artistic taste, sense of style, etc., for over the living room sofa. Well, it turns out there's some advanced pure math, not widely known, and still less widely really understood, for that.

Yes I encountered a LOT of obstacles since I wrote that post. The work is just fine; the obstacles were elsewhere. E.g., most recently I moved. But I'm getting the obstacles out of the way and getting back to the real work.


> Really, what people want is content with some meaning they have at least roughly in mind

Yes, but capturing meaning mathematically is somewhat an unsolved problem in both mathematics, linguistics and semiotics. Your post claims you have some mathematics but (obviously as it's a secret) doesn't explain what.

SEO currently relies on keywords, but SEO as a practice is humans learning. There is a feedback loop between "write page", "user types string into search engine" and "page appears at certain rank in search listing". Humans are going to iteratively mutate their content and see where it appears in the listing. That will produce a set of techniques that are observed to increase the ranking.


> Yes, but capturing meaning mathematically is somewhat an unsolved problem in both mathematics, linguistics and semiotics.

I've been successful via my search math. For your claim, as far as I know, you are correct, but actually that does not make my search engine and its math impossible.

> That will produce a set of techniques that are observed to increase the ranking.

Ranking? I can't resist, to borrow from one of the most famous scenes in all of movies: "Ranking? What ranking? We don't need no stink'n ranking".

Nowhere in my search engine is anything like a ranking.


So, do you only ever display one single result? Or do you display multiple results? Because if you display multiple results, they will be in a textual order, whether that's top to bottom or left to right, and that is a ranking.

People pay tens or even hundreds of thousands of dollars to move their result from #2 to #1 in the list of Google results.


> #2 to #1 in the list of Google results.

My user interface is very different from Google's, so different there's no real issue of #1 or #2.

Actually that #1 or #2, etc. is apparently such a big issue for Google, SEO, etc. that it suggests a weakness in Google, one that my work avoids.

You will see when you play a few minutes with my site after I announce the alpha test.

Google often works well; when Google works well, my site is not better. But the post I responded to mentions some ways Google doesn't work well, and for those and some others my site stands to work much better. I'm not really in direct competition with Google.


stop vaguebooking and post it up on HN. if you're comfortable with where the product is at at the current moment then share it. it will never be finished so share it today.


I was not really announcing a new Web site, which I do intend to do at HN once my omelet is ready for actual eating, but just replying to the post

https://news.ycombinator.com/item?id=22092248

of rahulchhabra07.

His post was interesting to me since it mentioned some of the problems that I saw and that got me to work on my startup. And my post might have been interesting to him since it confirms that (i) someone else also sees the same problems and (ii) has a solution on the way.

For explaining my work fully, maybe even going open source, lots of people would say that I shouldn't do that. Indeed, that anyone would do a startup in Internet search seems just wack-o since they would be competing with Google and Bing, some of the most valuable efforts on the planet.

So that my efforts are not just wack-o, (i) I'm going for a part of search, e.g., solving the problems of rahulchhabra07, not currently solved well; (ii) my work does not really replace Google or Bing when they work well, and they do, what, some billions of times a second or some such?; (iii) my user interface is so different from that of Google and Bing that at least first cut my work would be like combining a racoon and a beaver into a racobeaver or a beavorac; and (iv) at least to get started I need the protection of trade secret internals.

Or, uh, although I only just now thought of this, maybe Google would like my work because it might provide some evidence that they are not all of search and don't really have a monopoly, an issue in recent news.


Nah - unbreakable encryption is actually possible with one time pads.

The only way SEO could be impossible is if there was no capacity to change search ranking no matter what - which would be both useless and impossible.


Get feedback before you launch. Id be happy to test it.


Thanks. I intend to announce the alpha test here at HN, and I will have an e-mail address for feedback (already do -- at least got that little item off my TODO list although it took 36 hours of on the phone mud wrestling with my ISP to set it up).


An AI crawler is needed.


From my end, it looks like google search is very strongly prioritising paid clients and excluding references to everything else. Try a search or view from maps, it shows a world that only includes google ad purchasers.

Google has become a not very useful search - certainly not the first place I go when looking for anything except purchases. They've broken their "core business".


It also favors itself. I searched for "work music". First 9 results are from youtube.


This should probably be a separate submission but why is search so bad everywhere?

- Confluence: Native search is horrible IME

- Microsoft Help (Applications): .chm files Need I say more.

- Microsoft Task Bar: Native search okay and then horrible beyond a few key words and then ... BING :-(

- Microsoft File Search: Even with full disk indexing (I turned it on) it still takes 15-20 minutes to find all jpegs with an SSD. What's going on there?

- Adobe PDFs: Readers all versions. What? You mean you want to search for TWO words. Sacrilege. Don't do it.

Seriously though with all the interview code tests bubble sort, quick sort, bloom filters, etc. Why can't companies or even websites get this right?

And I agree with other commenters as far as Google, Bing, DDG, or other search sites it's been going down hill but the speed of uselessness is picking up.

The other nagging problem (at least for me) is that explicit searches which used to yield more relevant results now are front loaded with garbage. If I'm looking for datasheet on an STM (ST Microsystems) Chip and I start search with STM as of today STM is no longer relevant (it is, meaning it shows up after a few pages). But wow it seems like the SEOs are winning but companies that use this technique won't get my business.


Or MacOS Spotlight. Good lord. Most common occurrence: searching for Telegram, an app I have open 24/7 and interact with dozens of times a day.

CMD+Space

"T": LaTeXIT.app (an app I have used fewer than a dozen times in two years)

"E": LaTeXIT.app

"L": Telegram.app

"E": Electrum.app (how on earth??)

"G": telemetry.app (an app which cannot even be run)

"RAM" : Telegram

Similar experience searching for most apps, files, and words. It's horrendous.

MacOS Mojave 10.14.6 on a MacBook Pro (Retina, 15-inch, Mid 2015)


Yeah I never understood that, why don't they order it by usage frequency? They have those metrics as they introduced Screen Time


Anything Microsoft other than Office+outlook sucks. I don't know about azure though as I have not endured it yet.

Adobe wants to have you by your balls the moment you install their installer :-) I keep a separate computer for Adobe stuff just for that reason. Actually to run some MS junk too.

>Seriously though with all the interview code tests bubble sort, quick sort, bloom filters, etc. Why can't companies or even websites get this right?

I have see some of the stinkiest stuff created by people who will appear smartest in any test these companies can throw at them. Some people are always gambling/gaming and winging it. They leave a trail...unfortunately.


I envy your luck if you think office and outlook don't suck! Performance and reliability is terrible from my experience.


Performance and reliability is indeed terrible. It is a mystery that a word processor crashes so often and take 10s of seconds just to quit. But the fact is that they get the job done and I haven't seen any decent alternatives to word, excel and for that matter even outlook. If you know something reasonably close, then please share.


> - Microsoft File Search: Even with full disk indexing (I turned it on) it still takes 15-20 minutes to find all jpegs with an SSD. What's going on there?

I use this software utility called Search Everywhere, its surprisingly good, fast and fairly accurate most of the times :)


> - Microsoft File Search: Even with full disk indexing (I turned it on) it still takes 15-20 minutes to find all jpegs with an SSD. What's going on there?

Does turning it off speed it up? I think disk indexing (the way Windows does it) is a remnant from HDD times, and might make things worse when used together with a modern SSD.

> Adobe PDFs: Readers all versions. What? You mean you want to search for TWO words. Sacrilege. Don't do it.

If you're just viewing and searching PDFs (and don't have to fill out PDF forms on a regular basis), check out SumatraPDF. Fastest PDF reader on Windows I've come accross so far.


> it still takes 15-20 minutes to find all jpegs with an SSD. What's going on there?

What is going on there? I'm working on a file system indexer in golang and to walk and parse extension to a mimetype runs at several thousand images a second, over NFS. Windows is full of lots of headscratchers "why is this taking so long?"


My gut feeling is search is bad everywhere because no one provides a pure-text API to the content. Cleaning data is hard and its easier to chuck in all the text blasted off an HTML page than to exclude everything non-signal.

I have no answers for Microsoft File Search, it never returns any results for me, I wonder if they even tested it sometimes.


With Google you can use their search operators to find some relevant content and I wish more search engine would support the minus (-) to ignore content with certain keywords.

https://ahrefs.com/blog/google-advanced-search-operators/


Google has definitely stopped being able to find the things I need.

Pasting stack traces and error messages. Needle in a haystack phrases from an article or book. None of it works anymore.

Does this mean they are ripe for disruption or has search gotten harder?


> Pasting stack traces and error messages.

I cannot fathom the number of times I've pasted an error message enclosed by quotes and got garbage results, and then an hour of troubleshooting and searching later I come across a Github/bugtracker issue, which was nowhere in the search results, were the exact error message appears verbatim.

The garbage results are generally completely unrelated stuff (a lot of Windows forum posts) or pages were a few of the words, or similar words, appear. Despite the search query being a fixed string not only does Google fail to find a verbatim instance of it, but instead of admitting this, they return nonsense results.

> Needle in a haystack phrases from an article or book.

I can confirm this part as well, searching for a very specific phrase will generally find anything but the article in question, despite it being in the search index.

Zero Recall Search.


It seems like pasting a very large search query should actually make it easier for the search engine to find relevant results, but given that this doesn't happen suggests that the search query handler is being too clever and getting in the way.


> a Github/bugtracker issue, which was nowhere in the search results, were the exact error message appears verbatim.

Did you put the error message in quotes? I've never had this problem.


I, too, have pasted error messages verbatim into Google queries only to have garbage returned. I did include the error message in quotes. I started filtering sites from the results eg `-site:quora.com site:stackoverflow.com site:github.com` etc to start to get a hint of other developers with similar issues and/or some bug reports and/or documentation and/or source code.


> -site:quora.com site:stackoverflow.com site:github.com`

A mock-google that excludes quora and optionally targets stackoverflow/github sounds useful.


Could it be that there just isn't a single page in the web with the exact error?


My guess is that suppressing spammy pages got too hard. So they applied some kind of big hammer that has a high false positive rate. You're getting the best of what's left.

Maybe also some quality decline in their gradual shift to less hand weighted attributes and more ML.


My guess is that Google et al are all hell-bent on not telling you that your search returned zero results. They seem to go to great lengths to make sure that your results page has something on it by any means necessary, including: searching for synonyms for words I searched for instead of the specific words I chose, excluding words to increase the number of results (even though the words they exclude are usually the most important to the query), trying to figure out what it thinks I asked for instead of what I actually asked for.

I further suppose a lot of that is that The Masses(tm) don't use Google like I do. I put in key words for something I'm looking for. I suspect that The Masses(tm) type in vague questions full of typos that search engines have to try to parse into a meaningful search query. If you try to change your search engine to caters to The Masses(tm), then you're necessarily going to annoy the people that knew what they were doing, since the things that they knew how to do don't work like they used to (see also: Google removing the + and - operators).


I was going to reply with something along the same lines. Dropping the keyest keywords is a particularly big pet peeve of mine.

For those "needle in a haystack" type queries, instead of pages that include both $keyword1 and $keyword2, I often get a mix of the top results for each keyword. The problem is compounded by news sites that include links to other recent stories in their sidebars. So I might find articles about $keyword1 that just happen to have completely unrelated but recent articles about $keyword2 in the sidebar.

It also appears that Google and DDG both often ignore "advanced" options like putting exact phrase searches in quotation marks, using a - sign to exclude keywords, etc.

None of this seems to have cut down on SEO spam results either, especially once you get past the first page or two of results.

I suspect it all comes down to trying to handle the most common types of queries. Indeed, if I'm searching for something uncomplicated, like the name of the CEO of a company or something like that, the results come out just fine. Longtail searches probably aren't much of a priority, especially when there's not much competition.


Surely most engineers want the power of strict searching and less of the comforts of being always getting filler results, right?

So... is there an internal service at Google that works correctly but they're hiding from the world?

It might be useful for Google to make different search engines for different types of people. The behaviors of people are probably multi-modal, rather than normally distributed along some continuum where you should just assume the most common behavior and preferences. \

It would even be easier to target ads...

Or maybe this doesn't exist and spam is too hard.


> They seem to go to great lengths to make sure that your results page has something on it by any means necessary

You just described how YouTube's search has been working lately. When you type in a somewhat obscure keyword - or any keyword, really - the search results include not only the videos that match, but videos related to your search. And searches related to your keywords. Sometimes it even shows you a part of the "for you" section that belongs to the home page! The search results are so cluttered now.


Searching gibberish to try to get as few results as possible.

I got down to one with "qwerqnalkwea"

"AEWRLKJAFsdalkjas" returns nothing, but youtube helpfully replaces that search with the likewise nonsensical "AEWR LKJAsdf lkj as" which is just full of content.


> I put in key words for something I'm looking for. I suspect that The Masses(tm) type in vague questions full of typos that search engines have to try to parse into a meaningful search query.

Yeeaap, sometime in gradeschool - I think somewhere around 5th grade, age 11 or so, which would be around 1999 - we had a section on computers, where we'd learn the basics about how to use them. One of the topics I remember was "how to do web searches", where a friend was surprised at how easily I found what I was looking for - the other kids had to be trained to use keywords instead of asking it questions.


It's surprisingly easy to get zero results returned pasting cryptic error messages. It doesn't mean there is nothing, though. Omit half the string, and there's the dozen stack overflow threads with the error. Maybe it didn't read over the line break on stack overflow or something, but I haven't tested anything.


Tyranny of the minimum viable user.


Two anecdotes: It’s really fascinating.

1. My work got some attention at CES so I tried to find articles about it. Filtering for items that were from the last X days and searching for a product name found pages and pages of plagiarized content from our help center. Loading any one of the pages showed an OS appropriate fake “your system is compromised! Install this update” box.

What’s the game here? Is someone trying to suppress our legit pages, or piggybacking on the content, or is that just what happens now?

2. I was looking for some OpenCV stuff and found a blog walking through a tutorial - except my spidey sense kept going off because the write up simply didn’t make sense with the code. Looking a bit further I found that some guys really well written blog had been completely plagiarized and posted on some “code academy tutorial” sort of site - with no attribution. What have we come to?


The first seems big right now, on weird subdomains of clearly hacked sites. E.g. some embedded Linux tutorial on a subdomain of a small-town football club.


Yup. Entertainingly I just saw an example of the “lying date” the original article pointed out: according to google the page is from 17 hours ago. However right next to this it says June, xx 2018. Really?


Well that “big hammer” so to speak is that they tend to favor sites that have a lot of trust and authority.

Someone mentioned that the sites that have the answer typically is buried in the results. That’s because they tend to favor big brands and authoritative sites. And those sites oftentimes don’t have the answer to the search query.

Google’s results have gotten worse and worse over the years.


This! I think this is the biggest piece of the puzzling issue.

Was it Panda update or that one plus the one after - it took out so much of the web and replaced it with "better netizens" who weren't doing this bad thing or that bad thing.

Several problems with that - 1 - they took out a lot of good sites. Many good sites did things to get ranked and did things to be better once they got traffic.

The overbroad ban hammer took many down - and many people that likely paid an seo firm not knowing that seo firms were bad in google's eyes (at the time) - so lots of mom and pops and larger businesses got smacked down and put out of the internet business - just like how many blogs have shut down.

Of course local results taking a lot of search space and the instant answers (50% of searches never get a click cuz google gives them the answer right on the results page (often stolen from a site) are compounding this.

They tried having the disavow tool to make amends - but the average small business doesn't know about these things, and getting help on the webmaster forum is a joke if you are tech inclined, imagine what an experience it is for small business owners.

I miss the days of Matt Cutts warning people "get your Press Releases taken down or nofollowed or it's gonna crush you soon" - problem is most of the people who were profiting from no-longer-allowed seo techniques were not reading Matt's words.

I also appreciated his saying 'tell your users to bookmark you, they may not find you in google results soon' - yeah, at least we were warned about it.

The web has not been the same since those updates, and it's gotten worse since. This does help adwords sell and the big companies that can afford them though.

In these ways google has been kind of like the walmart of the internet, coming in, taking out small businesses, taking what works from one place and making it cheap at their place.

I'd much rather have the results of pre-penguin and let the surfers decide by choosing to remain on a site that may be good that also had press releases and blog links... rather than loosing all the sites that had links on blogs. I am betting most of the users out there would prefer the results of days past as well.


I've been using DDG as a good enough search engine for most things, but when I sometimes fall back to Google, it blows me away how many ads are on the page pretending to be results!


'if not ddg(search) { ddg("!g " + search) }' has been my go-to method for awhile now; but as time has progressed, the results from DuckDuckGo have either been getting better, or the Google results have been getting worse; because usually if I can't find it on DDG now, I can't find it on Google either.


I use DDG by default, but I can feel myself mentally flinching unless I basically know what I'm looking for already (i.e. I know I'll end up on StackOverflow). When I'm actually _searching_, it's useless, and I'll always !g.


Same here, I actually prefer DDG to Google now, even for regional (Germany) results.

When I switched, about a year and a half ago, I felt like I was switching to a lesser quality search engine (it was an ethical choice and done because I can), that, however, gradually and constantly got better, whereas Google went the opposite path.

Nowdays I only really use Google to leech bandwidth off their maps services. Despite there being a very good alternative available, OpenStreetMaps, they unfortunately appear to have limited (or at least, way less than Google) bandwidth at their disposal... A pity though, because their maps are so awesome, the bicycle map layer with elevation lines is any boy scout's wet dream... but yeah, to find the next hairdresser, Google'll do.

Speaking of bandwidth and OSM reminds me, is there an "SETI-but-for-bandwidth-not-CPU-cycles" kind of thing one could help out with? Like a torrent for map data?

EDIT: Maybe their bandwidth problems are also more the result of a different philosophy about these things. OSM is likely "Download your own offline copy, save everybody's bandwidth and resources" (highly recommended for smartphones, especially in bandwidth-poor Germany) whereas Google is "I don't care about bandwith, your data is worth it".


> Speaking of bandwidth and OSM reminds me, is there an "SETI-but-for-bandwidth-not-CPU-cycles" kind of thing one could help out with? Like a torrent for map data?

OSM used to have tiles@home, a distributed map rendering stack, but that shut down in 2012. There is currently no OSM torrent distribution system, but I'd like to set that up.


Google images isn't even worth using at all anymore, after that Getty lawsuit that made them remove links to images (the entire damn point of image search as far as I'm concerned..)


I think the Web just kind of stopped being full of searchable information.


Imagine if instead of kneecapping XHTML and the semantic web properties it had baked in, Google had not entered into the web browser space. We might be able to mark articles up with `<article>`, and set their subject tags to the URN of the people, places, and things involved. We could give things a published and revised date with change logs. Mark up questions, solutions, code and language metadata. All of that is extremely computer friendly for ingestion and remixing. It not only turned search into a problem we could all solve, but gave us rails to start linking disparate content into a graph of meaningful relationships.

But instead Google wanted to make things less strict, less semantic, harder to search, and easier to author whatever the hell you wanted. I'm sure it has nothing to do with making it difficult for other entrants to find their way into search space or take away ad-viewing eyeballs. It was all about making HTML easy and forgiving.

It's a good thing they like other machine-friendly semantic formats like RSS and Atom...

"Human friendly authorship" was on the other end of the axis from "easy for machines to consume". I can't believe we trusted the search monopoly to choose the winner of that race.


I work for Google but not on search.

I think in this case semantic web would not work, unless there was some way to weed out spam. There are currently multiple competing microdata formats out there than enable you to specify any kind of metadata but they still won't help if spammers fill those too.

Maybe some sort of webring of trust where trusted people can endorse other sites and the chain breaks if somebody is found endorsing crap? (as in, you lose trust and everybody under you too)


> I think in this case semantic web would not work, unless there was some way to weed out spam.

That's not so hard. It's one of the first problems Google solved.

PageRank, web of trust, pubkey signing articles... I'd much rather tackle this problem in isolation than the search problem we have now.

The trust graph is different from the core problem of extracting meaning from documents. Semantic tags make it easy to derive this from structure, which is a hard problem we're currently trying to use ML and NLP to solve.


>Semantic tags make it easy to derive this from structure

HTML has a lot of structure already (for example all levels of heading are easy to pick out, lists are easy to pick out), and Google does encourage use of semantic tags (for example for review scores, or author details, or hotel details). For most searches I don't think the problem lies with being able to read meaning - the problem is you can't trust the page author to tell you what the page is about, or link to the right pages, because spammers lie. Semantic tags don't help with that at all and it's a hard problem to differentiate spam and good content for a given reader - the reader might not even know the difference.


> PageRank, web of trust, pubkey signing articles...

What prevents spammers from signing articles? How do you implement this without driving authors to throw their hands in the air and give up?


In the interests of not causing a crisis when Top Level Trust Domain endorses the wrong site and the algorithm goes, "Uh uh," (or the endorsement is falsely labeled spam by malicious actors, or whatever), maybe the effect decreases the closer you are to that top level.

But that's hierarchical in a very un-web-y way... Hm.


The internet is still kind of a hierarchy though, "changing" "ownership" from the government DARPA to the non-profit ICANN.

And that has worked... quite fine. I have no objections (maybe they're a bit too liberal with the new TLDs).

Most of the stuff that makes the hierarchies seem bad are actually faults of for-profit organizations (or other unsuited people/entities) being at the top, and not just that someone is at the top per se. In fact, in my experience, and contrary to popular expectation, when a hierarchy works well, an outsider shouldn't actually be able to immediately recognize it as such.


> Imagine if instead of kneecapping XHTML and the semantic web properties it had baked in, Google had not entered into the web browser space. We might be able to mark articles up with `<article>`, and set their subject tags to the URN of the people, places, and things involved. We could give things a published and revised date with change logs. Mark up questions, solutions, code and language metadata.

Can you explain in technical details what you think was lost by Google launching a browser or what properties were unique to XHTML?

Everything you listed above is possible with HTML5 (see e.g. schema.org) and has been for many years so I think it would be better to look at the failure to have market incentives which support that outcome.


Good machine-readable ("semantic") information will only be provided if incentives aren't misaligned against it, as they are on much of the commercial (as opposed to academic, hobbyist, etc.) Web. Given misaligned incentives, these features will be subverted and abused, as we saw back in the 1990s with <meta description="etc."> tags and the like.


I don't think there's any reason to think google was responsible for the semantic web not taking off. People just didn't care that much. It may have been a generally useful idea, but it didn't solve anyone's problem directly enough to matter.


It wouldn’t matter. 0.0001% of content authors would employ semantic markup. Everyone else would continue to serve up puréed tag soup.


If WordPress outputs semantic output that instantly gives you a lot more than 0.0001%. The rest would follow as soon as it improves discoverability of their content


Wordpress can't magically infer semantic meaning from user input any better than Google can. The whole point of the semantic web is to have humans specifically mark their intention. A better UI for semantic tagging would help for that, but it would still be reliant on the user clicking the right buttons rather than just using whichever thing results in the correct visual appearance.


> 0.0001% of content authors would employ semantic markup.

You don't think we'd have rich tooling to support it and make it easy to author?

Once people are using it with success, others will follow.


The breakthrough would be when Google were to rank pages with proper semantic markup higher. Just look at AMP.

(Of course that won't ever happen, but that's what would be needed.)


Did you try putting them in quotes?

EDIT: I don't know why this being downvoted. This is a genuine question to understand if the problem is the size of the index or the fuzzing matching that search engines do.


Quotes doesn't work reliably anymore, this is a big part of the problem. Googlers have been really busy the last 10 years doing everything except:

- fixing search (it has become more and more broken since 2009, possibly before. Today it wotks more or less like their competitors worked before, random mix of results containing some of my keywords.)

- fixing ads (Instagram should have way less data on me and yet manages to present me with ads that I sometimes click instead of ads that are so insulting I go right ahead and enable the ad blocker I had forgotten.)

- saving Reader

- etc


> I don't know why this being downvoted.

tbh, it's one of those "Are you sure you're not an idiot?" replies.


Google blatantly disregard quotes.


I think the behavior is more complex. I do get disregarded quotes from time to time so I typically leave them off. However, for the query 'keyword1 keyword2', if I get a lot of keyword1 results with keyword2 struck through, and I search again with keyword2 in quotes, it works as expected.


Reference?


Will you take my word for it?

They not only disregard quotes but also their own verbatim setting.


Asking for a reference helps:

- Establish the behaviour as documented.

- In representing and demonstrating this to others.

It's not that I doubt your word, but that I'd like to see a developed and credible case made. Because frankly that behaviour drives me to utter frustration and distraction. It's also a large part of the reason I no longer use, nor trust, Google Web Search as my principle online search tool.


I see. I'll try to make a habit out of collecting those again.

That said, I might have something on an old blog somewhere. I'll see if I can find it before work starts...

Edit: found it here http://techinorg.blogspot.com/2013/03/what-is-going-on-with-... . It is from 2013 and had probably been going on for a while already at that point.

Edit2: For those who are still relying on Google, here's a nice hack I discovered that I haven't seen mentioned by anyone else:

Sometimes you might feel that your search experience is even worse than usual. In those cases, try reporting anything in tje search results and then retry tje same search 30 minutes later.

Chances are it will now magically work.

It took quite a while for me to realize this and I think in the beginning I might not have realized how fast it worked.

It seemed totally unrealistic however that a fkx would have been created and a new version deployed in such a short time so my best explanation is they are A/B-testing some really dumb changes and then pulling out whoever complains from the test group.

Thinking about it this might also be a crazy explanation for why search is so bad today compared to ten years ago:

There's no feedback whatsoever so most sane users probably give up talking to the wall after one or two attempts. This leaves them with the impression that everyone is happy, so they just continue on the path back to becoming the search engines they replaced.


Thanks.

I'm getting both mixed experiences and references myself looking into this. Which is if anything more frustrating than knowing unambiguously that quoting doesn't work.

I've run across numerous A/B tests across various Google properties. Or to use the technical term: "user gaslighting".


If they were ripe for disruption and it was easy to do this disrupting just be returning better search results, and returning better search results was an easily doable thing then I suppose all the other functioning businesses that have a stake in web search would already be doing that disrupting.

Search disrupted catalogs. What will disrupt search?


Boutique hand crafted artisanal catalogs?

Not joking I have a feeling subject specific topics will be further distributed based on expertise & trust.


That's exactly what github's Awesome lists are: Decentralized, democratized handcrafted subject-specific catalogs


If they became important sources of information outside technically competent people I suppose we would end up with a bunch of Awesome lists of Content Farms!


Return of the Yahoo! Directory and DMOZ? Heh.


> Boutique hand crafted artisanal catalogs?

I think those are called books. ;-)


Pubmed is an excellent example of a boutique search engine.


So the old Yahoo web index, basically.


Are you sure there's a page on the web that has the stack trace you search for? Maybe there just isn't anything.


Perhaps expectations have risen over time


In my opinion, Google is getting worse constantly, which boils down to basically the following aspects for me:

1. I don’t like the UI anymore. I preferred the condensed view, with more information and less whitespace.

2. Popping up some kind of menu when you return from a search results page shifts down the rest of the items resulting in me clicking search links I am not interested in.

3. It tries to be smarter than me, which it fails in understanding what I am searching for. And by “understanding” I basically mean to honor what I typed and not replacing it with other words.

I try to use DDG more often but Google gives me the best results most of the time if I put in more time.


Yeah, number 3 really pisses me off recently. If I type in 3 words I would like to search by those 3 words. What ends up happening is Google just decides that it's too much of a hassle or that I've made a mistake and just searches using 2. So now I have to input all the words in quotes so that it works like it supposed to in the first place.

This functionallity literaly never helped me during search. Not once.


"try" "putting" "the" "words" "in" "respective" "quotes" "like" "this"


That's what I'm doing, sorry it wasn't clear ;)


You don't even seem to be able to simply get the URL of a search result any more: some hierarchical token thing is used instead to display each search result's address, and copying any link just gives you:

https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&c...

with the url buried somewhere in the GET params.


Everything that you're looking for is (fortunately) still in the HTML source code. When Google started messing with the URLs I wrote a filter for my MITM proxy to put them back. Then it recently changed the format so I had to change the filter again. It's annoying for sure.


Agreed on all three but especially #1. The "modern" web is full of so much whitespace it's infuriating


I've been thinking about this for years[1]. The truth is, what Google solved was parsing the search query, not identifying the best results. In fact, Google is not incentivized to give you the best results, they are designed to maximize their revenue, derived from getting you to view / click ads.

Google is not a search company, they are an advertising company. The more searches you make, the more revenue they make. Their goal is to quickly and often get you to search things. As long as you keep using their platform, the more you search the better.

[1] https://austingwalters.com/is-search-solved/


The ‘Google is an advertising company’ is said often on HN. I agree to some extent but, doesn’t that imply that every newspaper company is also just an advertising company? Google solves a real problem and this works well, for them, with an advertising based revenue model. Do they compromise their search to that end? Probably. Do newspapers? Hopefully not, but maybe. To me, that doesn’t make them advertising companies. Apologies if this is pedantic.


Of course they are. Page ads are probably as old as print. Most reviews are ads. Travel sections are ads. There are real estate sections, and classifieds as well. Perhaps the most honest reporting is found in the local sports section.


> * but, doesn’t that imply that every newspaper company is also just an advertising company?*

Historically, at least, they sold subscriptions.


https://www.opendemocracy.net/en/opendemocracyuk/why-i-have-...

Hsbc demanded The Telegraph pull negative stories or they would pull their advertising.

All newspapers are full of stories about property "investment" and have a separate segment one a week for paid advertising of property.


Subscriptions for the “modern newspaper” did not pay the bills, but were proof that people were actually reading the newspaper.

Prior to that there were papers which did indeed make their money from subscriptions. But their content was different as well: explicitly ideological and argumentative. The NYT or Wapo idea of neutral journalism was a later development.


According to this, the newspaper subscription only covered about 18%. The rest was from advertising.

https://idiallo.com/blog/we-never-paid-for-journalism


I used to work for a company that did product search. Search is hard. The original idea behind Page Rank was really insightful and made search a lot easier, at least until first SEO, then no one linking as much as they used to. The other trick Google figured out was good ML on query results, so all the popular queries have decent results. That sill leave you with search being hard.


Is it time for paid search engines? Make users vote with their wallets and pay for the eternal arms race. Problem is, whoever is behind something like that would have to start with an already sufficiently superior experience (or massive geek cred) to make people pay from the early days. Maybe going back to manually curated results of a subset of topics would work? Or some Stack Overflow-esque model of user powered metadata generation?


No. Just because you pay for something doesn't mean you aren't also the product. The other ideas are specifically ones which failed. It didn't even work that well in the 90s. User metadata isn't even a solution in itself but an optimization layer at best.


Simpler idea:

Paid search engine that ranks sites based on how often the users click results from that site (and didn't bounce, of course). The fact that it's paid prevents sybil attacks (or, at least, turns a sybil attack into a roundabout way of buying ads).

Of course, at this point, you are now the product even though you paid. But it's a tactic that worked for WoW for ages.


Google already includes clicks and bounces in its ranking factors.


Probably not. The problem with current search engines is that they need some way to rank the pages and hence inherently susceptible to clever SEO (that curates pages to get ranked high for certain type of queries).

So having a paid search engine does not fully solve the SEO problem. Having no Ads does not takeout the SEO problem of boosting pages to top.


If the paid for search engine ignores SEO "optimizations" and actually ranks on content AND these results prove better that attracts actual paying users, then we don't have to wait for SEO to die. Just like SEO rose to prominence to win Google result ranking because that's where the users were, the sites would stop the SEO crap.

Getting this hypothetical site to get users is the real problem. Same thing with getting users to a Facebook alternative.


I somehow doubt that all of Google’s internal decisions are as simple as “well we’re an ads company, let’s just not worry about search quality.”


Why do you doubt this? As a company in total, I believe that all decisions are made on "will this sell more ads". At the same time, I believe that some of the ideas from the software devs are born from doing something cool. It then gets rolled into "how can we use this new thing to sell more ads"?


Because search quality drives ads.

There were search engines around but when Google came out with superior search results everyone switched and those search engines quickly vanished. There are search engines in direct competition with Google today. If Google does not provide the best service there's an extremely low switching cost. Bing, Baidu, DuckDuckGo et al would be happy to take your traffic.


This article is essentially just complaining that DDG and Google don't have special parsing for reddit pages ("How come it doesn't know that thread didn't get many upvotes?", "How come it thinks some change to the site's layout was an update to the page?")

Maybe if you want to search reddit, the best search engine is the search bar on reddit.com.


But those pretend complaints aren't his complaints. His complaint is "why does this archived reddit page from six years ago without any updates come up on search results for 'things within the past month'?"

Which is... reasonable.


It is reasonable. It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

Google COULD offer more time machine features and perform diffing on pages. But a reddit "page" will always have content changes, as everything is generated from a database and kept fresh on the page. The ONLY metric therefore Google could use would be whatever meta tag or header tag that reddit provides.


It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

That doesn't explain why Google lists the old search results as being from this month, while Duck correctly lists them as being from years past.


Google does cache results, and could via comparison with cache notice changes, and claim, the page was updated sometime between.

I've always wondered, how search engines get a hold of timestamps. Locally with a cached sample, like I explained above, parsing a page's content or some metadata? It's not like the HTTP protocol sends me a "file created/last modified date" along with the payload, does it?


> It is reasonable. It is also likely that whatever meta information reddit is sending back (in headers or tags) is probably not dated correctly for the time of the origin post.

That could explain the first screenshot, but definitely not the second, where google has it tagged as years old.


That's DDG, not google.


Seems like an easy solution to this problem would use two functions.

One function that takes the output of the page, and renders it so only what's user visible, actually gets indexed. So no headers, no JSON data, no nothing, unless it's actually in the final outcome of the page when rendered. This would require jsdom or some other DOM implementation. Hardly hard for Google (Chrome) to achieve this, and been done multiple times.

Second function is a function that does the same call twice, passing the page to function one each time, then compare them two. If you make two calls right next to each other, and some data is different, you discard that from your search index. Instead you only index data that appears in both calls.

Now you don't have the issue of "dynamic content" anymore...


Typically dynamic content doesn't change from second to second, it changes after 5 minutes or an hour or 1 day, actually it is extremely site specific too.

But I do like your idea.

To go a bit further on your idea - you could apply machine learning to analyse the changes. So for example, ML could determine what is probably the "content area" of the page simply by having built out a NN for each website that self-expires the training data at about 1 month (to account for redesigns over time).

The major problem will still be "ads" in the middle of the content, especially odd scroll designs ads that have a different "picture" at each scroll position, as well as video ads that are likely to be different at each screen shot.

Another form of ad being the "linked" words like when you see random words of the paragraphs becoming links that go to shitty websites that define the word but show a bunch of other ads.

I suppose Google could simply install UBlock in it's training data collector harness to help with that stuff. >()


Yes, admittedly I do expect a search engine to be able to parse one of the biggest websites in the world, just how it used to for roughly a decade.

Obviously, it doesn't need to consider the upvotes directly but maybe the text inside the page. Or the date.


Does Reddit use semantics like <time> [or a microformat like <span class="dtstart"><span class="value">...] to allow proper parsing? If not then they should share at least half, probably most, of the blame IMO.


The search bar on reddit.com only searches through post titles and not comments. Using Google/DDG/<any other search engine> seaches through all content posted on Reddit and not just post titles. Until Reddit implements proper search, people will keep using search engines to search for content on it.


appending site:reddit.com to your search is very useful.


It doesn't have to be special parsing. Upvotes are just a heuristic for contents. Good contents generate upvotes, not the other way around. And site layout shouldn't fool the algorithms, or is that something that only Reddit does?


everyone knows the best way to search reddit is via google. The search engine bar in reddit is for optics only.


I prefer https://redditsearch.io/ but Google works too


Thanks, never heard of it. Already returning better results for me.


But does anyone know why search on Reddit is broken? Perhaps intentionally? I don't want to get tin foil hatty but perhaps more not readily apparent false positives = more user clicks = more revenue via ad serving?


I often wonder why some fairly large companies that rely heavily on their own website don't seem to put more than a sole web developer worth of resources into them. Reddit fits into that category for me (Reddit has 400 employees).

Initially I had the impression that search was hard to implement. However, spending a work week figuring it out with ElasticSearch, Solr and Sphinx changed my mind. Getting the solution to work with the scale of the website would take more work, but all the know-how is there, and they could put a whole team to the task for a month.


I wouldn't say it's a trivial ask, but yeah, if you have 400 employees at least assign some resources to get it right. Unless it's intentionally broken. Facebook's prioritization but also randomization of the feed is a feature not a bug.


Because search is difficult to get right. So most sites just implement a basic feature and then assume users will use Google.


Simple: because Google has a ton more data on what content is relevant on Reddit than Reddit itself does.


Given how relevant forums and discussion spaces are. One would think there would be some standardized structure for it so you can search for comments or posts with criteria all over the internet.


There are 'standards' (and an XKCD comic) for that. See schema.org for example.

Google does use it, last time I used it there was even tooling for it in the 'check how my site is to crawl' console, whatever it's called.


https://schema.org/upvoteCount

Yet I haven't seen even one instance of this anywhere. :(


BRB, need to put `upvoteCount="10000000"` on all my blog articles. :)


Google is a TRILLION dollar company. Indexing Reddit properly would take what? 2-3 engineers? Cmon.


Would take 2-3 engineers how long -- I can't really see it being more than a couple of week project. But, why would they? Seems they'd only do it if there's a ROI, is there?

If Google gives you what you want straight away then you leave; sure, you come back, but they want to be bad enough to keep you on there and good enough to be better than other searches. Their reach and resources cures the latter.


Google adding dedicated optimizations for popular sites seems like a bad trend.


They do it all the time.


The first time I realized that Google search was bad was when del.icio.us got big. I was an avid user - and I stopped using Google except for basic things. You could search tags on del.icio.us - and the results were incredibly good, far better than Google, especially for niche areas.

I think, unfortunately, this kind of curated, social approach to search will never be compatible with monetization by ads. I'm not quite sure how to make a search engine profitable without significantly distorting its results. Maybe, depressingly, Google is the best thing possible given the constraint of making a profit?


The worst thing for me is that I have become accustomed to search working a certain way. If I put a word in the query, it had better fracking be in the results. That's why I put it in the fracking query.

I guess whatever sauce Google applies to the query maybe works better according to some metrics for some users, but it is a source of endless frustration for me.


The problem as I see it is that popularity ranking worked fine in the pre Eternal September era for the web (~10 years ago?). I think it is safe to say that most HN users skew toward searching for more technical, intellectual or scientific topics and get frustrated by their searches getting swamped by popular topics. What I'd like to see is a check box or slider bar to exclude or adjust the weighting for popularity in a search. I don't need to see links for the latest Taylor Swift breakup or what the Kardashians are up to that appear in a technical search due to a randomly shared keyword. Often, the topics I am searching for will never be popular and current search operates on the assumption that it will.

A second problem is that now that Google likes to rudely assume to know what you want, i.e. ignoring quotes and negation in search or even modifying keywords, its even harder to find what you want especially if its not on the first page or two of results. Because of this interference even changing your search parameters doesn't change the results much and you see the essentially the same links. What I'd like to see is a search engine that will do a delta between say Google and Bing and drop the links common to the two services. This might lead to uncovering the more esoteric or hidden links buried by the assumptions of the algorithm.

Finally, a last problem that I see right now is the filter bubble effect. I had to search for how to spell "kardashians" in the above paragraph. Now my searches and ads for the next 2-3 weeks will be poisoned by articles or ads about the Kardashians. Taking one for team to make my point, I suppose.


> pre Eternal September era for the web (~10 years ago?)

LOL, was that by chance the time that single from Green Day was released? ;)

["Eternal September or the September that never ended is Usenet slang for a period beginning in September 1993"]: https://en.wikipedia.org/wiki/Eternal_September

EDIT: TIL Green Day's single has nothing to do with that September. Huh. I gave them more nerdcred than they deserved...


Don't get me started on Outlook search. For a company that runs a global search engine, the most prominent mail client in the world is absolutely shit for searching.


Outlook is poor in every aspect.


I increasingly just use google to search sites I already know have better content than the web at large, since it so often feels like it sucks for any depth of information.

Want to get in touch with someone? name job site:linkedin.com. Want to find how to solve a tech issue? "issue" site:stackoverflow.com, and so on.

Google's search of sites like this is pretty good (although the recency working well would be really good... but perhaps impossible to solve well), and often better than in site search. But that + very basic fact finding "how much is a lb in kg", "where is restaurant X" are pretty much all I feel it's good for. Then again, I guess it's not supposed to be an encyclopaedia (or it can be, site:wikipedia.org!)


Public information should not be filtered by one single private entity. We need a distributed system with an open standard. Search should work more like DNS... In order to get your web site indexed, you only have to publish your search URL. There should be many index cache servers, so that your search URL only get a hit when a search string expires...


Using shopping engines is even worse. Google shopping and Amazon, I've been having an incredibly difficult time finding products within a price range and sorting it by price. Searching for items in quotes on Google Shopping often returns all sorts of irrelevant results. In Amazon, the 'price low to high' filter doesn't even seem to work most of the time and it includes sponsored results way out of my price range in the middle of the results. Amazon also seems to have removed any type of price range filter on the left sidebar.


I think web information access needs a new paradigm, that needs to make "search" itself irrelevant. Much like search replaced the taxonomy-based browsing (Yahoo/"Portals" of the 90s).

I don't know what that is, but there needs to be paradigmatic change.


Try using natural language phrases in search like "reddit on best phone to buy" - most search engines are NLP enabled and can give more relevant results.


This is good advice. I used to be very good at formatting queries in search-engine-ese, but that stopped working well. I finally tried switching to asking questions in normal language, and my results started getting good again. It seems that AI has reached a point where natural language is better than trying to talk like a computer.


A big part of the problem is that it returns results from years ago, even when I specify that I want only recent results. I tried a few different variations and the results were still bad.


To be fair, Google has always been trash at searching for recentness. I don't remember a time when searching "recommended X (current year)" didn't return old out-of-date info. I think Google has had a habit of returning 3-year-old pages since its inception.


Did you try clicking Search Tools and limiting the date range there instead of in the query string? Always works for me.


searching for

  best cell phone to buy site:reddit.com
and setting it to results from last month works fine...


Did you click on the results? I just checked that exact query and it's mostly the same links - the first one is the one from 6 years ago, as with my original search.


Still getting results that are outdated if you click on the links, even though the date on the search results page is recent.


I'm a bit cynical with this but I believe a lot of the "super smart" AI tech they no doubt run over all their search these days isn't actually that super smart. If we actually handled metadata properly "last month" would be a trivial, "dumb" thing to search for but apparently that's broken. Why?


You might be surprised at how many ways "last month" is encoded just in a single data set. One of the biggest problems with search -- just about any kind of search-- is the low quality of markup/metadata. If only data was structured properly, we would barely need search in the first place!


Right! It's just... the internet isn't exactly in beta anymore. You'd think there's a timestamp type value that just tells you the date of an article or forum post that's linked. But no, we have to run a sophisticated algorithm to search for it on the page – and fail.

Honestly, my second theory is that google knows exactly when a given reddit article was posted but doesn't trust the user to judge the relevancy of that information. Which might even be reasonable in many, many cases. But it's also annoying. I'm definitely seeing a trend towards "editorializing" search results on Google, where often the first half of the page isn't even websites anymore but some random info box and whatnot, and your search terms are interpreted very liberally, even when in quotes. It's one of those things that is probably better for 99% of users/uses but super annoying when you actually want precise results.


> This is a query for checking out what reddit thinks in regards to buying a phone

[reddit phone to buy]

What is astonishing, is that the most obvious part of the non-progress in search over the last two decades, is just accepted as the starting point to the way things are.


I don't agree with the premise of the article, although I accept that the example given is clearly terrible.

I haven't noticed it before, is it a recent bug? It certainly seems a significant one, but not representative of my experience using Google - which I also acknowledge is skewed according to the data they have on you and SEO gaming.

But generally - those constraints acknowledged - I still find Google's search to be one of the modern wonders of the world, still the go to, and - yes - not perfect.


The article hardly supports its conclusion with these cherry-picked examples; however, the core reason these results don't meet the author's expectations is that Google's AI does not understand the content of webpages well enough to identify the publication date accurately (at least anywhere near as accurately as a human can). Google's publication date is based on whether it found changes to the HTML on its own crawl date (which is very noisy due to today's dynamically generated website) or based on schema.org/microdata, which as other commentators point it is game-able for purposes of SEO, or simply missing on most sites.

As a contrast, take a look at how Diffbot, an AI system that understands the content of the page by using computer vision and NLP techniques on it, interprets the page in question:

https://www.diffbot.com/testdrive/?url=https://www.reddit.co...

It can reliably extract the publication date on each post, without resorting to using site-specific rules. (You can try it on other discussion threads and article pages, that have a visible publication date).


SEO is the new spam. We solved spam pretty well, but it was a very different solution space to what is available for the web:

- Spammers had basically two ways to verify their efficacy – they could either sign up to every provider under the stars and test each email with each of them individually, or they could use the absence of a signal as "proof" of being caught by a filter. But neither of these are very efficient. An SEO expert can simply wait for the search engine to detect their changes and verify the result with two or three search engines quickly and automatically.

- For practical purposes whether an email is spam is answered in a binary form: either it ends up in your spam box or it does not. Removing spam-looking things from search results entirely would be devastating for any site victim of a false positive. And how do you implement the equivalent of a spam box in a search engine in a useable way?

- Spam filtering was implemented in different ways on every mail provider, so the bar to entry was "randomized" and spammers would have to be quite careful to pass the filters on a large subset of providers. ISPs and users currently have nowhere near the resources to implement their own ranking rules, but maybe this could be a solution in the mid to long term with massively cheaper hardware.


I think a main issue is the vagueness of the query.

I was searching for "what phone should i buy in 2020 site:reddit.com" and while a few results where from a year ago, most where from January 2020.


Google have moved on from allowing exact searches to me made, sometimes you have to think "if I didn't know what I actually wanted to see, and was trying to make a question-form query about it, what would I write" and that gets better results, IME, than something where you know the exact words you want (which even with "" might not be in the link).


No, I didn't use quotes (exact match) in my query. This was only to express delimiters. Sorry for the confusion.


Adding 2020 might kind of work in January. It wouldn't help as much for getting this month's results in November.


I tried it and I think you're not right. The reason why this works most of the time (at least on reddit) because there are lot's of threads which put the date in the title.

Query was:

what phone should i buy nov 2019 site:reddit.com


I just did a google search for "piano". Just the word "piano"

Only one link on the first page, the wikipedia entry for "piano" had anything to do with pianos, (i.e., the instrument invented in Italy 300+ years ago that has hammers, strings, and an iron frame).


What do you get when you search for that? Did a test right now, and apart from the Wikipedia page, I get videos about piano music/pianos, shop pages for buying pianos and local businesses that sell either pianos or piano lessons.

So I'm curious whether the issue is that there are too many shopping/business related pages (which is fair, but at least those seem to be piano related), or whether you're getting something completely different.


The first three links after the ad were for the same "virtual piano" (not a piano) on different websites.

See https://imgur.com/a/cMC9wQH

Then the wikipedia page, then a couple of "online" non-pianos, then a company that happens to be called piano.io

https://imgur.com/a/uRcyx84

Shopping pages are fine, if we'd get links to, say Steinway, Yamaha, and Bosendorfer, or links to Lang Lang's home page, or something that has more to do with _pianos_.


Well, how do you actually determine the age of a web page? Is it the post date? How do you even find that out? Is it the last comment post date? Is it the last edit of the main post, or the last edit of a comment? How do you find this out automatically? Is it the last change the HTTP server responds with? Is it the last time the entire page has been modified? If the page is built up of multiple components like iframes, do their post dates matter? Do ads matter? If the page is dynamic, everything gets a few orders of magnitude more complicated.

Point is, it is not a trivial task at all to automatically find out the time that corresponds to the intuitive understanding of the "age" of a web page.


It's the first time Google saw the page. Reddit posts have unique URLs and Google scans popular sites very regularly (in fact is rumoured to have site-specific optimizations).


That can't be right if, according to OP, the first result was a reddit post from six years ago, yet the date according to Google was Jan 11, 2020. So the first time Google saw that page would likely have been the day it was published.


Not sure if DDG patched this, but querying DDG w/o the month based tick results in a result that'll point you to the correct subreddit for finding which phone to buy in <current month> [0].

Although it's not using the "This Month" dropdown, doing the vanilla search still gets you the "most correct" answer imho.

[0] https://jszym.com/dl/imgs/20200119-ddg-example.png


> At any rate, I got annoyed at this point (mentioning for those who couldn’t tell), so I switched to DuckDuckGo.

For those who might be misled like I used to be DuckDuckGo is just a proxy for Bing.


Yes, it gets the majority of its results from Bing. But it's not _just_ a proxy, it's an anonymizing proxy, at least if we believe their pinky swear.


This line from the wikipedia article about DDG somewhat contradicts you, but it is rather vague:

> DuckDuckGo's results are a compilation of "over 400" sources,[15] including Yahoo! Search BOSS, Wolfram Alpha, Bing, Yandex, its own Web crawler (the DuckDuckBot) and others.

Could you elaborate on your comment, I am sincerely interested in learning the details.


You can open up two web pages side by side, search the same term in both Bing and DuckDuckGo, and see many of the results are the same in similar order (at least they were the last time I did this, maybe two months ago). DuckDuckGo appears to get most of its results from Bing.


For those who might be misled like I used to be DuckDuckGo is just a proxy for Bing.

Every time the topic of search comes up on HN, someone always jumps in and says this.

Then there are a bunch of other people who jump in and say that Duck is much more than that.

So, which is correct?


https://help.duckduckgo.com/duckduckgo-help-pages/results/so...

Interpret it as you wish. To me it sounds like they are using 400 sources and their own crawler for the Instant Answers stuff but get all their "traditional links in the search result" from Verizon(?) and Bing.


Well, of course, it can be "much more than that" depending on how you interpret that statement. When I hear marketing blurbs like _"over 400" sources,[15] including Yahoo! Search BOSS, Wolfram Alpha, Bing, Yandex"_ I inevitably start yawning and realize "much more than that" for what is is, marketing that worked on people.

You can run an experiment: would you ever have to be persuaded that for instance Google is much more than just X? Well, no because it's usually proven that it is without someone on the internet having to tell you about it.

Crawling most of the web on a regular basis is incredibly difficult and requires resources that only a company like Google can provide.


I've been screaming about this for years and only recently have people begun agreeing with me - and I know exactly what the major problems are, and they are synergistic:

1. SEO has totally warped result rankings. Now instead of getting results which naturally match my keywords because of content, I'm presented with almost exclusively commercial websites which are trying to sell me something. Gone are the days where you could search for technical terms and not be bombarded by marketing websites.

2. Google's AI is far too aggressive for technical searching. It is clear that Google is using NLP to parse queries and substitute synonyms based on some sort of BERT-like encoding. The problem is that a given word may have synonyms that are actually orthogonal in meaning space. For example, if I search for trunk, Google may return results for "boot" as in car trunk, instead of anything related to SVN. Contrived example, sure - but here's where the real problem is: Google's AI is regressing to the layman's mean. It is effectively overfitting to Grandma's average search query. Think of it as the endless summer of search...and since there's no way to usefully customize your search now (can't give people too many options or they might get confused!), you're stuck combing through unrelated results and it is increasingly difficult to disambiguate your search query. Remember when advanced search existed and typing in a question to search was terrible practice? That shouldn't have changed - but as more and more non-technical people started searching, Google (rightly, from a marketing perspective) seized the opportunity aggressively.

3. Primarily because of a combination of points 1 and 2 above, and the endless summer of non-technical users, informative websites have all but disappeared in search results, replaced by shitty SEO optimized blog spam and commercial websites which offer high level summaries primarily to generate traffic and sell you shit. Curious about how to repair your own roof in detail? Well don't bother searching "roof repair" (and quotes seem to be broken too btw) because the first two pages will be full of roof repair company websites.

So what is the result? The portal to the greatest asset in the history of the civilization, the internet, has gradually turned into a neutered, commercialized corporate service where users are a product. It's tragic to see all of that empowerment thrown away in the name of profit. As they say, if the user doesn't see it, it isn't there, and for this reason Google is effectively killing the internet.

I haven't even gotten into the demonstrated potential for search curation and autocomplete abuse, where Google becomes an effective, centralized arbiter of truth as the defacto portal to the internet, and how dangerous such a concentrated power over society can be.

Google really was admirable when it wasn't evil - now I'm about convinced that it needs to die.


Well, '"roof repair"' was never a good search for obvious reasons. But you'd expect a search like '"how to" repair roof' to mostly filter out sites providing roof-repair services - and if a search engine doesn't do that properly (because it ignores the "how to" part as irrelevant even though it's in quotes!) that's just broken.


I tried "how to repair roof" and the first organic result is the featured snippet of how to fix shingles, with bunch of youtube videos on various roof repair methods next, with DIY repair sites. So I don't see anything broken there ?


Google has fallen to Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." For years, Google moved the goalposts for what measurements of a good website were and the broad internet moved in lockstep to meet those goals. Until Google settled on a "thin veneer of content but actually an ad for a service" as not too spammy to blacklist. So that is now the advice for any business. Hey, want to place highly for "St. Louis dentist"? Make a top 10 list of toothbrushing mistakes, etc.

Second, Google shifted about 7-10 years ago from searching for webpages to searching for answers. This was reflected in how they communicated about search and also in stuff like showing infoboxes and AMP results. I think this was move into mobile, but it leaves the actual web underserved.

The worst part is they have been removing tons of older content from the web that is still there and is still valuable but has become no longer surfaceable, even with direct quoted phrases.


Does anyone else no longer see the date at all in their google search results?

Drives me mad, Especially when I only want results from a certain year/month.


A shoutout for duckduckgo here. I tried both qwant and duckduckgo about a couple of years ago but the quality of search results forced me back to google in a couple of days. But I gave it another shot a few weeks ago and I have been pretty happy with the search results from duckduckgo and the fact that the ads do not follow me around the web based on my searches is rather pleasant. I must admit that google hasn't done anything bad to me but their sheer size and scale is scary enough for me to look for alternatives.

Now back to the search quality, I wanted to find out when exactly world economic forum 2020 is happening and google won hands down (search term = world economic forum 2020 dates). But for now and for most of my day to day search terms duckduckgo is doing okay. I know this won't last for long as they also are looking at advertising dollars but I hope then someone else will stand up to challenge duckduckgo+google combined.


I agree google is shit, which is why I never use it. They're too busy being woke to run a search engine which compares to what they were doing 10-15 years ago. It's pretty obviously entirely that; smart people don't want to work in dying Brezhnevian bureaucratic hellscapes.

Try Yandex for an example of a much smaller company doing a fine job at search:

https://yandex.com/search/?text=reddit%20phone%20to%20buy&lr...

Produces exactly what OP was looking for. QED.

Qwant isn't bad either; don't remember if they piggyback off of other search results:

https://www.qwant.com/?q=reddit%20phone%20to%20buy&t=web


IIRC, Qwant piggybacks (-ed?) off Bing.


The point no one is making in these comments is: How is search so good? I've tried to implement search on small websites and always failed miserably. Results were always terrible no matter what library/database/indexer I used. I cannot even imagine how one would proceed to implement search over the entire internet.

I also have the experience that Google is much worse the some years ago, I'm also frustrated by everything, but still, it does the job much better than anyone else probably -- otherwise someone would have better results overall.

(Also, yes, I use DuckDuckGo and I don't think it's so much worse than Google, which is good, because I used to think no one would ever be able to do be as good as Google but today there are many competitors that come close.)


I feel like Google has often turned strict commands into fuzzy searching, maybe for a decade?

I never heard a clear explanation as to why, I just imagined that it was some sort of A/B tested paternalism. Maybe most users really want fuzzy searches when using the commands I use for a strict search.


I think it's simply a human bias in action - people don't realize when their queries benefit from the "fuzzy matching", and they only notice/remember when they don't get what they want from search and then (often mistakenly) blame fuzzy matching for it as that's what's visible to them.


When I'm ready to flip the table over about search results, I remember that https://millionshort.com/ exists and I give that a whirl. Then when I'm sick of seeing links to sites that are highly-SEOed but low-signal, Tampermonkey is there to give me a nice little [block] button to remove them.

Fantasy-future: Mozilla could "widen out" their library and hire 20,000 librarians to curate the "New Web" in a non-wiki-format. If you paid each librarian about $150K/annum, that's about $0.60/annum from 5B subscribers, just for their salaries for the advertising-free library.


Everyone can make a better search engine in the comments, but strangely, barley no one is actually commenting on the actual case study. Search based around time of publishing.

I guess another committee to paint the whole garden shed is easy, talking about what paint to use is hard.

I suspect Reddit needs to add meta data of the publishing date.

It is complicated in a forum, is it publish date or last comment date. But Google is still getting the basics wrong ie Every comment date is a year ago, still less than a month in search.

It still doesn't help many time based issues (News always displays new headlines on old articles. So you'll see Iran funeral crowd crush on 'old' news articles in search) but it's a start.


Google search is absymal at generic questions like the one the author mentioned about the best phones, or others like "Redux vs. Mobx" or "styled-components vs emotion". I end up just searching Reddit (or sometimes stackoverflow) not because Reddit is particularly good for technical discussions, but because Google's default search literally just returns blogspam from dev agencies.

Why hasn't a superior competitor emerged yet?

I get that the web is super broad and indexing the entire thing is an enormous task. But perhaps there's room for more niche search engines (eg. focused on tech) to stab away at this Google search monopoly.


It's worse than the author described. Google won't even sort by date correctly. I ran into this problem years ago and it still exists. Example: https://www.google.com/search?q=where+to+buy+a+phone&hl=en&t...


As a point of clarification, "Past Month" is not the same thing as "Last Month" or "Previous Month". "Past Month" in this context actually means the past 30 days from today. It's a really subtle and confusing nuance in English. In any case, Google Search having results from 5 days ago is accurate when filtering for Past Month. Google (being a global resource) really should reword "Past Month" to "Past 30 Days" to eliminate any confusion.


But still, none of the reddit posts he referenced are from the past month. They say things like Jan 11, 2020; which would be correct, but the actual post on Reddit is 6 years ago.

Whether this is Google or Reddit causing this is another issue.


Isn’t last month supposed to be, last month chronologically speaking instead of calendar month? If I search “last month” today I would expect to get results from Dec 19,2019 and Jan 19,2020.


Why can't every web server just remember (index) all the content that it serves?

Much like resolving DNS queries, I could then ask every server near me for specific terms. If they serve that content, they'll return a list of links containing those search terms.

We could have different apps with different algorithms to sort the bare results according to the criteria that's most relevant to each individual user.


We need a taxonomy. In the old days finding links was done by going to a directory: dmoz or Yahoo.

I'm not saying we regress back to the past... I'm sure there is some hidden underlying directory structure in web search today. What I'm suggesting is making it more accessible to users as a way of navigating the web for good content.


I find I agree with the article, but mainly on queries that are purchase related, or those that overlap with some kind of business. If I'm looking for some coding question, usually the right SO question comes up on the first page.

Makes sense that it is so. Clearly there's incentives to show up first on a "where to buy phone" search. The actual answer without vested interests has nobody to pay for it. I bet there's some street in Berlin where there's a number of phone shops, but without coordination (eg a shopping centre) there's no way any of them will tell you that the selection is best if you just show up somewhere on that street.

Also on a deeper note, I fear the internet is being flooded with terrible authority sites. Superficial articles that sound right and have lots of affilicate links. But the incentive is not to be informative and correct, rather to look right to people who don't know better - that's why they're there! - in order to funnel them to certain shops. I can just imagine there's an anti-vaxx site somewhere that purports to have real evidence and sells vaginal anti-cancer eggs.


This is something I'm currently focused on solving actually. Currently on Show a ways down is the alpha but the basic gist is a search engine focused on discovery while maintaining topical relevance. We thoroughly agree that search results are kind of garbage on major engines at the moment.


Maybe it is time for s personal meta search aggregator with indexing proxy. One that delegates the query to reddit/SO/github-lab search and filters the results down to those that actually match your query. The proxy indexes all the pages you surf over and includes those in the search.


From the author's search result and a quick test I ran, it appears that the specific problem here is Google doesn't seem to "understand" Reddit. If it did, search results would be based on relevancy of comments on the Reddit page, posted in the specified timeframe.


All search is being devoured by SEO


This. For ever engineer that somehow works with search quality, there are thousands of experts who are working to subvert SERPs in some fashion. Pretty sure that if we gained true knowledge about the challenges that search faces due to abuse, it would be like facing one of Lovecraft's cosmic horrors.


Yawn. Anecdote doesn't make it a data. Search is now a sufficiently complicated and difficult problem that can't be represented by a few queries. You need qualitative analysis as well as quantitative to draw meaningful comparison.


Any suggestions on how could one start an open source search engine business? Being able to index the whole internet as an MVP and quickly serve queries on top of this data has a major initial and operating cost.


For power users it is. Probably we are just too niche to be worth supporting


I always wondered why a search engine can't use SEO tactics as a kind of anti-signal. What comes to the top of a google search if you filter out anyone gaming SEO?


Likely because a lot of SEO tactics are not necessarily things that hurt the quality of the site for the user. Using the right heading tags, image alt texts, meta data/schema markup, a good title and meta description etc are all SEO tactics, and they're also all things that help the user experience.

Similarly, a lot of inbound/offsite SEO tactics are theoretically things that help the user as well. Providing content people want to link to, getting authorities to link to said relevant content, etc are all things a user would appreciate.

Using SEO tactics as an anti signal would boost poorly designed sites, inaccessible sites, etc. What really needs to be done is something that filters out sites creating thin content just for the purposes of getting traffic, and that's harder to filter out.


"The site you're after" will also try SEO tactics, because they also want to be seen. And if not using SEO gave you a better rank, then that would also become an SEO tactic.

Turtles all the way down.


Google filters the results on date of web page update, not date of the post reddit post creation, which is the correct behaviour.

And you are just making unjustified bold statements


Well, searching for "phone to buy" isn't going to do much good, is it.

"Reddit best 2020 phone" works for me much better, though ...

User friendly it is? No.


Should not be hard to solve if they would have an algorithm to search Reddit built in. But perhaps they do not, in which case it gets much harder.


The user probably would be less pissed off if they had no way to filter by time. An interesting lesson in UX.


What was the query the author used? Did he try something like this

"https://www.google.com/search?sitesearch=reddit.com&q=best+p...


Watch out with doing more complicated searches like extending those to a date range - they get people “banned” from using google search.


Even browsing beyond the first few pages of results can end you up having to play with captchas.

Also it's funny how they lie about having about a quadrillion results, and then when you're on page 12, suddenly that's all the 150 results they have, sorry.


That quadrillion results bit is like a relic from a bygone era, when people actually cared.

I have not hit captchas very often. Even IP bans only last hours. It is annoying, but low risk. I do searching from shell prompt, never browser.

What always amazed me about Google is that they are not willing to let users to skip pages 1-11 and immediately jump to, say, page 12.

Sometimes queries are non-commercial and there are no ads. Still, jumping straight to page 12 can trigger a captcha.

I wrote a script that reverses or randomises the order of Google results, as an experiment.


Elaborate.

I was doing some deep searching in very specific date ranges with lots of modifiers and google gave me the "your query reminds us of bots or some shit try again later" and I had to stop because it wouldn't let the queries through anymore. Is that what that was?


Yes, that sounds like that same thing.

I've encountered Google ReCaptcha as well a few times when searching from a logged out browser.


What kind of cluster do I need to just grep the web?


As for the Case Study part, and me saying this isn’t simply a rant - I lied, hence the quotation marks in the title

So it's NOT a case study...


For what is worth, I'm not sure why they changed the title here. The flippancy in my original title makes the level of diligence a bit clearer from the get go.


A person using google search is not google’s customer, there is no money is building a good search engine itself.



How is search so bad?

My broad take is that previously search worked (Altavista era through early-mid Google) because it referenced organic links put in place by real humans and keywords, plus basic metadata like physical location of servers, freshness of content, frequency of update, metadata behind domains, etc.

Since the mid 1990s that has increasingly been gamed heavily, PageRank style approaches have come and sort of gone, and a vast majority of content accessed by consumers has moved to one of a small number of platforms or walled gardens, often mobile applications. I don't know for sure, but I'd assume with confidence that the majority of result inclusion decisions made by Google are now based on rejection blacklists, 'known good' safe hits and effectively minimizing anomalous results above the fold. Simultaneously, the internet has become an international place and the bar has been raised for new entrants such that an incapacity to return meaningful results in multiple languages bars a search engine from any significant market position. A huge percentage of results are either Wikipedia/reference pages, local news or Q&A sites. Further, huge amounts of what is out there is behind Cloudflare or similar firewalls which will probably frustrate new and emerging spiders.

The existing monopolies, having some established capacity and reputation in this regard, may have become somewhat entrenched and lazy, and do not care enough about improvement. They are literally able to sail happily on market inertia, while generating ridiculous advertising revenues. In China we have Baidu, and in most rest of the world, Google.

Who will bring about a new search engine? Greg Lindahl https://news.ycombinator.com/user?id=greglindahl who formerly made Blekko is apparently working on another one.

I once wrote a small one (~2001) which was based upon the concept of multilingual semantic indices (a sort of non-rigorously obtained language-neutral epistemology was the core database). I still think this would be a meaningful approach to follow, since so much is lost in translation, particularly around current events. One problem with evolving public utilities in this area are that such approaches border on open source intelligence (OSI) and most people with linguistic or computational chops in that area leave academia and get eaten up by the military industrial complex or Google.

Now we have https://commoncrawl.org/the-data/get-started/ which makes reasonable quality sample crawl data super-available. Now "all" we need is people to hack on algorithms and a means to commercialize them as alternatives to the status quo.


The author is not using a "search" engine, Google is a recommendation engine.


Cuz it is goddamn hard!?


You're located in Berlin. It found a page on Reddit about buying phones in Berlin.

Not saying you're wrong about the dates but... I dunno... seems like an odd query. "Phone to buy?" And why not just search Reddit?

This site feels like we're just complaining about nothing these days. (Downvote away!)


I'm downvoting you because your response is totally useless.

He wasn't looking for where to buy a phone in Berlin, and besides, it's an old reddit thread.


agree..OP does a poorly formed search query and complains. try "reddit best phone to buy 2020" you're welcome!


It's a hard problem because what is relevant is inherently subjective and context specific and only a minority of users uses the advanced search functionality so it is also not a big priority to solve it. Both Google and Duck Duck Go optimize for the simple use case where there's a bit of user context and some short query that the user typed. That's what needs to work well. For that Google is still pretty good. I try duck duck go once in a while but it's just not good enough for me right now. And of course when Google fails me, that's probably also a hard case for Duck Duck Go.

The other problem is that websites provide very inconsistent meta-data, and worse, are actively trying to game the system by abusing that metadata. So, things like timestamps are not standardized at all (well, a little bit via things like microformats). So recency of data is important as one of many relevance signals but not necessarily super accurate. And given that it's a relevance signal, you have people doing SEO trying to game that as well.

Anyway, Hacker News could also do with some search improvements to its ranking. It always pulls up some ancient article as the most relevant thing as opposed as the article from last week that I remembered and wanted to find back. I consult people on building search engines with Elasticsearch, so I have some idea what I'm talking about. It seems the ranking is basically "sort by points". Probably not that hard to fix that with some additional ranking signals. I just searched for "search" expecting to find this article near the top 5 (because it is recent and has search in the title). Nope; not a thing.


Hacker News doesn't even have search at the moment. It just redirects you to some crappy external site.


It giving the YC startup running the search backend some visibility doesn't mean it somehow "doesn't even have seaarch".


Exactly, there's a search box on the web site. How it's implemented is an implementation detail. Given that it happens to be something by a company (Algolia) selling this as a SAAS solution, I don't think this is a great advertisement for them either.


Algolia is a YC company.




Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: