Hacker News new | past | comments | ask | show | jobs | submit login

> A new breakthrough heuristic today will look something totally different, just as meritocratic and possibly resistant to gaming.

I wonder how much of this could be obtained back by penalizing:

1. The number of javascript dependencies 2. The number of ads on the page, or the depth of the ad network

This might start a virtuous circle, but in the end, this is just a game of cat-and-mouse, and website might optimize for this as well.

What we might need to break this is a variety of search engines that uses different criteria to rank pages. I suspect it would be pretty hard, if not impossible, to optimize for all of them.

And in any case, frequently change the ranking algorithms to combat over-optimization by the websites (as that's classically done against ossification for protocols, or any overfitting to outside forces in a competitive system).




You could even have all this under one roof: one common search spider that feeds this ensemble of different ranking algorithms to produce a set of indices, and then a search engine front end that round-robins queries out between the different indices. (Don’t like your query? Spin the algorithm wheel! “I’m Feeling Lucky” indeed.)


The Common Crawl is a thing already. Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage, and I can't think of anything that could change that in the foreseeable future. That's why I think providing a federated Web directory standard, ala ODP/DMOZ except not limited to a single source, would be a far more impactful development.


Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage

Maybe instead of a problem, there is an opportunity here.

Back before Google ate the intarwebs, there used to be niche search engines. Perhaps that is an idea whose time has come again.

For example, if I want information from a government source, I use a search engine that specializes in crawling only government web sites.

If I want information about Berlin, I use a search engine that only crawls web sites with information about Berlin, or that are located in Berlin.

If I want information about health, I use a search engine that only crawls medical web sites.

Each topic is still a wealth of information, but siloed enough that the amount of data could be manageable to a small or medium-sized company. And the market would keep the niches from getting so small that they become useful. A search engine dedicated to Hello Kitty lanyards isn't going to monetize.


I´d be happy with something like Searx [1,2,3]

[1] https://en.wikipedia.org/wiki/Searx [2] https://asciimoo.github.io/searx/ [3] https://stats.searx.xyz/

featuring the semantic map of [4] https://swisscows.ch/

incorporating [5] https://curlie.org/ and Wikipedia and something like Yelp/YellowPages embedded in Open Streetmaps for businesses and points of interest, with a no frills interface showing the history (via timeslide?) of edits.

Bang! Done!


That's the problem that web directories solve. It's not that you're wrong, it's just largely orthogonal to the problem that you'd need a large crawl of the internets for, i.e. spotting sites about X niche that you wouldn't find even from other directly-related sites, and that are too obscure, new, etc. to be linked in any web directory.


That's the problem that web directories solve

Not really. A web directory is a directory of web sites. I can't search a web directory for content within the web sites, which is what a niche search engine would do.


On the other hand, the niche search engine depends upon having such a web directory (the list of sites to index).



Don't forget the search engine search engine!


You don’t really need to store a full text crawl if you’re going to be penalizing or blacklisting all of the ad-filled SEO junk sites. If your algorithm scores the site below a certain threshold then flag it as junk and store only a hash of the page.

Another potentially useful approach is to construct a graph database of all these sites, with links as edges. If one page gets flagged as junk then you can lower the scores of all other pages within its clique [1]. This could potentially cause a cascade of junk-flagging, cleaning large swathes of these undesirable sites from the index.

[1] https://en.wikipedia.org/wiki/Clique_(graph_theory)


Javascript, which Google coincidentally pushed and still pushes for, doesn't exactly make the web easier to crawl either.


What if: SEO consultants aren’t gaming system, but the search and web is being optimized for “measurable immediate economic impact” that is ad revenue at this moment — due to web itself being in-monetizable and unable to generate value.

I don’t like the whole concept of SEO, I don’t like the way the web is today, but I think we should stop and think before resorting to “immoral few is destroying things, we unfuck it reclaim what we deserve” type simplification.


> the search and web is being optimized for “measurable immediate economic impact” that is ad revenue at this moment

So much is obvious. The discussion is about whether there is a less shitty metric.


Merging js deps into one big resource isn't difficult. The number of ads point is interesting though. How would one determine what is an ad and what is an image? I have my ideas, but optimizing on this boundary sounds like it would lead to weird outcomes.


Adblockers have to solve that problem already. And it's actually really easy because "ads" aren't just ads unfortunately, they're also third-party code that's trying to track you as you browse the site. So it's reasonably easy to spot them and filter them out.


Though, advertisers are already using first party redirection. Future of adblockers is bleak.

https://github.com/uBlockOrigin/uBlock-issues/issues/780/


Back in the early days of banner ads, a CSS-based approach to blocking was to target images by size. Since advertising revolved around specific standards of advertising "units" (effectively: sizes of images), those could be identified and blocked. That worked well, for a time.

This is ultimately whack-a-mole. For the past decade or so, point-of-origin based blockers have worked effectively, because that's how advertising networks have operated. If the ad targets start getting unified, we may have to switch to other signatures:

- Again, sizes of images or DOM elements.

- Content matching known hash signatures, or constant across multiple requests to a site (other than known branding elements / graphics).

- "Things that behave like ads behave" as defined by AI encoded into ad blockers.

- CSS / page elements. Perhaps applying whitelist rather than blacklist policies.

- User-defined element subtraction.

There's little in the history of online advertising that suggests users will simply give up.


Some of those techniques will make the whole experience slow compared to the current network request filters and dns blockers.

And that will probably be blocked or severely locked down by your most popular browser, chrome.

I don't need to give advertisers data myself when someone else I know can. I really doubt it is easy to throw off chrome monopoly at this stage. I presume we will see a chilling effect before anything moves like IE.


At this point, I'll take slow over shitshow.


That is fixed since 1.24.3b7 / https://github.com/gorhill/uBlock/releases ?


In the early days of DMOZ, some editors would rank sites lower based on the number of ads they had.


I don't think DMOZ had ranking per se? They could mark "preferred" sites for any given category, but only a handful of them at most, and with very high standards, i.e. it needed to be the official site or "THE" definitive resource about X.


You are correct, the sites weren't "ranked" the same way that Google ranks sites now. But there were preferred sites, and each site had a description written by an editor who could be fairly unpleasant if they wanted to.

I had a site that appeared in DMOZ, and the description was written in such a way that nobody would want to visit it. But it was one of only a few sites on the internet at the time with its information, so it was included.


Given that Google makes money off the ads, that would be hard. DuckDuckGo could pull it off. You need another revenue stream though.


Google has taken on so many markets that I don't think they can do anything reasonably well (or disruptive) without conflicting interests. A breakup is overdue: if they didn't control both search and ads, the web would be a lot better nowadays. If they didn't control web browsers as well, standards would be much more important.


> What we might need to break this is ...

Create a core protocol at the same level as DNS etc., that web servers can use to offer an index of everything they serve/relay. A multitude of user-side apps may then query that protocol, with each app using different algorithms, heuristics and offering different options.


I've been thinking along similar lines for a year or so now.

There are several puzzling omissions from Web standards, particularly given that keyword-based search was part of the original CERN WWW discussion:

http://info.cern.ch/hypertext/WWW/Addressing/Search.html

IF we had a distributable search protocol, index, and infrastructure ... the entire online landscape might look rather different.

Note that you'd likely need some level of client support for this. And the world's leading client developer has a strongly-motivated incentive to NOT provide this functionality integrally.

A distributed self-provided search would also have numerous issues -- false or misleading results (keyword stuffing, etc.) would be harder to vet than the present situation. Which suggests that some form of vetting / verifying provided indices would be required.

Even a provided-index model would still require a reputational (ranking) mechanism. Arguably, Google's biggest innovation wasn't spidering, but ranking. The problem now is that Google's ranking ... both doesn't work, and incentivises behaviours strongly opposed to user interests. Penalising abusive practices has to be built into the system, with those penalties being rapid, effective, and for repeat offenders, highly durable.

The problem of potential for third-party malfeasance -- e.g., engaging in behaviours appearing to favour one site, but performed to harm that site's reputation through black-hat SEO penalties, also has to be considered.

As a user, the one thing I'd most like to be able to do is specify blacklists of sites / domains I never want to have appear in my search results. Without having to log in to a search provider and leave a "personalised" record of what those sites are.

(Some form of truly anonymised aggregation of such blocklists would, of course, be of some use, and facilitating this is an interesting challenge.)


I too have been thinking about these things for a long time, and I also believe a better future is going to include "aggregation of such blocklists would, of course, be of some use, and facilitating this is an interesting challenge."

I decided it is time for us to have a bouncer-bots portal (or multiple) - this would help not only with search results, but also could help people when using twitter or similar - good for the decentralized and centralized web.

My initial thinking was these would be 'pull' bots, but I think they would be just as useful, and more used, if they were perhaps active browser extensions..

This way people can choose which type of censoring they want, rather than relying on a few others to choose.

I believe creating some portals for these, similar to ad-block lists - people can choose to use Pete'sTooManyAds bouncer, and or SamsItsTooSexyfor work bouncer..

ultimately I think the better bots will have switches where you can turn on and off certain aspects of them and re-search.. or pull latest twitter/mastodon things.

I can think of many types of blockers that people would want, and some that people would want part of - so either varying degrees of blocking sexual things, or varying bots for varying types of things.. maybe some have sliders instead of switches..

make them easy to form and comment on and provide that info to the world.

I'd really like to get this project started, not sure what the tooling should be - and what the backup would be if it started out as a browser extension but then got booted from the chrome store or whatever.

Should this / could this be a good browser extension? What language / skills required for making this? It's on my definite to do future list.


There are some ... "interesting" ... edge cases around shared blocklists, most especially where those:

1. Become large.

2. Are shared.

3. And not particularly closely scrutinised by users.

4. Via very highly followed / celebrity accounts.

There are some vaguely similar cases of this occurring on Twitter, though some mechanics differ. Celebs / high-profile users attract a lot of flack, and take to using shared blocklists. Those get shared not only among celeb accounts but their followers, though, because celebs themselves are a major amplifying factor on the platform, being listed effectively means disappearing from the platform. Particularly critical for those who depend on Twitter reach (some artists, small businesses, and others).

Names may be added to lists in error or malice.

This blew up summer of 2018 and carried over to other networks.

Some of the mechanics differ, but a similar situation playing out over informally shared Web / search-engine blocklists could have similar effects.


Create a core protocol at the same level as DNS etc., that web servers can use to offer an index of everything they serve/relay

Isn't that pretty much a site map?

https://en.wikipedia.org/wiki/Sitemaps


A sitemap simply tells you what pages exist, not what's on those pages.

Systems such as lunr.js are closer in spirit to a site-oriented search index, though that's not how they're presently positioned, but instead offer JS-based, client-implemented site search for otherwise static websites.

https://lunrjs.com


How would this help anything? It would make the blackhat SEO even easier, if anything.


The results could be audited.

Fail an audit, lose your reputation (ranking).

The basic principle of auditing is to randomly sample results. BlackHat SEO tends to rely on volume in ways that would be very difficult to hide from even modest sampling sizes.


How do you stop the server lying?

If a good site is on shared hosting will it always be dismissed because of the signal of the other [bad] sites on that same host? (you did say at DNS level, not domain level)


> Create a core protocol at the same level as DNS etc., that web servers can use to offer an index of everything they serve/relay.

So, back to gopher? That might actually work!


Google already penalizes based off payload size/download speed.


> 1. The number of javascript dependencies

How about we don't start looking at the /how/ a site is made, when it's already difficult to sort out the /what/ it is.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: