> A new breakthrough heuristic today will look something totally different, just as meritocratic and possibly resistant to gaming.
I wonder how much of this could be obtained back by penalizing:
1. The number of javascript dependencies
2. The number of ads on the page, or the depth of the ad network
This might start a virtuous circle, but in the end, this is just a game of cat-and-mouse, and website might optimize for this as well.
What we might need to break this is a variety of search engines that uses different criteria to rank pages. I suspect it would be pretty hard, if not impossible, to optimize for all of them.
And in any case, frequently change the ranking algorithms to combat over-optimization by the websites (as that's classically done against ossification for protocols, or any overfitting to outside forces in a competitive system).
You could even have all this under one roof: one common search spider that feeds this ensemble of different ranking algorithms to produce a set of indices, and then a search engine front end that round-robins queries out between the different indices. (Don’t like your query? Spin the algorithm wheel! “I’m Feeling Lucky” indeed.)
The Common Crawl is a thing already. Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage, and I can't think of anything that could change that in the foreseeable future. That's why I think providing a federated Web directory standard, ala ODP/DMOZ except not limited to a single source, would be a far more impactful development.
Unfortunately, a "full" text crawl of the internets is a YUUUGE amount of data to manage
Maybe instead of a problem, there is an opportunity here.
Back before Google ate the intarwebs, there used to be niche search engines. Perhaps that is an idea whose time has come again.
For example, if I want information from a government source, I use a search engine that specializes in crawling only government web sites.
If I want information about Berlin, I use a search engine that only crawls web sites with information about Berlin, or that are located in Berlin.
If I want information about health, I use a search engine that only crawls medical web sites.
Each topic is still a wealth of information, but siloed enough that the amount of data could be manageable to a small or medium-sized company. And the market would keep the niches from getting so small that they become useful. A search engine dedicated to Hello Kitty lanyards isn't going to monetize.
incorporating [5] https://curlie.org/ and Wikipedia and something like Yelp/YellowPages embedded in Open Streetmaps for businesses and points of interest, with a no frills interface showing the history (via timeslide?) of edits.
That's the problem that web directories solve. It's not that you're wrong, it's just largely orthogonal to the problem that you'd need a large crawl of the internets for, i.e. spotting sites about X niche that you wouldn't find even from other directly-related sites, and that are too obscure, new, etc. to be linked in any web directory.
Not really. A web directory is a directory of web sites. I can't search a web directory for content within the web sites, which is what a niche search engine would do.
You don’t really need to store a full text crawl if you’re going to be penalizing or blacklisting all of the ad-filled SEO junk sites. If your algorithm scores the site below a certain threshold then flag it as junk and store only a hash of the page.
Another potentially useful approach is to construct a graph database of all these sites, with links as edges. If one page gets flagged as junk then you can lower the scores of all other pages within its clique [1]. This could potentially cause a cascade of junk-flagging, cleaning large swathes of these undesirable sites from the index.
What if: SEO consultants aren’t gaming system, but the search and web is being optimized for “measurable immediate economic impact” that is ad revenue at this moment — due to web itself being in-monetizable and unable to generate value.
I don’t like the whole concept of SEO, I don’t like the way the web is today, but I think we should stop and think before resorting to “immoral few is destroying things, we unfuck it reclaim what we deserve” type simplification.
Merging js deps into one big resource isn't difficult. The number of ads point is interesting though. How would one determine what is an ad and what is an image? I have my ideas, but optimizing on this boundary sounds like it would lead to weird outcomes.
Adblockers have to solve that problem already. And it's actually really easy because "ads" aren't just ads unfortunately, they're also third-party code that's trying to track you as you browse the site. So it's reasonably easy to spot them and filter them out.
Back in the early days of banner ads, a CSS-based approach to blocking was to target images by size. Since advertising revolved around specific standards of advertising "units" (effectively: sizes of images), those could be identified and blocked. That worked well, for a time.
This is ultimately whack-a-mole. For the past decade or so, point-of-origin based blockers have worked effectively, because that's how advertising networks have operated. If the ad targets start getting unified, we may have to switch to other signatures:
- Again, sizes of images or DOM elements.
- Content matching known hash signatures, or constant across multiple requests to a site (other than known branding elements / graphics).
- "Things that behave like ads behave" as defined by AI encoded into ad blockers.
- CSS / page elements. Perhaps applying whitelist rather than blacklist policies.
- User-defined element subtraction.
There's little in the history of online advertising that suggests users will simply give up.
Some of those techniques will make the whole experience slow compared to the current network request filters and dns blockers.
And that will probably be blocked or severely locked down by your most popular browser, chrome.
I don't need to give advertisers data myself when someone else I know can. I really doubt it is easy to throw off chrome monopoly at this stage. I presume we will see a chilling effect before anything moves like IE.
I don't think DMOZ had ranking per se? They could mark "preferred" sites for any given category, but only a handful of them at most, and with very high standards, i.e. it needed to be the official site or "THE" definitive resource about X.
You are correct, the sites weren't "ranked" the same way that Google ranks sites now. But there were preferred sites, and each site had a description written by an editor who could be fairly unpleasant if they wanted to.
I had a site that appeared in DMOZ, and the description was written in such a way that nobody would want to visit it. But it was one of only a few sites on the internet at the time with its information, so it was included.
Google has taken on so many markets that I don't think they can do anything reasonably well (or disruptive) without conflicting interests. A breakup is overdue: if they didn't control both search and ads, the web would be a lot better nowadays. If they didn't control web browsers as well, standards would be much more important.
Create a core protocol at the same level as DNS etc., that web servers can use to offer an index of everything they serve/relay. A multitude of user-side apps may then query that protocol, with each app using different algorithms, heuristics and offering different options.
IF we had a distributable search protocol, index, and infrastructure ... the entire online landscape might look rather different.
Note that you'd likely need some level of client support for this. And the world's leading client developer has a strongly-motivated incentive to NOT provide this functionality integrally.
A distributed self-provided search would also have numerous issues -- false or misleading results (keyword stuffing, etc.) would be harder to vet than the present situation. Which suggests that some form of vetting / verifying provided indices would be required.
Even a provided-index model would still require a reputational (ranking) mechanism. Arguably, Google's biggest innovation wasn't spidering, but ranking. The problem now is that Google's ranking ... both doesn't work, and incentivises behaviours strongly opposed to user interests. Penalising abusive practices has to be built into the system, with those penalties being rapid, effective, and for repeat offenders, highly durable.
The problem of potential for third-party malfeasance -- e.g., engaging in behaviours appearing to favour one site, but performed to harm that site's reputation through black-hat SEO penalties, also has to be considered.
As a user, the one thing I'd most like to be able to do is specify blacklists of sites / domains I never want to have appear in my search results. Without having to log in to a search provider and leave a "personalised" record of what those sites are.
(Some form of truly anonymised aggregation of such blocklists would, of course, be of some use, and facilitating this is an interesting challenge.)
I too have been thinking about these things for a long time, and I also believe a better future is going to include "aggregation of such blocklists would, of course, be of some use, and facilitating this is an interesting challenge."
I decided it is time for us to have a bouncer-bots portal (or multiple) - this would help not only with search results, but also could help people when using twitter or similar - good for the decentralized and centralized web.
My initial thinking was these would be 'pull' bots, but I think they would be just as useful, and more used, if they were perhaps active browser extensions..
This way people can choose which type of censoring they want, rather than relying on a few others to choose.
I believe creating some portals for these, similar to ad-block lists - people can choose to use Pete'sTooManyAds bouncer, and or SamsItsTooSexyfor work bouncer..
ultimately I think the better bots will have switches where you can turn on and off certain aspects of them and re-search.. or pull latest twitter/mastodon things.
I can think of many types of blockers that people would want, and some that people would want part of - so either varying degrees of blocking sexual things, or varying bots for varying types of things.. maybe some have sliders instead of switches..
make them easy to form and comment on and provide that info to the world.
I'd really like to get this project started, not sure what the tooling should be - and what the backup would be if it started out as a browser extension but then got booted from the chrome store or whatever.
Should this / could this be a good browser extension? What language / skills required for making this? It's on my definite to do future list.
There are some ... "interesting" ... edge cases around shared blocklists, most especially where those:
1. Become large.
2. Are shared.
3. And not particularly closely scrutinised by users.
4. Via very highly followed / celebrity accounts.
There are some vaguely similar cases of this occurring on Twitter, though some mechanics differ. Celebs / high-profile users attract a lot of flack, and take to using shared blocklists. Those get shared not only among celeb accounts but their followers, though, because celebs themselves are a major amplifying factor on the platform, being listed effectively means disappearing from the platform. Particularly critical for those who depend on Twitter reach (some artists, small businesses, and others).
Names may be added to lists in error or malice.
This blew up summer of 2018 and carried over to other networks.
Some of the mechanics differ, but a similar situation playing out over informally shared Web / search-engine blocklists could have similar effects.
A sitemap simply tells you what pages exist, not what's on those pages.
Systems such as lunr.js are closer in spirit to a site-oriented search index, though that's not how they're presently positioned, but instead offer JS-based, client-implemented site search for otherwise static websites.
The basic principle of auditing is to randomly sample results. BlackHat SEO tends to rely on volume in ways that would be very difficult to hide from even modest sampling sizes.
If a good site is on shared hosting will it always be dismissed because of the signal of the other [bad] sites on that same host? (you did say at DNS level, not domain level)
I wonder how much of this could be obtained back by penalizing:
1. The number of javascript dependencies 2. The number of ads on the page, or the depth of the ad network
This might start a virtuous circle, but in the end, this is just a game of cat-and-mouse, and website might optimize for this as well.
What we might need to break this is a variety of search engines that uses different criteria to rank pages. I suspect it would be pretty hard, if not impossible, to optimize for all of them.
And in any case, frequently change the ranking algorithms to combat over-optimization by the websites (as that's classically done against ossification for protocols, or any overfitting to outside forces in a competitive system).