Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Can we create a new internet where search engines are irrelevant?
374 points by subhrm on June 26, 2019 | hide | past | favorite | 373 comments
If we were to design a brand new internet for today's world, can we develop it such a way that:

1- Finding information is trivial

2- You don't need services indexing billions of pages to find any relevant document

In our current internet, we need a big brother like Google or Bing to effectively find any relevant information in exchange for sharing with them our search history, browsing habits etc. Can we design a hypothetical alternate internet where search engines are not required?

I think it would be helpful to remember to distinguish two separate search engine concepts here: indexing and ranking.

Indexing isn't the source of problems. You can index in an objective manner. A new architecture for the web doesn't need to eliminate indexing.

Ranking is where it gets controversial. When you rank, you pick winners and losers. Hopefully based on some useful metric, but the devil is in the details on that.

The thing is, I don't think you can eliminate ranking. Whatever kind of site(s) you're seeking, you are starting with some information that identifies the set of sites that might be what you're looking for. That set might contain 10,000 sites, so you need a way to push the "best" ones to the top of the list.

Even if you go with a different model than keywords, you still need ranking. Suppose you create a browsable hierarchy of categories instead. Within each category, there are still going to be multiple sites.

So it seems to me the key issue isn't ranking and indexing, it's who controls the ranking and how it's defined. Any improved system is going to need an answer for how to do it.

Some thoughts on the problem, not intended as a complete proposal or argument:

* Indexing is expensive. If there's a shared public index, that'd make it a lot easier for people to try new ranking algorithms. Maybe the index can be built into the way the new internet works, like DNS or routing, so the cost is shared.

* How fast a ranking algorithm is depends on how the indexing is done. Is there some common set of features we could agree on that we'd want to build the shared index on? Any ranking that wants something not in the public index would need either a private index or a slow sequential crawl. Sometimes you could do a rough search using the public index and then re-rank by crawling the top N, so maybe the public index just needs to be good enough that some ranker can get the best result within the top 1000.

* Maybe the indexing servers execute the ranking algorithm? (An equation or SQL-like thing, not something written in a Turing Complete language). Then they might be able to examine the query to figure out where else in the network to look, or where to give up because the score will be too low.

* Maybe the way things are organized and indexed is influenced by the ranking algorithms used. If indexing servers are constantly receiving queries that split a certain way, they can cache / index / shard on that. This might make deciding what goes into a shared index easier.

> Indexing is expensive. If there's a shared public index, that'd make it a lot easier for people to try new ranking algorithms. Maybe the index can be built into the way the new internet works, like DNS or routing, so the cost is shared.

But what are you storing in your index? The content that is considered in your ranking will vary wildly by your ranking methods. (example - early indexes cared only for the presence of words. Then we started to care about the count of words, then the relationships between words and the context. Then about figuring out if the site was scammy, or slow.

The only way to store an index of all content (to cover all the options) is to...store the internet.

I'm not trying to be negative - I feel very poorly served by the rankings that are out there, as I feel on 99% of issues I'm on the longtail rather than what they target. But I can't see how a "shared index" would be practical for all the kinds of ranking algorithms both present and future.

> The only way to store an index of all content (to cover all the options) is to...store the internet.

An index cannot hope to cover all options, these ideas are antithetical.

this is a pretty killer idea

How about open sourcing the ranking and then allowing people to customize it. I should be able to rank my own search results how I want to without much technical knowledge.

I want to rank my results by what is most popular to my friends (Facebook or otherwise) so I just look for a search engine extension that allows me to do that. This could get complex but can also be simple if novices just use the most popular ranking algorithms.

I think Facebook really missed the boat on building their own "network influenced" search engine. They made some progress in allowing you to search based on friends' posting and recommendations to some degree but it seems to have flatlined in the last few years and is very constricting.

One thing I haven't seen much on these recent threads on search is the ability to create your own Google Custom Search Engine based on domains you trust - https://cse.google.com/cse/all

Also, not many people have mention the use of search operators, which allows you to control the results returned. Such as "Paul Graham inurl:interview -site:ycombinator.com -site:techcrunch.com"

That would bring to an even bigger filter bubble issue, more precisely to a techno élite which is capable, willing and knowledgeable enough to feel the need go through the hassle, and all the rest navigating in such an indexed mess that would pave the way to all sort of new gatekeepers, belonging to the aforementioned tech élite. It’s not a simple issue to tackle, perhaps a public scrutiny on the ranking algorithms would be a good first step.

I disagree. The people who don't know anything and are unwilling to learn wouldn't be any worse off than they are today and everyone else would benefit from an open source "marketplace" of possible ranking algorithms that the so called "techno elite" have developed.

I think the proposed improvement to the web in its intentions should mostly benefit the "ignorants", not those that can already navigate through the biases of today's technological gatekeepers. Please note, ignorants are not at fault for being so. Especially when governments cut funds for public education, and media leverages (and multiplies) ignorance to produce needs and sales, fears and votes. Any solution must work first to make the weak stronger, more conscious. A better and less biased web can help people grow their unbiased knowledge, and therefore exercise their right of vote with a deeper understanding of the complexity. Voting ignorants are an opportunity for the ill intentioned politicians, as much as are a problem for me, you and the whole country.

blekko and bing both implemented ranking by popularity with your Facebook friends, and the data was too sparse to be useful.

If the details of a ranking algorithm are open source, it would be easy to manipulate them.

Open sourcing the ranking... YES!!!

I wonder if indexing and ranking could be decentralized. Lets say we design some data formats and protocols to exchange indexing and ranking information. Then maybe instead of getting a single Google, we could have a hierarchical system of indexers and rankers and some sort of consensus and trust algorithm to aggregate the information between them. Maybe offload indexing to the content providers altogether, i.e. if you want your website found, you need to maintain your own index. Maybe do a market on aggregator trust, if you don't like a particular result, the corresponding aggregator loses a bit of trust and its rankings become a bit less prominent.

Spitballing here, but what if instead of a monolithic page rank algorithm, you could combine individually maintained, open set rankings?

===Edit=== I mean to say you as the user would gain control over the ranking sources, the company operating this search service would perform the aggregation and effectively operate marketplace of ranking providers. ===end edit===

For example, one could be an index of "canonical" sites for a given search term, such that it would return an extremely high ranking for the result "news.ycombinator.com" if someone searches the term "hacker news". Layer on a "fraud" ranking built off lists of sites and pages known for fraud, a basic old-school page rank (simply order by link credit), and some other filters. You could compose the global ranking dynamically based off weighted averages of the different ranked sets, and drill down to see what individual ones recommended.

Seems hard to crunch in real time, but not sure. It'd certainly be nicer to have different orgs competing to maintain focused lists, rather than a gargantuan behemoth that doesn't have to respond to anyone.

Maybe you could even channel ad or subscription revenue from the aggregator to the ranking agencies based off which results the user appeared to think were the best.

Well I suppose Google has some way of customizing search for different people. The big issue for me is that google tracks me to do this. Maybe there could be a way to deliver customized search where we securely held the details of our customization. Or we were pooled with similar users. I suppose if a ranking algorithm had all the possible parameters as variables, we could deliver our profile request on demand at the time of search. That would be nice. You could search as a Linux geek or as a music nut or see the results different political groups get.

Building something like this becomes much easier with Xanadu-style bidirectional links. Of course, building those is hard, but eliminating the gatekeeper-censors may finally be the incentive required to get bidi links built. It's also worth noting that such a system will have to have some metrics for trust by multiple communities (e.g. Joe may think say, mercola.com is a good and reliable source of health info, while Jane thinks he's stuck in the past - People should be able to choose whether they value Joe's or Jane's opinion more, affecting the weights they'll see). In addition (and this is hard, too), those metrics should not be substantially game-able by those seeking to either promote or demote sites for their own ends. This requires a very distributed trust network.

I like the idea of local personalized search ranking that evolves based off of a on device neural network. I'm not sure how that would be work though.

Sounds like ad-Blocker repos, nice!

Not to mention all the people who will carefully study whatever new system, looking for their angle to game the ranking.

> When you rank, you pick winners and losers.

...To which people responded with various schemes for fair ranking systems.

...To which people observed that someone will always try to game the ranking systems.

Yep! So long as somebody stands to benefit (profit) from artificially high rankings, they'll aim for that, and try to break the system. Those with more resources will be better able to game the system, and gain more resources... ad nauseam. We'd end up right where we are.

The only way to break that [feedback loop](https://duckduckgo.com/?q=thinking+in+systems+meadows) is to disassociate profit from rank.

Say it with me: we need a global, non-commercial network of networks--an internet, if you will. (Insert Al Gore reference here.)

(Note: I don't have time to read all the comments on this page before my `noprocrast` times out, so please pardon me if somebody already said this.)

This is a bang on distillation of the problem (or at least one way to view the problem, per "who controls the ranking and how it's defined").

That’s a very useful distinction, that brings me to a question: are we sure that automating ranking in 2019, on the basis of publicly scrutinized algorithms, would bring us back to a pre-Google accuracy? Also, ranking on the basis of the sole query instead of the individual, would lead to much more neutral results.

Absolutely spot on... I've been using DDG as my default search engine for a couple months. But, google has a huge profile on me. I find myself falling back to google a few times a day when searching for technical terms/issues.

Couldn't you just randomize result ordering?

You know how google search results can get really useless just a few pages in? And it says it found something crazy like 880,000 results? Imagine randomizing that.


Unrelated I searched for "Penguin exhibits in Michigan". Of which we have several. It reports 880,000 results but I can only go to page 12 (after telling it to show omitted results). Interesting...


If you think of it as like an old fashioned library or an old fashioned Blockbuster video store.

Sure you could read any book ever printed in the English language in the local library. They might have to get it in from the national collection or the big library in the city. But you ain't going to see every book in the local library. There is more than you could wish for and you will never read every book in the local library. But all the classics are there, the talked about new books are there (or out on loan, back soon). All the reference books that school kids are there, there is enough to get you started in any hobby.

Google search results are like that. Those 880,000 'titles' are a bit like the Library of Congress boasting how big it is, it is just a number. All they have really got for you is a small selection that is good enough for 99% of people 99% of the time. Only new stuff by people with Page rank (books with publishers) get indexed now and put into the 'main collection'.

Much like how public libraries do have book sales, Google do let a lot of the 880,000 results drop off.

It's a ruse!

I heard that they also filter results by some undisclosed parameters, like they don't show you anything that hasn't been modified in the last ~10 years, no matter how hard you try

Yeah, this is a real problem for research into older things that have no need to change. Google seems to think that information has a half-life. That's really only true in the social space. Truth is eternal.

Sure, but then whoever gets to populate the index chooses the winners and losers, because you could just stuff it with different versions of the content or links you wanted to win and the random ranking would should those more often, because they appear in the pool of possible results more often.

That would make it waaaay less useful to searchers and wayyy easier to game by stuffing results with thousands of your own results

I suspect just randomizing the first 20 or so results would fix most problems. The real issue is people putting effort in to hitting the first page, so if you took the benefit out of doing that people would look for other ways to spend their energy.

If you find nothing useful, just refresh for a new set. It would also help discovery.

Sounds like a great ux

Yes, it was called Yahoo and it did a good job of cataloging the internet when hundreds of sites were added per week: https://web.archive.org/web/19961227005023/http://www2.yahoo...

I'm old enough to remember sorting sites by new to see what new URLs were being created, and getting to that bottom of that list within a few minutes. Google and search was a natural response to solving that problem as the number of sites added to the internet grew exponentially...meaning we need search.

Directories are still useful - Archive of Our Own (https://archiveofourown.org/) is a large example for fan fiction, Wikipedia has a full directory (https://en.wikipedia.org/wiki/Category:Main_topic_classifica...), Reddit wikis perform this function, Awesome directories (https://github.com/sindresorhus/awesome) or personal directories like mine at href.cool.

The Web is too big for a single large directory - but a network of small directories seems promising. (Supported by link-sharing sites like Pinboard and HN.)

Yes! But, of course, for directories outside of Wikipedia. This is very interesting for its classification structure. It's so typical of Wikipedia that a 'master list of lists' (by my count, there are 589 list links on this page) contains lists such as "Lists of Melrose Place episodes" and "Lists of Middle-earth articles" alongside lists such as "lists of wars" or "lists of banks".

Ao3 isn't really a directory since they do the actual hosting

Yes, thank you - I only mean in terms of organization.

I used Yahoo back in those days, and it literally proved the point that hand-cataloging the internet wasn't tractable, at least not the way Yahoo tried to do it. There was just too much volume.

It was wonderful to have things so carefully organized, but it took months for them to add sites. Their backlog was enormous.

Their failure to keep up is basically what pushed people to an automated approach, i.e. the search engine.

I found myself briefly wondering if it were possible to have a decentralized open source repository of curated sites that anyone could fork, add to, or modify. Then I remembered dmoz, which wasn't really decentralized -- and realized that "awesome lists" on GitHub may be a critical step in the direction I had envisioned.

I think this could work for small, specific areas of interest. For example, there are only so many people writing about, and interested in reading about, programming language design. Those small communities could stand ready with their community-curated index when an "outsider" wants to research something they know well.

You don't have to go all the way back into Yahoo-era when it comes to manually curated directories: DMOZ was actively maintained until quite recently, but ultimately given up for what seems like good reasons.

This is true, and DMOZ was used heavily by Google's earlier search algorithms to rank sites within Google. Early moderators of DMOZ had god like powers to influence search results.

Earlier than that there was a list of ftp sites giving a summary of what was available on each.

I wonder if you could build a Yahoo/Google hybrid where you start with many trusted catalogs run by special interest groups then index only those sites for search. Doesn't fully solve the centralization problem, but interesting none the less.

Everyone has missed the most important aspect of search engines, from the point of view of their core function of information retrieval: they're the internet equivalent of a library index.

Either you find a way to make information findable in a library without an index (how?!?) or you find a novel way to make a neutral search engine - one that provides as much value as Google but whose costs are paid in a different way, so that it does not have Google's incentives.

The problem is that current search engines are indexing what is essentially a stack of random books thrown together by anonymous library goers. Before being able to guide readers to books, librarians have to the following non-trivial tasks over the entire collection:

- identify the book's theme

- measure the quality of the information

- determine authenticity / malicious content

- remember the position of the book in the colossal stacks

Then the librarian can start to refer people to books. This problem was actually present in libraries before the revolutionary Dewy Decimal System [1]. Libraries found that the disorganization caused too much reliance on librarians and made it hard to train replacements if anything happened.

The Internet just solved the problem by building a better librarian rather than building a better library. Personally I welcome any attempts to build a more organized internet. I don't think the communal book pile approach is scaling very well.

[1]: https://en.wikipedia.org/wiki/Dewey_Decimal_Classification

>I welcome any attempts to build a more organized internet. I don't think the communal book pile approach is scaling very well.

Let me know if I misunderstand your comment but to me, this has already been tried.

Yahoo's founders originally tried to "organize" the internet like a good librarian. Yahoo in 1994 was originally called, "Jerry and David's Guide to the World Wide Web"[0] with hierarchical directories to curated links.

However, Jerry & David noticed that Google's search results were more useful to web surfers and Yahoo was losing traffic. Therefore, in 2000 they licensed Google's search engine. Google's approach was more scaleable than Yahoo's.

I often see several suggestions that the alternative to Google is curated directories but I can't tell if people are unaware of the early internet's history and don't know that such an idea was already tried and how it ultimately failed.

[0] http://static3.businessinsider.com/image/57977a3188e4a714088...

I remember trying to get one of my company's sites listed on Yahoo! back in the late 1990s. Despite us being an established company (founded in 1985) with a good domain name (cardgames.com) and a bunch of good, free content (rules for various card games, links to various places to play those games online, etc.), it took months.

That was not a bad thing. It was curated. Most of the crap never made it in the directory precisely because humans made decisions about what got in. If you wanted in the directory faster, you could pay a fee to get to the front of the queue. The result is that Yahoo could hire people to process the queue and make money without ads.

Isn't paying money to jump to the front of the queue just another form of advertising?

That was my experience as well. For old companies and new. Yahoo was just really slow.

> I often see several suggestions that the alternative to Google is curated directories but I can't tell if people are unaware of the early internet's history and don't know that such an idea was already tried and how it ultimately failed.

¿Por qué no los dos?

1) The idea is that a more organized structure is easier for a librarian to index. Today, libraries still have librarians. The book pile just wouldn't take decades to build familiarity.

2) Times change. New technology exists, people use the internet differently, and there's more at stake. Just because an approach didn't work before doesn't mean that it won't work now.

There are real problems with an organizational approach, but I don't see why the idea isn't worth a revisit.

There are plenty of these, wikipedia has a list [1].

I think these efforts get bogged down in the huge amount of content out there, the impermanence of that content and also the difficulty in placing sites into ontologies.

And at the end of the day, there's not a large enough value proposition to balance the immense effort.

I think, if you were to do it today, you would want to work on / with the internet archive, so at least things that were categorized wouldn't change or disappear (as much)

[1] https://en.m.wikipedia.org/wiki/List_of_web_directories

Obviously a naïve web directory isn't going to cut it.

What would make the approach viable is if there were a nice way to automate and crowd source most/all of the effort. Maybe that means changing the idea of what makes a website. Maybe there could just be little grass roots reddit-esque communities that are indexed/verified (google already favors reddit/hn links). Who knows, but it's an interesting problem to kick around.

>What would make the approach viable is if there were a nice way to automate and crowd source most/all of the effort.

But to me, crowdsourcing is also what Jerry & David did. The users submitted links to Yahoo. AltaVista also had a form for users to submit new links.

Also, Wikipedia's list of links are also crowdsourced in the sense that many outside websurfers (not just staff editors) make suggested edits to the wiki pages. Looking at a "revision history" of a particular wiki page makes the crowdsourced edits more visible: https://en.wikipedia.org/w/index.php?title=List_of_web_direc...

Sometimes it just takes a small changes to make an idea work. Neural networks weren't viable until GPUs/backpropagation. Dismissive comments like this aren't very useful.

>Dismissive comments

I wasn't being dismissive. I was trying to refine your crowdsourcing idea by explicitly surfacing what's been tried in the past.

The thread's op asks: "Can we create a new internet where search engines are irrelevant?"

If the current best answer for op is: "I propose crowdsourced curated directories is the alternative to Google/Bing -- but the implementation details is left as an exercise for the reader" ... that's fine that our conversation terminates there and we don't have to go around in circles. The point is I didn't know this thread's discussion ultimately terminates there until I ask more probing questions so people can try to expand on what their alternative proposal actually entails. I also don't know what baseline knowledge the person proposing the idea has. I.e. does person suggesting an idea have knowledge of internet's evolution and has that been taken into account?

> Maybe there could just be little grass roots reddit-esque communities that are indexed/verified

Verified by who, exactly?

I know, I know... "dismissive comment", but it's an important thing to think about: Who decides what goes in the library? It's an evergreen topic, even in real, physical libraries, as those tedious lists of "Banned And Challenged Books" attest. It seems every time a copy of Huckleberry Finn gets pulled from an elementary school library in Altoona everyone gets all upset, so can you imagine what would happen if the radfems got their hands on a big Web Directory and cleansed it of all positive mentions of trans people?

I imagine the communities would kind of serve as a public index in aggregate that have a barrier to entry / reputation. If one turns to crap just ignore it with whatever search tool you're using.

It wouldn't be about policing, just organizing.

Consider the sheer size of the internet now. Even if you could categorize and file that many websites accurately, how do you display that to the user in a way that's usable? It will probably look a lot like a search engine, no matter which way you frame it.

The underlying goal: "Get a user the information they want when they don't know where it lives" isn't really going to be helped by a non-searchable directory of millions of sites.

The current search engines are also indexing books maliciously inserted in the library in a way to maximize their exposure e.g. a million "different" pamphlets advertising Bob's Bible Auto Repair Service inserted in the Bible category.

A "better library" can't be permissionless and unfiltered; Dewey Decimal System relies on the metadata being truthful, and the internet is anything but.

You can't rely on information provided by content creators; Manual curation is an option but doesn't scale (see the other answer re: early Yahoo and Google).

Perhaps there exists a happy medium between: manual curation -- unfiltered

PageRank is kind of a pseudo manual curation. The manual effort is just farmed out to the greater internet and analyzed.

The really hard part of this to scale is the quality metric. Google was the first to really scale quality measurement by outsourcing it to the web content creators themselves.

Any attempt to create a decentralized index will need to tackle the quality metric problem.

Also, there's an massive economic market on top on what is on the closest shelves. Libraries are less sensitive to these forces.

They are also a spam filter. It's not just an index of whats relevant, but removal of what maliciously appears to be relevant at first glance.

This. Everyone's missing the point of a search engine.

We're talking about billions of pages and if not ranked (authority is a good hueristic), filtered (de-ranked), etc then good luck finding valuable information because everyone is gaming the systems to improve their ranking.

I think this is part of the reason you get a lot of fake news on social media. It's a constant stream of information (a new dimension of time has been added to the ranking, basically) that needs to be ranked and with humans in the loop, there's no way to do this very easily without filtering for noise and outright malicious content.

i disagree that there isnt a way, just that nobodies tried a good one yet.

take reddit for example. it should be very easy to establish a few voters who make "good" decisions, and then extrapolate their good decisions based on people with similar voting patterns. it would combine a million monkeys with typewriters with expert meritocracy. you want different sorting, sort by different experts until you get the results you want. it seems every platform is too busy fighting noise to focus on amplifying signal, or are focused on teaching machines to do the entire task, instead of using machines to multiply the efficiency of people with taste who can make a good judgement call with regard to whether something is novel or or pseudo-intellectual. Not to pick on them, but I would suspect an expert to be better at deranking aeon/brainpickings type clickbait than an eruditelike ai, if only because humans can still more easily determine if someone is making an actual worthwhile point, vs repeating a platitude, conventional wisdom, or something hollow.

It should, but if anyone knows who these kingmakers are, it's still probably just a matter of time before they accrue enough power for it to be worth someone's time to at least try to track them down and manipulate their decisions (bribe, blackmail, sponsor, send free trials, target with marketing/propaganda campaigns, etc.)

Who says it even has the same kingmakers every day? Slashdot solved that part of metamoderation two decades ago.

A person might be an expert in cars but not horses. A car expert might be superseded . The seed data creators could be a fluid thing.

This is a technocracy. Noone wants this but Hacker News.

Let's say you have a subreddit like /r/cooking. You think exposing a control in the user agent (browser, app, ui) that let's you sort recipe results by lay democracy, professional chefs, or restaurant critics taste is a technocracy?

Are consumer reports and wirecutter less valuable than Walmarts best sellers? Is techmeme.com worse than Hackernews by virtue of being a small cabal of voters? Should I dismiss longform.org and aldaily as elitist because they aren't determining priority solely from the larger populations preferences. Is Facebooks news algorithm better because it uses my friends to suggest content?

Is it a technocracy that metacritic and rotten tomatoes show both user and critic score? I'm proposing an additional algorithm that compares critic score with user score to find like voters and extrapolate how a critic would score a movie they have never seen. I think that would be useful without diminishing the other true scores. I would find it useful to be able to choose my own set of favorite letterboxd or redef voters and see results it predicts they would recommend, despite them never having actually voted on a movie or article. Instead of seeding a movie recommendation algorithm with my thoughts, I could input others already well documented opinions to speed up the process.

This idea would work better if people voted without seeing each others votes until after they vote. It might be hard to extrapolate Roger Ebert's preferences if voters formed their opinions of movies based on his reviews. You'd end up with a false positive that mimics his past but poorly predicts his future.

The reverse is a problem too, Google filtering things out based on their political leanings in an attempt to shape public opinion.

I haven't seen any examples which were anything other than runaway persecution complexes of those who found their world view was less popular than they believed - which were greeted with exasperation by testifying engineers who had to explain how absurdly unscaleable it would be to do it manually.

I think heavy reliance on human language (and its ambiguity) is one of the main problems.

Maybe personal whitelist/blacklist for domains and authors could improve things. Sort of "Web of trust" but done properly.

Not completely without search engines, but for example, if every website was responsible for maintaining it's own index, we could effectively run our own search engines after initialising "base" trusted website lists. Let's say I'm new to this "new internet", I ask around what are some good websites for information I'm interested in. My friend tells me wikipedia is good for general information, webmd for health queries, stackoverflow for programming questions, and so on. I add wikipedia.org/searchindex, webdm.com/searchindex and stackoverflow.com/searchindex to my personal search engine instance, and every time I search something, these three are queried. This could be improved with local cache, synonyms, etc. As you carry on using it, you expand your "library". Of course it would increase workload of individual resources, but has potential to give feel of that web 1.0 once again.

This was devised by Amazon in 2005. They called it OpenSearch (http://www.opensearch.org/) Basically it was a standard way to expose your own search engine on your site. It made it is to programmatically search a bunch of individual sites.

This would be ludicrously easy to game. Crowdsourcing would also be ludicrously easy to game.

The problem isn't solvable without a good AI content scraper.

The scraper/indexer either has to be centralised - an international resource run independently of countries, corporations, and paid interest groups - or it has be an impossible-to-game distributed resource.

The former is hugely challenging politically, because the org would effectively have editorial control over online content, and there would be huge fights over neutrality and censorship.

(This is more or less where are now with Google. Ironically, given the cognitive distortions built into corporate capitalism, users today are more likely to trust a giant corporation with an agenda than a not-for-profit trying to run independently and operate as objectively as possible.)

Distributed content analysis and indexing - let's call it a kind of auto-DNS-for-content - is even harder, because you have to create an un-hackable un-gameable network protocol to handle it.

If it isn't un-gameable it become a battle of cycles, with interests with access to more cycles being able to out-index those with fewer - which will be another way to editorialise and control the results.

Short answer - yes, it's possible, but probably not with current technology, and certainly not with current politics.

Just want to point out that you're on a site that successfully uses crowd sourcing combined with moderation to curate a list of websites, news, and articles that people find interesting and valuable. Why not a new internet built around communities like this where the users actively participate in finding, ranking, and moderating the content they consume? It's not a stretch to add a decent search index and categories to a news aggregator, most do it already. If these tools could be built into the structure of the web we'd be half way there.

Edit: I had myself convinced that comments have a different ID space from submissions, but that obviously isn't true. I've partly rewritten to correct for an over-guess on how many new submissions there are each day.

I agree with your general suggestion, but just want to highlight that scale issues still make me think whatever finds traction on HN is a bit of a crapshoot.

It looks like there were over 10k posts (including comments) in the last day, and the list of submissions that spent time on the front page day yesterday has 84 posts. I don't how normal the last 2 days were, but by eyeball I'd guess around a quarter of the posts are comments on the day's front-page posts. This means there are probably a few thousand submissions that didn't get much if any traction.

Any time I look at the "New" page, I still end up finding several items that sound interesting enough to open. I see more than 10 that I'm tempted to click on right now. The current new page stretches back about 40 minutes, and only 10 of the 30 have more than 1 point (and only 1 has more than 10). Only 2 of the links I was tempted to click on have more than 1 point.

I suspect that there's vastly more interesting stuff posted to HN than its current dynamics are capable of identifying and signal-boosting. That's not bad, per se. It'd be an even worse time-sink if it were better at this task. But it does mean there are pitfalls using it as a model at an even larger scale and in other contexts.

User's search engine doesn't have to trust suggestions verbatim, it can always run its own heuristic on top of returned results. And the user could reduce the weight of especially uncooperative domains or blacklist them altogether.

So long as there is a mechanism for categorizing information and ranking the results, people will try to game the mechanism to get the top spot regardless of your own incentives.

Despite their incentives to make money, Google have actually been trying for years to stop people from gaming the system. It's impressive how far they've been able to come, but their efforts are thrwarted at every turn thanks to the big budgets employed to get traffic to commercial websites.

The only assured way to have a "neutral" search engine is to run your own spiders and indexers which you understand completely.

Neutral in that sense is only "not serving the agenda or judgement of another" at the obvious cost of labor and not just as a one off thing as the searched content often attempts to optimize for views. It isn't like a library of passive books to sort through but a Harry Potter wizard portrait gallery full of jealous media vying for attention.

And pendantically it isn't true neutral - but serves your agenda to the best of your ability. A "true neutral" would serve all to the best of their ability.

Besides neutrality in a search engine on a literal level is oxymoronic and self defeating - its whole function is to prioritize content in the first place.

A few years ago there was that blogs thing, with rss... all things that favoured federation, independent content generation, etc. Now it's all about platforms. I understand that "regular people" are more comfortable with Facebook but, other than that, why are blogs and forums less popular now?

The problem with forums is that you end visiting 5~10 different forums, each with their own login, and some of them might be restricted at work (not that you should visit them often).

So it's easier to have 2~4 aggregators in where all the information you desire resides, even if in each of them there are different forums.

A unified entry point helps adoption.

So, instead of platforms, other option would be a client software for different forums. Like Tapatalk. Is there anything like that but libre and/or desktop?

Reddit really did a good job of moving the masses away from site-specific forums.


I'd argue that forums and blogs require more effort.

Read a cool blog post? Nobody around you will ever give a shit, because in order to do so, they'd have to read it too. Shared a photo from a vacation? It might start a conversation or two with people around you, while you receive dozens or hundreds of affirmations (in the form of likes).

I don't like to use social networks, but that's what I fall back on when I have a few minutes to spare. I rarely look at my list of articles I've saved for later — who has time for that?

>I don't like to use social networks, but that's what I fall back on when I have a few minutes to spare. I rarely look at my list of articles I've saved for later — who has time for that?

Plenty of people. Ever push an article to a reader view service and see how long it takes to read? Most articles posted here on HN or the nyt front page can be read in 3-5 mins. Occasionally you'd get a 20 min slog.

I used to use social media way more, and by far my biggest wastes of time on the platform were those spare minutes you get a dozen times a day. On the elevator, waiting for the bus, waiting on food, anytime I could sit still the phone went out and my head went down because that's what everyone around me was also doing while waiting on their coffee.

Eventually I realized I was just idly scrolling and not retaining anything at all from those 30s-2m sessions on instagram. Just chomping visual popcorn. Now, anytime I have a spare 10 mins, I'll read an article or two from my reading list. Anytime I have less than a spare 10 mins, I'll twiddle my thumbs and keep the phone in the pocket.

I used to be much more scatterbrained and had trouble winding down for the evening and getting good rest. Now, I feel like a monk.

the problem is multiple actually: a) most internet-connected devices these days favor content consumption vs content creation (blogs vs instagram),

b) mainstream culture > closely-knit communities (facebook > forums)

c) big-player takeovers (facebook for groups, google for search) over previously somewhat niche areas and, actually, internet infrastructure

d) if you're not a big player, you don't exist... and back to c)

> a) most internet-connected devices these days favor content consumption vs content creation (blogs vs instagram),

You chose Instagram as your example, to make the point that phones favor consumption over creation?

Yes! Instagram has the appearance of a OC/creation platform, but, typically of such platforms (such as twitter/fb) the "content" is low-effort "convenient" opportunistic trivia, and the product consumed is likes, followers, etc.

A search engine is more like putting the books in a paper schredder and writing the book title on every piece, then ordering the pices by whatever words you can find on it, putting all pieces that has the word "hacker" on it in the same box. Where as the problem becomes how you sort the pieces. Want to find a book about "hacking"? This box has all the shreds that has the word "hacker" on it, you can find the book title on the back of the piece. Second problem becomes how relevant the word is to the book.

The library index only indexes the information that fits on a card catalog card. That's extremely unlike a web search engine.

If you'd like to see an experimental discovery interface for a library that goes deeper into book contents, check out https://books.archivelab.org/dateviz/ -- sorry, not very mobile friendly.

Not surprisingly, this book thingie is a big centralized service, like a web search engine.

maybe crowdsourcing would be a solution - something similar to "@home" project, only for web indexes/cache - maybe even leverage the browsers via plugin for web scraping. It already kind of works for getpocket.

I don't think it would be an issue if Google wasn't creating "special" rules for specific winners and losers (overall). Hell, I really wish they'd make it easy to individually exclude certain domains from results.

The canonical example to me of something to exclude would be the expertsexchange site. After stack overflow, ee was more than useless, and even before it was just annoying. There are lots of sites with paywalls, and other obfuscations to content and imho these sites are the ones that should be dropped/low-ranked.

But the fact that there's no autocomplete for "Hillary Clinton is|has" (though "Donald Trump is" is also filtered). Yes, it's been heavily gamed. It's also had active meddling. And their control over YouTube seems to be even worse, with disclosed documents/video that indicate they're willing to go so far as outright election manipulation. With all indications that Facebook, Pinterest and others are going the same route.

> or you find a novel way to make a neutral search engine

Just because nobody's said it in this thread yet: blockchain? I never bought into the whole bitcoin buzz, but using a blockchain as an internet index could be interesting.

How would Merkle DAGs be relevant?

even better, have something like git for the web - effectively working as an archive.

The problem with git is countering nefarious forces. The blockchain is better in that regard because the consensus algorithm can be used to verify that the listings are legitimate.

content change signed by creators private key, otherwise merge is rejected?

or, wiki approach...

Just signing with a private key isn't a guarantor of anything other than that if you trust that the person with the key is who they say they are, then the actual content is from them. But that would require a massively large web of trust in itself: that all the private keys would be trusted. And if you only let in private keys that you explicitly trusted, then it's very likely you could end up with an echo chamber

good point, but we already have the PKI in place, and use it for SSL.

I think Apple's current approach, where all the smarts (Machine Learning, Differential Privacy, Secure Enclave, etc.) reside on your device, not in the cloud, is the most promising. As imagined in so much sci-fi (eg. the Hosaka in Neuromancer) you build a relationship with your device which gets to know you, your habits and, most importantly in regard to search, what you mean when you search for something and what results are most likely to be relevant to you. An on-device search agent could potentially be the best solution because this very personal and, crucially, private device will know much more about you than you are (or should be) willing to forfeit to the cloud providers whose business is, ultimately, to make money off your data.

>, where all the smarts [...] reside on your device, not in the cloud, is the most promising. [...] An on-device search agent could potentially be the best solution [...]

Maybe I misunderstand your proposal but to me, this is not technically possible. We can think of a modern search engine as a process that reduces a raw dataset of exabytes[0] into a comprehensible result of ~5000 bytes (i.e. ~5k being the 1st page of search result rendered as HTML.)

Yes, one can take a version of the movies & tv data on IMDB.com and put it on the phone (e.g. like copying the old Microsoft Cinemania CDs to the smartphone storage and having a locally installed app search it) but that's not possible for a generalized dataset representing the gigantic internet.

If you don't intend for the exabytes of the search index to be stored on your smartphone, what exactly is the "on-device search agent" doing? How is it iterating through the vast dataset over a slow cellular connection?

[0] https://www.google.com/search?q="trillion"+web+pages+exabyte...

The smarts living on-device is not necessarily the same as the smarts executing on-device.

We already have the means to execute arbitrary code (JS) or specific database queries (SQL) on remote hosts. It's not inconceivable, to me, that my device "knowing me" could consist of building up a local database of the types of things that I want to see, and when I ask it to do a new search, it can assemble a small program which it sends to a distributed system (which hosts the actual index), runs a sophisticated and customized query program there, securely and anonymously (I hope), and then sends back the results.

Google's index isn't architected to be used that way, but I would love it if someone did build such a system.

To some extent, doesn't Google already do this? Meaning that based on your location/Google account/other factors such as cookies or search history, it will tailor your results. For instance, searching the same query on different computers will result in different results.

Though to your point, google probably ends up storing this information in the cloud

Also instant search results, which were common search terms that were cached at lower levels of the internet.

I think you're suggesting homomorphic encryption to execute the user's ranking model. Unfortunately, homomorphic encryption is pretty slow, and the types of operations you can do are limited. But it's viable if the data you're operating on is relatively small - e.g. just searching through (encrypted) personal messages or something.

I think you've got the right general idea, but I don't know that it has to be homomorphic encryption. After all, an index of the public web is not really secret, and the user doesn't have a private key for it.

In the simplest case, you could make a search engine in the form of a big, public, regularly-updated database, and let users send in arbitrary queries (run in a sandbox/quota environment).

That's essentially what we've got now, except the query parser is a proprietary black box that changes all the time. I don't see any inherent reason they couldn't expose a lower-level interface, and let browsers build queries. Why can't web browsers be responsible for converting a user's text (or voice) into a search engine query structure?

Or even an online search engine that was configurable where you could customize the search engine and assign custom weights to different aspects.

I'd love to be able to configure rules like:

+2 weight for clean HTML sites with minimal Javascript

+5 weight for .edu sites

-10 weight for documents longer than 2 pages

-5 weight for wordy documents

I'd also like to increase the weight for hits on a list of known high quality sites. Either a list I maintain myself, or one from an independent 3rd party.

Once upon a time I tried to use Google's custom search engine builder with only hand curated high quality sites as my main search engine. It was to much trouble to be practical, but I think that could change with an actual tool.

I think this is not what was the original question. A device that knows You still needs indexing service to find data for You. IMHO.

I remember hearing something about Differential Privacy from a WWDC keynote a few years back however I haven't heard much lately. Can you say how and where Apple is currently using Differential Privacy/


Apple uses local differential privacy to help protect the privacy of user activity in a given time period, while still gaining insight that improves the intelligence and usability of such features as: • QuickType suggestions • Emoji suggestions • Lookup Hints • Safari Energy Draining Domains • Safari Autoplay Intent Detection (macOS High Sierra) • Safari Crashing Domains (iOS 11) • Health Type Usage (iOS 10.2)

Found via Google...

I see a lot of good comments here, I got inspired to write this:

What if this new Internet instead of using URI based on ownership (domains that belong to someone), would rely on topic?

In examples:

netv2://speakers/reviews/BW netv2://news/anti-trump netv2://news/pro-trump netv2://computer/engineering/react/i-like-it netv2://computer/engineering/electron/i-dont-like-it

A publisher of webpage (same html/http) would push their content to these new domains (?) and people could easily access list of resources (pub/sub like). Advertisements are driving Internet nowadays, so to keep everyone happy, what if netv2 is neutral, but web browser are not (which is the case now anyway)? You can imagine that some browsers would prioritise some entries in given topic, some would be neutral, but harder to retrieve data that you want.

Second thought: Guess what, I'm reinventing NNTP :)

Inventing/extending a new NNTP is nice idea too.

The Internet has become synonymous with the web/http protocol. The web alternatives to NNTP won instead of newer versions of Usenet. New versions of IRC, UUCP, S/FTP, SMTP, etc., instead of webifying everything would be nice. But those services are still there and fill an important niche for those not interested in seeing everything eternal septembered.

I believe there is/was an extension to NNTP for full text search or at least a draft proposal no?

Another inspiration: DNS for searching.

What if we implement DNS-like protocol for searching. Think of recursive DNS. Do you have "articles about pistachio coloured usb-c chargers"? Home router says nope, ISP says nope, Cloudflare says nope, let's scan A to Z. Eventually someone gives an answer. This of course can (must?) be cached, just like DNS. And just like DNS, it can be influenced by your not-so-neutral browser or ISP.

The proliferation of Black hat SEOs would render this useless.

How would topic validity get enforced?

For example, if a publisher has a particular pro-Trump article, they would likely want (for obvious financial reasons) to push it to both etv2://news/anti-trump and netv2://news/pro-trump . What would prevent them from doing that?

Also, a publisher of "GET RICH QUICK NOW!!!" article would want to push it to both netv2://news/anti-trump and netv2://computer/engineering/electron/i-dont-like-it topics.

You can't simply have topics, you can have communities like news/pro-trump that are willing to spend the labor required for moderation i.e. something like reddit. But not all content has such communities willing and able to do so well.

I like this idea of people dreaming about a new internet :D

The idea of moving to a pub-sub like system is a good one. It makes a lot of sense for what the internet has become. It's more than simple document retreival today.

To me it seems that you’ve just recreated Reddit.

You want to silo information and create built-in information echo chambers? That seems so bad for polarization.

im starting to think echo chambers are just something that will forever be prevalent and its up to the users to try to view alternate viewpoints

If netv2 is neutral, I would just stuff all of the topics with my own content millions of time, so everyone can only see my content

Who maintains, audits, and does validation for content submitted to these global lists of topics?

That was what the early internet was like (I was there). People built indexes by hand, lists of pages on certain topics. There was the Gopher protocol that was supposed to help with finding things. But this was all top-down stuff, the first indexing/crawling search engines were bottom-up and it worked so much better. And for a while we had an ecosystem of different search engines until Google came along, was genuinely miles better than everything else, and wiped everything else out. Really, search isn't the problem, its the way that search has become tied to advertising and tracking thats the problem. But then DuckDuckGo is there if you want to avoid all that.

In the very early days, you didn't need a search engine because there weren't that many web sites and you knew most of the main ones anyway (or later on had them in your own hotlists in Mosaic). Nowadays you need a search because there is so much content.

The problem is that the amount of content and the size of the potential user base are so large that is is impossible to offer search as a free service, i.e. it has to be funded in some way. Perhaps instead of having a free advertising-driven search, there would be space for a subscription-based model? Subscription based (and advert free) models seem to be working in other areas, e.g. TV/films and music.

Another problem though is that more and more content seems to be becoming unsearchable, e.g. behind walled gardens or inside apps.

Exactly my thought. But it definitely wouldn't get mass adoption which is good because mass-market content websites are questionable in terms of user experience (they also need to cover content creating costs by popups/ads/pushes). One thing, though, ad based search engines lift ad based websites because they can sell ad on a second end.

Maybe we'll see advent of specialised paid search engines SaaSs with authentic and independent content authors like professional blogs.

Search is the problem. If you don’t rank in google you don’t exist on the internet. There is an entire economy built on manipulating search that is pay to play in addition to google continually focusing on paid search of natural SERPs. Controlling search right now is controlling the internet.

>If you don’t rank in google you don’t exist on the internet.

Maybe in 2009. Today there are businesses today that exist solely on Instagram, Facebook, Amazon, etc.

Whatever you replace Search with would be gamed in the same way.

true, but when it was lycos, hotbot, altavista, google, webcrawler, aol, gopher, archy, usenet and so many other sources it was much easier to exist in many ways (harder to dominate) - people used to ‘surf the web’, join “webrings” and share stuff.. now they consume and post memes. so i blame behavior as much as monopoly

A lot of other things have changed since then, so the difference in tone you are noticing might not have much to do with search engines. In 1996 there were only about 16 million people on the internet, and usage obviously skewed towards the more technical nerdy crowd. Now there are 4,383 million people on the internet. Which is about 57% of everyone.

I see this a lot on HN. People forget that a lot of things in the early days of the Internet only worked because there were so few people on the Internet.

If you were rich and had a T1 in your home in the days everyone was on dialup, sure you could host a website yourself. But these days, even if you're one of the lucky residents on a gigabit symmetrical connection, there's a limit to how much you can serve. Self-hosting isn't an option unless your website is a niche.

More people and fewer companies dominating how everything is found... i don't think that change is for the better.

If your target audience isn't on Google, then you don't have to rank there.

Almost all of my customers find me through classified advertising websites. Organic and paid search visitors to my site tend to be window shoppers.

I think in one sense the answer is it always depends who or what you are asking for your answers.

The early Web wrestled with this, early on it was going to be directories and meta keywords. But that quickly broke down (information isn't hierarchical, meta keywords can be gamed). Google rose up because they use a sort of reputation system based index. In between that, there was a company called RealNames, that tried to replace domains and search with their authoritative naming of things, but that is obviously too centralized.

But back to Google, they now promote using schema.org descriptions of pages, over page text, as do other major search engines. This has tremendous implications for precise content definition (a page that is "not about fish" won't show up in a search result for fish). Google layers it with their reputation system, but these schemas are an important, open feature available to anyone to more accurately map the web. Schema.org is based on Linked Data, its principle being each piece of data can be precisely "followed." Each schema definition is crafted by participation from industry and interest groups to generally reflect its domain. This open world model is much more suitable to the Web, compared to the closed world of a particular database (but, some companies, like Amazon and Facebook, don't adhere to it since apparently they would rather their worlds have control; witness Facebook's open graph degeneration to something that is purely self-serving).

The deeper problem is advertising. It is sort of a prisoner's dilemma: all commercial entities have a shouting contest to attract customer attention. It's expensive for everybody.

If we could kill advertisement permanently, we can have an internet as described in the question. This will almost be like an emergent feature of the internet.

We could supercharge word of mouth. I've been thinking about an alternative upvote model where content is ranked not primarily based on aggregate voting but by:

- ranking content that users you have upvoted higher

- ranking content that users with similar upvote behaviour higher

While there is a risk of upvote bubbles, it should potentially make it easier for niche content to spread to interested people and make it possible for products and services to spread using peer trust rather than cold shouting.

> ranking content that users with similar upvote behaviour higher

This is what Reddit originally tried to do before they pivoted.


Oh, interesting!

Makes me think that their original plan could still work if they just put a bit more effort into crafting that algorithm.

For example, the main criticism brought up is that things that you dislike that your peers like keep getting recommended. Why not add a de-ranking aspect into it and try adding downvote-peers in addition to upvote peers.

I imagine you could create this interesting query language that could answer questions like: what things do you like if you like X and Y but not Z? (I kind of remember that something akin to this have been hacked together using subreddit overlap.)

As long as there are big companies making money off their products, you can be sure they'll find a way to advertise them to you.

I've had similar ideas recently. Especially niche content (or shared research) would probably be notoriously hard (WRT false positives) for machine learning to decide whether it is relevant to you, people with similar interests know that much better.

I was also wondering what would be good options to store votes/upvotes in a decentralized way.

> people with similar interests know that much better

Yeah, I wonder if there is a cheap way to test this. Actually! There could be! Like using favorite's here on hacker news. That could be mined and visualized in various ways. (Although a quick sample shows me that it's a rarely used feature)

> I was also wondering what would be good options to store votes/upvotes in a decentralized way.

Yeah there are a lot of interesting optimization challenges if you really want to utilize upvote graphs for ranking.

Not to echo a R&M quote on purpose but that just sounds like targeted advertising with extra steps.

> ranking content that users with similar upvote behaviour higher

That's how you make echochambers

All social media have echo chamber characteristics. You have to counteract it with transparency and opt-in/out.

So, basically Facebook?

This sounds so much like Facebook.

Any "social" ranking algorithm is going to sound at least superficially similar to what's already out there.

Maybe if IPFS (~web 3.0) succeeds in the future, you could solve the advertising problem by inventing a meta network, where all the sites involved would agree to follow certain standardized criteria of site purity. You'd tag the nodes (or sites), and then have an option to search only sites from the pure network. Just a thought. edit: Maybe this would lead to a growing interest in the site purity, and as the network's popularity would grow, you could monetize the difference to its advance.

Be careful what you wish for, as you might get AMP or some propriety Facebook format as a standard instead.

Well, I was thinking we could have endless number of (meta) networks / network configurations / standards. I mean each node could have as many tags as needed, e.g. #safe_for_children_v1.1 #pure_web_v2.0. Then you could configure your search engine / browser according to these tags. You could also stack tags to simplify things, e.g. pure_stack would include both #safe_for_children and #pure_web, etc. Maybe I'm missing something, but it seems doable.

If we kill advertisement, you can say goodbye to the vast majority of content on the internet. The better approach is to make advertising a better experience and to create incentives for advertisers to spend ad dollars on quality content.

There will always be bottom-feeders as long as there is a market where people are not forced to choose with their wallets. Killing the "vast majority of content on the internet" seems like a good thing to me, honestly.

> Killing the "vast majority of content on the internet" seems like a good thing to me, honestly.

I sure hope my content of preference beats out yours for not getting killed.

I am reasonably sure that even if our preferences are complete opposite and we eliminate 99% of content in general, you would still have enough quality content for what your interests are. But just to be extra sure, please vote with your wallet and actively support the things you like and don't let advertisers do the choosing for you.

Advertisement just should not be the central means of income of content producers. I really hope this point of view gets killed together with advertisement.

> Advertisement just should not be the central means of income of content producers.

Can you propose any viable alternative?

Ads are placed via an automatic auction upon pageview. GM and Ford both want to show me an ad when I google "what car to buy", and have automatic systems that decide how much they'd be willing to pay to show me that ad based on my likelihood of purchase (income, sex, location, etc). Why not have a system that follows me around and outbids them using funds from my bank account, to show me an ad which is just a transparent image? That way I don't have to see ads but content creators still get what they need?

What you are describing is exactly what Google Contributor is trying to do. We'll have to see how it turns out.


It says it only works with "participating sites". I wonder why

The first version worked exactly as you proposed. The UX however was meh. You'd place a monthly limit on your ad (outbidding) spend (eg. $2) and it ended up outbidding only some of the ads: those served by Google which were also outbid by your amount.

So from a user's perspective it didn't fully work. Also the ad space wasn't fully removed (perhaps due to technical reasons) but was replaced with a blank image. It also didn't catch on much.

So they tried to pivot and now the program works with certain cooperating websites to fully get rid of all ads but I'm sure bigger websites would rather be in total control of monetizing themselves and can spend on the necessary IT infra. similar to most online newspapers these days.

I think an advertiser (eg. a legal firm) might be willing to pay eg. $10 per ad impression but no user is willing to outbid it so I think the first model (outbid in the auction) is more sustainable and profitable for both parties but needs to have all ad exchanges on board.

So in short, it's been tried but wasn't an instant (or even a slow) success and idk whether Google will continue investing in it or not.

Are you actually proposing for people to gasp! pay gasp! for content?

Google makes around 30 billion/quarter on ads. Assuming most of that comes from 200 million users (they have more than that but I assume a lot are not worth very much to advertisers), and their ad revenue comes from a 50% cut of the total ad payments, that comes out to around $300/quarter or $75 a month. I'd pay it, but I think most wouldn't.

Certain % of your internet bill goes to helping pay to host the sites you are visiting every billing period. If a site is large enough hosting would be sustained by the visiting userbase rather than the site owner. If a site is too small for that, chances are hosting has been cheap anyway.

Subscription. It is only viable for content that well off people use a lot of though, even then only when you are much better than the free competition.

whatever wikimedia organisation does :)

1) Not to most of the best content, 2) other business models may have an actual chance when not competing with "free", 3) actually-free, community-driven sites and services (and standards and protocols—those used to be nice) will have a larger audience and larger creator interest when not competing with "free" (and well-bankrolled).

The vast majority of content is absolute shit though, so speaking strictly for me, I'm willing to try

The question was about search engines, not about content.

But I think the combination of advertising+search engines is particularly bad, so paying for search would be a great first step.

maybe it's worth saying goodbye to "8 reasons why current internet sucks that drive spammy copywriters mad". The whole more-clicks-more-revenue based approach did not do good things to the online content.

I wrote up a proposal on this, changing the economics to adapt to and account for post-scarce resources like information:


To kill advertising would mean the web would live behind many walled gardens where each site requires membership.

For the remaining free sites you will see advertising in different forms (self promotion blog, the upsell, t-shirt stores on everysite, spam-bait).

Advertising saved the internet.

Now tracking.. for advertising or other purposes is the real problem.

Other than a completely new approach for producing value such as the 'Freeism' one described in the article suggested in this comment https://news.ycombinator.com/item?id=20282851 (which I hadn't time to read yet and hence I'm neither in favour of or against) this simply boils down to the questions of who will pay for relevant content and what the business model will be.

By and large, people don't seem to be willing to pay for content on the web. Hence, advertising became the dominant business model for content on the web.

Find another way for someone to pay for relevant content and you can do away with advertising. It's as simple as that.

> By and large, people don't seem to be willing to pay for content on the web. Hence, advertising became the dominant business model for content on the web.

I don't think the causality is right here. People might not be willing to pay for content on the web because advertising enables competitors to offer content for free. If you removed that option, if people had no choice but to pay, it might just turn out that people would pay.

How would you achieve that? By outrightly outlawing advertising?

There absolutely are paid options on the web. It's just that they don't seem to appeal to a sufficient number of buyers so advertising could become irrelevant.

> How would you achieve that? By outrightly outlawing advertising?


> There absolutely are paid options on the web. It's just that they don't seem to appeal to a sufficient number of buyers so advertising could become irrelevant.

They aren't appealing in the presence of ad-subsidized free alternatives. Remove the latter, and they just might become appealing again.

Few things sound less likely to improve the internet than some entity having the power to content-police the web and remove anything it accuses of the thoughtcrime of advertising...

You can block third-party advertising structurally, so that a content-cop isn't required. First-party advertising cannot be blocked, of course, since that's just content.

For example, using browsers that impose a Content Security Policy that prevents anything from being loaded from domains other than the origin.

Sure, but if the only ad restriction was mandatory blocking of third party content, you'd just see ad agencies work out ways they can get the content they want to serve hosted locally (and lots of more interesting third party embedded content cease to exist due to it not having the same commercial rationale for workarounds...). If you start forcing companies not to promote third party products with anything that even looks like an ad, you'll just see a greater proportion of the free-to-access internet turn into paid-for reviews and influencer marketing. Not sure that'd be an improvement, and I'm pretty sure the next logical step of getting the content cops ruling which content looks too commercially-oriented for us proles to look at is even worse.

You can block third party advertising structurally using uBlock without ruining the internet for everyone else.

Advertising isn't a thoughtcrime, it's a cognitive/psychological assault.

I think a combination of consumer protection laws, truth in advertising laws and data protection laws, all turned up to 11 (even GDPR), could achieve most of the desired outcome on the Internet without much problematic "content-policing". But I'm not sure. You won't eliminate advertising from the Internet entirely, but making it illegal would make undesirable advertising more expensive, by creating vast amount of risk for advertisers and simultaneously destroying the adtech industry, thus rendering most of the abusive practices that much less efficient.

(Also, to be clear, I want all advertising gone. Not just on-line, the meatspace one too.)

Huh. That sounds like a free market model.

Isn't this what different newspapers like NYT and WSJ are moving towards? Why can't both models coexist?

Because one totally destroys the other.

Slave labour, selling poison or dumping waste into rivers are all superior business models too, but that doesn't mean they should exist in a civilized society.

The train also destoyed the horse drawn wagon train for bulk land transport.

Just because it totally destroys another business model doesn't mean it is wrong. Felony interference with a business model protectionism isn't good for societies. Historically this stagnant "stability" gets them lapped and forced into the modern world if lucky or conquered if not no matter how vigorously they insist that it is the only and right way.

Of course. I'm not saying displacing business models is bad per se. I'm saying that just because one business model can displace a different one, doesn't immediately mean it's good. Plenty of business models are morally bankrupt, and I believe "free but subsidized by advertising" is such, by virtue of advertising itself[0] being morally bankrupt.


[0] - as seen today; not the imaginary "informing customers about what's on the market" form, but the real "everyone stuck in a shouting contest of trying to better manipulate customers" form.

> Find another way for someone to pay for relevant content and you can do away with advertising. It's as simple as that.

Not so simple. What is relevant for me may be irrelevant for you.

You pay for content that's relevant to you. I pay for what's relevant to me.

Oh, okay. I was assuming we had someone like the government pay for content.

Promotion is a need, and a very important need for ideas to spread. We all know that the concept of "if you build it they will come" doesn't work". Google's adaptation for this was to make advertising relevant... which is actually a considerable improvement over historical media models...

There's a saying in sales: "people hate to be sold, but they love to buy"... which is akin to what you are saying here. Advertising isn't the problem... the problem is that the reasons why people are promoting aren't novel enough... (rent seeking... which creates noise)

The only way to kill advertising is to have perfectly efficient markets.

Until then, you're going to have demand for ferrying information between sellers and buyers, and vice versa, because of information asymmetry. You may disagree with some of the mediums currently used, finding them annoying, but advertising is always evolving to solve this problem, as is evident in the last three decades.

Yes, we need search engines, but they don't need to be monolithic. Imagine that indexing the text of your average web page takes up 10k. Then you get 100.000 pages per Gig. It means that you if you spend ~270USD on a consumer 10 tera drive you can index a billion webpages. Google no longer says how many pages they index, but its estimated to be with in one order of magnitude of that.

This means that in terms of hardware, you can build your own google, then you get to decide how it rates things and you don't have to worry about ads and SEO becomes much harder because there is no longer one target to SEO. Google obviously don't want you to do this (and in fairness google indexes a lot of stuff that isn't keywords form web pages), but it would be very possible to build an open source configurable search engine that anyone could install, run, and get good results out of.

(Example: The piratebay database, that arguably indexes the vast majority of avilable music / tv / film / software was / is small enough to be downloaded and cloned by users)

Google's paper on Percolator from 2010 says there are more than 1T web pages. 9 years later there is surely way more than that.


The real issue would be crawling and indexing all those pages. How long would it take for an average user's computer with a 10Mb internet connection to crawl the entire web? It's not as easy a problem as you make it seem.

I'm not saying its easy, its not, but people tend to think that because Google is so huge, you have to be that huge to do what Google does. My argument is that in terms of hardware google need expensive hardware because they have so many users, not because what they do requires that hardware to deliver the service for one or a few users.

I have a gigabit link to my apartment (go Swedish infrastructure!). At that theoretic speed I get 450 gigs an hour, so I could download ten tera in a day. We can easily slow that down by an order of magnitude and its still a very viable thing to do. If someone wrote the software to do this, one could imagine some kind of federated solution for downloading the data, so that every user doesn't have to hit every web server.

Could be done with a p2p "swarm". Peers get asigned pages to index then share the result.

How would you begin indexing everything?

This is good question. Crawling and storing the pages is the easy part... searching them with a sub 1 second response time is much harder. Which current DB platforms can handle the size of data that Google indexes?

Almost definitely not.

Search engines are there to find and extract information in an unstructured trove of webpages - no other way to process this than with something akin to a search engine.

So either you've got unstructured web (the hint is in the name) and GoogleBingYandex or a somehow structured web.

The latter has been found to be not scalable or flexible enough to accomodate for unanticipated needs - and not for a lack of trying! This has been the default mode of web until Google came about. Turns out it's damn near impossible to construct a structure for information that won't become instantly obsolete.

> A structured web ... has been found to be not scalable or flexible enough to accomodate for unanticipated needs - and not for a lack of trying!

Linked Open Data (the latest evolution of Semantic Web technologies) is actually working quite well at present - Wikidata now gets more edits per unit of time than Wikipedia does, and its data are commonly used by "personal assistant" AIs such as Amazon's Alexa. Of course, these can only cover parts of the web where commercial incentives, and the bad actors that sometimes pursue them, are not relevant.

I've had this idea floating in my head for a while, that one thing that might make the world better is some kind of distributed database, and a gravitation back to open protocols (though instead of RFC's... maybe we could maintain an open source library for the important bits) I was thinking the architecture of DNS is a good starting point. From there we can create public indexes of data. This includes searchable data, but also private data you want to share (which could be encrypted, and controlled by you (think PGP). I'd modify browsers so that I don't have to trust a 3rd party service)

Centralization happens because the company owns the data, which becomes aggregated under one roof. If you distribute the data it will remove the walled gardens, multiple competitors should be able to pop up. Whole ecosystems could be built to give us 100 googles.... or 100 facebooks, where YOU control your data, and they may never even see your data. And because we're moving back to a world of open protocols, they all work with each other.

These companies aren't going to be worth billions of dollars any more.... but the world would be better.

You've just more or less described Solid. https://solid.mit.edu/

I think a lot of people dismiss Solid based on its deep origins in Semantic Web, or because it's a slow project, based on Web standards, intended to solve long term problems.

But being part of the Web is a huge process, and with DIDs it maps just fine into decentralized worlds.

Unless there's a significant change in human behaviour, convenience is always going to trump everything else including privacy - we have seen over and over again that people will happily hand over their personal data in return for a free service that is simple to use. So any solution where you control your own data is going to have to be as convenient as alternatives, otherwise there'll be an opening for a new centralised "we'll do all the hard work of owning your data for you" mega corp tech titan.

i really like that idea.

The 2 core flaws of the Internet (more precisely the World Wide Web) are lack of native search and native payments. Cryptocurrencies have started to address the second issue, but no one that I know of is seriously working on the first.

Fast information retrieval requires an index. A better formulation of the question might be: how do we maintain a shared, distributed index that won't be destroyed by bad actors.

I wonder if the two might have parts of the solution in common. Maybe using proof of work to impose a cost on adding something to the index. Or maybe a proof of work problem that is actually maintaining the index or executing searches on it.

Why does there need to be one central source of truth on the internet? It seems like it would be impossible to implement. Even if google worked like it did 15 years ago and you got decently relevant results to your search terms, that's still not even scraping the surface of the whole internet that is relevant to your search terms.

It's an impossible problem to solve because we don't have good consistent metadata to draw on. Libraries work because they have good metadata to catalog their collections. Good metadata needs to be generated by hand, doing it automatically is bound to lead to errors and special cases that will pollute your search results.

I say we abandon the idea of the ideal search engine, accept the fact that we will never be able to find every needle in every haystack, and defer to a decentralized assortment of thousands of topic-specific indexes of relevant information. Some of them will be shit, but that's fine, the internet has always been a refuge for conspiracy theorists and other zaney interests. The good stuff will shine through the mud, as it's always done.

My approach to answering this would entail:

1) Determining what percentage of search engine use is driven by the need for a short cut to information you know exists but dont feel like accessing the hard way

2) Information you are actually seeking.

My initial reaction is that making search engines irrelevant is a stretch. Here is why:

Regarding #1, the vast majority of my search activity involves information I know how and where to find but seek the path of least resistance to access. I can type in "the smith, flat iron nyc" and know I will get the hours, cross street and phone number for the Smith restaurant. Why would I not do this instead of visiting the yelp website, searching for the Smith, set my location in NYC, filtering results etc. Maybe I am not being open minded enough but I don't see how this can be replaced short of reading my mind and injecting that information into it. There needs to be a system to type a request and retrieve the result you're looking for. Another example, when I am looking for someone on LinkedIn, I always google the person instead of utilizing LinkedIn's god awful search. Never fails me.

2. In the minority of cases I am looking for something, I have found that Google's results have gotten worse and worse over the years. It will still be my primary port of call and I think this is the workflow that has potential disruption. Other than an Index, I dont know what better alternatives you could offer.

You'd still want to be able to retrieve "useful" information which can't be tampered with easily which I think is the biggest issue.

You can't curate manually.. That just doesn't scale. You also can't let just anyone add to the index as they wish or any/every business will just flood the index with their products... There wouldn't be any difference between whitehat/blackhat marketing.

You also need to be able to discover new content when you seek it, based on relevancy and quality of content.

At the end of the day, people won't be storing the index of the net locally, and you also can't realistically query the entire net on demand. That would be an absolutely insane amount of wasted resources.

All comes back to some middleman taking on the responsibility (google,duckduckgo,etc).

Maybe the solution is an organization funded by all governments, completely transparent, where people who wish to can vote on decisions/direction. So non profit? Not driven by marketing?

But since when has government led with innovation and done so at a good pace? Money drives everything... And without a "useful" amount of marketing/ads etc, the whole web wouldn't be as it is.

So yes, you can.. But you won't have access to the same amount of data, as easily, will likely have a harder time finding relevant information (especially if its quite new) without having to parse through a lot of crap.

If we were to design a brand new DATABASE ENGINE for today's world, can we develop it such a way that:

1. Finding information is trivial

2. You don't need services indexing billions of rows to find any relevant document

How far can one get with content-addressable storage? It's not obvious to me how to emulate search results ranking (well, anyhow), but it could give you a list of documents satisfying some criteria according to the authors who stored them.

Google throws billions of dollars at this problem, nothing about it is "trivial".

>In our current internet, we need a big brother like Google or Bing to effectively find any relevant information in exchange for sharing with them our search history, browsing habits etc.

The evil big brothers may not be necessary. We just need to expand alternative search engines like YaCy.

I can't imagine how this is possible. Imagine I have a string of words (a quote from a book or an article, a fragment of an error message, etc), and I want to find the full text where it appears (or pages discussing it). How would you do that without a search engine?

I think OP's idea is that search services would be built into the Internet, and not provided by a third party. That is, when a website is published or updated, it is somehow instantly indexed and made available for search as a feature of the platform on which it was published.

But you still need a third party to rank the results. I don't just want any page about my error message, I want the best page.

The page rank could be a transparent algorithm, which is regularly updated by a consortium like W3C.

The question is whether this would work in an adversarial setting where every party tries to inflate their page rankings by any trick they can find.

Not a chance it would survive. Google has enough problems fighting SEO right now and they don't publish their algorithm and have incredibly deep pockets.

Personally, I don't want "The page rank algorithm" I want 100 page rank algorithms made by 100 people. Transparency is important, but I think competition is more important.

The platform could provide useful metadata, leaving the ranking up to the client..

> built into the Internet

Uh...what? How do you define this?

Indexed where?

You ask a question (possibly on Stack-Something) and don't get ridiculed for not using Google, since you live in a world where search engines don't exist.

And how would you find out if that question has been answered before? That would only work if there was single unified centralised question site. And then we are pretty much back at Google's single search field.

You don't need to. It's not a problem to ask and answer the same question repeatedly. School never had a problem with that.

School also never had navigability of past questions / answers as its explicit objective.

Neither would the replacement for search engines.

And then 5 other users ask the same question because they have no search engine.

I think this gets boring quick...

We could have website-centered search engines. You ask the question on whatsthatquote.com and find out if someone has already asked it. If yes, you have your answer, if not someone answers and no one is annoyed. Stack overflow does that. You don't get ridiculed for asking a quesiton on so that has already been asked on another website you don't know of.

I guess that would be the age of smaller communities centerd around a few websites only? Maybe, I don't know if we can consider google as enabling a real global community as of today. I pretty much browse around the same websites. Anything I want to find without a precise source of information in mind, I use google and stumble upon ads and ads and sometimes ads, but rarely an answer.

I sometimes still search stuff manually browsing through websites indexes. Some things are difficult to find with keywords. Equations of which the name you forgot. Movies with a plot so generic billions of result would be associated with it on a search engine. That piece of music of which you could write the notes on a sheet but don't remember the title.

Teachers teaching the same class every year can't use this as an excuse either.

The most informative answers I've encountered on StackOverflow are either a product of research (benchmarking, analyzing multiple sources) or very specific knowledge, sometimes written by the author of the framework/library in question. I'm not sure your analogy applies since these answers demand substantially more effort than the (usually) predictable and repetitive questions teachers face in class.

At some point someone has to write it, like school textbooks. After that your just looking at distributing that knowledge. The replacement for search engines solves the latter problem.

Have an ip addressing and redirecting system designed around a Tower of Babel lookup.


Most of the search engines now days have the advantage of being closed source (you don't know how their algorithm actually work). This makes the fight against unethical SEO practices easier.

With a distributed open search alternative the algorithm is more susceptible to exploits by malicious actors.

Having it manually curated is too much of a task for any organization. If you let user vote on the results... well, that can be exploited as well.

The information available on the internet is to big to make directories effective (like it was 20 years ago).

I still have hope this will get solved one day, but directories and open source distributed search engines are not the solution in my opinion unless there is a way to make them resistant to exploitation.

I've been thinking that the only way to get around the bad-actor (or paid agent) problem when dealing with online networks is to have some sort of distributed trust mechanism.

I feel like manually curated information is the way to go, you just have to find some way to filter out all the useless info and marketing/propaganda. You can't crowd source it because it opens up avenues for gaming the system.

The only solution I can think of is some sort of transitive trust metric that's used to filter what's presented to you. If something gets by that shouldn't have (bad info/poor quality), you update the weights in the trust network that led to that action so they are less likely to give you that in the future. I never got around to working through the math on this, however.

Domain authority is a distributed 'trust' system?

But you want 'manually curated' but not 'crowd sourced', which suggests you want an individual to or small group to find, record, and curate all pages (? or domains, or <articles>, or ...) across more than 60 Billion pages of content??

There's something like 1000 FOSS CMSs - I would be surprised if there's a million domains with relevant info to sift through just for that small field.

There's no way you're curating _all_ that without crowd sourcing.

Of course you don't have to look at everything to curate, but how are you going to filter things ... use a search engine?

Is there a way to get a list of every domain in existence?

This https://www.whoisxmlapi.com/whois-database-download.php is a start, but there's new ones every second so you're going to need to update a lot.

I think it's not possible as domains don't have to be registered necessarily - a server can serve a domain at a particular IP so long as the requesting client uses a domain that the server responds to.

Obviously if it's registered domains then you can in theory just get the list from the registrar. They probably sell the full list for a price.

I imagine you can harvest a list with sufficient resources.

Thank you! this is very helpful for me.

That's very workable.Any agent should have a private key with which it signs it's pushes. Age of an agent and score of feedback for that agent determine its ranking.Though that still leaves gaming possible with the feedback. But heavy feeback like "this is malicious content" could be moderated. (So that people cant just report stuff they don't like).

The reason I mentioned that the trust metric should be transitive and distributed is so that it prevents gaming as much as possible. You wouldn't want to have a trusted central authority (for everyone) because that could always be corrupted or gamed if it's profitable enough. Rather every individual would have a set of trusted peers with different "trust" weights for each based on the individual's perception of their trustworthiness, that could be changed over time.

This trust (weighting) should be able to propagate as a (semi-)transitive property throughout the network to take advantage of your trusted peers' trusted peers. This trust weight propagation would need to converge, and when you are served content that has been labeled incorrectly ("high-value" or "trustworthy" or whatever metric, when you don't see it that way), then your trust weights (and perhaps your peers') would need to re-update in some sort of backpropagation.

The hard part is keeping track of the trust-network in a way that is O(n^c) and having the transitive calculations also be O(n^c) at most. I'm quite sure there are ways of doing this (at least with reasonably good results) but I haven't been able to think through them.

>But heavy feeback like "this is malicious content" could be moderated. //

You're just shifting around your trust problem. You need to handle 4chan level manipulation (million of users coordinating to manipulate polls), or Scientology depth (getting thousands of people in to USA government jobs in order to get recognised as a religion). If it's "we'll catch it in moderation" then whoever wants to manipulate it just gets a moderator ...

"Super-moderation": will a dictatorship work here? I don't see how.

"Meta-moderation": you're back to bad actors manipulating things with pure numbers.

You can't get around the problem of manipulation if your trustworthiness metric for content will be the same for all people, as it is on reddit, hacker news, or Amazon for example. Having moderators just concentrates the issue into a smaller number of people and you haven't solved the central problem--manipulation is profitable.

But think of how we solve this problem in our personal interactions with other people, and this should be a clue for how to solve it with computational help. We have a pretty good idea of which people are trustworthy (or capable, or dependable, or any other characteristic) in our daily lives, and based on our interactions with them we update these internal measures of trustworthiness. If we need to get information from someone we don't know, we form a judgement of their trustworthiness based off of input from people we trust--e.g. giving a reference. This is really just Bayesian inference at its core.

We should be able to come up with a computational model for how this personal measure of trustworthiness works. It would act as a filter over content that we obtain. Throw a search engine on top of this, sure, but in the end you'd still need to get trustworthiness weights onto information if you want it to be manipulation-resistant. This labeling is what I mean by manual curation. You can't leave that up to the search engine or the aggregator because those can be gamed, like the examples you gave for aggregators and SEO for search engines have shown.

>We have a pretty good idea of which people are trustworthy (or capable, or dependable, or any other characteristic) in our daily lives //

We really don't. People get surprised all the time that someone had an affair, or cheated, or ripped someone off, or whatever. "But I trusted you" ...

It's actually relatively easy to fool people in to trusting you, as many red team members will probably confirm.

Look at someone like Boris Johnson, people are trusting him to lead the country knowing that he's well known to betray people's trust and that he even had a court case lodged against him based on his very blatant lying to the entire country. You can even watch the video of him being interviewed where the interviewers says (paraphrasing) "but we all know that's a half truth" and BoJo just pushes it and pushes it and refuses to accept that it's anything other than absolute truth.

>If we need to get information from someone we don't know, we form a judgement of their trustworthiness based off of input from people we trust--e.g. giving a reference. //

This is domain authority again - trust some domains manually, let it flow from there. If that domain trusts another domain then they link to it, trust flows to the other domain, and so on. Maintaining such trust for a long time adds to a particular domains trust factor, linking to domains not trusted by others detracts from it.

So how do _you_ make any sort of judgments based off of what people say? What information do you use to judge whether their statements are accurate? Or do you always start with the assumption that everything everyone says is suspect? What sort of information do you use to come to any sort of conclusion, and how do you determine the trustworthiness of that information?

>This is domain authority again - trust some domains manually, let it flow from there. If that domain trusts another domain then they link to it, trust flows to the other domain, and so on. Maintaining such trust for a long time adds to a particular domains trust factor, linking to domains not trusted by others detracts from it.

This can be gamed if you're able to update the trustworthiness of a domain for other people, and that's why a trust metric needs to be mostly personal, and should update dynamically based on your changing trust valuations.

Pyrrhonism, you start on the assumption that no-one [else] even exists and go from there ... ;o)

Seriously, I'm not so sure -- I try to trust first and then update that status as more information becomes available; but that's more of a religious position.

I don't think it's necessarily instructive to look at my personal modes here. I guess my main point is that if you're going to say "well humans have cracked trust, we'll just model it on that" then I think you're shooting wide of the mark.

Any trust needs some kind of root. The big problem is that you need to prevent a billion real users from being "outvoted" in that Bayesian inference by a billion fake agents (augmented by thousands of paid 'influencers') saying that spam is ham and vice versa, and ensuring that they all have good reputation.

> Having it manually curated is too much of a task for any organization.

ODP/DMOZ worked quite well while it was around. I don't think it would work equally well nowadays as a centralized project, because bad actors are so much more common today than they were in the 1990s and early 2000s; and because the Internet is so astoundingly politicized these days that people will invariably try to shame you and "call you out" for even linking to stuff that they disagree with or object to in a political sense (and there was a lot of that stuff on ODP, obviously!). But federation could be used to get around both issues.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact