Hacker News new | past | comments | ask | show | jobs | submit login
Almost all searches on my independent search engine are now from SEO spam bots (searchmysite.net)
696 points by m-i-l on May 16, 2022 | hide | past | favorite | 377 comments

So spammers have latched onto your search engine because they are getting useful results. They are able to systematically discover websites built on certain platforms that allow users to post content containing links, which they can target for link spam. It is very difficult to fight this on a technical level because there is an entire industry built around blackhat SEO, with all kinds of softwares and services dedicated to thwarting your defensive efforts. Even Google struggles to keep up with this.

However, they are also systematically feeding you their footprint lists. I imagine you could put together a footprint blacklist pretty quickly, and just stop returning results for any obvious spam queries like those containing "powered by wordpress".

It's not a very elegant solution I'll admit. It won't stop the bots from trying, and you may have to circle back periodically to add new footprints as they surface. But it's a potentially quick and easy way to stop rewarding their efforts, and the blackhat world is pretty used to burning out their resources so hopefully they will figure out it's a dead end and move on.

> So spammers have latched onto your search engine because they are getting useful results.

I'm not sure about this. At least with my search engine, it doesn't really seem to matter what response they get, I don't even think they look at the responses. They keep hammering away with tens of thousands of queries per day with the requests even though they've seen nothing but HTTP Status 403 since last October or so.

My best guess is they're going after search engines in general in case they forward queries to google, in order to manipulate their typeahead suggestions.

Put a CloudFlare web application firewall at the front of the site and then use its rate limited / CAPTCHA features to throttle traffic. It is the easiest way to get rid of parasitic scraping and API abuse. Cost is $0.

Yeah, that's essentially what I've done, except I'm paying for their cheapest non-free tier to have a bit more control over it. I really wish I didn't have to route all my traffic through an untrusted a 3rd party like that, but I guess we can't have nice things on the Internet anymore.

> I guess we can't have nice things on the Internet anymore

Not since it left the larval stage and became "pay for play", no.

Oh, well, those taxpayer-funded years were nice for those of us who were around.

I think I remember wondering, after the dotcom bust, if the whole web thing would actually take off.

The reasoning I vaguely remember reading was that the internet required government subsidy to exist - at first directly, then in the form of universities, and the bust was a sign that it couldn't exist without one.

I don't remember how prevalent the view was at the time though. Obviously it turned out to be wrong.

Putting authentication on the site would be easier.

There a rub here, in that people expect to search things without being logged in. But then if you don't log in people, anyone can come calling, including bots. This then causes you to do things like get a third party to filter the data, which then affects the users by having to reroute their traffic to someone else to get rid of some of the visits you don't want from the bots.

And round and round.

Simple authentication to the site with tokens might solve the problem. If an IP comes calling that does so with out authentication, or payment, then hang the connection.

> Cost is $0.

Cost is the slow enclosure of the internet hy a handful of giant companies and once attestation is universal having anyone without a locked down device be locked out of most of the internet without providing endless free labour.


Huh, well I guess there goes my theory about the incentive. What a bummer. I would have thought that at least with search engine scraping, they would stop expending the effort once the results dried up.

Or put those query results behind an anti-bot/"capcha" test.

That would probably help, but it's also a continuation of the cat and mouse game. There are plenty of captcha breaking services out there, it only cost about $1 to programmatically solve 1000 captchas.

> There are plenty of captcha breaking services out there

Give it a try and see what happens.

People said greylisting against email spam wouldn't work, since spammers would just resend. It works since 20 years. To get your IP off the DNSBL NiX Spam you just have to follow a link. People said spammers would automate that process. Never happened in 19 years. Sometimes spammers are just lazy.

Sure, but it increases friction that forces a re-eval of cost/benefit of the bot(s).

Newest captcha services are a prediction score, not even a verification screen, and you can feed polluting data to bots you are certain to exist.

Agreed. I suspect that this is an arbitrage game on the part of the SEO spammers. Each search is cheaper for them than it is for a competitor who's using a major search engine with more extensive anti-spammer protections, and that difference equals $$$. A captcha doesn't have to be an unbeatable solution. It just has to provide enough of a barrier to equalize the cost.

I'm not so sure about this. The spammers goal is to build up as big a list of link spam targets as possible. If one spammer chooses to only scrape minor engines and another only major engines, the one scraping the major engines will probably come out on top despite the higher cost. Whoever is abusing OP's search engine is likely doing it to supplement the data they are already scraping from the major engines.

For OP, I think simply not returning results at all is a more practical measure because it removes the reward completely. Captchas and bot detection keep the reward in play, while taking away the results entirely makes the entire pursuit futile.

It might be a better idea to return low quality results than nothing at all. The idea is that it's pretty obvious when the bot is banned when it receives no results at all. Having to look at the results manually to determine whether one is banned is a much more time consuming endeavor.

Well what I'm suggesting isn't about blocking the bots, it's about removing the incentive. So in this case, I think the more obvious it is the better. I would want them to realize as soon as possible that they are 100% wasting their time.

If anything, it might be best to return a page that explicitly states "Sorry, this search engine no longer supports SEO footprint search queries."

*edit for typo & wording

On the other hand, making content difficult to parse is easy to do and a very strong weapon. Make them waste dev time... It is much easier to make variants of HTML than it is to parse it. You can even automate it to some degree.

Deliberately feeding the spam bots into an endless loop of captchas might slowly drain their accounts if they are paying 3rd party captcha farms.

Then monetize by setting up your own captcha farm, but instead of paying for compute, send the captcha to the spam bots, who send it to another captcha farm and solve it for you.

As I understand it, the main point of CAPTCHAs isn’t to keep out bots completely, but to give enough friction to make automated attacks or uses infeasible, while keeping the friction low enough that normal users can still use it normally.

... and there are the "click farms" with human beings.

If someone pay people to collect data you could outright sell the data to them.

Captcha breaking is SO easy these days; even the modern captchas are easy to defeat.

How about serving bots with one link per page, and taking a minute to serve each page? Would this impact their efficiency?

Considering that as of Mar 12, this search engine only has 1001 sites indexed, I am not sure how useful this site is for getting SEO backlinks. Speaking of which, are backlinks still a thing these days?

They are, but the useful ones are those coming from sites with higher domain authority rankings.

That’s why you'll see fluff pieces (aka, paid content) from online publications like Forbes for the better funded entities.

Another approach is the reach out to site operators with offers of writing content or asking them to link to your site’s content in their existing content.

It’s expensive and/or incredibly time consuming to get back links that matter.

If the confidence was high enough, perhaps return garbage data?

> It is very difficult to fight this on a technical level

It is when your base assumption is that you won't hire outside of engineering. There are more bored teenagers with phones than people creating quality content, so I'm not sure why you wouldn't just brute force checks against bad actors.

just to throw out ideas: What if he decided to charge for each search?, say 1 cent or so. Users could purchase them in bulk, say 100 searches for a 1$.

The world is getting more and more desperate for a better search engine. the day may come, when people are willing to pay for better results.

what is the end goal here? i understand it's about making money somewhere down the road. but how?

Since everyone in this thread wants to jump down OP's throat about the quality of his web site, another interesting search engine is millionshort.com, which allows you to filter out the top N web sites from the results of your search. It's a great tool for looking past sites with good SEO; all you have to do is fiddle with the value of N.

For example, searching for "electronic music box" as /u/ajnin suggested, with the top 100K web sites removed from the results, filters out the following:

> These 23 sites were removed from your results:

> alibaba.com (1 result removed)

> aliexpress.com (1 result removed)

> allaboutcircuits.com (1 result removed)

> amazon.com (2 result removed)

> apple.com (1 result removed)

> bestreviews.com (1 result removed)

> ebay.com (1 result removed)

> etsy.com (2 result removed)

> facebook.com (1 result removed)

> instructables.com (2 result removed)

> lightinthebox.com (2 result removed)

> lumberjocks.com (1 result removed)

> mapquest.com (1 result removed)

> reverb.com (1 result removed)

> twitter.com (1 result removed)

> wikipedia.org (1 result removed)

> yelp.com (1 result removed)

> youtube.com (2 result removed)

And the top result ends up being https://midiguy.com/.

That's an outstanding concept. One problem though: wouldn't it also filter out high quality curated results?

Yes, such as Wikipedia, for example.

I assume the idea is this is a secondary search, after Google has failed once again to return anything other than Etsy and Pinterest results.

It also seems fairly customisable, like I can search and include all results but choose to remove ecommerce, or sites with live chat (weird filter, but I like it).

Million Short also has an option to remove only e-commerce results which is invaluable if you still want results from sites like Twitter, Wikipedia and YouTube but don't want online shopping spam.

Would this also work for the fake-sites-stealing-text-to-look-legit sites since they quickly end up in the top results?

This made me curious to try that search engine so I typed "electronic music box" (first thing that came to mind). As far as I can tell none or the 10+ pages of results include all those 3 words. I mean, you might not have any relevant sites in your database (likely if there are only 1000 sites or so as another of your blog posts imply), and I understand you want to show some result to the user, but if I want irrelevant links I might as well go to google.com...

What the heck is an 'electronic music box'? I personally wouldn't expect those three words to show up on any sites served by a small search engine.

It's a music box that produces sound electronically, as opposed to traditional mechanical ones. I don't think it is that foreign of a concept. It might be present, or not, in the search results, it depends entirely on the niche and I could not now which it was by just reading the blog post. Anyway that was not really the point of my test.

Yeah same, I searched for Leeds grand theatre and the top result is something titled "June 2012 – Sam's Blog' which just mentions the word grand.

You mention the "Dead Internet Theory" (not heard that phrase before!).

I agree: the WWW Internet is dead, that is your problem. No-one visits websites anymore, everyone has moved to the 10 biggest websites and all data is now siloed there.

If I want to search for something topical and relevant, I go to Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord etc.

The general Internet is dead: it's just legacy content and spam.

If you think it's bad for you, imagine what it is like for Google Search! Their entire business is indexing a medium which no longer has any relevancy. People complain that Google no longer delivers good results. But what can Google do? The "good content" is no longer available for them to index.

Want to become rich? Make a search engine which indexes the fresh relevant data from the big siloed websites, and ignores the general dead Internet.

I built my search engine in part to explore whether this was actually true, and I don't think it actually is.

There's still a lot of organic human-made content still out there, possibly more than ever, it's just not able to compete with the SEO industry that completely displaces it from Google and social media.


> If I want to search for something topical and relevant, I go to Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord etc. The general Internet is dead: it's just legacy content and spam.

The "general" Internet is not dead. Though if you just want to participate in just Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord you might well think that.

Users of marginalia (author above), Mojeek (disclosure: CEO) and others [0] are well aware that there are riches of organic human-made content; from years back and new. Yes, a lot of noise too, which Google has a bigger (SEO) struggle to compete against. But still there is good and different content available.

To find good content, using search, you need to use "search" engines which enable discovery, as Google used to do so. I stress the "search" as the emphasis of Google, Bing and thus their syndicates is increasingly on being "answer" engines.

[0] https://seirdy.one/2021/03/10/search-engines-with-own-indexe...

> The "general" Internet is not dead.

For some things it is. Good luck getting a non-sponsored/SEO-gamed review of a kitchen appliance or particular vacation mode such as a cruise. It's flabbergasting.

Most times I just stick "inurl:reddit.com" in my search and try to get discussion threads about the thing I'm researching, but even that's getting filled up with shills.

I think search engines are broken, but the Internet itself is probably not "dead". It's just our accessibility to that information. That's not super helpful until we have better search engines (which steer us away from this SEO stuff), but the good news is that building a better search engine is easier than resurrecting the Internet. In particular, there's a good chance that a niche, naive search engine might be able to significantly improve accessibility (e.g., high rankings for pages that answer user queries in the fewest bytes).

These websites seem to be last updated decades ago, which is prehistoric to most casual browsers. There's no doubt there is great content on the general internet, but these examples I would classify as "legacy".

I can see why the website owners would be interested in getting traffic to recent websites, but why would you be interested in recently updated websites?

Stores typically stock recently manufactured products. Once the manufacturer discontinues a model and inventory is gone, that's a wrap. Sometimes the product was good and gets replaced by an inferior one (in the spirit of old burger king vs new post-acqusition burger king), other times it's just small tech refresh tweaks, and everything in between.

A real litter of inconsistency between unrelated external organizations and varying markets and skill sets.

Most of these are spam. They contain affiliate links to Amazon to buy the product which is being reviewed, therefore the the review cannot be trusted.

"Which" looks to be the exception, but that is a paid-for service.

It's a sad state of affairs.

I understand your opinion about affiliate links - but I use several review websites that use such links for all products they review, and have both positive and negative reviews for products. So I wouldn’t say it necessarily follows that affiliate links = biased reviews.

How often do they give their best review score or opinion to a product without an affiliate link? Not every product will have an accessible affiliate link.

Isn’t Amazon commonly used for most affiliate links or has that changed in recent years? Amazon isn’t the cheapest all the time any more. Nor is its customer support the top any more

Also, I've noticed that the list of products reviewed is limited to only those that _have_ Amazon affiliate links. If a product is only available on not-Amazon stores, they don't even get mentioned. Which is a bias in itself.

Yeah that’s what I was thinking too. A big bias right there.

Everyone is trying to game the Google algorithm. The net result is all this long form content and cooking recipes that are 10 pages long.

There seems to be a big disconnect with a typical users attention span and the length of a post.

I thought the recipe thing was to be able to copyright them

That’s just gaming a different algorithm.

Yes, butt importantly not one Google or any search engine can do anything about

Sounds like we’re back to AskJeeves and a number of failed answer engines from a couple of decades ago!

AskBERT but now MUM knows best.

This matches my findings 100%. The WWW is active and bubbling, but virtually all the cool websites I've found in the last 10 years or so came through friends, small IRC channels, or more recently through marginalia.nu :-). Google and friends are facilitators for the SEO and tracking industries, so of course they have zero interest to prioritize these things over content spam -- their whole business runs on content spam. But the WWW is as alive as it gets.

I take myself as an example.

People that know me and don't meet me regularly might know the URL of my web site and might care to look at it once per year and check if there is something new. Usually pictures and tales from holidays. Covid made those holidays less memorable so I didn't make any update since fall 2019. People that meet me regularly don't need that website, I'm telling them the tales first hand and showing them the pictures without being obnoxious. I guess that this website is a target for your search engine except it's not in English and your search engine seems to want English search phrases.

I don't have anything of value to share on a public chat like Twitter and I don't have an ego to pretend I do. I also don't use Facebook anymore. I go there once per year to like the messages that wish me happy birthday. I think it's polite to do so. All my media production is on WhatsApp or Telegram in group chats with people I know in real life.

If I really cared about producing content for the world I'd probably be using Twitter, Medium or the fad of the year and they'd take care of my SEO (do they?) or I'd be trying to score points on StackOverflow.

To recap: I never intended to compete on SEO. I'm really OK that my website is only for friends and spreads by word of mouth. It probably never did, I bet it's been on a flatline since I created it 20+ years ago.

All open systems are destroyed by spam once they become popular enough to be profitable targets. This will eventually happen to the Fediverse too. If there is money to be made pissing all over the commons, the commons will be pissed all over.

It even happens to proprietary silos if they are too open. Look at how many bots and spammers infest social media. Propaganda and disinformation can also be considered a form of spam.

I realize this sounds cynical but don’t shoot the messenger. It’s just something I’ve learned watching the Internet evolve since the middle 1990s. Spam eats everything it can.

IMHO the future is enclaves and invite only communities. The Internet is a dark forest.

It's not cynical, is how every system in nature works. Everything alive must develop an immune system or it is attacked and eaten.

As old open systems are destroyed, new ones are created to replace them. The Internet exists in a constant state of rebirth and transformation. You really can't step into the same river twice.

> You really can't step into the same river twice.

I love the maxim and philosophy of eternal refreshment.

Seems like the problem is more akin to having nuclear waste dumped into our rivers though.

> This will eventually happen to the Fediverse too.

Oh, don’t worry, the Fediverse will never catch on.

Why? Serious question.

You are probably right about the future; not necessarily because of spam, though that's a part of it, but just because of the toxicity of global, open to the world, mostly public social media. The Fediverse has mostly coasted by so far on obscurity, but it's not great, and it's bound to get worse. All of my online socializing these days is either through short-lived pseuds on topic-oriented fora, or invite-only Matrix rooms.

How do you surface organic human content? I happen to linger around the fediverse/tildeverse sphere where I see organic content from people I personally have a direct (digital) connection to (and I started self-hosting my music after Epic bought Bandcamp), but I'm not clear on how I'd go about digging that kind of stuff up in the more general case.

It's not about surfacing organic human content, it's about only indexing organic human content. The problem is automated indexing. So long as indexing works according to defined rules, the advantage will be to those able to shape their content to those rules, and the spammers and scammers will win.

An idea I've had for a few years is making a social-network based index engine. The only pages that get indexed are pages that users themselves mark as worth indexing, and the only pages returned in your results are pages that were marked for indexing by people you added to your circles, or the people in their circles, or the people in those circles, etc (probably up to 5 or 6 degrees of separation).

...so, blogrolls?

Not familiar with blogrolls, but not quite. The idea is more to have standard search engine user experience, but with the requirement that each result is vetted by someone the user trusts, or trusts by proxy.

> Not familiar with blogrolls

Not directed at you specifically but this is the actual problem.

We already had a good system for these things. Delicious, blogrolls, RSS, the folksonomy ..

> up to 5 or 6 degrees of separation

So basically everyone on earth?

Alright, 2 or 3!

Sounds like a great idea, execution will be key...

I do a traditional web crawl and exclude anything that looks too much like it wants a high google ranking. Nothing to it.

This might be controversial, but I wish Google would exclude those websites too.

Google started punishing keyword spam, then it started punishing black-hat comment spam. Even Youtube backtracked on the "videos have to be 10 minutes to rank".

I wish they would do the same for carefully manicured SEO content farms too, as those sites are causing a harm worse than keyword-spammer sites did.

They're probably doing all they can. The problem is their dominance, both means they have effectively an entire industry looking for loopholes in everything they do, as well as legal considerations (arbitrarily punishing individual smaller actors might skirt on the territory of anti-competitive behavior)

I fear that Google also has a conflict of interest here. A lot of these non optimized sites are not interested in making money via ads. So Google wouldn't profit additionally from leading people there.

And a lot of people (myself often times included) are looking for a quick answer. A good enough answer. So good enough, SEO optimized is being surfaced. The result of an optimization war on both sides combined with the inevitable monetary interests.

I don't habe a solution. Sadly.

I think there's two kinds of SEO spam going on.

The black-hat kind is definitely made to extract money from ads. But those are easy to avoid for web veterans IMO. And I also feel that Google is doing its part, even though it's costing them money from those sweet ads!

But the white-hat kind, also known as content marketing, is made to let legit companies save money. Instead of paying for Google Advertisement, they get traffic by means of organic content. Think "Michelin Guide" or "Red Bull". Which is a jolly fine idea and responsible for a lot of good stuff, but the problem is that this has been taken to extremes, and now the web is littered with low-effort content made by freelancer writers getting peanuts.

I would personally prefer if those freelancer writers were doing 10 interesting Red Bull articles per month rather than 500 rehashes of contents from other websites. But who am I to judge.

In the news industry things are also very similar.

The "white-hat kind" can trivially be filtered out (or deterred) by downranking any of the crap these marketers use to measure their conversion rate - analytics, etc.

I love this idea. Would be nice to see it in a search engine, or at least a browser extension showing how much analytics junk a site has before you click it.

Kagi has a non-commercial filter that I suspect uses the presence of ads/analytics as a signal.

Does anyone have an ad free search engine? You'd start with blacklists from ublock origin, pi-hole, and similar, don't bother even crawling those, then have easy reporting for new or self hosted ads. Not much money in it if any, but it would be refreshing. Might even have a mode to nix anything with a payment method on the site, or that links to a site with a payment method.

> Does anyone have an ad free search engine

kagi.com search.marginalia.nu

Maybe back to Yahoo model of the 90s? Manually created collection of curated links?

Yes. We have enough users now.

I love your search engine. Should I stop recommending it to friends to keep it safe?

I jest a little bit, but your comment genuinely makes me wonder if Marginalia++ is search results - Google - Marginalia

Welcome to the billion dollar question. Any place that is authentic will face the zombie horde attempting to fake authenticity in order to capture attention.

I think your almost right, but it's not necessarily authenticity... I think it's just money.

Large "authentic" search engines can exist to serve the rest of the web, those personal blogs and other small communities. Those sites have a natural tendency to not be trying to turn everything into a revenue stream, so if that was the prerequisite for an engine, it would be a perfect match and naturally dissuade marketing types.

Authenticity is worth money.

When you have a 'real' community you're talking about real people with real salaries and desires, add in that you tend to develop a real trust between members. Think of this as fertilized soil. You can grow crops in it, but weed seeds will eventually land and try to take over it.

HackerNews is a good example of this, it takes a healthy amount of moderation to keep things on topic where things like politics get peared pretty ruthlessly. If for a minute Dang gave in found ways to additionally monetize the forums, something that would be profitable for a while at least, things would start down a bad path.

I can only agree with my sister comment. I find this industrialized web more and more shallow and taxing to use.

While professionally I need to help (smaller, local) clients to reach their audiences I become more and more weary.

It is like walking through a supermarket with industrialized fast convenience food shouting in bright colors and advertising while ultimately not nourishing me like slow, real food could.

I am still looking for this digital slow food movement.

> I am still looking for this digital slow food movement.


Please read it, and if you enjoy it please suggest it to friends.

Read the intro. So you find vegans annoying (because they 'are the future'), and your not a vegan yourself – and you write that digital veganism is more important than actual veganism. Now that's a way to start off well!

I second that independent sites exist - I maintain my own website on a personally run server. There are dozens of us! to quote a quaint phrase.

And who uses your search? I had never heard of "you" until just now. And there is the problem with "new" search engines. Unless you can come up with what would have to be one of the greatest ad campaigns the world has ever seen, no significant number of users will know you exist. Where does the money to pay for that ad campaign come from? How will a search engine generate money to stay relevant? Once people see you becoming relevant, they will figure out how to game your system. It's just the nature of the beast. I don't think I'm being overly cynical about this either.

Why would I need to generate money to stay relevant?

<edit>The first </edit>relevant was the wrong word. sustainable would be more appropriate. on the assumption that hosting the search engine isn't free, and unless it is supported by a generous benefactor it will need to have a way of generating money to keep the servers running.

I'm self hosting so my operational cost is like $50/mo.

then he must be relevant

Agreed, the general internet is not dead, but the majority of internet users are on Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord etc.

From my perspective, we onboarded a lot (if not most) people to the internet after 2007 (the explosion of social media). People sticking to big sites really speaks to an inability to explore the larger internet and a lack of knowing why you would even want to.

I think the answer is in the name: "social" media.

Most (99%) people use the Internet most (99%) of the time to see or hear what other people are up to. The big sites are where all the other people are. QED.

(This comment falls into that space)

I added it to my list of search engines on Firefox... your favicon is really small, that's on purpose?

> No-one visits websites anymore, everyone has moved to the 10 biggest websites and all data is now siloed there.

Really? We make our living running a small web based publication; around 40k readers a month. I know of many other sites like this. Google, and other search engines, depends on niche websites to provide quality search results. Without sites like ours, the internet would truly be dead, and search would be mostly useless. Our "traffic sources" come from a mix of Facebook, Search, Reddit, etc, in addition to our many loyal readers.

Others in our niche are producing blog spam, which looks nearly identical to people who aren't experts in the field, but we have real experts, fact checkers, etc, as part of our production process. This is a big problem: These low quality websites get similar rankings to our own, which does make it much harder for people to get quality information via search. (Hence the general shift towards trusting social recommendations, such as from Reddit.)

In short, the WWW is alive and well, it's just buried under a bunch of #$#$%.

> Our "traffic sources" come from a mix of Facebook, Search, Reddit, etc, in addition to our many loyal readers.

40k/mo is a pretty good number for an independent website. As a word of warning though, relying on social media reach is a dangerous game, as there is anecdotal evidence that tweets with outbound links don't get as many impressions as those that link to in-site content, like another Twitter post.

As for Facebook, well, there's a good comic from The Oatmeal (enormously popular on FB back in 2010) that talks about what happened in the long run:


The internet itself is probably gonna die soon anyway. Every country wants to impose its own laws on it. I think it'll eventually fragment into multiple segregated continental networks, if not national ones, all with heavy filtering at the borders.

I'm happy to have experienced the free internet. Truly a jewel of humanity.

I think this was inevitable all along, something similar happened to radio if I'm not mistaken.

However, the good news is that we will never stop reinventing everything. The real value of the old internet was showing us what is possible.

> The real value of the old internet was showing us what is possible.

Of equal value is that it showed us what not to do.

We have 30 years of documentation for research on exactly what a successful intra-planetary network needs to be immune to. A successful future network must build-in resistance all forms of human pyschopathology from the ground up.

This is a nice fantasy, but it's a fantasy. The tech stack and network we have is too dense a forest to be replaced by clean slate designs. But maybe some of the problems could be improved with some new platforms and APIs. Mind you, ML is making so much progress so quickly that what happened over the last thirty years is at best a partial model of the problem we have to solve now, and the tools we have to do it with...

> ML is making so much progress so quickly that what happened over the last thirty years is at best a partial model of the problem we have to solve now, and the tools we have to do it with...

Sorry I don't see how ML can help here. It seems like another thing to pin hopes of repairing an already too broken system on.

"We cannot solve our problems with the same thinking we used when we created them." -- Albert Einstein

"A new scientific truth does not triumph by convincing its opponents and making them see the light, but rather because its opponents eventually die, and a new generation grows up that is familiar with it." -- Max Planck

We are the dying generation my friend. We built it. They came. It didn't work. Surely if ML can do anything it's telling us that we need to tear down the old system completely and start again, don't you think? Adding sticking tape won't help.

edit: turning a grunt into an honest question

> I think it'll eventually fragment into multiple segregated continental networks, if not national ones

That's exactly the world in which the Internet grew. There were multiple segregated national and sub-national networks, and the Internet was built as a means to interconnect them. After some time, the Internet protocols ended up being used even within these networks, but that was not originally the case. And even today, there are still things like the AS (Autonomous System) concept which permeates the core of the top-level Internet routing protocols, which still reflect the Internet being a "network of networks" instead of a single unified network.

That's why I'm not too worried about the Internet fragmenting; we've seen this before. What happens next is gateways between the networks, and there are already shades of these in the VPN providers which allow one to connect as if one were located in a different network, often from a different country.

This made me sad, the optimist in me believes that some alternative will be built, that could take us back to those days. Honestly I do feel for most of my life I experienced an American Internet mostly (From South Africa), as long as one can still hop from one internet to another, in as simple a manner as possible it might not as bad as it could be.

I'm sad as well. To me it feels like we're already living in a cyberpunk nightmare, things just keep getting worse and there's nothing anyone can do to stop it.

The networking may have been open like that, but I'm not sure the content ever was. It seems to me like a lot of internet users consume mainly the content of sites from their country. Kind of hard to blame them when that content is probably going to download fastest. But the language barrier has also kept the internet from becoming truly global.

> I think it'll eventually fragment into multiple segregated continental networks

i think it already has.

the Great Firewall of China is the classic example, but I think the trend started in the west with the Right to be forgotten/right to erasure in Europe, and subsequent HTTP Status 451 Unavailable For Legal Reasons. GDPR just further cemented the split between Europe and the rest, and the new DMA & DSA regulation in the European Union finally makes it clear. The writing is of course on the wall, so countries like India or Australia aren't too far behind. Places like California also have their own "right to be forgotten", and I'm sure the US will not be left behind for too long before we see regulation further splitting their internet from the RoW. And I don't think the RoW will hold off much longer till it also splits into multiple big blocks. It's the start of the new "nationalist" internet, and I'm sure we'll all be poorer because of it.

Exactly what I mean. There is no way to have an international network with national borders. Telecommunications providers have always been centralized and have always been in bed with the government. Only way we'll ever be free is if someone invents some kind of decentralized long range wireless mesh network.

Good luck, spectrum is highly regulated in every country I can think of. If national governments don’t want you networking across borders, you’re definitely not going to be broadcasting long range radio transmissions that way. In fact, it’s currently illegal to transmit encrypted data or to relay packets via ham radio in the US.

Who knows? The whole point of decentralization is for there to be so many nodes in the network they can't possibly take them all down so that it's pointless to even try. What if all smartphones formed a mesh network? There aren't enough prisons in my country for all those criminals.

I agree with your ethos, but I don't share your optimism. If the state wants to enforce networking firewalls along national boundaries, no technological solution will save us in general. As a resourceful techie with the right know-how you may be able to sneak your packets through, just like people in Cuba receive a literal packet of data via sneakernet, but if the state doesn't want widespread meshnets circumventing their firewall, they will imprison you for emitting pirate radio signals, they will penalize any electronics manufacturer that makes non-compliant hardware, and rest assured that companies will go right along. Liberty requires more than technical solutions.

I'm saying this as someone who once wrote a decentralized P2P mesh for instant messaging[1]. I was inspired by the HK protests going on ~2014 after hearing that they were using Bluetooth chat apps. Luckily Matrix, Telegram, Signal, etc. mostly solved the problem. Still, I don't think any amount of mesh networking would turn back the tide of Hong Kong now.

[1]: https://github.com/zacstewart/comm/

>What if all smartphones formed a mesh network? There aren't enough prisons in my country for all those criminals.

There don't need to be. You publicly gruesomely execute the first 100 or so you catch, and the practice of running a mesh node on your cell phone will fall so far out of fashion that the network breaks.

Societal shortcomings cannot be fixed via tech alone. If you can't build a society resilient to authoritarianism in the first place, tech will not help you. It can be used to increase resilience, but that's far from fixing the problem by itself.

Like Starlink?

Starlink is maintained by a company, it's an internet service provider. One visit from the police and they'll censor anything.

The mesh network should be made out of common hardware in order to be viable. I'd suggest phones but those devices are owned before they've even left the factory.

One visit from the US police. US-unfriendly countries have no leverage over it, and similarly, the US has no leverage over satellite ISPs based in countries they aren't on good terms with.

> US-unfriendly countries have no leverage over it

"Star Wars Episode 10: The one that's not fiction."

Internet censorship isn't worth going to war over and disclosing secret anti-satellite weapons that are better saved for a rainy day.

It's probably easier to just cut off outgoing payments to Starlink anyway. They're not a charity, so if they don't get paid, they probably don't want to provide service just to send a message to some random government.

On the other hand, if you want to demonstrate that you have anti-satellite capability it's probably a better idea to shoot down a corporate satellite than a military one. The Soviet Union shot down Korean Air Lines Flight 007 and it didn't start a war, after all.

> It's probably easier to just cut off outgoing payments to Starlink anyway.

Cryptocurrencies might be a problem in this plan, and satellite internet access itself might become a currency (since unlike cryptocurrencies, this one both has almost an intrinsic value and provides its own infrastructure that's very hard to block, where as cryptos rely on external sources of Internet access).

It also depends - drugs have consistently won the war on drugs despite being a physical product that needs a local supply chain and various anti-money-laundering and banking/finance regulations that should make it hard to fund the operation. Satellite internet access is likely to be even easier as it doesn't rely on a physical product (if we reach this stage there's going to be clandestine satellite terminals built locally, so blocking shipments of the real thing isn't going to cut it).

The only solution, apart from North Korea-levels of isolation (and even then, NK has the advantage of their population being isolated & indoctrinated since birth, something most other countries won't achieve even if they turned authoritarian overnight) would be detection followed by harsh punishment, but this has the downside of not only wasting the disclosure of detection capabilities (that are useful to the military) but also outsourcing the R&D of evading such capabilities into the open which enemies will no doubt pick up on too and use against you in a conflict.

Starlink connects to standard internet gateways on the ground. It cannot function without the 'regular internet', unless a replacement appears.

IIRC there was mention of it providing some p2p network style communication capabilities for Ukraine's military, and one of the reasons it's appealing to the US's military is the ability to route communications entirely within the network (well, with the gen 2 satellites which have laser interconnects).

So it can (at least eventually) function without 'regular internet', although I would still be hesitant to call it a viable infrastructure choice if the goal is to get around government control, simply from how much SpaceX have to appease the government to do anything space related.

These discussions always make me recal Jacob Applebaum. Think of him what you want, but this statement of his really stuck with me at the time. Paraphrasing:

The real dark-net is facebook. Everything that goes in there never comes out again and is basically invisible to the world, except if you join facebook yourself.

My own prime example of that used to be pinterest: it seems to be a 100% sink in the directed graph of internet links. But since Applebaum stated this, instagram (also facebook of course) is trying hard to push pinterest off that particular throne.

to me this is also discord - which seems to have become the chose alternative tk online forums for many communities and basically hides what used to be the public face of those communities.

Interesting thought. I just went though my browser history and realised that almost every time I use google search, I already know what website I want, I just don’t know the exact link/page. I’ll use google because the search on stack overflow or reddit sucks but I know I’m looking for a page on one particular site.

I realized this too. I disabled search from address bar and started bookmarking everything even remotely sane I see. I often add a few personal keywords to the bookmark bar.

It is starting to pay dividends. Instead of weird stuff thrown up by google when I type in something, I get the "oh yeah, that was the page" from a short list of bookmarks shown to match the words.

I had the same realization and ended up setting up a simple Cloudflare script to automatically do an “I’m Feeling Lucky” style search to return the first result: https://notes.npilk.com/custom-search

I think this is a tad reductive, but I will say that we sure let a lot of big companies convince a huge portion of the population to create all of their content on platforms that they have no real control over.

The problem is, many of them didn’t realize this was a problem until recently.

That said, plenty of exciting stuff is happening outside of the walled garden, as long as you know how to find it.

And not only did this happen already over a decade ago, a lot of the current internet users have never known anything else.

We had a discussion with coworkers and somebody mentioned irc. Explaining to younger colleagues what it was and that it was not a product of a company, but operators had servers that formed a network, and it was more like infrastructure. Felt weird.

Most of the kids in my 3rd graders peer group understand federated infrastructures quite well because of Minecraft.

Perhaps it wasn’t the federated nature of irc that was surprising but the fact that it was irc?

Isn't minecraft more decentralised than federated?

IRC networks usually have multiple servers connected together (historically, often run by a bunch of different people) and I didn't think people self-hosting minecraft servers usually did that?

I think honestly it highlights the power of marketing as much as anything else. In some ways, building an open network is always going to put you at a disadvantage to a company that can throw money at user acquisition and PR teams. That federated networks like Mastodon have seen growth reflects the fact that word of mouth still means something in 2022.

isn't Discord a bit like IRC used to be?

How do I connect to a self hosted discord, and then connect it to my friends self hosted one?

And where do I get the RFC for the protocol so that I can write my own compatible implementation?

IRC isn't a product. It's a standardized protocol sufficiently simple to implement in a day or two.

I no longer see Google as a neutral "search engine" the way it used to be. Now it's just another company that owns and promotes certain types of content, no different from reddit. For some things Google has the best content, for some things Twitter or Reddit have the best content.

Back in 2000s Google used to be the place for any type of search (IIRC).

Now, I've been conditioned to use it only for specific use cases, mostly for convenience. Some examples include:

1. Anything programming related (searching for man pages, error codes etc) is straightforward. (I do have some UBO filters to exclude SO copycats)

2. Utility stuff like currency conversion, finding time in another city, weather etc.

Where Google has really fallen behind is in multimedia search. Not sure if it's due to copyright issues or not but Bing and Yandex provide way better service in this regard.

Not to mentions the "reddit" suffix I need to add to any search that even remotely calls for public opinion. In many cases, Google is just a shortcut to take me to the relevant subreddit.

Programming-related stuff seems to have gotten a lot worse in the last couple of years. Now most terms, at least for common things, return a ton of blogspam, when the official docs or SO are usually the best source.

another thing seems to be prioritizing current news over past news which makes searching for old.articles youve read quite difficult.

I find one of the best ways to find interesting content on specific subjects using Google is now to start blocking all their top returns (a lot of SEO spam). This is somewhat tedious (lots of -site:seospam.com) and Google doesn't like automated queries. However, a few rounds of this often turns up interesting content down low in the search results. Just don't take what's on offer on page one of search results, basically.

Where it's gotten really bad is on news searches as Google either now has some kind of shitlist of independent news sites that it won't allow to show op on, for example, site:youtube.com searches - or, it's filtered through a guest list. It's hard to tell which strategy they're using, but news is definitely being heavily filtered based on very dubious propaganda-smelling agendas.

You might be interested in using uBlockOrigin and https://letsblock.it/filters/search-results to easily block these domains. In addition to your own domain list, you can use the community-maintained SO / github / npm copycat lists.

Google is an advertising company. It has been for a good while.

Yeah I use you.com and kagi.com. No advertising on either. Less SEO spam too it seems.

I don't believe the WWW internet is dead; there's still millions of webpages being made and published every day. However, the traffic numbers are skewed in favor of the big socials and aggregators; I wouldn't be surprised if the 80/20 rule applies there.

There seems to be a tendancy towards video that undercuts the "old internet". I prefer instructions in a text or list format, but that's almost impossible to find for things like, changing the headlight bulb on my traverse.

1. turn the wheel so it is pointed hard in the direction of the bulb you are changing.

2. remove the hex screws from the shroud in the wheel well

3. pull the shroud down, it's pretty flexible plastic.

4. reach up and change the bulb. The wires are a bit short so you might need to get both hands in there. I have big hands and I'm able to do it.

---- There are innumerable videos explaining this process, but very few text directions.

I think this is actually because real, fluent literacy is still rare even in highly developed places. It may be easier for a very literate someone to dash off those instructions but most people are 1000x more comfortable making a little video. Same goes for reading vs watching the video.

This is my same theory about meetings being universally preferred to asynchronous email, even when literally all the questions someone asks at a meeting have already been answered in my long form email.

Most people, even if they can read, are not really comfortable with it. Doubly so for writing. There used to be no choice to function in society, but increasingly we can use technology to substitute for reading and writing effectively, so people do.

You're probably right, it's just so frustrating.

I think I'm going to start compiling stuff like this in my git repo.

Even something like that flounders on the question "these instructions say to pull down the shroud, what is a shroud?" or "I can't find those hex screws, where are they located?" Repairs are inherently visual, although text with illustrations might work.

But Twitter, Reddit, HN, and most other such places are just websites and can be indexed fine. Same with Wikipedia, which is very much a silo (they don't have regular links in text in the hypertext spirit, but only footnotes).

Facebook and Instagram are more of a walled garden, like Quora, but there is a lot of junk there anyway.

It's sad for the WWW, but I don't really think it is a fundamental problem for search engines. In fact Twitter for example gives a direct pipe to Google. If you tweet something, it is immediately findable. Similar for StackExchange, but there I think the site is so "small" that Google can afford to just continuously index it.

Twitter and Reddit still can be indexed, but they've also become increasingly hard to use without an account. Reddit doesn't let you fully expand threads when you're unlogged. Twitter limits the amount of things you can read and shows a modal. Both of them heavily limit usage on mobile devices without installing an app.

Sure, an account is free but might require giving information you don't want to give. Twitter asks me for a phone number a few minutes after creating an account, even if I don't post anything). Reddit at least lets you skip giving an email.

Sure, there are workarounds such as using lite versions (old Reddit, mobile Twitter), but that's not known to all people coming from a search engine.

It feels as if HN are the only one that's not a partially walled garden yet (and Wikipedia of course).

> Reddit doesn't let you fully expand threads when you're unlogged.

that's what old.reddit.com is for!

old.reddit will be gone soon, it is inevitable. Especially once they go public.

Yup. It's bound to happen. And when it does, Reddit will no longer exist in my eyes.

Agreed. IDK how I feel about Reddit. I've been on it since 2010 when Fark lost its spark. I remember some great times but a lot of it was "junk" content that in the end was very meaningless. I wish I could say I used it to develop my career in tech but that isn't true either; I use specific blogs, books, and tutorial sites to learn instead.

I suppose I mostly view it as a continuous party, yeah it's fun if you attend but after a few hours I wish I was doing something more productive.

Isn't it a bit ironic that a site - or its operator - 'going public' means all the content on said site actually 'goes private'?

Exactly, I mentioned it. But not only it's bound to go away sometime, it's also not trivial to find to anyone who's not an expert Reddit user, unfortunately.

And isn't great to get a link to Reddit or Twitter, and you click the link, and try to navigate to the comments for context or the answer, and you go to click the link to expand it, and then you get a demand to log in and install their app? Don't talk about walled gardens and not include Reddit or Twitter just because they let you look at one brick before demanding their tax.

This is not true, maybe for a subset of Internet users.

For example you have Wikis and forums. Wikis are good for communities that are passionate about a topic and they collaborate on buidling content for their passion. Reddit is a valid alternative to forums but if the community s older and has members that are technical competent then they usually have the forum customized for their purpose and the forum will continue to exist , especially if you want to avoid some third party censorship.

I never ever search for something and found answers on Facebook, sometimes very rare I find something that points to Instagram blogs/posts but never Facebook.

Probably depends on your location and what you search for, so it might be possible that 99% of your Internet consumption is satisfied by 5-10 websites.

I am not so sure...

I think what happened is this: the WWW was everything back in the days. But in the "old days," only 10% of all people were online, the web elite. Then, AOL came, and the rest came online slowly but surely. The so-called "mainstream" people were no geeks, and these people were "just" ordinary people. Almost all were captured by what you call "big websites".

Now, we see the 100% being dominated by the 90%. That's why "Google results are bad". Bad for us! Not maybe (most probably) not for them.

Eternal September was Sep 1993. AOL hit the internet in March 1994.

Netscape didn't launch until December 1994 (and the WWW was nothing before that. I subscribed to a mailing list with new sites that were released and I'd visit most new websites on the internet on most days with the Cello browser in my uni labs most days).

AOL users have been there since the beginning of the WWW.


My recollection is that the AOL event you reference was only making usenet accessible - a point that makes good sense in the context of the eternal September.

But when talking about the WWW, that's a very different story. I think that AOL didn't incorporate a web browser until quite some time after that.

The WWW took off when Netscape shipped in late 1994.

AOL users could use Netscape from the beginning.

This is so incredibly false, I've been working on a project for the last six months and MoM I've seen steady increase in usage. Tbh much much higher usage then I expected. Most users find my site via Google or Facebook however they are looking for content that's not in those silos and have no problems leaving them.

If you have high quality content and you get it indexed properly by Google, users will come.

There are reasons users are not using your website.

1. It's not solving a problem people have.

2. Users can't find it.

Who, in their right mind searches for search engines? Nobody I know.

If you want users you have to go out and get them (literally pound the pavement and talk to people) or create a LOT more content ironically, so they can find your site on the search engines they are using today.

Based on my observations over the past year, I’m certain that Google and Bing choose not to show us most of the web anymore.

I usually find what I’m looking for. It just takes literally three orders of magnitude longer than it used to for the same kind of stuff. I used to use Google a lot to jog my memory about various things I vaguely remembered. Type a few associative words and snippets, press Enter, done. Google’s useless for that now.

If you’re looking for hot pop shit in trendy publications, things to buy, commercial services to subscribe to - G has you covered. That’s what they do now.

"Want to become rich? Make a search engine which indexes the fresh relevant data from the big siloed websites, and ignores the general dead Internet."

Did that to some degree. Unscatter.com pulls from reddit and twitter to source links.

I found reddit only created an echo chamber bubble of obvious bias and twitter only diluted it a little.

As you describe this, it makes me think about how populations tend to migrate to cities and away from rural areas. There’s even a parallel to white flight in the emerging popularity of the chan/gab fora.

> Want to become rich? Make a search engine which indexes the fresh relevant data from the big siloed websites, and ignores the general dead Internet.

That would be a great service, but it certainly wouldn't make you rich. Where's the money going to come from? Google got rich because they acquired an ads platform (DoubleClick) and an analytics platform (Urchin) and started monetizing the vast amounts of data they had. That was years after Google had established goodwill as the best search engine.

I use beta search engines. On kagi.com and you.com you can preference and filter top sites. There's also no advertising on either. I've just stopped using Google altogether and its improved search so much.

I think you're generalizing your own behavior. I regularly use google to search for topics that cross my mind and I end up on many websites that are not one the giants in your list. It's a fun activity. If people stick to the same 10 websites that's on them. Nothing prevents you from exploring the web.

> Nothing prevents you from exploring the web.

What prevents you from exploring the web is you can't find but the same 10 sites through search engines.

> If I want to search for something topical and relevant, I go to Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord etc.

Maybe we’re searching for different content, but I disagree. While Google results are not without noise, I think it’s a huge exaggeration to suggest it’s useless. I still regularly find quality results from a quick skim of the first or second page of Google results.

Meanwhile places like Reddit, Twitter, and Hacker News are full of very strong opinions that feel truthy, but are mostly noise. Unless you go in with enough baseline knowledge to filter out 9/10 underinformed comments to dig out the 10% who actually have direct knowledge of the subject and aren’t just parroting some version of something they read from other comments, skipping straight to social sites becomes a source of misinformation.

If you want to be rich, solve search without full-text indexing of sites. Pagerank only ever worked because of human curation of webrings. Full-text search made is easier to find content, and opened the door for spammers. The only viable route forward for search will be to replace full-text indexing with human curation, somehow. Solve how to scale that up instead, so that when everyone else realizes we need it for the health of the Web, you’re ready.

Doesn't this site, and all of the content it links to, pretty much disprove your theory?

Yes, sure, I often do go to the "top sites" when searching for content, but I still usually start at Google. And, despite all the SEO spam, Google still does a fairly decent of landing me on, for example, the appropriate Wikipedia page, Stackoverflow post, travel site, etc.

> If I want to search for something topical and relevant, I go to Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord etc.

High chances you will find a link to an external site over content actually on those big named sites though, right? That tells us the organic web isn't dead, it's just hard to discover/navigate - because of SEO wars, most probably... The problem isn't the lack of content, it's the number of shitty spammy sites standing in your way of the sites you actually want to see. Like a sleazy salesman trying to direct you to the crap laden three wheeled rust bucket when you were heading toward the family sedans.

This MUST be the reason that they threw their purchase of Postini in the garbage and my GMAIL INBOX is filled with spam, and my "social" and "promotions" tabs dont filter....

GMAIL is garbage now, I literally use it as my spam email any more. Which sucks because I have had it for a really long time.

Annecdote on Yahoo! Mail ; years ago I wrote to yahoo support asking when I created my Yahoo Mail account (i'd had it from the 90s when it was very early available...)

And support told me that they couldnt tell me when my account was created as that was *proprietary company information*

So I deleted my Yahoo account. Im about to DL all my gmail and do the same.

It has been dead for a while now and the whole society feels it globally. Things were getting so good then things become horrible and whoever cracks the path to the goods stuff again will find great riches at the end of the path.

I agree that this seems way too reductive. I was recently reflecting on this and noticed that I constantly run across new blogs and sites whenever trying to learn something. I just don't usually pay much attention to the site name in the way that I remember HN, Reddit, Twitter etc.

So, while I would agree that some aspects of the old internet are dead (like 'small' ~1000 user forums focused on specific topics having largely been replaced by generally inferior subreddits and discord servers), I think it hasn't gotten as bad as you're making it out to be.

Unfortunately, correct. The average Internet user accesses it via a phone, not a desktop, laptop, or even tablet these days. Most of that access is through apps, not a browser. To the extent that a user is looking for a factoid answer and does a search, a Google Knowledge Graph result with a Wikipedia link is probably enough in most cases. If they want a technical question answered, Stack Exchange; a product review, Reddit; nearby restaurants with reviews, Google Maps; etc.

I don't get how TFA shows evidence of the Dead Internet Theory just because their site manages to attract ~zero users.

Just host a <form><textarea><button></form> at an IP address and notice it's just spambots submitting it with backlinks, not actual users. Doesn't mean the internet is dead nor that the indieweb is dead.

It doesn't really show anything other than the only people able to extract value from your creation are the spammers.

I think you're thinking too narrowly about general chit chat content. E-commerce for example is still very much in the function of using your own website. As I would say is documentation, e-learning, saas, company information, etc. It's a more purposeful web.

What is dead though is the general blog like content and community platforms of old, the era of Wordpress blogs, forums and hobbyist websites is certainly gone.

> Make a search engine which indexes the fresh relevant data from the big siloed websites, and ignores the general dead Internet

I don’t understand why Google themselves don’t do this. LinkedIn v. hiQ demonstrated that they won’t get in trouble for scraping users’ subjective views of data within these silos and then stitching them together to form a cohesive whole. So where’s the effort to do so? It seems like the obvious step.

I think the Dead Internet Theory bit is just a bait to get more comments. It's a bit of a stretch to conclude that the internet is mostly robots just because one website sees mostly robots. This extrapolation would be convincing if that one website is a high ranking website that sees a lot of traffic, but searchmysite.net does not appear to be one of the top websites.

> I agree: the WWW Internet is dead

I've heard this claim a lot, with 0 supporting evidence. Do you have any?

My own experience is that there are thousands of content-rich, high-quality blogs still being written by real humans, because I regularly find and bookmark new ones weekly, without even looking for them, so: please provide evidence for this claim that runs counter to my lived experience.

> If I want to search for something topical and relevant, I go to Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord etc.

Interesting. When I search for something topical I search those sites using Google because al(most) (I don't use some like FB and insta) all those sites have really shitty search.

If they wanted good content, they shouldn't have coerced everyone with good content into turning it into illegible seo spam in order to appear in the top 10 pages. The writing was on the wall the first time a recipe site had to start writing stupid stories about their dog.

I agree with you to an extent. The web is less useful than it used to be. BUT I would say a lot of that usefulness has diverted into youtube. There are people who would previously have made sites who are making youtube videos instead which of course is owned by google.

The big siloed websites are just indexes of fresh content though.

With a generic way to place comments on it.

> If you think it's bad for you, imagine what it is like for Google Search! Their entire business is indexing a medium which no longer has any relevancy.

Google was the one (among many) that killed it - so I am not gonna shed any tears.

Google is still pretty good at searching reddit. Maybe reddit can acquire them.

site:reddit just is the best search engine at this point. I still don't like Google though.

I was once on this bandwagon, but I think it was just confirmation bias reflecting the way I used the internet at the time. The non-siloed internet is bigger than the pre-siloed internet ever was.

To a fish the world is made of water and there can't possibly be anything else worthwhile. This is more indicative of how you spend your time online vs reality.

People are doing this already. You just have to include the site name in the search on google e.g reddit. Search on these platforms is often broken.

"I agree: the WWW Internet is dead, that is your problem. No-one visits websites anymore, everyone has moved to the 10 biggest websites and all data is now siloed there."

That is not the Dead Internet Theory. That's just something anyone can see by looking at the world.

The Dead Internet Theory is that the Internet is already an echo chamber custom fed to you by a collection of bots and other such things, and that a lot of the "people" you think you're interacting with are already, today, faked. You're basically in a constructed echo chamber designed only with the interests of the creators of that chamber in mind, using the powerful social cues of homo sapiens effectively against you.

In particular, those silos aren't where people are communicating. Those silos are where you think you're communicating.

It is obviously not entirely true. When we physically meet friends, sometimes topics wander to "Did you see what I posted on Facebook?" So far, we've not caught Facebook actively forging posts from our real-life friends that we physically know. (Though we have caught them failing to disseminate posts in what seems to be a distinctly slanted manner.)

I am also not terribly convinced that the bots have mastered long-form content like you see on HN. I think we've had some try, and while they can sort of pass, they seem to expend so much effort on merely "passing" that they don't have much left over to actually drive the conversation. HN probably still requires real humans to manipulate things.

Where I do seriously wonder about this theory is Twitter. AI has progressed to the point that short-form content like that can be effectively generated and driven in a certain direction. There's been some chatter on the far-out rumor mills about just how bot-infested Twitter may be, how many people think they have thousands of followers, even having interacted with some of them as "people", and in fact may only have dozens of flesh-and-blood humans following them, if that. Stay tuned, this one is developing.

(Note that while this could be "a big plan", it is also a possible outcome of many groups independently coming to the conclusion that a Twitter bot horde could be useful. A few hundred from X trying to nudge you one way, a few hundred from Y trying to nudge you another, another few thousand from Z trying to nudge you yet another, before you know it, the vast vast majority of everyone's "followers" is bots bots bots, and there was no grand plan to produce that result. It just so happens that Twitter's ancient decision to be dedicated to short-form content, with no particular real-world connection to the conversation participants, where everyone is isolated on their own feed (even if that is shared in some ways) made it the first place where this could happen. Things with real-world connections, things where everyone is in the same "area" like an HN conversation, and long-form content will all be three things that will be harder for AIs to manipulate. Twitter is like the agar dish for this sort of thing, by its structure.)

I agree - I don't believe that there is a grand master plan of a conspiratorial or other nature. I think it is simply, as you stated, a co-evolution of independent actors.

> (Though we have caught them failing to disseminate posts in what seems to be a distinctly slanted manner.)

I haven't seen this, but I'd be interested in reading about it, if you have a link!

I think this is a very "consumer focused" take. Yes. A lot of interesting people data is now "locked" behind these aggregators and platforms (and also hard to handle because of GDPR). But most interesting company data is still out there.

Ona tangential note, I remember a time when Google had the option to search only for 'discussions'. The results were amazing and accurate as it scoured online forums. Almost all issue I had (was following the rooting scene closely back then) were quickly resolved. Then suddenly it got removed for reasons unknown to me. Anyone knows if it's replicatable today?

I have a suspicion they removed it because of the amount of spam on those forums. There's tons of abandoned forums that are only occupied by spambots.

There's even pretty convincing looking accounts and messages that turn out to be spam in the end, once they start trying to post links.

I have Akismet on the comment section of the Wordpress front-end of the site I run, it basically said something like 99.99% of attempted comments were spam. I'm sure the same applies to e-mail and the like.

Reminds of those "fake forums" I sometimes see when exhausting google's results. Found a screenshot of the concept here: https://www.reddit.com/r/Scams/comments/jxtr1k/but_it_requir...

Everyone is a spammer according to Akismet. I wouldn't be surprised if 99% of that 99.9999% is false positives.

You could start a website for people you don't like, flag all the comments as spam and they wont be allowed to post anything elsewhere - forever!

That percentage sounds about right to me. I've seen comments on blogs from ~10-15 years ago, that continue to have spam posted to them. The first 2-3 comments will be relevant, but comments 50-100 may have a single relevant comment along them, with a total of anywhere from 300-3000 comments. Older comments link mainly to blogs (*.WordPress.com) and such, while newer comments link to Facebook and Instagram.

Brave Search recently implemented "discussions". From what I've seen it is mostly Reddit results but StackExchange also can appear there.


Sometimes adding "reddit" to a search query produces fantastic results.

I have had some success adding "forum", when looking for trade discussions; eg: controls & automotive. With all the walled silos on the net, this is much less useful with every passing day. On the bright side, I don't have to use -twitter & -facebook, so there's that.

This is great but it seems reddit has done something to mess with their date reporting. When looking for recent posts, I might see a result on Google that says it was posted in the last few days, but on clicking the result will actually be from years ago.

Messed up dates, plus irrelevant topics showing up because there are matched snippets in “more posts from…”.

might also be google. I've noticed inaccurate dates that don't appear anywhere for some of my pages. my only theory as to why these were displayed is that google interpreted a (server side) randomly generated number in an inline script as a timestamp (but i can't know for sure that's what happened)

I use "site:reddit.com" to fully restrict to that. You can even filter by subreddit that way.

Works well with HN and other sites, too.

Not sure for how much longer this is going to work. Plenty of marketers make fake posts there in grassroots campaigns. Reddit itself is an advertising company.

God I hope they never find out about this site.

I do this all the time

Brave Search does have a discussion search section.

> I didn’t notice at first because the web analytics only shows real users, and the unusual activity could only be seen by looking at the server logs.

Sounds like everyone blocking analytics (Plausible in this case), e.g. myself just now, is lumped in with spam bots.

Of course, analytics blocking can’t meaningfully swing the ~99.99% statistic.

If you self-host Plausible, it's also possible to bundle the analytics package with the website, so that there's isn't an "ad-blockable" lone request for the .js file.


Yeah there is. I surf with JS off because of people like you.

Most of the data you can collect with Plausible could just be collected server side instead, it's nothing like Google Analytics.

> Most of the data you can collect with Plausible could just be collected server side instead

Then why not just use that instead?

SPAs & marketing teams are used to snippets


You surf with JS off because of sites abusing their users' data. This is not it.

Collecting data that a user doesn't want collected is abuse. It doesn't matter what you do with it.

Oof. Hard disagree on that one, way too black & white of a position for me in the face of such a broad concept as "data".

> You surf with JS off because of sites abusing their users' data. This is not it.

Wrong. I surf with JS off because of sites that use JS to collect information about me.

If it's available on the server, then sure that might be considered fair game. But using javascript (or any other client-side tool) to do what you should instead do server-side is abusing users (or their data).

Putting analytics inline so it's "not ad-blocked by a url request" is absolutely disrespecting users and a perfect reason to turn off javascript.

> Wrong. I surf with JS off because of sites that use JS to collect information about me.

Plausible doesn't collect information about you, but the site's usage. Do you also object to physical stores putting up cameras?

Here's their own instance, open to public.


> If it's available on the server, then sure that might be considered fair game. But using javascript (or any other client-side tool) to do what you should instead do server-side is abusing users (or their data).

That's quite the affirmation. Is this fact or opinion?

> Plausible doesn't collect information about you, but the site's usage. Do you also object to physical stores putting up cameras?

The difference is that the cameras don't get attached to my physical body, doesn't have any ability to monitor my actions after I have left the presence of the physical store, and can't force me to take any physical item or action.

Javascript, on the other hand, has the capability to become persistent, can monitor my computer's activity outside of your website, and can leave a lot (!) of additional data on my computer without my permission.

> doesn't have any ability to monitor my actions after I have left the presence of the physical store

What a coincidence, Plausible doesn't either.

Also notice how I said "analytics package" and not "tracking" in my comment, because there is no tracking. I mean, unless you're the only visitor from a specific country, there is literally 0 identifying data in Plausible.

Analytics is still unnecessary JS and a bandwidth hog, so it has to go.

I would argue that yes, it can. If the only people who are interested in using the website are those who block analytics - and, given the demographic of a niche search engine, it doesn't sound entirely implausible - then there's no telling how the 99.99% splits into bots and nerds.

Not every "nerd" use a blocker. I know many who don't. Some want to support the sites they visit; some want to see the web as it is for most people; some say their mental filters are so well developed that ads don't bother them; etc.

You could guesstimate by checking the IP address - blocks assigned to residential users are likely humans, blocks assigned to cloud providers etc. likely bots.

This is far from true. Either via trojans, botnets, "crowd sourced vpns", or of course tor relays, residential IPs are a source of many bots. The overwhelming majority of spam sources (after you block a few data centers in NL).

even if there's 99 people blocking analytics for every person who doesn't, the figure is still 99%

I'm disappointed that Search My Site isn't seeing many legitimate viewers.

Just wanted you to know that I'm a fan. I love reading peoples personal websites, and Search My Site has been great for discoverability. I visit the Newest Pages and Browse Sites pages once or twice a week to check out the new sites being indexed.

I don't know what the answer is to the spam bots, but you do have some real visitors out there. :)

This guy throws multiple reasons/conspiracies out there on why the website is really struggling to gain literally any sort of traction. Web is all bots, search engines not promoting competitors and being drowned out by SEO spam, yet he's failing to see the most obvious reason... the reason nearly all websites don't gain traction...

Because it's a bad website. It provides no value to the user. I put in a few search terms and had no relevant search results back. What use is a search engine that can't find what I'm searching for?

Maybe if that was improved he may see traction.

Hi, "this guy" here:-) If people come to a site but don't come back then it is reasonable to conclude that "it's a bad website", but as the blog entry put it "without any real users in the first place it is hard to gauge whether people like it or not".

Note also that it isn't intended to be a general purpose search engine, but a niche search engine to try and find some of the fun and interesting content, e.g. relating to hobbies and interests, which used to be at the core of the web but which can be difficult to find anywhere nowadays.

How exactly is a "general purpose search engine" different than a "search engine to try and find some of the fun and interesting content"?

The general purpose search engines search the whole internet, and as a result claim that you can search for anything on the whole internet, even going beyond that to answer questions which aren't on the internet as such, e.g. "What is my IP?" and "What time is it?". However, niche search engines only search specific parts of the internet, and only claim to be able to deliver results relating to their specific topic, e.g. you wouldn't ask the search on a car forum what the weather is today.

I am a search guy and I would like you to succeed. But I don't get it. The name of the site is bland and makes me think you are a white label search service for websites. On the homepage it says "Open source search engine and search as a service for personal and independent websites." but it offers me to reason about why I (or anyone) would want to use it. The content it actually searches is random and of no real particular value as far as I can tell. Also, you are trying to avoid spam sites, but once you reach a certain size that's all you would see is people submitting spam sites. If you blocked people from submitting you would never get all the diamonds in the rough you are trying to expose.

You need to find an actual niche that solves a real problem people have and can understand and orient everything you do to tackling that. Then expand from there.

> general purpose search engines search the whole internet, and as a result claim that you can search for anything on the whole internet, even going beyond that to answer questions which aren't on the internet as such, e.g. "What is my IP?"

I think there are two distinct things here:

1) Searching the whole internet

2) Returning results that aren't necessarily from the Internet, but instead are convenience features of the engine

I understand that you're not trying to replicate things like "What's the weather today", but when I want results about <very specific classic car X>, how can you return meaningful results without searching the whole Internet?

Put another way, if you don't search the whole Internet, the results are going to be limited to only the curated list of sources you do search. This can be useful in its own way - i.e. if you are positioning this as "search this list of curated sources", but also means the site will only be as useful as the curation you provide.

For example, I dabble with Software Defined Radio. If I search your site for "rtlsdr", a very popular package, I get three results. Those results are somewhat interesting, but I know there's a whole world of content out there related to rtlsdr that I'm not seeing here.

So adding a bit to what the parent commenter was saying - if I'm using your site to look for my particular niche, and I only see three results when I know there are many more, I'm not likely to continue using your site to search for rtlsdr.

It then leads me to wonder what I can search for, or if there's much utility to searching at all.

Please take these comments in the spirit they are intended - I think a search engine that helps find things on the "old" web, or just helps me cut through all of the SEO optimized crap is a great idea. It's something I want to use. But I can also understand why someone might try a search and move on.

Just an idea, but maybe providing a way for independent creators to submit their site for indexing (or for an interested user like me to submit a site) would help increase your reach.

Ok, but answering questions like "what time is it?" doesn't subtract from the usefulness of a search engine. Seems like you're saying it makes your search engine better somehow because it can't do the above.

Google is demonstrating this nicely now. It's become almost useless, replacing the query I actually typed with something more popular. And when that doesn't happen, the results are likely seo'd junk. (The latter is not purely googles fault, it's just that smaller search engines aren't targeted as much).

Try looking up a phone number (by number) in google for a great example of nothing but spam results.

Well, it's worse than that. The whole schtick is that it's only pure, real content by folksy people like us. The top reason to use it on the about page is:

Indexes only user-submitted sites with a moderation layer on top, for a community-based approach to content curation, rather than indexing the entire internet with all of its spam, "search engine optimisation" and "click-bait" content.

So I tried searching [kotlin] and got 123 results ...


... of which the 9th result is SEO spam! It reads:

PersonalSit.es | Yes we got hot and fresh sites https://personalsit.es/ ... Shandilyahttps://msfjarvis.devTagsandroid, kotlin, rust Go to feed Go to siteradoslawkoziel.plradoslawkoziel.pl ...

That looks like junk to me. How is that possible if what the developer says is true, that it's all verified and pre-moderated?

Thanks for your feedback. It is just the home page which is moderated before indexing (and reviewed annually). When https://personalsit.es/ was listed it looked legitimate, but agreed the results for that site look infected with spam now. I've found at least one other site today where the home page and blog look genuinely legitimate, but which has a complete spam subdomain, quite possibly the victim of a subdomain takeover attack by spammers. I've delisted both. Unfortunately it isn't an easy task trying to defeat a vast army of well funded spammers in your spare time!

As someone that has a few sites that can get user generated content - I must say that it saddens me that spam stuffing would get the main domain and site delisted - and likely never re-listed.

A couple times a year I get hit with a bunch of spam blogs / user profiles and when I discover and clean them up, I assume that at least google/bing see that the spam-to-real ratio has been fixed and rank it higher again.. but I'm not sure really, especially since google took keywords out of click traffic.

What would be nice is something like the 'site has been hacked page' that I've unfortunately seen a few times for sites - that lets you clean it up and submit a re-check it's clean now button thing.

I've also suggested that google make it so you have to vouch for links which would expose people using the spam stuffing techniques.. kind of the opposite of the disavow tool - but they never read any of my disavow submissions.

Sucks to get spammed, fight spam, and then be penalized for it more ways than one.

One of my older buddypress/wpmu sites I recently turned off blog creation for users because it's just so tiring fighting the spammers - which are only doing what they do because google - meh.

Your problem is that SEO are under no obligation to be truthful with you, and will likely pull bait and switches as far as making accounts if it ever seems like your site will catch on.

Note, I nearly spit my food the first time I was at lunch and someone was talking about SEO a few tables away...oh a decade or so ago now. It's sad it's gotten this bad.

I second this. Don't get me wrong, I applaud the concept and the effort, but this implementation isn't quite there.

I searched for "document management system comparison" since I am currently in the process of selecting one for our legal team at work. Some on-the-ground reports from real users would be hugely valuable. But this is the classic example of where Google utterly fails; document management is a 100 billion industry and there are absolutely no search results which are not SEO, marketing copy, or astroturfed listicles with nearly zero value.

Unfortunately, this website returned even less relevant results. Not a single result pertained to document management at all; instead it returned random matches on words like "system" and "management."

Whoever solves this problem could definitely unseat Google as the go-to search engine for most people. So it's a big prize. But it's also a super hard socio-technical problem, requiring incredibly sophisticated and powerful tech in a highly adversarial environment. However, regrettably, it looks like this attempt hasn't even got the basic search tech down.

Typed this search into Kagi and got:

- This results from an old site https://www.scanstore.com/Scanning_Software/Document_Managem... not sure if still relevant

- A bunch of discussions from reddit and other forums (probably best lead)

- One research paper https://arxiv.org/pdf/1403.3131.pdf

- Listicles grouped togeter so you can skip them

- The noncommercial filter gave a few more good results, but it seems like there is not much 'good' content written on this topic

I would definetely not call all Kagi results fantastic, but it does seem to be better than Google. We are trying hard to solve the problem of the nonsense on the web (Kagi founder here).

Thanks for building Kagi! Have been enjoying the experience of it this past month

Got any beta slots to share?

Is a comparison of document management systems something you expect actually find, as something written by humans? I wouldn't write such an article, I don't know who would.

The only people who seem to be writing these types of comparison articles are spammers.

I typed this reply without checking, but I checked now, and yeah -- if you google "document management system comparison", you get ads for document management systems, and search engine spam. That's hardly helpful.

2nd result I got from that exact search is an article from techradar:


Do you consider that search engine spam?

Yeah, that's affiliate marketing dressed up as a review. They're getting a kickback for several of the links in the review.

The deal on DocuWare is perhaps the most obvious, but the Abbyy-link also run through an affiliate marketing redirect service.

I guess the use-case just isn't that popular. It's a good website if you want to learn what some devs are up to, but barely anyone cares about that. Most people use search engines to find answers to their questions and Search My Site just doesn't work like that.

Searching for Astral Codex Ten, a popular, well-written, non-spammy blog which I would expect is indexed...

Returns only results in which _other_ bloggers are referencing ACX. Consider me as one of the datapoints that arrived from HN and likely won't be back, I'm afraid.

Thanks for your feedback. The idea was for people to submit sites they like, and search sites other people have liked. I've submitted Astral Codex Ten, and that site is now indexed for the benefit of others.

I just search Kagi, Google, and DDG for "Astral Codex Ten" and it was the first result on each.

Ironically the Kagi search engine is not in the first few results in Google when you search Kagi (at least in Thailand)

And when I did make it to the site, it looks like I have to sign up to use it? I'm not sure putting a locked gate in front of a search engine in 2022 makes sense but okay

The whole concept of kagi is to be a paid service (is still in beta and for now it's free AFAIK), so you pay money instead of having ads or the search engine selling your data, use the service that suits best to your purposes and philosophy.

The concept in 2022 sounds doomed to fail on many fronts. A service that claims to offer privacy but requires identifying payment information. A required email signup so followup sales emails can happen when the service is ready.

Ddg was popular on here until they censored certain websites. Does this search service censor?

Sounds like they are trying to tackle privacy but in reality users of this service will have less privacy.

I searched "best dress shoes reddit" as a test, and just got a random list of websites that had the word "shoes" on the page somewhere, including a Dinosaur Comic from 2008.

So... yeah. Won't exactly be my first choice of search engine in the future.

Looking at the blog (https://blog.searchmysite.net/posts/milestone-1000th-site-in...) I think very little of the internet is in this search engine.

Its difficult to gauge the quality of the engine itself at this point with so little content in it.

What I can say is that even remotely presenting the system as a general purpose internet search engine like the UI from https://searchmysite.net/ does is going to give people the wrong idea and make them think the system is bad. To start with I'd suggest adding the number of sites indexed to the main search page.

I also think that the https://searchmysite.net/ portal will likely never be a destination. I'd suggest trying to promote it differently, offer a service service for OG internet sites, they opt-in to the service because they want a search widget they can embed on their site that has filter to search just that site or all OG sites. Having website categories would also help so people could search across tech blogs, or aquarium, or bowling sites, etc. Basically the old web ring idea but powered by search instead of just browsing a list.

Since there is a chicken and egg scenario - What you really need are people that think Google sucks that are invested in a niche and want to build a search ring out. The "only sites submitted by verified site owners" restriction needs to go, you want good curation but this is just too restrictive. I also think "downranks results containing adverts" is too restrictive, switch that to "downranks results containing excessive adverts and SEO spam".

It doesn't index sites like Reddit, so, not too surprising Reddit wasn't in the result.

Is there a reason for that? I am not super knowledgeable on search engines.

Whether or not his site is meeting his goals is his business.

I find this a really interesting post, because I'm also dealing with excessive bot traffic (it's generally about half of my overall), and specifically how to salvage analytics data when there's so much noise. Seeing what other people are doing to combat it helps me, regardless of whether you might think of them as successful or not.

I found a few pro-terrorism sites here. I don't think it's the OPs purpose, but he's being duped by the few users that do look for sites like this where they can add a "curated link" to their ISIS or Hezbollah or Hamas site with a slick facade.

Thanks for your feedback. If you can drop me a note I'll remove those sites - it is against the Terms of Use at https://searchmysite.net/pages/terms/ (not that spammers, terrorists, etc. care about complying with a Terms of Use). I think legitimate looking home pages as a front to other non-legitimate content is a genuine problem this model doesn't solve (also noting that some of those home pages may even be genuinely legitimate but have been hacked e.g. via a subdomain takeover).

I'm getting lots of `No results found for query = xxx.`

That sounds like a feature actually, being honest about no hits.

Applications are open for YC Winter 2024

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact