However, they are also systematically feeding you their footprint lists. I imagine you could put together a footprint blacklist pretty quickly, and just stop returning results for any obvious spam queries like those containing "powered by wordpress".
It's not a very elegant solution I'll admit. It won't stop the bots from trying, and you may have to circle back periodically to add new footprints as they surface. But it's a potentially quick and easy way to stop rewarding their efforts, and the blackhat world is pretty used to burning out their resources so hopefully they will figure out it's a dead end and move on.
I'm not sure about this. At least with my search engine, it doesn't really seem to matter what response they get, I don't even think they look at the responses. They keep hammering away with tens of thousands of queries per day with the requests even though they've seen nothing but HTTP Status 403 since last October or so.
My best guess is they're going after search engines in general in case they forward queries to google, in order to manipulate their typeahead suggestions.
Not since it left the larval stage and became "pay for play", no.
Oh, well, those taxpayer-funded years were nice for those of us who were around.
The reasoning I vaguely remember reading was that the internet required government subsidy to exist - at first directly, then in the form of universities, and the bust was a sign that it couldn't exist without one.
I don't remember how prevalent the view was at the time though. Obviously it turned out to be wrong.
There a rub here, in that people expect to search things without being logged in. But then if you don't log in people, anyone can come calling, including bots. This then causes you to do things like get a third party to filter the data, which then affects the users by having to reroute their traffic to someone else to get rid of some of the visits you don't want from the bots.
And round and round.
Simple authentication to the site with tokens might solve the problem. If an IP comes calling that does so with out authentication, or payment, then hang the connection.
Cost is the slow enclosure of the internet hy a handful of giant companies and once attestation is universal having anyone without a locked down device be locked out of most of the internet without providing endless free labour.
Give it a try and see what happens.
People said greylisting against email spam wouldn't work, since spammers would just resend. It works since 20 years. To get your IP off the DNSBL NiX Spam you just have to follow a link. People said spammers would automate that process. Never happened in 19 years. Sometimes spammers are just lazy.
Newest captcha services are a prediction score, not even a verification screen, and you can feed polluting data to bots you are certain to exist.
For OP, I think simply not returning results at all is a more practical measure because it removes the reward completely. Captchas and bot detection keep the reward in play, while taking away the results entirely makes the entire pursuit futile.
If anything, it might be best to return a page that explicitly states "Sorry, this search engine no longer supports SEO footprint search queries."
*edit for typo & wording
That’s why you'll see fluff pieces (aka, paid content) from online publications like Forbes for the better funded entities.
Another approach is the reach out to site operators with offers of writing content or asking them to link to your site’s content in their existing content.
It’s expensive and/or incredibly time consuming to get back links that matter.
It is when your base assumption is that you won't hire outside of engineering. There are more bored teenagers with phones than people creating quality content, so I'm not sure why you wouldn't just brute force checks against bad actors.
The world is getting more and more desperate for a better search engine. the day may come, when people are willing to pay for better results.
For example, searching for "electronic music box" as /u/ajnin suggested, with the top 100K web sites removed from the results, filters out the following:
> These 23 sites were removed from your results:
> alibaba.com (1 result removed)
> aliexpress.com (1 result removed)
> allaboutcircuits.com (1 result removed)
> amazon.com (2 result removed)
> apple.com (1 result removed)
> bestreviews.com (1 result removed)
> ebay.com (1 result removed)
> etsy.com (2 result removed)
> facebook.com (1 result removed)
> instructables.com (2 result removed)
> lightinthebox.com (2 result removed)
> lumberjocks.com (1 result removed)
> mapquest.com (1 result removed)
> reverb.com (1 result removed)
> twitter.com (1 result removed)
> wikipedia.org (1 result removed)
> yelp.com (1 result removed)
> youtube.com (2 result removed)
And the top result ends up being https://midiguy.com/.
It also seems fairly customisable, like I can search and include all results but choose to remove ecommerce, or sites with live chat (weird filter, but I like it).
I agree: the WWW Internet is dead, that is your problem. No-one visits websites anymore, everyone has moved to the 10 biggest websites and all data is now siloed there.
If I want to search for something topical and relevant, I go to Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord etc.
The general Internet is dead: it's just legacy content and spam.
If you think it's bad for you, imagine what it is like for Google Search! Their entire business is indexing a medium which no longer has any relevancy. People complain that Google no longer delivers good results. But what can Google do? The "good content" is no longer available for them to index.
Want to become rich? Make a search engine which indexes the fresh relevant data from the big siloed websites, and ignores the general dead Internet.
There's still a lot of organic human-made content still out there, possibly more than ever, it's just not able to compete with the SEO industry that completely displaces it from Google and social media.
> If I want to search for something topical and relevant, I go to Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord etc. The general Internet is dead: it's just legacy content and spam.
The "general" Internet is not dead. Though if you just want to participate in just Facebook, Twitter, Reddit, HackerNews, Instagram, Google Maps, Discord you might well think that.
Users of marginalia (author above), Mojeek (disclosure: CEO) and others  are well aware that there are riches of organic human-made content; from years back and new. Yes, a lot of noise too, which Google has a bigger (SEO) struggle to compete against. But still there is good and different content available.
To find good content, using search, you need to use "search" engines which enable discovery, as Google used to do so. I stress the "search" as the emphasis of Google, Bing and thus their syndicates is increasingly on being "answer" engines.
For some things it is. Good luck getting a non-sponsored/SEO-gamed review of a kitchen appliance or particular vacation mode such as a cruise. It's flabbergasting.
Most times I just stick "inurl:reddit.com" in my search and try to get discussion threads about the thing I'm researching, but even that's getting filled up with shills.
A real litter of inconsistency between unrelated external organizations and varying markets and skill sets.
"Which" looks to be the exception, but that is a paid-for service.
It's a sad state of affairs.
Isn’t Amazon commonly used for most affiliate links or has that changed in recent years? Amazon isn’t the cheapest all the time any more. Nor is its customer support the top any more
There seems to be a big disconnect with a typical users attention span and the length of a post.
People that know me and don't meet me regularly might know the URL of my web site and might care to look at it once per year and check if there is something new. Usually pictures and tales from holidays. Covid made those holidays less memorable so I didn't make any update since fall 2019. People that meet me regularly don't need that website, I'm telling them the tales first hand and showing them the pictures without being obnoxious. I guess that this website is a target for your search engine except it's not in English and your search engine seems to want English search phrases.
I don't have anything of value to share on a public chat like Twitter and I don't have an ego to pretend I do. I also don't use Facebook anymore. I go there once per year to like the messages that wish me happy birthday. I think it's polite to do so. All my media production is on WhatsApp or Telegram in group chats with people I know in real life.
If I really cared about producing content for the world I'd probably be using Twitter, Medium or the fad of the year and they'd take care of my SEO (do they?) or I'd be trying to score points on StackOverflow.
To recap: I never intended to compete on SEO. I'm really OK that my website is only for friends and spreads by word of mouth. It probably never did, I bet it's been on a flatline since I created it 20+ years ago.
It even happens to proprietary silos if they are too open. Look at how many bots and spammers infest social media. Propaganda and disinformation can also be considered a form of spam.
I realize this sounds cynical but don’t shoot the messenger. It’s just something I’ve learned watching the Internet evolve since the middle 1990s. Spam eats everything it can.
IMHO the future is enclaves and invite only communities. The Internet is a dark forest.
I love the maxim and philosophy of eternal refreshment.
Seems like the problem is more akin to having nuclear waste dumped
into our rivers though.
Oh, don’t worry, the Fediverse will never catch on.
An idea I've had for a few years is making a social-network based index engine. The only pages that get indexed are pages that users themselves mark as worth indexing, and the only pages returned in your results are pages that were marked for indexing by people you added to your circles, or the people in their circles, or the people in those circles, etc (probably up to 5 or 6 degrees of separation).
Not directed at you specifically but this is the actual problem.
We already had a good system for these things. Delicious, blogrolls, RSS, the folksonomy ..
So basically everyone on earth?
Google started punishing keyword spam, then it started punishing black-hat comment spam. Even Youtube backtracked on the "videos have to be 10 minutes to rank".
I wish they would do the same for carefully manicured SEO content farms too, as those sites are causing a harm worse than keyword-spammer sites did.
And a lot of people (myself often times included) are looking for a quick answer. A good enough answer. So good enough, SEO optimized is being surfaced. The result of an optimization war on both sides combined with the inevitable monetary interests.
I don't habe a solution. Sadly.
The black-hat kind is definitely made to extract money from ads. But those are easy to avoid for web veterans IMO. And I also feel that Google is doing its part, even though it's costing them money from those sweet ads!
But the white-hat kind, also known as content marketing, is made to let legit companies save money. Instead of paying for Google Advertisement, they get traffic by means of organic content. Think "Michelin Guide" or "Red Bull". Which is a jolly fine idea and responsible for a lot of good stuff, but the problem is that this has been taken to extremes, and now the web is littered with low-effort content made by freelancer writers getting peanuts.
I would personally prefer if those freelancer writers were doing 10 interesting Red Bull articles per month rather than 500 rehashes of contents from other websites. But who am I to judge.
In the news industry things are also very similar.
I jest a little bit, but your comment genuinely makes me wonder if Marginalia++ is search results - Google - Marginalia
Large "authentic" search engines can exist to serve the rest of the web, those personal blogs and other small communities. Those sites have a natural tendency to not be trying to turn everything into a revenue stream, so if that was the prerequisite for an engine, it would be a perfect match and naturally dissuade marketing types.
When you have a 'real' community you're talking about real people with real salaries and desires, add in that you tend to develop a real trust between members. Think of this as fertilized soil. You can grow crops in it, but weed seeds will eventually land and try to take over it.
HackerNews is a good example of this, it takes a healthy amount of moderation to keep things on topic where things like politics get peared pretty ruthlessly. If for a minute Dang gave in found ways to additionally monetize the forums, something that would be profitable for a while at least, things would start down a bad path.
While professionally I need to help (smaller, local) clients to reach their audiences I become more and more weary.
It is like walking through a supermarket with industrialized fast convenience food shouting in bright colors and advertising while ultimately not nourishing me like slow, real food could.
I am still looking for this digital slow food movement.
Please read it, and if you enjoy it please suggest it to friends.
From my perspective, we onboarded a lot (if not most) people to the internet after 2007 (the explosion of social media). People sticking to big sites really speaks to an inability to explore the larger internet and a lack of knowing why you would even want to.
Most (99%) people use the Internet most (99%) of the time to see or hear what other people are up to. The big sites are where all the other people are. QED.
(This comment falls into that space)
Really? We make our living running a small web based publication; around 40k readers a month. I know of many other sites like this. Google, and other search engines, depends on niche websites to provide quality search results. Without sites like ours, the internet would truly be dead, and search would be mostly useless. Our "traffic sources" come from a mix of Facebook, Search, Reddit, etc, in addition to our many loyal readers.
Others in our niche are producing blog spam, which looks nearly identical to people who aren't experts in the field, but we have real experts, fact checkers, etc, as part of our production process. This is a big problem: These low quality websites get similar rankings to our own, which does make it much harder for people to get quality information via search. (Hence the general shift towards trusting social recommendations, such as from Reddit.)
In short, the WWW is alive and well, it's just buried under a bunch of #$#$%.
40k/mo is a pretty good number for an independent website. As a word of warning though, relying on social media reach is a dangerous game, as there is anecdotal evidence that tweets with outbound links don't get as many impressions as those that link to in-site content, like another Twitter post.
As for Facebook, well, there's a good comic from The Oatmeal (enormously popular on FB back in 2010) that talks about what happened in the long run:
I'm happy to have experienced the free internet. Truly a jewel of humanity.
However, the good news is that we will never stop reinventing everything. The real value of the old internet was showing us what is possible.
Of equal value is that it showed us what not to do.
We have 30 years of documentation for research on exactly what a
successful intra-planetary network needs to be immune to. A
successful future network must build-in resistance all forms of human
pyschopathology from the ground up.
Sorry I don't see how ML can help here. It seems like another thing to
pin hopes of repairing an already too broken system on.
"We cannot solve our problems with the same thinking we used when we
created them." -- Albert Einstein
"A new scientific truth does not triumph by convincing its opponents
and making them see the light, but rather because its opponents
eventually die, and a new generation grows up that is familiar with
it." -- Max Planck
We are the dying generation my friend. We built it. They came. It
didn't work. Surely if ML can do anything it's telling us that we need
to tear down the old system completely and start again, don't you
think? Adding sticking tape won't help.
edit: turning a grunt into an honest question
That's exactly the world in which the Internet grew. There were multiple segregated national and sub-national networks, and the Internet was built as a means to interconnect them. After some time, the Internet protocols ended up being used even within these networks, but that was not originally the case. And even today, there are still things like the AS (Autonomous System) concept which permeates the core of the top-level Internet routing protocols, which still reflect the Internet being a "network of networks" instead of a single unified network.
That's why I'm not too worried about the Internet fragmenting; we've seen this before. What happens next is gateways between the networks, and there are already shades of these in the VPN providers which allow one to connect as if one were located in a different network, often from a different country.
i think it already has.
the Great Firewall of China is the classic example, but I think the trend started in the west with the Right to be forgotten/right to erasure in Europe, and subsequent HTTP Status 451 Unavailable For Legal Reasons. GDPR just further cemented the split between Europe and the rest, and the new DMA & DSA regulation in the European Union finally makes it clear. The writing is of course on the wall, so countries like India or Australia aren't too far behind. Places like California also have their own "right to be forgotten", and I'm sure the US will not be left behind for too long before we see regulation further splitting their internet from the RoW. And I don't think the RoW will hold off much longer till it also splits into multiple big blocks. It's the start of the new "nationalist" internet, and I'm sure we'll all be poorer because of it.
I'm saying this as someone who once wrote a decentralized P2P mesh for instant messaging. I was inspired by the HK protests going on ~2014 after hearing that they were using Bluetooth chat apps. Luckily Matrix, Telegram, Signal, etc. mostly solved the problem. Still, I don't think any amount of mesh networking would turn back the tide of Hong Kong now.
There don't need to be. You publicly gruesomely execute the first 100 or so you catch, and the practice of running a mesh node on your cell phone will fall so far out of fashion that the network breaks.
Societal shortcomings cannot be fixed via tech alone. If you can't build a society resilient to authoritarianism in the first place, tech will not help you. It can be used to increase resilience, but that's far from fixing the problem by itself.
The mesh network should be made out of common hardware in order to be viable. I'd suggest phones but those devices are owned before they've even left the factory.
"Star Wars Episode 10: The one that's not fiction."
On the other hand, if you want to demonstrate that you have anti-satellite capability it's probably a better idea to shoot down a corporate satellite than a military one. The Soviet Union shot down Korean Air Lines Flight 007 and it didn't start a war, after all.
Cryptocurrencies might be a problem in this plan, and satellite internet access itself might become a currency (since unlike cryptocurrencies, this one both has almost an intrinsic value and provides its own infrastructure that's very hard to block, where as cryptos rely on external sources of Internet access).
It also depends - drugs have consistently won the war on drugs despite being a physical product that needs a local supply chain and various anti-money-laundering and banking/finance regulations that should make it hard to fund the operation. Satellite internet access is likely to be even easier as it doesn't rely on a physical product (if we reach this stage there's going to be clandestine satellite terminals built locally, so blocking shipments of the real thing isn't going to cut it).
The only solution, apart from North Korea-levels of isolation (and even then, NK has the advantage of their population being isolated & indoctrinated since birth, something most other countries won't achieve even if they turned authoritarian overnight) would be detection followed by harsh punishment, but this has the downside of not only wasting the disclosure of detection capabilities (that are useful to the military) but also outsourcing the R&D of evading such capabilities into the open which enemies will no doubt pick up on too and use against you in a conflict.
So it can (at least eventually) function without 'regular internet', although I would still be hesitant to call it a viable infrastructure choice if the goal is to get around government control, simply from how much SpaceX have to appease the government to do anything space related.
The real dark-net is facebook. Everything that goes in there never comes out again and is basically invisible to the world, except if you join facebook yourself.
My own prime example of that used to be pinterest: it seems to be a 100% sink in the directed graph of internet links. But since Applebaum stated this, instagram (also facebook of course) is trying hard to push pinterest off that particular throne.
It is starting to pay dividends. Instead of weird stuff thrown up by google when I type in something, I get the "oh yeah, that was the page" from a short list of bookmarks shown to match the words.
The problem is, many of them didn’t realize this was a problem until recently.
That said, plenty of exciting stuff is happening outside of the walled garden, as long as you know how to find it.
We had a discussion with coworkers and somebody mentioned irc. Explaining to younger colleagues what it was and that it was not a product of a company, but operators had servers that formed a network, and it was more like infrastructure. Felt weird.
Perhaps it wasn’t the federated nature of irc that was surprising but the fact that it was irc?
IRC networks usually have multiple servers connected together (historically, often run by a bunch of different people) and I didn't think people self-hosting minecraft servers usually did that?
And where do I get the RFC for the protocol so that I can write my own compatible implementation?
IRC isn't a product. It's a standardized protocol sufficiently simple to implement in a day or two.
Now, I've been conditioned to use it only for specific use cases, mostly for convenience. Some examples include:
1. Anything programming related (searching for man pages, error codes etc) is straightforward. (I do have some UBO filters to exclude SO copycats)
2. Utility stuff like currency conversion, finding time in another city, weather etc.
Where Google has really fallen behind is in multimedia search. Not sure if it's due to copyright issues or not but Bing and Yandex provide way better service in this regard.
Not to mentions the "reddit" suffix I need to add to any search that even remotely calls for public opinion. In many cases, Google is just a shortcut to take me to the relevant subreddit.
Where it's gotten really bad is on news searches as Google either now has some kind of shitlist of independent news sites that it won't allow to show op on, for example, site:youtube.com searches - or, it's filtered through a guest list. It's hard to tell which strategy they're using, but news is definitely being heavily filtered based on very dubious propaganda-smelling agendas.
1. turn the wheel so it is pointed hard in the direction of the bulb you are changing.
2. remove the hex screws from the shroud in the wheel well
3. pull the shroud down, it's pretty flexible plastic.
4. reach up and change the bulb. The wires are a bit short so you might need to get both hands in there. I have big hands and I'm able to do it.
There are innumerable videos explaining this process, but very few text directions.
This is my same theory about meetings being universally preferred to asynchronous email, even when literally all the questions someone asks at a meeting have already been answered in my long form email.
Most people, even if they can read, are not really comfortable with it. Doubly so for writing. There used to be no choice to function in society, but increasingly we can use technology to substitute for reading and writing effectively, so people do.
I think I'm going to start compiling stuff like this in my git repo.
Facebook and Instagram are more of a walled garden, like Quora, but there is a lot of junk there anyway.
It's sad for the WWW, but I don't really think it is a fundamental problem for search engines. In fact Twitter for example gives a direct pipe to Google. If you tweet something, it is immediately findable. Similar for StackExchange, but there I think the site is so "small" that Google can afford to just continuously index it.
Sure, an account is free but might require giving information you don't want to give. Twitter asks me for a phone number a few minutes after creating an account, even if I don't post anything). Reddit at least lets you skip giving an email.
Sure, there are workarounds such as using lite versions (old Reddit, mobile Twitter), but that's not known to all people coming from a search engine.
It feels as if HN are the only one that's not a partially walled garden yet (and Wikipedia of course).
that's what old.reddit.com is for!
I suppose I mostly view it as a continuous party, yeah it's fun if you attend but after a few hours I wish I was doing something more productive.
For example you have Wikis and forums. Wikis are good for communities that are passionate about a topic and they collaborate on buidling content for their passion. Reddit is a valid alternative to forums but if the community s older and has members that are technical competent then they usually have the forum customized for their purpose and the forum will continue to exist , especially if you want to avoid some third party censorship.
I never ever search for something and found answers on Facebook, sometimes very rare I find something that points to Instagram blogs/posts but never Facebook.
Probably depends on your location and what you search for, so it might be possible that 99% of your Internet consumption is satisfied by 5-10 websites.
I think what happened is this: the WWW was everything back in the days. But in the "old days," only 10% of all people were online, the web elite. Then, AOL came, and the rest came online slowly but surely. The so-called "mainstream" people were no geeks, and these people were "just" ordinary people. Almost all were captured by what you call "big websites".
Now, we see the 100% being dominated by the 90%. That's why "Google results are bad". Bad for us! Not maybe (most probably) not for them.
Netscape didn't launch until December 1994 (and the WWW was nothing before that. I subscribed to a mailing list with new sites that were released and I'd visit most new websites on the internet on most days with the Cello browser in my uni labs most days).
AOL users have been there since the beginning of the WWW.
But when talking about the WWW, that's a very different story. I think that AOL didn't incorporate a web browser until quite some time after that.
AOL users could use Netscape from the beginning.
If you have high quality content and you get it indexed properly by Google, users will come.
There are reasons users are not using your website.
1. It's not solving a problem people have.
2. Users can't find it.
Who, in their right mind searches for search engines? Nobody I know.
If you want users you have to go out and get them (literally pound the pavement and talk to people) or create a LOT more content ironically, so they can find your site on the search engines they are using today.
I usually find what I’m looking for. It just takes literally three orders of magnitude longer than it used to for the same kind of stuff. I used to use Google a lot to jog my memory about various things I vaguely remembered. Type a few associative words and snippets, press Enter, done. Google’s useless for that now.
If you’re looking for hot pop shit in trendy publications, things to buy, commercial services to subscribe to - G has you covered. That’s what they do now.
Did that to some degree. Unscatter.com pulls from reddit and twitter to source links.
I found reddit only created an echo chamber bubble of obvious bias and twitter only diluted it a little.
That would be a great service, but it certainly wouldn't make you rich. Where's the money going to come from? Google got rich because they acquired an ads platform (DoubleClick) and an analytics platform (Urchin) and started monetizing the vast amounts of data they had. That was years after Google had established goodwill as the best search engine.
What prevents you from exploring the web is you can't find but the same 10 sites through search engines.
Maybe we’re searching for different content, but I disagree. While Google results are not without noise, I think it’s a huge exaggeration to suggest it’s useless. I still regularly find quality results from a quick skim of the first or second page of Google results.
Meanwhile places like Reddit, Twitter, and Hacker News are full of very strong opinions that feel truthy, but are mostly noise. Unless you go in with enough baseline knowledge to filter out 9/10 underinformed comments to dig out the 10% who actually have direct knowledge of the subject and aren’t just parroting some version of something they read from other comments, skipping straight to social sites becomes a source of misinformation.
Yes, sure, I often do go to the "top sites" when searching for content, but I still usually start at Google. And, despite all the SEO spam, Google still does a fairly decent of landing me on, for example, the appropriate Wikipedia page, Stackoverflow post, travel site, etc.
High chances you will find a link to an external site over content actually on those big named sites though, right? That tells us the organic web isn't dead, it's just hard to discover/navigate - because of SEO wars, most probably...
The problem isn't the lack of content, it's the number of shitty spammy sites standing in your way of the sites you actually want to see. Like a sleazy salesman trying to direct you to the crap laden three wheeled rust bucket when you were heading toward the family sedans.
GMAIL is garbage now, I literally use it as my spam email any more. Which sucks because I have had it for a really long time.
Annecdote on Yahoo! Mail ; years ago I wrote to yahoo support asking when I created my Yahoo Mail account (i'd had it from the 90s when it was very early available...)
And support told me that they couldnt tell me when my account was created as that was *proprietary company information*
So I deleted my Yahoo account. Im about to DL all my gmail and do the same.
So, while I would agree that some aspects of the old internet are dead (like 'small' ~1000 user forums focused on specific topics having largely been replaced by generally inferior subreddits and discord servers), I think it hasn't gotten as bad as you're making it out to be.
Just host a <form><textarea><button></form> at an IP address and notice it's just spambots submitting it with backlinks, not actual users. Doesn't mean the internet is dead nor that the indieweb is dead.
It doesn't really show anything other than the only people able to extract value from your creation are the spammers.
What is dead though is the general blog like content and community platforms of old, the era of Wordpress blogs, forums and hobbyist websites is certainly gone.
I don’t understand why Google themselves don’t do this. LinkedIn v. hiQ demonstrated that they won’t get in trouble for scraping users’ subjective views of data within these silos and then stitching them together to form a cohesive whole. So where’s the effort to do so? It seems like the obvious step.
I've heard this claim a lot, with 0 supporting evidence. Do you have any?
My own experience is that there are thousands of content-rich, high-quality blogs still being written by real humans, because I regularly find and bookmark new ones weekly, without even looking for them, so: please provide evidence for this claim that runs counter to my lived experience.
Interesting. When I search for something topical I search those sites using Google because al(most) (I don't use some like FB and insta) all those sites have really shitty search.
With a generic way to place comments on it.
Google was the one (among many) that killed it - so I am not gonna shed any tears.
That is not the Dead Internet Theory. That's just something anyone can see by looking at the world.
The Dead Internet Theory is that the Internet is already an echo chamber custom fed to you by a collection of bots and other such things, and that a lot of the "people" you think you're interacting with are already, today, faked. You're basically in a constructed echo chamber designed only with the interests of the creators of that chamber in mind, using the powerful social cues of homo sapiens effectively against you.
In particular, those silos aren't where people are communicating. Those silos are where you think you're communicating.
It is obviously not entirely true. When we physically meet friends, sometimes topics wander to "Did you see what I posted on Facebook?" So far, we've not caught Facebook actively forging posts from our real-life friends that we physically know. (Though we have caught them failing to disseminate posts in what seems to be a distinctly slanted manner.)
I am also not terribly convinced that the bots have mastered long-form content like you see on HN. I think we've had some try, and while they can sort of pass, they seem to expend so much effort on merely "passing" that they don't have much left over to actually drive the conversation. HN probably still requires real humans to manipulate things.
Where I do seriously wonder about this theory is Twitter. AI has progressed to the point that short-form content like that can be effectively generated and driven in a certain direction. There's been some chatter on the far-out rumor mills about just how bot-infested Twitter may be, how many people think they have thousands of followers, even having interacted with some of them as "people", and in fact may only have dozens of flesh-and-blood humans following them, if that. Stay tuned, this one is developing.
(Note that while this could be "a big plan", it is also a possible outcome of many groups independently coming to the conclusion that a Twitter bot horde could be useful. A few hundred from X trying to nudge you one way, a few hundred from Y trying to nudge you another, another few thousand from Z trying to nudge you yet another, before you know it, the vast vast majority of everyone's "followers" is bots bots bots, and there was no grand plan to produce that result. It just so happens that Twitter's ancient decision to be dedicated to short-form content, with no particular real-world connection to the conversation participants, where everyone is isolated on their own feed (even if that is shared in some ways) made it the first place where this could happen. Things with real-world connections, things where everyone is in the same "area" like an HN conversation, and long-form content will all be three things that will be harder for AIs to manipulate. Twitter is like the agar dish for this sort of thing, by its structure.)
I haven't seen this, but I'd be interested in reading about it, if you have a link!
There's even pretty convincing looking accounts and messages that turn out to be spam in the end, once they start trying to post links.
I have Akismet on the comment section of the Wordpress front-end of the site I run, it basically said something like 99.99% of attempted comments were spam. I'm sure the same applies to e-mail and the like.
You could start a website for people you don't like, flag all the comments as spam and they wont be allowed to post anything elsewhere - forever!
Works well with HN and other sites, too.
God I hope they never find out about this site.
Sounds like everyone blocking analytics (Plausible in this case), e.g. myself just now, is lumped in with spam bots.
Of course, analytics blocking can’t meaningfully swing the ~99.99% statistic.
Then why not just use that instead?
You surf with JS off because of sites abusing their users' data. This is not it.
Wrong. I surf with JS off because of sites that use JS to collect information about me.
Plausible doesn't collect information about you, but the site's usage. Do you also object to physical stores putting up cameras?
Here's their own instance, open to public.
That's quite the affirmation. Is this fact or opinion?
The difference is that the cameras don't get attached to my physical body, doesn't have any ability to monitor my actions after I have left the presence of the physical store, and can't force me to take any physical item or action.
What a coincidence, Plausible doesn't either.
Just wanted you to know that I'm a fan. I love reading peoples personal websites, and Search My Site has been great for discoverability. I visit the Newest Pages and Browse Sites pages once or twice a week to check out the new sites being indexed.
I don't know what the answer is to the spam bots, but you do have some real visitors out there. :)
Because it's a bad website. It provides no value to the user. I put in a few search terms and had no relevant search results back. What use is a search engine that can't find what I'm searching for?
Maybe if that was improved he may see traction.
Note also that it isn't intended to be a general purpose search engine, but a niche search engine to try and find some of the fun and interesting content, e.g. relating to hobbies and interests, which used to be at the core of the web but which can be difficult to find anywhere nowadays.
You need to find an actual niche that solves a real problem people have and can understand and orient everything you do to tackling that. Then expand from there.
I think there are two distinct things here:
1) Searching the whole internet
2) Returning results that aren't necessarily from the Internet, but instead are convenience features of the engine
I understand that you're not trying to replicate things like "What's the weather today", but when I want results about <very specific classic car X>, how can you return meaningful results without searching the whole Internet?
Put another way, if you don't search the whole Internet, the results are going to be limited to only the curated list of sources you do search. This can be useful in its own way - i.e. if you are positioning this as "search this list of curated sources", but also means the site will only be as useful as the curation you provide.
For example, I dabble with Software Defined Radio. If I search your site for "rtlsdr", a very popular package, I get three results. Those results are somewhat interesting, but I know there's a whole world of content out there related to rtlsdr that I'm not seeing here.
So adding a bit to what the parent commenter was saying - if I'm using your site to look for my particular niche, and I only see three results when I know there are many more, I'm not likely to continue using your site to search for rtlsdr.
It then leads me to wonder what I can search for, or if there's much utility to searching at all.
Please take these comments in the spirit they are intended - I think a search engine that helps find things on the "old" web, or just helps me cut through all of the SEO optimized crap is a great idea. It's something I want to use. But I can also understand why someone might try a search and move on.
Just an idea, but maybe providing a way for independent creators to submit their site for indexing (or for an interested user like me to submit a site) would help increase your reach.
Try looking up a phone number (by number) in google for a great example of nothing but spam results.
Indexes only user-submitted sites with a moderation layer on top, for a community-based approach to content curation, rather than indexing the entire internet with all of its spam, "search engine optimisation" and "click-bait" content.
So I tried searching [kotlin] and got 123 results ...
... of which the 9th result is SEO spam! It reads:
PersonalSit.es | Yes we got hot and fresh sites
... Shandilyahttps://msfjarvis.devTagsandroid, kotlin, rust Go to feed Go to siteradoslawkoziel.plradoslawkoziel.pl ...
That looks like junk to me. How is that possible if what the developer says is true, that it's all verified and pre-moderated?
A couple times a year I get hit with a bunch of spam blogs / user profiles and when I discover and clean them up, I assume that at least google/bing see that the spam-to-real ratio has been fixed and rank it higher again.. but I'm not sure really, especially since google took keywords out of click traffic.
What would be nice is something like the 'site has been hacked page' that I've unfortunately seen a few times for sites - that lets you clean it up and submit a re-check it's clean now button thing.
I've also suggested that google make it so you have to vouch for links which would expose people using the spam stuffing techniques.. kind of the opposite of the disavow tool - but they never read any of my disavow submissions.
Sucks to get spammed, fight spam, and then be penalized for it more ways than one.
One of my older buddypress/wpmu sites I recently turned off blog creation for users because it's just so tiring fighting the spammers - which are only doing what they do because google - meh.
Note, I nearly spit my food the first time I was at lunch and someone was talking about SEO a few tables away...oh a decade or so ago now. It's sad it's gotten this bad.
I searched for "document management system comparison" since I am currently in the process of selecting one for our legal team at work. Some on-the-ground reports from real users would be hugely valuable. But this is the classic example of where Google utterly fails; document management is a 100 billion industry and there are absolutely no search results which are not SEO, marketing copy, or astroturfed listicles with nearly zero value.
Unfortunately, this website returned even less relevant results. Not a single result pertained to document management at all; instead it returned random matches on words like "system" and "management."
Whoever solves this problem could definitely unseat Google as the go-to search engine for most people. So it's a big prize. But it's also a super hard socio-technical problem, requiring incredibly sophisticated and powerful tech in a highly adversarial environment. However, regrettably, it looks like this attempt hasn't even got the basic search tech down.
- This results from an old site https://www.scanstore.com/Scanning_Software/Document_Managem... not sure if still relevant
- A bunch of discussions from reddit and other forums (probably best lead)
- One research paper https://arxiv.org/pdf/1403.3131.pdf
- Listicles grouped togeter so you can skip them
- The noncommercial filter gave a few more good results, but it seems like there is not much 'good' content written on this topic
I would definetely not call all Kagi results fantastic, but it does seem to be better than Google. We are trying hard to solve the problem of the nonsense on the web (Kagi founder here).
The only people who seem to be writing these types of comparison articles are spammers.
I typed this reply without checking, but I checked now, and yeah -- if you google "document management system comparison", you get ads for document management systems, and search engine spam. That's hardly helpful.
Do you consider that search engine spam?
The deal on DocuWare is perhaps the most obvious, but the Abbyy-link also run through an affiliate marketing redirect service.
Returns only results in which _other_ bloggers are referencing ACX. Consider me as one of the datapoints that arrived from HN and likely won't be back, I'm afraid.
And when I did make it to the site, it looks like I have to sign up to use it? I'm not sure putting a locked gate in front of a search engine in 2022 makes sense but okay
Ddg was popular on here until they censored certain websites. Does this search service censor?
Sounds like they are trying to tackle privacy but in reality users of this service will have less privacy.
So... yeah. Won't exactly be my first choice of search engine in the future.
Its difficult to gauge the quality of the engine itself at this point with so little content in it.
What I can say is that even remotely presenting the system as a general purpose internet search engine like the UI from https://searchmysite.net/ does is going to give people the wrong idea and make them think the system is bad. To start with I'd suggest adding the number of sites indexed to the main search page.
I also think that the https://searchmysite.net/ portal will likely never be a destination. I'd suggest trying to promote it differently, offer a service service for OG internet sites, they opt-in to the service because they want a search widget they can embed on their site that has filter to search just that site or all OG sites. Having website categories would also help so people could search across tech blogs, or aquarium, or bowling sites, etc. Basically the old web ring idea but powered by search instead of just browsing a list.
Since there is a chicken and egg scenario - What you really need are people that think Google sucks that are invested in a niche and want to build a search ring out. The "only sites submitted by verified site owners" restriction needs to go, you want good curation but this is just too restrictive. I also think "downranks results containing adverts" is too restrictive, switch that to "downranks results containing excessive adverts and SEO spam".
I find this a really interesting post, because I'm also dealing with excessive bot traffic (it's generally about half of my overall), and specifically how to salvage analytics data when there's so much noise. Seeing what other people are doing to combat it helps me, regardless of whether you might think of them as successful or not.