Hacker News new | past | comments | ask | show | jobs | submit login
The most talented person in the world (matt.sh)
260 points by keyboardJones 73 days ago | hide | past | favorite | 151 comments



Has anyone built a search engine that uses LLMs to pre-grade every page with metrics such as:

- Commercial bias (content compared to the source, which it learns about)

- Insincere motives

- Bloat (how many words it takes to say how little to penalize SEO bloat)

I would assume that using LLMs, we can get a pretty good idea of what is SEO bloat and who the bad actors are by this point, and just penalize those results.


Did exactly that, hacked together a small pipeline in Nushell with simonw/llm. Even with GPT-4 turbo and given direct guidelines on common spam heuristics, it seems to perform worse than a Bayesian BoW. Endless trails of questions with no relation to the title it describes as informative, presence of affiliate links often gets forgotten in a mass of tokens in their tracking parameters, relevance is consistently near 0.8 even if there's relationship between title and content, and as for insincerity, our favorite BS generator cannot for the life of it correctly recognize its own creations.

Your ideas for metrics are good, but LLMs seem to be quite terrible at any of these. A simple set of heuristics and maybe a tiny language model for named entity detection and "vibe checking" would serve you much better.

Also, a lot of the worst offenders seem to use the same Q&A +- conclusion structure, which Viktor from marginalia.nu wrote a simple heuristic for, which I recall he said did wonders for pruning it. Solving SEO spam is easy when you aren't the one being optimized against. What's left is scaling and information retrieval.


What makes you think that LLMs will be better at combating spam than they are at creating it? There’s no universal rule that innovations in AI will go hand in hand with innovations in detecting AI, yet I feel like I see people talking all the time like that’s the case.

As of right now, LLMs are prolific but unreliable, which makes them extremely well suited for generating spam, but unsuited to detecting it without a large number of false positives and negatives.


I honestly don’t mind so much if it detects “ai-generated” vs. “human-generated”—the key thing to detect is whether the page is full of irrelevant SEO junk. GP suggested several attributes that ought to be detectable. Even if we don't eliminate the AI content, but succeed in promoting "better" content, maybe it's an improvement?


Classification is easier than generation


Sadly, this didn't seem to track for LLMs. Even OpenAI gave up on trying to detect its own outputs.

> As of July 20, 2023, the AI classifier is no longer available due to its low rate of accuracy. We are working to incorporate feedback and are currently researching more effective provenance techniques for text, and have made a commitment to develop and deploy mechanisms that enable users to understand if audio or visual content is AI-generated.


You're not classifying just text, you're classifying entire web pages. If it's easy to tell for a human that it's SEO spam, it's easy for a classifier.


Not with standard LLM models. Generating garbage is easier; deciding that it’s garbage much more difficult. This is an inherent problem with “prompt optimizers” like DSPy.


Yeah no I’m gonna need more than that chief. Everything I know about LLMs says the opposite.


Look up Generative Adversarial Networks, that's their basic principle.

You're not classifying just text, you're classifying entire web pages.


I'd say the whole point of GAN is that generation is cheaper than classification, therefore an effective brute-force way of making a good classifier is to generate an infinite supply of examples with a-priori known classification, and pit it against a classifier.


Yeah, but the theoretical endpoint of training a GAN is that the generator gets so good that the discriminator has to resort to guesses and become unable to tell with any sort of accuracy as to whether the example it is shown is real or generated.


I don't think that ever happens in training, at least in the image domain. The classifier can always can find some subtle clue.


Sounds a bit more like you want to do something reranking-ish. Ideally, you would train a retrieval system to retrieve the most relevant pages which would inturn have been trained on a dataset not very different from MS-Marco. This would get you a small set of documents you want to rerank.

For reranking to be able to detect commercial bias, insincerity or bloat you could use LLMs but IIRC you train a multiclass classifier for each and then combine the probabilities for each head(calibrate too?) into a score and use it in your ranking as weights?


I think Kagi should add a feature where I can subscribe to the domain blocks of someone else. Every time I see a spam blog, I can easily prevent the domain from polluting future results. But it'd be great if I could also use my friends lists to rank their blocked domains to the end of my search results.


This is definitely the best option. Community search blocklists, like uBlock Origin


We don't need to. We have LLMs now


I still would like to search and not get the dog crap that is Google's Internet, in addition to using LLMs myself.


I know it's the other way around of what you are suggesting, but I feel I'm using Kagi for a while for the same reasons, with results you expect.

Their search is much, much cleaner, that's for sure. But what made me stick (and mostly ditch DDG, which btw is also much cleaner than google), was how well their fastgpt works as a search tool.

Summaries are very good, it includes recent events and news, it goes through pdfs, always cites it's sources. Does hallucinate for me sometimes, but I always can tell it's incorrect by the response itself. Plus it usually gives me links that easily clear out the confusion. Especially in IT field I can tell I'm fed with the source of my trouble (like initial GitHub issue that introduces broken functionality, source pdf of a study) and less discussion around it.

Their search has some neat features as well, as you can simply choose to see less/no results from given site straight from list of search results.


Yes - and the other killer feature in Kagi is being able to uprank your own choice of sites, and set contexts for this upranking. That to me is the killer thing about it


I agree. My issue with LLMs is that it isn’t clear when it is hallucinating versus when it isn’t.


It’s always hallucinating… but sometimes the hallucinations are of things that actually happened.


Non LLM efforts include alternative and even self hosted search engine indexes. I'm also curious if brave search's concept of "goggles" could work out, where you can write your own indexing logic and share it with others.


We don't need self-hosted alternatives (just like the computer market doesn't need Linux tinkerers) as much as we need a real formidable competitor to dethrone Google and create proper website incentives so the Internet stops sucking.

We need SEO companies to realize they will go out of business if they continue to generate crap filler content for their clients.

We know Google won't do it. We need influence. We need effective results.

Self-hosted is cute but ineffective at best, selfish at worst.


Hmm, I reflexively disagree, but I disagree even after considering it from other different positions.

Need is a strong term and it is likely doing a lot of work in that claim. We, technically, do not need much bar food, water and shelter. In that sense, the post is absolutely correct. Realistically, there is zero need for self-hosting, or linux, or anything much really.

But even if we get past the need claim, why is setting up a duopoly a preferred option to people actually running their own preferred setups ( and maybe even learning something in the process )?

More importantly, why on earth would I want yet another giant corporation in charge of my digital life?

<< create proper website incentives so the Internet stops sucking.

Interestingly, linux tinkerers and self-hosters are likely one of the few reasons web does not suck AS much as it otherwise could have. In a sense, the incentives are there.

<< We need SEO companies to realize they will go out of business if they continue to generate crap filler content for their clients.

Business is just that.. a business. I don't expect mosquito not to bite me. If anything, the whole premise is wrong. SEO companies defer to google's wishes and google heard the pleas of the ad industry and declared war on adblockers.

<< We need influence. We need effective results.

Zero disagreement.

<< Self-hosted is cute but ineffective at best, selfish at worst.

I dunno. It might be selfish, but I am ok with that if that is the worst. I would worry about it being ineffective, but.. I like my various instances. They serve a purpose to me.


Switch to Kagi, IMHO it's worlds better than Google. It's worth the price, I get a lot more out of it than my Netflix subscription.


I'm sorry but LLMs notoriously provide inaccurate and otherwise awful information.

Case in point, I asked an LLM what the last non cellular windows mobile classic PDA was. (I knew the answer) And it routinely got it wrong.

This is what LLMs should be useful for. If I cant audit the results or very how it came to the conclusion the answer is useless.

LLMs are toys at this juncture.


That's a great example of the kind of prompt that I intuitively know wouldn't return a useful result... but I can't explain WHY I intuitively know that. Which is deeply frustrating.


Here are the results for that exact search in Google: https://www.google.com/search?q=what+was+the+last+non+cellul...

vs 4o: The last non-cellular Windows Mobile Classic PDA was the HP iPAQ 110 Classic. Released in 2008, it ran on Windows Mobile 6.0 Classic and featured a 624-MHz Marvell PXA310 processor, 256MB of Flash ROM, and a 3.5-inch screen with a 240 x 320 resolution. It included Wi-Fi and Bluetooth connectivity but lacked cellular capabilities, making it one of the final models in the declining PDA market as smartphones began to dominate [oai_citation:1,List of Windows Mobile devices - Wikipedia](https://en.wikipedia.org/wiki/List_of_Windows_Mobile_devices) [oai_citation:2,The End of the Classic Version of Windows Mobile (AKA the PDA)](http://www.pocketpcfaq.com/commentary/end_of_WM_Classic.htm) [oai_citation:3,HP iPAQ 110 Classic - PDA Like It's 1999 - WiFi Planet](https://wi-fiplanet.com/hp-ipaq-110-classic/).

I know which version I'd prefer.


What about the HP iPAQ 112 then? Wasn't the HP iPAQ 110 Classic released in November 2007?

Also why are you entering questions into the Google search prompt?


Because the commenter implied that LLMs get it wrong, as if search engines get it right. The reality is that both take digging, but the LLM response gets you to the right answer more quickly.


Which LLMs stay current and link their sources? If I have to wait on the LLM to search for me, I'd rather just do the search myself. What if I want to search for something the LLM can't show me? Or something I want to watch or interact with that isn't an LLM?


GPT-4o serves up results that look like Perplexity's, except the sources are actually relevant links.

All of that to say: solved problem? Assuming you're ok with chat as the UI


I'm saying use LLMs as preprocessors to form predetermined rankings by URI which weigh into the search. Let the crawler pipe into an LLM.


I get it - use the LLM score as an additional metric for page rank, not ask the LLM for search results


I always assumed LLMs would result in more bloated content (stupid interns) but I think you’re right that it’ll lead to more efficient prioritization (hooray interns)


I’m working on that problem right now actually, but not in a direct way. We create ad hoc content for people based on product reviews etc and have had to invest a fair amount of time filtering content and removing sponsored/shill and low quality generated content that reduces the utility of our… generated content. Ours is at least dynamically rendered so others won’t have to sift through it some day.


Literally just assess how much ad revenue Google stands to earn from the site. If it's a Google top hit (SEO) AND earns Google a disproportionate potential amount of revenue, then there's your grade.


All of those sites are labeled as "A Venture 4th Media Company" which has such wondrous important content all run by the same "writer" on:

    https://mtbinsider.com/author/jodie-chiffey/
    https://turfandtill.com/author/jodie-chiffey/
    https://www.betterwander.com/author/jodie-chiffey/
    https://artofgrill.com/author/jodie-chiffey/
    https://theathleticfoot.com/author/jodie-chiffey/
    https://altprotein.com/team-members/jodie-chiffey/
    https://digitalguyde.com/us/
    https://total3dprinting.org/author/jodie-chiffey/
Each of which is registered by NameCheap, who can never seem to kick their addiction to bottom feeders.

Each of which is behind Cloudflare, the official latrine of planet dysentery.

disclosure: I use too.


What's interesting is I keep my old blog up even though it is basically never updated. I receive 3 or 4 emails a week from people like Jodie Chiffey saying they would like me to consider hosting a "guest article" by them about some topic or other. The content and SEO mill is absolutely gigantic.


Sometimes I think about whether a nation-state may have an incentive to solve this problem. They could adopt a strict anti-SEO spam policy for their ccTLD and run a publicity campaign telling people about it, so they'd filter search results to that ccTLD, with the purpose of improving the country's reputation (by associating it with high-quality search results). That sounds kind of far-fetched, but it's hard to think of anything else. Maybe you could direct the blame at CAs too.

For now, at least we can still put "before:2020" in our search queries.


How bad does it have to get before it stops working, I mean like phone a friend brittanica yellow pages bad? Quicker and more deeply we get there, the more effort and traction gets behind some breakout solution. Maybe it's just pure cynicism: burn it all.

Remember email spam? It got so bad, that we fixed it. I mean email has its issues and how but spam isn't one of them. I built a spam juggernaught in my day (got bills don't I :)) and I feel like I contributed a tiny bit to our almost-spamless latter days.

Progress! The world is on the march.


You raise a good point. E-mail spam is mostly fixed, and, if I remember correctly, a relatively easy solution works good enough: ask users to mark spam as spam and e-mails that look similar[1] to spam would be banned.

Which got me thinking: surely that would work for websites too? Why not let people report low quality of sites directly to the search engine? Kagi lets you to ban sites you find unhelpful, but it doesn't downgrade websites similar to the one you've banned. I shan't speak of Google practices as we all know them by now.

[1] https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering


I find personal email to be almost useless for anything other than mailing lists and online shopping messages. I don't consider it fixed; we already left it to burn.


I use email constantly but I can’t tell you the last time I saw actual spam in my inbox

Sure, there are occasional unwanted marketing emails but those are easily dispatched with unsubscribe and/or inbox filters


Depends on your provider. My GMail is fine, but my much older Hotmail is a spam cesspool. It's so bad I barely check it anymore the last couple of years, despite having been able to keep that email address since something like 1996. As I had "exposed" that to some email farms I probably got into some really bad spammers lists which Microsoft seems unable to stop.


I gave up on my Hotmail completely quite recently for the same reason. It seemed to just stop filtering altogether. I'd recieve somewhere on the order of 50 spam emails to my main inbox a day. If anything, it filtered more email I did want than those I didn't. I haven't had the same issue with gmail.


Spam evolved. The filters catch most of the egregious kind, but our mailboxes are still flooded with the rest: marketing of legitimate companies, which you likely interacted with at some point, however briefly, and often mixed into transactional messages. This consists the majority of most people's inbox; I bet it's the case for you too.


That's not spam. Watch what you subscribe for and periodically unsubscribe, done.


I don't subscribe for anything of this kind, and if something somehow gets into my inbox, I report it as spam and then unsubscribe.

All that is very much spam. I'm never interested in marketing material from vendors. Doesn't stop them from padding every single e-mail with it.


Then it is spam, my bad. I get next to nothing somehow maybe once a week at most...


I often google Reddit posts to find products or services. Posts with a lot of comments usually have better solutions. I also check out YouTube reviews, especially ones with few dislikes, using addons like "Return YouTube Dislike." These methods are easy and pretty reliable.

But it means the platform will soon be flooded with bot-generated spam.


Once I heard of ReplyGuy, I knew it was the beginning of the end for forums like Reddit


It's actually one of the perks of centralized platforms like Reddit: if they want to, they have the technology and the resources to investigate bad actors like that. I don't just mean just finding and blocking their IPs, but untangling their ownership, corporate structure, tooling, and so on. Hiring actual PIs if necessary. It's the bread-and-butter for many spam and abuse teams at Big Tech.

That said, just because they can doesn't mean they will. It's possible that they've grown complacent and underfunded these capabilities (instead relying on community moderators to weed out bad actors). Or it's possible that they're too focused on the short term to see the existential risks. If I recall correctly, they couldn't resist the temptation of selling user content for LLM training, same as Stack Overflow.

But in an internet overrun by spam LLMs, the future are curated, walled-garden communities, and Reddit could be the basis for that.


Anything which diminishes their usage numbers won't look good for investors, so one should assume no action will be taken by the company which diminishes their usage numbers.


> The only future of the internet is, sadly, proof-of-person and proof-of-residence on every public network interaction.

I really hope we do not give up the internet's freedom as they suggest (and I doubt this would solve the spam problem).


As much as I hate the idea, I'm convinced it would be a godsent for 99% of the population.

Just imagine when you buy your SIM card the phone shop asks you: Do you want to limit incoming calls to people who you either called before or who have ever had a permanent residence in your country? 99% of spam and scam calls blocked, just like that.

And just imagine how hilarious it would be if all those Nigerian prince emails had a note that says "actually, the sender of this email has never been to Nigeria"


99% of the population is also happy with carrying a spyware device everywhere and allowing Google to know everything you do in order to serve you better ads.

I am not saying this does not have upsides, but it would be a nightmare to have it imposwd on you.


I don't think this is actually true, they are not happy about it, they just can't care because they have no real choice to do it differntly


> actually, the sender of this email has never been to Nigeria

Conversely, what if all those emails actually originate from Nigeria? Would it make them more legit?


They actually indeed originate from Nigeria almost all of the time.


That's what I'm hinting at :)


Personally I'd start with the phone system guaranteeing that the person calling me owns the number that shows up on the call display. That's how people already assume it works and would cut 90% of the spammiest of spam calls while still letting folks with family in Nigeria choose to answer those calls.


Agree, as much as we hate AI-generated content, what’s to say that the content isn’t helpful to some in some instances?

Also, as long as search engines do their job, engagement on high quality pieces will always justify having a human write art


Can you think of a single example how AI generated blogspam can be useful? I cannot. It's all just noise and bullshit.

> as long as search engines do their job

lolololololol, funniest thing i've read all day.


I think it’s the only way forward. I wrote a blog post about this recently.

I think abandoning anonymity is the only way forward and I think it’ll be glorious.


Share the blog post?

I remember in my 2005 high school political science class proposing fake legislation to require ID to go on the internet.

For the past 20 years I’ve shuttered at how awful an idea that would be and how naive I was as a kid.

But now I am curious how all of the externalities would play out.



I read said blog post. The future you dream of would require, at minimum, vastly improved privacy laws and freedom of identity.


I agree, but I just don’t see an alternative future if we want to maintain an open platform for the global exchange of ideas.

I also don’t feel like this changes the privacy debate. I’m not naive enough to think Google and 1000 other data brokers don’t know exactly who I am.

Having a verifiably unique digital identity wouldn’t change that fact. I’d still want to control when and how data associated with my identity is monetized.

Maybe regulators can force Google et al to digitally watermark data before they sell it, so if I find it elsewhere without my permission I can trace it back to Google and seek remuneration.

I think the only way we get the internet back is to hold everyone accountable that wants to participate.


What I mean, mostly, is that I shouldn't get doxxed for the crime of being queer on the internet the way some people I won't name like to do. There isn't really law surrounding this yet (in the US.)


I’m amazed my medical websites… The top 5-10 sites all seem have identical wording for common conditions.

In the age of copyright enforcement and DMCA antics, I don’t understand how this continues year after year.


A family member works for a company that writes and publishes a mandatory handbook for GPs in Australia.

One day the WHO found them and wholesale copied large chunks of their content verbatim into their online resources without asking or informing. They only found out because THEY went looking on the WHO website for the latest information on something.

At that point I started wondering just how much of big companies' work and content is just plagiarism and gratuitous theft from reputable but less visible or popular sources (also highly dependent on country, language, etc). And that was before I discovered hbomberguy on YouTube.


I worked in a hospital system for a time—those documents are often integrated through a medical system where the hospital pays a recurring subscription, and the site developers just plug things through an API to display the documents from the central medical system.

Thus, you have one document that's identical across dozens or hundreds of different hospital and medical system websites.

A long time ago, I held more weight in results from reputable places like Mayo Clinic, but even their site seems to be the same as all the others now.


But is the purchased information not good?


They’re all the same thing, same owners likely


All major public facing organizations are slowly bought out by extremely wealthy and influential groups when they get big enough for narrative control purposes because ultimately mind control is the largest source of wealth and power.


The equation to rank links is a poor substitute for an editor. Editors used to be the gateway of information that was disbursed via newspaper, radio, and TV. They were not without fault. So what is the least faulty filter? (Even your own brain, eyes, and ears are faulty, sorry perfectionists.) You cannot trust video or audio now due to deepfakes. What can you trust as a source of information more than 50%?

There's still information in the noise, you have to become your own editor.


Google originally used back links in the equation; where the goal is to take the information by editors on a massive aggregate to rank. Brilliant until gamed.


Time to go back to Yahoo! search engine with manually added and verified links. Automated web crawling experiment has failed. Anyway, whatever I search, I end up in either reddit or wikipedia (incidentally, both are human-curated stores of knowledge).


I don't know about Wikipedia, but Reddit is full of bots. They're sophisticated enough you may not recognize some comments are not from a human though. That's worse than when it was always obvious.


Not just Reddit; Twitter, Facebook, Instagram, (basically every social media)... full of bots posting and replying to each other. Dead Internet Theory.


The reason all of this exists - the reason the entire Internet turned into an unusable amalgam of horseshit - is because we built the entire commercial side of the internet off ad revenue. It might be a death spiral at this point - I can’t imagine anyone actually being willing to Pay for whatever the fuck Google or Facebook have become, so there’s nothing left to do but keep inventing new ways to generate bullshit and new bots to view it for you.


Whatever model of the internet you have isn’t going to disrupt affiliate links. And as long as you have affiliate links, you will have websites trying to game the rankings to get their affiliate links in front of eyeballs. It’s not just a Google problem.


The internet is the largest and most elaborate monument to marketing we may ever build.


Ye of little faith. Some day we'll project that Coca-Cola logo on to the moon.


someday we'll sell the edges of our peripheral vision for ad space in lieu of a universal basic income


Aren't we already doing that?


> someday we'll sell the edges of our peripheral vision for ad space in lieu of a universal basic income

Meta and RayBan are working on that and not far: the RayBans "Meta take a picture" / "Meta take a movie" are a thing. And I doubt very much that it's done with good motives.

No HUD yet but they've already got a microphone to analyze your voice (for good reasons, we're sure) and speakers.


I write really detailed content for a living. So long as no one pays for it, I must live through other means. Fortunately, I don’t need to peddle stupid products, but I can’t disentangle myself from affiliate partnerships without state fundi, and that presents its own challenges.

We have the internet we pay for.


I pay for Kagi and I have to say it's so worth it. Very fast and no more trash result, a lot of customizability, filters, lenses, etc. They even have an incentive to make my searches worth while since I pay a flatrate.


> is because we built the entire commercial side of the internet off ad revenue

The root problem there is it's not practical in today's world to pay the small sums individual bits of content are actually worth. The smallest practical transact-able value is about a dollar. With service fees you can't go much below that value before you make no money or lose money. Even then it's really only practical for large scale players.

A single news article is not worth a dollar. A tweet or HN comment is worth nowhere near a dollar. Even if I found some HN comment worth money...how is my payment going to get to you and not eaten by HN?

Crypto bullshit is not the answer at all. It's worse for transactions in every way than regular money. It pretends to be a solution to micropayments by ignoring the very real and very onerous transaction fees and deflationary nature of the currency. There's been efforts to deal with micropayments but it's a hard problem. Paying individuals is difficult and transacting in practical (sub-cent values) is extremely difficult to do.

Ads are an imperfect but working-ish solution to micropayments. They allow the customer to "pay" with attention (though now with intrusive tracking) rather than currency. AdTech has gone bonkers with tracking and targeting and has gleefully participated in facilitating the Dead Internet.


This kind of spam was already destroying the web before LLMs came along. Now it’s being accelerated by thousands of times.

Mainstream web search is probably cooked. Kagi and other niche players might have a chance if the fact that they are not beholden to advertisers lets them introduce features to do things like downrank content with ads. Kagi has “small web” which I think includes this in its weighting.

Open social media is probably cooked too. In the future it’s going to require proof of human identity and will be more heavily moderated.

The future of social is closed forums. Even those are really having to fight bots though.

The other pervasive awful trend of the moment is everything becoming like John Deere tractors: cloud connected, DRMed, with subscriptions and/or planned obsolescence.

Capitalism is supposed to reward people for creating value, but today it seems like it’s far easier and more profitable to just extract rent or scam. I am not sure how to fix this.


> Even those are really having to fight bots though.

We recently released a closed-system, iOS-only app, that has a fairly rudimentary, privacy-first system of user registration. It is currently restricted to the US and Canada.

Each signup request is manually vetted. There is no automatic registration. The app is designed for a specific demographic, and we do our best to ensure that new accounts are real people, that fit the demographic.

This is an iOS-only app (free), restricted to the North American continent, and with no accessible server API. The server is a bespoke server, and has no connections or dependencies that we don't control.

We are flooded with bots, and, most likely, scammers. So far, they have been pretty easy to spot, but that could change.


What app and serving whom?


I won’t mention it here.

Like I said, each signup is individually vetted, and the last thing we need, is hundreds of curious geeks, signing up one-shot accounts.

It’s for addicts, seeking Recovery.

We’re not interested in scale; only quality. People’s lives can depend on it.

The bots have declined, recently. I suspect that there’s a watcher bot, that triggers on new apps. When it first came out, we had a lot.


I think we'll end up moving away from shallow signifiers of trustworthiness to reputational networks and evidence someone has invested into an identity or entity.

That's not to say it's a solved problem, even in the real world with thousands of years of battle tested strategies.

A very simple example would be a web browser where I could blacklist chronically-unhelpful sites, and share that metadata among friends.


The hyper-enshittification stages look like:

1. People get on the web and make real content with care because they're just excited to share stuff with each other.

2. Advertisers talk them into putting some ads on their pages so they can get some compensation for their work.

3. Shitty people figure out you can just make content where the main incentive is to get people to go to the page and see the ads.

4. Those people then outsource writing the content to the lowest bidder.

5. The lowest bidder becomes an LLM.

6. Search engines cut out the middle-man entirely and just send your search query to an LLM, stuff some ads in, and show the result to the user without ever hitting the web (except to periodically scrape it for model training).

7. Because of 6, people stop putting new content on the web at all. The models get shittier and stupider with regards to current events.

8. To counter that, LLM companies make deals with news organizations and other primary source information provides and pay them to have direct access to content to train their models.

9. Those organizations get such a large fraction of their income from those deals that eventually they get out of the business of giving human readers direct access to it because it's not worth the effort. Newspapers become B2B companies.

10. The only way to get information is via a handful of giant tech companies sitting on top of huge LLMs saying who-knows-what trained on a slurry of actual information and giant piles of ads.

I hope that somewhere in the process people start to get tired of talking to machines all day and hop off the ride entirely and starting calling up their friends and getting information the old fashioned way.

The only consolation I have is the belief that people have a deep seated desire to connect to actual humans and know the real truth about the world.


IMHO a major part of the problem is that the Internet never had a mechanism for paying for good content. Everything is “free” therefore ads emerge as the only monetization strategy and you did a good job outlining the rest.

I’ve started trying to pay for good journalism, especially good indie journalism. I also Patreon a bunch of podcasts, buy high quality software if the price is reasonable, buy albums of my favorite music, buy films, and so on, while actively avoiding both gratuitous subscription models and the ad web.

Pay for it or it either doesn’t get made or it pays for you. Free is a lie and piracy undermines quality.

Edit:

All the paying for good stuff I outlined above averages out to around $100-$150/month. It’s less than I usually spend on restaurants and coffee shops and far less than groceries for our family. Restaurants in particular feel like a far more frivolous expense.


Who are some indie journalists you've found to be worth paying? I'd like to find some more work that's worth supporting.


Andrew Callaghan of Channel 5 News


He makes good stuff, but do the allegations give you any pause?


You're assuming that LLMs will boost spam and scams more than LLMs will tone them down by marginalizing them automatically. I have the opposite view. SMTP spam used to be mostly unavoidable, then one day it was not.

IMO, the number of engineers and moderators needed to offset one scammer is about to take a huge dive.


Your analysis overlooks a massive cost asymmetry[1] in favour of the spammer. THey only have to pay the LLM cost once to generate a message which they can use thousands/millions of times versus the receiving side would need to pay an LLM to check and classify each incoming message.

[1]Either you pay a SaaS LLM provider or you pay the cost of compute to run the LLM yourself


What information led to your opinion?


I think LLMs have higher complexity in the white-/black-hat "struggle domain" than the weakest-link problem, maybe O(N^2) vs. O(NlogN).

So if it used to be O(N) white against O(NlogN) black, it's now evenly matched, with O(N^2) LLMs on both sides.


My strategy been to put my money where my mouth is and start paying for services that provide value to me (Kagi is one example - I’m a paying customer, and actually found this article using their small web site)


Strategy for what?


I was nodding my head as a member of the choir until

> The only future of the internet is, sadly, proof-of-person and proof-of-residence on every public network interaction

Yeah, go away. The problem is that Google has monopolized web aggregation. Without it these sites wouldn't be worth making. I've got to show ID to post something online to kick that can down the road?

The way I see it this is a self resolving problem. Ad driven search engines rank ad driven blog spam, people get tired of it and use methods of finding content that don't show that garbage, these guys, along with their multinational trillion dollar benefactor, go out of business. Problem solved.


> Yeah, go away. The problem is that Google has monopolized web aggregation. Without it these sites wouldn't be worth making.

So how will you find things?

> Ad driven search engines rank ad driven blog spam

We don't have actual concrete proof that Google (or others) rank content with ads higher because of the ads.

As a contrary corollary, generally as society grows, we've seen an increase in "proof-of-person and proof-of-residence on every public X". Want welfare checks or charity, prove yourself. Want to buy cough syrup or booze, prove yourself. Want to drive, want to shoot a gun, want to XXXX.... get an ID.

In early America, men used to vote by everyone going into a big room and shouting for a while (some minor exaggeration). Now you have to register in advance and show ID, and they maintain registries of everyone and their affiliated party.

Showing identity comes with a lack of trust, and volume + anonymity decreases trust as it's slowly abused.


Look, there's a difference between showing ID to vote or get a free paycheck from someone and showing ID to shout my ideas from a rooftop. I don't need your trust to say things on the internet. If you don't trust me, don't read what I have to say. You can continue trusting Google though, I won't, this is a bait and switch and they're the source of the problem, not your lack of my identifying documents. We have no proof that they're doing it in purpose, I don't care about the intent, I care about the results.

How do I find things. I'm already living life without much google in it. There are a lot of ways to find things. Aggregators like this one have a better signal to noise ratio than google or most places that publish a lot of information. There are search engines that actively blacklist anything with SEO in it. There are community groups that focus on topics of interest. I find that I only use big search engines nowadays to find a git repo for something or find out what time some place closes, that's all they're good for nowadays. I trust people more than faceless services, and I don't care anything about who any of those people are in real life.


I don't see this issue the way the author of this article does. Both the stated problem and the stated solution seem overblown to me.

First, what jumped out at me when I read the article was that I had never heard of Jodie Chiffey before reading the article. I don't see all the trash the author is talking about because I don't spend any time on websites that promote it, or dealing with spam emails from sources of it, etc., etc. I can get along just fine with simply ignoring the existence of all the trash and going to websites (like this one) that I actually want to go to.

And how am I able to go to websites like this one without being overwhelmed by trash? Because (a) this website is managed by actual people who aren't interested in spamming me, and (b) HTTPS means I know that when I go to "news.ycombinator.com" I am going to a site managed by those people. In other words, we already have the actual solution to the "how do I avoid the trash" problem, and have had it for decades. We do not need to build a huge new spyware infrastructure to "fix" the web. We already have, and are using, the tools we need to avoid the trash.


There was another article here on HN some time ago on how big media publishers are littering the web with spam recommendations https://news.ycombinator.com/item?id=39433451 (https://housefresh.com/david-vs-digital-goliaths/)

This listed review spam from Dotdash Meredith which I used a clue to search and block all sites by them using uBlacklist.

Edit: Found another article saying the same thing there are only few big media giant's dominating search https://detailed.com/google-control/


I apologize for not having a solution or complaint here. The discussion is mostly over my head.

I just want to point out Matt.sh apparently missed that Jodie’s a hardcore gamer and gaming expert as well as a trusted advisor to senior citizens. She also “… doesn’t want to hold back when it comes to getting the word out about affiliate marketing,” and will teach you her tricks. She is an avid toy insider too, loves toys from the 80’s. Not that bad looking either.


Who are you protecting? Those who would be duped by it? Are you trying to get rid of salesman? Responding to a dark market with authoritative measures means you're trying to protect suckers. You can't stop people from being victimized. An evolutionary response is the only solution. A graceful decline of suckers is the solution. Do not try to protect them, they will only hate you for it.


While I understand the frustration of the author in searching the web today (I do too), I am not sure how they get to:

> We end up with real time global access from “low trust low income max exploitation because who is going to stop us” wankers to “high trust high income low questioning” societies and everything falls apart.

Obviously, there are "low trust high income max exploitation" folks doing the damage too, as one of the linked articles talks about frauds with wiring large sums (in 100ks of dollars) into Hong Kong, which is itself an expensive city.

Similarly, from the other direction, you've got high-income communities "exploiting" lower-income communities by getting cheaper labour (like getting things produced in China) — paying less than they would locally for the same or larger work effort.

The solution to this is obviously global equalization of salary bands, which is well under way due to an ability to do a lot of highly paid work remotely and globalization in general — but it will take some time (and it's also why China is becoming less appealing in particular: salaries are going up there as well). But that will lead to a new set of problems altogether.


I have never come across "AI generated" content in duckduckgo


Maybe a social endorsement scheme could be an alternative to an invasive proof system.


If there's a social endorsement scheme that doesn't involve real-world identity, I don't see how you won't end up with a situation where Bot_A is counted as legit, being endorsed by Bot_B...Bot_Z


because once Bot_J and Bot_Q are discovered to be bots, then everyone they've authenticated suddenly needs new people to vouch for them.

problem is, how do you figure out someone's a bot? Ah well maybe someone will make an AI tool for that, we humans are too busy doing groceries and putting round cubes into square holes.


That kind of verification could be off-protocol. Perhaps a reputation or incentive system could help as well.


Mission fucking accomplished. [0]

0 - https://xkcd.com/810/


CAs tend to revoke a domain's certificates if it's used for crime. Maybe they should do the same for SEO spam.


Like maybe a website is considered better if it is linked to by other websites? Kinda like research papers... We could call it "Page Rank".

I bet we could get a major university (eg. Stanford) to help fund the initial deployment. I think we should call it a really big silly number, like "Gazillion" to emphasize how much it knows. Obviously this won't have ads, and it should make an explicit point to not be evil....

Hmm. I think I've heard this before.


She is second best to the worlds truly all rounder person. Johnny Sins.


> The only future of the internet is, sadly, proof-of-person and proof-of-residence on every public network interaction.

I'm going to be that pedant and point out that the Internet is not the same as the Web, and it's the Web that's sick. The Internet is fine.

It's a distinction that matters because the Internet is expensive things like satellites and undersea cables. It's an investment that's too large to just walk away from, so perhaps its future is our future.

The Web is just a bunch of conventions about how to use the Internet, it's not binding in any way. We can write a different protocol without laying new cable, we can make it less profitable for abusers, and then we can abandon the sick version that we're currently using.


Yes, but we all get what the author means. UDP or TCP, SMTP or HTTP — most of it is transporting low value sludge whose purpose is to exploit its consumers. It involves everything from somewhat benign forms of surveillance/profiling to aggressively malicious scams.

You could use the infrastructure for better and many people do. But most of the content on the internet isn’t that.


The rest of the internet has even less reputation & ability to assess than the web. Nothing else has links worth a damn, and links while fakeable also do say something sometimes.


The post you replied to didn't say anything about the Internet being sick.


Google has an incentive to keep these links going. Google makes money by you continually returning back to Google. If you get the prefect result on your first click, then you move on with your day. If you get junk, then you go back to Google and click the next junk link and repeat. Each time someone visits one of these content farms and returns back to Google, then it is more advertising dollars for Google.


This is a classic argument and appears in many many forms:

* "Psychologists don't want to fix your problems because then you'll stop needing therapy every week."

* "Dating sites don't actually want you to find a long-term partner because then you'll stop using the site."

* "The mechanic's not trying to actually fix your car, just get it running for a few weeks so it breaks down again and you come back."

Etc. etc.

Any time there is information asymmetry and leaving a customer not fully satisfied might lead to future sales, this old canard comes up.

I'm sure in some cases it's true. But, like, people aren't entirely stupid. Consumers generally won't keep repeatedly going back to the same business if the service is kinda sucky. And businesses generally figure out that reputation matters and the most economically viable long-term strategy is just to give people what they want.


> But, like, people aren't entirely stupid. Consumers generally won't keep repeatedly going back to the same business if the service is kinda sucky. And businesses generally figure out that reputation matters and the most economically viable long-term strategy is just to give people what they want.

This isn't a law of nature. It's the result of particular conditions. Businesses in high-trust and low-trust cultures behave differently, and the descent of the US from a high-trust to a low-trust culture is going to have consequences.


High trust? The US sold literal "snake oil"


The Wild West times were low trust. The New Deal era was high-trust. We're swinging back to low trust.


Don’t forget that rubbish cleaners throw the most rubbish on the streets. If there would be no rubbish on the streets, they would be out of a job.

Now think crime and police: without crime, the police would be out of a job.

A consultant has no interest that the project they are consulting on is ever completed.

Of course these examples aren’t to be taken seriously, they merely illustrate some potential conflicts of interest of roles within society.


> long-term strategy

This is where the theory falls apart. When “long-term strategy” and “short-term quarterly earnings” get into the boardroom together at a public company, it ain’t “long-term strategy” that’s walking out.


It can be though. "Short Term" tends to win when CEO pay is contingent upon hitting certain share price at certain times. Founders tend to not have this issue, and I'm sure other CEOs can have pay packages crafted this way if we desired.


There’s nothing stopping it except that investors solely want short term gains and are rewarding CEOs and management for delivering that and punishing them when they don’t.


It's not necessarily that Google or whoever is doing it on purpose. It might just be that Google gets lazy because rhe money is still coming in. Or the psychiatrist, chiropractor, etc actually believes they are helping you and feel they can continue to be useful (or don't want to turn you lose too early in case of a bad consequence). There's all sorts of unintentional stuff that can still result in a bad outcome that seems predatory.


Both things can be true at the same time, it depends on the time preference of the business owner.


> this old canard comes up.

Your examples 1 & 3: I have personally witnessed those negative outcomes.

Regarding 3, A/C repair shops (drain system first, then discuss pricing) and transmission shops (disassemble first, then discuss pricing) are kind of notorious for it.

And yet there are mechanics and therapists who have earned my unquestioning trust.

I've not used 2. I may have bias.


I am a dynamic figure, often seen scaling walls and crushing ice. I have been known to remodel train stations on my lunch breaks, making them more efficient in the area of heat retention. I translate ethnic slurs for Cuban refugees, I write award-winning operas, I manage time efficiently. Occasionally, I tread water for three days in a row.

I woo women with my sensuous and godlike trombone playing, I can pilot bicycles up severe inclines with unflagging speed, and I cook Thirty-Minute Brownies in twenty minutes. I am an expert in stucco, a veteran in love, and an outlaw in Peru.

Using only a hoe and a large glass of water, I once single-handedly defended a small village in the Amazon Basin from a horde of ferocious army ants. I play bluegrass cello, I was scouted by the Mets, I am the subject of numerous documentaries. When I’m bored, I build large suspension bridges in my yard. I enjoy urban hang gliding. On Wednesdays, after school, I repair electrical appliances free of charge.

I am an abstract artist, a concrete analyst, and a ruthless bookie. Critics worldwide swoon over my original line of corduroy evening wear. I don’t perspire. I am a private citizen, yet I receive fan mail. I have been caller number nine and have won the weekend passes. Last summer I toured New Jersey with a traveling centrifugal-force demonstration. I bat 400. My deft floral arrangements have earned me fame in international botany circles. Children trust me.

I can hurl tennis rackets at small moving objects with deadly accuracy. I once read Paradise Lost, Moby Dick, and David Copperfield in one day and still had time to refurbish an entire dining room that evening. I know the exact location of every food item in the supermarket. I have performed several covert operations for the CIA. I sleep once a week; when I do sleep, I sleep in a chair. While on vacation in Canada, I successfully negotiated with a group of terrorists who had seized a small bakery. The laws of physics do not apply to me.

I balance, I weave, I dodge, I frolic, and my bills are all paid. On weekends, to let off steam, I participate in full-contact origami. Years ago I discovered the meaning of life but forgot to write it down. I have made extraordinary four course meals using only a mouli and a toaster oven. I breed prizewinning clams. I have won bullfights in San Juan, cliff-diving competitions in Sri Lanka, and spelling bees at the Kremlin. I have played Hamlet, I have performed open-heart surgery, and I have spoken with Elvis.

But I have not yet gone to college.

- well known missive from decades ago


Thanks, I've not seen that college application essay in ages, and it's written better than most of what I've read since.

https://archive.blogs.harvard.edu/sj/i-am-a-dynamic-figure/


Buries the lede. WHY would you make so many talanted instances of Jodie? Whats the upside. How do the financials work and is it a stake, or paid labour?


Blogspam sites are made by the exact same kind of "hustle bros" who run dropshipping companies.

The business model is very simple and requires 3 ingredients: web design, SEO, and copyediting. The first two are one-time costs. The 3rd one is a COGS and there is a whole market of professional blogwriters who charge something per 1000 words.

To answer your question, once you create a blog that starts printing money, it's in your best interest to just replicate it while changing as little of the "template" as possible, because you don't really know what element made it click.


So if Jodie the expert in fishing works, you try cricket, music, cabbage-patch dolls on the assumption .. the ideation of Jodie worked? Feels like a big assumption that the magic was Jodie, and not the context of fishing.

Ie your "you don't really know what element made it click." above assumed Jodie was the fixed value, no matter what else.

That's what I don't get: I click on lego spam because I like lego, not because Jodie is cool and I love his wordsmithing. That follows on.

I'm probably being thick. Maybe it's because to the recipient Jodie is unique each time? It's "this is the least important part of it" so they don't change it because IT DOESN'T MATTER.


> WHY would you make so many talanted instances of Jodie?

One name is twice as easy to invent as two names.


Choosing a name isn't the bulk in the cost structure.


> WHY would you make so many talanted instances of Jodie?

Because Jodie is everywhere and got your girl back home

https://taskandpurpose.com/military-life/brief-history-jody-...


How bad does it have to get before it stops working, I mean like phone a friend brittanica yellow pages bad? Quicker and more deeply we get there, the more effort and traction gets behind some breakout solution. Maybe it's just pure cynicism: burn it all.

Remember email spam? It got so bad, that we fixed it. I mean email has its issues and how but spam isn't one of them. I built a spam juggernaught in my day (got bills don't I :)) and I feel like I contributed a tiny bit to our almost-spamless latter days.

Progress! The world is on the march.


Did you accidentally post the same thing twice or are you making some meta point on spam.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: