Hacker News new | past | comments | ask | show | jobs | submit login

I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

It also made the web a less than ideal source for training. And yet LLMs were still fed articles written for Googlebot, not humans. ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.






> I agree in general but the web was already polluted by Google's unwritten SEO rules. Single-sentence paragraphs, multiple keyword repetitions and focus on "indexability" instead of readability, made the web a less than ideal source for such analysis long before LLMs.

Blog spam was generally written by humans. While it sucked for other reasons, it seemed fine for measuring basic word frequencies in human-written text. The frequencies are probably biased in some ways, but this is true for most text. A textbook on carburetor maintenance is going to have the word "carburetor" at way above the baseline. As long as you have a healthy mix of varied books, news articles, and blogs, you're fine.

In contrast, LLM content is just a serpent eating its own tail - you're trying to build a statistical model of word distribution off the output of a (more sophisticated) model of word distribution.


Isn't it the other way around?

SEO text carefully tuned to tf-idf metrics and keyword stuffed to them empirically determined threshold Google just allows should have unnatural word frequencies.

LLM content should just enhance and cement the status quo word frequencies.

Outliers like the word "delve" could just be sentinels, carefully placed like trap streets on a map.


But you can already see it with Delve. Mistral uses "delve" more than baseline, because it was trained on GPT.

So it's classic positive feedback. LLM uses delve more, delve appears in training data more, LLM uses delve more...

Who knows what other semantic quirks are being amplified like this. It could be something much more subtle, like cadence or sentence structure. I already notice that GPT has a "tone" and Claude has a "tone" and they're all sort of "GPT-like." I've read comments online that stop and make me question whether they're coming from a bot, just because their word choice and structure echoes GPT. It will sink into human writing too, since everyone is learning in high school and college that the way you write is by asking GPT for a first draft and then tweaking it (or not).

Unfortunately, I think human and machine generated text are entirely miscible. There is no "baseline" outside the machines, other than from pre-2022 text. Like pre-atomic steel.


is the use of miscible here a clue? Or just some workplace vocabulary you've adapted analogically?

Human me just thought it was a good word for this. It implies some irreversible process of mixing, I think that characterizes this process really well.

There were dozens of 20th Century ideological movements which developed their own forms of "Newspeak" in their own native languages. Largely, natural human dialog between native speakers and between those opposed to the prevailing regime recoils violently at stilted, official, or just "uncool" usages in daily vernacular. So I wouldn't be too surprised to see a sharp downtick in the popular use of any word that becomes subject to an LLM's positive-feedback loop.

Far from saying the pool of language is now polluted, I think we now have a great data set to begin to discern authentic from inauthentic human language. Although sure, people on the fringes could get caught in a false positive for being bots, like you or I.

The biggest LLM of them all is the daily driver of all new linguistic innovation: Human society, in all its daily interactions. The quintillions of daily phrases exchanged and forever mutating around the globe - each mutation of phrase interacting with its interlocutor, and each drawing from not the last 500,000 tokens but the entire multi-modal, if you will, experience of each human to date in their entire lives - vastly eclipses anything any hardware could ever emulate given the current energy constraints. Software LLMs are just a state machine stuck in a moment in time. At best they will always lag, the way Stalinist language lagged years behind the patois of average Russians, who invented daily linguistic dodges to subvert and mock the regime. The same process takes place anywhere there is a dominant official or uncool accent or phrasing. The ghetto invents new words, new rhythm, and then it becomes cool in the middle class. The authorities never catch up, precisely because the use of subversive language is humanity's immune system against authority.

If there is one distinctly human trait, it's sniffing out anyone who sounds suspiciously inauthentic. (Sadly, it's also the trait that leads to every kind of conspiracy theorizing imaginable; but this too probably confers in some cases an evolutionary advantage). Sniffing out the sound of a few LLMs is already happening, and will accelerate geometrically, much faster than new models can be trained.


humans also lag humans, the future may already be spoken, but the slang is not evenly memed out yet.

If you think that's niche wait til you hear about man-machine miscegenation

> LLM uses delve more, delve appears in training data more, LLM uses delve more...

Some day we may view this as the beginnings of machine culture.


Oh no, it's been here for quite a while. Our culture is already heavily glued to the machine. The way we express ourselves, the language we use, even our very self-conception originates increasingly in online spaces.

Have you ever seen someone use their smartphone? They're not "here," they are "there." Forming themselves in cyberspace -- or being formed, by the machine.


chat is this real?

1. People don't generally use the (big, whole-web-corpus-trained) general-purpose LLM base-models to generate bot slop for the web. Paying per API call to generate that kind of stuff would be far too expensive; it'd be like paying for eStamps to send spam email. Spambot developers use smaller open-source models, trained on much smaller corpuses, sized and quantized to generate text that's "just good enough" to pass muster. This creates a sampling bias in the word-associational "knowledge" the model is working from when generating.

2. Given how LLMs work, a prompt is a bias — they're one-and-the-same. You can't ask an LLM to write you a mystery novel without it somewhat adopting the writing quirks common to the particular mystery novels it has "read." Even the writing style you use in your prompt influences this bias. (It's common advice among "AI character" chatbot authors, to write the "character card" describing a character, in the style that you want the character speaking in, for exactly this reason.) Whatever prompt the developer uses, is going to bias the bot away from the statistical norm, toward the writing-style elements that exist within whatever hypersphere of association-space contains plausible completions of the prompt.

3. Bot authors do SEO too! They take the tf-idf metrics and keyword stuffing, and turn it into training data to fine-tune models, in effect creating "automated SEO experts" that write in the SEO-compatible style by default. (And in so doing, they introduce unintentional further bias, given that the SEO-optimized training dataset likely is not an otherwise-perfect representative sampling of writing style for the target language.)


On point 1, that’s surprising to me. A 2,000 word blog post would be 10 cents with GPT-4o. So you put out 1,000 of them, which is a lot, for $100.

There are two costs associated with using a hosted inference platform: the OpEx of API calls, and the CapEx of setting up an account in the first place. This second cost is usually trivial, as it just requires things any regular person already has: an SSO account, a phone number for KYC, etc.

But, insofar as your use-case is against the TOUs of the big proprietary inference platforms, this second cost quickly swamps the first cost. They keep banning you, and you keep having to buy new dark-web credentials to come back.

Given this, it’s a lot cheaper and more reliable — you might summarize these as “more predictable costs” — to design a system around a substrate whose “immune system” won’t constantly be trying to kill the system. Which means either your own hardware, or a “being your own model” inference platform like RunPod/Vast/etc.

(Now consider that there are a bunch of fly-by-night BYO-model hosted inference platforms, that are charging unsustainable flat-rate subscription prices for use of their hardware. Why do these exist? Should be obvious now, given the facts already laid out: these are people doing TOU-violating things who decided to build their own cluster for doing them… and then realized that they had spare capacity on that cluster that they could sell.)


This makes sense. But now I’m wondering if people here are speaking from experience or reasoning their way into it. Like are there direct reports of which models people are using for blogspam, or is it just what seems rational?

But then you'll be competing for clicks with others who put out 1,000,000 posts for less costs because they used a small, self hosted model.

if you are a sales & marketing intern, have a potato laptop and $100 budget to spend on seo, you aren't going to be self hosting anything even if you know what that means.

This is about high-volume blog/news-spam created specifically to serve ads and affiliate links, not about occasional content marketing for legitimate companies.

> LLM content should just enhance and cement the status quo word frequencies.

TFA mentions this hasn't been the case.


Would you mind dropping the link talking about this point? (context: I'm a total outsider and have no idea what TFA is.)

TFA means "the featured article", so in this case the "Why wordfreq will not be updated" link we're talking about.

To be pedantic, the F in TFA has the same meaning as the F in RTFM.

It’s the same origin. On Slashdot (the HN of the early 00’s) people would admonish others to RTFA. Then they started using it as a referent: TFA was the thing you were supposed to have read.


Oh that I'm aware of, but it's softened over time too haha

I miss the old Atomic MPC forums in the ~00s.


The Fucking Article, from RTFA - Read the Fucking Article - and RTFM - Read the Fucking Manual/Manpage

  Too deep we delved, and awoke the ancient delves.

serpent eating its own tail

GOGI.


The Inhuman Centipede

At some point though you have to acknowledge that a specific use of language belongs to the medium through which you're counting word frequencies. There are also specific writing styles (including sentence/paragraph sizes, unnecessary repetitions, focusing on other metrics than readability) associated with newspapers, novels, e-mails to your boss, anything really. As long as text was written by a human who was counting on at least some remote possibility that another human might read it, this is way more legitimate use of language than just generating it with a machine.

This feels like a second, magnitudes larger Eternal September. I wonder how much more of this the Internet can take before everyone just abandons it entirely. My usage is notably lower than it was in even 2018, it's so goddamn hard to find anything worth reading anymore (which is why I spend so much damn time here, tbh).

I think it's an arms race, but it's an open question who wins.

For a while I thought email as a medium was doomed, but spammers mostly lost that arms race. One interesting difference is that with spam, the large tech companies were basically all fighting against it. But here, many of the large tech companies are either providing tools to spammers (LLMs) or actively encouraging spammy behaviors (by integrating LLMs in ways that encourage people to send out text that they didn't write).


The fight against spam email also led to mass consolidation of what was supposed to be a decentralised system though. Monoliths like Google and Microsoft now act as de-facto gatekeepers who decide whether or not you're allowed to send emails, and there's little to no transparency or recourse to their decisions.

There's probably an analogy to be made about the open decentralised internet in the age of AI here, if it gets to the point that search engines have to assume all sites are spam by default until proven otherwise, much like how an email server is assumed guilty until proven innocent.


Another problem with this arms race is that spam emails actually are largely separable from ham emails for most people... or at least they were, for most of their run. The thousandth email that claims the UN has set aside money for me due to my non-existent African noble ancestry that they can't find anyone to give it to and I just need to send the Thailand embassy some money to start processing my multi-million yuan payout and send it to my choice of proxy in Colombia to pick it up is quite different from technical conversation about some GitHub issue I'm subscribed to, on all sorts of metrics.

However, the frontline of the email war has shifted lately. Now the most important part of the war is being fought over emails that look just like ham, but aren't. Business frauds where someone convinces you that they are the CEO or CFO or some VP and they need you to urgently buy this or that for them right now no time to talk is big business right now, and before you get too high-and-mighty about how immune you are to that, they are now extremely good at looking official. This war has not been won yet, and to a large degree, isn't something you necessarily win by AI either.

I think there's an analogy here to the war on content slop. Since what the content slop wants is just for you to see it so they can serve you ads, it doesn't need anything else that our algorithms could trip on, like links to malware or calls to action to be defrauded, or anything else. It looks just like the real stuff, and telling that it isn't could require a human rather vast amounts of input just to be mostly sure. Except we don't have the ability to authenticate where it came from. (There is no content authentication solution that will work at scale. No matter how you try to get humans to "sign their work" people will always work out how to automate it and then it's done.) So the one good and solid signal that helps in email is gone for general web content.

I don't judge this as a winning scenario for the defenders here. It's not a total victory for the attackers either, but I'd hesitate to even call an advantage for one side or the other. Fighting AI slop is not going to be easy.


> but spammers mostly lost that arms race

I'm not saying this is impossible but that's going to be an uphill sell for me as a concept. According to some quick stats I checked I'm getting roughly 600 emails per day, about 550 of which go directly to spam filtering, and of the remaining 50, I'd say about 6 are actually emails I want to be receiving. That's an impressive amount overall for whoever built this particular filter, but it's also still a ton of chaff to sort wheat from and as a result I don't use email much for anything apart from when I have to.

Like, I guess that's technically usable, I'm much happier filtering 44 emails than 594 emails? But that's like saying I solved the problem of a flat tire by installing a wooden cart wheel.

It's also worth noting there that if I do have an email thats flagged as spam that shouldn't be, I then have to wade through a much deeper pond of shit to go find it as well. So again, better, but IMO not even remotely solved.


I’m not sure what you’ve done to get that level of spam, but I get about 10 spam emails a day at most and that’s across multiple accounts including one that I’ve used for almost 30 years and had used on Usenet which was the uber-spam magnet. A couple newer (10–15 year old) addresses which I’ve published on webpages with mailto links attract maybe one message a week and one that I keep for a specialized purpose (fiction and poetry submissions) gets maybe one to two messages per year, mostly because it’s of the form example@example.com so easily guessed by enterprising spammers.

Looking at the last days’ spam¹ I have three 419-style scams (widows wanting to give away their dead husbands’ grand piano or multi-million euro estate) and three phishing attempts. There are duplicate messages in each category.

About fifteen years ago, I did a purge of mailing list subscriptions and there’s very little that comes in that I don’t want, most notably a writer who’s a nice guy, but who interpreted my question about a comment he made on a podcast as an invitation to be added to his manually managed email list and given that it’s only four or five messages a year, I guess I can live with that.

1. I cleaned out spam yesterday while checking for a confirmation message from a purchase.


I'm having a hard time finding reliably sourced statistics here, but I suspect you're an outlier. My personal numbers are way better, both on Gmail and Fastmail, despite using the same email addresses for decades.

> but spammers mostly lost that arms race.

Advertising in your mails isn't Google's.


I hope this trend accelerates to force us all into grass-touching and book-reading. The sooner, the better.

Books printed before 2018, right?

I already find myself mentally filtering out audible releases after a certain date unless they're from an author I recognize.


Yes but not quite as far as you imply. The training data is weighted by a quality metric, articles written by journalists and wikipedia contributors are given more weight than Aunt May's brownie recipe and corpoblogspam.

> The training data is weighted by a quality metric

At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight. They're not even filtering the comically low-hanging fruit like those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet, and is of course always a glowing recommendation since the point is to get the viewer to click an affiliate link.

Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?


> Google has been playing the SEO cat and mouse game forever, so can startups with a fraction of the experience be expected to do any better at filtering the noise out of fresh web scrapes?

Google has been _monetizing_ the SEO game forever. They chose not to act against many notorious actors because the metric they optimize for is ad revenue and and those sites were loaded with ads. As long as advertisers didn’t stop buying, they didn’t feel much pressure to make big changes.

A smaller company without that inherent conflict of interest in its business model can do better because they work on a fundamentally different problem.


> those YouTube channels which post a new "product review" every 10 minutes, with an AI generated thumbnail and AI voice reading an AI script that was never graced by human eyes before being shat out onto the internet

The problem is that, of the signals you mention,

• the highly-informative ones (posting a new review every 10 minutes, having affiliate links in the description) are contextual — i.e. they're heuristics that only work on a site-specific basis. If the point is to create a training pipeline that consumes "every video on the Internet" while automatically rejecting the videos that are botspam, then contextual heuristics of this sort won't scale. (And Google "doesn't do things that don't scale.")

• and, conversely, the context-free signals you mention (thumbnail looks AI-generated, voice is synthesized) aren't actually highly correlated with the script being LLM-barf rather than something a human wrote.

Why? One of the primary causes is TikTok (because TikTok content gets cross-posted to YouTube a lot.) TikTok has a built-in voiceover tool; and many people don't like their voice, or don't have a good microphone, or can't speak fluent/unaccented English, or whatever else — so they choose to sit there typing out a script on their phone, and then have the AI read the script, rather than reading the script themselves.

And then, when these videos get cross-posted, usually they're being cross-posted in some kind of compilation, through some tool that picks an AI-generated thumbnail for the compilation.

Yet, all the content in these is real stuff that humans wrote, and so not something Google would want to throw away! (And in fact, such content is frequently a uniquely-good example of the "gen-alpha vernacular writing style", which otherwise doesn't often appear in the corpus due to people of that age not doing much writing in public-web-scrapeable places. So Google really wants to sample it.)


Reminds me of a Google search I did yesterday: “Hezbollah” yields a little info box with headings “Overview”, “History”, “Apps” and “Return policy”.

I’m guessing that the association between “pagers” and “Hezbollah” ended up creating the latter two tabs, but who knows. Maybe some AI video out there did a product review of Hezbollah.


>At least in Googles case, they're having so much difficulty keeping AI slop out of their search results that I don't have much faith in their ability to give it an appropriately low training weight.

I've noticed that lately. It used to be the top google result was almost always what you needed. Now at the top is an AI summary that is pretty consistently wrong, often in ways that aren't immediately obvious if you aren't familiar with the topic.


I don’t think they were talking about the quality of Google search results. I believe they were talking about how the data was processed by the wordfreq project.

I was actually referring to the data ingestion for training LLMs, I don't know what filtering or weighting might be done with wordfreq.

Google has those problems because the company's revenue source (Ads) and the thing that puts it on the map (Search) are fundamentally at odds with one another.

A useful Search would ideally send a user to the site with the most signal and the fewest noise. Meanwhile, ads are inherently noise; they're extra pieces of information inserted into a webpage that at best tangentially correlate to the subject of a page.

Up until ~5 years ago, Google was able to strike a balance on keeping these two stable; you'd get results with some Ads but the signal generally outweighed the noise. Unfortunately from what I can tell from anecdotes and courtroom documents, the Ad team at Google has essentially hijacked every other aspect of the company by threatening that yearly bonuses won't be given out if they don't kowtow to the Ad teams wishes to optimize ad revenue somewhere in 2018-2019 and has no sign of stopping since there's no effective competition to Google. (There's like, Bing and Kagi? Nobody uses Bing though and Kagi is only used by tech enthusiasts. The problem with Google is that to copy it, you need a ton of computing resources upfront and are going up against a company with infinitely more money and ability to ensure users don't leave their ecosystem; go ahead and abandon Search, but good luck convincing others to give up say, their Gmail account, which keeps them locked to Google and Search will be there, enticing the average user.)

Google has absolutely zero incentive to filter out generative AI junk from their search results outside the amount of it that's damaging their PR since most of the SEO spam is also running Google Ads (since unless you're hosting adult content, Google's ad network is practically the only option). Their solution therefore isn't to remove the AI junk, but to instead reduce it enough to the degree where a user will not get the same type of AI junk twice.


My understanding is that Google Ads are what makes Google Search unassailable.

A search engine isn't a two-sided market in itself but the ad network that supports it is. A better search engine is a technological problem, but a decently paying ad network is a technological problem and a hard marketing problem.


It certainly feels like the amount of regurgitated, nonsensical, generated content (nontent?) has risen spectacularly specifically in the past few years. 2021 sounds about right based on just my own experience, even though I can't point to any objective source backing that up.

Upvoted for "nontent" alone: it'll be my go-to term from now on, and I hope it catches on.

Is it of your own coinage? When the AI sifts through the digital wreckage of the brief human empire, they may give you the credit.


I do hope it catches on! I did come up with this myself, but I really doubt I'm the only one — and indeed: Wiktionary lists it already with a 2023 vintage:

https://en.wiktionary.org/wiki/nontent


Ooh I like “nontent.” Nothing like a spicy portmanteau!

I personally am yet to see this beyond some slop on youtube. And I am here for the AI meme videos. I recognize the dangers of this, all I am saying is that I don't feel the effect, yet.

I'm seeing it a lot when searching for some advice in a well-defined subject, like, say, leatherworking or sewing (or recipes, obviously). Instead of finding forums with hobbyists, in-depth blog posts, or manufacturers advice pages, increasingly I find articles which seem like natural language at first, but are composed of paragraphs and headers repeating platitudes and basic tips. It takes a few seconds to realize the site is just pushing generated articles.

Increasingly I find that for in-depth explanations or tutorials Youtube is the only place to go, but even there the search results can lead to loads of videos which just seem… off. But at least those are still made by humans.


There's been a ton of low-rent listicle writing out there for ages. Certainly not new in the past few years. I admit I don't go on YouTube much and don't even have a tiktok account so it's possible there's a lot of newer lousy content I'm not really exposed to.

It seems to me that the fact it's so cheap and relatively easy for people with dreams of becoming wealthy influencers to put stuff out there has more to do with the flood of often mediocre content than AI does.

Of course the vast majority don't have much real success and get on with life and the crank turns and a new generation perpetuates the cycle.

LLMs etc. may make things marginally easier but there's no shortage of twenty somethings with lots of time imagining riches while making pennies.


Looking forward to watch perfect generated videos. We need so much more power and chips but it’s completely worth it. After that? Maybe generated videogames. But the video stuff will be awesome and changing the video dominated social media content for ever. Virtual headsets will become useful finally generating anything you want to see and jump tru space and time.

SEO grifters have fully integrated AI at this point, there are dozens of turn-key "solutions" for mass-producing "content" with the absolute minimum effort possible. It's been refined to the point that scraping material from other sites, running it through the LLM blender to make it look original, and publishing it on a platform like Wordpress is fully automated end-to-end.

Or check out "money printer" on github: a tongue in cheek mashup of various tools to take a keyword as input and produce a youtube video with subtitles and narration as output.

Aunt may's brownie recipe (or at least her thoughts on it) are likely something you'd want if you want to reflect how humans use language. Both news-style and encyclopedia-style writing represent a pretty narrow slice.

That's why search engines rated them highly, and why a million spam sites cropped up that paid writers $1/essay to pretend to be Aunt May, and why today every recipe website has a gigantic useless fake essay in front of their copypasted made up recipes.

I hate how looking for recipes has become so… disheartening. Online recipes are fine for reputable sources like newspapers where professional recipe writers are paid for their contributions, but searching for some Aunt May's recipe for 'X' in the big ocean of the internet is pointless — too much raw sewage dumped in.

It sucks, because sharing recipes seemed like one of those things the internet could be really good at.


There seem to be quite a few recipe sharing sites around - e.g. allrecipes.com.

And they're all flooded with low effort trash and useless.

The only remaining reliable source - now that many newspapers are axing the remaining staff in favour of LLMs - is pre-2020 print cookbooks. Anything online or printed later must be assumed to be tainted, full of untested sewage and potentially dangerous suggestions.


The wife and I use the internet for recipe ideas... but we hardly ever follow them directly anymore. We're no formally-trained chefs but we've been home cooks for over 20 years now, and so many of them are self-evidently bad, or distinctly suboptimal. The internet chef's aversion to flavor is a meme with us now; "add one-sixty-fourth of a teaspoon of garlic powder to your gallon of soup, and mix in two crystals of table salt". Either that or they're all getting some seriously potent spices all the time and I'd like to know where they shop because my spices are nowhere near as powerful as theirs.

It's not just online recipes, but cookbooks written for the Better Home & Gardens crowd. The ones who write "curry powder" (and mean the yellow McCormick stuff which is so bland as to have almost no flavour) or call for one clove of garlic in their recipe.

I joke with folks that my assumption with "one clove of garlic" is that they really mean "one head of garlic" if you want any flavour. (And if the recipe title has "garlic" in it and you are using one clove, you’re lying.)


If the recipe has "garlic" in the title, I'm budgeting 1/2 head per serving.

Well there's https://www.allrecipes.com/author/chef-john/ on that particular site.

I absolutely love Chef John. Great recipes and the cadence of his speech on YouTube (foodwishes) is very soothing, while he cooks up something amazing. If you're a home cook I highly recommend his recipes and his channel.

Chef John is the best.

It's interesting to search for recipes in other languages and not find junk as we do in English.

I read Spanish and Italian fluently and stumble my way through Japanese (with translation). It's easier to find a good recipe in these languages, provided you can find the ingredients or substitutes.


I wish more people presented recipes like cooking for engineers. For example - Meat Lasagna https://www.cookingforengineers.com/recipe/36/Meat-Lasagna

I love the table-diagrams at the end. I've never seen anything like that until now and it really seems useful for visualization of the recipe and the sequence of steps.

Interestingly my wife has been writing recipes on post-it notes for years in that same style, with arrows instead of tables. And she's the opposite to an Engineer, a psychologist (interest in people vs objects).

When I saw them, they blew my mind. Short to store and easy to understand.


Combined with pictures for what each step should look like. I had a few of these pages printed out back in the '00s for some recipes that I did.

And here I thought my defacement of printed recipes by bracketing everything that goes together at each stage was just me. There are, well, maybe not dozens but at least two of us! Saves a lot of bowls when you know without further checking that you can, say, just dump the flour and sugar, butter and eggs into the big bowl without having to prepare separately because they're in the "1: big bowl" bracket.

Depends on what you’re doing. For best cookies, you want to cream the butter with the sugar, then add the eggs, and finally add the flour. If you’re interested and can find one, it’s worth taking a vegan baking class. You learn a lot about ingredient substitutions for baking, about what the different non-vegan ingredients are doing that you have to compensate for…and it does something that I’ve only recently started seeing happen in non-vegan baking recipes: it separates the wet ingredients from the dry ingredients.

That is, when baking, you can usually (again, exceptions for creaming the sugar in butter, etc.) take all of your dry ingredients and mix/sift them together, and then you pour your wet ingredients in a well you’ve made in the dry ingredients (these can also usually be mixed together).


No need to cakesplain, that was an example with three ingredients of the top of my head, very, very obviously the exact ingredients and bracket assignments vary depending on what you are making.

But for shortbread or fork biscuits those three could indeed all go in the bowl in one go (but that one admittedly doesn't really need a bracket because the recipe is "put in bowl, mix with hands, bake").


Ok, but what i said is true regardless of SEO, and that SEO has also fed back into english before LLMs were a thing. If you only train on those subsets you'll also end up with a chatbot that doesn't speak in a way we'll identify as natural english.

Yet. Give it time. The LLMs will train our future children.

I'm sure they already are.

The current state of things leads me to believe that Google's current ranking system has been somehow too transparent for the last 2-3 years.

The top of search results is consistently crowded by pages that obviously game ranking metrics instead of offering any value to humans.


Prior to Google we had Altavista and in those days it was incredibly common to find keywords spammed hundreds of times in white text on a white background in the footer of a page. SEO spam is not new, it's just different.

Don't forget Google's adsense rules which penalized useful straightforward websites and mandated websites be full of "content". Doesn't matter if the "content" is garbage nonsense rambling and excessive word use - it's content and much more likely to be okayed by adsense!

> ML/LLM is the second iteration of writing pollution. The first was humans writing for corporate bots, not other humans.

Based on the process above, naturally, the third iteration then is LLMs writing for corporate bots, neither for humans nor for other LLMs.


Indexability is orthogonal to readability.

It should be, but sadly it’s not.

It's crazy to attribute the downfall of the web/search to Google. What does Google have to do with all the genuine open web content, Google's source of wealth, getting starved by (increasingly) walled gardens like Facebook, Reddit, Discord?

I don't see how Google's SEO rules being written or unwritten has any bearing. Spammers will always find a way.


>And yet LLMs were still fed articles written for Googlebot, not humans.

How do we know what content LLMs were fed? Isn't that a highly guarded secret?

Won't the quality of the content be paramount to the quality of the generated output or does it not work that way?


We do know that the open web consitutes the bulk of the trainig data, although we don't get to know the specific webpages that got used. Plus some more selected sources, like books, of which again we only know that those are books but not which books were used. So it's just a matter of probability that there was a good amount of SEO spam as well.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: