This internal tension between chasing AI tooling and avoiding AI-generated content is just a prelude to the bigger shift of search engines getting reinvented around generated results instead of found results.
Fast forward 10+ years and for knowledge-related queries search is going to be more about generated results personalized to our level of understanding that at best quote pages, and more likely just reference them in footnotes as primary inputs.
These knowledge-related queries are where most content farms, low quality blogs, and even many news sites get traffic from today. If the balance of power between offense (generating AI content) and defense (detecting AI content) continues to favor offense, there will be a strong incentive to just throw the whole thing out and go all-in on generated results.
Big question is how incentives play out for the people gathering the knowledge about the world, which is the basis for generated results. Right now many/most make money with advertising, but so do content farms, and more generated results means more starving of that revenue source. For a portion of info that people want to know (e.g. factual info, not opinions, guidance, etc), Wikipedia is an alternative fact- and context-gathering model, but if search relies on it more, it will strain Wikipedia's governance model and become more of a single point of failure.
At some point in this scenario you outline, there will be so much ML generated content referencing generated content, the (already muddled) primary sources will be lost in the mix and the role of ML will be forced to synthesize sensible, meaningful content out of conflicting "truth".
What I'm trying to get at is that ML is currently terrible at contextualizing information, but in the future the successful knowledge-query engine will be the one that do the best job of wading through the explosion of digital content and pulling out the parts that are coherent and more connected to reality.
This program will need to be able to contextualize information to form a coherent model where knowledge is interconnected into a larger model of reality.
Easier said than done, we'll see if that is even possible. This is getting towards AGI.
My sense is a whole chunk of the internet is going to just get washed out with the tide as it gets demonetized by search, and that's the stuff most likely to be ML-generated. Meanwhile search engines like Google will start to laser in on fewer and fewer higher quality sources that have some signal, both through manual tuning and automation.
I think it's important to distinguish between ML-assisted and whole cloth ML generation too. I imagine many high quality content sources are going to be using ML-assisted writing tools; for instance, at my old news company (edsurge) the journalists spent most of their time gathering info and constructing the story, which was the real meat of the work. The writing and wordsmithing was the not operating at the top of their skillset, it was pretty inefficient manual labor, especially the drafts. So from content production perspective, ML could automate a lot of that away while letting the journalists tune the narrative and edit the output for better nuance - in other words, editing vs producing.
This also assumes a single output, but if you take the edtech lens and couple it with the power of ML and you could easily have any given news article have nearly infinite varieties based on reading level, language, and even past knowledge (by including or removing context). As someone interested in new mediums and innovating news, this is absolutely exciting to me. A really interesting question here is if there's an opportunity for collaboration between expert sources and search engines, such that for a given set of domain-specific queries the expert sources are in charge of fine-tuning the generated outputs.
Completely agree about contextualization. The automated detection of ML (the defense part) needs to be tuned toward trying to extract and analyze claims that are being made in the content, and a lot of ML content is nonsensical, or introduces wild, novel claims that lack evidence and alignment with other sources. Some big questions for the analysis are whether those claims are corroborated by other higher quality sources, whether relevant context is included, and whether its bringing anything new to the table.
In this kind of setup redundant sites will be ruthlessly demonetized by search. As you say, this also drives us in a direction toward expert systems with some probabilistic measurement of the truth of claims and completeness of context, which may also have actual experts at the controls of some of this (like above). This of course raises a lot of questions where unorthodox ideas may fit in this model.
I'm not an ML nor search expert though, my experience is more on production side.
This sounds scarily similar to descriptions of brains dealing with incomplete information. Good thing brains aren't keen on rationalizing prior beliefs in the face of new evidence or believing spurious things.
I’m not as convinced that “coherent” or “more connected to reality” will be how the value of mass market content will be measured in the future, which makes some of these issues less of a problem for the content generator.
I flip flop between seeing it as inevitable, and overly techno-optimistic. Can definitely see it both ways.
In the arms race between ML-copy generation and its detection, eventually the edge becomes "does this make sense in a coherent world-view", so there should be a natural pressure towards that outcome.
But maybe we're just not good enough at building these tools, and the incentives will align with an endless onslaught of digital sludge.
There's also the problem of: how does legitimate fiction, poetry, love stories, and art fit into a www where only the most logical is allowed to be seen? It could be a logically puritanical nightmare.
> it will strain Wikipedia's governance model and become more of a single point of failure
Why can't the Wikipedia model be adapted to a federated, directly community run approach? This works well enough for services such as email, matrix, and the fediverse. There's gravitation towards centralized hosting services but that's largely behavioral - the model itself works perfectly fine with lost of small players.
Heavyweight multimedia can be a challenge but text content itself is quite easy to serve up from very small devices.
Wikidata and Wikibase, the software it runs on, are expanding it into a "federated" network of knowledge stores. You can, for example, link from wikidata to some other instances of the software and query them transparently. It's used by a few museums that want to keep control over the description of the art, but link to wikidata for, say, the artists' place of birth. Then, you can use their query interface (SPARQL etc.) to get all the art they have from "artists born in a city that had a commercial port in 1960" without the museum ever having to enter more information than "this is a van Gogh".
I'm not sure such a complex system of content moderation & process could so easily be federated; I'd love to see a federated system equivalent to Wikipedia out there or one that has successfully transitioned governance like that. Email spam, by comparison, is far far less nuanced. Regardless, it'd be a new effort and wouldn't just work or be trusted year one. It'd need to be tested and refined over years and years, like Wikipedia has.
I could see several nonprofits and news brands along with Wikipedia shift to becoming a set of sources of truth for different & likely overlapping topics. That shift could happen gradually, as part of a mix of monetization incentive changes and more explicit 'here's how you participate' coersion (medium is message stuff; see FB's pivot to video, or youtube algo changing how content creators create). The generated result that Google spits out could reference those and note them as inputs, including noting where they disagree or choose to include or exclude certain context.
I still don't see how these ideas get funded without Google directly funding them, where algorithm transparency comes in play, etc.
The problems havve absolutely nothing to do with technology in the streaming-video sense of the word. It's about trust, versioning, truth, reality, and similar concepts.
Maybe Linux development is a good example, with some centralisation but other power centers of varying size connected, such as distributions or non-kernel software projects.
But then again Wikipedia already is federated into hundreds of local communities and horizontal projects like commons or wikidata, and it works not quite as terrible as one would think.
I mean it in the education and learning sense, that for knowledge to be understood it must be tailored to the person asking the question. Answering a question about physics ideally should look different for someone with a high school education vs someone who's in the field, or from someone who only has passing interest vs one that expresses interest through continued engagement. All info is filtered in some fashion, else a simple query would lead to an impossibly dense tome ('baking an apple pie from scratch').
As an edtech person that started a news company to try to solve this, one of my pet peeves about news is that it doesn't adopt an education mindset / approach despite it's mission being an educational one at its core. Part of that is due to search being what drives traffic, and that means there's a lack of relationship to the reader. But the root is internal as news and journalism is still too stuck in newspaper-mind of bespoke, reverse pyramid one-off articles that act as islands of information, delivered to an audience they have no ongoing relationship with and who have no means of getting follow-up. It's a cultural holdover that's bound up in relationship to it's old medium, not what optimally teaches.
Generated results need not be in the form of articles, that's one of the constraints that ML lifts. You could give people some generated text but also give people drill-down tools, letting people expand on a simple summary, or read in dialectic or debate style in areas where facts are not settled or multiple viewpoints exist. Just gotta expand your POV a bit and ML generated results represent an opportunity for incredible leverage educating the world.
Anyway if someone in ML at OpenAI/Google/etc wants to get deeper into this, please reach out!
As Diablo 2 released a remaster this year, it is sad to see that old websites with comprehensive information from 20 years ago are gradually replaced by the same three or four SEO optimised content farms in the top of Google search results.
Same with the Chrono Cross remaster this weekend: I'm already inundated with content farms spewing (bad) rehashes of tutorials from 20 years ago. We've come a long way from text-only 200-page gameFAQ .txt tutorials to more-ads-on-the-page-than-text listicles that split paragraphs of content over a slideshow of multiple pages in order to run even more ads.
Holy shit this comment resonates so much with me. It used to be that you could Google something about your game and find a 150KiB text file article (no ads, no useless pagination/SEO filler) which is a treasure trove of useful information, e.g. https://gamefaqs.gamespot.com/gba/921905-pokemon-emerald-ver... .
Nowadays I'm almost happy if I find at least one sentence of useful information in an entire "article" recommended by Google.
I think the above game example is an example of erosion of both sides: articles have gone to shit, and the amount of care, attention and depth to most games have gone down a lot.
A big part of this appears to be that google massively downranks any non-https sites-- most old sites, which are just documents and have no accounts or writing functionality have had no reason to adopt https.
If you run an older internet reference, please add https. The search engines also have an obvious 'recent' bias, but everything we can do to prevent the loss of accessibility of older reference material helps.
I think they would have seen this coming, they just couldn't do anything about it.
The rise of social media has moved most human generated content onto a few platforms, most of whom don't let Google index them (facebook being the main one).
They probably need to figure out how to pull information out of youtube (which there seem to be some initial efforts at).
However they might already be in trouble there, as TikTok is growing a huge archive of 20 second answers to common queries.
If I'm understanding the quoted interview correctly, Google is talking about AI generated spam - like when you ask GPT-3 to write you an article about XYZ topic and it spits out 500 words of well-written, plausible sounding gibberish - that you throw up on your website to try to rank in the search engines.
However, they seem to be leaving open the possibility of AI-assisted writing, where a human comes up with the information and guides the AI as it puts that information into words.
> From our recommendation we still see it as automatically generated content. I think over time maybe this is something that will evolve in that it will become more of a tool for people. Kind of like you would use machine translation as a basis for creating a translated version of a website, but you still work through it manually.
> And maybe over time these AI tools will evolve in that direction that you use them to be more efficient in your writing or to make sure that you’re writing in a proper way like the spelling and the grammar checking tools, which are also based on machine learning. But I don’t know what the future brings there.
In my opinion GPT-3 has already reached the point of being useful for this purpose - there are several GPT-3 based apps that do exactly what he's describing.
A friend (university science lab technician, in a field that didn't pay like CS) would make approx. $2/hour extra money in the evenings, from some Web site that directed her what topics to write articles about. Google the topic, rapidly skim, distill/rephrase it to a certain word count. (I suggested she could make a lot more working in a cafe, but she was physically exhausted from being on feet all day in lab, with lots of moving around heavy objects.)
I guessed it was used for filler content for SEO sites.
A question is whether that company would save money by using "AI" text generation, when they were paying real humans so little, for arguably higher quality.
> A question is whether that company would save money by using "AI" text generation, when they were paying real humans so little, for arguably higher quality.
You can power a GPU for a few cents of electricity per day, so I think the answer is a resounding yes.
My girlfriend was struggling with an assignment for university yesterday, so I put the questions into a GPT-3 site and the essays it wrote were better than either of us could have written by hand...
I put in a moderately specialist topic I'm reasonably expert in. What it wrote was highly convincing. While I wouldn't be particularly impressed, I absolutely couldn't distinguish it from a mediocre human. It's really quite terrifying to think of the internet filling up with this garbage.
Is it garbage if it writes better than most of the humans from the developing world that are currently exploited to churn out meaningless blog spam?
Also, it would have worked perfectly for my girl's homework assignment, but she has morals and decided she couldn't ethically submit it. She did use the output for ideas on what to write, though.
It writes very convincing pop-sci articles about high energy photons and the like. I see no way how even a proper AI can distinguish it from similar articles written by journalists.
I asked GPT-3 for sing requests and it was like: “The first caller requested _____ by ______. The second caller requested _____ by _____. The third caller…”
It didn’t seem to know any musical artists or songs. It was weird. When I asked for song suggestions the best it could do is write bad poetry.
If it's simply a quality issue, then at the point that AI generated content becomes better than human generated content, will Google ban human generated content?
If captchas are any indicator, AIs will soon be better at convincing Google their content is human generated than humans are.
Note that, to win, the machines just need to produce approved content at a faster rate than the humans. If some AI can spew text at a measly 1GB/s, and Google blocks 99% of it, 10MB/s will still getting through. In 2016, there were ~4.6B pages on the internet. If each page contains 10KB of text, then the AI in my example produces 1000 pages per second. There are 31 million seconds per year, which means it would churn out 31B pages per year, or the entire 2016 internet every few months.
My conclusion? Internet search is screwed. Maybe people will start paying for curation.
So Google would need to build a discriminator that detects machine-generated content. It will be interesting to see these discriminators fight the generators of other big companies.
I'll be taking a front-seat row watching that show, if it happens. Perhaps in the future, we'll have to deal with discriminators that approximate some originality-index. Will be fun fighting with those algorithms, to interact with the internet as a normal user (to some extent we already do - proving that you're human is becoming more and more tedious.)
In practice I think it's more Google now has another policy reason to banhammer prolific and irritating blogspammers than an arms race Google has a chance in.
Google isn't yet effective at detecting blogspam generated by naive scripts that simply swap words in the source material for other words in a thesaurus. I don't think they're going to start picking up continuity errors, factual errors or "weirdness" in GPT-3 - which often satisfies human readers - any time soon.
Google engineers can't even filter out GH or SO scraped sites like gitmemory, nor offer a way to let users block these sites. I'm not sure we should expect them to handle more advanced techniques like detecting word swaps any time soon.
The main reason I haven't tried this yet is I am concerned what would happen if every result is one of these spam sites that has fooled Google engineers for so long. Do I get a suspiciously blank page?
The other reason is I started using a search engine that cracked the code to let users rank and block sites (kagi) so I don't have to worry about it.
That isn't how it works. Long term refusal to do it isn't evidence their engineers are incapable of it. Blacklisting sites is trivial. They've always been able to do it and they do it routinely to this day.
The reason those results remain in your searches is because sending people to those sites makes Google money, and by definition that is of greater value to Google than the quality and relevance of your search results as a non-paying user.
I am unsure about that but using GPT-3 in that manner would certainly trigger OpenAI's automatic or manual systems and would violate their ToS and your account would be locked.
Try searching for a speciality business in the middle of nowhere (where there are zero providers).
I recently did this, and ended up in a warren of sites like "10 best Widget repair in Smallsville, MO", where all the addresses were for nonexistent businesses in other states, with sort of plausible names tangentially related to widgets. (High plains frobulator sales, Mesa, AZ, or whatever).
I'm pretty sure a deep learning algorithm produced that corner of the internet. It definitely contained lots of referral links and phone numbers. Never managed to reach a human or figure out what the scam was.
I've been noticing something similar with "phone-number look-up" sites.
I put that in quotes as I've entered numbers I know and find that what's being presented is apparently randomly-generated IDs.
The overt use is checking when unknown numbers call (or going through call history) and trying to get a sense of who the number is associated with.
The general location is usually plausible (so the generators are relying on local exchange numbers), but the specific street addresses and subscriber names appear generated, and specific street addresses may not actually exist.
These are usually lead gen sites. Done well they can take over smaller areas outside of major metro areas and drive sales leads to those larger companies that can travel vs the small local guy. I live in a small community and it has always been a problem usually started by the fact that the small local guy already has more work than he can handle and doesn't see the need to be online.
You need to be computationally irreducible through self-attention and conscious self-interaction.
By the same criterion we may need to consider the Wattage cost of intelligence when granting rights to AI. Their kernels needs to be aligned with evolutionary wisdom encoded in our highest motivations.
Exactly like the one we have now: A cacophonous cesspit filled with the mental diarrhea of a million bots sharting their opinions into the void in a desperate bid to either sway your opinion to their cause (whether commerce or politics), or bury you under such an avalanche of bullshit that you remain paralyzed with indecision.
And the result of this? Genuine conversation progressively retreats from the internet at large as it becomes overrun with the intellectual flatus of the bot wars, and people move further and further into silos in which bot behavior can be spotted and eliminated.
Welcome to the future. How do you like it, gentlemen? All your posts are belong to us.
Really? I think HN has even fewer safeguards against AI-generated content than Google. No offense to dang and co., and I'm sure there's more going on there than I'm aware of. But still, I'm pretty sure it would be trivial to set up an account and use GPT-3 to produce the content. The only reason I'd suspect this isn't happening is because there isn't a strong financial incentive to do so. In other words, HN avoids the spam because it's still too small to matter.
While it is true that there are no good incentives for a bot to comment here, I believe that the best safeguard against AI-generated content is that here such content will be noticed easily.
I have seen a couple of times on some other even smaller technical forums what appeared to be sequences of comments generated by an AI, from a fake account (until the accounts were banned).
There could not have been any kind of financial gain from those actions, so the conclusion was that someone was testing the capability of their AI to carry a conversation on such subjects.
The messages posted were remarkably coherent, but in the end it was obvious that their source could have been only either an AI or a human with mental problems.
What made those messages stand-out was that even if they contained mostly correct information, the same that might be found e.g in technical magazines, there was always some weird disconnect between their answer and the message to which they replied.
Apparently, in most cases the AI failed to identify which exactly was the point-of-view of the previous poster. The reply referred to various items mentioned in the parent message, but instead of referring to the essential points it referred to some non-relevant things mentioned more or less accidentally, or it tried to argue for or against some points as if trying to contradict the previous poster, when they actually had argued in the same direction.
They will be downvoted into Oblidon. And then banned, shadow-banned, hell-banned and a few more creative ways of banning. The account, the IP, the site, and perhaps anyone that a bayesian filter put in the same bucket.
Yes, which will produce things that are stylistically similar to HN comments, but without any connection to external reality beyond the training data and prompt.
That might provide believable comments, but not things likely to be treated as high-quality ones, and not virtual posters that respond well to things like moderation warnings from dang.
I think it's possible to train the bot to write oneliners, but onliners are usually downvoted here.
Replies to comments are difficult, so I guess the bot must just write somewhat related top level comments.
It will accumulate a lot of "Did you RTFA?" and also polite request for clarifications, and it will be suspicious that the bot never replies or replies with nonsense.
If every commit has a link to http://www.example.com then people will notice and start flagging.
I guess it can fly under the radar posting a few top level bland comments per week in the megathreads with a few hundred comments, but I don't imagine how this can be used to gain anything.
But HN also adores tangents, and that gives plenty of room for GPT-3 to succeed. You could affect topical discussion merely by distracting it in several directions.
Above a certain percentage it's going to poison human-generated content too. You will have to discern between ai-generated content, ai-influenced-human generated content and genuine human-generated content.
One could argue it's already happening. How many of the people we talk to everyday get their facts from SEO-spam websites and Google instant answers (which often sources its content from such websites)? Even if we avoid AI-generated content, we might be gettting fed it by proxy.
Human filtering of AI creativity might work, but deepfakes mismatch with that. Personally I decided to make it a pattern that I unsubscribe from channels that use deepfakes since I saw Internet Historian using it and possibly adding to the already crippling confusion regarding UAPs. - IH is not a credible source anyway, but you can easily use the clips they produced without attribution.
I think the world will be a better place if everybody follows a similar pattern. The only reason to use deepfakes is if the victim who's identity is being stolen is not cooperating with you. - It's a new way to violate a person's integrity and their right to agency in our already fading grasp of reality. You could probably gaslight your girlfriend with it, if you are incomprehensibly evil.
>You could probably gaslight your girlfriend with it, if you are incomprehensibly evil.
I'm honestly a bit more concerned with people gaslighting courts.
The technology isn't going away, unfortunately. Society will have to adapt to these new invasive norms, as they already have time and again in the past.
In the process of remembering that little tidibt it reminded me of a short story I read once that was circulated without attribution for a few decades, i.e. Terry Bissom's "They're Made Out of Meat"[1] published in OMNI in 1990. "They're meat all the way through."
> In theory GTP-3 could fill the entire internet with approximately true but false information.
Much of the internet is currently full of not-even-approximately true (and often maliciously false) information, so I’m not particular worried about that.
Given that the AI are trained on human generated content to be human like without having understanding of the content, I would think approximately the same but even less correct.
Google shouldn't be the arbiter of what's ok or not on the internet. They use AI to take away all human recourse with the company but want to tell others not to use it? It's a pretty laughable position. Good luck trying to detect GPT-3 and the like when you compare with non native speakers of languages. Are you suddenly going to just block them too?
If an AI can generate high quality content, why is that any less than human generated content. Human generate a ton of trash content, it's not inherently better.
Those same models on copilot are generating useful and often good code for me on a daily basis. If someone told me it's not OK to use any copilot generated code as unethical/wrong I'd laugh in their face. It's basically saving me a google search to find snippets/examples of things I wasn't sure of.
Maybe that's the threat? If we have access to AI directly (ala copilot) then I am googling less.
If someone told me it's not OK to use any copilot generated code as unethical/wrong I'd laugh in their face. It's basically saving me a google search to find snippets/examples of things I wasn't sure of.
I'm not concerned with ethics but both Google-search and co-pilot are some version of processed human generated content and I think problem both have is how to get the content for cheap and still sell processed version. Google pays for it's stuff by advertising and . Co-pilot just munged the contents of github and didn't pay anything. But if AI-generators continue the path of scanning free content and selling the processed result, someone will start working on how to spam them as well - there's a project for you, upload a whole bunch of general purpose code to github with spam in the comments.
The issue here is that the Turing test is getting ever harder - aren't you afraid of the time when you are going to start regularly failing it yourself ?
Furthemore: Did you think about what this means for people with e.g. serious cognitive disability?
Your statement argues for prohibiting them from participating in public discourse wholesale.
In the maximal case where bots are indistinguishable from humans, by definition the odds of failing the Turing test are 50/50. A coin flip per comment. Yeah, I can put up with that.
Realistically, I don't think I will ever start regularly failing it. Go ahead, call it arrogance or hubris or whatever. I've looked at the originality and predictability metrics on a large corpus of my online comments. Let's just say, you won't be replacing me with gpt-3 or similar.
This is diverging from the strict definition of "Turing Test", but I've had the bad luck of having my YouTube comments silently deleted (or worse, shadowbanned), and it's so infuriating (because it seemingly punishes some of 3 ways in which I've tried to make my comments better), that I mostly stopped posting them.
Not to mention that in cases where it results in a direct ban, you better hope that odds are better than 50/50 ! (New users are particularly susceptible.)
Which seems fair. I do not want to be bombarded by AI generated media and content that can be created at infinite volume and speed. Google is taking a massive step here for the good.
I mean, I don’t know about you, but I’d be fine with an infinite stream of AI generated media… once it’s no worse than human-authored content. Which is where Google is setting the bar. If the AI can generate articles that are as good as the ones humans are writing, why not index those?
Of course, this suggests a corollary: that Google shouldn’t be indexing the output of content farms where the writing and derivative-ness is strictly worse than what you’d get from an AI article generator. :)
What about high quality ai content? What if someday the content generated by ai is actually more useful/higher quality than the human generated content we just shoot outself in the foot because it's ai generated. Seems discriminatory to me!
It's closer to whataboutism than a valid criticism. In this particular case doesn't address at all whether AI generated content should be downranked in search results (probably should since it is mostly low quality clickbait, the same way human generated listicle trash is also low quality clickbait).
Using AI generally isn’t a problem for google. I don’t care if they use it to sort my calendar. The dilemma here is that they use AI to create and put their own auto generated content above search results.
Like all those countless examples where someone got a picture of a criminal mashed up with their own Wikipedia page because they happen to have the same family name. [1] Disregarding the shame factor of this, it’s showcasing low quality auto generated content created only to put the google name above others in search results.
Same applies to those auto generated opening hours of restaurants etc, which are incorrect all of the time. Again low quality ai content put above others and bypassing page rank.
Speech police complaining about others using techniques to feign authority on things they themselves have no authority over is pretty fucking hypocritical.
> "For us these would, essentially, still fall into the category of automatically generated content which is something we’ve had in the Webmaster Guidelines since almost the beginning.
> And people have been automatically generating content in lots of different ways. And for us, if you’re using machine learning tools to generate your content, it’s essentially the same as if you’re just shuffling words around, or looking up synonyms, or doing the translation tricks that people used to do. Those kind of things.
> My suspicion is maybe the quality of content is a little bit better than the really old school tools, but for us it’s still automatically generated content, and that means for us it’s still against the Webmaster Guidelines. So we would consider that to be spam."
So are people reacting thinking this is a new policy or...?
I think it's difficult to marry this policy with the other stuff Google is claiming to be dramatically transformational. If Google came out and said "Hey, this self-driving stuff isn't really dissimilar from traditional driving assist." there would be some questions to answer. Which of course, is what the regulators actually say.
Those examples seem highly different? Waymo can and does pick you up and take you where you need to go in SF and Austin, with no one behind the wheel. I'm not aware of any traditional driver assist package that's capable of even approaching that.
That AI generated content is automated SEO, which is mostly a bunch of heuristics to please Google's ranking algorithms. Blame the algorithms, not the people who reverse engineer them.
Why blame the algorithms rather than the spammers who reverse engineer them to get their affiliate link littered ai generated "reviews" to the top of the search results.
In the arms race between Google and spammers, honest websites sometimes get caught in the crossfire. For some reason lately lots of people want to blame google for this and not the spammers.
Why must I'd be necessarily either/or, black/white, Google/Spammer?
Why couldn't it be that reasonably both sides are to blame. Google enables the commodification of search results. Yes they claim that they want to show the best result. But best for whom? I don't believe that Google is a neutral party as they earn more if they promote sites that lead to additional advertising clicks. How should users know that there are often better results on pages 3 or 4 and below.
And than there are the ones trying to make a living with minimal effort. No need to create great content. Good enough is sufficient. As long as the content is optimized and the page receives relevant traffic from search the revenue via affiliate links is secured.
In this arm's race the other sites loose. They are the collateral between two fighting parties in this arms race. And also people loose. Loose great content and the diversity of the net.
Thankfully Google operates in a free market where it is possible to compete by building a better search algorithm.
I don't buy into the position that all this great content and diversity of content is lost because of Googles algorithm. It isn't as discoverable as the mainstream content that Google search returns but that's no different than being in a world where Google didn't exist. Except for perhaps in such a world people would have different content discovery habits.
Complaining that Google doesn't surface your preferred set of "great content" is like complaining that prime time cable TV only shows lame sitcom reruns. It's true, but it doesn't prevent you from buying HBO, or Prime Video.
The difference is, that a lot more people know Prime Video, Netflix and the like. Most people, when asked, might be able to say that Microsoft also has a search engine. But for a majority of people Google (or Facebook) is the internet. To the these different kinds of search engines as well as different kinds of content are a unknown unknown.
They can't look for what they don't even know exists.
I know how to surface these sites and enjoy them. But I probably belong to a tiny minority of users within the internet regarding this knowledge.
I know that basically always there have been the types of people who were better equipped for knowledge discovery. And also that knowledge was not free. To learn was time consuming and expensive. And only in recent decades this has changed.
Still the tools as well as the fact of additional stuff outside one's current knowledge needs imho to be distributed more broadly.
> Thankfully Google operates in a free market where it is possible to compete by building a better search algorithm.
That is a sentiment I don't share. With market shares > 90 percent and reighn of Android the network effects are very strong in my opinion. Making it at least a few orders of magnitude more difficult than when Google entered the game. So while it might be possible, it is economically probably not a great bet to make.
Google make a large chunk of their money from ads. If you create some SaaS where users can automate their site's SEO by using GPT or whatever, and it is reasonably priced, you could end up competing with Google Ads. Google want to prevent that.
I can't see any other reason for spam clone farms to outrank the sites they clone except either more incompetence then I'd ascribe to Google, or because the spam farm is full to the gunwhales with ads which earn money for Google.
Because we wouldn’t be here if Google had supported other means of natural search engine inclusion through quality metrics vs their current only option of pay to play or cheating.
Instead of focusing on AI generation, Google should focus on content farms in general. Those will have a different signature than non-spam AI content. There's legitimate uses for such content, the most obvious being research and artistic projects, not just content farms.
Even then... A lot of financial news is written algorithmically. A peer in my Comp Ling masters program started a business that does just that, and sold it or consulted on it's use with Bloomberg, or maybe Forbes? It was more than a decade ago. That stuff was mostly mixed methods of rules & stochastic, so I'm sure it's come much further. But it's a valuable service because so much of this news really is very formulaic: "Stock $X came in at $Y earnings for quarter $Z, beating analyst target earnings of $A. This is in line with the general rally in the $B sector, following similar estimate-beating earning in stocks $C, $D, and $E."
Dozens of these a day (or more) could certainly look like a content farm.
I am translating german content to swedish and back again. Google never noticed. My backtranslated content even corrects all typos and gramatical errors. It ranks better then the original content and I see no way they could ever detect.
Amazes me how much HN talks about the declining quality of search results, and then in this thread weeps and wails with crocodile tears the moment Google tries to do something about it.
Ah, yes, a glimpse of the world where sites exist to placate Google into delivering some traffic, and Google is the supreme judge of whether a site gets a piece of the webs.
Might be funny to see the dissonance when people begin to make decent art or memes with AIs, and Google will have to either drop the site, or ignore its own rules.
Neal Stephenson's "Fall, or Dodge In Hell" has a subplot driven by a news-story-generating AI run amok. The AI uses reader engagement as the metric (along with really great natural-language stuff). But it doesn't have any truth metric. So, at the time of the book, the widely believed AI's stories have constructed an alternate reality for many people, reinforcing the present polarization of media in the US.
That presents quite a threat profile. It's far more pernicious than SEO script kiddles doing whatever passes for keyword stuffing in 2022.
I hope the search crawler folks at Google are working hard to detect that sort of thing and prevent it from getting into indexes. Let's hope Neal Stephenson isn't as right about that threat as Arthur C. Clarke was about geosynchronous communications satellites.
That's basically what the Q larp, or realrawnews, is already doing today. If it's profitable to do, it doesn't take a rogue AI to make up nonsense and spread it to the masses.
So it's ok for google to generate <title> tags, google ad headlines, and email assistance with AI while new agencies for awhile robo generate articles, the real issue is google knows this is gaming seo and they will struggle against this which will only get better.
To give them to benefit of doubt, perhaps they didn't do much about it because they don't want to ban SEO sites and be accused of outright anticompetitive practices. Then they decided "AI-generated content" is a hammer that's more palatable to antitrust regulator.
it is very important to cover the legal aspect of such thing now
otherwise some dumb people will want ai generated content to be allowed everywhere
i'm pretty sure they are doing their move now because it causes them ton of issues with YouTube
allowing ai generated content, would mean allowing ai generated comments on youtube, wich is already happening and causes lot of issues
if you can't tell what is AI generated and use comments/discussions/like/dislike in your algorithms for ranking videos, then it'll be very easy for 3rd parties to push and play the game, including ad revenue
the inevitable will come sooner rather than later, get ready for your online passport!
Great point! I think with captchas, some humans might be able to identify obviously made up facts and very poor sentence structure to tag a text as low quality. But I don't think you'd be able to give a reliable assessment on whether a text was AI generated or not. Especially if it's a random group of people (unfamiliar with modern content generation) filling out the captchas.
Captchas can be made adversarial. Not sure if it's a good idea to try that with text since we don't know how humans react. Maybe that's what the Phenomenon is about?
The irony here is that Google absolutely prioritizes AI-generated content when it comes from Google itself. Sure, infoboxes contain content from other websites written (usually) by humans, but the selection and excerpting process is AI-driven and there are plenty of examples of that process going spectacularly wrong.
And infoboxes are usually the first result in any Google search, before the ads even.
There's an interesting variant of the Turing test here: develop an AI sufficiently intelligent to distinguish human content from AI generated content.
I might be wrong, but I think this might be a more difficult task than generating convincing dialogue: its very easy to generate text that statistically resembles human writing, and generators trained on certain topics (i.e. some science niche) might be impossible to flag.
you approximately just described adversarial networks (GANs). One of their training metrics is that it is how difficult it is for the "other" network (the adversary) to train itself to distinguish the real from generated.
There is also the question on what will happen when every GPT-X is trained with data who in large parts was generated by GPT-(X-1) instead of humans. Maybe to avoid that endogamy we will trace a line so they will be trained only with data created before the AI "eternal September".
If Google spent less time worrying about their guidelines and more time hiring people that don’t wish to turn the company into their political playground they could certainly be a lot further along than where they are
perhaps they don't want the datasets that they use to train the AI to be watered down by other AI generated content. after all alot of the text data was sourced from the internet.
I think the dataset and the context very important. I don't think google will give a bad score if the content is generated by AI and it is helpful for the people.
We often talk about this problem as if we expect Google to solve a social and economic people problem purely through math alone.
The appeal of open-to-all search as a way of navigating the web was that there was a huge long tail of interesting stuff that would be hard to manually index and categorize.
If the long tail of interesting stuff has fully drowned in a sea of spam and crap, I'm not sure that it still makes that much sense over something smaller but human-curated.
Perhaps the trick would be human curation with extensive and always-evolving AI tools to speed up the curation. You have to get past the filter to get in, versus being in by default unless you're blatant enough to get banned. There is a layer of human judgement in addition to the algorithm's score of the content, and additionally that gives you an extra scoring factor on the algorithms yourself - the humans should be able to help direct the development of the algorithm to fight spam more preemptively.
Would a mix of that give us a bigger internet than the entirely-manual "web directory" days of 1997 or so, but a less-shit-filled one than today's?
Seems like there should be a carve-out for content clearly marked as AI-generated. I wonder if the SEO hit is why I haven’t been able to find too many others posting funny GPT-3 outputs like (Disclosure incoming:) mine.
And if I can peddle my blog’s ‘content’: here’s Trump announcing he’s (not?) trans.
> I have been so concerned about this issue, I've held back from telling you that I'm transgender. I'm not transgender, but I'm so proud of the transgender community and their rights.
> The Democrats have been so horrible to the transgender community. They've made them live in these little closets, ya' know? And they've tried to force them to use certain restrooms.
Fast forward 10+ years and for knowledge-related queries search is going to be more about generated results personalized to our level of understanding that at best quote pages, and more likely just reference them in footnotes as primary inputs.
These knowledge-related queries are where most content farms, low quality blogs, and even many news sites get traffic from today. If the balance of power between offense (generating AI content) and defense (detecting AI content) continues to favor offense, there will be a strong incentive to just throw the whole thing out and go all-in on generated results.
Big question is how incentives play out for the people gathering the knowledge about the world, which is the basis for generated results. Right now many/most make money with advertising, but so do content farms, and more generated results means more starving of that revenue source. For a portion of info that people want to know (e.g. factual info, not opinions, guidance, etc), Wikipedia is an alternative fact- and context-gathering model, but if search relies on it more, it will strain Wikipedia's governance model and become more of a single point of failure.
Really interesting stuff ahead.