Certified 100% AI-free organic content

dale_glass · on Jan 24, 2023

> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward

I find this to be a particularly interesting problem in this whole debacle.

Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.

I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.

Now imagine a myriad fake blogs using ChatGPT under the hood to generate blog entries explaining how to solve often wanted problems, and that then being spidered and fed into ChatGPT 2.0. Such things could end up creating a downwards trend in quality, as more and more of such junk gets posted, absorbed into the model and amplified further.

I think image generation should be less vulnerable to this since all images need tagging to be useful, "ai generated" is a common tag that can be used to exclude reingesting old outputs, and also because with artwork precision doesn't matter so much. If people like the results, then it doesn't matter that much that something isn't drawn realistically.

eloisius · on Jan 24, 2023

> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.

This is exactly what I don't like about Copilot, maybe even more than the IP ethics of it. If it really succeeds, it's going to have a feedback loop that amplifies its own code suggestions. The same boilerplate-ish kind of code that developers generate over and over will get calcified, even if it's suboptimal or buggy. If we really need robots to help write our code, to me that means we don't yet have expressive enough languages.

sillysaurusx · on Jan 24, 2023

Devs can use their noggins to check whether copilot’s output is decent or not. It’s not a given that people will use it automatically.

spacephysics · on Jan 24, 2023

I think many developers of today agree.

Todays programmers can see copilot output and probably think “well that’s not optimal”. Fast forward five years, new CS grads are using Copilot 3.0, and are used to specific auto-completes that copilot gives for certain tasks, as they may have never needed to go beyond some of the more basic suggestions.

It “feels” like an older programmer seeing a younger web dev and going “you’re wasting MB of memory!”

While true the web has gotten slower in many regards, and indeed memory may have been wasted, business value creation typically doesn’t care if a few MB is sub optimally wasted, while the previous generation does.

partdavid · on Jan 24, 2023

This is a whole class of problem, it seems to me. A semi-automated approach can seem fine when there's an executive function over the top of it, exercised by someone who has not just the knowledge, but current and honed knowledge. But over time, what's keeping that knowledge current and honed?

The airline industry has talked about this, of course, and adoption of robotic surgery has opened up a whole new training problem, because its escape hatches when it goes wrong or can't complete a procedure is often "complete the surgery manually". Which is fine on Day 1 of robotic surgery, but what about day 2 when surgeons typically don't have hundreds of similar procedures under their belts? And where the only time they're called on to exercise the skill is in difficult edge cases?

We have basically turned driving a standard transmission into a weird old person quirk or niche enthusiant skill in the United States. If an automatic transmission required a similar manual fallback or check, how well would that work? Well, it would work fine if basically everyone already had a lot of practice driving a manual--but now? It wouldn't work well at all. Of course, automatic transmission don't fail like that, and are a lot better at switching gears than AI assistants are at generating code. I worry about the semi-automated approach to self-driving, where the driver may not actually have currency with their driving skill, and where--in the instance that it's necessary--a driver has to react to more complicated situations (they don't have practice with the simple ones, and they have to react not to a hazard but to their car's failure to react to a hazard).

mikewarot · on Jan 24, 2023

>It “feels” like an older programmer seeing a younger web dev and going “you’re wasting MB of memory!”

I crammed a gigabyte of ASCII into a string in FreePascal, it was glorious!

Waste all the MB you want!

(I'm 59)

toss1 · on Jan 24, 2023

>>Devs can use their noggins

Can =/= Will

And as also pointed out by others, it not only requires effort, but knowledge, and that knowledge will be systematically degraded the more AI-ish code generation is used.

OTOH, those with super-diligent hacker attitudes will start to learn how to find the flaws in generated code and optimize it, so leveraging the tool, but most will just move on to the next task/ticket as soon as the AIish code passes the unit tests. So, super-leveraging AI-generated code will be rare.

brookst · on Jan 24, 2023

> And as also pointed out by others, it not only requires effort, but knowledge, and that knowledge will be systematically degraded the more AI-ish code generation is used.

How is that different than the plague of junior devs we've always had? Devs will get more senior by identifying and correcting issues in AI code rather than code from their peers. Seems OK, like we just got a whole lot more coding capacity.

harimau777 · on Jan 24, 2023

Until managers threaten to fire them for "wasting time" correcting the output.

dymk · on Jan 24, 2023

Why do you worry that’s going to be a real problem?

bluefirebrand · on Jan 25, 2023

Because managers already lean on engineers for "wasting time" fixing tech debt, building scalable solutions, etc.

boringg · on Jan 24, 2023

That and I bet the generic boilerplate code that co-pilot produces is on average much better than the boiler plate code that the average dev might use. So it removes the lower level work.

brookst · on Jan 24, 2023

It is. There are a lot of problems with copilot, but one magical thing is the way it rewards best practices like thinking through what a function is supposed to do before starting to write it.

If you write a good comment describing name, inputs, outputs, logic, and exceptions, the "generate code from comment" capability is kind of amazing. I'm a terrible, hacky programmer, and it has wholly converted me to documentation-first.

Cyberdog · on Jan 24, 2023

The only Copilot-type system I'd consider using is one that did the opposite - generated comments from my code, so my lazy butt doesn't have to write them manually.

brookst · on Jan 25, 2023

I hear ya, but I’m a PM so I’m looking at it as spec-to-code. Though I would love to see code-to-comments, especially really opinionated comments.

spacephysics · on Jan 24, 2023

I think that can surely be the case, but just like any AI there may need to be manual review to asses what is the most optimal way to go about task X, and retrain.

I can see this go both ways:

boilerplate code being a great generic solution to a set of problems, but a more seasoned programmer may say “that works, but for our use case the trade offs don’t make sense”

Or alternatively, “this code wasn’t something I knew I could do in language X, and it’s far more efficient”

smrtinsert · on Jan 25, 2023

Not really the case. I reject maybe 90٪ of copilot suggestions. If they leverage that I'm sure it will improve

_oghd · on Jan 24, 2023

but enterprise FizzBuzz is a demonstration that exactly that phenomenon will happen without AI, merely with books, youtube videos, or blogs (the calcification) and cargo culting (the lazy application).

oceanplexian · on Jan 24, 2023

For real, most developers repeatedly copy/paste things they find on the internet without understanding how they work. So the AI isn’t doing anything special that humans don’t already do.

Source: Bitter old SRE who has had to fix many broken software patterns ripped out of StackOverflow and the like.

rgrieselhuber · on Jan 24, 2023

As someone in SEO, I've been pretty disgusted by the desire for site owners to want to use AI-generated content. There are various opinions on this, of course, but I got into SEO out of interest in the "organic web" vs. everything being driven by ads.

Love the idea of having AI-Free declarations of content as it could / should help to differentiate organic content from generated content. It would be very interesting if companies and site owners wished to self-certify their site as organic with something like an /ai-free.txt.

capybara_2020 · on Jan 24, 2023

Curious, isn't SEO the thing that ruined search to a big extent. A 1000 word article where the actual answer needs a fraction of that size. Or interesting content buried because it is not "SEO optimized". Or companies writing blog content and making it look helpful while actually shilling and highlighting their own product. Plus tons of other things.

So now you need something like ChatGPT to cut through the noise?

dazc · on Jan 24, 2023

> So now you need something like ChatGPT to cut through the noise?

I once employed a journalist to write about the pros and cons of wedding insurance. Just to give you a clue how long ago this was, it was a unique article at the time.

Many years years later, every article you will read about wedding insurance (there will be many thousands) is around 90% similar in style and content to the one I paid for.

I dare say you could use any other topic as an example, one thoroughly researched original and many thousand similar copies. I can't see how ChatGPT is not going to make this situation much worse?

idopmstuff · on Jan 24, 2023

My guess is that ChatGPT is going to solve the SEO spam problem by changing the way we search for things. Instead of searching for webpages that have information about a topic, we're going to ask an AI.

It'll tell you what the pros and cons of wedding insurance are, and because eventually it'll have access to your calendar, it'll tailor that answers to the specifics of the fact that you're having a destination wedding during monsoon season in the area.

Once this kind of AI search becomes the default way we look for information, there won't be a point to creating SEO spam anymore. It'll create other problems, of course, but that's the way it goes with new technology.

wpietri · on Jan 24, 2023

> Instead of searching for webpages that have information about a topic, we're going to ask an AI.

I think this is probably true for some people, the same sort of person who sees something on Facebook and assumes that it's true. [1] But there are quite a lot of people for whom "according to whom?" is the next question after being told something factual. For them, I think search's job is to find relevant sources and get out of the way.

But I think even finding out is a long way away. The main thing that ChatGPT has nailed is glibness. It produces text that sounds authoritative, whether or not it's correct. And it's often incorrect. People may try ChatGPT search out of novelty or because it feels human. But if they depend on it and feel the real-world impact of a confidently wrong answer, they're going to treat it as a human that's untrustworthy. A blowhard, a liar, a fool. So I'm sure the major search players are going to be very cautious rolling out chat-like things. Google has spent decades building up consumer trust, and the don't need a zillion articles about people who a too-confident chat steered wrong.

[1] E.g., That men in white vans are kidnappers: vhttps://www.cnn.com/2019/12/04/tech/facebook-white-vans/inde...

Cyberdog · on Jan 24, 2023

> But I think even finding out is a long way away. The main thing that ChatGPT has nailed is glibness. It produces text that sounds authoritative, whether or not it's correct. And it's often incorrect.

Perfect example I found a while back: If you know Chinese or Japanese, ask ChatGPT for the stroke order of a certain character and watch how confidently it tells you how to draw a nonsensical scribble.

Even when you ask it for the stroke order of 一, it will tell you to draw a vertical line!

AnimalMuppet · on Jan 24, 2023

> But if they depend on it and feel the real-world impact of a confidently wrong answer, they're going to treat it as a human that's untrustworthy. A blowhard, a liar, a fool.

Maybe that's good. There are glib liars on the net, and not all of them are ChatGPT. If people learn to be skeptical of fine-sounding content on the net, maybe they'll apply it to humans, too.

xg15 · on Jan 24, 2023

Yes, but how would you practically do that?

The most basic technique for estimating trust in an answer today is to check the sources, see who said it, why, who is agreeing with them, etc.

It some AI just spits out an answer without any references, you cannot do that. You either have to blindly trust the answer, which will be dangerous, or you'll have to blindly distrust the answer, at which point the AI will be useless.

AnimalMuppet · on Jan 24, 2023

Step 1: No sources? No trust.

Step 2: Sources? Check them. Do they exist at all? Do they say what the thing in question says they say?

Step 3: Check for other (non-cited) sources that confirm what the cited sources say. Check for other sources that dispute it.

rgrieselhuber · on Jan 24, 2023

Also, who funded the sources and what biases do they have?

Out_of_Characte · on Jan 24, 2023

Regardless of spam, there is another fundamental issue with AI, Accountability. Any text you've read had a real person behind it with real intentions. Malice and greed or honesty and exploration. It would be very difficult to hold an AI accountable for any offence committed on accuracy or honesty. With a person, you can slowly get to the bottom of it and develop a relationship. AI will muddy the waters of people writing and thinking instead of offloading everything to an AI that could do it faster and you'd even evade any responsibility for your text as you could claim that the AI might have inaccuracies and does not reflect your own opinion

idopmstuff · on Jan 24, 2023

I don't think what you're saying really applies to SEO articles, though. If you don't get wedding insurance because you read some SEO article that recommends against it, even if the advice is clearly bad, can you really hold them accountable? It's tough for me to imagine you'd win that lawsuit.

> With a person, you can slowly get to the bottom of it and develop a relationship.

With this kind of content (with most content on the internet, I'd argue), you really can't.

Out_of_Characte · on Jan 25, 2023

I'd agree that there is no significant difference in accountability between content farms and AI. But I think that's burying the lede because what makes HN so appealing to many is the personally written blogposts and experts in specific fields or silly hobby projects. AI has the unique ability to mimick all of that for highly specific audiences with significantly lower cost and effort. This would impact curation efforts of many different user-generated websites and posts and could create a new status quo where a nobody could be treated guilty of AI-Generated articles, SEO hacking and upvote manipulation before entertaining the thought that a natural person could be making their first concerted effort for their reputation and prove their honesty.

For now HN is a bastion of conversation across many countries and domains

brookst · on Jan 24, 2023

What is the difference in accountability and ability to get to the bottom of who's responsible, between AI and someone just hiring an offshore content farm to write crap content?

rgrieselhuber · on Jan 24, 2023

Very good point.

nickfromseattle · on Jan 24, 2023

>Once this kind of AI search becomes the default way we look for information, there won't be a point to creating SEO spam anymore

What about creating new, relevant, interesting content that no-one will ever see because search no-longer exists? Will site owners continue to do it knowing AI will crawl it, and never send traffic? Probably....not?

How do LLMs of the future get better if website owners are no longer incentivized to create content?

specproc · on Jan 24, 2023

I've already had the SEO guy at work ask me how we might go about influencing model output. What a time to be alive.

artificial · on Jan 24, 2023

Watch, they'll have Kenyans review the output. It'll be human curated AI stuff, wonder if it'll be a new take on curated directories?

ren_engineer · on Jan 24, 2023

ChatGPT is trained on those same SEO spam blog posts, so I'm not sure how it solves the fundamental problem. People aren't going to create content for corporate giants to vacuum up as training material for their AI

rgrieselhuber · on Jan 24, 2023

There are various ways of looking at it and, of course, all sorts of people involved. My focus has always just been to encourage people to treat the search engines as an index that will pick up and rank quality content if you treat them that way. It still works very well today.

dale_glass · on Jan 24, 2023

I don't see the point. There's lots of old content out there that won't get tagged, so lacking the tag doesn't mean it's AI generated. Meanwhile people abusing AI for profit (eg, generating AI driven blogs to stick ads on them) wouldn't want to tag their sites in a way that might get them ignored.

And what are the consequences for lying?

westurner · on Jan 24, 2023

Does use of a search engine violate the "No AI" covenant with oneself?

Variation on the Turning Test: prove that it's not a human claiming to be a computer.

Modeling premises and Meta-analysis are again necessary elements for critical reasoning about Sources and Methods and superpositions of Ignorance and Malice.

jhbadger · on Jan 24, 2023

Maybe this could encourage the recreation of the original Yahoo! (If you don't remember, Yahoo! started out not as a search engine in the Google sense but as a collection of human curated links to websites about various topics)

daveguy · on Jan 24, 2023

I consider Wikipedia to be a massive curated set of information. It also includes a lot of references and links to additional good information / source materials. Companies try to get spin added and it's usually very well controlled. I worry that a lot of ai generated dreck will seep into Wikipedia, but I am hopeful the moderation will continue to function well.

westurner · on Jan 24, 2023

List of Web directories: https://en.wikipedia.org/wiki/List_of_web_directories ; DMOZ FTW

Distributed Version Control > Work model > Pull Request: https://en.wikipedia.org/wiki/Distributed_version_control#Pu...

sindresorhus/awesome: https://github.com/sindresorhus/awesome#contents

bayandin/awesome-awesomeness: https://github.com/bayandin/awesome-awesomeness

"Help compare Comment and Annotation services: moderation, spam, notifications, configurability" https://github.com/executablebooks/meta/discussions/102

Re: fact checks, schema.org/ClaimReview, W3C Verifiable Claims, W3C Verifiable News & Epistemology: https://news.ycombinator.com/item?id=15529140

W3C Web Annotations could contain (cryptographically-signed (optionally with a W3C DID)) Verifiable Claims; comments with signed Linked Data

rgrieselhuber · on Jan 24, 2023

It would basically depend on how serious the search engines are about wanting created vs. generated content. Generated content is ultimately going to be a regurgitation.

dazc · on Jan 24, 2023

> 'It would be very interesting if companies and site owners wished to self-certify their site as organic with something like an /ai-free.txt.'

I'm sure you can appreciate that such an initiative would be wholesale abused from day 1.

rgrieselhuber · on Jan 24, 2023

As with anything in the industry, yes. But it would provide a basis for ranking penalties if the search engines cared about it.

TylerE · on Jan 24, 2023

I'd bet money on it being an INVERSE signal.

password11 · on Jan 24, 2023

>but I got into SEO out of interest in the "organic web" vs. everything being driven by ads.

If the owner of an SEO site wants to use AI for "content generation", doesn't that mean they didn't care about the human-generated content in the first place?

Seems like a choice between garbage and slightly more expensive garbage. What is interesting or organic about that? Back in the day, people used to put things on their websites because they cared about it and wanted to say it.

noduerme · on Jan 24, 2023

I read this more as meaning that they work for a legitimate company or two that is trying to organically improve their search results without stooping to nasty tricks

rgrieselhuber · on Jan 24, 2023

Yes, that is the intent

password11 · on Jan 24, 2023

But if said company doesn't care about the human-generated content quality in the first place (evidenced by the fact that they're willing to replace it with AI generation), how is that not also a "nasty trick" by your standard?

At the end of the day they just want to optimize search results. And the Overton window of acceptability currently allows "human generated SEO content" but not "AI generated SEO content". It's just an arbitrary rule.

rgrieselhuber · on Jan 24, 2023

I think the difference between created content vs. generated content is quite more than arbitrary. If the leading search engines truly don't care about the difference, I'd predict a future where you have more bifurcation between a human web and a bot-driven one.

kurthr · on Jan 24, 2023

Don't worry, it won't be small shops doing it. It will be the majors, if that's where the money is.

To quote Yan LeCun:

   Meta will be able to help small businesses promote themselves by automatically producing media that promote a brand, he offered. 

   "There's something like 12 million shops that advertise on Facebook, and most of them are mom and pop shops, and they just don't have the resources to design a new, nicely designed ad," observed LeCun. "So for them, generative art could help a lot."

https://www.zdnet.com/article/chatgpt-is-not-particularly-in...

wizofaus · on Jan 24, 2023

How could it possibly help unless there were some independent verification mechanism though? If there's a motivation to lie about the content being "organically generated" because that's what search users prefer to find, then clearly people will. And it's hard to imagine what that verification process would look like given current technology.

gingerlime · on Jan 24, 2023

What about AI-assisted writing? e.g. improving style, grammar, readability, making explanations clearer and better structured? especially for non-native writers this is a challenge and not many can hire an editor or even a proofreader. I wonder if such use gets “penalized” by search engines the same way AI-generated content might?

megous · on Jan 24, 2023

Sure, let's just ignore the benefits of AI and pretend like it doesn't exist. That sounds like a great plan.

alpos · on Jan 24, 2023

> I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.

So the next question has to be: Was this still the right answer?

I've personally had plenty of instances in my programming career where the code I was working on really needed functions which were best shopped out to a common API. To avoid interrupting my flow and to better inform the API I'd be designing for this, I just continued to write as if the API did exist. Then I went on to implement the functions that were needed.

Perhaps the bot was right to presume that there should be an API for this. You might even be able to then prompt ChatGPT to create each of the functions in that API.

dangom · on Jan 24, 2023

Exactly, that there is an end to the rabbit hole is a limitation of today's models. If something does not exist, it should be generated on the spot. GPT5 should check for the existence of an API and if it exists, test and validate it. If it fails tests or doesn't exist, create it.

lolinder · on Jan 24, 2023

Well, this is ChatGPT, not Copilot, so I'd assume that OP was looking for a snippet using a public library rather than an internal API. In that context, suggesting you use an API that doesn't exist is just wrong.

I've definitely done this with Copilot, though—it will suggest an API that doesn't actually exist but logically should in order to be consistent, and I'll go create it.

alpos · on Jan 24, 2023

That seems more like misplaced expectations. Someone may have given you the impression that copilot was supposed to do things like that where that expectation seems to not be present for you in relation to ChatGPT.

However, as far as I know, the OpenAI team has not made it a goal to have ChatGPT only produce functional code using existing APIs. So I'm not sure we can call that an incorrect answer based on context.

If the API it demonstrated using logically should exist, it seems like the right answer is still to just go create it.

lolinder · on Jan 24, 2023

If I asked a coworker how to do X in framework Y, giving me the name of a function that should exist but doesn't is not a correct answer. If they told me "well, that's the function that should exist, you should just go submit a PR to framework Y", I would stop asking that coworker for help.

The difference I was drawing between ChatGPT and Copilot wasn't that Copilot has functionality ChatGPT doesn't, it's that it has context ChatGPT doesn't, so it suggests things related to internal APIs. In a conversation with ChatGPT it would be very difficult to get help with internal APIs, hence my assumption that OP wasn't referring to APIs they have any control over.

alpos · on Jan 24, 2023

I was not suggesting the bot was telling the user to go make a PR to an open source framework but rather that they could create a library that contains those functions if that was the logical thing to do. Which is why I asked if that actually seemed the longer term right thing to do.

While I can easily agree that Copilot is probably the better tool for such questions, it is not clear from the parent comment whether their prompt to ChatGPT was asking to create code to do X or create code to do X using only existing publicly available libraries.

It's not immediately obvious that the bot failed to understand the question or that the answer was an example of the bot failing to understand the programming language. It could easily be that the user had an implied expectation of usability in a context they did not give to the bot.

That scenario is more like you asking a random person on the street, who happens to know Y framework, how to do X in that framework. Your coworker can be expected to get that you are looking for an answer that gets your current task done faster than you would be able to do without their assistance. The person on the street could not reasonably be expected to get that unless you give them that context.

jhbadger · on Jan 24, 2023

>I've already run into scenarios where ChatGPT generated code that looked perfectly plausible, except for that the actual API used didn't really exist.

Yes! I remember generating a seemingly reasonable R script except that the library that it called to do most of the work didn't exist! It was like code from an alternate dimension!

dagw · on Jan 24, 2023

I asked if there where any Open Source libraries that implemented a certain algorithm. It gave me links to 3 different GitHub repos, none of which existed.

euroderf · on Jan 25, 2023

Did it have the common courtesy to add a comment saying "PRANKED!" ?

gattilorenz · on Jan 24, 2023

It’s as if ChatGPT was behaving like a language model with no real connection or understanding of R… hmmmmm…

alpos · on Jan 24, 2023

This take strikes me as a little off. Programming languages are language. Unlike natural languages they are also based on context-free grammar. So an understanding of programming languages should actually be easier for even a general language model to incorporate than natural languages.

We can expect a bot like this to not really get context clues in natural language, although they seem to be getting better at that, but context is not necessary to have a true and functional understanding of a programming language. That was the point of creating such languages.

Using an API that doesn't exist but logically should once the use cases are demonstrated is not an example of lacking understanding, it is an example of advanced insight. A human might have invented the necessary functions inline with the rest of the project but if they are expressing functionality that is commonly applicable, then a common API for those functions is what the humans would eventually converge upon to clean up the code from the initial inline implementation, making it more consistent and readable.

gattilorenz · on Jan 24, 2023

My answer was mildly tongue in cheek, and I see where you’re going.

On the other hand, one of the other posters asked “to generate a parallax effect in Qt/QML. It simply used a QML Elemened with the name Parallax”. Is this an insight, or is this answering “yes, I could” to “could you pass me the salt?”. Maybe the line between the two is a fine one, and I didn’t realize that yet.

In general, it feels like copying part of the question (“write parallax code”) in the answer is the easy part of the task…

alpos · on Jan 24, 2023

Saw that, and yea, that's totally a fall-back cop-out type answer. I was pretty much just questioning everyone dismissing the whole category of answers like the R script one.

To me there does seem to be some nuance to here that's worth noticing. Some examples of this type of response are indeed too cheap and can be chalked up to lack of training data or something.

But in other cases it's actually not immediately obvious whether the answer the user got was their fault for not specifying that they are expecting code that works without additional supporting libraries.

A language model can't reasonably be expected to understand an expectation of usability or fitness for purpose in a context the user didn't specify.

gattilorenz · on Jan 24, 2023

Yes, indeed there are some extra nuances that should not be automatically dismissed.

> A language model can't reasonably be expected to understand an expectation of usability or fitness for purpose in a context the user didn't specify.

I agree, but I think we’re at the same time expecting the LM to “understand” a lot more

layer8 · on Jan 24, 2023

The failure to realize that the API doesn’t exist and therefore the code won’t work in practice, however, is a major lack of insight and understanding.

alpos · on Jan 24, 2023

Agreed. That does seem to be an example of a language model failing to understand the context in which the question was asked.

The user was implicitly expecting code that would function when executed immediately and as written with no additional supporting libraries included. This is different from code that would function correctly when executed after having downloaded relevant existing packages. Which is different from code that would function if executed along side additional supporting code from private libraries the user might not have access to. Etc...

Yet any of those answers fit for the same prompt, "create an R script that does such and such". The bot's lack of insight is on the likely intention behind the prompt rather than on the requested language. I'd say if it produces any code that fits the syntax and grammatical structure of the requested language, that's enough to say it understands the language.

Kelteseth · on Jan 24, 2023

Can confirm that this happened to me when I asked ChatGPT to generate a parallax effect in Qt/QML. It simply used a QML Elemened with the name Parallax.

VeninVidiaVicii · on Jan 24, 2023

Yeah, a few times when I ask for a reference to something outlandish, it generates a perfectly realistic looking paper alongside a doi link, that’s completely made up. Both the paper and the link link do not exist!

alpos · on Jan 24, 2023

I have to wonder, were any of the the hypothesis' in those papers plausibly viable areas of inquiry?

Perhaps it could be useful if they train the bot to identify cases like this and state that no such references exist but also provide a thesis or suggest a line of study that would produce such a reference.

PartiallyTyped · on Jan 24, 2023

I have had the same experience with Typescript and Python.

bamboozled · on Jan 24, 2023

I asked it to generate sample aws step function config and as far as I could tell it made up configuration parameters. I know it’s a language model.

jay-barronville · on Jan 24, 2023

> It was like code from an alternate dimension!

Maybe it was? Haha. ChatGPT has seen some things and knows something we don’t!

luxcem · on Jan 24, 2023

> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward

Every discussion on AI take the example of ChatGPT and its inherent flaws but AI-generated content doesn't have to be dull and low quality.

One question that bother me is does it really matter? If AI-generated content is on par with Human-made or even better does it matter anymore that an AI generated it?

Maybe it's the sentimental value, empathy, fidelity?

If an AI had written Mozart's Requiem would it lessen its interest, its beauty?

DennisP · on Jan 24, 2023

I don't think AI has to be low-quality for GP's concern to be valid.

Humans get inputs from a large variety of sources, but if an AI's input is just text, then there's the potential for AI's input to mostly consist of its prior output. Iterate this, and its model could gradually diverge from the real world.

The equivalent in human society is groupthink, where members of a subgroup get most of their information from the same subgroup and end up believing weird things. We can counter that by purposely getting inputs from outside of our group. For a text-model AI, this means identifying text that wasn't produced by AI, as the article suggests.

ben_w · on Jan 24, 2023

Kinda?

AI is often limited in ways we aren't, but it also trivially consumes more than we can in a lifetime.

"Just text" in the case of GPT-3, but also it is trained on a token count exceeding the number of times an average synapse in a human brain will fire in a lifetime.

It can still get biases from the training set; while I'm not sure if "group think" is quite the right phrase, it does seem to "want" everyone to get along even when asking it to create multiple characters engaged in a conflict. (Or perhaps that's just an artefact of it estimating that I want that). Reminds me — in a bad way — of Jules Verne's From The Earth to the Moon.

DennisP · on Jan 24, 2023

Consuming and producing vast amounts of information is what makes the problem potentially worse than human groupthink. It enables the situation where AI is mostly consuming information produced by AI. That's the feedback loop I'm calling "groupthink." It could end up diverging from reality in the same way that chaotic functions diverge widely due to tiny differences in the initial conditions. The same problem exists if the AI consumes other types of information that it also produces.

Humans are more grounded by having a presence in the physical world. Plus they draw on various sources considered more reliable, like formal training, scientific papers, textbooks, quality journalism, etc. If we want AI to be reliable, we'll need it to put the most weight on similar sources, and maybe even have some real-world presence with sensors and robots.

Eventually AI will be able to produce new reliable information itself. But for that, it would have to recognize factual inconsistencies between sources and logical inconsistencies in arguments, and figure out how to resolve those, and do math correctly. I don't know what the state of the art is here, though ChatGPT tends to fail at basic arithmetic.

ben_w · on Jan 24, 2023

> That's the feedback loop I'm calling "groupthink."

While I think I get your point, I'd call that failure mode "believing it's own BS", and (perhaps I'm just being cynical here) I think humans collectively also have this failure mode.

That said, there's an old saying: "To err is human, to really foul up requires a computer" — it is quite possible for a machine, with merely the same category of flaws we have and no others, to be really bad for the world just because it's really fast and doesn't sleep.

joshspankit · on Jan 24, 2023

There’s nothing saying AIs have to have text input, it’s just the method with the lowest friction of imagination. That’s why books of text have been around for so long.

There are already AIs that take input via image, video, and audio. The AI tech is input agnostic and only requires that someone figures out a way to get the input in.

DennisP · on Jan 25, 2023

Yes but the issue isn't specific to text. It works for anything that AI also produces.

lolinder · on Jan 24, 2023

> Every discussion on AI take the example of ChatGPT and its inherent flaws but AI-generated content doesn't have to be dull and low quality.

To get away from that we'd have to dramatically change our approach. The LLMs we have are trained on as much content as possible and essentially average out the style of their training data. What it writes reads like a B-grade high school essay because that is what you get when you average all the writing on the internet.

It's not obvious to me that a creative approach that boils down to "pick the most likely next word given the context so far" can avoid sounding bland.

iliane5 · on Jan 24, 2023

>One question that bother me is does it really matter? If AI-generated content is on par with Human-made or even better does it matter anymore that an AI generated it? > If an AI had written Mozart's Requiem would it lessen its interest, its beauty?

I think it's about intent. Art is interesting and beautiful to us because there is an undeniable human intent in creating it and a vision behind it.

ChatGPT and DALL-E are pretty cool but I think until AI get it's own intent and goals it's pretty fair to try to separate human art and AI art.

int_19h · on Jan 24, 2023

I've seen plenty of images generated by Midjourney and Stable Diffusion that I would describe as "interesting" and/or "beautiful".

For that matter, nature can obviously be both, and it doesn't have intentional design nor vision behind it. So it's clear that it's not a universal requirement.

iliane5 · on Jan 24, 2023

Sure, I didn't word it in the best way.

What I meant is that something that is "interesting" and/or "beautiful" is just artistic, such as nature as you pointed out. For it to be art, there has to be intent behind it, otherwise it's just aesthetically pleasing.

My point was that art is more than just something that's aesthetically pleasing.

int_19h · on Jan 24, 2023

I would say that art is something that is deliberately created to be aesthetically pleasing. If it's done by an AI that was designed with intent to generate such things, I would consider them art, as well.

But if we're talking about definitions, surely what really matters is how most in society understand "art"? Now suppose we went around showing Midjourney-generated pics to random people on the streets and asking them whether it's art or not; how many do you think would say "no", or ask questions about artist's intent before giving an answer?

iliane5 · on Jan 24, 2023

Most would say it's art, which is why these models are so interesting and have had such quick adoption. If you were to reveal that it's AI-made, some would also change their mind or say that it's art but not exactly in the same sense that human-made art is art.

I just think that until AI has real intent, it's a different thing that what we do. Which is not saying anything bad it, quite the contrary.

The definitions will inevitably shift over time, they're already ultimately subjective. If you believe in god, it's not hard to argue that nature is art and quite a lot of people believe in god.

htrp · on Jan 24, 2023

> If an AI had written Mozart's Requiem would it lessen its interest, its beauty?

Will be a great question when we can't tell the difference.

layer8 · on Jan 24, 2023

It’s not clear when and how we would reach that level of quality. It doesn’t seem very relevant to the present state of affairs.

roughly · on Jan 24, 2023

Yes! There’s a common problem where people think that an ecosystem is infinite, or at least sufficiently large, when it’s not. We’ve done similar with dumping in the ocean, and now we’ve all got plastics in our blood, and we assumed soil quality was a given over time, too. AI content released into the wild will be consumed by AI; how can it not be? You’ve got a system which can produce content at a rate several orders of magnitude higher a human, of course the content ecosystem will be dominated by AI-generated content, so of course the quality of content generated by AI systems, which rely on non-AI-generated content to train, will go down over time.

r3trohack3r · on Jan 24, 2023

I feel like we are nearing peak "uncurated" content, both for humans and machines. Humans are still grappling with our novel abundance problems.

As we move forward, suspect we will see an increase in curation services and AI models will do more with less. You can bootstrap a productive adult human on an almost infintismal slice of the training sets we are using for the current gen of AI, can't imagine future approaches are going to need such an unbounded input to get better results - but might be wrong!

If content is curated for its quality, whether or not it's AI generated (or assisted) doesn't matter.

joshspankit · on Jan 24, 2023

We’d have to adjust capitalism to deal with the “novel abundance” problems. Most of the drive for novel content/audiences is simply to decide which people get a cut of the revenue (and/or audience).

If we focused on quality and stopped caring about who gets paid for what I suspect that not only would we have better quality overall but we’d also push the boundaries much faster thus making things even more interesting.

r3trohack3r · on Jan 26, 2023

Strong disagree. Capitalism (can argue whether that's what we have or not) uses capital to respond to markets. Markets are signals of what people want.

The scarcity mindset in an the era of information abundance isn't being perpetuated by capitalism, it's being perpetuated by people making their own decisions one transaction at a time.

joshspankit · on Jan 26, 2023

You make some strong logical cases so I'd like to use some of your language to try to give another perspective on my points:

Markets in our current system are (IMO) a distorted view of what people want. For example TVs: How much of the market is people who simply want a nice clean picture and how much is people who are feature-chasing based on the TV marketing? And then how much of the 'market' is actually smoke and mirrors accounting to increase the value of companies?

Some things that I think are only common because of capitalism (or "capitalism") and could therefore be phased out if we transitioned to an AI-supported system:

- "Keeping up with the Jones'"

- Needing short hits of artificial 'excitement' (such as the kind of novelty drip-fed to us by the movie industry)

- Keeping ideas secret as a form of Intellectual Property protection, and therefore the scarcity mindset that there are no good solutions and no new problems to solve (since we'd all be exposed to the latest of those if we chose)

- Capital being used for poor-quality projects (since many of these again are just about trying to grab more money for the people making them)

- People making their own decisions one transaction at a time. This is definitely underrated. Sure some people will continue to think short-term just because that's how their brains are wired, but I suspect the majority will find themselves able to breathe slowly for the first time and they will naturally unwind in to longer and longer-term thinking.

welshwelsh · on Jan 24, 2023

I think that in addition to influencing future models, AI content will also influence how humans think and write. People will start ironically and unironically copying GPT's style in their own writing, causing human produced content to increasing resemble AI content.

High school students that are prohibited from using AI for their essays will have a bad time. Even if they don't use AI chatbots themselves, they will unknowingly cite sources that were written by AI, or were written by someone who learned about the topic by asking ChatGPT.

messe · on Jan 24, 2023

Hmm, forgetting natural language for a moment and instead considering programming languages: it’s pretty easy to generate nonsense but semi-plausible looking ASTs without the help of AI. Could this be used to attack GitHub’s copilot?

Step 1. Release a tool that generates nonsense code across a thousand repositories, and allow anybody to publish crap to GitHub.

Step 2. Copilot trains on those nonsensical repositories because it can’t distinguish them from the real thing.

Step 3. Copilot begins to generate similar crap.

theRealMe · on Jan 24, 2023

Imagine this as a security attack vector. Instead of nonsense, spam a bunch of repos with code that does a specific thing but in a very hard to understand way. Then add in a small piece of very hard to understand, but legit looking malicious code. Copilot trains on it and then starts feeding it to developers around the world. Probably easier ways to achieve this, but interesting to think about.

empyrrhicist · on Jan 24, 2023

You'd probably also have to botspam stars, issues, pull requests.

kaetemi · on Jan 24, 2023

Those blogs already exist. Pretty much 90% of the results I see in Google for non-technical household related queries. Just incoherent rambling that sounds plausible but is complete nonsense.

visarga · on Jan 24, 2023

> Published content will be later used to train subsequent models, and being able to distinguish AI from human input may be very valuable going forward

Not necessarily all AI contents are bad and all human contents are good. We need a way to separate the good from the bad, not AI from human, and it might be impossible to do 100% correct anyway.

A4ET8a8uTh0 · on Jan 24, 2023

I think I would compare it to Stack Overflow. Some of the solutions do exist there, but not all are applicable to the use case or the exact circumstances the person asks there and yet the prompt used by AI would remain the same. SO has its rating system, but it has the same issue as the sentence above. From that perspective, we have identified potentially good human output ( assuming it wasn't already pollinated with AI output, which seems less and less likely ) that should only be accessible by humans and we would need a separate forum for bad AI output ( that should be verified by humans as bad but maybe only be accessible by AI once verified ).

I am just spitballing. I do not really have a solution in mind. It just sounds like an interesting problem going forward.

orbital-decay · on Jan 24, 2023

> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits?

No, because models are already trained like that. Datasets for large models are too vast for a human to even know what's inside, let alone label them manually. So instead they are processed (labeled, cropped, etc) by other models, with humans overseeing and tweaking the process. Often it's a chain with several models training each other, bootstrapped from whatever manual data you have, and curated by humans in key points of the process.

So it's actually the opposite - the hybrid bootstrapping approach that combines human curation and ML labeling of bulk low-quality data typically delivers far better results than training on a small but 100% manual dataset.

visarga · on Jan 24, 2023

> They are processed by other models, humans overseeing and tweaking the process. Often it's a chain with several models training each other, bootstrapped from whatever manual data you have, and curated by humans in key points of the process.

A great description of what actually happens when you deal with massive datasets. One way to inspect a large dataset is to cluster it, and then look at just a few samples from each cluster, to get an overview.

wincy · on Jan 24, 2023

Okay for image models I think humans could help a lot more than we give them credit for. We can read and parse images WAY faster than you might think.

What if we just crowdsource and have a new Folding@home protein thing but this time it’s for classifying data sets? LAION-5B has 5 billion image text pairs, if we got 10,000 people together that’d just be… 100,000 per person which would take… awhile but not forever. Humans can notice discrepancies super quickly. Like a slide show display the image and the text pair at a speed set by the user, and pause and tweak ones that are outright wrong or shitty.

Boom, refined image set.

Maybe? I’m looking at the LAION-5B example sets on their website and it seems to literally be this simple. A lot of the images seemed pretty poorly tagged. You get a gigantic manually tagged data set, at least for image classification.

davidkunz · on Jan 24, 2023

I assume at some point, ChatGPT needs some kind of text ranking. Popular texts are usually correct (content and presentation) and useful, so they should rank higher. At some point, low-quality texts are filtered out. Personally, I don't care if a text is written by a human or a machine as long as it's good.

joshspankit · on Jan 24, 2023

Popular as defined by who? Because I think we all know that popular as defined by Pagerank, sales, or visits has a lot of issues.

layer8 · on Jan 24, 2023

Information quality and authenticity will become a cat-and-mouse game just like information security.

jjtheblunt · on Jan 24, 2023

> Could we end up having AI quality trend downwards due to AI ingesting its own old outputs and reinforcing bad habits? I think it's a particular risk for text generation.

I just had my eyes opened reading that, because humans also do exactly that, inadvertently.

jjtheblunt · on Jan 25, 2023

another afterthought : does that mean humans are artificial AI?

CuriouslyC · on Jan 24, 2023

This isn't an issue, because it's possible to add prose quality and content accuracy scores to training data and train the model to predict those quantities during generation, which would allow you to condition the generation on high prose quality/accuracy. It just requires a small update to the model, and a shit ton of data set annotation time.

Likewise, images can be scored for aesthetics and consistency and models updated to predict and condition in the same way.

rcme · on Jan 24, 2023

How would you score them at scale without training some model to differentiate real vs. AI content? If you need to train such a model, where would you get the data from?

CuriouslyC · on Jan 24, 2023

We don't need to differentiate AI vs Human, just accurate and well written vs not. We'd do that the same way we've scored stuff at scale so far - grad students and crowdsourcing.

rcme · on Jan 24, 2023

The scale of data for these LLMs is well beyond the scale producible via crowdsourcing.

CuriouslyC · on Jan 24, 2023

That just isn't true. It's expensive, but entirely doable. Also, it's perfectly normal to perform initial model training on a large data set to capture the statistical properties of language, then perform a second stage of model training on more curated data to cause the model to actually do what you want.

rcme · on Jan 24, 2023

Current LLMs are trained on as much data that can be scraped from the public internet. It’s simply not possible to annotate that much data, even with crowdsourcing. It’s not even a matter of cost. You’d basically need to duplicate the amount of data on the internet. I don’t think you’re appreciating the scale of the data involved in training these models.

CuriouslyC · on Jan 24, 2023

Not necessarily. The bloom model (a GPT competitor and similarly sized) was trained on 1.5T of text, which reduces down to 350B unique tokens. If you took a histogram of those unique tokens, it would have a very long tail with probably 1% or less being well represented. That leaves 350M common tokens to serve as the basis for token tuples being fed into crowdsourcing. There are probably ~2-5B very common token sequences, if you had 5 people view each token sequence and give it a few scores, and that process took ~1-2 minutes (these sequences are short), that leaves a conservative estimate of 50 billion person minutes, or ~34 million person days. If you paid these workers $15/hour, that comes out to $12.5 billion dollars, which is not prohibitively expensive for any big tech company when spread out over several years, particularly when it provides a massive competitive advantage.

rcme · on Jan 24, 2023

BLOOM isn’t as good as GPT-3 because it doesn’t use as much training data. LLM quality is still data bound [0]. Further limiting data by requiring annotation is not going to work, at least with the current LLM modeling approach.

0: https://www.alignmentforum.org/posts/6Fpvch8RR29qLEWNH/chinc...

CuriouslyC · on Jan 24, 2023

As the scale of input data goes up linearly, the scale of commonly observed input patterns goes up logarithmically. If we bumped the scale up an order of magnitude in terms of common input tokens, that still means we could annotate the important part of a 150TB text corpus for 125B worth of human annotation. Given that could break the budget of even large corporations, realistically we'd probably train a model to predict the scores of interest using a fraction of that much human annotation, which would be inferior but still a massive improvement. It is also likely that corporations would team up with indirect competitors to share the cost of annotation and gain an advantage against direct competitors.

rcme · on Jan 24, 2023

How do you figure? Let's say a commonly observed input pattern comprises 1% of training data. For a data set of size N, 0.01 * N examples will contain the pattern. If we increase the size to 2N, 0.01 * 2 * N examples will contain the pattern. Why is the growth logarithmic?

CuriouslyC · on Jan 24, 2023

As the data set size increases, the average frequency of non-trivial most frequent patterns will go down, and the tail will get much larger. Thus if you had a 1% cutoff, the percentage of the data set hitting this cutoff goes down as the data set size goes up. Take a look at pareto distributions with high alpha to understand the statistics of it.

Of course, this is only true if new data is distinct from old data. If you just copied your data set 10x and pretended it was a 10x larger data set, it would behave like you expect.

rcme · on Jan 24, 2023

Hmm I’m still not convinced. Gather training data can be thought of sampling the underlying distribution of the data. In that sense, you’d expect the proportions of things to converge towards the underlying distribution as you gather more data.

CuriouslyC · on Jan 25, 2023

That would be true if we were sampling from the underlying distribution in an unbiased and balanced way from the beginning. Instead data is generated and incorporated one set at a time, and each set is biased. Jargon and terms vary, but the language plumbing is the same - new sets bolster common phrases/idioms and lengthen the tail with specific tokens.

Keep in mind though, language isn't a stationary process.

rcme · on Jan 25, 2023

Even if each dataset is biased, I’m not still not sure how you derived logarithmic growth from the general notion of bias in data. For instance, assuming the data is biased, perhaps it is biased in the other direction and contains more common patterns compared to the underlying distribution.

CuriouslyC · on Jan 25, 2023

There is a lot to this subject, it might be easier if you took a look at https://martinapugliese.github.io/data/heaps-law-languages/.

Note that when plotting corpus size vs unique words, the log plot is expected to be linear.

rcme · on Jan 25, 2023

Ah, I see what you mean: the number of unique examples increases logarithmicly with data size, which kind of makes sense. Language, in this case, follows a power law.

I think you argument is that this means smaller datasets are ok because they contain "most" of what the larger datasets contains. But I think this data-power-rule implies the opposite. ML models can often get to 80-90% accuracy on some task. Unfortunately, these models often aren't that useful because that missing 10% of accuracy matters a lot to users. So what this data-power-rule implies is that, in order to get the last 10% of gains, you need 10x the amount of data.

CuriouslyC · on Jan 25, 2023

Well, to get back to my original point, if we're trying to improve the quality and accuracy of model writing, and we want to do that by adding quality and accuracy scores to short token sequences, the power law distribution means we could get coverage on a significant portion of the data set by scoring just the most frequent sequences that aren't linguistic trivia. We could probably get to 50% average coverage fairly cheaply, and while diminishing returns would kick in and make getting to 80 or 90% much more expensive, at that point we could use a model to estimate the remainder, and have a perfectly suitable quality/accuracy scores to condition the model on. The model would output those quality/accuracy scores for the generated token sequence as well, so portions of output that were low quality/of questionable accuracy could be flagged.

joshspankit · on Jan 24, 2023

Then you’d just create the AI equivalent of black-hat SEO.

CuriouslyC · on Jan 25, 2023

An arms race between generative AI and discriminative AI is a good thing, it'll lead to higher quality models since the only way to "win" the game is to make output that is indistinguishable from legitimate content.

In fact, that is basically the GAN algorithm in meta form.

foobarbecue · on Jan 24, 2023

It reminds me of a dog eating its own vomit.

euroderf · on Jan 25, 2023

Excellent analogy. Filed for future use.

foobarbecue · on Jan 25, 2023

Heh, if we work together we can ruin the term "dogfooding"

euroderf · on Jan 25, 2023

"WOOF?"

"BRRAAALFF!!"

shadowgovt · on Jan 24, 2023

There will likely be selective pressure from human interaction with the data to curate good content above bad.

After all, we had the issue of millions of auto-generated bad pages in the web 1.0 SEO days. Search engines addressed it by figuring out how to rely more heavily on human behavior signals as an indication of value of data.

joshspankit · on Jan 24, 2023

The thing that concerns me is that we may end up with a downward trend in accuracy.

If AI writes the bulk of the content, how long will it be before people simply do not put in the work to make sure things are true or put in the work to discover and understand new true things?

bogdanoff_2 · on Jan 24, 2023

>Such things could end up creating a downwards trend in quality, as more and more of such junk gets posted, absorbed into the model and amplified further.

I feel like a similar thing already happened with YouTube recommendations

user3939382 · on Jan 24, 2023

I brought this exact issue up recently https://news.ycombinator.com/item?id=34252938

noduerme · on Jan 24, 2023

>> a downwards trend in quality

Have you googled for reviews on toaster ovens recently?

A4ET8a8uTh0 · on Jan 24, 2023

To your point, anecdotally, the system is heavily gamed. The other day, I saw reviews for a restaurant pop up that did not quite open yet. Either reviewers got a sneak peek behind the chef's curtain or those reviews are not quite true.

Sadly, word of mouth again becomes defacto the only semi-reliable way to separate crap from non-crap and even that comes with its own set of issues and attempts at gaming.

dsign · on Jan 24, 2023

There is so much stuffing for a simple idea that I'm not sure if this piece deserves its own title, but I'll give it the benefit of the doubt.

One thing that I wonder though is how we will draw the line. If I'm writing a piece and do a Google search, and in that way invoke BERT under the hood, is anything that I write afterwards "AI-tainted"? What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?

And that brings the other problem: do the general public really know the extent of AI use today, never mind in the future?

With all of that out of the way, yes, I would rather read text produced by human beings, not because of its quality--the AI knows, sometimes humans can't help themselves and just keep writing the same thing over and over, specially when it comes to fiction--but just to defend human dominance.

Dalewyn · on Jan 24, 2023

>What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?

We could get this whole discussion back to some semblance of sanity if we stopped calling any form of remotely complicated automation "AI". The term might as well be meaningless now.

Nothing about any of all these "AIs" is intelligent in the sense of the layman's understanding of artificial intelligence, let alone intelligence of biological and philosophical schools of thought.

artpi · on Jan 24, 2023

> There is so much stuffing for a simple idea that I'm not sure if this piece deserves its own title, but I'll give it the benefit of the doubt.

Frankly I had the same thought writing it :D

It's more of a stake in the ground sort of a thing I guess? What I really want is somebody saying "hey, there is an open standard already here" so I can use it.

nonrandomstring · on Jan 24, 2023

The idea has some legs, but they are weak for the many reasons pointed out to me by fair criticism of "digital veganism". The main one is that labelling is one small part of quality. Tijmen Schep in his 2016 "Design My Privacy" [1] proposed some really cool ideas around quality and trustworthiness labelling of IoT/mobile devices, but ran into the same issues. Responsibility ultimately lies with the consumer, and so long as consumers remain uneducated as to why low quality is harmful, and cannot verify the provenance of what they consume or the harmful effects, nothing will change.

Right now we seem to be at the stage of "It's just McDonald's/KFC for data - junk food is convenient, cheap and not a problem - therefore mass production generative content won't be a problem".

The food analogy is powerful, but has limits, and I urge you to dig into Digital Vegan [2] if you want to take it further.

[1] https://www.tijmenschep.com/design-my-privacy/

[2] https://digitalvegan.net

felideon · on Jan 24, 2023

By contrast, I enjoyed the entire piece, read another one of your posts, and subscribed to your newsletter.

arketyp · on Jan 24, 2023

It's interesting that the article mentions Kosher rules as if abiding by them is trivial and as if the practice isn't ridden with gray areas.

sigriv · on Jan 24, 2023

>One thing that I wonder though is how we will draw the line. If I'm writing a piece and do a Google search, and in that way invoke BERT under the hood, is anything that I write afterwards "AI-tainted"? What about the grammar checker? Or the spot removal tool in photoshop or gimp? Or the AI voice that reads back to me my own article so that I can find prose issues?

>And that brings the other problem: do the general public really know the extent of AI use today, never mind in the future?

The line is drawn at human ownership/responsibility. A piece of content can be 'AI tainted' or '100% produced by AI', what makes the difference is if a human takes the responsibility of the end product or not.

alpos · on Jan 24, 2023

Responsibility and ownership always lies with the humans. Even supposedly 100% AI generated content is still coming from a process started and maintained by humans. Currently also prompted by a human.

The humans running those processes can attempt to deny ownership or responsibility if they so choose but whenever it matters such as in law or any other arena dealing with liability or ownership rights, the humans will be made to own the responsibility.

Same as for self-driving cars. We can debate about who the operator is and to what extent the manufacturers, the occupants, or the owners are responsible for whether the car causes harm but we'll never try to punish the car while calling all humans involved blameless. The point of holding people responsible for outcomes and actions is to drive meaningful change in human behaviors in order to reduce harms and encourage human flourishing.

In terms of ownership and intellectual property, again the point of even having rules is to manage interactions between humans so we can behave civilly towards each other. There can be no meaningful category of content produced "100%" by AI unless AI become persons under the law or as considered by most humans.

If an AI system can ever truly produce content on its own volition, without any human action taken to make that specific thing happen, then that system would be a rational actor on par with other persons and we'll probably begin the debate over whether AI systems should be treated as people in society and under the law. That may even be a new category distinct from human persons such as it is with the concept of corporate persons.

xpe · on Jan 24, 2023

> ... yes, I would rather read text produced by human beings, not because of its quality ... (snip) ... but just to defend human dominance.

One could make a strong argument that defending moral principles is preferable to preferring the underlying creative force to have a particular biological composition.

As an example, I don't want a system to incentivize humans kept as almost-slaves to retype AI generated content.

How can one tell the difference between all the gradations of "completely" human generated to not?

virtualritz · on Jan 24, 2023

> One of my favorite products is “100% Fat-Free Pickled Cucumbers Fit (Gluten Free), “ which I once saw at the grocery store.

On my first fligh to the US, in the 90's, a rather obese lady in the row in front of me asked the flight attendant: "Excuse me. Do you have fat-free water?"

The flight attendant hesitated a split second, her face not moving an inch. Then she smiled and replied: "We certainly have fat-free water, madam. I fetch you a bottle straight away."

short_sells_poo · on Jan 24, 2023

A few years ago in a hotel in London I had complementary water bottles on the night stand. The label said "Organic Water from Scotland". I was like: uhh organic water from Scotland is probably cattle piss. I prefer inorganic, non-bio water.

ghostbrainalpha · on Jan 24, 2023

Alex, I'll take Things that never happened, for $500.

heywhatupboys · on Jan 24, 2023

> in the 90's, a rather obese lady

I suppose you wrote the 90's lest our mental image was of a 2020's rather obese lady? The woman probably would be skinny today

dopylitty · on Jan 24, 2023

The long term impact of the ease of generating low nutrition digital content using language models may be that people put down their devices and return to the real world. We’re already far down that path with the existing internet where most content is generated for SEO.

Anything you’re consuming on the internet or even on a TV may just be random noise generated by some model so why waste your precious time consuming it?

On the flip side why waste your time producing content if it’s going to be drowned in a sea of garbage mashed together by some language model?

TheMaskedCoder · on Jan 24, 2023

> The long term impact of the ease of generating low nutrition digital content using language models may be that people put down their devices

The problem is people don't always make the wise decision. Evidence: the junk food industry is alive and kicking.

Some people will disconnect from devices, but others may just say "this is the way things are now" and adjust themselves to the flavor of junk content.

alkjsdlkjasd · on Jan 24, 2023

Why are you assuming that it will make writing worse not better?

Just because it can be used by non-experts to create crappy written work, it can also be used by people who work with it to augment and improve their existing writen work.

To my mind AI is a general purpose technology: https://en.wikipedia.org/wiki/General-purpose_technology

I guess using this mental model, what you are worried about is the equivilent to pollution?

Did the printing press also increase the amount of crap in circulation?

A4ET8a8uTh0 · on Jan 24, 2023

<< Did the printing press also increase the amount of crap in circulation? << Why are you assuming that it will make writing worse not better?

Both of these are fascinating questions and, to me anyway, both can be answered with yes. The sheer amount of writing increased exponentially once more people could read, write and publish their own writings ( and internet only exacerbated this trend ). I accordance with pareto principle, most of it was of poor quality, but the upside was that good output likely did increase in terms of absolute number as well ( few people are bound to write something decent ).

I think parent is looking back at history and reasonably infers potential results ( more crap ).

dazc · on Jan 24, 2023

> Did the printing press also increase the amount of crap in circulation?

I think the answer is, undoubtedly, yes.

CuriouslyC · on Jan 24, 2023

We need content curation AI to catch up with content generation AI.

ipython · on Jan 24, 2023

Sorry, but all I can think of after reading this blog post is the evil bit RFC: https://www.ietf.org/rfc/rfc3514.txt which has had just as much effect on internet security as this proposal will have on controlling ai generated content.

fellerts · on Jan 24, 2023

That RFC was intended as a joke though (look at the date). This isn't, and I agree that it sounds like a feeble attempt at controlling the floodgates.

ipython · on Jan 24, 2023

Yes I know it was a joke - hence my point.

shever73 · on Jan 24, 2023

This post reminds me of Samuel Butler's novel, Erewhon.

"Assume for the sake of argument that conscious beings have existed for some twenty million years: see what strides machines have made in the last thousand! May not the world last twenty million years longer? If so, what will they not in the end become? Is it not safer to nip the mischief in the bud and to forbid them further progress?"

slashdev · on Jan 24, 2023

That reminds me of the epilogue to HG Wells Time Traveler:

“ He, I know—for the question had been discussed among us long before the Time Machine was made—thought but cheerlessly of the Advancement of Mankind, and saw in the growing pile of civilization only a foolish heaping that must inevitably fall back upon and destroy its makers in the end. If that is so, it remains for us to live as though it were not so.”

That passage has haunted me since. I often wonder if that is the answer to the Fermi paradox. Civilization might be but a brief spark in the long night, separated from others by both time and distance insurmountable.

erulabs · on Jan 24, 2023

You’re not wrong but… “I refute it thus”

kicks big pile of old books

“it remains for us to live as though it were not so” is a wonderful line.

jefftk · on Jan 24, 2023

Which of these are AI free? Something written:

1. Before 1970

2. By someone who has consumed AI-generate content

3. With a ~2000-era spellchecker

4. By someone using ~2020-era neural speech-to-text software

5. With a ~2020-era spellchecker

6. By someone with an AI in the traditional editor role (reads it, gives detailed feedback)

7. By a human and an AI working together where the AI generates everything initially but the human fixes things and nothing goes out without human endorsement.

I'd probably draw the line at 7, but you could also argue for 6 or even 5.

The history of the term "handmade" and discussions about what to allow on Etsy come to mind: https://whileshenaps.com/2013/10/etsy-redefines-handmade-aut...

beej71 · on Jan 24, 2023

I think the line for me is somewhere around where the tool starts passing the Turing test.

euroderf · on Jan 25, 2023

I've always maintained that for any food product labeled "Home Style" or "Home-Made Flavor", the product must also feature a photograph of the factory floor where the product is made.

eruci · on Jan 24, 2023

I was thinking about this exact problem a few days ago when I created a site hosting poems that were either 100% AI written, or 100% Human. https://news.ycombinator.com/item?id=34472478

Then I asked people to guess the authorship. Amazingly, only 70% of the time the guess people make is correct. https://random-poem.com/

why : Is this Poem written by or by ? Guess & Click.

I'm guessing it will get even harder to tell as the AI improves further down the road.

everybodyknows · on Jan 24, 2023

Why is AI introducing a misspelling, one that does not appear to be any sort of deliberate wordplay?

https://random-poem.com/Weird

"ridicolously"

eruci · on Jan 24, 2023

Good Question. However this is 100% ChatGPT. Perhaps it knows that Weird Al intentionally misspells words in his work, hence it has intentionally introduced this typo. Which makes this type of AI even more awesome.

--edit

.. weird:

I asked ChatGPT if there is a typo in this poem. This is what it responded with:

Me: Does this poem have a typo? https://random-poem.com/weird

ChatGPT: It appears that there is an intentional typo in the first line of the poem "Weird Al, oh Weird Yankovic Al" instead of "Weird Al Yankovic". Yankovic being the surname of the artist, this addition can be seen as a playful and humorous way to refer to the artist, and give the poem a personal touch.

mensetmanusman · on Jan 24, 2023

This reminds me of the ‘portrait drawings’ to camera transition.

LLMs have given us a more interesting corridor in the Library of Babel - https://en.m.wikipedia.org/wiki/The_Library_of_Babel - but choosing the wheat from the chaff will still be the human endeavor because of the infinite possible BS.

thenerdhead · on Jan 24, 2023

The point about emotional response is good.

I'm not sure how to best describe it, but every time I interact with AI, there is very little emotional response from it. Rather it's a "good enough" response rather than a sense of awe or horror.

I've been experimenting with writing recently and wanting to provide some AI imagery to match the emotions I'm expressing. A painting like "Wanderer above the Sea of Fog" evokes many emotions. But when I use the same description such as:

"a man standing upon a rocky precipice with his back to the viewer; he is gazing out on a landscape covered in a thick sea of fog through which other ridges, trees, and mountains pierce, which stretches out into the distance indefinitely."

I get the store-brand version that doesn't invoke any emotion. It is "good enough" to get the point across, but lacking the response. Similar to the countless recreations of the Mona Lisa, there is just something about organic perfection. I'm sure AI will get there one day, but who knows if we will react to it in this sense of wonderment.

luxcem · on Jan 24, 2023

Do you think your emotional response is related to the external knowledge of who produced the painting or is it only based on the visual impression?

We probably can still figure out if a painting is original or AI-generated but I don't think we can much longer as AI improve.

The question would be could we feel emotions even if the source material is artificial. I think the answer is yes. Human brain is can easily be tricked.

thenerdhead · on Jan 24, 2023

I guess it is more that AI doesn't struggle to create. Humans go through many emotions in the process of creating art. The outcome isn't as important and the appreciation that artists went through that grueling process and the outcome being what it is I believe is the response I'm trying to express.

So more to the human struggle of the production. At least for me.

adamsmith143 · on Jan 24, 2023

>a man standing upon a rocky precipice with his back to the viewer; he is gazing out on a landscape covered in a thick sea of fog through which other ridges, trees, and mountains pierce, which stretches out into the distance indefinitely."

Might want to look into better prompt engineering. This is pretty generic and there are better prompts you can use.

int_19h · on Jan 24, 2023

Take a look at this.

https://toldby.ai/4kQNd-_tvUG

ElemenoPicuares · on Jan 24, 2023

We just need to give the AI memories from real people and give them a life span short enough that they won't realize it... and then if they go bananas and start causing harm, make another AI to terminate them... I swear I have heard this strategy used elsewhere.

mrtksn · on Jan 24, 2023

Hear me out, the problem is not AI generated content but what this content stands for.

Why we use text? Half of it is about getting something from someone else BUT even more importantly we write text to change something in the worlds using our words.

The problem with AI pretending being a human only exist when that AI doesn't get anything from us and it's only good for extracting information from.

It's utterly futile to discuss with AI generated content here on HN but it's amazing experience on ChatGPT precisely because when we write each other something here on HN, we expect that our words will create some impact somewhere. Some option will change or we will befriend someone.

I have 0 problems with having interacting with AI which is an individual machine in Lisbon, and it is learning and evolving as the life happens. On the other hand, I hate the idea that I'm talking with a machine in Palo Alto and the only output of my conversation is some statistics that VC will gaze over and optimize for his own gain.

Just think about it how meaning is transferred from human to human: we can compress meaning in few markings and extract it on the other side only because as individuals we have experienced life and some markings saying "bored" is enough to transfer very complex situation from person to person. In the current state of the AI, being bored doesn't mean anything to LLMs but if individual AI machines lived human-like lives, I think it will start having meaning for them too.

IMHO, the problem with AI generated content currently is exactly the same with SPAM or other non-genuine content and has nothing to do with it's origins being biological or electronic.

We people will put eyes on a ball and call it our friend, we don't have problem with it.

xpe · on Jan 24, 2023

Proving content to be AI-free content is fraught. It might be practical in narrow contexts, but I think it will be mostly impractical. It will even be theoretically impossible in many cases.

So, the article's underlying philosophy is "not there yet". It does not adequately address various real world challenges.

1. How is AI-generated content different from algorithms? I'd suggest drawing a line may be nonsensical.

2. What is the precise ethical motivation for wanting to avoid AI-influenced computation? I don't see a compelling case.

Examples:

A. Do we want civil engineers to use optimization software? Yes.

B. Do we like spelling and grammar checkers? Yes.

C. Do we want content generation software to suggest topics and hyperlinks? Yes.

D. Do we want to try out AI music? Yes. And we want to remix it.

E. Do we want to improve our health by making it more accessible and affordable? Yes.

If our goal include protecting human rights, health, dignity, and so on, we better darn well formulate our philosophy and policy goals in a coherent way.

noduerme · on Jan 24, 2023

You know that to certify something as kosher, a rabbi has to be paid to stand there all day next to the kishke stuffing machine or whatever. Makes the consumable a lot more expensive. On the other hand, I see job opportunities.

[edit] Actually, just to expand on this a bit, this is effectively an argument for establishing a mark of authenticity that all literate civilizations have always striven to place on their intellectual output. There has never been a single rule uniting these efforts, but civilizations which placed more emphasis on safeguarding and defending the precision of the written word have tended to be rewarded with greater longevity. There's no reason that trend shouldn't accelerate when faced with the threat of inundation by meaningless language models. Just like gold coinage, we're looking at a period of debasement and inflation.

You could actually argue that language models themselves are an expression of anti-semitism, in the sense that they're an attempt to undermine the sacredness of the written word, to destroy or wash out the way that the meaning of words ennobles humanity, and to eradicate the special relationship that the law of language and the language of law create between God and Man. I only say it seems anti-semitic because that particular concept, as a high/sacred value, seems unique to Judaism (from my perspective, I can't think of another culture that considers it an inviolable precept) and so this attempted abolishment of the human hand in the written word seems particularly targeted at those who consider the word sacred; maybe this is yet to be threshed out. Maybe Bari Weiss will write about it once some nazis have ChatGPT come up with a totally bunk but plausible corruption of the Talmud. But love of the written word is something that should rightly be a general human value, because we'll live or die with it, Jews and everyone else, whether we want to or not. All civilizations fall when their coin is debased, and our coin today is information.

[edit2] also, I'm drunk, and I love y'all. I hope this stimulates debate, not hate.

A4ET8a8uTh0 · on Jan 24, 2023

I hate to say it, but it does not sound that far fetched. In some weird ways, tech already feels like magic to a lot of people and people surrounding it are today's wizards and priests willing that power into existence.

In practical sense, as a society, we do need something to separate good from bad. Technopriest cast does sound like a fun job description to me ( which will inevitably be corrupted and result in its own schism ).

Future scares me.

Veuxdo · on Jan 24, 2023

"AI-free" is pretty clear. "Organic" is much more subjective. The same applies to Organic food, if I'm not mistaken.

remedan · on Jan 24, 2023

Is "AI-free" clear? If I google something during the making of my project, is the project AI free? What if I type on a phone keyboard with autosuggestions? There is machine learning involved in all kinds of stuff and AI is a pretty general term.

quchen · on Jan 24, 2023

Edwin Brady often talks about his (digital) lab assistant. I think that’s a good mental image: an AI assistant is what we’ve used so far and I think is fine. They help us bounce ideas back and forth (e.g. searching) and correct typical mistakes (typos, type errors in Edwin’s case). But as soon as they’re running the lab and we become the assistant, I think that’s where most people would draw the line.

alpos · on Jan 24, 2023

Actually it's the other way around.

Those pushing the concept of "AI-Free" have yet to nail down what amounts and types of automation may be used in the production of a thing before it can no longer be labeled "AI-Free".

On the other hand, "Organic" is very well defined at this point. In the US there is a whole legal framework around labeling foods, drugs, and cosmetics.

https://www.fda.gov/food/food-labeling-nutrition/organic-foo...

https://www.ams.usda.gov/about-ams/programs-offices/national...

There is even a training and accreditation system in place for qualifying people as certifiers of organic practices as well as an application and review process for farms and manufacturers to be certified to use the label.

https://www.ams.usda.gov/services/organic-certification/beco...

artpi · on Jan 24, 2023

I couldn't resist the metaphor of the "Organic" origin, as produced by wet brain instead of silicon.

jacquesm · on Jan 24, 2023

Silicon!

artpi · on Jan 24, 2023

Thank you, edited - my toddler did not sleep that well tonight :D

jacquesm · on Jan 24, 2023

Yes, that can eat up your energy like nothing else. Best of luck with that, been there, have a whole bunch of t-shirts ;)

throwawayhnfpg · on Jan 24, 2023

Calling others' work product "content" is pretty offensive.

daveguy · on Jan 24, 2023

I disagree. We have called online distribution "content distribution" and the people who contribute stories, videos, and art "content creators" for at least a decade. It's a catch-all generic term for all content that another person would consume via a platform. In a general work setting you're probably not producing content for a platform, but you are certainly producing content for your coworkers and customers. It's so you don't have to say: "stories, blogs, vlogs, photos, digital art, products and any other thing that is consumed by others."

A4ET8a8uTh0 · on Jan 24, 2023

Agreed. If anything, word 'content' captured a broader array of work output in the mind of general public than work, which conjures up a lot of things, but rarely artistic endeavors. I do dislike using it as a part of describing my own habits, but it would appear that ship has sailed. In both my tech and non-tech circle, stuff they do is content consumption. I am not sure when it became so prevalent.

scotty79 · on Jan 24, 2023

Alternatively...

Pure, certified 100% AI generated content. No humans or other animals were directly exploited for the purposes of generating this content.

zelias · on Jan 24, 2023

I wonder if this sentiment will ultimately lead to a "Butlerian Jihad" culling of "thinking machines" a la Dune

euroderf · on Jan 25, 2023

Well, we've got part of the Mentat angle covered already.

"It is by caffeine alone I set my mind in motion. It is by the juice of Java that thoughts acquire speed, the teeth acquire stains, the stains become a warning. It is by will alone I set my mind in motion."

swyx · on Jan 24, 2023

> Customers and the audience should be able to know and choose if they are interacting with AI-generated content. More importantly, they need to be able to choose to interact exclusively with artisanal, human-produced ideas.

good that automattic is setting this policy for themselves, but as long as this is opt-in there is zero chance this fantasy will be reality

also @simonw’s “AI veganism”: https://news.ycombinator.com/item?id=32639643 its a good term because it will be about as popular and have as loud and passionate and few adherents as actual veganism