Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: DALL-E was trained on watermarked stock images?
266 points by whycombinetor on Aug 24, 2022 | hide | past | favorite | 227 comments
I just got a Dall-E render with a very intact "gettyimages" watermark on it. I'm no legal expert on whether you have to own the license to something to use it as training input to your AI model, but surely you can't just... use stock photos without paying for the license? Maybe I'm just old fashioned.

Prompt: "king of belgium giving a speech to an audience, but the audience members are cucumbers"

All 4 results (all no good as far as the prompt is concerned): https://ibb.co/gz5RDkB

Fullsize of the one with the watermark https://ibb.co/DzGR063

I am not a lawyer, but I've had to argue about copyright with several.

In the United States, there are two bits of case law that are widely cited and relevant: In Kelly v. Arriba Soft Corp (9th), found that making thumbnails of images for use in a search engine was sufficiently "transformative" that it was ok. Another case, Perfect 10 (9th), found that thumbnails for image search and cached pages were also transformative.

OTOH, cases like Infinity Broad. Corp. v. Kirkwood found that that retransmission of radio broadcast over telephone lines is not transformative.

If I understand correctly, there are four parts to the US courts' test for transformativness within fair use (1) character of use (2) creative nature of the work (3) amount or substantiality of copying (4) market harm.

I'd think that training a neural network on artwork--including copyrighted stock photos--is almost certainly transformative. However, as you show, a neural network might be overtrained on a specific image and reproduce it too perfectly--that image probably wouldn't fall under fair use.

There are also questions of if they violated the CFAA or some agreement crawling the images (but Hiq v Linkedin makes it seem like it's very possible to do legally) and whether they reproduced Getty's logo in a way that violates trademarks (are they trying to use it in trade in a way there could be confusion though?)

Search engines don't create market harm for a work because they don't compete with it. In fact, they do the opposite: they advertise the work, making it more accessible and increasing exposure.

These AI tools on the other hand seem to do the exact opposite. They can (or could, if they got good enough) absolutely compete with a work, and therefore seem like they create substantial market harm. The character of use also seems vastly different; AI tools are creating images explicitly to be consumed, vs a search engine is basically just an index, and only shows the image in so far as it needs to make it discoverable.

So three of the four tests for fair use seem clearly against AI image generation, at least to me. The only test that possibly goes in favor of AI is the amount or substantiality of copying, but AIs can easily reproduce images, or if not entire images, other substantial subsets of a composition.

I just don't get how these could possibly be fair use.

As I see it, 3 of the 4 tests are strongly in OpenAI's favor; the 'market effect' is mixed.

(1) The use is highly transformative;

(2) the images used were offered to the anonymous browsing public (with watermarks);

(3) the end effect of training will only retain a tiny spectral distilled essence of any individual photo, or even a giant source corpus;

(4) there's a potential risk of market competition from the ultimate model output, for some uses – but that's also the most 'transformative' aspect.

Getty et al could potentially just ask creators of such models not to include their images – perhaps by blocking their crawling 'User-Agent' – and it might not make any real difference in the models.

I'm still not seeing the "transformative" argument: the point of transformation isn't "it is in a different format" but (to quote Wikipedia, which is, of course, dumb... I'm sorry ;P) where one "builds on a copyrighted work in a different manner or for a different purpose from the original". The reason a search engine thumbnail is transformative isn't because it has been transformed to make it smaller... it is because the purpose of the resulting use of the image is somewhat unrelated to the use the original author was going for when they made the original image. At issue here is then that, rather than using an original image from Getty Images, someone decided to take all of the images from Getty Images and churn them through some algorithm that generated an image that directly competed with the original images from Getty Images. So like, sure: if you really only narrowly want to talk about OpenAI, what they are themselves doing (training and distributing a model) might potentially be legal, but the people using the result would seem to be in serious hot water... oh, and actually, I think they run it all a service, don't they? So no: I don't even think that defense works, as OpenAI is in some sense not even selling a model, they are merely directly competing with Getty Images to provide sell photos to people.

Autogenerated, often fantastical, never-seen-before AI images strike me as a paradigmatically 'transformative' use. It's novel. It's shocking to many practicioners how flexible & high-quality the images can be. It will unlock all sorts of new downstream creation.

The representation that feeds the generation is statistical, even to the point of being plausibly factual: these things/people/places/concepts can be abstractly represented as the balanced weights inside the model. And under US law, facts aren't copyrightable.

I could see a case being factored as: (1) the scraping/training/ephemeralization itself involves the usual copying of downloading/locally-processing images, like indexing, but all those 'copying' steps are fair-use protected, as science/transformative/de-minimus/whatever; (2) any subsequent new-image generation no longer involves any 'copying', only new creation from distilled patterns of the entire training corpus, in which Getty retains no 'trace tincture' of copyright-control. So there's no specific acts of illegal copying to penalize.

Also, a human artist would be allowed to review related Getty/etc preview images, free on the web, to familiarize themself with a person or setting, before drawing it themself, with their own flair – as long as they don't copy it substantially. Why wouldn't an AI artist?

"AI artist" doesn't add any of its "own flair". It builds exclusively on past experience and work of humans. And it also directly completes with them without any thought of credit or compensation.

People are really underplaying how damaging this is going to be for the industry. It's going to completely decimate it. You can already see people using names of artists in the DALL-E prompt to get "their" work for few dollars avoiding any copyright or social issues.

Artists will suddenly be competing with AI on price and time - why we should pay you living wage when we instantly generate something close enough.

Why would anyone try to create some new aesthetic or push anything further if their effort will be replicated next week when the model gets updated with new source data. Everything is gonna get stuck to aesthetic of 2025 and before.

It's completely inhuman.

The synergistic effect of all the AI's inputs absolutely results in a unique new 'flair', with extensions, reversals, and mash-ups of styles just as in human-made artistic styles.

And AI "builds exclusively on past experience and work of humans" just like any young new human artist equally does. In many cases, you can even tell the different models' outputs apart, not by raw quality or glitches, but by hard-to-describe aesthetic tendencies.

I share your concern on the effect on human artists – both the market for their work, and even their morale, when learning, knowing that decades of practice will still be outproduced by seconds of computation.

But I don't think the genie will be put back in the bottle, by either expansive interpretation of existing copyright law, or even new laws.

Indeed the genie is out. And while we will get some interesting AI uses ultimately this is degenerative tech. In the end we end up with less authentic, less unpredictible and less delightfull art. Instead we get the perfectly suited to us, predictible, mediocre stuff.

I said it in comment above - yes people build on work of others but they also bring lots of their originality and intelect. Part of what people do is truly uniquely theirs and piece by piece we progress as a whole.

The crutial detail is that AI learns only from visual patterns from past and cant think at all. And humans learn from everything around them and think about it deeply.

I don’t believe we will lose the capability to create new original styles. If a prompter can describe the creation of a new style, the AI can create it. Using both iterations of image & text prompts, unique styles will come.

The thinking is still done by the human prompter.

The value of the image is in the human prompter (in the overall concept) but the overall style - the aesthetic is stuck in the past. Its almost impossible to describe aesthetic in text without referencing examples of that aesthetic. Its the case of one image says more than thousand words. It has to be seen.

I am not sure finding new aesthetics is even the playingfield nowdays. Its probably not because we’ve been stuck for decades. Its more about cyclic trends of things forgotten. So who cares. But this will just solidify that even more. But yeah it has already happened and since the tech will be firmly in private hands everybody will be just exploited and pushed by it instead of it helping anyone.

One could make the same case about humans, nobody works in a vacuum. Even though he used it in a pejorative sense, Sir Isaac Newton, the famous English scientist, once said, “If I have seen further, it is by standing on the shoulders of giants.”

That humans are capable of developing their own style could still be argued that it's just a intermixing of previous work that they've seen, but they've combined it in a different way, which effectively is exactly what these generative systems do.

Of course humans build on work of other people. And what they do is partialy a mashup. But their work is not only replication of visual patterns. Its thinking its other non visual experiences its their politics and world views combined in their work. Often its their life project.

To think that artists only mash up what was before them is quite obviously wrong.

But its exactly only thing the tech does.

I'd argue that if an artefact such as a watermark is copying even more substantially than any other human would and that human would at best be labelled as unoriginal, or doing very derivative work or be in violation of copyright.

Perhaps I’m misunderstanding your argument, but my counterexample would be: if a human digital artist transformed a Getty image, resulting a fantastical, never-before-seen result, using software like Photoshop, that use would be no more defensible. If anything, the vast scale at which this occurs in AI makes it worse.

I think your hypothetical would depend on the character & extent of the transformation. Mere filters that leave the original recognizable? Probably an infringement. But creative application of transformations to express new ideas? Maybe not – especially if the derivative is a comment/parody on the original, that actually increases interest in it. Most art is a conversation with the past, reusing recognizable motifs & often even exact elements.

For example:

Andy Warhol died in 1987, 35 years ago. One of his 'Prince' collages dating to the early 80s used another photographer's photo, without permission. In 2019, one federal judge ruled that was not infringement. An appeals judge then said it was.

The Supreme Court has decided to take the case.

The US Copyright Office & Department of Justice agree with the photographer in briefs filed with the court... but the mere fact the Supreme Court took the case indicates they think there might be issues with the appeals court ruling. They might agree with the original judge!

Oral arguments come this October. See:


So, when all the (possible) disputes over AI-training-on-copyrighted-images resolve – maybe in the 2030s or 2040s? – what will the laws say, & courts decide? It'll depend a lot on other specifics, & reasoning, that may not be evident now.

Thanks, that is a thorough and interesting reply.

I find legal disputes in fine art interesting, however—IANAL, of course—I understand that fine artists (Richard Prince comes to mind) are subject to very different copyright restrictions than graphic artists under commercial use.

It’s, as you said, up to courts to decide. But AI generated imagery is frequently commercial in nature (KFC, already). AI services are trained on unlicensed commercial stock images, and are able to reproduce enormous quantities of derivative images, and do so at a profit. I think that’s categorically different from a fine artist appropriating imagery in a single artwork or even series of artworks in an entirely different context.

These AI generated images are directly competing with stock images. AI tools are selling images to blogs and other customers that often would purchase stock images instead.

The "character of use" is not in favor of dall-e, it is a commercial use.

Copyright law does not require getty to block a user agents or ask them not to include their images.

Another issue here is that removing copyright management info like a watermark is a violation of the DMCA, separate from fair use or copyright infringement. These cases have statutory damages and attorneys fees awarded.

Whether something is directly competing for the same business would have to be evidenced, and copyright doesn't mean protection from all possible competition - it's just one factor weighed. And fair use protects many commercial uses, too, depending on proportion/character-of-original/etc.

But also, none of these images are direct, or even necessarily subtantial, "copies" of other images. The generator learned from other images – the same as any human artist might.

No watermark has been removed; the bigger issue may be that the spectral watermark violates a trademark. (But, I doubt consumers are likely to be confused.)

"The generator learned from other images – the same as any human artist might."

A lot of people seem to make this comparison, but I don't think it's fair. It's wrong. A computer is capable of ingesting/processing and "learning" from images at a rate no human can possibly come close to matching. To elaborate, it is not actually learning in the way we normally think of it, as its "brain" is completely different from a human's brain. It is doing something entirely different that should have its own word. Human artists learn from other human artists' work. An AI does something else.

It's also worth noting that the art the AI was trained on was posted online when the technology didn't exist (or if it did in some form it was not in the state it is in now). So an artist having posted their art online for public consumption can't be equated with somehow consenting to its consumption by a web scraper / AI.

It's great that human artists learn from, & introduce into their work, influences other than just patterns seen in other works.

But it's also great that AI artists can learn from more examples in a few minutes than a human artist might see in lifetime.

To say that's "not actually learning in the way we normally think of it" is superficially true, but it doesn't mean it's "not actually learning", or necessarily any worse than typical learning. It's so new, & we barely understand fully how it works or what its limits are. It might be better in many relevant & valuable aspects!

Fair, I don't know what it's actually doing. I just know you can't equate it with anything a human does, and the use of the word "learn" is misleading, or vastly oversimplifies what is happening, to the point that it allows for false analogies.

That said, my main objection to this technology is that:

- The AI's work is based on human artists' work

- Companies are then profiting off of the AI's work

- The companies are indirectly?/directly? profiting off of artists' work

- The companies do not get artists consent or compensate them in any way

- The companies are essentially stealing from artists

Companies should be forced to obtain the creator's consent when using art to train their models.

It’s going to be interesting what the stock companies will do. Maybe they will make their own Image Generator. Perhaps we will see a case based on the new factor that is AI. An AI is not artist; they can’t be conflated. A decent artists can churn out maybe 5-10 works if he is productive. AI can churn out by the hundreds or thousands if needed. The process also isn’t the same.

Anyway it will be interesting to watch this space.

AI generated images cant be copyrighted.

Given the iterative contribution of a artistically-talented human prompter, I'm not sure that precedent – set by the Copyright Office in the US, rather than a clear statute or court decision – will hold up. A court might decide differently, or a statutory update could overrule the copyright office, especially in cases where an individual output is the mix of human & AI effort.

I have a hard time agreeing with 3, given https://ibb.co/DzGR063

aside from if it is not copyrighted the image, the Getty watermark usage probably might have a bunch of issues.

> Search engines don't create market harm for a work because they don't compete with it. In fact, they do the opposite: they advertise the work, making it more accessible and increasing exposure.

AMP, snippets, Knowledge Base and in-app browsers would like to have a word with you

Knowledge Base I grant you, but snippets are a crucial feature to trust a result is correct before clicking through.

AMP is completely unrelated so I'm not sure why you mention it. Website owners have to create a specific version of their own site for AMP to even work.

It seems it is possible to generate images which are very similar to the existing stock photos if you feed getty images' description into DALL-E.

I tried it with a distinctive banana image:


"very similar" insofar as it's following the narrow prompt, sure.

> Different runs can generate different size, orientation and placement of the bananas, as well as different shades of pink.

At that point it's definitely the curation causing any possible derivation. The image generator is innocently doing what you ask in an unbiased way.

Those bananas are completely different. There's no copyright infringement there. I could take a photo of a banana and photoshop it repeatedly onto a pink background. That would look just as similar, and there's no copyright problem there.

You can't copyright an idea.

Images are different, but it appears that DALL-E is inspired by the aesthetics and the layout of the copyrighted material.

Another example, picking a random image from the Getty Images site. "A young parkour flips through the city,guangzhou,china, - stock photo":


The images are obviously different, but it appears that DALL-E maps the getty images description to similar tone, similar perspective, similar background, and similar weather conditions. I'm sure there are thousands of possible backdrops in Guangzhou, and many ways to show a parkour flip. Even in the Google image search results there's more variance than in the output of DALL-E.

So you can't copyright an idea, but you can certainly scrape a copyrighted DB with image metadata, and use it to create your own product. My point is that DALL-E itself might be a derivative work of Getty Images and thousands of other online catalogs.

Interesting. Adding "stock photo" to the string generated that getty tag? That is probably the most attackable (alas easy to fix) part of the issue. It will be an interesting question how close to the original a picture has to be to be considered the same (I'm sure there's some case law) and maybe there's some new research to be done regarding how to recreate the training data images with the correct search string (I suppose one could build an ML model for that).

Fun times ahead

No, I didn't get the tag. But I suppose that Getty metadata as well as the images were used for training.

From what I understand, the actual process of fair use boils down to "the judge decides in his/her gut if the use is fair, and then writes up the analysis to justify coming to that conclusion." If you look at the recent SCOTUS opinion in Google v Oracle, you can see how two judges can look at the same facts and come to almost diametrically opposed fair use analyses. My further understanding is that generally the #1 overriding concern in fair use analysis is money, which means you're more likely to see analysis along Thomas's dissent than Breyer's opinion.

In this case, let me give a fair use analysis that is going to suggest that this isn't fair. Factor 1 weighs against fair use: it's not transformative because, well, transformative is extremely narrowly interpreted against fair use. Factor 2 weighs against fair use because, well, it's factor 2 and it weighs against fair use unless the underlying copyright was paper-thin in the first place. In factor 3, it's weighing against fair use because it's not copying the minimal amount of the original work to get what it needs (it copied the watermark after all!). And factor 4 of course weighs against fair use because you're essentially creating stock images which is naturally in the exact same market that a stock image provider is in.

If you wanted to write a fair use analysis that finds fair use, you'd argue instead that the work was transformative, and the amount copied also weighs in favor of fair use (thus converting factors 1 and 3 to weigh in favor of fair use). You might try to argue that it's a completely different market, but I'm incredibly skeptical that such an argument could win over both a district court and an appeals court (although Breyer's opinion in Google v Oracle did basically follow this thread of analysis, its repetition is unlikely since everyone wants to pretend that Google v Oracle has 0 impact to anything outside of software). Such an analysis is possible, but unlikely, since the unspoken factor of "could you have paid for this" tends to be the factor that wins out over everything else.

Note that we are going to have a SCOTUS case in the fall that will specifically explore transformative uses in the context of fair use: Warhol v Goldsmith (https://www.scotusblog.com/case-files/cases/andy-warhol-foun...). I'm not going to hold my breath that the use will be found fair, though.

Putting aside the core question of the legality of training data on licensed material - what about the false advertising/copyright aspect that comes with slapping a "GettyImages" logo on some random nonsense generated by a "neural network"?

It's not worth discussing about Getty so much. AI labs will collect a dataset to predict if an image is watermarked. They will crawl to index the Getty images to make sure they are not in the training set. Then retrain and in 2 months the problem is solved. They can cut out a sizeable part of the training set without problem, the model will still be good.

They can also OCR the output to make sure there are no blacklisted words and use an index to skip all images that look too similar to the training data. Then the argument of copyright defenders is going to be weakened.

The fact that a prompt and curation are necessary also goes against the "AI works can't be copyrighted" narrative - it's generated by a human-AI team, so human work is part of the process.

The core of the issue I see is that human and AI both learn from the published media but an AI can both "see" and "draw" more than a human, so there is an important distinction there.

I understand that there are (both practical and theoretical) ways to reduce the chances of an AI generating an image that has copyrighted elements in it (such as the "GettyImages" logo).

I'm mostly curious about the legal aspects of having a black-box system that can - under some unknown circumstances - attach openly copyrighted or trademarked elements (such as a company logo) to a piece of work.

> (2) creative nature of the work

Is AI even capable of having a creative nature. All that I see is re-use of source images.

All large-scale public machine learning stuff is depending on being exempt from copyright restrictions, under fair use doctrine. Look at my responses to all of the threads about Copilot + GPL for more info about that application of it: https://hn.algolia.com/?query=chrismorgan+copilot+gpl&type=c....

When that is finally tried in court, if it fails to any meaningful extent at all (including going all the way up to Supreme Courts as it doubtless will), then Copilot is dead, DALL·E is dead, GPT-3 is dead, all of these things will be immediately discontinued in at least the affected jurisdictions, at least until such a time as they get the laws changed or judgements overturned.

To me this feels like the argument that we should allow Uber and Airbnb because they're sufficiently "transformative" use cases. When clearly they are playing fast and loose by the rules and have taken advantage of being early enough to do so. As soon as the rulemakers caught up, it became obvious that they didn't have a license to operate differently from everyone else, just because they're new and popular.

Personally, I agree that a strike against AI fair use would kill these current generation of tools. But I don't see why that would be the end of it. What it would do is to create a market for open source data sets with liberal licenses. We'd lose something by not being able to train models on every piece of media that has ever been on the internet anywhere, but it's not obvious to me that was ever really reasonable in the first place. If the only way to make AI that can produce good writing is to train it on every piece of writing ever produced in the history of the human race... aren't we missing something? Surely if AI has a future, it'll have to overcome this at some point.

> What it would do is to create a market for open source data sets with liberal licenses.

This is exactly right. Open Datasets is the way to go. I would also say that in the spirit of the Open Access movement for journals and publications, it might be useful to set up an Open Access protocol for training data sets, methods (these are just the algorithms; publishing them openly might be the way to go) and computed models.

This will ensure that models are evaluated for risk by a large set of people and any risks/shortcomings could be addressed soon. Quite similar to how cryptographic algorithms are designed / analyzed in public. Obfuscation might look like it helps, but it doesn't in the long run and just creates more headaches.

> As soon as the rulemakers caught up, it became obvious that they didn't have a license to operate differently from everyone else, just because they're new and popular.

This is well said. One of the primary advantages of these businesses was evading the regulation and taxation that their competitors were subject to.

Indeed, but this is the proof the market didn't want the regulation and taxation and wanted better technology.

Shouldn't we listen to what people want over bureaucrats?

People want a lot of things, but people can be egoistic. What people want is not necessarily good for the society as a whole. That's why laws exist.

There is no "fair use" when it comes to laws and regulations.

It's not going to fail: the US courts are big company biased, and all the big companies are going to show out in force and money to ensure they get the result they want.

But even extending that: knocking copyright'd images out isn't going to stop these systems. We know they work now, so if you have to be careful about licensing then that's just going to be done.

The idea that any of these platforms will "die" if copyright fair use doesn't automatically apply is magical thinking. Most art is worthless - companies hoovering up huge corpuses with the correct rights assignment for machine learning is going to be the new business.

A company like Disney will drop every piece of output from their staff into a dataset for Disney, then license it under terms to other companies - the tech works, so "invent me a disney character looking like..." would have value internally, just to Disney, for idea generation and refinement - arguably a lot more then to anyone else because they would still retain the artist resources to capitalize on it.

Right now, a bunch of people who told themselves that despite the pay, they weren't going to be replaced by AI are shrieking that it's turned out not to be the case (it was obvious for a few years something like this was coming though). They're reaching for every legal tool that they hope will kill these things, forgetting that it's never worked out like that. Copyright being a problem when it happens to you as an individual, is different to when it happens to MegaCorp Inc. which is constantly being sued, has limited liability, and puts payouts down as a line-item expense.

> A company like Disney will drop every piece of output from their staff into a dataset for Disney, then license it under terms to other companies

Knowing Disney, those terms would 100% include "subject to Disney's approval obtained prior to publication", and a chunk of money extracted from you. That company is very controlling of their IP.

Either that, or you're talking about a dataset that generates very specific looking images, that do not remind you of any major classic Disney property (so not "every piece"). Your own Jedi / Avenger avatar creator maybe, custom Mickey Mouse world character absolutely not.

For my part, I think you’re right, and that it’s unlikely to fail in at least the USA, and, after consideration of what you wrote, that if it did fail in any meaningful way it would push things in the direction of copyright pools (like patent pools); but at the very least, it would be a massive disruption which would take some time to be sorted out and require a certain degree of starting from scratch in data sets; and all up it’d probably favour big business even more heavily than the current informal consensus. I think there’s also a fair chance in such a situation that European countries with their different approach to copyright philosophy would act as a balancing force, striking down overly-general copyright-assignment-equivalent clauses in terms of service and the likes, which would be the only real way of sucking up as much everything as these models need to work well, especially for providing retroactive relicensing (to avoid a big hole in their sources).

I personally feel like that lawsuit happens the moment someone builds the version of this that works on music; in my experience arguing before the copyright office at the library of congress, the people who tend to be the most omnipresent is the RIAA, and when someone releases an AI-generated piece of music that sort of sounds like some recent Taylor Swift song but using that infamous sample from Under Pressure / Ice Ice Baby, the lawsuit will be filed within days.

Great points but scary. If training ML models on copyrighted data becomes illegal in the US but remains legal in say China or Russia then the US will quickly fall Behind on ML capabilities - major national security implications at the very least. I suspect if the decision went the way you suggest congress would have to change the law to allow training.

Isn't that true for all technology? In the U.S. we have the specwriter system which leads to inefficiency to get around copyright. In China or Russia they just copy the code and iterate.

As a Russian programmer, worked in companies big and small, I cannot say this is even remotely right.

All companies I worked with really cared about cleanness of the origin of the code.

My feeling is that while these things may be technically falling under fair use, I really feel like they are running roughshod over a lot of ethical and moral lines and that perhaps "fair use" needs to be redefined to explicitly exclude this kind of processing.

And if it kills these things, oh well. "Being an artist" is a precarious enough existence in this world as is, I'd be delighted to stop worrying about having to compete with an endless sea of algorithmically-generated barely-good-enough spam.

You will still be competing against it. It'll just be outsourced and presented as manual work.

True. But there wouldn’t be a growing number of easily accessible websites happily letting everyone type whatever prompt they desire, and there wouldn’t be further development being done in this area.

“Zero people trying to disrupt my job with art generators” would be ideal, but “a lot fewer people trying to disrupt my job with illegal art generators they can get in trouble for using” is still better than “hey we released this great copyright infringement machine for anyone to use for free”.

There will be. They'll just be outside of your country's jurisdiction.

Exactly this. It will just become tool used by those who can afford it in secret. No matter what copyright it breaks.

I have been saying for a long time. These big companies with an investment in getting these systems working could collaborate with twitter and instagram to include image license options. They could give users the option to opt-in to data collection. This would help the public feel like they’re giving consent, and it would also lead to the creation of massive, properly licensed open data sets. There’s already some big open image datasets to start with.

And of course, they could always work more on sample efficiency.

I’m not particularly familiar with open image data sets, but I doubt that most of them are suitable. Open Images, for example, uses CC-BY images (https://storage.googleapis.com/openimages/web/factsfigures.h...). Without the fair use exemption, this would suggest that if you used a model trained on that, you would have to comply with the license of every image, which would mean providing attribution for every single item in the set, which is somewhere between infeasible and impractical.

The only types of licenses suitable are ones that require nothing like attribution. This is why you’d mostly be limited to public domain materials (though if it went down this way, you’d find terms of service popping up that included a license grant for model training and selling either your data for model training or trained models without any sort of attribution or remuneration).

Actually, I think they could comply by providing a list of every author in the dataset, though this is following the letter rather than the spirit.

They could also do research on ways to get the model to return the top ten influential works for some output, and make a legal argument that this is a best effort given technical challenges with tracing every source.

For the first point, I get that by reading the license at this image:


“attribution – You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

share alike – If you remix, transform, or build upon the material, you must distribute your contributions under the same or compatible license as the original.”

So, output has to have the same license and worst case the image is accompanied by a link to a list of every author in the dataset. This is a far cry from the “this research would be useless if copyright was enforced” as some people suggest.

Here’s an open dataset I found which does not require attribution: https://www.pexels.com/creative-commons-images/

And Wikimedia commons, some of which require attribution: https://commons.m.wikimedia.org/wiki/Category:Images

And it is easy to go take a 4k video camera and start collecting tens of thousands of frames of your own images.

My point is that people are throwing up their hands saying oh, respecting the copyright of artists is impossible. But it feels very unfair that these huge companies are walking all over the copyright of small artists, but if we took their code to re use their lawyers would sink us overnight. This upsets a lot of people and it’s a bad look. I don’t actually like copyright but if everyone else has to follow the rules I don’t like giving them a free pass.

> They could give users the option to opt-in to data collection. This would help the public feel like they’re giving consent

Would we get paid? I'd care a lot less about OpenAI profiting from my work if I was getting commission every time my work was hit in the training data.

I guess this is what those NFT supporters were talking about.

Nah, you could zap the training sets tomorrow and start over with public domain material and it would be fine. In fact I think you could easily get paid to generate more content for it.

For text (the GPT-3 case), that’d work to train a model that had no knowledge of the last century of popular culture or idiom, and was significantly biased to more formal and traditional writing styles. The effects of this would be really quite interesting, but I think it would significantly limit the places it could be usefully applied.

For DALL·E and Copilot, I’m confident that you couldn’t find anywhere near enough material to produce results anywhere near as good as what there is now. I strongly suspect the results would be too poor to be useful in most places where they may be useful now.

IANAL, but what you should be able to do is have a set of quotes of one or two sentences each from various sources (books, TV, movies, etc.) that have the modern word or idiom you are specifying. That should then be small enough to fall under fair use (like IMDB, wikiquote, etc. have quotes from films, and good reads, dictionaries, etc. have quotes from books/text), and be plenty enough to capture the meaning of the words/idioms.

You could create your own sentences that you control the copyright of containing the word or idiom, as those words and idioms themselves are not copyrightable. For example: "I fracking hate ice cream!"

For the rest, there is a lot of text upto 1926 (depending on when the author died) that is available for use, so you only need to capture words and idioms changed since then, including any pop culture terms.

- "E.T. phone home" was considered copyrightable enough to stop others from making money with it[1]

- trademarks are protected even when the content itself wouldn't be copyrightable; you can't sell AI-generated T-shirts that "happen to" include the word Nike.

- NBC has a trademark on three tones, total length under 2 seconds[2]

[1]: https://fairuse.stanford.edu/2003/09/09/copyright_protection...

[2]: https://en.wikipedia.org/wiki/NBC_chimes

There is tons of english text and images with permissive licensing. All of stack overflow, wikipedia is creative commons. Anything created by the US government or many other governments is public domain

The terms of the CC-BY-SA licenses that Stack Overflow and Wikipedia largely use cannot practically be satisfied in a data model. By design, all outputs derive from all sources to some extent, and the licensing requires that they generally be specifically identified, so you can’t just say “from Wikipedia” or “from Stack Overflow” but “from such-and-such a page, by so-and-so”.

“Permissive” is not enough. You need no-strings-attached, and attribution is a string. Hence mostly talking about public domain materials, which make up the vast majority of suitable materials.

I suspect that covered works of the USA federal government would be quite a large fraction of the public domain material (as reckoned by the USA) from the last 70 years. I don’t believe it’d be enough to be particularly useful, certainly not for pop culture knowledge or colloquial idiom.

Big user-generated-content websites (reddit, Facebook, etc) could start new business models of licensing their text for training purposes.

They will be discontinued, but of course the profits made during all this time-- with everyone including those companies knowing how it is basically laundering intellectual property-- will stick.

As a private citizen, what can I do to hasten this trial in court? I have a feeling that the sooner the better, before this becomes a big industry that the US government might not want to hurt.

These are the absolute worst DALL-E images I've seen. Do people generally just share the amazing ones and most of the output is actually complete shite? Like Instagram presenting the top 1% of people's lives.

Top 1% is a bit exaggerated, but there is definitely a lot of not good stuff. I find that Dall-E does especially poorly with underspecified prompts too, unlike something like Midjourney which can give visually pleasing photos for even the most abstract concepts. Dall-E tends to do better with concrete and specific prompts.

Here's an example: Stressful Shapes

Dall-E: https://i.imgur.com/JBkSh0y.png

Midjourney: https://i.imgur.com/C02Zq3i.png

On the other hand, here's a specific prompt: "nerdy yellow duck reading a magical book full of spells"

Dall-E: https://i.imgur.com/FMKZ8zc.png

Midjourney: https://i.imgur.com/lpsg6af.png

But still "king of belgium giving a speech to an audience, but the audience members are cucumbers" is very specific.

And I don't see the king of Belgium anywhere, two pictures have absolutely nothing to do with the prompt (no king, no speech, no audience, no cucumber), one has the speech and audience but no king or cucumber. Graphically, they are deep into the uncanny valley. Only the third image is kind of right, if you really stretch your imagination.

I’ve been comparing Dall-E, MidJourney, and StableDiffusion. Goes to show how much training set and implementation choices matter. But in all cases, you have to think of the underlying labeled text-to-image sets as paint colors to mix, and prepare a palette accordingly. Still haven’t figured out how to get what I want, but to your point, one can get closer.

- - -

Not sure if this is why, but with OpenAI’s Dall-E, you can’t use public figures. You can use proxies, such as “60 year old banker with salt and pepper hair” and then fill in the rest, e.g. “handsome 60 year old banker with salt and pepper hair giving a speech while standing above 12 cucumbers”:


Telling it oil painting can fudge who the person is, then pick one that’s close and generate variations:


Or use a reasonable photo and then use edit and in-painting to try to improve the implausible subject. This takes a photo from the first prompt above, erases the lower half of image, and makes a new prompt for the lower half, while keeping just enough of the upper half to orient the collage, e.g. “[photo_edit] + banker giving a speech to cucumbers bin full of cucumbers”:


- - -

Over on MidJourney, where it’s happy to use public figures so long as you’re not violating terms of service about their use, first a couple prompt experiments with King Philippe of the Belgians.



Then one upsized plausible painting from among those, where the actual command was “King Philippe of Belgium talking in a large group of cucumbers --q 2 --uplight” which is pretty basic.


> you have to think of the underlying labeled text-to-image sets as paint colors to mix, and prepare a palette accordingly.

Very insightful tip on how to harness the "creativity" of Dall-E and the like.

I see how the phrase "king of belgium" was too vague for Dall-E, so it didn't produce anything recognizable - but changing the words into known details, like "banker" and "salt and pepper hair", worked effectively to generate concrete imagery.

Hilarious results. :)

It's not that it's "vague", they intentionally throw off when you try to generate a photo of a named person. It's an intentional protection they put in. If you just do "king" it'll likely do fine, but if it's referring to a specific person it won't.

Ah I see what you mean - "king of belgium" is a real person, so they put in some safe guards in DALL-E to prevent recognizable images for such queries. Makes sense.

IIRC, DALL-E filters requests related to politicians/celebrities. A friend had tried to make some funny stuff involving the Greek PM a couple of months ago, and it plainly refused. Now, it seems to process the request, but it will not show anyone resembling the person you asked for.

Could that have to do with Sophie Wilmès, Belgian PM 2019–2020? (https://en.wikipedia.org/wiki/Prime_Minister_of_Belgium#Livi...)

I gather Midjourney was trained primarily using Journey album covers?

I find Midjourney to be biased towards an artistic representation (for some definition of artistic) When Dall-e is happy to produce children's scribbles or poor imitations.

Try 'poorly drawn ... by a 5 year old using crayons' in Midjourney.

Even then Midjourney is more high-quality :)

See https://imgur.com/gallery/U5zJMcU

Comparison of two prompts, "poorly futuristic landscape by a 5 year-old" and "poorly drawnn highly detailed futuristic landscape dotted by mahcinery and tall buildings by a 5 year-old"

Also, https://imgur.com/gallery/jvEClos

Comparison of "poorly drawn red sports car in the street of a city by a 5 year-old"

Edit: forgot about crayons :D

That's honestly one of my favorite prompts. It's funny to think I use this state of the art AI to generate crayon drawings, but they look so great!


Now that I'm aware and biased, DALL-E's first image indeed looks very much like stock photo training. This would also make sense given how they can correlate the image with words completely for free due to pretty extensive metadata.

What puzzles me is if the Getty Images logo can sometimes appear. If you only have a Getty account, you get rid of the logo and can legally use them royalty free?

No, but you can input your getty image to StableDiffussion img2img and see what's out

> On the other hand, here's a specific prompt: "nerdy yellow duck reading a magical book full of spells"

> Dall-E: https://i.imgur.com/FMKZ8zc.png

How well it learned all the common prejudices!

"nerdy" == wears glasses

I'm applauding.

I'm looking already forward to AGI based on the current approaches… It will lead us finally into a better world, for sure. /s

How do you visually show 'nerdy' without resorting to the glasses stereotype? Your prompt is specifically requesting a prejudiced image.

Sure. And the AI serves the expected stereotype.

Isn't that great? The world will become a better place with AI everywhere.

We need especially more AI in law enforcement, and such…

AI should make important decisions. Because it bears the same prejudices as humans. So it can replace humans just great. ;-)

The AI is making no decision. The person entering the prompt made the decision to include the term. It is performing the same function as a pencil.

Now, if 'criminal' rendered as a black male 90% of the time rather than a crouched white male wearing a cheesy burglar mask and a sack over his shoulder, then I could see your point about perpetuating prejudice rather than stereotypes.

Op constructed a horrible prompt. First of all, using king Philippe I. is against the ToS, so let's go with a generic "king".

Let's not confuse the AI with "buts", just say that he is giving the speech to cucumbers.

Lastly, specify some style, because this would probably not work out as a photo.

My single try is not bad at all and it could definitely be improved.


I tried it with Stable Diffusion as well. You can use actual people and the model is even pretty decent at many of the famous ones.

On the other hand, it is more difficult to get it to produce absurd results like these.

my prompt: King Philippe I. of Belgium giving a speech surrounded by [[[[large green vertical cucumbers]]]], digital art in the style of Greg Rutkowski


Is the usage of the surrounding brackets some kind of keyword weights specific to stable diffusion?

I think it's specific to SD. [Square brackets] increase the weight while (simple brackets) decrease the weight. In the cucumber case I used them to force the model to take into account the less believable part of the prompt, because otherwise stable diffusion often ignores such parts.

There is always some cherry-picking, but prompt engineering is an art per se, you become better and better working on it. I just started this experiment https://www.instagram.com/unshushproject or without Instagram https://unshush.com and spent A LOT of hours and patience to become good at it. Now I'm very proud of my results and I'm working on doing better.

It's a bit risky to invest too much time because every generator is different and they change the underlying model frequently (see the beta of MidJourney yesterday), but if you do it for passion or curiosity there is no problem.

Now I'm experimenting with a local installation of Stable Diffusion (well, not really "local" because I have an old computer) and the prompt is only one of the things you can tweak. There are num_inference_steps, guidance_scale and other parameters.

Prompt Engineering can help a lot but yes, you're basically right: People are generating many, many images and sharing only the best ones with the fewest artifacts.

For simple prompts with little additional guidance, all the diffusion image generators I've seen/used will produce output about like what the author linked most of the time. There are always a few gems, and honing in via prompt engineering helps immensely.

I disagree, with proper promptcrafting you can expect far far better results than the one in the op, before any cherry picking. (see my comment sibling to yours)

With Dalle-2 I get a satisfying result in >50% of attempts and I'm a beginner.

With Midjourney the result almost always looks great, but often misses some part of what I wanted. I'd say Stable Diffusion is similar. The results are seldom crap, but it's difficult to bend it to produce unusual situations. And in SD it's difficult to keep the entire objects in the frame, but that's a different problem.

Um, have you read the prompt? It looking weird is simply the result of "the audience members are cucumbers". The more crazy your prompt is, the worse the results will generally get.

On top of that DALL-E2 has generally issues with anything dealing with multiple objects. A single person will render fine, groups of people will generally give artifacts. Attributes will also be spread across all objects in the scenes, not just the ones you specified in your prompt, so doing anything more complex will require manual uncropping und inpainting, not just a single prompt.

Anyway, if you avoid the obvious weak spots and holes in the training set, DALL-E2 output is for most part pretty amazing out of the box. It's really more a top 50% than a top 1%.

The biggest bias when it comes to published DALL-E2 images are the prompts. Most prompts you see online are not the actual prompts, but funny descriptions made by a human after the fact. The actual prompt are often much longer and sometimes completely different.

I have found being as direct as possible and removing duplicate or superfluous words works best.

Perhaps this rewrite may yield better results:

"King of Belgium gives a speech to an audience of cucumbers"

I've been reading some folks saying that "prompt engineering" is a legit future vocation in a world where AI has taken over a lot of creative work

And from my experience getting high-quality output from AIs takes a bit of finesse. Not quite unlike crafting a good Google query

so... yes

doubtful - I'm tuning GPT3 with good midjourney prompts as we speak

Is command language the ultimate interface? I doubt it. Similar to how GUI supersedes CLI in most use cases, we should be able to indicate "warmer" / "colder" preference to generate new images from previous attempts.

I'm not a huge fan of controlling things with text prompts but it does seem to be the best way to describe the image you're looking for

For diversity, Dalle 2 has a random chance of injecting "women" or "black" after a prompt. When this happens, at least for me, it generally destroyed the quality of the images. Probably "King" was identified as a gendered word. You can find some discussion of this on the subreddit r/dalle2. Sometimes the images are quite poor, but in this case, openAI is doing additional tampering.

A twitter user figured out which words they were using by generating a lot of images with the starting prompt "A sign being held that says "

As I mentioned here a couple of weeks ago [1], I tested DALL-E with prompts for paintings and drawings in three standard genres: still life, landscape, and portrait. The prompts for portraits yielded a lot of grotesquely unacceptable faces, but almost all of the DALL-E output for the still lifes and landscapes was perfectly fine.

[1] https://news.ycombinator.com/item?id=32433821

They are the worst I’ve seen as well.

Yes, people tend to share the best of the best. However these results seem especially bad, like bottom 10% bad.

Yes, "prompt engineering" in actually a thing. People shares various tips and tricks on the internet to engineer their prompts for best results. Lots of trial and errors required.

Example: https://news.ycombinator.com/item?id=32088718

Of course people are more likely to share the best iamges – or in this case, the one most illustrative of their concern (about watermarks).

Also: my sense is that getting the best results often requires a lot of extra coaching with style/detail words. As we can't see the prompt here, we don't know what sort of style/details were requested. GIGO.

You're right. This shows the prompt and it doesn't have such style directives


Also, a construction like 'but' that tries to override another expectation may be suboptimal. I gave the same concept a few tries, with more 'sweeteners'. First batch, for prompt "news photo of the King of Belgium giving a speech to an audience that is entirely cucumbers, award-winning, well-composed, detailed surroundings" – & it's a bit better:



https://labs.openai.com/s/M0i029fZnYjQXHFobUpw7eun (best of batch imo)


A few more tries didn't manage to create any photorealistic shots with actual cucumbers-in-seats – perhaps due to the absurd contrasts required – but shifting to a 'cartoon' style with the prompt "editorial cartoon of the King of Belgium giving a speech to many cheering cucumbers, professional illustrator" got a lot closer:

https://labs.openai.com/s/4IonSKYkl0okhNvzJmEAH30K (good)

https://labs.openai.com/s/ZeadCzZ9WqeASYXPlOb13wDV (good)

https://labs.openai.com/s/nERf6bALKEBsQQBvPVsAH7o4 (good)


If I had more time & credits to burn, I suspect working off those could eventually hit something really apt... but it takes some work & tinkering.

OP did say what prompt they used

I've often seen people show off their autogenerated images and report only approximate paraphrases of their actual prompts.

There's one screenshot showing the prompt – but in $CURRENT_YEAR, I view all screenshots with at least a little suspicion, especially when there was a way to highlight the pseuod-watermarked image – OpenAI's native 'Share' – that would've provided stronger proof, direct from OpenAI, of exactly the prompt associated with an image. Hoaxes are everywhere! I've added DALL-E bottom-right color-squares to non-DALL-E images, & seen others do the same, as a subtle joke.

So I generally believe the OP, but don't rule-out the possibility there's been tampering to make some point.

It definitely requires some very detailed descriptions and sifting through to find a good one. One time I've regenerated a prompt as well because the existing 4 were just not that good. But I did get some great ones at a pretty good usable:unusable ratio.

I have access. You get for trials for each query. I have to say that usually there is only one that is good on those three. Sometimes you need to refine your query. I'm pretty impressed as a user.

They can definitely be that bad quite frequently. I've actually been a lot happier with stable diffusion outputs lately (doesn't hurt that they're free too).

People here, as always, get hung up on legalese bullshit, but miss the overall picture.

The dynamics in play is highly questionable. Countless artists and photographers put effort into creating their works. They put they work online to get some attention and recognition. A company comes along, scrapes all of it and starts selling access to the model to generate something that looks highly derivative. The original cohort of artists and photographers not only get zero money or attention from this new endeavor, they are now in competition with the resulting model.

In short, someone whose work was essential to building a thing gets no benefits and possibly even gets (financially) harmed by that thing. Just because this gets verbally labeled "fair use" doesn't make it fair.

Additional point:

Just a few years ago a bunch of tech companies were talking about "data dignity". Somehow, magically, this (marketing) term is no longer used anywhere.

Did I hear "fair use"?

Here's some more fair use:


(previous HN discussion: https://news.ycombinator.com/item?id=27796124 )

I'm concerned that, and predict that, we will continue to see legal efforts from large data companies to prevent their own data from being used to train similar models. They can use our data, but we can't use theirs. Time will tell.

I also fear our governments are incapable of acting on behalf of the people (non-corporations) in this matter.

The law fundamentally needs to evolve. Latent embeddings of large corpuses of copyrighted works is something we are going to have to wrangle with more directly, it’s not clear to me how we even ought to want it to work in terms of the rights of the copyright holders for data it was trained on.

With the release of spectral diffusion, arguably the genie is out of the bottle now, so there’s probably a ceiling on how much the law can evolve to claw back any retroactively determined rights to copyright holders.

Somewhat ironically, wasn't it openai's main mission for AI to benefit humanity?

Reminds me of the discussion about GitHub Copilot using the entirety of GitHub as training data. I was honestly baffled how many people, even experts in the field, saw use as training data as non-infringing. With the corrolay that it's apparently perfectly legal to "copyright-wash" a work by feeding it to an AI and have that AI generate a slightly different but extremely similar work.

Considering how strict and heavy-handed copyright handling has been otherwise, this has added to my belief that copyright in practice is really just enforcement of the interests of whatever industry has the most power at a given time: When entertainment and content generation was the biggest revenue generator, copyright couldn't be strict enough, now all money is on AI and suddenly loopholes the size of barn doors pop up.

Written laws are vague, practical verdicts are based on case law, cases are won by better-funded lawyers, rich industries prevail.

It's a bit of an exaggeration but maybe not too much.

"Copyright washing" seems a lot like clean room reverse engineering to me; this is usually done by having one person read the copyrighted code and describe what it does to another person, who then designs an implementation based on the description.

At least, I can't see a substantial difference in the result.

These loopholes are purely theoretical until tested in court. At some point a generating AI will hurt the wrong company, and they will either make a public spectacle out of it in court, or if they see no chance of winning lobby congress to introduce laws that make the case winnable.

Yeah, things should get interesting when the first model makes use of Rings of Power or House of the Dragon footage or whatever the latest superhero movie is.

I wonder if we'll see a "Hollywood vs Silicon Valley" lobbying battle. Or possibly "Amazon media division vs Amazon AI division"...

I think silicon valley would win. I saw some analysis a long time a go that basically indicated a couple big companies could likely buy out the entire Hollywood and music industry and fully own them and make most copyright issues go away.

I dont know if its still true but its really a big difference in how much capital, revenue, and money there is. Hollywood is pennies in comparison. I think Big tech would easily win.

> but surely you can't just... use stock photos without paying for the license?

They aren't hosting the infringing content. Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).

The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.

> Training on the data is probably covered under fair use. Generations are of _learned_ representations of the dataset, not the dataset itself. This makes it closer to outputting original works (probably owned by the person who used the model).

"Probably" is doing a lot of heavy lifting in that sentence.

As for "_learned_", that's pretty debatable considering it's reproducing recognizable trademark infringement.

> The players involved here are known for being litigious, however. I wouldn't be surprised if OpenAI did in fact pay some hefty fee upfront to get full permission to use these images.

I have no idea why anyone would assume the "move fast and break things" disruption mindset that pervades tech companies these days, especially in spaces like ML/"AI", would mean they considered the legality, ethics, or good business sense of their training dataset.

As with Copilot, I suspect the DALL-E terms of use puts the onus on the user to avoid using infringing items.

> "Probably" is doing a lot of heavy lifting in that sentence.

Indeed, that's why I used it. It wasn't long ago that DALLE-2 outputs were the ownership of OpenAI (they changed it so the owner is the user recently). Definitely plenty of room for debate on who the owner should be.

> As for "_learned_", that's pretty debatable considering it's reproducing recognizable trademark infringement.

I guess. I meant this strictly in the machine learning sense, where "learned" is typically used to describe models trained via stochastic gradient descent.

> I have no idea why anyone would assume the "move fast and break things" disruption mindset that pervades tech companies these days, especially in spaces like ML/"AI", would mean they considered the legality, ethics, or good business sense of their training dataset.

I agree mostly, except that companies like Alamy have their hooks in everywhere so they can seek rent. I just figured they might be cautious about this if e.g. Microsoft (OpenAI's business partner) had an existing agreement in place for Bing or something.

Unlike Copilot, DALL-E et al. don't produce verbatim copies of trained data.

Copying ideas and styles has always been a fundamental part of art history, so an artwork right holder might have a hard time successfuly sueing a user for the user's generated image looking similar to the right holder's artwork.

"Verbatim" is an interesting term since I'm not certain it matters. In this case OP here demonstrated DALL-E generating a trademarked watermark on top of an image. I doubt the courts, looking at that, would believe that that's not close enough to their trademark to infringe.

The art world's copyright suits are all over the place in terms of what's sufficient to meet the threshold of "fair use" or "not a copy".

It's hard for me as a layperson to see works by Richard Prince[1] as substantially transformative (clearly one work is derived from the other) and even the different courts couldn't agree on this as it was initially found in favor of the plaintiffs but then Prince won his appeal.

My approach to this kind of thing is simply this: Does this technology inherently open me up to lawsuits in undecided or highly unreliable legal territory? If yes, steer well clear of using it in any capacity.

[1]: https://www.artnews.com/art-in-america/features/richard-prin...

The case you refer to (Cariou v. Prince) is also a case where part of the artwork is reproduced verbatim :)

If they had been paying for the images upfront, wouldn't you expect them to train the model on the non-watermarked versions?

Good point! They certainly had no obligation to pay, either. Perhaps they just scraped it all.

The watermarked version might be more prolific with better metatags and descriptions around them.

The non watermarked versions are likely internal only and have far less diverse descriptions.

if they paid for access, or permission, why train on the watermark versions?

I’m guessing they assumed fair use and there will be lawsuits.

Is that representation of the watermark a trademark? If so, then copyright infringement might not matter, but use of the trademark may.

I would be very surprised if OpenAI paid anything for these, because it would set precedent that copyright infringement was applicable, which would be fatal down the road. (The only argument they could possibly mount in their defence would be that they wanted to train on the original images without watermarks.)

What if my dataset is just the one Getty image I don’t want to pay for.

What if I write a machine learning algorithm that only generates images that it has seen in the training dataset, with one pixel slightly different.

It won't be transformative enough and you'd probably lose the case.


What about two pixels?

Not enough... three... four... I think at some point there's a blurry gray area where a human judge would decide it is infringement or not. Of course not with a few pixels but at whole image level.

Kids in school are also trained on stock images


I'm not even an American, and I've heard all about the Getty Images Address.

I think you're technically right, but that this will be overlooked from a legal perspective because it's less obvious that humans have been training ourselves on the prior art of others. We tend to blend in additional things besides prior art. (eg. nature, sensations, etc.)

That's technically not a stock image, it's a portait that has been public domain for a long time.

But you've seen many PD images reshared by stock imagery companies. It raises the question of why false assertions of ownership aren't easily prosecuted, given that they constitute a kind of fraud upon the public.

That's what trademark law is. But I don't think the person who painted the portrait is going to be suing Getty for impugning his good reputation.

If something's public domain, anybody can use for anything they want, even if that's just rehosting it with your watermark.

I think it’s amusing that many commenters here are perfectly willing to defend DALL-E, but mention Copilot and the discussion looks radically different.

Based on the new scraping ruling with LinkedIn [0], anything that is "open gate" (as in, accessible without logging in) can be scraped and (I assume) be used by neural networks. The onus, it appears, is to not use it to generate copyrighted works, like Iron Man from Marvel, just as one can use Photoshop as a tool but is still barred from making and selling an Iron Man digital painting.

[0] https://cdn.ca9.uscourts.gov/datastore/opinions/2022/04/18/1...

> Based on the new scraping ruling with LinkedIn [0], anything that is "open gate" (as in, accessible without logging in) can be scraped and (I assume) be used by neural networks.

The ruling you are linking to is about whether scraping violates the Computer Fraud and Abuse Act.

This isn't really applicable here. First of all, that's a separate issue from copyright. Just because scraping publicly accessible data doesn't violate the CFAA doesn't mean that suddenly all images posted on the internet are public domain or that can use copyrighted images from websites for whatever you want, for example.

Furthermore, how copyright applies to training neural networks on copyrighted works is an open question right now.

Perhaps a neural network's outputs can be deemed as transformative use and thus fair use. I don't know though, I'm not a lawyer.

I would assume that for cases like this is is more a matter of whether you can redistribute copyrighted work that has not had any of the usual "creative use" things applied, rather than whether the original scanning was protected.

I remember when people used to say ianal. Innocent times when we thought there was an objective law and lawyers knew it. But that's not how these things work. The truth is that no one knows. Ultimately a bunch of people will decide how they feel about it. Well-read legal scholars trying really hard to be fair, but still just people. No one can predict with full certainty which way it will go.

>>No one can predict with full certainty which way it will go.

Until somebody tries to float a trial balloon (case) in court.

> Ultimately a bunch of people will decide how they feel about it

Some would argue that technically these people _discover_[0] the law, but it amounts to the same thing

[0] https://www.jstor.org/stable/3143421

Legally wouldn't it just boil down to the license on the watermarked image?

BTW you can add 'royalty free' to the prompt to get rid of those most of the time (lol?).

> royalty free

Wouldn’t that remove the king of Belgium? Or add a “down with the king” placard?

My personal opinion is that it's unethical (and possibly illegal, in a subset of cases) to train models on data without explicit consent of the creators of that data. And that really encompasses all data - generative models were not a thing when said data was created and no matter how it was licensed before, explicit consent about using it for model training must be obtained from the creators themselves.

That being said, arguments about copyright are just a fig leaf as far as I am concerned. The outcome of whether this is allowed or not will depend on the net impact of using those models on the job market and whether society will be willing to tolerate it.

You may want to use the native 'Share' option, especially on the one with the watermark.

You'll get a public link, at `labs.openai.com` rather than some random image-sharing site, which will show the image & the prompt used to generate it (including a credit to "your-first-name × DALL·E").

What is interesting is a human analogy.

Say you were an artist who went to every art show and museum and studied all the art there.

If you produced a work of art solely from memory that contained large portions of other people's copyrighted art, would that still fall under copyright/require licensing?

Precedent in music says sometimes-yes. The "Blurred Lines" lawsuit found that Pharrell and Robin Thicke were liable in the tune of $7m for producing a work of art solely from memory that copied the "signature phrases, hooks, bass lines, keyboard chords, harmonic structures and vocal melodies" of a Marvin Gaye song. https://en.wikipedia.org/wiki/Pharrell_Williams_v._Bridgepor... https://www.npr.org/2015/03/11/392375390/-7-million-verdict-...

Yes, but that's an outlier ruling that was widely criticized.

Widely criticized doesn’t change that it’s case law other judges might consider.

Another human analogy could be: you take a photo from every art show and museum, and use those for reference as you paint.

The analogy of using your cortex is more apt.

My understanding of neural networks is that there are no remnants of the original inside it. The training data is used to back propagate a bunch of weights.

Your brain works like those neural network neurons; they learn when to fire, but they don’t know the intricate detail like a photo. Hence why many claim eyewitness testimony is bogus.

> My understanding of neural networks is that there are no remnants of the original inside it.

Can you prove that? I can prove the opposite.

Some networks use databases of images rather than storing all data directly in weights.

Alternatively, if you memorized some GPL code, can you write a copy of it and put in a proprietary licence?

Of course not! That's why people do black-box reimplementations without ever seeing the original source code.

There are definitely lines to be crossed. Let me tell you about one of them.

There is a comics creator named Kieth Giffen. He's done a lot of solid work over the years for DC and Marvel, there's a playful love of the medium and its history that flows through a lot of his work. At first his style was pretty middling; nothing terrible, nothing to really stand out from the pack. Then one day his work changed dramatically - he got a lot more daring in spotting his blacks, inking with a heavier brush, and doing a lot of panels that were a closeup of a backlit head with rim lighting, and eyes and teeth standing out in white. It was grounded in observation but had a lot of fresh ways to abstract a scene in the service of story. It was like nothing else on the racks and really striking.

It was also completely swiped from the work of an Argentinian artist named José Muñoz. Pick up one of Muñoz's shadow-drenched crime stories, put it next to one of Giffen's superhero tales, and you could clearly see the influence. And not just the influence, influence is okay - Giffen had started entirely cloning Muñoz's style, completely dropping all his other influences in the process. Muñoz was not happy when he heard about this, and neither were other artists in the field of comics. Influence is one thing, everyone's influenced by other artists, and if you're familiar with an artist's influences you can tell. But dropping all your other influences to start drawing almost exactly like a new one? That's just not done.

Giffen got a lot of shit for this. Giffen quit comics for a couple of years after this, and when he came back he had a new look. He still does the Shadowy Muñoz Face now and then but it's more along the lines of one of the many things he's borrowed from his multiple influences rather than one of the ways he was wholesale ripping off Muñoz.

"Style theft" is completely legal in the eyes of the court. There was nothing legally actionable going on here. But in the court of his fellow artists, Giffen was judged, and found guilty.

There's a range here. Nobody's going to care if you pick up a collection of Winsdor McCay's pioneering 19xx comic strip "Little Nemo" and do a dream-themed story that borrows his distinctive panel composition, lettering, and inking choices. Nobody's going to care if you do one drawing that precisely lifts Mike Mignola's heavy use of black and thin, clear lines. If you do superheros long enough then you're pretty much obligated to do at least one story that emulates Jack Kirby as closely as you can. If you worked as someone's assistant for a half a decade then you are very much allowed to bust out a perfect rendition of their style at any point in your entire life. But there is definitely a line you can cross where every artist (and a lot of non-artists) who sees a side-by-side view of what you're doing and what you're swiping from will say "dude, not cool, stop swiping their style".

These image generators actively encourage adding the names of prominent, living artists to your prompts to get the results you want. Is this crossing the same line Kieth Giffen did?

yeah except this artist wont go around painting watermarks



(Yes I get it's not technically a watermark, but it certainly qualifies as a trade mark in a similar fashion)

Ceci n'est pas une pipe … er … soupe

my point being if he tries to draw a dog his idea of a dog will not contain the Getty Images watermark and even if he did he would just not draw it

There's a big difference here. DALL-E (or any other similar tool) is not an artist, it is a commercial product that can produce something that might be considered art. The usage might be ok, but the product itself might not... if I paint using stolen pigments (not saying that DALL-E is alike to this), my painting would be legal, but the fact of having stolen the pigment is very illegal...

Product of machine cannot be copyrighted. Only humans have the right to copyright something. Dall-e like kaleidoscope then. Anybody, even monkey, can generate similarly looking images using Dall-e with the same prompt. So there is two options:

a) Dall-e images are without copyright.

b) Original authors of work used by Dall-e can claim copyright over images generated.

Copying signatures one entera into forgery territory…

If you read the licence from Getty, they say, you are not allowed to use Getty pictures for ML.

What that license text says is irrelevant, because they’re not using it under that license, but under fair use exemptions in copyright law.

Fair use, until it disrupts market, sidelines creators, gobbles up the market, make it ubiquitous, lock people in, then let the regulators craft some bs law that will change nothing, compensates noone. ;)

This interesting era of AI will surely teach us the meaning of that old phrase "great artists steal", or more subtly rephrased, "everything is a derived work".

Got the exact same girl from the picture in the ad at the bottom. Creepy! https://ibb.co/dBLNxQ6

It doesn't matter. I could put a Getty watermark on anything. Getty would have to show that a generated image was at least in part the same as their image.

No. You could put the Getty watermark on anything, and that wouldn't be copyright infringement... but it would be pretty clear trademark infringement.

I'm finding it amusing that everyone immediately assumes infringement, OpenAI is a company that will not be inviting lawsuits.

We can't assume any licensing behind closed doors, my guess is that OpenAI has an agreement with Getty, take a look at the licensing in this Observer piece, it's been licensed by Getty, this would indicate that Getty are happy with scraping.


Besides, this is not infringement in principle, the AI has been trained to think that high-quality news images have watermarks.

I don't care much for what laws say. If the only way someones service can work is by ingesting the work of someone else, without compensation, and then compete with that same person, that is wrong.

If a company reverse engineers a competitors product, they still buy the product to tear it apart and figure out how it works.

If a student learns from their teacher, then goes on to sell a similar kind of work as what their teacher makes, at least the student paid for the classes.

This arrangement offers none of that. As long as theft is illegal, this should be. I'd call it parasitic, but it isn't; this is a parasite who's sole intent is to kill the host.

>but surely you can't just... use stock photos without paying for the license?

You'd be surprised...

Is there a copyright protection in terms of consuming a copyright-protected image? I thought it was only for the purpose of displaying that image. If you're reading the file and reading the data, but not displaying it, is that also protected?

Copyright, as the name implies, is mostly for restricting copying (as in printing copies of a book), but also restricts distribution, adaptation, display, and public performance of that work. In the case of AI, it’s the “adaptation” part which is up for debate. If a person uses an image as part of a training set for an AI image generator, and then uses said AI to generate new images, are those images “adaptations” of the images in the training set? I would suggest that the answer is yes, but current behavior by AI vendors are not in concordance with that view.

Just wait until they build an AI watermark identifier and remover (which is a problem subset) and then use its output to train/update their model.

They probably already have specialized filtering models built to filter out censorable terms. They may be imperfect, but they are there. A watermark remover might be an easy addition.

When Stable Diffusion released their model playground, I used the prompt Peter at the pearly gates dressed as a security guard and got three images two of which were censored and one that was an ordinary image. So, the capability is there already. Just a matter of time before they get good at watermark removal.

Probably just some stock photos with watermark sneak in.

There are lots of photos with watermark circulating on web, for example in memes and unfinished webpages (when finished, these will be replaced with paid variant without watermark).

Yeah, I've seen an image get generated with a very recognizable watermark for a certain stock image company. This happened with a totally unrelated prompt.

Did you try reverse image searching the generated image?

I don't know about the images, but what about the watermark itself? Can I just take any photo and add a proprietary watermark?

Similar thing with GH Copilot. I'd say it is still fair use though, even though such things should be filtered out.

A couple of people here have asserted that it's "probably" fair use, but are there any rulings on the subject?

Yes, Imagen and everything based on LAION 400M or 2B, too.

BTW, Copilot also ignored all licenses of the source code it memorized.

Datasets are the new capital. If they could, most employees would probably also object to their company using the result of their work to replace their job. But they can't. It's the same with artists here.

The first thing that I try after generating an image from DALL-E is using reverse image search. I do it on every image that I intend to use, more often than not, I find a very similar image, in this case I discard it and vary my prompts.

> more often than not, I find a very similar image

Can you give an example? I was also doing reverse image searches, and I havent seen a single case of an image being closely related to another unless it was used as the base for inpainting.

At first, you can experiment with "extreme" input that can cause overwriting like: a photo of Marylin Monroe, etc. To search using reverse image search I use yandex, and I downsample the image.

What are the best apps and subscriptions to generate these? No private beta, just sign up, put a credit card on file, and use? (Low volume, perhaps 100 images per month, so 300-500 attempts.)

Could be great for featured images for blog posts.

some people will post images with watermarks on social media or other sites with user generated content. if their dataset included images scraped from them, then it could have gotten in that way

Relevant earlier discussion about this issue: https://news.ycombinator.com/item?id=32436203

Wondered the same thing recently … https://news.ycombinator.com/item?id=31159231

No one fucking cares. For 1 "copyrighted" image theres a thousand free with the same quality or almost.

You are wasting CO2 even discussing it

Obviously you could send it to the copyright holder and find out. In the case of Copilot, Oracle certainly would sue.

Seems more likely to me that they add uploaded images into their data set and someone uploaded a watermarked image.

Educational is a fair use category. These tools advance science. I wouldn't expect them to respect copyright.

sometimes people will post stock images on sites with user generated content. if their training data included images scraped from those sites, then it could have gotten in that way unintentionally

Last time i checked you can source from whatever you want, legislation doesn't care.

The last time i checked it was when colpilot got public, they could have trained it only on gpl code. The source license/copyright et all don't matter.

So what happens if I start selling Dali like pieces?

You transformed the original enough so it's ok

Regardless of whether or not training an AI on stock images violates the license, there's a very real problem with that watermark being present, which is that it proves their AI is prone to copying large swaths of images from gettyimages unaltered, and that definitely is a license violation.

This makes me think back to the controversy over github copilot; if these AIs are going to be trained on other peoples' IP then somebody needs to be held accountable when they commit plagiarism.

Otherwise, im sure Microsoft won't mind my new "gamemaker AI" that i trained on that new halo game last year, or this "OS AI" that I trained on windows 11.

Just because it contains the text of the watermark does not mean that it's reproducing large swaths of the image - its doubtful even the most generous perceptive hash would retrieve any matching images in the Getty repository.

and yet the watermark is there. if it can't copy parts of the training images then how and why did it copy the watermark?

It's prone to copying things it has seen thousands of times, such as those watermarks. The content itself is unrelated.

some people go into business models that simply have no legal protections

This copyright "issues" are against the true nature of innovation.

By the means of Artificial INTELIGENCE, we must to accept a mind or intelligence is free to perceive external elements and use every stimulus to execute its own creative process.

The world is a perpetual iteration cycle amongst human beings. Good artists borrow, great artists steal.

This comment should be on the Wikipedia article for "parody indistinguishable from reality"

Regardless, I think I agree. I use images from all sources as inspiration for my art.

I know people who have used my pieces for inspiration as well. Intelligence and creativity aren’t bound by IP law.

Why should AI be bound by it?

Yeah! You got what I meant... We have the opportunity to make a big leap by exploring a new Era of creativity provided these technological advances.

It's the same cynical people questioning if "machines going to replace people" didn't figure out that the machines need to build themselves first.

When machines build themselves (with no humans since conception), let's see how the "patent ideas" world wouldn't fall apart.

Until then, We can get our piece of Cake by augmenting our human creative process.

Sorry but I'm not the daydreamer here. I can't live under the false premise that Intelectual Property is something that would control the input(training, learning, etc.) for all the A.I. (created or to-be-created).

If we are expected to believe AI have a "creative process" then they should abide by labor and copyright laws.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact