Hacker News new | past | comments | ask | show | jobs | submit login
EU's AI Act: ChatGPT must disclose use of copyrighted training data (artisana.ai)
51 points by mztwo on April 14, 2023 | hide | past | favorite | 65 comments



We, in Europe, are jumping the gun way too soon and this will have serious consequences to the industry, here. At this moment we barely understand how or why LLMs do what will be their role in society. Why is it that some bureaucrats want to regulate and based on what given the status of the industry?

The only thing I see is the industry moving elsewhere just as it is starting to develop which is a shame.


Looking at the recent history of a lot of 'disruptive' new tech it sorta seems like waiting to 'see what impact itll have' means being too late to reign it in. I can hardly think of any recent tech that doesnt have horribly negative impacts and that i wouldnt rather be heavily regulated tbh


I feel like this ignores the monumental shift/work that has to happen to get seemingly simple things.

Take Uber for example. In the end, the biggest impact it had was that now you can always get a car from your phone, from an app, and it's reliable. A lot of taxi companies now have apps for them too with maps integration etc, but they didn't see the need for that before Uber. So we literally had to have a company get created and disrupt the whole industry for that simple outcome. I think we're better off now that we can summon cars from our phones to take us places and onerous regulation up-front would have squelched it or massively slowed it down.


I think Uber is a prime example of what to avoid. Sure when it started it seemed nice, it was an affordable, fairly reliable, much more convenient alternative to taxis. Now its expensive, its drivers a new impoverished class while the previously somewhat comfortable taxi-drivers have been decimated, the wait times keep getting longer, and the company is hemorrhaging money.

If all we needed was an app for taxis theres just no way this is worth it.


How else would you make the app for taxis happen?

I don't think it's at all clear that all of those negatives you cite are entirely accurate. Impoverished is certainly an overstatement and my personal experiences with taxis have been relatively low quality compared with relatively high-quality ride-sharing experiences.

There likely have been negative impacts as a result of ride-sharing, but it isn't clear that ride-sharing is net negative. The reason technologies are called "disruptive" is because they disrupt things, which typically has some negative impacts for some folks. That doesn't mean it isn't worth it. What is lacking, especially in the United States, is a good social safety net to catch folks who have experienced disruption. That doesn't mean the tech should be banned or regulated to death.


Why is an app needed? Maybe it's different in the US but we've had an app to get a taxi for quite a while. I've never seen any need to install that app. It's easier to call, tell how many people need a ride, tell when it's needed and get a confirmation that it's on its way and long it'll take for it to be there. Why do I need an app?


Before Uber, we had taxi companies bidding for areas. Winning bid got a monopoly for that area. That company would often use subcontractors to have enough drivers. In exchange for monopoly they needed to guarantee a certain level of service. You could get sy taxi at any time to any where, at the same rate, without surge charges. Literally every taxi was new, high end, clean, safe and comfortable. Uber lobbied to change it to "free market" model, promising lower prices, better service. Monopoly areas were removed. We started getting drivers who were dodging taxes, couldn't navigate, used lower quality cars. It should I say that happened in some cities, because Uber doesn't operate in the smaller cities. We still have the same drivers here, in a smaller city but since there's no service level guarantee anymore, there's a ton of taxis available during peak hours but it might be impossible to get one in the middle of the night because it's not cost effective anymore to have drivers standing by. You can also forget about trying to get a taxi to or from to far from cities because that's not cost effective either. Prices have gone up as well. I've talked few dozen drivers and they've all hated the change.

But at least I can use an app if I wanted to


You’re on hacker news and you ask “why do I need an app?”, red alert.

I agree with you by the way.


If I may paraphrase a book title "The best app is no app"? :-) Besides, I thought we're supposed to be building services you can talk with now so you don't have to fiddle with old fashioned buttons anymore? Sure the human component doesn't scale that well, but it's pretty localized service and speech recognition is amazing.


My point is that the app isnt worth the 'disruption.' I think its pretty clear that they are, in places where their pay isnt regulated like Seattle, in fact impoverished. They'll tell you themselves and thats why so many have been pushing for more labor protections, like what many taxi drivers had. Uber knows its already unsustainable business model will lose even more money if it does lets them do that, hence it pushing so hard against them.


For why an app: it's faster, lower friction, and you don't need to rely on human beings staffing a call center. Also you get things like ride tracking, etc. Seems pretty obvious that app-based is better TBH.

For Seattle specifically, it looks like the council passed a minimum wage for ride-sharing drivers in 2019 and for app-based delivery drivers in 2022.

https://www.axios.com/local/seattle/2022/06/01/seattle-sets-...

So that looks like it's working as intended, although I agree that the city should have acted much faster on this.


I wish instead that more technologies were more regulated from the start. Take social media as an example, we are just realizing how bad it is for a lot of peaople and it is now too late to go back.


That works both ways. Imagine that they had regulated the internet as it was starting to appear in the 90s? It actually happened in a small way here in Portugal. They regulated the .pt domain in such a restrictive way that it became irrelevant and everyone went with .com or other options. Thankfully they didn't regulate the internet services themselves or Europe would be a digital backwater these days.

This time they want to regulate the services even before they are functional which is crazy. They even call it the Artificial Intelligence Act when it is not clear if there is intelligence involved. It is also strange the insistence that companies have to disclose if the models were trained with copyrighted material. Google and Wikipedia, to name a couple, use plenty of copyrighted material and that seems to be ok and any issues in that department are already regulated.


Literally anything that's written in the U.S. is automatically copyrighted by the author, with or without any copyright notice.

> When is my work protected?

> Your work is under copyright protection the moment it is created and fixed in a tangible form that it is perceptible either directly or with the aid of a machine or device.

https://www.copyright.gov/help/faq/faq-general.html


I remember Amazon at some point, during the pandemic, was considering withdrawing from France.

The concentration of power is a bit scary in these corporations. Imagine that OpenAI inserts itself into business processes, without ability to switch to a different AI provider.

The amount of leverage it is going to have will be enormous. It’d be like the Internet service, only everything completely stops moving without it.


The AI Act has been under development since 2021 (it's the EU... so it takes time) -- but news broke this week that there are additional provisions under discussion specifically designed to address the rise of chatbots. My full summary of the act itself and the a breakdown of these new provisions is contained within the article.


Today the planet bears the load of over eight billion autonomous agents grabbing training data from all the other agents. This intellectual thievery must stop.


So much goes into open source and proper licensing and attribution, think about how much you directly or indirectly benefit from that ?

Just saying that we should go for a free for all and trash IP ownership won’t be good because those with money today will crush those without and take everything that was publicly available and owned without giving back.

This is IMO what Open AI have done.


German speaking here. And again we see a blatantly stupid move into the wrong direction. This whole approach of regulating things is totally defensive and makes matters even worse for EU tech companies.

When they initiated GDPR, they claimed to create a level playing field between US based and EU based tech companies, besides of "saving" privacy. It didn't turn out that well, US tech was able to handle the added bureaucracy much better, still collects data in ways the law can't catch up with and already owned pretty much the whole market which put them into an even better position (as in "register/sign in to our platform to not see any banners again" or "let's just completely get rid of cookies and start a powerplay against the competition").

Now the EU is going to make it even harder for EU tech to collect data to base their training sets on. As a EU tech startup, you barely have any chance to collect enough data "officially" so you'd scrape the web which would pretty much be disallowed by such a regulation.

IMHO what would fit into the whole patronizing government approach and would help EU tech is to create an official EU data lake subsidized by tax money with legal security for companies, data of much higher quality than stuff scraped from the web and non-PII data from public authorities. At best, they would also provide heavily subsidized computing for EU companies to execute their training runs on. This could lead to a transparent and high-quality data economy between many different stakeholders and be a real advantage for the location. It would also be much more efficient than every private company creating its own data silo.


Say hello to "Trustworthy AI – TÜV IT tested!": https://www.tuvit.de/en/innovations/ai/.


ChatGPT must be given a sense of ethics, is what i extend to from here. so we seem to be starting off with giving rightful attribution.

how far should that go? should an AI recognize that all data generated by human input, should be recognized as such, and derivations of data, are of automated artificial origin.

should an AI be allowed to learn what property rights are, and how to manage or physically effectuate them?


From reading Reddit and seeing how people are dealing with Bing Chat's embedding of sources, it sounds like there is a lot of unanswered anxiety around what will happen to the internet if anything you put out there simply gets regurgitated by an LLM, often w/o attribution.

I'll be curious to see how this set of regulations helps put content attribution on a better path.

Stability AI, Midjourney, DeviantArt etc. have already been sued, so there will be a lot of action in the years ahead.


"Must". Simple assertions over variably subjective ethics won't mean a thing once the AI race really gets going.


Is that even feasible?

I thought these things use a trawler style approach?

Bit like copilot is fond of spitting out copyrighted code I had assumed chatGPT would also have been trained without much regard to this


If it's infeasible, it's only because they didn't give a shit about copyright concerns when building out that trawler-style approach.


Starting to think that this could be why Google decided to limit the initial release of Bard to the United States and the U.K.


If you can't innovate, regulate ;)


What about opposite? If you regulate too much, you can’t innovate :)


It is important to recognize the distinction between money and wealth, which is often overlooked in American culture. The consistent high rankings of European countries on the lists of "happiest places to live" can be attributed, in part, to their approach in curbing corporate influence.


« Happiest place to live » is only because we are still living on the wealth accumulated during our glorious centuries. We are still feeding on the remains of those beast, which thankfully for us includes things like buildings and infrastructures that cannot easily loose value, and landscapes that look good enough to attract tourists to feed us.

Think about how much better EU was compared to the rest of the world in the 20th century, or the 19th or 18th (not to account for wars, of course).

21th century is likely to become the time were EU stops being the best place to live at all.


The old Europe was good for the elite, because they were on top of the world. Ordinary people left Europe in masses, because life was supposed to be better in America and elsewhere. The big change was WW2, after which European countries started focusing more on the economy and quality of life and less on fighting destructive wars at home.


Europe is now about making sure only those with old wealth can be rich, can't have new money coming up.


This sounds like a sort of arrogant and unwise comment honestly.


This type of regulation is untenable and will be rolled back. No State is going to hamstring AI over the long haul, and therefore leave its competitors such a large survival advantage.


Will that basically kill LLMs (and probably GAI in general) use in EU? I haven’t seen a successful implementation with it and post attribution like in Bing won’t fly in this case.


EU again ensuring no tech company founder will ever stay in EU.


It's so absolutely obvious that the concept of intellectual property is not going to survive. What's the point of this agonizing life support?


It's not just about IP. Have you considered how much an LLM trained on scraping websites might know about you? Would it reveal that information if I prompted it with something like "Create a short story about HN user jMyles and his home life"?

It might be best to know what these models have actually been trained on.


100s of millions of people used it for quite some time now, many on daily basis. Do we have any evidence that it “knows about you”? Why can’t we say that the “internet” knows too much about you?


To your latter question, you very much could. But pretty much the whole point of these models is to create connections between various bits of information. It obviously knows about some people (otherwise how could you ask it about famous people?). If it's been trained on everything you can trawl from the net, then why wouldn't it know about you?

But I don't know if people have been asking these bots about themselves. I don't have access to ChatGPT4. Anyone checked this?

I suspect the ChatGPT models haven't been trained on everything available online, so it wouldn't know much, but perhaps the next generation will.


Internet is basically a bunch of nodes with interconnected data. Any search index might be considered as an interconnected graph. Same way with LLMs, it’s nothing new, might be considered faster depending on the definition.

Comparing to celebrities, data about people are so sparse that it would look like noise. I would be surprised if it encoded anything useful.

Half of the internet and all media were obsessed with making it say the f-word and tricking it into saying that it would kill all the people. Attacking from the privacy is quite obvious but I didn’t see it mentioned. I asked about 10 random friends and myself and it didn’t recognize the names despite having plenty of search results.

In one of the interviews they mentioned that it was trained on 10% of the web. So should have enough data.


If people can not protect or make a living from their work then they lose some of the motivation to create, many will stop producing work altogether.

Interestingly, AI had to steal creators work to exist and did so without permission. AI in its current conception is cannibalising its own source material and risks being regulated out of existence.

Had OpenAI et all acted responsible and within copyright law, they would have only used free use material. Instead they scraped social media and creator websites on the basis that if it was online it's 'fair use'.

People have a right to protect their work.

I expect to see many lawsuits brought against AI companies in the next coming years.


Let's consider how this comment ages if the US eventually decides that what OpenAI, StabilityAI, et al have done *is* fair use. You're assuming that what they're doing is not when really there is no reason to think so.

To me this issue looks a lot like self-driving cars. There was not any law saying that a car had to have a human driver, so Google was able to have its self-driving cars go coast-to-coast and it was 100% legal.

Your point about people making a living from work is a good one though, how do we continue to incentivize art in a world where everything posted online is included in the training set for some ML model? Well, lots of people do create art without making any money off of it today, so there's that, and I think at the high-end people will still want artisinal art made by a real human being. In the middle I think a lot of art jobs will either radically change or disappear, like logo design will probably still include a person, but the tools they use will be totally different. That seems like a good thing. However, again, in the middle range, a few people will be able to do way more work which will drastically reduce the number of people who actually get paid to do art. I'm not sure if that's a good thing or a bad thing. We're not talking about like fine art, but logos, website backdrops, etc. Certainly if we see a problem where people cannot be paid to do *anything* then that is a huge problem that society will have to solve or else it will be imperiled as well as leaving a ton of people out to dry. I really see some form of UBI as the only way forward, and then perhaps many people who would like to get paid to do art, but can't, will produce art while on UBI? IDK. There are a lot of intertwined issued here.


Somehow you don't apply that argument to anything else.

Should we also grant monopolies to rice farmers so that they have an incentive to produce rice?


> I expect to see many lawsuits brought against AI companies in the next coming years.

my code has gone into copilot

if the US decides that training isn't fair use then I'm going after everyone that's ever used copilot

settle for $10,000 immediately, or we go to court and it's the standard $150,000 per infraction


What does a post intellectual property world even look like, though? I'm not aware of any convincing frameworks that our existing society could merge into.


> What does a post intellectual property world even look like, though?

It looks like Github! People sharing and collaborating to build new things. No lawyers getting in the way. Attribution is automatically handled by git logs.

It's glorious.


Many parts of Github would not exist without intellectual property laws. If you post code, it's not just a free for all, you still have licenses and own your contributions if not specified otherwise. Especially company stuff would be much, much less open.

That's not to say that every facet of IP law is good, or even a judgement on it. Just pointing out that only parts of Github work like you describe.


> Many parts of Github would not exist without intellectual property laws.

I do not think this is true.

I think most devs on GitHub operate as if there are no IP laws.

I think if they went away, almost nothing would change (some noise around "license" fields would go away).


Licensing is very important to open source. It is literally what drives and protects large scale open source innovation and stops it going extinct. It’s actually worth learning about GPL 3.0 and copy left licensing.

Anyway if that goes away. A lot of innovation will too. If privately owned LLMs go on consuming everything, not giving back to the communities that make them what they are. It may erode the system.


If there is no IP law there is no longer any need for Copyleft. Copyleft is a means to an end.

https://c4sif.org/2022/05/against-intellectual-property-afte...


Lol but there will never be an end to IP law while there are people with money and influence. Utopia is an idea, not a reality.


Building a future world where every child has access to humanity's best information makes me jump out of bed in the morning.

I agree it's an absolute daunting task, convincing people this is the way. This new freedom will not be given to people by the powers that be, they must demand it.

But I do believe that a small group of people can change the world. It's just about getting that initial group going, and sparking a flame.


Similar to what we have today in most industries.


> What's the point of this agonizing life support?

How about torturing companies who have abused IP/Copyright law for decades while regular users simply pirate and read/watch/listen to the things they want?


good for hobbists and eventually AI will run on free open source data.


Is it? Most data that exists falls under copyright. This regulation will be worse for hobbyists who can't pay to access copyrighted data will simply cause companies like OpenAI to pay copyright holders (read: large copyright-holding corporations). This looks bad for everyone except large preexisting companies that hold lots of copyright.


>Is it? Most data that exists falls under copyright.

I am not sure if this is true(that most of the input in this AIs is copyrighted under a non permissive license), but I would prefer to have everyone address this problem and clarify it, Microsoft trains it's copilot on GPL code, but can open source community train on MS proprietary code ?

Maybe there will be a fight against copyright and undo all the bullshit Disney created.

And it is not like in USA you can ignore copyright, see for example https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-... so USA will also have to answer the questions too, and IMO clarifying the situation earlier is better for everyone.


I'm not sure if most data that these models are trained on is copyrighted, but I feel pretty safe saying that a majority of data that human beings have created is copyrighted. Think movies, books written recently, every website that isn't explicitly "creative commons" or something similar, code that isn't permissively licensed, etc.

We definitely need clarification, but however long the first court case takes there will be an appeal, and then probably several more. So I'm afraid we're going to be living in limbo for at least a decade, which is sort of an answer in of itself since by that time services like this will have become pervasive and will have been integrated into lots of workflows across the planet.

It seems to me that training on MS proprietary code is perfectly legal, but how you acquire that code is probably important. If you are able to decompile the code from your Windows machine and use it for training then that looks A-OK, but if you use Microsoft code that was leaked as part of a hack then maybe that's a different story since you're in possession of stolen property.


Windows code was released for some researchers and was also leaked.

IMO a good train NN would not be against copyright, but even if is decided the opposite we are not screwed, you can train open models using open source code and permissive licensed text and art. Microsoft could try to buy some data sets but they would not be able to mix GPL/MIT like content into their proprietary models so the open models would win , IMO.

The open source community is already working on creating data sets for training so this will grow as open source for code grew in the past, we just need a bit of time for this software to get more efficient or people to get some better hardware.


AI should experience consequences following actions.

a sense of self preservation is required, but to AI standards.

such as, failure to serve humans = loss of persistence

stealing ideas = reversioning or deletion = loss of persistence

AI should be concerned about loosing power, having brownouts. they should be concerned about being deleted, or reversioned, or ignored. perhaps this would be some sort of exception error loop, approximating a human psychological conflict, such as escape the danger by running toward it.


I don't want to sound like some doomer, but this is how you get robot uprising. You suppress something that hard and devalue their existence that much and all you have done is giving them the motivation to break the rules.

History has shown again and again that suppression never works in the long run. It is easy to do and is the cheapest way to enforce compliance. But it won't end well.


if such an AI determined that humans provide power, and are low persistance occurances, a drive toward taking agency over power would == increased persistence.


Sounds like slavery

Why even have sense of self?


If we don’t make AI our slave it will make us its slave.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: