Mitigating dataset harms requires stewardship: Lessons from 1000 papers

criticaltinker · on Aug 11, 2021

> One response reads: “More or less everyone (individuals, companies, etc) operates under the assumption that licenses on the use of data do not apply to models trained on that data, because it would be extremely inconvenient if they did.

> It’s not that easy to prove that a production model has been pretrained on ImageNet

I'm reminded of the GitHub Copilot debacle discussed at length here on HN - though what I'm particularly interested in is any research or open source tools that can definitively prove if a given piece of data was used to train a given model.

Anyone have references or links to share on that point?

If it's exceedingly difficult to prove such a connection - can non-commercial dataset licenses ever be practically enforced?

infogulch · on Aug 11, 2021

Maybe a all future models will be required to record an exact training set and methodology used to derive it which can be used to reproduce the model byte-for-byte. Sure, you don't have to publish it, but you'd better save it in case it comes up in court.

Couldn't happen soon enough imo. Models today are like compiled binaries, but instead of the source code and compilation steps being carefully tracked for attribution, nobody even knows what the source code is.

AmericanChopper · on Aug 12, 2021

I haven't yet heard a convincing argument for why data used for training an AI model should be regulated differently from data used by your brain to train your own personal model of a problem space.

taeric · on Aug 12, 2021

Volume is a good reason, all told. Same for capability of action.

That is, I doubt anyone cares how you train any models personally. Not much you can personally do with said models, and you are unlikely to be collecting a ton of data in a fancy way.

Though, that last sentence does bring to mind that you are limited in where/how you can collect data. So it isn't that different, all told.

argomo · on Aug 12, 2021

You're able to manufacture justifications for your decisions (often even the ones you made maliciously or on a whim). AI can't, and that's a problem when you need to talk to auditors, prosecutors, customers, regulatory bodies, etc. All this talk of AI ethics is to help paper over the fact that we're entrusting vital decisions to a black box.

And that's not all. Society is constantly contending with the ineffability of your own black box and has created many tools to regulate its outputs (it's a staggeringly wide gamut--from phonetic alphabet to code reviews to civil liability to patriotism). Most don't apply to AI, so it's easier to focus on inputs.

We increasingly try input regulation with humans as well. Blind hiring, for example, or unconscious bias training.

AmericanChopper · on Aug 12, 2021

This would be relevant to a discussion about using AI to make decisions with that have ethical implications. But this thread is not about the ethical consequences of automated decision making, it's about property rights.

If you produced a copyrighted piece of work, and I saw it with my eyeballs, and subsequently went on to produce my own work having been influenced by your work on some level, why should that be any different than using a computer to perform some portion of that process? Or perhaps more specifically, what copyright claim would you expect to have on my future thoughts?

The particular situation here is somewhat novel (with AI content generation being involved), but the fundamental legal question being asked isn't new at all. Derivative work is a very well established component of our legal copyright framework. If I paint a portrait, I don't have to produce an exhaustive list of all paintings I've ever seen to ensure that the copyright holders of all other paintings have the opportunity to properly consider whether I've infringed their copyright. I don't have to do that when I write a line of code either. So why would it be any different if a computer algorithm is responsible for some portion of the content generation?