The more worrying takeaway is that the winners scraped videos from people who clearly had no intention of their videos being used for a deepfake detection algorithm. Yet they did not think of the ethical considerations of using that data (did everyone in the video even have a say in the video being uploaded?). I think Kaggle disqualifying the team is the right move (even if it's a painful one for the winners).
See the other comment for some of the exceptions.
What's more, if that consent is not legally required (there's a heavy "if" in this sentence, IANAL so I do pretend to know whether it's required e.g. under GDPR, but let's assume for a moment that it's not) then Kaggle are still perfectly at rights to ask for that permission to qualify for their competition. After all, it's their competition, and it's totally reasonable for them to set an ethical criteria that's even higher than legally required.
"A. If any part of the submission documentation depicts, identifies, or includes any person that is not an individual participant or Team member, you must have all permissions and rights from the individual depicted, identified, or included and you agree to provide Competition Sponsor and PAI with written confirmation of those permissions and rights upon request."
The whole reason Facebook launched this challenge was to try and bury the bad PR over their data practices. If people in the external datasets had complained about the unauthorized use of their faces in the winning solution, it would've been pretty embarrassing for FB.
Documentation requirements are pretty standard in Kaggle competitions, and usually cover having to supply your code, and maybe write a blog post about it. I've never seen one that had major rules in it.
For instance, with the same scraping being used to train the deepfake GAN, would their model be more or less effective than a competitor model?
It seems like they won from a disparity in data not an innovative technical approach.
The correct decision was made.
Because of RGPD and friends, Facebook can't store those photos, even though their license are permissive, and respects Kaggle's rules.
This clearly shows what Kaggle is: It is a way to get very cheap and high quality data-science work. It's not for hiring people, not for truly helping the research community, not for to help people learn. Nope, just cheap workers.
It really feels like Facebook have there whole deepfake detection strategy here! They put like 2M$ on the table to solve an issue that will(?) plague their whole multi-billion platform.
It is also a way for companies to get work done. And maybe the competition format allows them to do so cheaper than people are comfortable with?
But one does not exclude the other. Indeed, it’s the premise for Kaggle’s, and really any marketplace’s, success that it found a model that is beneficial for all participants.
Competitions on Kaggle tend towards goals that are of universal value, of which the detection of deepfakes is one example. There are others that are rather far from any business case to support the idea that companies’ are looking to cheaply get something done, such as Google’s yearly Basketball score prediction competition (that comes with a relatively big pool of prizes).
In fact just the low number of competitions would seem to speak against the idea that cheapness is a primary motivator here. Because getting one challenge done for, say, half of what it might cost in-house just doesn’t register on the scale of these companies.
What I would assume to be the officially stated motivation, i. e. that it is a way to get a number of different approaches, possibly with ideas that wouldn’t come up within the far more homogeneous workforce of these companies, makes a lot more sense, intuitively.
I would expect the algo + weight of the model to be enough, and that it is all fine RGPD-wise since you definitely can't identify people from the weight of a model trained with your picture.
In this case the disqualified participants are well respected and haven't previously been involved in any dubious behavior. They properly disclosed what they were doing and despite there being other clarifications there was none that person releases for CC-BY data would be required.
Obviously this is a ridiculous requirement. There's no way for that team to be able to do that, but they did take proper care to use data that Facebook could reasonably use. It's unreasonable for FB/Kaggle to expect participants in a data science competition to suddenly know what Facebook's data ethics department is demanding this week outside what is legally required.
There is an extremely simple solution to that: Do not use the data.
There are countries in which privacy is considered important. Data being public has no influence on whether is is considered personal in the EU. Also, copyright has nothing to do with the issue.
“ A. If any part of the submission documentation depicts, identifies, or includes any person that is not an individual participant or Team member, you must have all permissions and rights from the individual depicted, identified, or included and you agree to provide Competition Sponsor and PAI with written confirmation of those permissions and rights upon request.”
Notice the permission is from people identified and depicted, not the copyright holder. I’m not a lawyer and even I find that straight forward and clear.
We suspect that most competitors also did not realise these additional restrictions existed - we are unable to find any data posted in the External Data Thread which meets this threshold with a brief scan. During the competition, the rules on external data were repeatedly clarified, so this leaves us wondering why Kaggle never took the opportunity to clarify that external data must additionally follow the more restrictive rules for winning submission documentation.
Here's a Kaggle competition admin saying:
he deadline to declare external data is on March 3rd. So you cannot add new external datasets after that deadline, but you can use any datasets that have been declared (which are not prohibited) on this thread.
and clarrfying licensing:
So it is expected that competitors understand the external data they’re using and ensure it matches the requirements in the rules.
I’ve answered the question about BY-NC not being available for use by all (non-commercial use) and therefore violating the requirement that external data be available for use by all participants.
Note nothing about there being extra "rules" in the "Winning Submission Documentation".
I think it’s reasonable to assume that because something is labeled CC-BY that it’s legal to use for situations allowed by CC-BY.
Is there some copyright certification service? Is there a way to test copyrights? If I don’t go my the author’s copyright, how could I verify it?
If I watch a movie, should I not do so until I verify the personality rights of every person who appears in the film? How can I trust Disney?
edit: As for your example, copyright generally doesn't care about you consuming a work but about you sharing a work. As such, you don't have to check anything but Disney does have to check everything.
Disney, and random youtuber, does have to check everything, not me if I’m legally using it. However, if the copyright if wrong, I’m not sure what the liability is.
The way that publishers, for example, avoid such pitfalls is by exclusively working with professional agencies that control the provenance of the work they license.
Watching anything isn’t copying nor showing and therefore safe. Practically, if you want to use any CC-licensed photos, I would advise at least a reverse image search.
I suspect that even in the US there is more to it. For example, stock photo agencies treat copyright and model releases as separate things. For any non-editorial images that contain recognizable faces they are in my experience strict about about getting a proper model release.
No one is disputing that the team was disqualified fair and square. But this rule – where you must get consent from every single person appearing in your training data – seems neither standard nor sensible.
Firstly, as someone else pointed out, copyright doesn't apply here at all. You can use whatever training data you want as long as your model is sufficiently transformative. OpenAI used terabytes of copyrighted music in their training for OpenAI Jukebox; they certainly didn't get a license from every musician.
Beyond that – big companies don't play by this rule! If a BigCo wants to train on some data, you bet they'll be using it. When's the last time Google sent you an email like "Are you ok with us using your flickr photos to help improve Google Image Search?"
So my question is simple: in the context of this competition, where should I go to get a decent dataset? The winners were disqualified doing exactly what I would have done. What's the alternative?
Also, yes, ethics are a concern. If you're concerned about ethics, aim it at big companies, not us small fries that are merely trying to win some cash. Again, no one disputes that they were disqualified for valid reasons. But it has nothing to do with ethics and everything to do with the artificial constraints imposed by this competition.
Where I work now we work a lot with synthetic training data, because dealing with "real" data is complicated.
From Kaggle… There is a datasat provided for each challenge.
The disqualified competitors here seem to have assumed that CC-BY meant you can do whatever you want with data, when actually that's far from true. CC-BY is solely about copyright and doesn't address other rights (e.g. model release, gdpr, etc.)
What about divided? by a fraction? :trollface: Does that fall under "increased"?
So maybe they'll get to win on lack of clarity in the specifications, but that will be unfortunate.
Competitions that allow external data are at least partially explicitly about that. If they weren't they wouldn't allow external data (which is not exactly uncommon on kaggle).
While I'm sure it's just accidental in this case, I see this all over the news and suspect attempts to steer public opinion by condemning people or institions in the headline, before the news is actually reported on. A forum of independent thinkers should insist on not having the news presented to them in a potentially manipulative manner.