Hacker News new | past | comments | ask | show | jobs | submit login
Facebook disqualifies leaders of Deepfake Detection Challenge for rule breach (syncedreview.com)
163 points by baylearn 28 days ago | hide | past | favorite | 55 comments

The rules seem pretty clear that consent is required from any persons appearing in any external datasets that are required. The winners scraped data from Youtube videos so I am not not sure the issue is.

The more worrying takeaway is that the winners scraped videos from people who clearly had no intention of their videos being used for a deepfake detection algorithm. Yet they did not think of the ethical considerations of using that data (did everyone in the video even have a say in the video being uploaded?). I think Kaggle disqualifying the team is the right move (even if it's a painful one for the winners).

The article states the videos used a Creative Commons license that allowed for commercial use. It is an extremely liberal license that does not state "free for commercial use except for when used with facial recognition."

For people in a video you need a model release from them. This is also a mistake many people make, they use Creative Commons licenses and think they are safe. A picture or a video needs model releases for the people in the picture (several exemptions apply).

If that is true then basically all the photographs in Wikipedia are illegal since the only check they do is for copyright not for model release. Pretty sure that's not a legal requirement.

Of which photos are you thinking? It most certainly varies from country to country but public figures or random people captured when taking a picture of a landscape or a building are at least in some countries not subject to such rules.

But that Creative Commons licence was issued from the copyright holder of those videos, not the people in them. The people in those videos may not even have agreed to appear in the video if they were in a public place (the relevant legal term, at least here in the UK, is "reasonable expectation of privacy"). So if Kaggle requires people in the videos to consent taking part then that consent cannot be inferred from that licence.

What's more, if that consent is not legally required (there's a heavy "if" in this sentence, IANAL so I do pretend to know whether it's required e.g. under GDPR, but let's assume for a moment that it's not) then Kaggle are still perfectly at rights to ask for that permission to qualify for their competition. After all, it's their competition, and it's totally reasonable for them to set an ethical criteria that's even higher than legally required.

You're right, I missed that part of their rules. Looks like they did probably break them.

"A. If any part of the submission documentation depicts, identifies, or includes any person that is not an individual participant or Team member, you must have all permissions and rights from the individual depicted, identified, or included and you agree to provide Competition Sponsor and PAI with written confirmation of those permissions and rights upon request."

Yeah, with $1 million at stake, I can't believe this team of really smart people made such an incredible blunder.

The whole reason Facebook launched this challenge was to try and bury the bad PR over their data practices. If people in the external datasets had complained about the unauthorized use of their faces in the winning solution, it would've been pretty embarrassing for FB.

Note that isn't part of the rules. It's part of the "Winning submission documentation requirements" which is a separate document and wasn't mentioned at all on the "external data" Kaggle thread, which had Kaggle moderators explaining the rules.

Documentation requirements are pretty standard in Kaggle competitions, and usually cover having to supply your code, and maybe write a blog post about it. I've never seen one that had major rules in it.

I'm with you here. There are ethical concerns, legal concerns for productization, and overall this defeats the purpose of creating novel algorithms rather than a better trained model.

For instance, with the same scraping being used to train the deepfake GAN, would their model be more or less effective than a competitor model?

It seems like they won from a disparity in data not an innovative technical approach.

It's much better they learn now by being banned from a competition rather than having a lawsuit filed against them in the future.

The correct decision was made.

What if you took commercial video like a news broadcast vs youtube? Would that still be off limits?

IMO the real issue is that Facebook wanted a commercially-usable product, and they thought Kaggle had all the safeguards for that, but no:

Because of RGPD and friends, Facebook can't store those photos, even though their license are permissive, and respects Kaggle's rules.

This clearly shows what Kaggle is: It is a way to get very cheap and high quality data-science work. It's not for hiring people, not for truly helping the research community, not for to help people learn. Nope, just cheap workers.

It really feels like Facebook have there whole deepfake detection strategy here! They put like 2M$ on the table to solve an issue that will(?) plague their whole multi-billion platform.

Kaggle very obviously is useful for the research community, has led to quite a few people getting jobs they otherwise wouldn’t, and (as I can personally attest) is an excellent resource for learning.

It is also a way for companies to get work done. And maybe the competition format allows them to do so cheaper than people are comfortable with?

But one does not exclude the other. Indeed, it’s the premise for Kaggle’s, and really any marketplace’s, success that it found a model that is beneficial for all participants.

Competitions on Kaggle tend towards goals that are of universal value, of which the detection of deepfakes is one example. There are others that are rather far from any business case to support the idea that companies’ are looking to cheaply get something done, such as Google’s yearly Basketball score prediction competition (that comes with a relatively big pool of prizes).

In fact just the low number of competitions would seem to speak against the idea that cheapness is a primary motivator here. Because getting one challenge done for, say, half of what it might cost in-house just doesn’t register on the scale of these companies.

What I would assume to be the officially stated motivation, i. e. that it is a way to get a number of different approaches, possibly with ideas that wouldn’t come up within the far more homogeneous workforce of these companies, makes a lot more sense, intuitively.

I think this is the main takeaway. Kaggle, corporate sponsored hack-a-thons, etc. are all very cheap ways for companies to get a lot of work that is typically very expensive done for very cheap. Who benefits from these situations except for companies? Even if you get a job out of your kaggle contributions, what value was created for a company by kaggle in comparison to the value that was created for you?

We would arguably all benefit from a robust method to detect deepfakes, and from its availability, as open source, to all platforms large and small.

That’s a benefit to the collective and I believe mnky9800n is referring to the authors of the extremely beneficial work being exploited via inadequate compensation.

Perhaps for only about a week, until the GANs can be optimized to bypass detection. It is an arms race with no end in sight if we're being honest.

Do Facebook need to store the photos to use the model commercialy ?

I would expect the algo + weight of the model to be enough, and that it is all fine RGPD-wise since you definitely can't identify people from the weight of a model trained with your picture.

Presumably you'd want to retrain the model from time to time and that would require the original material.

I think the issue here is that Kaggle's statement that the top teams broke the rules is just very opaque. They stated they broke the rules on external data. The article then goes on to talk about what data the teams used and what licenses it has, and what data the teams were asked to provide. But it really is almost impossible to know what the concerns of FB/Kaggle were without them specifically stating them. Clearly whatever the issue was it didn't effect every team - so it may be there were details of the licenses that the disqualified teams used that weren't good enough. As I say though, it's very difficult to say and it's kind of hard to think of a reason Facebook would arbitrarily disqualify teams for no good reason. It's perfectly possible FB were concerned about image rights or something else, but people seem to be perfectly happy just assuming some grand conspiracy.

For those who aren't aware, many Kaggle competitions allow external data (this one did) but require disclosure, and often there is some back-and-forth to clarify the exact details of what is used.

In this case the disqualified participants are well respected and haven't previously been involved in any dubious behavior. They properly disclosed what they were doing and despite there being other clarifications there was none that person releases for CC-BY data would be required.

Obviously this is a ridiculous requirement. There's no way for that team to be able to do that, but they did take proper care to use data that Facebook could reasonably use. It's unreasonable for FB/Kaggle to expect participants in a data science competition to suddenly know what Facebook's data ethics department is demanding this week outside what is legally required.

> There's no way for that team to be able to do that

There is an extremely simple solution to that: Do not use the data.

There are countries in which privacy is considered important. Data being public has no influence on whether is is considered personal in the EU. Also, copyright has nothing to do with the issue.

The requirements were clearly spelled out.

“ A. If any part of the submission documentation depicts, identifies, or includes any person that is not an individual participant or Team member, you must have all permissions and rights from the individual depicted, identified, or included and you agree to provide Competition Sponsor and PAI with written confirmation of those permissions and rights upon request.”

Notice the permission is from people identified and depicted, not the copyright holder. I’m not a lawyer and even I find that straight forward and clear.

That was part of the "Winning Submission Documentation", not part of the rules. It's true that it was there, but to quote the authors:

We suspect that most competitors also did not realise these additional restrictions existed - we are unable to find any data posted in the External Data Thread which meets this threshold with a brief scan. During the competition, the rules on external data were repeatedly clarified, so this leaves us wondering why Kaggle never took the opportunity to clarify that external data must additionally follow the more restrictive rules for winning submission documentation.[1]

Here's a Kaggle competition admin saying:

he deadline to declare external data is on March 3rd. So you cannot add new external datasets after that deadline, but you can use any datasets that have been declared (which are not prohibited) on this thread.[2]

and clarrfying licensing:

So it is expected that competitors understand the external data they’re using and ensure it matches the requirements in the rules.

I’ve answered the question about BY-NC not being available for use by all (non-commercial use) and therefore violating the requirement that external data be available for use by all participants.[3]

Note nothing about there being extra "rules" in the "Winning Submission Documentation".

[1] https://www.kaggle.com/c/deepfake-detection-challenge/discus...

[2] https://www.kaggle.com/c/deepfake-detection-challenge/discus...

[3] https://www.kaggle.com/c/deepfake-detection-challenge/discus...

Yes, this seems like a clear case of bullying by FB and in turn by Kaggle. As a contestant I would be furious too. If I shared what data I am using during the contest and to other participants, and then I won then its completely fair. After the match, suddenly the Facebook legal team wakes up, they should have been awake during the whole contest. Ideally, they should give the prizes and then do another contest where the legal team is involved throughout

If they disclosed the data, I agree with you that the decision should have been done during the competition, not after it...

Why would written consent be needed from people appearing on pictures with CC-BY licence? Was this just an overreaction or is there any legal risk for Facebook using those pictures without additional consent?

In addition to the concern that copyright doesn't necessarily cover personality rights, you can not assume that just because someone uploaded an image to Flickr as CC-BY that they had the right to do so, in a lot of cases they do not (when people upload images found on the internet). Yes, it's their fault, but in the end you still don't have a valid license for the image.

I think the poster is responsible for correctly releasing content under the right copyright.

I think it’s reasonable to assume that because something is labeled CC-BY that it’s legal to use for situations allowed by CC-BY.

Is there some copyright certification service? Is there a way to test copyrights? If I don’t go my the author’s copyright, how could I verify it?

If I watch a movie, should I not do so until I verify the personality rights of every person who appears in the film? How can I trust Disney?

Legally, you are responsible for ensuring YOUR work follows copyright. If your work is made of other works then you need to ensure you have a license for them all. Otherwise you can be sued and saying someone uploaded it to Flickr with a CC-BY won't save you. Presumably you'd track down the original author and have them sign a piece of paper where they claim it is their work. Then you can at least sue them for damages if it's not.

edit: As for your example, copyright generally doesn't care about you consuming a work but about you sharing a work. As such, you don't have to check anything but Disney does have to check everything.

That’s what I meant. The youtube poster is responsible for the copyright. I don’t think it’s possible for me to check beyond the legal copyright.

Disney, and random youtuber, does have to check everything, not me if I’m legally using it. However, if the copyright if wrong, I’m not sure what the liability is.

While I understand where you’re coming from, in terms of the current legal situation, you are liable for mistakenly believing some license information when that turns out to be wrong.

The way that publishers, for example, avoid such pitfalls is by exclusively working with professional agencies that control the provenance of the work they license.

Watching anything isn’t copying nor showing and therefore safe. Practically, if you want to use any CC-licensed photos, I would advise at least a reverse image search.

One way to protect yourself is buying licenses to images from a service that provides legal indemnification. Part of the reason why stock image websites can earn money is that they provide this, so (at least part of) the risk is with them.

Creative Commons is based on copyright law. In some countries the rights regarding ones likeness are covered by separate laws and Creative Commons has no influence on that.

I suspect that even in the US there is more to it. For example, stock photo agencies treat copyright and model releases as separate things. For any non-editorial images that contain recognizable faces they are in my experience strict about about getting a proper model release.

Copyright only gets you editorial usage. For any commercial usage, you must also have a model release or property release.

It's not about legality, it's about image. If they hadn't disqualified them, the headline would've been "Creepy Facebook Contest Scrapes YouTubers' Faces for AI 'Black Box'".

As a machine learning researcher, where exactly am I supposed to get a dataset that complies with Facebook's/Kaggle's rules in this case?

No one is disputing that the team was disqualified fair and square. But this rule – where you must get consent from every single person appearing in your training data – seems neither standard nor sensible.

Firstly, as someone else pointed out, copyright doesn't apply here at all. You can use whatever training data you want as long as your model is sufficiently transformative. OpenAI used terabytes of copyrighted music in their training for OpenAI Jukebox; they certainly didn't get a license from every musician.

Beyond that – big companies don't play by this rule! If a BigCo wants to train on some data, you bet they'll be using it. When's the last time Google sent you an email like "Are you ok with us using your flickr photos to help improve Google Image Search?"

So my question is simple: in the context of this competition, where should I go to get a decent dataset? The winners were disqualified doing exactly what I would have done. What's the alternative?

Also, yes, ethics are a concern. If you're concerned about ethics, aim it at big companies, not us small fries that are merely trying to win some cash. Again, no one disputes that they were disqualified for valid reasons. But it has nothing to do with ethics and everything to do with the artificial constraints imposed by this competition.

Big companies definitely care a lot more about these things than universities or "small-time researchers". I've been in long meetings with corporate lawyers at multiple big companies, and they are a lot more careful than you imply (because they have more to lose). It's always a risk vs reward thing. Things like "copyright doesn't apply here at all. You can use whatever training data you want as long as your model is sufficiently transformative." are not certain until they have been tested in courts, and will be different for each case (it's a lot easier to argue that a simple classifier is transformative than a GAN that might reproduce one of its training inputs, for example). An additional complication is that if you're a company that's operating globally, you have to pay attention to a lot more laws.

Where I work now we work a lot with synthetic training data, because dealing with "real" data is complicated.

> As a machine learning researcher, where exactly am I supposed to get a dataset that complies with Facebook's/Kaggle's rules in this case?

From Kaggle… There is a datasat provided for each challenge.

Ironic rules from Facebook considering Facebook has been caught multiple times harvesting data from other apps on phones without permission.

As someone who previously competed on Kaggle, this seems a reasonable decision. In previous contests it was pretty clear if you wanted to do something that used third party data you should get pre-clearance for it from Kaggle/contest organizers.

The disqualified competitors here seem to have assumed that CC-BY meant you can do whatever you want with data, when actually that's far from true. CC-BY is solely about copyright and doesn't address other rights (e.g. model release, gdpr, etc.)

> and each individual participant further waives all rights to have damages multiplied or increased.

What about divided? by a fraction? :trollface: Does that fall under "increased"?

This competition should not be about scraping and tagging skills (impressive as they may).

So maybe they'll get to win on lack of clarity in the specifications, but that will be unfortunate.

So you think this task is learneable without any labeled data ("scraping and tagging skills")? After all, that's pretty much how FB, Google et. al. do it as well, throw more data & CPU power at the problem.

That is not what I said

>This competition should not be about scraping and tagging skills

Competitions that allow external data are at least partially explicitly about that. If they weren't they wouldn't allow external data (which is not exactly uncommon on kaggle).

They're explicit that this data should be shared. Obviously aiming for it to not be a part of unique contribution of the contender. They may not provide a tight enough statement about that, but it is fairly clear.

It's unfortunate that the title leads with the "backlash" from the thing that happened, not the thing itself ("Kaggle disqualifies participants over usage of external data usage"). This suggests that the decision about the case has already been made by a plurality and with fervor. In reality this article is the first time many HN readers learn about this for the first time.

While I'm sure it's just accidental in this case, I see this all over the news and suspect attempts to steer public opinion by condemning people or institions in the headline, before the news is actually reported on. A forum of independent thinkers should insist on not having the news presented to them in a potentially manipulative manner.

We are in the Zeitgeist of "...all over the news and suspect attempts to steer public opinion"... One of the reasons is newspapers fighting for survival (and therefore prioritizing "what people like to hear" to boost ad-clicks over "what people should know"). There are also other reasons like newspaper editors favoring one political party other the other - and therefore they publish mostly negative news about one party and mostly positive news about the other one...

I feel the same, and it was the right move imho because if they didn't disqualified the winners, another header could have been "Facebook, Kaggle rewarding unauthorized use of people data"

One possible reason for the “no external datasets” rule might be that the data which Facebook uses to judge the competition is also taken from the same publicly available sources. If this is so, then if somebody uses those same datasets, they would have trained to the test, so to speak, which obviously would not lead to good outcomes when run against future data.

The thing is that they allowed external datasets :\

Also that would be extremely lazy from Facebook, who have plenty of sources of this data without having to scrape random YouTube videos.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact