Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Microsoft quietly deletes largest public face recognition data set (ft.com)
35 points by Turukawa on June 6, 2019 | hide | past | favorite | 11 comments



"The site was intended for academic purposes. It was run by an employee that is no longer with Microsoft and has since been removed."

What in the ACTUAL fuck Microsoft.


“The people whose photos were used were not asked for their consent, their images were scraped off the web from search engines and videos under the terms of the Creative Commons license that allows academic reuse of photos.”

Is it ethically acceptable to republish someone's CC-licensed face annotated for the purpose of training recognition algorithms?

Is the dataset inclusive as a composite whole? Does it have biases such as “primarily white-male” or “majority white”?

Should facial recognition data and training studies of people’s faces be required to adhere to the same ethical review practices that psychology and sociology studies of people are required to adhere to?

If I were Microsoft, I would red flag every one of these questions as reason enough to take it down immediately until answered.


I don't personally see the ethics issue if they were using publicly available pictures. That would be like someone scraping public FB profiles.

That person posted the image/information willingly knowing they lose all control over how it is used and who it is seen by.

Could definitely cause some potential bias though if your input set isn't filtered for some kind of diversity.


That person posted the image/information willingly knowing they lose all control over how it is used and who it is seen by.

This seems like a dangerous precedent. There have been cases where images of recognisable people that were made available with some liberal licence were then used as part of marketing for deeply offensive campaigns or illegal activities, for example.

I don't think it's reasonable to say that anyone who volunteered to let others use their image for general purposes should automatically accept the kind of portrayal that would result in a defamation lawsuit in other contexts. You can call them naive for not anticipating nasty people doing that with their image, but naivety isn't a crime. Meanwhile, being portrayed deliberately and without warning as a child abuser or a supporter of a highly unpopular politician or a drug addict or a terrorism suspect could have profound and immediate consequences for the subject, who obviously didn't intend to consent to that and may have no idea it has been done until the reality catches up to them.


Misrepresentation of the images is something Microsoft has no more power over than Google Image Search. At the very least their dataset here wasn't including names/locations/etc. I don't really see how this is any different from Google using their own data in projects like DeepMind. At least Microsoft admitted the project didn't go as planned and they're shuttering it, and cleaning up their data.


To narrow this further to the science-fiction ethical issue that deepfakes and facial recognition are both forcing to the foreground, try this on for size:

“All humanity has the inalienable right to control how their likeness is transformed by others. Consent must be given freely by either the human or their delegated representative, and no discrimination against refusal to permit transformation, whether by default or by declaration, shall be permissible under law.”


I’m not asking if they gave up copyright on their photos. They did. I’m asking, for example, if it’s ethically appropriate for Microsoft to publish annotated public domain photos without requiring a human ethical review for each use of their dataset. If I wanted to perform a sociological study on that dataset, I’d have to get a review board’s approval. Why is performing a statistical study (literally, machine learning) somehow exempt from that ethical concern?


Perhaps Meaning there is likely origin issues with the datasets usage permissions, and they want to distance themselves from it.

Or it could be, that no one wanted to manage it, after this person left, and decided to take it down, rather than deal with having to maintain it.


Or to avoid privacy PR backlash. Or to gain advantage with it internally while denying that benefit to those who haven't already copied it. Or to monetize access to it differently.


Might be a valid reason. After leaving universities, I was required to remove my datasets from the institutions' websites.


Anyone have a bittorrent hash for the data set?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: