Hacker News new | past | comments | ask | show | jobs | submit login
Google employees are listening to Google Home conversations (translate.google.com)
871 points by bartkappenburg 8 days ago | hide | past | web | favorite | 441 comments





I think the responses to this can be broken down into a 2x2 matrix: level of concern vs. understanding of technology.

1) Don't understand ML; not concerned - "I have nothing to hide."

2) Don't understand ML; concerned - "I bought this device and now people are spying on me!"

3) Understand ML; not concerned - "Of course, Google needs to label its training data."

4) Understand ML; concerned - "How can we train models/collect data in an ethical way?"

To me, category 3 is the most dangerous. Tech workers have a responsibility not just understand the technologies that they work with, but also educate themselves on the societal implications of those technologies. And as others have pointed out, this extends beyond home speakers to any voice-enabled device in general.

In conversations about this with engineers the response I've gotten is essentially: "Just trust that we [Google/Amazon/etc.] handle the data correctly." This is worrying.


I'm in the 5th category. 5) Understand ML; concerned - won't allow any of these things in my house period because they will always use them for things behind the scenes that they won't state. I don't care how well trained they are, or "ethical." Ethical... according to who, and at what time period in the future? Ethics change. The data they have on you won't. Look at all of the politicians and other people getting in trouble for things they said 15 years ago, which were generally more acceptable at the time but we've "progressed" since then. Who will be making decisions about you in the future based on last years data? Just don't give it to them.

Before I go further: do you own a smartphone? If not, you can ignore the rest.

So, assuming you have a smartphone. That is a device which has:

- Permanent network connection. Doesn't matter what you do on your Wifi, it has the cellular data network. Which is controlled by an independent processor with its own firmware

- Excellent noise-cancellation microphones. They may not be able to pinpoint the sound source like an Echo would, but they are still pretty sensitive

- Accurate location updated via GPS. If you block GPS, it can still use the cell phone towers as an approximation, plus any SSID beacons nearby

- Very powerful processor, capable of listening for 'interesting' data before sending anything

- Similar functionality to an Echo / Google Home / Apple whatever. They'll be listening for "hey siri" or "ok google"

And so on. Plus, anyone you interact with will have such a device in their pockets.

Given that. why would an Echo be of any particular concern? At least you can monitor their network activity easily. Not so with a smartphone, which is a much higher threat. And yet most people sleep with them next to their beds.


> Similar functionality to an Echo / Google Home / Apple whatever. They'll be listening for "hey siri" or "ok google"

This seems to be a pretty big assumption, without which your whole argument falls apart. In my case, at least, I have no such functionality enabled. You could argue that the manufacturer of my phone could push a software update that enables it without my knowledge (in fact, they likely couldn't, because I don't have automatic updates turned on), but even if this were true we're now arguing about a very different thing - the theoretical ability of MegaCorp to spy on me as opposed to actual known spying.


because I don't have automatic updates turned on

...the malicious entity already pushed it out silently in the last update you did accept...

You really have to trust the entity building the software on these devices. These little half-measures are purely psychological protection.


Not really.

There is a huge difference between "we listened to your conversation, but hey didn't you accept our EULA?" and " let's spy on some random people".


EU governments and telcos spy on you by law. 10 years ago there was a law across entire EU that forced telecom companies to store the position and call numbers of users for the past 6 months. Germany got even fined because it did not implement the law.

And in US all SMS are stored for not less than one year, also required by law.

And time and time again we hear that this data is used for every tiniest detail of a crime, because it's cheaper than actual policework.

https://www.techspot.com/news/73776-google-receiving-police-...

https://www.propublica.org/article/no-warrant-no-problem-how...

https://www.csoonline.com/article/2222749/mobile-phone-surve...

And of course, the police have been caught abusing this data. Data acquired this way is likely to get to you in another type of cases: divorce proceedings often use this sort of data as well, and it's been used in trade disputes too.

Given that all that already happens with cellphone data, what's the point of doing anything about it at this point ?


> In my case, at least, I have no such functionality enabled

How do you know? Is your device _incapable_ of performing that task, or did you just click a checkbox somewhere? Because if you just politely _asked_ the OS to disable it, it doesn't really matter, does it? We are back to the matter of trusting the vendor.

> (in fact, they likely couldn't, because I don't have automatic updates turned on)

So you say. The software may or may not agree with your statement.

> the theoretical ability of MegaCorp to spy on me as opposed to actual known spying.

There's no "actual known spying". Spying implies intent, none has been demonstrated so far.

Where your argument falls apart (yeap, I can use that line too) is this: you are interacting with other people also carrying powerful listening devices of their own. They may or may not have disabled a "ok google" functionality, and they may or may not have disabled updates. Again, where do you draw the line? Do you ask everyone to leave their devices behind before interacting with them?


Yes, but there's still a difference from what we are worried might be doing with our smartphones, and what we KNOW they are doing with 'voice assistants'.

I turn off voice assistant on my smartphone and don't have any google home/alexa devices either. I have a TV with no voice activation. I won't use voice activation devices.

Is it possible my phone is listening to my anyway and sending recordings to someone? It's possible. If I were smart I would never say anything that would be 'dangerous' if someone heard it with my phone in the room, although Im sure I'm not that careful, it's probably not possible.

If I found out for sure my phone was listening to me surreptitiously, rather than just knowing that it's not impossible -- I'd do something about it.

One affect of arguing "it's all the same, your phone COULD be spying on your voice conversations anyway" is, perhaps ironically, to make people care LESS about it. If it's doing it anyway (or just if I can't prove it's not?), what do I care about Google Voice doing it too? But one disincentive for companies doing it is precisely that people will be upset if they find out. (And they're gonna find out eventually).

You don't need to go live in the woods to care about it. It's good to care about it. You don't need to either "draw the line" at living in the woods without electricity, or draw no lines. We can draw lines. They can move. If your argument is that since you can't be sure that nobody's spying on you, you shouldn't care about the people you DO know are spying on you and shouldn't get upset about it -- that's not how any of this works.

Insisting on all or nothing gets you nothing (cause it turns out 'all' isn't really an option), and who wants nothing?


I'm in this camp

I'm just inserting a comment here in the hopes that it will come before a more deeply nested soon-to-be-written XKCD comic. Although I'm fairly confident XKCD has covered this in the past already.

He's just using a classic strawman, making the argument he could win.

On my smartphone I can at least turn off the apps that would automatically listen for voice commands; I can use the phone just fine without them. And if I thought there was enough of a threat that Google or someone else was listening in using my phone's microphone without my knowledge, even with the apps turned off, I would switch to a rooted phone that would give me more control.

For the Echo/Google Home/etc. devices, I don't have any of the above options, so I simply don't buy them and don't use them.


Even a rooted phone will not help, since the monopolistic baseband processor will do it. The big ARM CPU is just for the users, the baseband is for control of the public masses and mass surveillance.

The surveillance problem is not black and white. A rooted phone may not help you if a state actor is after you, but it will help against companies that make money by building an invasive profile of who you are and what you do. Again how much a rooted phone helps in that depends on how many ties you want to cut with the digital-advertising ecosystem.

Here's a paper describing the privacy invasive and potentially harmful properties of pre-installed apps [1]. I specifically saved this paper in my files so I can have a comprehensive answer when people ask me why I go through the trouble of rooting my devices.

P.S. People who know more than me are saying that the baseband processor can be restricted with hardware practices such as IOMMU [2]. I don't really know how effective that can be.

[1]: https://haystack.mobi/papers/preinstalledAndroidSW_preprint....

[2]: https://news.ycombinator.com/item?id=20151431


> the monopolistic baseband processor will do it

The baseband processor isn't running Siri or Google Voice on my phone. Yes, if the state has a back door into the baseband processor there's not much I can do about it, but that's not the threat model that I'm avoiding by not using devices like the Echo or Google Home.


> that's not the threat model that I'm avoiding by not using devices like the Echo or Google Home.

100% this. IF a state intelligence agency is after my data they'll get it somehow (most likely just by asking). My concern is the rampant dragnet aggregation of personal data that I didn't expressly provide them by companies like Google and Facebook. I mean, Google having access to your emails that you store in Gmail? Kind of understandable. Google buying MasterCard transaction records to cross-reference against your web browsing data as tracked by any site using Google Analytics? No.


I tend to believe you, but I'd love to see some material to back up your suspicions.

Theres a mute button on the echo. You can turn voice off just the same as you can on a phone.

> Theres a mute button on the echo.

Which defeats the whole purpose of having it, namely, to respond to voice commands.

> You can turn voice off just the same as you can on a phone.

I don't have to turn off voice altogether on my phone, I just have to turn off the apps that I don't trust. On an Echo, either I turn it off and might as well not have it, or I turn it on and it's recording everything. There's no middle ground where I get some basic functionality without being spied on.


> On an Echo, either I turn it off and might as well not have it, or I turn it on and it's recording everything. There's no middle ground where I get some basic functionality without being spied on.

I'd guess that most people could effectively find a middle ground by treating the Echo as if it were someone in a cordial but not really friends relationship with a family member, such as a classmate of one of you kids over to discuss a school project, or a member of your spouse's church over to discuss planning the annual church picnic.

If they are doing things that they would not mind such a person overhearing, have the Echo on. If they are doing things that they would not do until such a person leaves, have the Echo off.

It would be interesting to give these home voice assistants faces, including eyes that look at and track people it is listening to when they speak, to help people remember that it is still listening to them. That might make it easier to remember to turn it off before discussing sensitive things in its presence.


The analogy you make is helpful, but it immediately brings attention to one thing: you generally don' invite "someone in a cordial but not really friends relationship with a family member" to stay with you all the time, day and night. People you do let stay with you are people whom you either trust, or expect to trust as the relationship develops. There's no relationship development with an Echo - it's forever an agent of a faceless corporation that's indifferent to you.

There's also the perma-mute button which you can activate by not buying one in the first place.

My kids left Alexa outside over a rainy weekend. Cest la vie!

Turns out the NSA has an implant for smart tvs that among its features is keeping the tv on and recording you while turning the indicator light that indicates its on status off.

You don't own the device despite paying for it. It updates itself when it feels like it and does whatever its actual owner says including compromising itself if told to.


Not without internet.

my parents smart tv has over a gig of storage for apps.

i'd be really surprised if that couldn't act as an offline cache.


still needs internet to fill that cache, don't use the smart tv part, or whitelist it at the router.

I use Lineage OS (https://lineageos.org) with the cameras taped. It's not perfectly secure, but it's the best tradeoff I can make at the moment. Here's a few better options I want to switch to in the future:

Graphene OS, a security-hardened Android ROM: https://grapheneos.org

Librem 5, probably the ideal solution: https://puri.sm/products/librem-5/ Just look at the bullet-point features:

  + Does not use Android or iOS. The Librem 5 comes with the mobile version of our FSF-endorsed operating system PureOS by default, and is expected to be able to run most GNU+Linux distributions.  
  + CPU separate from baseband, isolating the blackbox that the modem may represent and allowing us to seek hardware certification of the main board by the Free Software Foundation.  
  + Hardware Kill Switches for camera, microphone, WiFi/Bluetooth, and baseband.  
  + End-to-end encrypted decentralized communications via Matrix over the Internet.  
  + We also intend the Librem 5 to integrate with the Librem Key security token in the future.
Pinephone, not as good as the Librem 5, but much cheaper: https://www.pine64.org/pinephone/

>Before I go further: do you own a smartphone?

Someone really needs to make a decent modular smartphone with a detachable radio.


The Librem 5 has a kill switch for the radio: https://puri.sm/products/librem-5/

Oooh, a Debian derivative smartphone. I been wanting one of these to reappear in the wild ever since having a play with a Nokia N900.

Does the radio have DMA?

nope! awesome.

It's not quite what you're asking for, but Apple does still sell the iPod Touch, and they recently revved it. One could pair that with a mobile hotspot with a power button...

I have been considering a battery powered raspberry pi with a usb hotspot and a virtual phone number.

I guess it makes sense to detach the radio if you're not using your phone to receive calls...

Do you mean the part where you can pull the Sim card out of the phone and reinsert it when you need to make a call?

Pull out the SIM card? I guess once eSIMs catch on, this won't be an option.

None of which invalidates these concerns. You just fall into category 1 or 3.

I won't have one either. Well, I have my Android, which I guess is essentially the same.

I was visiting my neighbor the other day. We were wondering something about special counsel mueller, and she decided to ask Alexa. This was new, and I was surprised she had this surveillance device, but I didn't say anything.

There was some back and forth, then I said to her "I think you're ordering motor oil "

"Alexa, stop! Alexa, stop!" As if a dog was chewing on the furniture.

And now Amazon knows that I was discussing politics, and my opinion. That kind of analyzed data is going to be lucrative in some future military or law enforcement or political contract.

Rule 1. Don't leave money on the table.

Rule 2. Ever.


Do they know it was you, specifically? I would say they know your neighbor was talking with someone else, and possibly that politics were being discussed (not really something they seem to have put a lot of effort into recognizing, since it doesn't lead to shopping...) If you're interacting with an Echo or gHome you can always ask it "who am I?" GOOG is sneaky enough to recognize you on any other device, Alexa, not so much (and in this case you shouldn't be on their radar to being with).

> Do they know it was you, specifically?

Maybe not with today's technology, but in the future, they will still have the recording and will have the technology.

It's the same thing with encryption. It's safe today, but in the future it'll be decodeable, and they'll still have the encrypted copies to work on.


What WalterBright says below.

They can eventually know who I am by voice recognition, especially they can get corroboration by the fact that we live in the same building.

I imagine we probably said something that would suggest I live nearby (... "Checking database" ...) or that I'm a regular visitor (... "Checking database." ...).

They could be sending an inaudible tone, to be captured by an app on my phone. I don't have an Amazon app, but some other app might capture the tone and sell that fact to Amazon, so that they don't leave money on the table.


Are there microphone-blocking phone cases? All I can really find is https://www.privacycase.com/, and it seems like overkill.

Imagine 20 years from now you decide to run for political office. Do you want someone "ethically" going through all your old recorded conversations looking for statements that will be politically incorrect 20 years from now?

Not only you, but your kids. People have lost sponsorships and contracts because of things one of their parents said before the person in question was even born. Look at last year's incident where Eli Lilly pulled their NASCAR sponsorship of Conor Daly over something his father had said a decade before Conor existed. https://www.indystar.com/story/sports/motor/2018/08/24/eli-l...

That's nuts. Things like this should cost company its reputation.

Fortunately, Eli Lilly took a bigger hit over this than Daly did. It's still disgusting on the principle of it, though.

I would expect it. I would watch what I say now and in a few years what I think.

Or intentionally don't watch what you say and stop giving this kind of system any power. If you run for office in 20 years planning to hide what you think now, or apologize for it, then you're in for a losing battle. Say what you think, admit when you're wrong, learn from your mistakes.

This. There’s nothing that binds today’s ethical company to voluntary ethical behavior when things get tough (or they get acquired).

There's nothing that stops the most ethical companies from being completely compromised by state actors either.

>won't allow any of these things in my house period because they will always use them for things behind the scenes that they won't state.

As OP stated, "this extends beyond home speakers to any voice-enabled device in general".

Unless you're living like the Unibomber, it's not a matter of "just don't give it to them". The moment you step outside your house and socialize with almost anyone anywhere, it's liable to being taken from you.


Neatpick: "unabomber" as in UNABOM (University and Airline Bomber).

As in Not EAT PICtures, oKay ?

"oh but i deleted all my data through the helpful portal they provided. they said i had full control?"

Data is Googles game. Once they have it, they have it. Nothing else matters. You own nothing. Your image, your voice, your ideas, your conversation, your habits, your medical records. Google wants it all. Thats is Googles world.

Google aint giving that up. Everything they do is about getting more and more of it. Everything to the contrary is just PR spin.


Additionally, in order to be 'ethical' by any means, they still have to collect the data in the first place. Regardless of actual intentions around ethical/defined practices, that always leaves the possibility for someone else to access that data and do unintended things with it.

Could I ask your age?

I have been casually tracking who actually owns these types of devices and have never met anyone under 25 who would even consider it whereas I go to the homes of people in their late 30's and 40's and see them all the time.


If you visit your friends or relatives, you cannot fully protect yourself from this data collection by simply choosing not to own these devices yourself.

What's strange about this conversation is that your refusal to trade privacy for convenience is passed off as some kind of no-loss decision, when in reality there are all kinds of downsides to living your life the way you claim you do.

The conversation can't even be had until the facts are presented honestly, and when you present your position this way, it's not honest.


This classification is very useful to discuss this issue.

The difference between 3 and 4, noble as it is, can be caused by feasability concerns that push people into 3, not just ignorance of the privacy impact. Human labelling of training data sets is a big thing in supervised learning. Methods that dispense with this would be valuable for purely economic reasons beyond privacy - the cost of human labelling of data samples. Yet we don't have them!

Techniques like federated learning or differential privacy can train models on opaque (encrypted or unavailable) data. This is nice, but they assume too much: that the data is already validated and analyzed. In real life modelling problems, one starts with an exploratory data analysis, the first step being looking at data samples. Opaque encrypted datasets also stop ML engineers from doing error analysis (look at your errors to better target model/dataset improvements) which is an even bigger issue, IMO, as error analysis is crucial when iterating on a model.

Even for an already productivized model, one has to do maintenance work like checking for concept drift, which I can't see how to do on an opaque dataset.


It's not wrong for humans to label training data. It's wrong to let humans listen to voice recordings that users believed would be between them and a computer. The solutions are obvious: sell the things with a big sticker that says, "don't say anything private in earshot," revert to old fashioned research methods where you pay people to participate in your studies and get their permission, or ask people for permission to send in mis-heard commands like how Ubuntu asks me if I want to send them my core dumps.

> ask people for permission to send in mis-heard commands

Note that you also want the "correctly" heard commands, because some of them will have been incorrect. It's frustrating when an assistant gives the "I don't know how to do that", but it's even more frustrating to get "OK, doing (the wrong thing)".

Also, another alternative: provide an actual bug reporting channel. "Hey Google, report that as a bug" "Would you like to attach a transcript of the recent interaction? Here's what the transcript looks like." "Yes."


To be fair the system already has something like that. If you complain to the Home it'll ask if you want to provide feedback and give you a few seconds to verbally explain what went wrong.

I'm not sure if humans will then review that feedback of if it goes through a speech to text algorithm first but the mechanism for feedback is there.


Yeah, i think I've experienced that. I was driving with Maps directions, and while i was driving Google decided to show me new things Maps can do.

I tried to voice my way back to directions, unsuccessfully. I said "Fuck you Google."

"I see that you're upset," followed by some instructions on how to give feedback. While I was driving. It sounded almost exactly like "I'm sorry Dave, I can't help you."


iOS voicemail transcription has this.

> like how Ubuntu asks me if I want to send them my core dumps

While I like how Ubuntu does it, I actually like better how Fedora does it. Not only do they ask to submit core dumps but gives you the ability to annotate and inspect what gets sent as well as gives you a bug report ID which you can use to follow up on.


Agreed, I'd like to support Ubuntu development, I often run it on bleeding edge hardware I'd like to submit crash reports for, but the inability to sanitise the data causes me not to unless it's a "fresh" device.

Just give participants the choice to opt in for a chance to get early access to new products. Make it invite only to feel exclusive. They will have millions of willing test subjects.

good point, there's precedent from hospitals wrt IRB and other infrastructure involved w/ data gathering. Hospitals/research institutions self-regulate in this regard, doesn't appear tech does

Handling the data in an ethical way doesn't need to be handling the data in an completely anonymous fashion. That would be one solution, but you can also create a tust-based system for how the data being labeled is handled, similar to HPAA. In addition, there are simple operational methods that could help ensure the data is processed as close to anonymously as possible. For example with voice data, you could filter the voices, work with the data in segments, and ensure that metadata for the samples is only accesible by trusted individuals certified under the above framework.

In trust-based systems like HIPPA or Clearances, there is a fundamental aspect of requiring 2 conditions to access data: privilege, and the necessity to know. Taking data and mining for valuable insights isn't a "need to know" it's a "need to discover something unknown". This is where the security breaks down. In a conventional HIPPA system, only your doctor needs to access your info. You don't have to worry about some other doctors accessing your information in bulk to try and conduct a study on cancer rates. They don't NEED to know your info, they just WANT to know. When you WANT to know how to accurately fingerprint people by their voice, then obfuscating it is counterproductive.

>You don't have to worry about some other doctors accessing your information in bulk to try and conduct a study on cancer rates.

This not only happens, it's my job (though I'm not a doctor). Of course, it's tightly controlled on my end. I work for the government, but health systems have their own analysts. As part of my job, I have access to sensitive and identifying information.

This isn't to be contrairian. There are existing systems using very personal data in bulk for analysis. The wheel doesn't need reinvented.


Is it feasibility, or just laziness?

My car has a little blurb that explains that they collect data to use for training and gives me the choice to participate or not. Opting out doesn’t affect any functionality. Why can’t Google do the same thing?


That should never be an opt-out. It is both ethically and in some regions legally required to be opt-in.

Or just an opt, where you have to make a choice during setup.

Because Google's first allegiance is to the shareholders and data has value so it's not in their best interest to make it easy not to share your data.

The shareholder value theory is rubbish, because it has no predictive or descriptive powers for why one decision was made over another.

I can just as easily say that the best way to maximize shareholder value is to minimize public scandal, scrutiny, and potential for legislature.

Nearly every single decision, including contradictory ones, made by every single company, everywhere, can be retroactively justified to have been done in the name of shareholder value.


> I can just as easily say that the best way to maximize shareholder value is to minimize public scandal, scrutiny, and potential for legislature.

Scandals can get free marketing, for example, Nike and Colin Kaepernick. Attention is always better than no attention at all for a business. Every single decision is made to increase profit but there might be many things that need to be accomplished first so its hard to see the big picture. For example, a developer might want to improve a feature because they want more people to use their product. A manager gets approval to pay that developer because the investment is deemed a profitable one. What does the person who gave them that money care about the number of users. It's not their invention and they don't even use the service? They give the money because they know that More users = more market share = more ads to sell = a return greater than the initial investment. Until a business can run with people working for free, the person paying for things always dictates what is bought and thus the direction the company is headed.

Let's say that direction is contrary to the direction of another prominent member of the business wants it to go. Whether you want to believe it or not, the same calculus goes on in every person's mind: Is this the potential payoff of Option A greater than the potential loss of Option B given the risk?


This is a wonderfully condescending response but it answers nothing. The question was, why can’t google do it differently? This doesn’t answer the question. We can plainly see this from the fact that other companies, operating under the same conditions you describe, make different choices.

This is the business equivalent of saying “because physics.” It’s not wrong, it’s just not useful.


Sorry, I didn't mean to be condescending. To answer your question, the reason Google can't do things differently is that they have already established themselves as first and formost and advertisement company and the way to do that best is to know their audience very intimately. Other businesses like Apple have established themselves as a hardware company first so they aren't dependent on user data as much so they took advantage of that and established themselves as the "Secure" phone. Google is too large and it makes too much money from its core business which is ad drive. As long as search and ads are their cash cow they cannot change in the way you hope.

Right! All the companies doing it differently are also trying to satisfy their shareholders.

That's what is so great about capitalism. If one company starts to take advantage of its users for profit, it opens up a niche for another company to take a different approach.

No it’s not.

Google has many primary concerns it needs to manage. That’s how you get big - by managing lots of concerns successfully.

If they drop one too long, they start going backwards very quickly.


Then explain why they changed to Alphabet. Shareholders were sick of things like project loon, siphoning cash from google search. You are extremely naive if you think there are many concerns of higher importance than profit. Everything else is about maintaining and growing profit even if that means doing an ad campaign convincing people you are fighting the good fight..for profit.

> Then explain why they changed to Alphabet. Shareholders were sick of things like project loon, siphoning cash from google search.

Alphabet is still spending billions from Google into "other bets" like Loon, so I don't see how this explains the change.


Because now they have to report it to their shareholders where the money is going so that if the board doesn't like it they can replace the CEO. Before since it was all google, the money went where they said it went, there was no oversight. They had this massive R&D budget that was opaque to the investors. Money that could have been paid to shareholders as a dividend or return was instead spent on projects they had no idea about.

>To me, category 3 is the most dangerous. Tech workers have a responsibility not just understand the technologies that they work with, but also educate themselves on the societal implications of those technologies.

Do you think its possible to be educated on the societal implications of these technologies and still not be concerned? Seems like you've written your own viewpoint into the only "logical" one here.


That's a fair point. I do think it is possible to be educated on the societal implications of these technologies and still not be concerned. I would disagree with that opinion, but it is certainly valid.

Maybe these philosophical debates are going on behind closed doors. If so, this should be communicated to the public/end consumers. Much like in legal proceedings, the process itself is just as important, if not more so, than the outcome.

That being said, based on conversations that I've had with people working on these very products, the interest level and incentive structure for engaging with the tech side of things far exceeds that of engaging with the broader societal implications. Creating the tech earns your salary, questioning its morality may get you fired. So many choose simply to not engage in the philosophical discussion, which to me is a big problem in the industry.


>To me, category 3 is the most dangerous. Tech workers have a responsibility not just understand the technologies that they work with, but also educate themselves on the societal implications of those technologies. And as others have pointed out, this extends beyond home speakers to any voice-enabled device in general.

Yes I'm frequently amazed how many coworkers I have that are still completely plugged into Google, Facebook, Amazon services/spyware, fill their homes with internet enabled "smart devices", have alexa/google assistance etc, and yet they act like I'm paranoid when I try to discuss security concerns or just flat out don't care.

As much as I hate to say it, I think there needs to be a massive breach or abuse of power from one of these organizations/services that has severe real world consequences for those that utilize/support them. Until then nothing will change.


It sounds like you're hoping for this as evidence that you were right to be concerned; have you considered that you might be wrong? What if your coworkers are right, and the risk is actually extremely low? How would you determine that?

> It sounds like you're hoping for this as evidence that you were right to be concerned; have you considered that you might be wrong?

An alternate interpretation is that if a massive breach of trust is intentional then it would be better if it happened sooner rather than later.

> What if your coworkers are right...? How would you determine that?

That is a pretty classic appeal to popularity; what most co-workers believe is not evidence of anything in this case. If they are right, they are right. If they are wrong, they might be a self-selected group of the people who don't see a risk for what it is.

At any rate 'the risk' is a bit vague, but moral panics and witch hunts are things that happen. When the tech companies get involved in one, which will happen, it could be very nasty. There is clearly some sort of new risk here making it easier to quickly and accurately identify minorities. Even ignoring rogue employees finding creative ways to use data to enrich themselves.


>It sounds like you're hoping for this as evidence that you were right to be concerned

Not at all, if the constant security breaches and lack of response from consumers, regulators, companies, etc. isn't enough evidence for you that there is a serious problem then you fall into the group I'm describing


What constant security breaches? There are two major examples of security breaches I can think of that happened recently at amagoofaceoft: mis-stored passwords (fb/insta, and google for a short period earlier this year), and variants of the Cambridge analytics attacks, which steal public information but at scale. While those certainly aren't good, I wouldn't classify either as a security breach. The first was a loss of defense in depth, and the second, like I said, just got public info, but lots of it.

Are you saying that breaches at other companies mean we shouldn't trust the big ones to be secure? Like because Equifax has terrible security practice, google by definition must also have bad security? Or...what?

(I work at Google).


Breaches was probably the wrong term to use in my example, as it brings attention to the wrong issue. The point I am trying to make is not that we shouldn't trust large companies to be secure as you say (although based on my experience with enterprise infosec I wouldn't be surprised if a majority of companies handling/storing personal data don't have appropriate security controls enforced).

The point I am making is that many large organizations such as Google and Facebook are performing worldwide, largely unchecked mass surveillance with data collection and analytical capabilities far beyond what is available to the majority of the world, and people simply don't care despite how knowledgeable they are about technology. There's also little to now way to escape it as Google and Facebook technology is so ingrained in the existing internet. While Google may not have poor security practices and may never experience a breach where data is stolen (although again I highly doubt that), as far as I'm concerned Google itself, as well as Facebook, are malicious actors in my own life and personal opsec as a huge portion of their business model is based on collecting and monetizing user data by any means possible with little to no concern for the negative impact on users such as mental health problems.

Frankly, I don't like companies that make money spying on people, particularly those that abuse psychological techniques that make it more difficult for people to make informed decisions or choices about the technology they're using.

Beyond that, these technologies are sold/rented or otherwise provided to governments, law enforcement and intelligence agencies, dictators, authoritarian regimes, and others that can and are being for personal gain.

So no, I don't believe we should trust Google, but not because other companies have experienced data breaches. That is just one of the many reasons I believe people should value their data and personal privacy far more than most do


I find it funny that I often get defensive questions from Google and Facebook engineers about their technologies/organizations when I post initial pro-privacy comments on HN, but after being called out and explaining in more detail I never get a response. I guess there's no point for them to argue it further as they're aware of the negative impact, but have made a conscious decision to choose money over morals

You said they have constant breaches, then immediately recanted when asked for details. Your argument which got no reply was an idealogical argument which appears to be constructed to shut down debate (they shouldn't be trusted because you don't like them) as opposed to lead to a meaningful discussion. You even threw out the casual line about them not having poor security and breaches, invalidating your argument in the post they replied to.

In other words, they seemingly care about whether the technical argument has merits. Once it's clear that there's no technical substance and it moves on to your personal crusade against modern companies, people lose interest.

Disclaimer: I don't work anywhere near the companies in question


>You said they have constant breaches, then immediately recanted when asked for details

No, I didn't, go back and read again

>Your argument which got no reply was an idealogical argument which appears to be constructed to shut down debate

It wasn't meant to shut down debate. If he wants to argue the ethics of spying on people and using psychological tactics for financial gain I'd be more than happy to discuss

>You even threw out the casual line about them not having poor security and breaches, invalidating your argument in the post they replied to

Again, no I didn't. I never said google had poor security or breaches, and I clearly stated that was just a generic example I used which brings attention to the wrong things, as demonstrated by you focusing on "breaches" rather than the point I was really trying to make and elucidated in my reply.

>In other words, they seemingly care about whether the technical argument has merits. Once it's clear that there's no technical substance and it moves on to your personal crusade against modern companies, people lose interest.

That's the entire point, and why I regretted saying "breaches". You are focusing 100% on the wrong thing. The problem that I have is not a technical argument about whether or not breaches could occur


As the original person, this almost exactly. I can totally understand why someone might hold those opinions. I don't share them, and argument won't be productive. Litigating values doesn't get anywhere.

> Are you saying that breaches at other companies mean we shouldn't trust the big ones to be secure? Like because Equifax has terrible security practice, google by definition must also have bad security? Or...what?

Are you asserting that Google detects 100% of significant breaches, and promptly notifies the public of all of them?

My experience tells me that neither assertion is likely to be true.


"... I think there needs to be a massive breach or abuse of power from one of these organizations/services that has severe real world consequences for those that utilize/support them..."

One of my greater fears is the knowledge that, should this happen, there's a nonzero chance that no one would care.


I suppose it hasn't had mass real world consequences for folks yet, but the Equifax breach pretty much proves this?

Yes, the Equifax breach is one of the reasons I included "severe" as a qualifier. It has been demonstrated that even fairly serious breaches will be ignored by the general public. It needs to be something that makes people genuinely fear for the safety of their finances, possessions, and/or health

> there needs to be a massive breach or abuse of power from one of these organizations/services that has severe real world consequences for those that utilize/support them. Until then nothing will change.

I think even more darkly: that the consequences of something severe enough to cause that change would be so much as to effectively destroy modern civilization. Consider that much of the modern economy is driven by tech companies not only ignoring privacy but often actively violating it.


> Tech workers have a responsibility not just understand the technologies that they work with, but also educate themselves on the societal implications of those technologies.

I think this goes well beyond tech workers. I think it's time for society to legally recognize the balance between the value of ML systems and the privacy concerns of customers of ML.

Doctors and lawyers obviously should understand the value of privacy, but we, as a society, have also created legal rights and duties for them. Conversations with lawyers and doctors are legally privileged; at the same time, there are specific consequences for medical companies or lawyers who do not protect that information.

Companies like Google, Apple, Amazon, etc. certainly have the resources, intelligence, and sophistication to comply with a similar regulatory regime. IMO it should be possible to construct a law that allows companies to collect, store, and tag customer data for purposes of training ML systems, but sets serious duties, with consequences, on them to do it right.

Right now, what is to keep employees at these companies from abusing these systems to stalk, to surveil, to harass, or even just to feed their own curiosity? These data systems are core trade secrets for these companies, which means they are opaque to any kind of oversight from outside the company.

The free market can't create the necessary balance because customers need information to make decisions--information that they don't have. The result will be an increasingly chaotic "hero/shithead rollercoaster" as customers make snap judgments based on scanty or wrong information about what these companies are actually doing.

This is a classic case for regulation, which prevents a "race to the bottom" of sketchy practices for short term gain, while also protecting the ability of people and companies to use this technology to create value.

Doing this right will help data-leveraging companies in the long run, just like attorney-client privilege and HIPAA have helped lawyers and doctors build trust (and therefore value) in their customer relationships.


I like that matrix!

One thing that I think gets lost in engineers (and humans) is scale.

Googazon doing {thing} might be "meh" for 10 people. But the implications look very different when it's doing {thing} for 10%+ of a country's population.

At 10 people, I may find out Ted likes to eat Italian. At 10%, I may find out an Italian chain has a sudden health issue and short their stock.

Which is in essence their original playbook: do things that only work at a scale that only we can play at.


> At 10%, I may find out an Italian chain has a sudden health issue and short their stock.

Let's use scale at both ends please.

At 10% of the population a nation state may ask Googazon to silently make changes to identify troublemakers.


Here’s an entirely plausible scenario:

The person in the story said they thought they heard domestic violence in some of the recordings.

I know some people who are into “consensual nonconsent“, a form of BDSM which I do not understand, but as these acquaintances tell me they like being on the receiving end I have reason to trust them.

Any system or person which incorrectly identifies one of these groups as the other, in either direction, has life-altering negative consequences. Note that in the UK, some forms of consensual BDSM have been prosecuted as serious assault, and the person on the receiving end has been prosecuted for conspiracy to commit assault because they consented to it.

Any system which prevents DV information reaching the police is bad. Any system which reports BDSM to the police as DV is bad. I don’t even know what the relative frequencies of the two acts are, so I cannot even make a utilitarian ethical judgement.


> Note that in the UK, some forms of consensual BDSM have been prosecuted as serious assault, and the person on the receiving end has been prosecuted for conspiracy to commit assault because they consented to it.

Can someone versed in UK law clarify how such a stupid system emerged out of their laws? I'd guess that's an unintended consequence of how laws were worded, but it is mind boggling.


I'm not versed in UK law, but live in the UK and am aware of a few cases. Bottom line - British law does not recognize the possibility of consenting to actual bodily harm.

I'm not aware of any recent case where the person on the receiving end was prosecuted for consenting, perhaps the OC can cast some light on that.


My acquaintances mostly reference Operation Spanner which is about 30 years old now. I don’t know if that counts as recent or not, however while the Law Commission recommended in 1994(!) that this law be altered, it appears their recommendation was never adopted.

Anyone remember the 3d printed gun stuff from a few years back? I think this isn't very different from it. You can take these raw pieces and explain how they are simple and good and draw these simple ethical conclusions from them, but then you add it up and the bigger picture doesn't feel quite the same way. 3D printers are good, sharing 3D printing plans is good, it's good to help your neighbor, no regulations and we're experiencing tremendous growth in the 3D space, people are inventing new stuff, starting new businesses, etc.. all good stuff. but letting any jackass off the street print a working gun when we have how many mass shootings a year? People don't feel the same way. All the pieces are totally okay until you've got a more questionable global intention, and how can you regulate intention?

Google using the data to train models is just a tool, it's a baby step, they aren't doing that to sell the models or in and of itself, they're doing it so that they can generate data that they might consider theirs and not yours from your voice data and then feed that in to other systems which generate tremendous profits for them in ways you don't even know. They have intended uses already. Is it a remotely fair question to talk about ethical training in this context without some idea as to the intended use and distribution of the meta data?


5) Understand ML; concerned - "Why do other people in the ML industry think it's OK to use and store peoples data in without informed consent (which are only those in group 3, and group 1+2 don't have informed consent)"

Answer to that question: because people are greedy and can't be expected to do the ethical thing. That's why we need government regulation.

That's group 4. How is worrying about informed consent not just a subset of concern about the ethics of collecting training data?

No, group 4 is "we must collect training data, how do we get arround the ethical questions"

Group 5 is "We must be ethical, can we still collect training data"

Different priorities, different outcomes.


I don't see the distinction between "if we wish to do a thing, we must do it ethically when doing so" (4) and "we must act ethically if we do a thing" (5).

If training data can't be collected ethically, do you think group 4 still would, or something? That just seems like trying to out yourself on an ethical high horse without a real distinction in action.


The distinction (from how I'm reading the GP's comment) is that Group 4 presupposes that data collection is necessary, and seeks to minimize unethical means of collecting that data, while Group 5 presupposes that ethics are paramount, and seeks to establish whether or not data collection can actually be ethical at all.

That is: Group 4 would be more willing to compromise ethics if absolutely necessary to get the data said group needs, while Group 5 would be more willing to compromise data collection if there's no ethical way to collect that data.


I agree. The way way the categories are worded essentially excludes the possibility that maybe we shouldn’t be training these models at all. We have banned potentially insightful experiments in both medicine and psychology because they are unethical. I see no reason ML should get a pass.

(4) isn't "if we wish to do a thing, we must do it ethically", it's "we're going to do the thing, how do we make it look ethical."

The Mycroft project has a better approach to this:

"Mycroft uses opt-in privacy. This means we will only record what you say to Mycroft with your explicit permission. Don’t want us to record your voice? No problem! If you’d like us to help Mycroft become more accurate, you can opt in to have your voice anonymously recorded."

(project is open source, at https://mycroft.ai/)

Let people participate in R&D if they want to, but don't force it.


Also there's a huge difference between short anonymized voice clips and taking a transcript of your entire house's audio, as a complete dataset, with your name and address on it.

I'm perhaps in a subcategory of (3) that falls under "Understand ML; concerned".

Knowing what I know about how people I have worked with have come close to or have actually mishandled data despite the best of intentions, I do not trust any of these teams without an explicit accountability mechanism that is observable by an outside entity. I'm not looking to punish slip-ups, because mistakes happen, but I am looking for external enforcement to keep people honest.

It's not that I think the engineers using this data are mustache twirling villains, it's that I think mishandling is inevitable due to inattention (yes, even you make mistakes!), and we have to design our data pipelines against that.


Exactly. Having worked in teams which handle personal data of consumers I know how easy it would be to misuse the privilege.

The legal and marketing teams that come up with the jargon and slogans about privacy are so far removed from the day to day operations that they have no clue about the reality. I don't they would care even if they did.


There’s a different dimension that may or may not understand ML, but are cognizant that any data created will be viewed at least by the company that creates it.

I fall into that category as I have time, nor do I trust any evaluation methods, to determine if a company is using my data ethically. If I create data and store it something that’s not mine, then I only do that in situations where I’m comfortable with the owner doing anything they want with it.

I understand ML and know that Google has to at least use it for training. I’ve also worked on IT long enough that even in super tight controlled environments data are misused by administrators.


> In conversations about this with engineers the response I've gotten is essentially: "Just trust that we [Google/Amazon/etc.] handle the data correctly."

No one is afraid of power when it's in their own hands. A common failure mode is that people assume a given power that's in their hands today will always be.


I'm in both 3 and 4.

4 because not being explicit about the practice is misleading at best, outsourcing the difficult task of keeping the analysis private show how unimportant it's considered, and because big techs have a tendency to decrease privacy over time. Using clients who paid for the product as a dataset generator is also wrong.

But 3 at the same time because well, it's important to evaluate the performance of the product in the field not just in the lab. There were so many cases of catastrophic failures for ML models (ex. classifying black people as gorilla) that having a tight feedback loop is important.

It has to be done right, but evaluating a product that was primarily developed for (or at least by) English speakers and transfered to other domains seem like the right thing to do.

All in all, I don't and wouldn't use one of those assistants because 4 outweigh 3, but it's not binary.


>Tech workers have a responsibility not just (to) understand the technologies that they work with

Ok, I agree completely with you, 100%. However, based on my limited worldview, tech workers barely understand the tech they work with at all [0]. Asking for the ethical implications to be mulled over is unlikely to happen considering the near-weekly HN threads on "interviewing sucks, heres how to fix it, lol". We can't even figure out how to hire someone let alone how to impedance-match with them on deep issues like ethical implications of ML/AI.

[0] https://stackoverflow.com/


Your source is stack overflow?! XD that really tickled me. It's a great point.

Get real, obvious, informed consent by asking if you would like your voice prompts to be improved on / heard by real live humans as an opt in. I bet 1/500 of the population would opt in to it.

And the first one to do it should be apple itself.


Assuming categories 1 and 3 are sufficiently large (and I assume that is the case), this is easily resolved by allowing users to choose whether to donate their data for training or not.

If the training already only happens on a 1/500 sample, skewing the sample towards "people who don't care about their privacy" will probably not significantly impact the quality of the data.

I'm surprised this wasn't already the case, but hopefully the article will help the people responsible make better decisions in the trade-off between minimizing onboarding friction and respecting user's privacy in the future.


> societal implications of those technologies

Asserting your point of view as "educated" and "correct" while labeling people who don't share it as dangerous. Doesn't sound like a great way to start a discussion.


I'm between 3 and 4: I just want proof that they remove PII from the audio files. If it's a bunch of audio files with unique IDs and metadata like time of day, count me as a member of group 3.

Even if I trust them to do what they say they're doing with the data I may not trust every party who comes to possess that data. And I may not trust their possession/use of it in all future contexts - as their privacy policy slowly drifts into the unknown year after year.

If they're collecting it in a way that can be requested by governments (for instance) or could be leaked by hackers that's another layer of valid "concern" not related to my understanding of the ML aspect of this.


The meta-issue in the United States is that once your data is accessible to a third party, you have no sovereignty over it, and abuse by private actors is "agreed to" by click-wrap and access by government actors is a simple subpoena.

The law needs to catch up. Sharing should require specific informed consent and legislation needs to establish a scope where data stored as a "tenant" on a third party server is given 4th amendment protection.


Essentially, a larger grid, involving

agent( tech, management ) # assuming management has power over tech worker

understanding-of-ML( yes, no )

concerned-about-ethics-and-privacy( yes, no )

The below combinations are worst in terms of ethics.

{ agent[tech], understanding-of-ML[yes], concerned-about-ethics-and-privacy[no] }

{ agent[management], understanding-of-ML[no], concerned-about-ethics-and-privacy[no] }

{ tech[management], understanding-of-ML[no], concerned-about-ethics-and-privacy[no] }


I agree, but I think this issue is incredibly mishandled by reporting. The title in the linked article being a great example.

There is absolutely no proof of number 2 in your list, but that is by far the widest-held belief.

It's infuriating, because we can't have a useful societal dialog about the issue if the largest chunk of concerned people are, essentially, conspiracy theorist.


I think there's a third dimension: how much you've given up.

So my thoughts are like #4, but with the undertone of "who am I kidding, of course they don't care, what can I do?".

This is perhaps the most dangerous response.


Category 3 checking in here.

TBH its not the Googles I'm worried about. They're under a fair amount of scrutiny, but ultimately, I think its mostly chicken little screaming regarding the 'how' of how they'll use this data. The ones I'd be more concerned about are the ones not on your radar. Google knows security, anonymity, they have share-holders, governments know to regulate them, etc.

What you really have to worry about is the companies getting data on you with out your permission that you have ZERO knowledge they are doing so, and as well, almost no ability to know they are taking data on you.

The reality is, most people are almost completely ignorant of how much real world data on them is out in the wild with no google or facebook or any tech giant involved. I can't get into too much detail on it, but the data is there. You just have to be savvy enough to find it, process it, predict, and refine.

In terms of morality? I don't know.


The reason companies use lots of live data is because people get upset when facial recognition doesn't work on people with different skin colors, voice recognition doesn't work for people with accents, etc. Doing #3 is literally the result of #4.

Keep in mind that there's no rule that says there must be a correct answer. If the only way to do something is approach A or approach B, and people don't like either approach, then other options include figuring out a new approach C or not doing the thing.

This is in no way 'the reason'. THE reason is that using live data without having to bother with consent is 'fast, easy and cheap'. Having to get consent is seen as a nuisance, and num's the word as we don't want to wake sleeping dogs. And we don't want to pay for generating training data, so yeah, we basically just steal the data from our uninformed users without their consent. I hope the GDPR will pursue this as a landmark case.

I don't think this is just a ML issue. Any device with a mic and connected to the internet could at any time be updates with new code that send audio back outside of what is needed for ML.

I think that issue is much bigger than just training ML based off real world data.

Laws might give you some recourse if you can prove this is happening or not. But really we need to start thinking about ways we can ensure the entire stack is under the users control.

I have said this in the past. Free software and open sourc movements have worked decades to open up software and hardware for greater transparency and user control. Yet within a few years we have wiped that out with all the closed hardware and software we happily hold in out hands and place in our homes.

We need more open source hardware and software projects offering alternatives to these systems that have become a daily part of our lifes.


I'd consider myself something like "4b": Understand ML; concerned - "DO NOT WANT!"

The more I see of this tech, the more disturbed I am about its existence.

In some ways, dystopia and utopia really are the same thing.


I fall into (3) but not for the reason you mentioned. "Privacy" is largely an industrial age phenomenon and practically speaking we can never go back. I think we need new constructs to talk productively about issues like this and the research in this area hasn't even really started to take shape yet. We are only beginning to understand the implications of social graphs, information security, data as a product - let alone critically analyzing their intersection and having informed discussions thereabout.

Privacy is not a Industrial Age stuff... I could be private in Rome, just as I'm now, just a random dude working... A small cog, no one gets to know me, no on cares to gossip about me... Be it now, be it on Renascence in Venice, be it on Rome just before the senators kill Caesar, be it on Luoyang as the tensions rose between the Warring States...

At any point in time, I could have a private conversation, write and hide a document, like some stuff that was socially unacceptable and keep it a secret, meet someone in secret, and, in general, lead a private life... Stop comparing urban life as if it is something modern... Of course it is much more prevalent now, as the ratio vs rural residents shifted, and sure, it is much harder to hide something on a small village of some dozens of people, or at a farm with a huge family, but in any major city of any era, you could be as hidden and private as today, if not more...


Privacy is not new and examples go back as far you you look. Curtains, doors, walls, envelopes, seals for envelopes. Some types and bits of clothing. Privacy is built into a awful lot of what people have done when you look for it.

Good breakdown. But I don't understand why any company thinks it is ethical to ever listen/record/send home data from -inside- a home. There are a plethora of public, or semi public places they could use to obtain real conversation(think of a subway, restaurants, DMV, etc). Set up a booth and let people talk to it. Using real speech from within someone's private quarters is disturbing IMO.

I dunno if I agree with you assessment, if you buy a device that purposefully made to record your voice and send it to a company, I wouldn't find it unreasonable that the company could listen to what I sent to them. It's not like they are being sneaky about when it is recording or where the recordings are going.

What we need is an AI to label the training data for ML to cut out the need for human workers to listen to people scream "Alexa" during sex. Or new laws to stop people from naming their kids Alexa or variations of it such as Alexander.

I am sure we can whip up some such AI easily with some quantum computing. Preferably in a blockchain so we can verify correct operation and scale better.

ducks


Cloud. Data lake. Immutable.

It's such a relief to hear your conclusions, after so thorough an analysis.

I understand ML and I'm concerned, but my position isn't "how can we train models/collect data in an ethical way?" Its more like "Do we really need to train models like this?"

I continually come back to the same thought: The Amish (to take a random example) consume probably thousands of times less than I do. They do without the conveniences of technology which I "enjoy". They have neither computers nor an endless stream of distracting media and social interactions.

Yet they are probably, on average, as happy as or happier than other Americans.

These technologies don't really exist to serve people - they exist to create profit, and they do that by meeting superficial needs.

There is great potential for machine learning to help the human race. But the vast majority of energy dedicated to the question by the information megaopolies isn't invested in that direction except under the most facile sorts of hypercapitalist justifications.

We should, in short, regulate data collection and analysis into oblivion, with exceptions made only for democratically determined use cases which serve the public good.


I'd like to offer a counterpoint to romanticizing the Amish way of life.

Love it or hate it, our current world would not have developed the way it has if everyone was following the Amish way of life. They enjoy a massive number of benefits that were born out of alternative-to-them life styles. For example we may have "an endless stream of distracting media" but we also have a hugely increased life span and reduced mortality rate due to advances in modern science.

To me, this is a bit like the "self-made" billionaire ignoring all the societal infrastructure afforded to them..


You're not wrong, but also, I doubt it makes much of a difference in absolute happiness levels.

You're getting too lost in the weeds here.

It doesn't matter if a company takes your data, does a poor job of anonymizing it, and then decides to label it as "training" for their "AI" or if they just stick it all in a flat .txt files and process it in Fortran.

It's the exact same thing. You should be mad about the data being saved and used. Splitting hairs over implementation details to try and find a loophole is just a waste of time.


You could also compare this to hospitals sending voice files to India to be transcribed. This is not automated at all. It's not clear that hospitals are any better at getting informed consent for this than Google.

https://en.m.wikipedia.org/wiki/Medical_transcription


1) A person going into a hospital, having their voice recorded, and then having the recording sent to another hospital where it might help treat and/or save their lives

vs

2) A company exploiting the lack of regulation and public knowledge/education on the dangers of mining personal data, to mine personal data and make a profit with no regard for the safety of the individual

If explicit consent had to be obtained, with the requirement that the person consenting be fully informed on the details of what they're giving up, in which of the scenarios above do you think people would be more likely to refuse consent?


You raise safety concerns. What risks do you have in mind?

Identity theft, regular theft, harassment, stalking, sexual assault, discrimination, reputational harm, etc.

Example scenario:

I tell a friend that I voted for Trump, my Google home hears it, a Google employee eavesdrops, leaks on twitter that I voted for Trump along with my home address, the likely times I'll be in my home, and even the pin to disable my alarm, etc. Then a group of left-wing extremists uses that information to harass/rob/murder me.

Alternate scenario:

Google employee uses their access to find an attractive woman with a Google home, steal nudes, spy on conversations, etc. That escalates into stalking, and eventually sexual assault and/or murder.

Both of those scenarios are possible today, and we're just supposed to "trust" Google is being responsible because they say so.


Whether these threats are realistic depends on how good Google's internal controls are. It's likely that there are Internet companies where internal controls are very weak (random Internet of things companies) and others where they are stronger. Stalking cases have happened, so you can say it's "possible," but to assess risk we need to do better than making a binary distinction between possible versus impossible.

In the case of the contractor described in this article, it sounds like they are pretty well isolated, so I don't see these scenarios happening: On the one hand, the audio snippets are more personal, being recorded in the home. On the other hand, having any idea who they're listening to will be rare, the snippets are short, and they are unlikely to hear the same person twice. I don't see them getting enough data to do damage.

You might compare with a store employee or waitress hearing a bit of conversation, or someone eavesdropping on your conversation or screen on a bus or plane. While people should be on guard, often they're not, and an eavesdropper can find out a lot more of any one person's data.

Other Google employees might have different access (for example tech support), but they'd be foolish to basically give employees remote root on Google Home devices, and I don't think Google security is that foolish.


I don't get your point here. You start off by questioning if the threats are realistic, then questioning if they're even possible, then you end by saying it's not that bad because waitresses can overhear your conversations too.

1) Those threats are 100% possible and realistic. If you think they're not just because the guy in this article is a contractor, then you're being incredibly naive and shortsighted.

2) Google employees have complete access to this data, and to think that they don't means you've decided to trust their word. Maybe you like Google, and that's fine, but it's not smart to trust them on this whether you're a fan or not. If their internal security policies for this type of data are terrible, they're never going to admit it and will definitely lie about it.

3) What people say in a restaurant and what they say in the privacy of their own homes are completely different. Can't believe I have to explain that.

> but they'd be foolish to basically give employees remote root on Google Home devices, and I don't think Google security is that foolish.

Why would you need remote root access when Google Home already uploads conversations to Google servers by default? That's the only part that matters.


Why do you think "Google employees have full access to this data?"

It seems strange that they would have permission, unless there were some reason it was necessary for the job.

This is sort of like assuming telephone company employees can listen to whatever conversations they want. Wiretaps exist, but it's not like just anyone gets to use them.


Well, this just happened: https://arstechnica.com/information-technology/2019/07/googl...

> Why do you think "Google employees have full access to this data?"

Because they do. It's literally there on their servers. You're assuming that they have some really good policies to prevent employees from accessing that data. Maybe they do, I don't know. But it doesn't matter because those are just internal policies. If some employee just says "fuck it" and ignores those policies, then if they're caught they'll just be silently fired and we'll never hear about it. There's no external audit; this is all unregulated territory.

Since this is HN, I'll give you a scenario that might hit closer to home: let's say you want to apply to work at Google. You send in your perfect application/resume, but you never hear back because your recruiter peaked into your Google Home files and noticed that you once told your friend that the Dodgers suck. Since your recruiter is a Dodgers fan, they decided to just throw your resume in the trash.


1000000%

The one thing about these stories that keep coming out about the home assistants... they kind of create the impression that this is an issue specific to home speakers, and you can avoid it, by simply not buying them.

That's misleading.

Any voice command you use to operate any internet connected tech gadget, from phones to smart TV's, is potentially stored and flagged for human review.

You really have to avoid using voice commands at all, on all of your devices. Even that is probably insufficient. You probably have to go even further and actively disable voice command features on all of your devices, assuming they actually support such a setting. Otherwise here's still the possibility of an accidental recording taking a journey through the clouds, to a stranger's ears.


And not to be anywhere near anyone else's listening devices.

Isn't there a law in some US states that there needs to be consent before recording someone? How does this fit in and who would be held responsible, the owner of the listening device or the company behind it?


Yeah, this was a big issue about a decade ago when police officers could sue someone for recording said officers doing something illegal in Massachusetts.

So they redacted that law then had to fast track an updated version of it a few years later when someone got arrested for taking upskirt photos and then it wasn't illegal. My cousin was an aide to a state lawmaker and had to explain to his boss what it meant at the time.


> And not to be anywhere near anyone else's listening devices.

No assistant records all the time, not Google, Amazon, or Apple. They listen for the "Wake Word" onboard using more primitive (and lower power consumption) Speech Recognition and only utilize The Cloud after the "Wake Word" (or phase) is spoken. You can confirm this using Wireshark.

You can view and listen to your recorded speech on the Google Account Dashboard.


No, they don't, however, accidental activation is still highly likely.

I've had other people's devices activate during conversations where I couldn't figure out which part of the sentence activated them, it just happens.

"Ok, Go get..." "Ok, Good ..." "All except..." "Sir, I..."

Then you have television shows and adverts that intentionally use language to activate these assistants.

Usually the assistants are going to say something when it accidentally detects a wake word, but if you're in another room or don't hear it for some reason, it can easily capture a conversation without you knowing.


I use the phrase “are you serious?!” a lot when I am frustrated. Almost always it wakes up Siri on at least one of my iDevices. Which makes me say “are you serious?!” again and it just spirals from there.

"Isn't there a law in some US states that there needs to be consent before recording someone?"

I doubt it - there would have been a test case involving one of the millions of people who have mobile phones capable of recording video/audio. If you're in a public place you should have no expectation of (that sort of) privacy.


These laws vary state-by-state.

I'm kind of surprised that Massachusetts's very strong laws about wiretapping permit storage of training data from Amazon Echo/Google Assistant/Apple Siri/Microsoft Cortana/Xbox, given that by their nature they naturally sometimes record incidental conversations of people who didn't intentionally trigger them.

It's a fairly mainstream view that MA wiretapping law requires written consent from every participant in a conversation for the recording of their conversation in a non-public place and does not permit implicit agreement (there isn't any kind of common-sense carveout for "you should have known you were being recorded in the background by the Echo at your neighbor's house"). See MGL chapter 272 section 99 (https://malegislature.gov/laws/generallaws/partiv/titlei/cha... , sample writeup https://www.masslive.com/news/2014/06/massachusetts_wiretap_... )

Now, MA has a bunch of laws on the books that no one actually enforces. There is a law against jaywalking which provides a $1 fine and tickets are never issued. Or for a bigger example, there is a law that requires you get a temporary permit from the Alcoholic Beverages Control Commission before importing any quantity of alcohol into the state, including for example buying beer at a NH liquor store or flying home from Europe with a bottle of wine. Last time I looked I think that law provides for a $2500 fine or 6 month jail time if violated, although it was hard to tell. ABCC will absolutely insist that this is a real requirement if you ask, and will even provide a copy of all such permits they have issued for the year under freedom-of-information rules -- I once asked and was given a copy of 46 permits issued in 2015, many to a single person who reviews wine and stubbornly files a permit for every shipment he orders from out of state apparently to protest the requirement, causing so much administrative overhead that the ABCC tried to issue him a special blanket approval to get him to go away, which he refused to accept.

To the extent that wiretapping laws are similarly not really enforced against the technology companies who make, retain, and distribute the recordings, this seems like an unknowably large regulatory risk a lot of companies are taking. Sure, the state loves its big local employers (IIRC Alexa development is in Cambridge?), and wouldn't want to lose their tax revenue, but what if the political winds change?


I think when you buy a Google Home you basically accept some terms and conditions in the app to set it up. Buried in there is probably your consent to analytics etc.

> Buried in there is probably your consent to analytics etc.

That's certainly not informed consent.


You already consented by agreeing to the terms of service. If someone else is talking to their smart device while you're talking, it's ostensibly their responsibility. There's no reasonable laws that prevent you from being overheard in the back of a recorded phone conversation

(IANAL, but) Not accurate in the US.

Most states are either one party or two party consent states. One party = you can unilaterally record anything (not sure this includes things you're not actively involved with, e.g. spying). Two party = you must have consent of everyone in the recording.

By a plain reading of two party consent statutes, people are in violation if their home speaker records a guest without obtaining consent.

I'm sure Google and Amazon's lawyers would try to weasel out of compliance via claimed anonymization, but that's definitely not the spirit of the law.

Old, but thorough: http://www.mwl-law.com/wp-content/uploads/2013/03/LAWS-ON-RE...

You're also going to bump up into specific wording on whether a given statute covers only telephone conversations or oral conversations, as most of these are phone wiretap laws that may or may not have been worked ambiguously.

Additionally, there are federal statutes that likely also bear.


Know that even in all party consent states, if you continue talking after being made aware that the conversation is being recorded implies consent. This is why devices like google home are legal, they make a loud warning sound before they begin recording. For example in CA the law states that:

> (a) A person who, intentionally and without the consent of all parties to a confidential communication, uses an electronic amplifying or recording device to eavesdrop upon or record the confidential communication, whether the communication is carried on among the parties in the presence of one another or by means of a telegraph, telephone, or other device, except a radio, shall be punished...

(b) For the purposes of this section, “person” means an individual, business association, partnership, corporation, limited liability company, or other legal entity, and an individual acting or purporting to act for or on behalf of any government or subdivision thereof, whether federal, state, or local, but excludes an individual known by all parties to a confidential communication to be overhearing or recording the communication.

https://leginfo.legislature.ca.gov/faces/codes_displaySectio...


> they make a loud warning sound before they begin recording

This contradicts my personal experience last week with a google-controlled music player. Music was the only response to a voice command to play music, and silence was the only response to a voice command to turn it off.


My understanding of recording law is that in One Party states you need to be part of the conversation to record it. Speculation: This would mean that the device / owner would be in violation if the owner was not in the room?

Ironically, if the owner were not in the room, I'd expect the device manufacturer would be more directly liable.

How can a homeowner be responsible for a device for which they (a) don't control the operation of & (b) don't control the software of?

At that point, whether a device captures incidental recording seems entirely under control of the manufacturer.


Sorry for the extreme analogy, but a gun owner is responsible if their nephew accesses their gun when they're not home.

It is not farfetched to imagine that, to be in compliance with the law, you would need to unplug your listening devices to avoid them accidentally going off.


The owner of a listening device might be ignorant of audio being stored and audited by human listeners. It's farfetched to think a gun owner might be ignorant of a gun's dangers because they didn't read the terms of service in detail.

So I buy one of these things; install it; put it online, but i'm not responsible for it? I don't understand. Would you be happy with that defence from a hotel if your wife stayed in a hotel room and was killed by carbon monoxide from a faulty heater?

And of course there are children to consider:

https://www.seattletimes.com/business/amazon/suit-alleges-am...


Not in the case of the device being someone else's. I don't have any Google or Amazon smart listeners, so I never accepted them. Yet if my voice is in any of these recordings, well...

As long as you are warned that you are being recorded then the law considers you to have given consent if you decide to continue to talk. This is why all the home devices make a loud sound that you cannot disable before they begin recording.

Except that I don't know what that sound means. I know what the one my device makes is, but I've never heard the others. Unless it's a human voice saying, "This conversation is now being recorded" I can't be expected to know what a random beep from a device means. It could just mean the person got a notification or something. (And even if it is a recognizable sentence, it assumes I understand the human language the device is set to.)

How do you square that with a security cameras? Do you need to consent every time you enter a space secured by them?

The rules regarding audio recording are different from video. This is why many security cameras do not actually record audio.

I think the laws are different primarily due to the different pace of audio vs. video recording technology. Audio recording of phone calls etc. has been feasible for a long time so laws were written about that. Ubiquitous video recording has really only become a thing in the past 2 decades or so.


Absolutely. I feel like the pace of adoption has been mostly driven by per-bit storage costs falling (and high efficiency codecs).

Above all else, people will do useful things with computers once the price to do so matches the utility. And we're far on the other side of that with cameras.

I can't wait to see what the next decade+ does to all the Facebook-esque camera startups. It's going to be hard to monetize your customer's video feeds once regulation clamps down.


I don’t believe there are laws prohibiting video surveillance in public by businesses. Some states have laws prohibiting filming in locations where one expects privacy. Other states allow filming in private spaces as long as the business notifies employees and customers they are being filmed.

Yea most states also have a pretty clear "if the device is obvious" law. It's also why businesses put up "smile you are on camera" signs, especially if their camera isn't immediately recognizable.

There are some rules concerning security cameras as well. I recently found out that private CCTV (at least in the UK) can't record public areas (e.g only your porch). Someone got sued over this recently.

There have been some stories in HN about opting out of face recognition as well. Maybe the laws for video are different as the other reply says, but there are privacy concerns in there as well.

edit: here's a list of GDPR fines (not comprehensive as I only see 2 in the UK). If you filter by CCTV you'll find a couple of examples from Austria: http://enforcementtracker.com/


> There have been some stories in HN about opting out of face recognition as well

Easiest way to do that is to wear a niqab or burka.


Face coverings are not legal in public in all countries, but even where they are, why should someone concerned about creeping surveillance go to great lengths to modify their own behaviour because someone else is unwittingly breaking the law because a product they bought was made by someone who couldn’t be bothered to do it right?

> Face coverings are not legal in public in all countries,

And that of course is the point.


Please elaborate

Video and audio can have different laws. In my state, video recording in my home or business is legal, but if I record audio, I need 2 party consent.

So this outrage is in the Netherlands. Security cameras may not be pointed by businesses at public space (which we have a lot of unlike the US). The local government itself may place cameras though but private parties, what kind of nightmare situation is that?

And in public or private space when there are cameras there needs to be signs everywhere to warn and inform you.

So in the Netherlands at least.. Google recording a conversation with someone who doesn't know Google is recording is definitely illegal.

The question is: will they prosecute? Then it becomes a geopolitical question because we are a small country with a disproportionate number of Google datacenters.

So to summarize:

- This is definitely illegal in the Netherlands

- There is no consent of others participating and you really do need that

- Fine print is not consent: consent of terms and conditions requires a majority (determined by polling or common sense of a judge) of users to be aware and knowledgeable what they consented to.

- there won't be prosecution by the Dutch public prosecutor.

- there will be a lobby for the EU to buttrape Google but it may use different reasons or context


I find this reasonable; you wouldn't sue the manufacturer of a recording device if someone made a secret recording with it. I do think this means that the smartphone owner is liable, and should be fined or jailed.

But in this particular case, it's the company doing the recordings, not the owner of the device, so it's a slightly different situation, in my opinion.

That said... The owner could be liable if for example it were necessary to explicitly inform of the existence of such devices the possibility of being recorded.


...and this is before you get to hacked/compromised devices. This is just devices working as specified.

Not to mention all the crap you type into your search bar, Gmail, images I upload, basically anything with any sort of machine learning backed enhancement. Any of that stuff could be sampled and reviewed by humans. Admittedly anonymised, but it's still potentially personal stuff just as the voice print data is.

Yes that’s true. But: with another device like a phone it is optional to turn a voice assistant on. Not so with a device with the sole purpose of being a voice assistant.

Whenever I'm in the home of a friend who's enthralled with their smart speaker gadget, I like to make red-flag-raising requests to their gadget. Asking for good sniper vantage points in Washington DC, a safe rohypnol dosage for 10-year-olds, the soonest flight to a non-extradition country, those sorts of things.

When I give a voice command, I know that my voice could be recorded. What I'm actually concerned about is devices which are always listening, even when I'm not giving a voice command, and recordings from when I'm not interacting with the device potentially falling in the wrong hands.

Any time I'm on a phone call and say "Ok, good", Google Assistant kicks in, which concerns me a lot.

I really don't understand how they don't allow setting a custom hotword.


Twilio (and their customers) listens to phone calls too no doubt.

So Google’s response is (paraphrased as fairly as I can while removing the sugar-coating):

’Yes, we hire people to listen in to and transcribe some conversations from the private homes of our customers (so as improve our speech recognition engines); but the recordings aren’t linked to personally identifiable information.’

Even assuming they have only the purest intentions here, I still don’t understand how they can possibly guarantee that these recorded conversations are not linked to personally identifiable information!

For example, what’s to stop me from saying “Hey Google, I am <full legal name / ID> and my most embarrassing and private secret is <...>”?

One might argue that they could detect this in the recognized text and omit those samples, but presumably the whole purpose of hiring people to create transcripts is because the existing speech-to-text engine isn’t perfect, and they need more training data.


“I rue the day I married you, Steven Robert Parker, you HIV-infected cheating scumbag! I wish I had never lied to the FBI about those classified documents you stole!”

It seems even worse than this - I'd argue your voice is personally identifiable information! The vast majority of these clips open with "Hey Google".

Meanwhile, Android allows you to personalize voice commands based on its ability to recognize that a specific person is the one saying "OK Google". Voice authentication has already reached high accuracy with a few seconds of unconstrained text, or a few words of fixed text. Voice identification on open sets takes more data, but sub-minute clips are still reasonably effective.

At the very least, Google itself could make a credible attempt to identify whether the speaker in any voice clip heard by Google Home is a regular user, and plausibly de-anonymize users of OK Google. More alarmingly, we're told that about 1 in 500 Google Home clips is heard by a human, and this employee apparently shared "thousands" of clips with a news organization. It seems plausible that anyone with access to any large voiceprint database could attempt to obtain clips from a random contractor and de-anonymize the most interesting or salacious content.


You paraphrased it in a different way and that might be why you're confused.

Google says "the excerpts are not linked to personally identifiable information." To me that means the metadata is stripped, not that they strip anything out of the audio.


Thank you, good catch. I’ve edited my paraphrase to make it more accurate in this way.

That said, it still sounds like Google is trying to convince us that the data they capture (not just the metadata) is never linkable to personally identifiable information, which if true would genuinely ease many privacy concerns here.

As far as I know, just because data is not explicitly annotated with PII doesn’t erase the legal (and ethical) responsibilities associated with handling data that contains PII.

So even if they worded their response so it’s truthfulness is legally/technically defendable, it’s still a bit of a ‘red herring’ at least (I don’t think anyone is accusing Google of explicitly associating these audio recordings with user IDs).


But in order to tell if it contains PII it has to be listened in by a human to transcribe it... It's like Schrödinger's audio assistant ;)

> For example, what’s to stop me from saying “Hey Google, I am <full legal name / ID>

Even more fun, if you call a bank, you often have to key-in your account number (which can be easily decoded if your phone sounds back the tones, which most do), then tell you name, your address and sometimes your other PII like Social Security number or part of it. Record that call and that's a complete identity theft package, nicely wrapped, just replay it to the bank (which name you've also have recorded, if the user called on speaker, which they did because who wants to keep the phone pressed to your head all the time while you're waiting and listening to the muzak) and you get full access to the user's bank account.


I'm not sure Google devices can make calls, but if they could, the only part that would be sent to Google (which is what these people would have to analyze) is "hey Google, call bank"

From what I understand, you don't have to call bank using Google Home device, enough that you'd call bank while Google Home device is within the earshot while something else says "OK google" while you're talking.

The Google Home can make phone calls in the US and UK.

I would count a recording of my voice as "personally identifiable information" right off the bat. Voice printing is a thing, and anyone will also tell you that they recognize the voices of people they interact with regularly. If someone played an audio clip of someone I know talking to Google Assistant to me, I would recognize who it was based on their voice.

This sent me down the rabbithole of learning how identifiable voiceprints are. As you might guess, the answer is "very", although to my surprise our voices change enough that recordings lose a great deal of fidelity over a few years.

Authentication on fixed phrases is reasonably accurate within a very few words, so at minimum it should be possible to associate "Hey Google" clips with regular users of Google Assistant voice control (i.e. "OK Google"). Identifying whether someone is present in a large dataset on open phrases is much harder, but a ~30s clip could do the job fairly consistently for anyone with access to a significant amount of voice data. And if this employee (who isn't directly working for Google) shared 'thousands' of clips with a news org, the cautious bet is that some other employee might share them with anyone willing to pay for the records.


So without connecting this phrase to a person or other phrases, what information leaks? That the person exists?

In terms of GDPR:

https://gdpr-info.eu/art-4-gdpr/

> ‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;


So you’ve identified a person. What information has been revealed about that person?

It could be anything - the voice snippet could include a query about a particular medical condition or include specific financial records for example.

Anything that falls in that 'personal data' segment above that belongs to me has to be obtained with my prior, explicit consent. And these bits of data with my name or other details in them must be included if I send Google a "right to be forgotten" or "show me all the data you've got about me" request. That's GDPR in a nutshell.

This might be a grey area for now, as both GDPR and listening devices are both quite new. But Google, Amazon & co aren't super popular with EU regulators and governments, so they might side with users' rights on this one.


Without the metadata PII, there’s no evidence adversarial utterances like that are true.

It’s hard not to feel like this outrage is trumped-up anti-Google FUD. So many more worthy fronts to assail Google et al. on!

After all, they let you upload photos and video that are, per various policies and with some non-zero frequency, reviewed by humans — and users are begging them to do it more often.


This is a good point, one of the reasons to use it is for reminders "Hey Google, remind me to visit x at place y"

Even then, the voice print of someone is obviously ID-able right?

"The man, who wants to remain anonymous, works for an international company hired by Google. "

So not a Google employee at all, a probably low paid contractor who is in possession of thousands of audio files. Your privacy matters, except when the bottom line is involved.


What is doubly concerning here is that the contractor was in a position to demonstrate how the system worked to the reporters. That would seem to indicate they have access to that data in a non-secured environment.

I'm not familiar with EU law around these things, but I would imagine there is some kind of whistleblower mechanism available, and a right for authorities to audit/inspect such activities?


I can guarantee you that 99% of every company that has a copy of your PII has employees that can connect to work resources via VPN.

I would expect that a telecoms employee who was doing similar work on quality etc to be quite securely vetted.

If I was doing now what I did back in the day For BT / Dialcom (I had root on the UK's main ADMD) I would probably have to pass DV vetting (TS in USA terms)


What about all the telecom APIs like Twilio? They have raw access to millions of phone calls every day. I doubt they have ‘secure rooms’ for debugging.

Isn't he in breach of GDPR requirements?

Sounds like he was a Turker: "For each fragment that he listens, he will receive a few cents."

Probably a company like https://scale.ai/, although I don't think they do audio.

The person is probably a temp/vendor from a consulting company (think accenture or cognizant), who should've signed the same NDA agreements as anyone working on that stuff.

But whose machine monitoring, security, and use habits almost definitely do not match the requirements Googmazon would require for its own employees. These vendors, time and time again, end up being the weak spots in companies' and governments' handling of sensitive information.

and how are they going to enforce the NDA? contractors are being paid peanuts in bad working conditions - firing them might land them a better job elsewhere, and they don't have any assets to sue for any kind of monetary damages.

That can still bankrupt the person on the receiving end of the lawsuit. Just because damages don’t necessarily fully compensate for the loss, doesn’t mean it’s not a massive deterrent to the behaviour.

Does it matter how much they're payed? They're probably payed the right amount relative to the work they are doing.

Also how is having access to small samples of audio a privacy issue? Are they also receiving enough information to attach an identity to the audio clips? How long are the clips? Are they randomly assigned to humans? Do those humans get to listen to multiple clips from the same Home device and can they tell that's the case?


You can say password, credit card number, bitcoin/ethereum mnemonic in 1 minute without problem, can't you?

Home, Siri, Alexa, M, they all do. I have friends that work on this field transcribing the audio, and measuring its accuracy. Sometimes it's multiple layers of contractors: An employee hands the task to a contractor, another contractor verifies the speech to text, and they're all managed by a contractor.

Search for languages like Portuguese, Swedish, Chinese, etc on LinkedIn and you'll find the jobs posts https://www.linkedin.com/jobs/search/?keywords=portuguese&lo...


"They all do" ... my understanding is that this expressly does not happen with HomePod conversations.

“... In some cases, teams use the audio of users’ voice requests as training data—all anonymized, Apple says.

> We leave out identifiers to avoid tying utterances to specific users, so we can do a lot of machine learning and a lot of things in the cloud without having to know that it came from [the user],” Joswiak said. In other words, Siri can learn things about users as a whole without tapping into individuals’ personal data.

> Apple holds on to six months’ worth of the user voice recordings to teach the voice recognition engine to better understand the user

> After that six months, Apple saves another copy of the recordings, sans user ID, for use in improving Siri, and these recordings can be kept for up to two years.

> The training happens on Apple’s servers, but the models only start practicing what they’ve learned when they’ve been deployed to your device.

> Once on the device, the models begin to run computations on things you type or tap into your device, or on things that are seen in the device’s camera, heard through the microphone, or sensed by the device’s sensors. Over time, this creates a massive pile of personal data on the device, as much as 200MB worth.

https://www.fastcompany.com/40443055/apple-explains-how-its-...


So I meant "they" in the sense of the companies, not necessarily the home devices. Sorry about the confusion.

I know 100% that it happens with Siri. If Apple is excluding HomePod conversations from Siri's dataset, that I don't know.


Do you have a source for human annotation of Siri recordings? Do they use subcontractors like Google?

First hand source, it was the first job several of my Brazilian friends (or their spouses) got when they relocated to the Bay Area. They use companies like Moravia or Welocalize. Take a look at some of the job posts from my link above.

As far as I am aware, Siri's audio retention policy is up to two years.

Apple still stores the audio, but they said they can't allow you to download/GDPR request your recordings (like Amazon and Google allow you to do) since they're not associated with your Apple ID whatsoever. I wouldn't be surprised if they also human-review some audio.

I'm not familiar with HomePod, but if I ever get an Alexa/Siri kind of assistant, it will be one that analyses my voice locally rather than sending it to the cloud.

Ah, who am I kidding? I just bought a new Android phone, which probably does exactly the same thing. Time to install LineageOS on it, I guess.


Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: