Hacker News new | past | comments | ask | show | jobs | submit login
More information about our processes to safeguard speech data (blog.google)
48 points by arusahni 5 days ago | hide | past | web | favorite | 39 comments





> Audio snippets are not associated with user accounts as part of the review process.

This is a bold claim. The VRT journalist were able to re-identify few data subjects and confirmed with them that their data was being listened by Google contractors. "‘This is undeniably my own voice’, says one man, clearly surprised." wrote the journalist.

> The Google Assistant only sends audio to Google after your device detects that you’re interacting with the Assistant.

Well, this is not what the contractor said [1]. The contractor heard private conversations, sex scenes, and violence including "a woman who was in definite distress".

https://www.vrt.be/vrtnws/en/2019/07/10/google-employees-are...


So what I'm hearing is that I should change my safe-word from "Ok Google"?

I would be shocked if Google allowed a contract labeler access to the name of the account from which data originated, except in the circumstance where someone, as the article describes, speaks PII outloud as part of the snippet. And that is why this subcontractor is in breach. They were trusted to handle data appropriately and failed to do so. For Google to provide labelers with the account-of-origin for a snippet wouldn't be lazy, it would require a concerted, covert effort to defeat multiple layers of security-by-design to pursue poor scientific method. And if there is one thing I'm confident about, it's Googlers' single-minded pursuit of sound scientific method. Their business model depends on it so much I dare say strong internal privacy controls are a side effect.

> This is a bold claim. The VRT journalist were able to re-identify few data subjects and confirmed with them that their data was being listened by Google contractors.

Well, both claims could be true. Even if the snippets are not directly associated to users, there could still be identifying information in the recording that could be used to identify the person speaking.


I interpret "not associated" as meaning the reviewers weren't told anything else about the account the snippet came from. Not that it would be impossible to figure it out.

Wouldn't be the first time Google's lied.

> The contractor heard private conversations, sex scenes, and violence including "a woman who was in definite distress".

I think this is a bit disingenuous and intentionally sensationalist, nobody has any reason to believe that the Assistant is always listening or will just trigger randomly or on demand by Google because they want to listen to particular conversations, that doesn't even really make any sense. Also various parties have access to the source code, so if it were malicious, there can be evidence of it. That there isn't any evidence is important to mention in this context.

In reality its just that it can randomly miss-classify "Ok Google", nothing malicious about that, and obviously the purpose of the annotator is to look at unrecognized samples, so obviously they are gonna frequently hear those cases, that's sort of the whole point of this.

Maybe "well, duh, how do you think it knows your voice? magical pixie dust?" is a bit of a cop out, but honestly there are two options, them doing what they are doing right now, and not have Google Assistant.


My experience with my Google home before I unplugged it is that it would randomly kick on - even in relative silence, unless the clack of a keyboard and maybe a fan whir could be misheard as "okay Google"

I would notice because it sat on my computer desk.

To be clear, I don't think it was nefarious that it would, just that I realized I wasn't using it as much as I thought I would and decided to unplug it.


Google’s argument is that the audio they store is non-private and non-personally identifiable. So this statement doesn’t make sense:

> We just learned that one of these language reviewers has violated our data security policies by leaking confidential Dutch audio data.

Isn’t the whole point that the audio is supposed to be innocuous? “Hey google, play Coldplay” or “Hey google, play Taylor Swift.” If that’s the only audio that Google stores, then there should be no problem with leaking it to the press.

But it isn’t, which is what the translator was trying to show in the first place!


I believe this is the article that they're referring to: https://www.vrt.be/vrtnws/en/2019/07/10/google-employees-are... Basically after listening to a bunch of audio, the reporters found some where the device had turned on but recorded identifiable information. Also Google would consider any audio to be confidential even is the material in it is innocuous.

> Building products for everyone is a core part of our DNA at Google. We hold ourselves to high standards of privacy and security in product development, and hold our partners to these same standards

I guess those standards don't include clearly letting end users know that their data will be stored, shared and transcribed by 3rd parties (not that the 1st party in this case isn't bad enough).

The leak is not the problem here (and it's extremely disconcerting that they're focusing on that). Their lack of communication with consumers is the problem. This blog post makes things worse as far as I'm concerned.


I will not use google home unless they remove the restriction to provide both browser history and location on your google account / cell phone. When my google home refused to work after disabling tracking features I unplugged it.

Yeah, the home bricking itself when you turn off those settings is infuriating. The Google Assistant effectively bricks itself too.

As if they need your location and browsing history if they're gonna play that song you asked for.


Then you can find out how much your phone is tracking you without your consent!

This response is no inadequate that it is almost laughable.

They claim that their experts can listen to a small number of recordings of human speech directed at their devices. However, there is no telling what people might be speaking to their Google devices. I don't use Google Home, but given that it allows one to set reminders, etc., I would think that people could be saying pretty sensitive things, including phone numbers, addresses, credit card numbers, etc. which combined together can identify them uniquely or at least allow those bits of information to be misused.

Also, embedded in the figure of 0.2%, mentioned for the percentage of spoken interactions listened to, is the assumption that it is a very small number. However, that number implies that 1 out of every 500 interactions are listened to. For a family of 4 owning a Google Home, the number of spoken interactions with it in a year would easily run into thousands. Therefore, the experts at Google are listening to at least a handful of interactions every year for each family. Given the speech-to-text state-of-the-art, if these recordings are being converted to text and being stored, it would not be hard to group the recorded interactions per family and derive some identifiable information about them from it.


I have both Home and Alexa in my house. If I give Google the full benefit of the doubt and believe that they only record when I say 'Hey/OK Google', what about Alexa? Does it send the recording to its server all the time or only transfer them when I invoke its service?

Amazon claims to only send speech data home when you invoke it with a 'hey alexa'. Whether they're sending home just your query, or everything the echo has heard since the last invocation, they are very careful not to say.

In both cases they only send what they heard after you used the wake word.

I don't understand how this is a controversial claim that people refute when it's so easy to test with something like Wireshark. If someone is paranoid enough spread FUD on the Internet, they should be paranoid enough to actually check with freely available tools.

If conspiracy theorists looked at facts, they wouldn't be conspiracy theorists. The common denominator of people who claim that these devices are always sending voice data, is that they have zero interest in learning how they actually work, the information is widely and easily available.

if by 'conspiracy theorist' you mean the literal definition, then you're talking about anyone who theorizes regarding the existence of a conspiracy. There's nothing wrong with that.

If by 'conspiracy theorist' you mean someone who suspects the government or corporations of foul play, then don't forget about 'government surveilance', one of the biggest conspiracy theories of the last several decades - until it turned out to be true. (did you dismiss them as well, before snowden put hard evidence behind the reasonable speculation?)

Either way, you shouldn't talk down to people who find minimal, available evidence worthy of speculation. It's not conducive to reasonable discussion.


When people get to know things like this way after they bought and used this devices... Google should made this very clear before people bought it

The common retort is "they record everything and then upload it all when you use the wake word". The people who say that clearly don't understand audio file sizes (unless Amazon has invented a new, secret, super compressed audio format).

If Google don't clearly states what it collect, people must assume its everything

When it's within your power to verify but you don't take the time to do so, you lose the authority to go online and claim your conspiracy theory has any merit at all.

No. You don't lose authority, you lose credibility. Unfortunately, conspiracy theorists tend to believe and repeat information that sparks an emotional reaction in them or agrees with their paranoid worldview, regardless of the credibility of that information. However, because of Google's track record with user data, paranoid assumptions of them tend to be correct.

You can make fun of people who don't read tiny little words hidden in a contract. But you also can say tiny words are dark patterns.

Assuming (pretty safely) that the connection to google is TLS encrypted, how exactly would someone be able to examine the data with wireshark? The times I've had to look at an HTTPS session, I ended up having to set up a TLS decrypting proxy with a custom root certificate. It doesn't seem likely you'd be able to do that with a closed IoT device.

You shouldn't need to actually view the data. Just check the payload size. Compare the traffic at idle to the traffic that's sent when you speak to it. If it's comparable, now it's time to be suspicious. If there is a noticeable spike after saying the wake word, then you can be fairly certain it's actually waiting for the wake word.

I would hope that it's more like "after and including the wake word" so that they are capturing the audio needed to better train their algorithms on false wakes. I would much rather they analyze "a stray noodle on the floor" and eventually figure out that it shouldn't have woken, than analyze "on the floor" by itself and not be learning about the nuances of noodles.

A hobby of mine is seeing how far away from "Hey Google" I can stray. The accessibility feature to beep on wake helps with this.


Can you provide a reference for where Amazon &/or google say that, specifically? Bc another comment mentinos that they sort weasel around that claim and dont actually clearly state it.

From my own perspective (and certainly dont mean this in a way that's dismissive towards you/your comment), given that those companies lie constantly, I'm not sure why I would trust their own PR about their products as a reliable source of information.


> Can you provide a reference for where Amazon &/or google say that, specifically?

This very article. But since you said you don't believe them ("those companies lie constantly") why would referencing them help here? Either you believe them, and it is settled, or you don't and it is worthless.

Hook up wireshark, you can quite clearly see when Google Mini or an Echo are sending an audio stream since it is a substantial packet-load relative to normal "keep-alive" comms. The only exceptions to that, from what I've seen, are automatic software updates (but traffic is moving down, rather than up).

You could also look at the device teardowns, for example this article has a good overview of the underlying workings (inc. Wake Word detection):

https://developer.amazon.com/blogs/alexa/post/2a32d792-d471-...


> given that those companies lie constantly, I'm not sure why I would trust their own PR

The only reference that would be credible would be something from the company, so if you don't believe them, then it wouldn't really matter, would it?


One would think that with all the very smart people that Google hires that such issues would be thought about and measures put in place to make sure stuff like this doesn't happen. Yet it seems to keep happening so I question how hard they are really trying.

There’s a conspicuous lack of anything like “we ask permission to save data at setup time and don’t store anything unless the user clicks Yes.”

How many of these incidents have to become public before some prosecutor decides they can make the argument that people in the vicinity of digital assistants no longer have a "reasonable expectation of privacy."


That was the discussion about the actual incident, not Google's response.

I'd love to see "An update on Google Assistant".



Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: