I mean. Of course they are. Do you expect to be able to do any meaningful level of training on data that hasn't been properly labeled? At some point, a human has to go in and correct the software when the software gets it wrong. If you want services that do what Google Home does, you have to have this.
Even with that, I'm sure that the engineers are flagging voice requests that happen more then once, or where some one has to manually change or correct what the software thought was the request.
This is only creepy if you don't understand how the software works.
I wonder how people felt about handing all their personal photos over to a stranger just so they can get developed and printed, back in the 1990s. By today's standards, that would be incredibly creepy. But everyone did it because... having photos is pretty awesome.
It would be nice if companies like Google made it very clear that specific forms of user-data can/will be human reviewed for development/operational purposes. Personally, I just assume that anything I say after activating a digital assistant will be anonymized but listened to by a human one day. And I still think that Digital Home Assistants are pretty awesome despite that.
> I wonder how people felt about handing all their personal photos over to a stranger just so they can get developed and printed, back in the 1990s. By today's standards, that would be incredibly creepy. But everyone did it because... having photos is pretty awesome.
This had its own set of norms associated with it: people would not take in photos that could be considered "indecent". Possibly by the extremely conservative standards of your local chemist. Then Polaroid invented a camera where you didn't have to submit your photos to the judging eye of someone else ...
Anyone who's worked in a photo lab before will tell you that "norm" gets ignored all the time. Some people just do not think about the fact that an actual human is going to see their BDSM photos with the secretary.
I worked in a photography store in Italy in the summer 30 years ago or so. That's not always what was happening.
Neither I nor anyone else working there ever looked at anyone's private pictures, but the guy delivering them from the lab would sometimes joke about or allude to customers' photos, always from some other store. Even as a kid, I found that troubling. It also made me wonder what he told other stores about our clients' pictures.
How about Google PAY MONEY to generate training data or gather it from informed people? (“Make $100 by recording your voice for Google’s machine learning algorithms.”) not “Arms full with an infant? This $50 device will solve all your problems” and then shipping those recordings to third party contractors in unsecured facilities.
Figure out how to run your business unit ethically or shut down the business unit. They don’t have the right to turn their transfer learning problem into an abusive privacy policy.
Yes, if they just paid people to record their voice they would not get training data for real use cases.
I cannot find the blog post now, but quite a few years ago I recall some Google employees noticed a large number of queries for "cha cha cha cha cha..." from Android users in New York. All of the queries were done using voice search, so they listened to a few of the recordings. It turns out that their speech-to-text models were interpreting the sound of the NYC metro pulling into a station as speech.
Obviously they didn't have enough training data of people trying to talk next to a train.
We test our medicines on a small group that is representative of the entire worlds population. We build soil models based on sampling a small region. We don't test your entire blood to do a medical test. I don't know what you mean by "real data", but representative sampling is how work gets done in every single domain in the world. Google can do this.
Representative sampling is how we formerly did this kind of work. It wasn't particularly good or effective, but we didn't have the methods or compute to go beyond that. No longer.
You're free to have your own opinion, but anything specific beyond "it doesn't work"? I work in pharma, and we use representative sampling every single day in every single thing we do, and it works.
Representative sampling, does 'work' in the sense that it may or may not 'prove' whatever it is you had a question about. But the issue is that you effectively building in your assumptions about what is 'representative' into your sample. Its (imo) the central issue in the reproducibility crisis: our assumptions about the world and how that impacts the questions we ask about it.
It was previously intractable to do a census rather than a sample, and maybe for your purposes a sample is good enough or a census remains intractable. In my field , this is how things were done for decades (and still largely is), and even though (imo) it did a piss-poor job, it was good enough for some purposes. As piss-poor job is still better than knowing nothing. Maybe this is good enough for your purposes.
There's a third way however, which is to move beyond sampling and to perform a census. This is the difference I'm speaking of. We're at the point where we don't have to sample because we can measure. Effectively, this is what modern data science is. We've always had the ability to sample and interpolate. It doesn't work very well (imo: https://en.wikipedia.org/wiki/Replication_crisis) and usually is reflecting back to us something about our assumptions in how we sampled. But thats just it. We don't have to rely on a sample if we can take a census.
>But the issue is that you effectively building in your assumptions about what is 'representative' into your sample.
Even if I agree with your premise, Google is not going to build a custom voice model for every individual anyway. There will be simplifications made. There will be assumptions made, and they will end up with a representative model anyway. So you're actually just bolstering my point. It makes a ton of sense to record people in a known, controlled environment and tweak variables one by one- such as the size of the room, the location of the microphone, introducing varying amounts of background chatter, etc etc. This is how normal science happens all the time, and it has worked for us so far. And we haven't even addressed the ethics of spying on people in such a blatant manner. That is a whole another conversation.
Modelling aggregate human behavior/psychology is not a proper science. The same is true of macro economics and other such non-exact fields. Problems in those fields do not apply across other fields.
This is a very different kind of problem than the ones you listed. One drop of blood is going to be very similar to any other in and individual. That's not true when it comes to language data (or many other types of data for that matter). The data you would record in a prepared setting (i.e. reading from some predefined set of phrases) is typically not even close to representing the full distribution of phrases/dialogues that human's use.
Furthermore, Google/Amazon/FB do use representative sampling of real user data, it's not feasible to transcribe every interaction with Google Home/Alexa/Siri. This akin to what you're suggesting but it no way addresses the privacy concerns. The only real way to do that is to use authorized data or scripted interactions, which, as described above, are not actually representative samples. It is complicated and nuanced problem.
>One drop of blood is going to be very similar to any other in and individual. That's not true when it comes to language data (or many other types of data for that matter).
Why? Please do explain. If you claim that our biology doesn't change at all in one domain, but varies significantly in another, it would be easy to show this scientifically, or more specifically, how this variance is applicable in this context of voice recognition.
Just to take a simple example of blood glucose. Using continuous glucose monitors attached at various sub-cutaneous sites over the body, it is trivial to show how the local glucose is not identical at all sites.
I'm sure they did. And probably still do. But no amount of paid for training data is going to cover all situations that occur in real life. Accents, mannerism, speech imperments - you can't cover all possible premutations.
> I'm sure that the engineers are flagging voice requests that happen more then once
As a user, I expect to see feedback declaring that it's having trouble and specifically requesting permission to send my unparseable requests to some review queue. I expect to see feedback indicating what could resolve the issue: was it something I said? was there too much noise in the background? was there a software fault and a fix will be deployed at such-an-such date?
The fact that I do not get any feedback whatsoever when things go wrong leads me to have no faith that the problem will be resolved in any satisfactory manner. When combined with a complete lack of consumer/business interaction options in general (w.r.t. Google), it leads to some very dissatisfied consumers.
You think engineers are flagging voice requests that happen more than once? How would engineers have access to that data in the first place if it's supposed to be anonymous and private?
>The fact that I do not get any feedback whatsoever when things go wrong leads me to have no faith that the problem will be resolved in any satisfactory manner. When combined with a complete lack of consumer/business interaction options in general (w.r.t. Google), it leads to some very dissatisfied consumers.
I guess you can ask for your money back? Lets stay grounded in reality here, Google is paying money on your behalf to improve your experience. They could, and from a financial perspective maybe should try getting you to annotate your own data. But all that annotated data gets pooled and used to improve the models. I don't think their goal here is to give you faith they are resolving your issue. The algorithm is going to get some things wrong. Its a matter of improving the overall accuracy and precision in the algorithm.
>You think engineers are flagging voice requests that happen more than once? How would engineers have access to that data in the first place if it's supposed to be anonymous and private?
100% they have to be. Its too costly and time intensive to slog through all the data. Probably* they have a flag that goes off on a voice request when something is either not intelligibly interpreted so many times in a row. Or if a use has to go manually do something after repeatedly making a request. This flags the interaction for manual review. Then it processes through some algorithm to strip it of identifying data. Then it gets put in front of a warm body for review.
Trust me that the company wants to avoid putting things in front of warm bodies as much as possible. Its expensive.
> Lets stay grounded in reality here, Google is paying money on your behalf to improve your experience.
Lets stay grounded in reality here, Google is not doing this for me or on my behalf. Google just invests a bit to take a bunch of very valuable user data and then monetizes it for far more than it took to obtain it. This relies on the fact that most users are unaware of how valuable their data is.
As a business model there's nothing illegal about this. But in any other sense it's no different from tricking an uneducated individual into selling their kidney for $2000 just so you can resell it for $20.000 and pretend you're doing them a favor by giving them money.
And as long as the vast majority of customers are in the dark regarding the value of their data and what they're actually trading when using such a system then yes, this is Google (and not only) abusing ignorance to line their pockets.
Can you imagine how valuable voice data could be if it could be mined to show what products, politics, opinions are being discussed in the real world?
Can you imagine how valuable your voice data would be to a marketing campaign which you didn't even know you had participated in?
Can you imagine how much valuable information is contained in the vocal enunciation of A/B testing? Every little "hmmm" or "how do I go backwards?" or "what does this button do?" that people don't even realize they're saying.
Can you imagine how much that violates someone's privacy?
> other then by selling devices that use voice recognition
That's one way but what's the actual question? My point is they are not doing this as a favor to the user (as GP seemed to suggest), they are doing it for a profit by getting the user's data far too cheaply. And for this they rely on the user staying unaware of the value of their data, how it will be used, and how much is collected in the first place.
The voice recognition tech itself (and any ancillary parts) can be sold/licensed to so they can build similar systems. The actual data obtained by the voice recognition can be used exactly like any other data Google collects. They literally have access to what you say around their microphone. You can't not see any way they could make money from this.
20 years ago monetizing the free search was just as much of a mystery for many, including seasoned investors.
It is still slightly creepy because of which humans are doing it. Is there any reason that these systems can't ask the users themselves about the accuracy of the speech to text conversion? Not everyone would do it but a percentage would.
Unsupervised learning is totally a thing! Word embedding models are achieving STOA results on massive unlabeled corpuses. That's what's powering most of the new results in NLP.
Clustering and Dimensionality Reduction folks just sitting here thrown out to dry too...
I thought only samples of audio from speech that is addressed to the Home device (starts with "OK Google...") is sent to humans. That by definition makes it not a conversation in the privacy of your home, it's a conversation with the assistant just like having a phone conversation with someone else in the privacy of your home implies that someone else has access to the conversation.
You will note that the mute button is not a physical switch that cuts the signal from the microphone, but a soft button. It will probably do what it claims most of the time, but if Google for whatever reason wanted to secretly unmute it remotely, I have no doubt they could.
Wireshark? Sure, but what am I even looking for? They could be holding on to recordings in storage to send them later when the device is unmuted again. They could be embedding audio in encrypted form into other innocuous-looking packets -- I doubt a device like this is quiet on the network even when muted.
Until all the software is open source and auditable and the switch verifiably breaks the physical signal path to the mic, a device like this can never be trusted.
It's precisely knowing how software works, and the limitations of the current technology, what makes it so creepy. With this system there's no way of avoiding humans hearing a clip they shouldn't have heard.
Even with that, I'm sure that the engineers are flagging voice requests that happen more then once, or where some one has to manually change or correct what the software thought was the request.
This is only creepy if you don't understand how the software works.