That would either increase the cost of the device

On-chip speech-to-text has been around since the 80's. It's not expensive to implement, especially with modern manufacturing.

or degrade the performance of the speech-to-text

How accurate does it need to be? All the spy corporations need is a stream of keywords.

