The pessimist in me think that a determined actor could simply capture non-trigger voice data offline, and bundle it with the rest of the traffic whenever the next trigger word occurs. But I am talking out my ass and have in no way verified any of this
If data is being buffered and only sent after the trigger words wouldn't the data transmitted vary depending on how much was said before the trigger word?
Maybe. All uploads could be padded with the maximum buffer size so you can't tell the difference. The buffer could flush only small amounts at a time. Some compression algorithm could be used that becomes more efficient with larger recordings.
What you should be asking with any "smart" device is "can I prove this device will do no harm to me".
Honestly I have never understood the value proposition of any smart device. Why would I want any of that functionality? Never once in my life have I ever wanted to talk to my TV. I'm beginning to (again) question the wisdom of carrying a smartphone.