Given that voice recognition is possible offline on a RaspberryPi Version 1 [1] I'm wonderung why they have to send the recorded audio to the cloud in the first place.
Cloud based versions work significantly better. They are able to put perhaps 10,000 times* more processing power into recognizing what you said. They are better able to deal with different people, background noise, and tick accents. When you are making a consumer device this is critical.
Android voice recognition can now be used offline[1]. You download the trained recognition model (which took much, much more than 10,000 times more processing power to train), and then it works without a network connection.
[1] https://jasperproject.github.io/