The main reason this kind of thing is outsourced to the cloud nowadays is because of deep neural network voice recognition technologies we have. Most of these models are too hefty to run inference on-device. Also, online learning allows for STT to get better as it’s used more if it’s centralized in a place like the cloud.
Your iPhone is also a $1k device that's faster than some laptops. And it still cant do convincing on-device text to speech a-la Google Tacotron, and its NLU capabilities _even in the cloud_ leave much to be desired.
Much of the cost of the iPhone is in the screen, battery, form factor and fashion accessory premium. Take that away and your much closer to raspberry pi territory.
FPGA on which it is worthwhile to do deep learning costs more than the iPhone, and consumes a lot more power. Your best option starting next year will be sub-$100 Chinese chips with a TPU-like unit built in. The only one I know of is RK3399Pro, which was supposed to come out this year, but didn't make it, apparently because the die had to be larger than they planned.