Two things I'm curious about...
1) What are you using for hardware with the Pi? It seems a high quality microphone is important to this application and the only array microphone I've been able to find is the MATRIX creator which seems steep in price if I could just buy an Amazon Dot.
2) Your numbers indicate significantly better performance than Google. How are you able to achieve that? Where does your training data come from if nothing is supposedly leaving my device?
I really strongly desire a system that wouldn't require relying on the cloud but I just don't know how you can get enough training data to be anywhere near as accurate as a cloud provider. That led me to thinking the next best thing would be a setup with Snowboy hotword detection where I know nothing is leaving my device until my own programmed hotword is spoken.