A long while ago, I wrote a little tutorial[0] on quantizing a speech commands network to the Raspberry. I used that to control lights directly and also for wake word detection.
More recently, I found that I can just use more classic VAD because my uses typically don't suffer if I turn on/off the microphone. My main goal is to not get out the mobile phone for information. That reduces the processing when I turn on the radio...
Not high-end as your solution, but nice enough for my purposes.
If you don't mind my asking, what do you mean "if it hears something interesting"? Is that based on wake word, or always listen/process?