Hacker News new | past | comments | ask | show | jobs | submit login
Year of the Voice – Chapter 2: Let's talk (home-assistant.io)
199 points by balloob on April 27, 2023 | hide | past | favorite | 57 comments



Founder Home Assistant here. Let me know if anyone has any questions.

Edit: if you want to keep in the loop of the work we're doing, subscribe to our free monthly newsletter @ https://building.open-home.io/


No questions, only a comment, you kick ass.


Can I repurpose my Mycroft device with this? I'm worried it is now a dead rock.


Mark I or Mark II?


I have a mark I that hasn't been plugged in for years. Iirc The SD card is bad.


Mark II.


I used to work for Mycroft, so I'm hoping to eventually create an image that's compatible with Home Assistant pipelines.

For now, though, you may want to check out OVOS: https://openvoiceos.com/


I checked it out but didn't see that it runs on my hardware. I'm worried this gets me closer not father to brick status with my device.


Neon AI and OVOS are taking over development of Mycroft Core and the Mark II: https://www.reddit.com/r/Mycroftai/comments/1212h87/neon_ai_...


First of all, thank you so much for HA. My family uses it daily and despite some of them being at first resistant to anything smart they now can’t imagine life without it.

I have a general question about HA: the main issues I’ve had with it have had to do with specifically only two points.

1. The recorder module has brought the system to a crawl when a single rogue device on my network kept spamming MQTT with status updates about power consumption and eventually killed the SD card it was running on. Some aggressive ignore statements in the config fixed the issue but I still see this as a major pain point as there is no indication of anything happening. Is there any plan to introduce any kind of housekeeping into the recorder module besides periodic purges (which in my case were not enough to actually keep the system running)?

2. I like to have control over my computers so I did not want to use a pre-installed image of HA for RPi. But the alternative is a cumbersome process of installing it in a Python virtual env and keeping things updated which is not ideal. Are there any plans for improving this installation path? Or alternatively some sort of advanced version of the HA image which would allow someone like me full control over the base OS while keeping the benefits of the cohesive self-updating HA distribution?


1. We don't currently have anything for that besides looking at the database directly.

2. We offer VM images if you want to keep flexibility but also benefit from all our work on an integrated system. You can also do a Supervised installation, but then it's your own responsibility to keep the host OS running the requirements of HA.


I’m a HA Container user and I can vouch for this path. It gives me the ability to run whatever other stuff I want to alongside HA in Docker containers with minimal effort.

Works really well, updating is low effort (but not automatic) but you could probably set something up if you wanted to pretty easily.


I'm also a container user and for the most part it works great, but there are downsides where you can't use certain functionality that's only currently exposed on HAOS (like the new SkyConnect stuff... they said it's coming for container users but I think any feature like that is going to take a little longer).


I do this as well and leverage Watchtower [0] to upgrade the container for me and keep the image cache flushed. I just track the `stable` tag and make sure to keep an eye on changes. I only check for updates once a week and it has worked out great.

[0] https://containrrr.dev/watchtower/


That's pretty good. Amazon Echo integration is one of the main Cloud dependencies I have.

I also like that you are directly supporting ESPHome and I'll definitely take that route and build something nice.

However, for the more casual users, are you thinking about releasing an off-the-shelf "voice assistant" device?


Yes, that's the plan.

We're still exploring different ML on the edge chips to help with wake word and audio processing. One hard requirement that we have is that the tooling is openly available, so that anyone can convert an AI model and run it on the chip. We don't accept tooling to be available as a website or under NDA. That challenge is harder than we expected!


What does an ML-on-the-edge system chip do differently from other chips? Does it have a gpu or other accelerator on it?


Domain specific fixed function hardware could consume 10-100x less power than a gpu for the same workload. Think video decode or ‘YouTube battery life’ benchmarks but for speech recognition.


Some labs are developing analogue chips instead of digital


Does the API of the Agent support the phrase "In ten minutes, turn off any light bulb beginning with the letter B"?

I hope you saw the recent demo of a ChatGPT plug-in for HA that might add a lot of functionality to HA if there is the ability to use it to construct action plans and then execute them.

https://www.reddit.com/r/homeassistant/comments/12l10se/intr...


In Home Assistant a voice assistant is a combination of a speech-to-text, conversation and text-to-speech engines. A user can create as many combinations as they want. For a conversation agent, it's text command in, text response out. We already have the option to use OpenAI GPT 3.5, so it's possible in the future to hook it up to a large language model and allow it to control your house.

Fun fact: we supported Almond (later Genie), the IoT AI from Stanford which they released a couple of years ago. It would convert natural language into their ThingTalk language, which could do things like you mention. It was not popular at all and Stanford ended up shutting it down.


I remember the Almond days. Back then they were up against Vera which, IMO, was the clear leader. Then the guys from SmartThings happened to come out with their product which I saw at a MinneStar event. I asked one of the founders after the preso what their perspective on cloud connected services was and the guy couldn't believe I had an opinion on a home automation system that couldn't operate without cloud services. They're were clearly going after a much simpler consumer market and looking to sell out, of which they did. But it was a disaster for anyone that was used to the configurability and flexibility of Vera. Then Vera started to lose focus and HomeSeer seemed to come up as a viable replacement, along with the early days of HomeAssistant. It's interesting that HA is only getting better while all the rest are slowly dying off. HomeSeer v4 is still pretty rough transition from v3. I only hope HA stays the course and continues to hold their high standards. It's such a phenomenal project. From simple to complex automation, it is truly a great work of open source.


Have you considered integrating RHVoice in addition to the Piper models? It's nowhere near as exciting (it uses HMMs instead of cool new shiny neural network goodness), but the use-cases are sort of similar (it was originally developed for screen-readers, which require extremely low latencies and high responsiveness.) It doesn't support as many languages, but for some of the languages it does support (Polish for example), I find the quality superior to what Piper can offer.


Are you looking into making fine tuning easy for the speech to text model ? I feel like that's the only thing missing before it's perfect.

My entity names can contain the English or other text that just won't be picked up by the speech to text, so if there was a way to record a couple of pronunciations for an entity name, and home assistant would fine-tune whisper on them in the background it would be wonderful !


First off, Thank you!

Wondering if you all are thinking about multiple remote microphones, and choosing which microphone will respond to the speaker?

Ideally, I'd love to have multiple microphones/speakers throughout the house, all of which can listen and answer, however, only the one that hears me the clearest/loudest/etc. actually answers.

Make sense?


Once we get our hardware done, this is feature that we definitely plan to include (but probably as a later software update)


That is incredibly rad.


I built an app called Homechart that I think would be a good partnership/would appreciate help with integrating into home assistant (or you could just aquire it =D), contact details in my profile


What's the relationship between this work and rhasspy 3?


Home Assistant and Rhasspy 3 will be able to share voice services thanks to the shared Wyoming protocol. Rhasspy 3 will have more options, including lots of experimental services.


How does this setup cope with multiple microphone? ie potentially overlapping ones.

If I toggle a light I wouldn’t want it to execute the command twice haha


A local voice assistant is the last link missing in my entirely local smart home setup, so this is exciting news. I would love if I could convert a google home mini that I have on hand to use with this, but my understanding is that the hardware is too locked down for tinkering with.

I love the VOIP integration shown off that can hook up to an old phone. One of my guilty pleasures is using peak forms of technology from the 20th century when things were more analog. It could be a lot of fun to bring an old phone into the mix to complement my turntable and PVM.


The most exciting thing about Home Assistant's "Year of the Voice", for me, is that it is apparently enabling/supporting @synesthesiam's continued phenomenal contributions to the FLOSS off-line voice synthesis space.

The quality, variety & diversity of voices that synesthesiam's "Larynx" TTS project (https://github.com/rhasspy/larynx/) made available, completely transformed the Free/Open Source Text To Speech landscape.

In addition "OpenTTS" (https://github.com/synesthesiam/opentts) provided a common API for interacting with multiple FLOSS TTS projects which showed great promise for actually enabling "standing on the shoulders of" rather than re-inventing the same basic functionality every time.

The new "Piper" TTS project mentioned in the article is the apparent successor to Larynx and, along with the accompanying LibriTTS/LibriVox-based voice models, brings to FLOSS TTS something it's never had before:

* Too many voices! :)

Seriously, the current LibriTTS voice model version has 900+ voices (of varying quality levels), how do you even navigate that many?![0]

And that's not even considering the even higher quality single speaker models based on other audio recording sources.

Offline TTS while immensely valuable for individuals, doesn't seem to be attractive domain for most commercial entities due to lack of lock-in/telemetry opportunities so I was concerned that we might end up missing out on further valuable contributions from synesthesiam's specialised skills & experience due to financial realities & the human need for food. :)

I'm glad we instead get to see what happens next.

[0] See my follow-up comment about this.


Thank you for the kind words, @follower!

I'm the author of Piper; it is a successor to Larynx (originally named Larynx 2). Piper uses the same underlying model as Mimic 3, which I developed before joining Mycroft. However, Piper uses a different library to get word pronunciations, so the voices aren't compatible between the two projects.

It's been an awesome year so far with Nabu Casa, and I'm very fortunate to be able to work on something I love. I hope to contribute to the open source voice space for many years to come :)


Piper sounds a lot like Mycroft's Mimic3. Do you know if they're both using Larynx underneath?


My interest in offline TTS is actually entirely unrelated to the automation space:

I'm interested in Text to Speech for creative pursuits, such as video game voice dialogue and animated videos.

This is one of the reasons why the range & quantity of available voices is particularly important to me.

After all, you can't really have scene set in a board room with nine characters[3] if you've only got three voices to go around. :)

I've actually been spending time this week on updating my "Dialogue Tool"[1] application (originally created to work with Larynx to help with narrative dialogue workflows such as voice "auditioning", intelligent caching & multiple voice recordings) to work with Piper.

Which is where I ran into the question of how to navigate/curate a collection of more than 900+ voices.

The main approaches I'm using so far are:

(1) Random luck--just audition a bunch of different voices with your sample dialogue & see what you like.

(2) Curation/sorting based on quality-related meta-data from the original dataset.

(3) Generating a different dialogue line for each voice that includes their speaker number for identification purposes that also (hopefully) isn't tedious to listen to for 900+ voices. :)

I haven't quite finished/uploaded results from (3) yet but example output based on approaches (3) & (2) can be heard here: https://rancidbacon.gitlab.io/piper-tts-demos/

The recording has two sets of 10 voices which had the lowest Word Error Rate scores in the original dataset--which doesn't mean the resulting voice model is necessary good but is at least a starting point for exploring.

I'd also like to explore more analysis-based approaches for grouping/curation (e.g. vocal characteristics such "softer", "lower", "older") but as I'm not getting paid for this[2], that's likely a longer term thing.

A different approach which I've previously found really interesting is to use voices as a prompt for writing narrative dialogue. It really helps to hear the dialogue as you write it and the nuances of different voices can help spur ideas for where a conversation goes next...

[1] See: https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to... & https://gitlab.com/RancidBacon/larynx-dialogue/-/tree/featur...

[2] Am currently available/open to be though. :D

[3] Will try to upload some example audio of this scene because I found it pretty funny. :)


A shameless plug, my colleagues made a demo where you could create a virtually infinite amount of voices with some sort of control of how they sound like: https://huggingface.co/spaces/Flux9665/ThisSpeakerDoesNotExi...


> 900+ voices

Where can I find all these voices? https://github.com/rhasspy/piper/releases/tag/v0.0.2 lists "only" ~50 files.


The libritts file has 900 plus speakers inside it


> On a Raspberry Pi 4, Piper can generate 2 seconds of audio with only 1 second of processing time.

> On a Raspberry Pi 4, voice commands can take around 7 seconds to process with about 200 MB of RAM used.

Have you looked into supporting something like the Coral Accelerator[0], which can drastically speed up machine learning inference on a Raspberry Pi?[1]

It used to be available for $60, but it is hard to find in stock at the moment except for way over MSRP.

[0]: https://coral.ai/products/accelerator

[1]: https://www.hackster.io/news/benchmarking-machine-learning-o...


I bought an m.2 coral TPU directly from mouser, it took 5 months to get here but it was only ~$30 IIRC.


It’s been a few years since I’ve been down this path, but last time I went exploring one of the main challenges was getting decent hardware. The microphone array on something like an Echo at the time was far better than anything I seemed to be able to achieve without buying in to the Amazon or Google ecosystems.

Is there better consumer stuff available now?


The asthetics are hard, too. If it can't be completely hidden and has to sit out in the open, it needs to look decent (aka "spouse acceptance factor").

Amazon and Google have both done pretty good here, and I hope there's a good option to replace them in the future. Maybe it's possible to transplant an esp32 into one of those devices? Maybe sometime will start selling something.


The same as back then: You can get pretty damn close with a used PlayStation Eye webcam (I have 5 lying around for that reason). Unless something changed very recently, the next steps up cost more than a whole Alexa device (e.g. ReSpeaker Core).


Is there a more open alternative to Whisper?


What restrictions are you hitting with MIT?


For eg Whisper is unlikely to meet Debian's requirements around ML models; such as it being possible for third parties to retrain the Whisper models from scratch using only data/code publicly available under FOSS licenses.

https://salsa.debian.org/deeplearning-team/ml-policy https://blog.opensource.org/episode-5-why-debian-wont-distri... https://deepdive.opensource.org/podcast/why-debian-wont-dist...


Eh, call me a naysayer, but nobody cared about AI agents because they largely weren't useful until ~GPT 3.5 and Alpaca Lora 7b is about as useful as a vestigial nipple and requires state of the art hardware to run locally.

I'm also going to take the opportunity to poo-poo this pet peeve:

>More powerful CPUs, such as the Intel Core i5, can generate 17 seconds of audio in the same amount of time.

Oh, really?

...Intel Core i5 brand microprocessors... Introduced in 2009...

About as accurate as saying "users of four-wheeled vehicles" when carriages were still around.

At the very least, you need to provide the node.


The reason I, and I assume many others, really dislike these smart assistants is that you then have to have creepy companies listening to everything you do.

The concept itself is great, and if I can finally have a good local only assistant, then that's fantastic.

And yes, one of the reasons people are more excited than ever is that the latest versions ChatGPT are actually really good.


Coauthor of the blog post here. You're right, I said it on the live stream we had today but forgot to mention it in the blog post: the i5 is from a Lenovo ThinkCentre M72e. They're available refurbished for less than the cost of a Pi 4 these days, so it seemed to be a good comparison!


Man, I switched my home assistant box from a raspberry pi to one of these machines, because I wanted the raspberry pi to run one of my 3D printers, and it has made such a beautiful difference in terms of the snappiness of everything home assistant does.

I super recommend it, if you can afford the extra 10 to 15 watts of power.


The ThinkCentre runs at 65W, a Pi4 runs at ~8W. There's a slight energy crisis (in Europe), so you need to optimise for workloads that can run at lower power levels. Just, voice in my opinion, does not justify 65W constant draw.


No personal offense, this is a very common affectation that has just personally been a bugaboo for me lately due to technical documentation containing these non-technical statements.


The article was specifically comparing those CPUs to the Raspberry Pi4. The point is that they’re targeting local-only TTS on low-power hardware.


I read it.

A pi 4 has a very specific performance to power profile. An "i5" has no specificity.

An underclocked i5 could either be a whisper or an inferno compared to this ARM chip.

It also mentioned useful agents like GPT 3.5 as a sort of distant afterthought, which is both paid and highly non-local.


Oh my mistake. I totally missed your point. Yeah an i5 covers a huge range. In my mind I was just assuming they meant the slowest i5, but yeah that doesn’t make sense


I interpreted the whole comparison as an expectation management for rpi4 users - A 2:1 length:processing time is well above what I would have guessed TTS would take, so highlighting this limitation is important.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: