Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Pi-C.A.R.D, a Raspberry Pi Voice Assistant (github.com/nkasmanoff)
344 points by nkaz123 10 days ago | hide | past | favorite | 92 comments
Pi-card is an AI powered voice assistant running locally on a Raspberry Pi. It is capable of doing anything a standard LLM (like ChatGPT) can do in a conversational setting. In addition, if there is a camera equipped, you can also ask Pi-card to take a photo, describe what it sees, and then ask questions about that.

It uses distributed models so latency is something I'm working on, but I am curious on where this could go, if anywhere.

Very much a WIP. Feedback welcome :-)

I wanted to create a voice assistant that is completely offline and doesn't require any internet connection. This is because I wanted to ensure that the user's privacy is protected and that the user's data is not being sent to any third party servers.

Props, and thank you for this.

I would love for Apple/Google to introduce some tech that would make it provable/verifiable that the camera/mic on the device can only be captured when the indicator is on and that it isn't possible for apps or even higher layers of OS to spoof this

Thinkpads come with that, an unspoofable indicator that will tell you with 100% certainty that your image is not being recorded or even recordable unless the physical operator of the machine allows it. Can't beat a physical cover if you really want to be sure!

Now make a cover for the microphone :)

Good point, I did conveniently gloss over that one didn't I!

A switch to break the physical wire connection is about the only thing I can come up with out of hand right now. I believe the PinePhones come with physical switches for their antenna's so there's that.

Framework laptops have physical power switches for camera and microphone. The devices completely disappears from the OS when the switch is moved.

What feels like a self-inflicted issue though is that the switches are fail-open, not fail-closed.

To allow for swappable bezels, the switches (on the plastic bezel) in fact just introduce obstructions in optocouplers (on the camera board screwed to the metal lid), which—I don’t know what they do due to Framework’s refusal to release the schematics, but I guess just cut the power line of the camera resp. the signal of the microphone using a couple of MOSFETs.

The problem is, the camera and the microphone are live if the obstruction is absent and disconnected if it’s present, not the other way around. So all it takes a hypothetical evil maid to make the switches ineffective is to pick up the edge of the bezel with a nail (it’s not glued down, this being a Framework) and snip two teensy bits of plastic off to make the switches nonfunctional while feeling normal. This is not completely invisible, mind you—the camera light will still function, you’ll still see the camera in the device list if you look—and probably not very important. But I can’t help feeling that doing it the other way around would be better.

I vaguely remember another vendor (HP?) selling laptops with a physical camera switch, but given the distance between the switch (on the side near the ports) and the camera (on the top of the display), I’m less than hopeful about it being a hardware one.

You could probably add that if you were sufficiently motivated. Either by adding an LED on the power or data path, or by adding a physical switch. I think it should be fairly easy on laptops; I'm not sure where you'd jam the hardware in a phone or if you can access the cables for the camera/mic without messing with traces.

I'm a little curious if iPhones could be modified to route the {mic,camera} {power,data} through the silent mode switch, either instead of or in addition to the current functionality. I don't really have a need for a physical silent mode toggle, I'm totally fine with doing that in the settings or control panel.

That’s allegedly the case in iOS (not the provable part, but I wonder if anyone managed to disprove it yet?)

I'm thinking perhaps a standardized open design circuit that can you can view by opening up back cover and zooming in with a microscope.

feel like privacy tech like this that seemed wildly overkill for everyday users becomes necessary as the value of collecting data and profiling humans goes through the roof

The value of the data you willingly transmit (both to data brokers, as well as in terms of the harm that it could do to you) via the use of apps (that are upfront about their transmission of your private data) and websites is far, far greater than the audio stream from inside your house.

If you don’t want your private information transmitted, worry about the things that are verifiably and obviously transmitting your private information instead of pointlessly fretting over things that are verifiably behaving as they claim.

Do you have the Instagram or Facebook apps on your phone? Are you logged in to Google?

These are much bigger threats to your privacy than a mic.

The sum total of all of your Telegram, Discord, and iMessage DMs (all of which are effectively provided to the service provider without end to end encryption) is way more interesting than 86400 images of you sitting in front of your computer with your face scrunched up, or WAVs of you yelling at your kids. One you knowingly provide to the service provider. The other never leaves your house.

Definitely not. Neither Face ID nor the recently added “screen distance” feature in Screen Time use the camera indicator, and both use cameras (and lidar scanners) continuously

It's annoying difficult to try and find an article about it, but IIRC there was a hack for the MacBook Pro's green camera LED where toggling the camera on and off super quickly would wouldn't give the LED time to light up

I didn't know that, but I'm glad I have my MacBook in clamshell mode with an external camera with a physical cover.

I mean, I appreciated the little green light, but the fact that it seemed necessary indicates to me that humanity still needs some evolving.

Red nail polish would like a word

I missed the reference. Red nail polish?

Used to paint over the indicator LED

+1. That's the #1 feature I want in any "assistant".

A question: does it run only on the Pi5 or other (also non Raspberry Pi) boards?

I think so. I plan to update the README in the coming days with more info, but realistically this is something you could run on your laptop too, barring some changes to how it accesses the microphone and camera. I assume this is the case too for other boards.

The only thing which might pose an issue is the total RAM size needed for whatever LLM is responsible for responding to you, but there's a wide variety of those available on ollama, Hugging Face, etc. that can work.

Thanks. My ideal use case would probably involve a mini PC, of which I have a few spares and most of them are faster and can be upgraded to more RAM than the Pi5, which especially in this context would be welcome.

Extra kudos for the name - and extra extra for using the good old "Picard facepalm" meme.

But seriously - the name got my attention, then I read the introduction and thought "hey, Alexa without uploading everything you say to Amazon? This might actually be something for me!".

> The default wake word is "hey assistant" - I would suggest "Computer" :) And of course it should have a voice that sounds like https://en.wikipedia.org/wiki/Majel_Barrett

Seriously yes, that's what I came here to say - the default wake word should be "Computer", with edit access to a captain's log granted if spoken with sufficient authority in proper Shakespearean English.

"Hey HAL, open the pod bay doors please"


All I need is a voice assistant that: - RPi 4 can handle, - I can integrate with HomeAssistant, - is offline only, and doesn't send my data anywhere.

This project seems to be ticking most, if not all of the boxes, compared to anything else I've seen. Good job!

While at it, can someone drop a recommendation for a Rpi-compatible mic for Alexa-like usecase?

Check out Rhasspy.

You won't get anything practically useful running LLMs on the 4B but you also don't strictly need LLM-based models.

In the Rhasspy community, a common pattern is to do (cheap and lightweight) wake-word detection locally on mic-attached satellites (here 4B should be sufficient) and then stream the actual recording (more computational resources for better results) over the local network to a central hub.

>> and then stream the actual recording (more computational resources for better results) over the local network to a central hub.

This frustrates me. I ran Dragon Dictate on a 200MHz PC in the 1990s. Now that wasn't top quality, but it should have been good enough for voice assistants. I expect at least that quality on-device with an R-Pi today if not better.

IMHO the end game is on-device speech recognition and anything streaming audio somewhere else for processing is delaying this result.

> IMHO the end game is on-device speech recognition and anything streaming audio somewhere else for processing is delaying this result.

Why? There's practically no latency to a central hub on the local network. A Raspberry Pi is probably over-specced for this, but I do very much see value in buying 5 $20 speaker/mic streaming stations and a $200 hub instead of buying 5 $100 Raspberry Pis.

If anything, I would expect the streaming to a hub solution to respond faster than the locally processed variant. My wifi latency is ~2ms, and my PC will run Whisper way more than 2ms faster than a Raspberry Pi. Add in running a model and my PC runs circles around a Raspberry Pi.

> I ran Dragon Dictate on a 200MHz PC in the 1990s. Now that wasn't top quality, but it should have been good enough for voice assistants. I expect at least that quality on-device with an R-Pi today if not better.

You should get that via Whisper. I haven't used Dragon Dictate, but Whisper works very well. I've never trained it on my voice, and it rarely struggles outside of things that aren't "real words" (and even then it does a passable job of trying to spell it phonetically).

No idea about resources. It's such an unholy pain in the ass to get working in the same process as LLMs on Windows that I'm usually just ecstatic it will run. One of these days I'll figure out how to do all the passthroughs to WSL so I can yet again pretend I'm not using Windows while I use Windows (lol).

in the same vein there was a product called "Copernic Summarizer" that was functionally equivalent to asking chatgpt to summarize an article - 20 years ago.

I have a decent home server, is there a system like this where the heavy lifting can be run on my home server but the front end piped to a ras-pi with speaker/camera, etc. ?

If not, any good pointers to start adapting this platform to do that?

I've been a happy Rhasspy user and now even more excited for the future because I'm hoping the conversational style (similar to what OpenAI demo'd yesterday) will eventually come to the offline world as well. It is okay if I gotta buy a GPU to make that happen, or maybe we get luckier and it wont? Ok, maybe I'm getting too optimistic now

Someone posted this on the HA sub yesterday - https://www.youtube.com/watch?v=4W3_vHIoVgg

He's using GPT 4o. Although not the stuff that was demo'ed yesterday, since it hasn't been widely rolled out yet. But I think its entirely possible in the near future.

Awesome. Side note: now my living room lights are white because I listened to this on speakers lol. I have the same wakeword xD

Have you checked out the Voice Assistant functionalities integrated into HA? https://www.home-assistant.io/voice_control/

NabuCasa employed the Rhasspy main dev to work on these functionalities and they are progressing with every update.

I have got good results with Playstation 3 and Playstation 4 cameras (which also have a mic). They are available for $15-20 on ebay.

> Why Pi-card? > Raspberry Pi - Camera Audio Recognition Device.

Missed opportunity for LCARS - LLM Camera Audio Recognition Service, responding to the keyword "computer," naturally. I guess if this ran elsewhere from a Pi, it could be LCARS.

Pi-C.A.R.D is perfect. Read it 100% as Picard, and more recognizable that LCARS.

Just configure it to respond to "Computer" and you're good to go.

The wake word detection is an interesting problem here. As you can see in the repo, I have a lot of mis-heard versions of the wake word in place, in this case being "Raspberry". Since the system heats up fast you need a fan, and with the microphone directly on a USB port next to the fan, I needed something distinct, and computer wasn't cutting it for this.

Changing the transcription model to something a bit better or moving the mic away from the fan could help this happen.

Have a look at the openWakeWord model which is especially built for detecting wakewords in a stream of speech.

"Number One" would be my code word...

And finally saying "make it so" to make the command happen.

"Engage" to order some travel solution?

Captain might be funnier.

Or just use the mouse.

It's just not very realistic - if you think you can give orders to your captain, you'll be out of Starfleet in no time!

How quaint.

As a professional technology person I say “computer” about 1megatoken per day

Yeah, but how often do you say "computer" in a querying/interrogatory tone?

That's a perfect opportunity to get better at cosplaying a Starfleet officer.

(Seriously though, a Federation-grade system would just recognize from context whether or not you meant to invoke the voice interface. Federation is full of near-AGI in simple appliances. Like the sliding doors that just know whether you want to go through them, or are just passing by.)

While totally true it’s not a good reason to use it as a wake word in 2024 with my raspberry pi voice assistant ;-)

This is why we can't have nice LCARS things: https://en.wikipedia.org/wiki/LCARS#Legal

Or LLM Offline Camera, User Trained Understanding Speech


s/Offline/Online/ and make sure it has all the cloud features enabled, so you and your friends and loved ones can become one with the FAANG collective.

It should be really something like Beneficial Audio Realtime Recognition Electronic Transformer.

I'm looking forward to trying this. Hopefully this gains traction, as (AFAIK) an open, reliable, flexible, privacy focused voice assistant is still sorely needed.

About a year ago, my family was really keen on getting an Alexa. I don't want Bezos spy devices in our home, so I convinced them to let me try making our own. I went with Mycroft on a Pi 4 and it did not go well. The wake word detection was inconsistent, the integrations were lacking and I think it'd been effectively abandoned by that point. I'd intended to contribute to the project and some of the integrations I was struggling with but life intervened and I never got back to it. Also, thankfully, my family forgot about the Alexa.

some "maker" products that were sold at target had a cardboard box with an arcade RGB-LED button on top, a speaker, and 4 microphones on a "hat" for a rpi... nano? pico? whatever the one that is roughly the size of a sodimm. It didn't have a wakeword, you'd push the white-lit button, it would change color twice, once to acknowledge the press, and again to indicate it was listening. It would change color once you finished speaking and then it would speak back to you.

It used some google thing on the backend, and it was really frustrating to get set up and keep working - but it did work.

i have two of those devices, so i've been waiting for something to come that would let me self-host something similar.

I'm really inspired reading this and hope it can help! I'm planning to put more work into this. I have a few rough demos of it in action on youtube (https://www.youtube.com/watch?v=OryGVbh5JZE) which should give you an idea of the quality of it at the moment.

Is it possible to run this on a generic linux box? Or if not, are you aware of a similar project that can?

I've googled it before, but the space is crowded and the caveats are subtle.

Raspberry-pi is very much like a generic Linux box, biggest difference being that it's ARM rather than Intel/AMD CPU which makes things slightly less widely supported.

But overall, Pi-C.A.R.D seems to be using Python and cpp so shouldn't be any issues whatsoever to run this on whatever Python and cpp can be run/compiled on.

I tried to build this on an early gen RPI 4 about three years ago- but ran in to limitations in the hardware (and in my own knowledge). Super cool to see it happening now!

It would be cool to see some raspi hats you could plug a GPU into, though unsure of how practical or feasible that would be. Todays graphics cards are tomorrows e-waste, perhaps they could get a second life beefing up a diy raspi project like this

Most of the USP, outside of the ecosystem developed around the single platform, is to do with form and power draw. Adding in the GPU/adapter/PSU to leverage cheap CUDA cores probably works out worse power/price/form wise than going for a better SoC or x86 NUC solution.

for crypto-mining you'd convert a single PCIe slot into 4 x1 PCIe slots - or just have a board with 12+ x1 PCIe slots. I'm not sure what magic goes in to PCIe, but i do know that at least one commodity board had "exposed" PCIe interface - the Atomic Pi.

anyhow the GPU would sit on a small PCB that would connect to an even smaller PCB in the PCIe slot on the motherboard, via USB3 cable. My point here is merely that whatever PCIe is, it can be transported to a GPU to do work via USB3 cables.

I see that a speaker is in the hardware list - does this speak back?

Yes! I'm currently using https://espeak.sourceforge.net/, so it isn't especially fun to listen to though.

Additionally, since I'm streaming the LLM response, it won't take long to get your reply. Since it does it a chunk at a time, there's occasionally only parts of words that are said momentarily. Also of course depends on what model you use or what the context size is for how long you need to wait.

When I did a similar thing (but with less LLM) I liked https://github.com/coqui-ai/TTS but back then I needed to cut out the conversion step from tensor to a list of numbers to make it work really nicely.

Why does Picard always have to specify temperature preference for his Earl Gray tea? Shouldn’t the super smart AI have learned his preference by now?

> Why does Picard always have to specify temperature preference for his Earl Gray tea?

Completely OT, but, most likely, he doesn’t. Lots of people in the show direct with the replicators more fluidly; “Tea, Earl Grey, Hot” seems like a Picard quirk, possibly developed with more primitive food/beverage units than the replicators on the Enterprise-D.

Perhaps he must be specific to override a hard, lawsuit proof, default that is too tepid for his tastes.

Will there still be lawsuits in the post-scarcity world? Probably.

Force of habit?

Most of Starfleet folks seem to not know how to use their replicators well anyway. For all the smarts they have, they use it like a mundane appliance they never bothered to read the manual for, miss 90% of its functionality, and then complain that replicated food tastes bad.

If anything he's not being specific enough https://i.redd.it/hluqexh3oqc91.jpg

Well the one time he just said "Tea, Earl Grey" the computer assumed, "Tea, Earl Grey, luke warm".

how does wake word work? Does it keep listening and ignore if the last few seconds does not have the wake word/phrase?

that's the general idea, yes. Or rather, store several chunks of audio, and discard the oldest. aka "Rolling window".

What latency do you get? I'd be interested in seeing a demo video.

Fully depends on the model, how much conversational context you provide, but if you keep things to a bare minimum, ~< 5 seconds from message received to starting the response using Llama 3 8B. I'm also using a vision language model, https://moondream.ai/, but that takes around 45 seconds so the next idea is to take a more basic image captioning model and insert it's output into context and try to cut that time down even more.

I also tried using Vulkan, which is supposedly faster, but the times were a bit slower than normal CPU for Llama CPP.

How hard is it to make an offline model like this learn?

The readme mentions a memory that lasts as long as each conversation which seems like such a hard limitation to live with.

Funny, I just picked up a device for use with https://heywillow.io for similar reasons

me too, but I bricked mine when flashing the bios. just a fluke, nothing to be done about it.

I watched the demo, to be honest if I saw it sooner I probably would have tried to start this as a fork from there. Any idea what the issue was?

No, but the core concept had changed so much, between the firmware version when I bought it, and what it is now, that I'm not surprised if the upgrade is a buggy process.

It's a shame, because I only wanted a tenth of what it could do- I just wanted it to send text to a REST server. That's all! And I had it working! But I saw there was a major firmware update, and to make a long story short, KABOOM.

I'm the founder of Willow.

Early versions sent requests directly from devices and we found that to be problematic/inflexible for a variety of reasons.

Our new architecture uses the Willow Application Server (WAS) with a standard generic protocol to devices and handles the subtleties of the various command endpoints (Home Assistant, REST, OpenHAB, etc). This can be activated in WAS with the "WAS Command Endpoint" configuration option.

This approach also enables a feature we call Willow Auto Correct (WAC) that our users have found to be a game-changer in the open source voice assistant space.

I don't want to seem like I'm selling or shilling Willow, I just have a lot of experience with this application and many hard-learned lessons over the past two decades.

I'm the founder of Willow.

Generally speaking the approach of "run this on your Pi with distro X" has a surprising number of issues that make it, umm, "rough" for this application. As anyone who has tried to get reliable low-latency audio with Linux can attest the hoops to get a standard distro (audio drivers, ALSA, pulseaudio/pipewire, etc) to work well are many. This is why all commercial solutions using similar architectures/devices build the "firmware" from the kernel up with stuff like Yocto.

More broadly, many people have tried this approach (run everything on the Pi) and frankly speaking it just doesn't have a chance to be even remotely competitive with commercial solutions in terms of quality and responsiveness. Decent quality speech recognition with Whisper (as one example) itself introduces significant latency.

We (and our users) have determined over the past year that in practice Whisper medium is pretty much the minimum for reliable transcription for typical commands under typical real-world conditions. If you watch a lot of these demo videos you'll notice the human is very close to the device and speaking very carefully and intently. That's just not the real world.

We haven't benchmarked Whisper on the Pi 5 but the HA community has and they find it to be about 2x in performance from the Pi 4 for Whisper. If you look at our benchmarks[0] that means you'll be waiting at least 25 seconds just for transcription with typical voice commands and Whisper medium.

At that point you can find your phone, unlock, open an app, and just do it yourself faster with a fraction of the frustration - if you have to repeat yourself even once you're looking at one minute for an action to complete.

The hardware just isn't there. Maybe on the Raspberry Pi 10 ;).

We have a generic software-only Willow client for "bring your distro and device" and it does not work anywhere close to as well as our main target of ESP32-S3 BOX devices based devices (acoustic engineering, targeted audio drivers, real time operating system, known hardware). That's why we haven't released it.

Many people also want multiple devices and this approach becomes even more cumbersome when you're trying to coordinate wake activation and functionality between all of them. You then also end up with a fleet of devices with full-blown Linux distros that becomes difficult to manage.

Using random/generic audio hardware itself also has significant issues. Many people in the Home Assistant community that have tried the "just bring a mic and speaker!" approach end up buying > $50 USB speakerphone devices with the acoustic engineering necessary for reliable far-field audio. They are often powered by XMOS chips that themselves cost a multiple of the ESP32-S3 (as one example). So at this point you're in the range of > $150 per location for a solution that is still nowhere near the experience of Echo, Google Home, or Willow :).

I don't want to seem critical of this project but I think it's important to set very clear expectations before people go out and spend money thinking it's going to resemble anything close to

[0] - https://heywillow.io/components/willow-inference-server/#com...

I really like having a voice assistant that focuses on privacy first. I will definitely try it. always add a video to show how it works. Thank you

Can't wait until I can buy a licensed lieutenant Data voice pack inside chatGPT

I didn't see a mention of languages in the readme. Does this understand languages other than english?

There are two models at use here, whisper tiny for transcribing audio, and then llama 3 for responding.

Whisper tiny is multi lingual (though I am using the english specific variant) and I believe llama 3 is technically capable of multi-lingual, but not sure of any benchmarks.

I think it could be made better, but for now focus is english. I'll add this to the readme though. Thanks!

The suggested model for vision capabilities is english only.

> It uses distributed models so latency is something I'm working on,

I think it uses local models, right?

I would have hidden the name behind a bit of indirection by calling it Jean Luc or something.

Thank you! This is a high value effort!

Show HN:

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact