Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Project S.A.T.U.R.D.A.Y. – open-source, self hosted, J.A.R.V.I.S. (github.com/grvydev)
121 points by GRVYDEV on July 2, 2023 | hide | past | favorite | 30 comments
Welcome to Project S.A.T.U.R.D.A.Y. This is a project that allows anyone to easily build their own self-hosted J.A.R.V.I.S-like voice assistant. In my mind vocal computing is the future of human-computer interaction and by open sourcing this code I hope to expedite us on that path.

I have had a blast working on this so far and I'm excited to continue to build with it. It uses whisper.cpp [1], Coqui TTS [2] and OpenAI [3] to do speech-to-text, text-to-text and text-to-speech inference all 100% locally (except for text-to-text). In the future I plan to swap out OpenAI for llama.cpp [4]. It is built on top of WebRTC as the media transmission layer which will allow this technology to be deployed anywhere as it does not rely on any native or 3rd party APIs.

The purpose of this project is to be a toolbox for vocal computing. It provides high-level abstractions for dealing with speech-to-text, text-to-text and text-to-speech tasks. The tools remain decoupled from underlying AI models allowing for quick and easy upgrades when new technology is realeased. The main demo for this project is a J.A.R.V.I.S-like assistant however this is meant to be used for a wide variety of use cases.

In the coming months I plan to continue to build (hopefully with some of you) on top of this project in order to refine the abstraction level and better understand the kinds of tools required. I hope to build a community of like-minded individuals who want to see J.A.R.V.I.S finally come to life! If you are interested in vocal computing come join the Discord server and build with us! Hope to see you there :)

Video demo: https://youtu.be/xqEQSw2Wq54

[1] whisper.cpp: https://github.com/ggerganov/whisper.cpp

[2] Coqui TTS: https://github.com/coqui-ai/TTS

[3] OpenAI: https://openai.com/

[4] llama.cpp: https://github.com/ggerganov/llama.cpp




Hey nerds: maybe include a brief explanation of what J.A.R.V.I.S. is in the readme just in case!

J.A.R.V.I.S. is a fictional character voiced by Paul Bettany in the Marvel Cinematic Universe film franchise, based on the Marvel Comics characters Edwin Jarvis and H.O.M.E.R., respectively the household butler of the Stark family and another AI designed by Stark.


I was going to say when H.O.M.E.R. appeared, the name "Homer" had a different connotation, but actually the Simpsons started 4 years before the Marvel character was created.


My skepticism at such overaching efforts is usually quite high. Apologies for discriminating on name/using popularity contest metrics, but seeing @grvydev's name on the repo is... great! They've become a proven master of web media already, pioneering low latency game-streaming with Lightspeed. They've selected really good off-the-shelf tech to work with for this effort. This is an effort I can believe in. Good stuff, go Garrett go!


Thank you for the very kind words :)


"In my mind vocal computing is the future of human-computer interaction"

...really though? That's the future? Not telepathic unspoken 1:1 mindmelding? Who wants to flap their vocal cords and waste breath having to talk to an anthropomorphic AI in the future if you can interact entirely through thought?

I think the future you're hoping for is a bit...near term. With a shelf life.


Something can be "the future" without being the endgame


Wait, so we are not in the endgame now? _confused Dr. Strange Noises_


In the future they will disconnect the telepathic mindmelding interface when they realise that the computer is better able to serve the user without the contradictions of what the user thinks they want.


In the future the computer will tell the user what they want.


Given how easily 24-hour access to mics and cameras can be abus… commercialized, I think I'd rather waste my breath.


If you really think about it every future is near term with a shelf life :)


Maybe tie in the ESP BOX as an open-ish Alexa like Willow. https://news.ycombinator.com/item?id=35948462


I love the idea.

Is adding long term memory on the roadmap? How does it handle conversations now? Do you give the text to text engine the last 15 messages or something?


Yeah adding memory is absolutely on the road map! For now everything is quite simple to allow for us to build over time but I plan to put a lot of work into the text to text engine in the coming months


looks super cool, i was planning to do the same thing! are you planning to add hotword detection - instead of mute/unmute? I saw this https://picovoice.ai/platform/porcupine/ but didn't have time to put all pieces together. are you planning to use it or something else?


Id love to add hotword detection! Or, even better, you (or someone in the community) could add it :)


Just wondering if there is a video or demo, or anything so we can see it before going through and trying it out? It seems super cool


Yeah! There is a YouTube video linked in the repo. I’ll edit the description to add it to the HN post as well :)


Somewhat covered by other comments, but what I would want to see to evaluate a project like this is your current response latency and your goal latency for each step, or ideas about reducing it.


Had a thought, maybe someone here will know the answer -- has anyone trained a voice system on how long to wait before replying to an utterance? It occured to me that if your STT hears "hey how's it going" it should reply immediately, but if it hears "I've been thinking about <x>." then it'd be appropriate to wait a second or two in case there's more coming.


Why would I need to look at a video of myself during speech input?


Nice! I'm the creator of Willow[0] (which has been mentioned here).

First of all, we love seeing efforts like this and we'd love to work together with other open source voice user interface projects! There's plenty of work to do in the space...

I have roughly two decades of experience with voice and one thing to keep in mind is how latency sensitive voice tasks are. Generally speaking when it comes to conversational audio people have very high expectations regarding interactivity. For example, in the VoIP world we know that conversation between people starts getting annoying at around 300ms of latency. Higher latencies for voice assistant tasks are more-or-less "tolerated" but latency still needs to be extremely low. Alexa/Echo (with all of its problems) is at least a decent benchmark for what people expect for interactivity and all things considered it does pretty well.

I know you're early (we are too!) but in your demo I counted roughly six seconds of latency between the initial hello and response (and nearly 10 for "tell me a joke"). In terms of conversational voice this feels like an eternity. Again, no shade at all (believe me I understand more than most) but just something I thought I'd add from my decades of experience with humans and voice. This is why we have such heavy emphasis on reducing latency as much as possible.

For an idea of just how much we emphasize this you can try our WebRTC demo[1] which can do end-to-end (from click stop record in browser to ASR response) in a few hundred milliseconds (with Whisper large-v2 and beam size 5 - medium/1 is a fraction of that) including internet latency (it's hosted in Chicago, FYI).

Running locally with WIS and Willow we see less than 500ms from end of speech (on device VAD) to command execution completion and TTS response with platforms like Home Assistant. Granted this is with GPU so you could call it cheating but a $100 six year old Nvidia Pascal series GPU runs circles around the fastest CPUs for these tasks (STT and TTS - see benchmarks here[2]). Again, kind of cheating but my RTX 3090 at home drops this down to around 200ms - roughly half of that time is Home Assistant. It's my (somewhat controversial) personal opinion that GPUs are more-or-less a requirement (today) for Alexa/Echo competitive responsiveness.

Speaking of latency, I've been noticing a trend with Willow users regarding LLMs - they are very neat, cool, and interesting (our inference server[3] supports LLamA based LLMs) but they really aren't the right tool for these kinds of tasks. They have very high memory requirements (relatively speaking), require a lot of compute, and are very slow (again, relatively speaking). They also don't natively support the kinds of API call/response you need for most voice tasks. There are efforts out there to support this with LLMs but frankly I find the overall approach very strange. It seems that LLMs have sucked a lot of oxygen out of the room and people have forgotten (or never heard of) "good old fashioned" NLU/NLP approaches.

Have you considered an NLU/NLP engine like Rasa[4]? This is the approach we will be taking to implement this kind of functionality in a flexible and assistant platform/integration agnostic way. By the time you stack up VAD, STT, understanding user intent (while allowing flexible grammar), calling an API, execution, and TTS response latency starts to add up very, very quickly.

As one example, for "tell me a joke" Alexa does this in a few hundred milliseconds and I guarantee they're not using an LLM for this task - you can have a couple of hundred jokes to randomly select from with pre-generated TTS responses cached (as one path). Again, this is the approach we are taking to "catch up" with Alexa for all kinds of things from jokes to creating calendar entries, etc. Of course you can still have a catch-all to hand off to LLM for "conversation" but I'm not sure users actually want this for voice.

I may be misunderstanding your goals but just a few things I thought I would mention.

[0] - https://github.com/toverainc/willow

[1] - https://wisng.tovera.io/rtc/

[2] - https://github.com/toverainc/willow-inference-server/tree/wi...

[3] - https://github.com/toverainc/willow-inference-server

[4] - https://rasa.com/


Hey! First of all thank you for this really detailed response! I am very new to the voice space and definitely have a TON to learn. I'd love to connect and chat with you sometime :)

I totally agree with you about latency. It is very very important for use cases such as a voice assistant. I also think there are use cases in which latency doesn't matter that much. One thing I think I may have understated about S.A.T.U.R.D.A.Y. is the fact that, at it's core, it is simply an abstraction layer over vocal computing workloads. This means it is 100% inference implementation agnostic. Yes, for my demo I am using whisper.cpp however there is an implementation that also uses faster-whisper.

I also want to call out that I have spent very little time optimizing and reducing the latency in the demo. Furthermore, when I recorded it I was on incredibly shoddy WiFi in northern Scotland and since the demo still depends on OpenAI a large chunk of the latency was introduced by the text-to-text inference. That being said there is still a ton of areas where the latency in the current demo could be reduced probably to the neighborhood of 1s - 1.5s. This will get better in the future :)

I want to touch on something else you mentioned. GPUs. I intentionally tried to avoid using any GPU acceleration with this demo. Yes, it would make it faster BUT I think a large part of making this kind of technology ubiquitous is making it accessible to as many clients as possible. I wanted to see how far you can get with just CPU.

In regards to your comments about NLU/NLP I have not dug into using them in place of LLMs but this seems like an area in which I need to do more research! I am very far from an AI expert :) I have a bunch of ideas for different ways to build the "brains" of this system. I simply have just not had time to explore them yet. The nice part about this project and demo is that it doesn't matter if you are using an LLM or an NLU/NLP model, either will plug in seamlessly.

Thank you again for your response and all of this information! I look forward to hopefully chatting with you more!


> Yes, for my demo I am using whisper.cpp however there is an implementation that also uses faster-whisper.

The benchmarks I referenced above show a GTX 1070 beating an Threadripper PRO 5955WX by at least 5x. Our inference server implementation runs CPU-only as well and is based on the same core as faster-whisper (ctranslate2) but our feature extractor and audio handling makes it slightly faster that faster-whisper. The general point is GPUs are so vastly architecturally and physically different - a $1K CPU can barely do large-v2 in realtime, while a $1k RTX 3090 is 17x realtime (4090 is 27x realtime).

Many demos, etc online that feature local Whisper use tiny - we've found that in the real world, under real conditions, Whisper medium is the minimum for quality speech recognition with these tasks and many of our users end up using large-v2. Using the same benchmarks above, this puts the floor for response time at 1.5 seconds (medium) for ~3 seconds of speech on CPU - just to get the transcript. I understand you're early but if you can eventually break five seconds with this all-local on any CPU in the world I would be very, very surprised and impressed! I suspect you'll find that even with the worst internet connection in the world OpenAI is still faster than llama.cpp, etc on CPU (they use GPUs, of course):

With highly tuned Whisper, LLM, and TTS our inference server is around three-four seconds all in (Whisper, LLM, TTS) for this task - on an RTX 3090 and I don't consider that usable (the LLM is almost all of that). Imagine trying to have a conversation with a person and every time you say something they stare at you blankly for 5-10 seconds (or more). Frustrating to say the least...

I suppose the point is that for these tasks Apple, Amazon, Google, OpenAI, etc all use GPUs (or equivalent) for their commercial products and that is the benchmark in terms of user expectations - and it's often still not fast enough and merely tolerated. For these tasks if you're bringing a CPU to a GPU fight you're going to lose - an RTX 3090 (for example) has nearly 20,000 cores and 935 GB/s of memory bandwidth. All of the software tricks and optimization in the world can't make CPUs compete with that.

That said, what Apple is doing with Apple Neural is very exciting but that's another accessibility issue - outside of HN most people don't have the latest and greatest Apple hardware (or Apple hardware at all). Not like many people just have GPUs lying around either but for today and the foreseeable future, given the fundamental physical realities, "it is what it is" - you either have specialized hardware or you wait.

Accessibility is important to us as well (why we support CPU only) but I question the value of accessible if it isn't near practical, and for many people waiting at least several seconds for a voice response puts these kinds of tasks in "take out your phone and do it there" territory, or in the case of already being on a desktop open a browser tab, type it out, and read it.

I DM'd you on twitter from @toverainc - let's do something!


There's a lot to read here, and maybe this is already implemented, but my initial thoughts were it was waiting for the full response to be generated before starting to read it.


How is the diagram made?



Is skipping the final period in acronyms a stylistic choice? =]

Thanks for your work.


More-so an oversight :D


Ok, we've added some periods.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: