Show HN: Gdańsk AI – full stack AI voice chatbot

a2128 · on Aug 4, 2023

I've always wondered if there's a better way of making voice assistants. With this stack, the AI will not be able to answer "what is this sound?", or give you UK-based information because it picked up on your British accent. It's bottlenecked by text. A model that can understand audio as input, and output audio directly, could be so much more powerful

og_kalu · on Aug 4, 2023

Sure there's a better way. https://google-research.github.io/seanet/audiopalm/examples/

There's no reason autoregressive LMs can't be used to model audio data.

ddnb · on Aug 5, 2023

Unfortunately this hasn't been released yet and since that paper it is very quiet around that topic so chances are high it (or something similar) will just not be released.

ramesh31 · on Aug 4, 2023

I've honestly lost all interest in anything integrating with OpenAI at this point. Llama 2 is giving completions at ChatGPT levels with a single GPU. I've replaced all of my LLM usage with it.

Open local models are the future.

Jeff_Brown · on Aug 4, 2023

Open local models are literally the past, but the very recent past. I therefore agree -- as long as they remain only on the order of a year older than the cutting edge, their future looks extremely bright.

I just hope for-profit enterprises continue wanting to push that frontier as hard as they have been. (Wait, actually also if they stop doing that it might be for the best ...)

ramesh31 · on Aug 4, 2023

>Open local models are literally the past, but the very recent past. I therefore agree -- as long as they remain only on the order of a year older than the cutting edge, their future looks extremely bright.

I think the qualitative difference between Llama 2 and any previous open LLM is sufficient to the point that we can call this a new epoch. It took OpenAI spending millions on free compute to show the world what these things are capable of. And because it's something you can really only believe when you see it, that's what set things off.

But the cat's out of the bag now, and it's never going back. I think OpenAI would do well to return to their roots of pure research rather than bothering with the product side of things. Come up with the latest and greatest new models, then chuck 'em over the fence for Microsoft to monetize.

>I just hope for-profit enterprises continue wanting to push that frontier as hard as they have been. (Wait, actually also if they stop doing that it might be for the best ...)

My bet is that Meta is pivoting hard right now. Llama is probably their most successful project/product/whatever since Instagram. They have the talent, the money, and (crucially right now) the hardware to do it. And this plays directly into Zuck's desire for a platform. It seems pretty obvious their play is to build an ecosystem around these things and start gradually introducing licensing fees (and/or hosted models) for big commercial users.

starik36 · on Aug 4, 2023

Which GPU is capable of all that? And which do you have a write up for this?

ramesh31 · on Aug 5, 2023

> Which GPU is capable of all that? And which do you have a write up for this?

Anything down to an RTX3090. You need 16GB VRAM to comfortably fit llama2-7b-chat with 8 bit quantization and a 2048 context length with llama.cpp. I'm getting completions (in real time) that are easily on par with ChatGPT on a VM with a single V100. People are even getting decent performance on Mac silicon with Metal.

QuantumG · on Aug 4, 2023

OpenAI released an app last month with press-to-talk that transcribes voice-to-text, but there's no in-built text-to-speech (use a screen reader I guess?)

I think this a dumb architecture. The Whisper model has been released and runs well on an 8GB consumer GPU. Train a new head on it to produce speech until it exceeds the other voice-to-voice models, and then fine-tune it to banter instead of translate. Is that possible? Sure, but it's a pretty small model, so you wouldn't expect large LLM performance.

julianeon · on Aug 4, 2023

I've thought that AI voice (anything) is probably the most useful area of AI that's not been built yet in a form that's generally known and available.

So kudos to you for building something useful and, as YC says, for 'building something people want.'

kjok · on Aug 4, 2023

Great job on Gdańsk AI! How do you handle the speech-text/API latency?

Coincidentally, I'm working on an idea that is similar to your other project https://poss.market/market/ How's it going? Would love to learn more.

yu3zhou4 · on Aug 4, 2023

Hi, thanks! STT/TTS latencies are a bit bottleneck here unfortunately, so it's not as fast as I'd want to

I'm about to launch poss market on production! Feel free to drop a mail at jed@maczan.pl

quantum-crt · on Aug 4, 2023

Looks cool! Do you have a hosted version anywhere we could play around with?

yu3zhou4 · on Aug 4, 2023

I'm setting up the project on DigitalOcean right now!

yu3zhou4 · on Aug 4, 2023

Demo is live! https://bibop.app/

Let me know if there are any troubles running it. It works under Google Chrome

yu3zhou4 · on Aug 8, 2023

I turn off demo today. If you need any assistance with setting up your own instance, don't hesitate to contact me via email (in my profile) or open an issue on project's GitHub page

fragmede · on Aug 5, 2023

Now just need to put an avatar in front of it and make an app and it'll be like http://callannie.ai

benreesman · on Aug 4, 2023

Related, what’s the current SOTA on STT models freely available? T5 is pretty good but the closed Google and Meta stuff seems better.

nielsinho · on Aug 4, 2023

TorToiSe (https://github.com/neonbjb/tortoise-tts) produces the best quality speech of any freely available model. However, its long inference times makes it impractical for voice chatbots like Gdansk.

kjok · on Aug 4, 2023

What's the reason for the high inference latency? Any ideas on how this could be improved?

nielsinho · on Aug 5, 2023

TorToiSe is composed of many large models: GPT-2 for text encodings, as well as a large VQVAE encoder + large diffusion model decoder.

Only the big spaghetti inference code (+ weights) has been published, so there's a high entrance barrier for re-training / improving it.

drewbitt · on Aug 5, 2023

It has been sped up, but still not fast enough for this use case. https://github.com/manmay-nakhashi/tortoise-tts-fastest

hidelooktropic · on Aug 4, 2023

Nice work. I noticed the bing app does this too and I'm curious how this would compare

aka878 · on Aug 4, 2023

Świetne imię

lossolo · on Aug 4, 2023

Mógł dać Szczebrzeszyn, ubaw słuchać jak Anglosasi próbują to czytać.

renegat0x0 · on Aug 4, 2023

Polacy nie gęsi

scyzoryk_xyz · on Aug 4, 2023

Tylko takie trochę mało „międzynarodowe”

tomwojcik · on Aug 4, 2023

There's a few. Off the top of my head

https://github.com/kopia/kopia

yu3zhou4 · on Aug 4, 2023

Są też nazwy nieintencjonalnie polskie. Gdy pierwszy raz usłyszałem nazwę Zapier, pomyślałem, że założycielem jest Polak :p

yu3zhou4 · on Aug 4, 2023

To prawda. Spróbujmy umiędzynarodowić polskie nazwy :)

gorenb · on Aug 6, 2023

How do I set this up? I don't really understand (not experienced with node.js or JavaScript in general).

selestify · on Aug 4, 2023

Out of curiosity, why GPL2 instead of 3?

yu3zhou4 · on Aug 4, 2023

No good reason behind it, I'm just not very familiar with GPL3 so I often default to 2. What could be reasons to pick 3 over 2? I can consider it

rodoxcasta · on Aug 4, 2023

AFAIK the main difference is in certain provisions designed to block the use of patents to restrict the fundamental freedoms that GPL allows to the user. The main drawback is that it's incompatible with GPLv2-only software, but for new software it doesn't seem to be a problem.

Also, maybe A-GPL could be a good license here. It adds a provision that if the user accesses the code remotely (as on a server), you should share the code too. The default GPL only requires that if you distribute the binary.

PS. not a lawyer, would be happy to be corrected if something I said was wrong