Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Gdańsk AI – full stack AI voice chatbot (github.com/jmaczan)
100 points by yu3zhou4 on Aug 4, 2023 | hide | past | favorite | 34 comments
Hi!

It's a complete product with integrations to Auth0, OpenAI, Google Cloud and Stripe, which consists of Next.js Web App, Node.js + Express Web API and Python + FastAPI AI API

I've built this software, because I wanted to make money by selling tokens to enable users talking with the chatbot. But I think Google / Apple will include such AI-powered assistant in their products soon, so nobody will pay me for using it

So I open source the product today and share it as a GNU GPL-2 licensed software

I'm happy to assist in case if something is unclear or requires additional docs and answer any questions about Gdańsk AI :)

Thanks




I've always wondered if there's a better way of making voice assistants. With this stack, the AI will not be able to answer "what is this sound?", or give you UK-based information because it picked up on your British accent. It's bottlenecked by text. A model that can understand audio as input, and output audio directly, could be so much more powerful


Sure there's a better way. https://google-research.github.io/seanet/audiopalm/examples/

There's no reason autoregressive LMs can't be used to model audio data.


Unfortunately this hasn't been released yet and since that paper it is very quiet around that topic so chances are high it (or something similar) will just not be released.


I've honestly lost all interest in anything integrating with OpenAI at this point. Llama 2 is giving completions at ChatGPT levels with a single GPU. I've replaced all of my LLM usage with it.

Open local models are the future.


Open local models are literally the past, but the very recent past. I therefore agree -- as long as they remain only on the order of a year older than the cutting edge, their future looks extremely bright.

I just hope for-profit enterprises continue wanting to push that frontier as hard as they have been. (Wait, actually also if they stop doing that it might be for the best ...)


>Open local models are literally the past, but the very recent past. I therefore agree -- as long as they remain only on the order of a year older than the cutting edge, their future looks extremely bright.

I think the qualitative difference between Llama 2 and any previous open LLM is sufficient to the point that we can call this a new epoch. It took OpenAI spending millions on free compute to show the world what these things are capable of. And because it's something you can really only believe when you see it, that's what set things off.

But the cat's out of the bag now, and it's never going back. I think OpenAI would do well to return to their roots of pure research rather than bothering with the product side of things. Come up with the latest and greatest new models, then chuck 'em over the fence for Microsoft to monetize.

>I just hope for-profit enterprises continue wanting to push that frontier as hard as they have been. (Wait, actually also if they stop doing that it might be for the best ...)

My bet is that Meta is pivoting hard right now. Llama is probably their most successful project/product/whatever since Instagram. They have the talent, the money, and (crucially right now) the hardware to do it. And this plays directly into Zuck's desire for a platform. It seems pretty obvious their play is to build an ecosystem around these things and start gradually introducing licensing fees (and/or hosted models) for big commercial users.


Which GPU is capable of all that? And which do you have a write up for this?


> Which GPU is capable of all that? And which do you have a write up for this?

Anything down to an RTX3090. You need 16GB VRAM to comfortably fit llama2-7b-chat with 8 bit quantization and a 2048 context length with llama.cpp. I'm getting completions (in real time) that are easily on par with ChatGPT on a VM with a single V100. People are even getting decent performance on Mac silicon with Metal.


OpenAI released an app last month with press-to-talk that transcribes voice-to-text, but there's no in-built text-to-speech (use a screen reader I guess?)

I think this a dumb architecture. The Whisper model has been released and runs well on an 8GB consumer GPU. Train a new head on it to produce speech until it exceeds the other voice-to-voice models, and then fine-tune it to banter instead of translate. Is that possible? Sure, but it's a pretty small model, so you wouldn't expect large LLM performance.


I've thought that AI voice (anything) is probably the most useful area of AI that's not been built yet in a form that's generally known and available.

So kudos to you for building something useful and, as YC says, for 'building something people want.'


Great job on Gdańsk AI! How do you handle the speech-text/API latency?

Coincidentally, I'm working on an idea that is similar to your other project https://poss.market/market/ How's it going? Would love to learn more.


Hi, thanks! STT/TTS latencies are a bit bottleneck here unfortunately, so it's not as fast as I'd want to

I'm about to launch poss market on production! Feel free to drop a mail at jed@maczan.pl


Looks cool! Do you have a hosted version anywhere we could play around with?


I'm setting up the project on DigitalOcean right now!


Demo is live! https://bibop.app/

Let me know if there are any troubles running it. It works under Google Chrome


I turn off demo today. If you need any assistance with setting up your own instance, don't hesitate to contact me via email (in my profile) or open an issue on project's GitHub page


Now just need to put an avatar in front of it and make an app and it'll be like http://callannie.ai


Related, what’s the current SOTA on STT models freely available? T5 is pretty good but the closed Google and Meta stuff seems better.


TorToiSe (https://github.com/neonbjb/tortoise-tts) produces the best quality speech of any freely available model. However, its long inference times makes it impractical for voice chatbots like Gdansk.


What's the reason for the high inference latency? Any ideas on how this could be improved?


TorToiSe is composed of many large models: GPT-2 for text encodings, as well as a large VQVAE encoder + large diffusion model decoder.

Only the big spaghetti inference code (+ weights) has been published, so there's a high entrance barrier for re-training / improving it.


It has been sped up, but still not fast enough for this use case. https://github.com/manmay-nakhashi/tortoise-tts-fastest


Nice work. I noticed the bing app does this too and I'm curious how this would compare


Świetne imię


Mógł dać Szczebrzeszyn, ubaw słuchać jak Anglosasi próbują to czytać.


Polacy nie gęsi


Tylko takie trochę mało „międzynarodowe”


There's a few. Off the top of my head

https://github.com/kopia/kopia


Są też nazwy nieintencjonalnie polskie. Gdy pierwszy raz usłyszałem nazwę Zapier, pomyślałem, że założycielem jest Polak :p


To prawda. Spróbujmy umiędzynarodowić polskie nazwy :)


How do I set this up? I don't really understand (not experienced with node.js or JavaScript in general).


Out of curiosity, why GPL2 instead of 3?


No good reason behind it, I'm just not very familiar with GPL3 so I often default to 2. What could be reasons to pick 3 over 2? I can consider it


AFAIK the main difference is in certain provisions designed to block the use of patents to restrict the fundamental freedoms that GPL allows to the user. The main drawback is that it's incompatible with GPLv2-only software, but for new software it doesn't seem to be a problem.

Also, maybe A-GPL could be a good license here. It adds a provision that if the user accesses the code remotely (as on a server), you should share the code too. The default GPL only requires that if you distribute the binary.

PS. not a lawyer, would be happy to be corrected if something I said was wrong




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: