Hacker News new | past | comments | ask | show | jobs | submit login
Coqui, a startup providing open speech tech for everyone (github.com/coqui-ai)
174 points by doener on April 14, 2021 | hide | past | favorite | 59 comments



In case it helps clarify what this is, I think the reason this is getting posted is because it was discussed earlier as a continuation of Mozilla's TTS work: https://news.ycombinator.com/item?id=26790281


Nothing to do with the tech, but I love your name! I'm just a tourist in Puerto Rico but I've been a number of times and it always warms my heart to hear the Coqui frogs at night. I even play the recordings at home sometimes if I'm having trouble falling asleep.

For anyone that hasn't heard of Coqui frogs before, they are pretty cool animals. Little guys, but a single one can be surprisingly loud and throw its voice pretty effectively. AFAIK they're only really found in Puerto Rico - apparently they can survive in other warm climates but will not sing? Maybe that's an urban legend though.

Anyway I know the sound is a little contentious (some hotels get cats to cut down on guest complaints about the Coqui) but I'd recommend checking it out: https://musicofnature.com/coqui-magic-nightscapes/


Oh, god. I took a vacation to Hawaii a few years ago. The damn Coqui frogs would never shut up. You lost sleep, because you couldn’t shut them out. People have died because they ran off the road while driving, due to lack of sleep caused by Coqui frogs.

Please kill all the Coqui frogs in the world. Or, at least imprison them all on an island that has no human habitation.

Sure, it’s cute in small quantities, but if that’s your soundtrack 24x7 at 70-80 decibels or more, non-stop, you’d probably want to commit suicide just to get out of there. That’s the kind of place that Coqui frogs drive you to.


I live in Adjuntas and no amount of cats would eat this many frogs - fortunately!

They tend to quiet down during dry spells, so our rain last week has made things sound much nicer.


Unfortunately, it is a bad name because, right or wrong, it will be easily mispronounced. Cocky AI? Who wants that?


Kind of an Americentric perspective and also doesn't give people the benefit of the doubt. I'm an American who only speaks English but I think it's an awesome and interesting name compared to most of the lame product names out there.


I agree with you but also Puerto Rico is technically in the US. I think anyone that has vacationed there will recognize the word (especially with the frog logo), since they are pretty hard to miss and there is a ton of Coqui memorabilia sold at the touristy shops. Puerto Rico is a super common vacation spot for those on the East coast, so unless I'm totally overrating the memorability of these frogs I'd guess plenty of non-Spanish speaking Americans will know how to pronounce.


The computing industry could get less sensitive and accept that other languages than English exist.


I agree, but also it's an onomatopoeia, the sound the frogs make actually is "Co-Kee". IIRC the indigenous people in Puerto Rico are the ones that named them a long time ago, but the spelling might have been influenced by the Spanish. Point being that you definitely do not need to speak a non-English language to know how to say the word, you could also just have visited Puerto Rico.


It will definitely be mispronounced if they misspell it without the accent (Coquí).

I understand they don't want non-ascii in the GitHub project name that turns into a directory name, but the one-line description contains an emoji so I would have thought they could have allowed themselves a Latin-1 character there, for the benefit of people who know some Spanish but hadn't heard of the frogs.


I'm glad to see this. I hope they can get the TTS to sound more conversational and less like a newscaster...that being said...free is nice.


The quality is also miles better than the last open source TTS I've heard that wasn't just an Amazon SDK. I'll take newscaster-voice over robot-from-the-early-00's.


The sample links are impressive to me. I don't follow the space closely but they sound conversational.

https://soundcloud.com/user-565970875/pocket-article-wavernn...


Usually conversation is less crisp. That sounds great, but more like an audiobook or NPR. It's probably going to be hard to sound conversational with only one voice speaking though.


Yes, the woman's voice definitely has that NPR way of finishing sentences.


Some samples of the TTS voice are here[1]

[1] https://erogol.github.io/ddc-samples/


Is there any information about the training process? Which data was used, which license was that data under and which tools, drivers and hardware was used for the training?

Basically I'm wondering if these projects count as libre machine learning projects according to the Debian Deep Learning Team's Machine Learning Policy.

https://salsa.debian.org/deeplearning-team/ml-policy


They do, the issue is with Tensorflow support iirc and with NVIDIA drivers.

So for the English model they use mostly free/open-source data, but some non-free data:

- train_files Fisher, LibriSpeech, Switchboard, Common Voice English, and approximately 1700 hours of transcribed WAMU (NPR) radio shows explicitly licensed to use as training corpora. (from https://github.com/coqui-ai/STT/releases/tag/v0.9.3)

As for the hardware for training... it's basically NVIDIA (you need Tensorflow / CUDA and all that guff). For inference it works realtime on a CPU.

I'm preparing pretrained models based on Common Voice data for their STT based on Common Voice data here: https://tepozcatl.omnilingo.cc/manifest.html

There is plenty of free/open-source voice data out there, just it's a question of reaching for a sick bag and installing NVIDIAs stuff.

I don't know if they'd count as free/open-source according to Debian (I'm a Debian user myself), but the team has definitely talked about getting into Debian and would be very open to discussions about it.


Check out the ML policy, it sounds like it would currently be classified as a "toxic candy" model due to the non-free data, but it sounds like you could re-train to avoid that.

Using CUDA also means it wouldn't be considered legitimately free enough, although folks are working on getting AMD ROCm into Debian.

Tensorflow isn't yet in Debian, but there may be folks working on it.

Another problem is that Debian doesn't have the hardware for doing training.

I'd encourage you to talk to the folks on the debian-ai mailing list and IRC channel to discuss these and other issues.

https://lists.debian.org/debian-ai/ ircs://irc.oftc.net/debian-ai


AFAIK, using copyrighted data to train does not necessarily make the trained model "toxic". "Authors Guild, Inc. v. Google, Inc." case [1] is viewed as a key precedent for this view.

[1] https://en.wikipedia.org/wiki/Authors_Guild,_Inc._v._Google,....


The phrase is "toxic candy" not "toxic", see the policy for what it means.

Most data is protected by copyright, but I assume you meant proprietary rather than copyrighted. Using proprietary data might not matter under copyright law, but it does matter in terms of the Debian machine learning policy and DFSG, because the non-free data cannot be shipped in Debian main and thus cannot be used to train a model shipped in main.


Hmm, that case doesn't appear to be about ML though, could you explain how it is considered a precedent for ML?



Thanks. Its interesting that this only applies to countries with the concept of fair use, which is unfortunately not widespread.


Yeah, ROCm is a bit of a mess. I actually have an AMD GPU in a server, but the drivers in the mainline kernel don't work properly, so have never been able to use it.

If I were into conspiracy theories I'd say that AMDs failure to compete in the GPU/DL space is to do with the relationship between the AMD CEO and the NVIDIA one.

Tensorflow is just awful, as is anything that touches bazel :)


> Tensorflow is just awful, as is anything that touches bazel :)

Mind if I ask why?


I wasted a week trying to replace the scorer component with a NN-based language model. Every time I made a change the whole codebase, including Tensorflow recompiled, so the turnaround time was about an hour per change. It was awful. I mean I get reproducible builds etc. and probably if you're running stuff at Google scale it has all kinds of useful features. But for development on a personal laptop it was torture. Eventually I gave up.


Got it, thank you.

Fwiw, that sounds like a bug or a misconfiguration; it's absolutely supposed to have better caching behavior than that (and does in the few projects I've used it on, even on a personal laptop). If you're interested in pursuing it further (I'd understand if you aren't; that sounds frustrating), I bet the bazel team would be interested in your report.


"There is plenty of free/open-source voice data out there"

No doubt about that, but you need validated transcripted voice data(no errors) and this is harder to get.


You don't need _no_ errors, you just need low errors.

Aside from Common Voice, there are also a lot of resources at openslr. Also, the amount of data you need is often vastly overestimated, with advances in pretraining and transfer learning and the fact that most languages don't have as terrible an orthography as English.


Just a note: the colorful frog pictured on the website (https://coqui.ai/) is not a coqui. Coquis are usually brown and tiny. https://upload.wikimedia.org/wikipedia/commons/6/62/Coqui_Fr...


I think it’s like https://www.descript.com/


Descript doens't look like it's open source though.


exactly, that's why this is exciting!


also like https://www.resemble.ai I remember using that one awhile ago and thinking this should all be open source


It would be awesome if this could be integrated into GNOME, KDE and the other open desktops as an accessibility feature.


Looks like a fork of Mozilla DeepSpeech by former DeepSpeech developers. What is the relation to the original project?


tl;dr:

- Mozilla fired the developers and mothballed the project

- But wants to keep it around as a museum piece

All ongoing development is happening in the fork.


I was very sad when that happened. There were a lot of language communities organizing their efforts around that project too.


Yeah, they really dumped them in it, fortunately the devs are keeping it all going at coqui.ai and are really supportive of any community that got abandoned by Moz.


any idea how they are financed?


Looks like after nvidia 1.5m grant devs returned back ;)


how do I actually use this to turn my speech into text? it seems some docs are 404ing.

edit: I found some transcription code:

https://github.com/WebThingsIO/voice-addon

DeepSpeech based but close, workable.


There is a lot of example code here: https://github.com/coqui-ai/STT-examples

If you have any more specific requirements then we can point you in the right direction. Or just join us on Matrix: https://app.element.io/#/room/#coqui-ai_STT:gitter.im :)


I want to yell at some webpage that is reading my mobile's mic and have it become text in a <textarea>.

And I don't want the company that must not be named to know what I said.


You might want something like LocalSTT if it's on mobile: https://github.com/ccoreilly/LocalSTT

Otherwise this code does streaming on a websocket: https://github.com/coqui-ai/STT-examples/tree/r0.9/web_micro...


yep, just found those

I'm looking forward to bothering you on the Matrix ;)


This seems neat, but I wish they had more examples of how to use it as a library. Most of their tutorials seem focused on training new models


There is a repository with STT examples[1]. Is what you're looking for there?

[1] https://github.com/coqui-ai/STT-examples


If I'm parsing the acronyms right, that's speech-to-text and I was hoping to try out their Text-to-Speech


There is another article about the TTS: https://news.ycombinator.com/item?id=26790951


How does this startup plan to make money? Services?


Wow, that soundcloud sample is amazing. I wonder how long it took to produce it or if it could be produced in real time?


Iirc it is realtime, but check out the Matrix channel: https://app.element.io/#/room/#coqui-ai_TTS:gitter.im


What is open speech tech? Clicked twice (GitHub, website's homepage) and didn't really get much out of it.


They have text-to-speech software and speech-to-text software, both of which are open source.


Basically building the tools to collect data, and models to let anyone implement voice interfaces in their system without needing to use a closed api.


I'm not sure but in any case they have a great collection of papers and talks in that repository.


Cocky? Might want to consider a rename before you run into the Coq situation.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: