Hacker News new | past | comments | ask | show | jobs | submit login
ChatTTS-Best open source TTS Model (github.com/2noise)
165 points by informal007 29 days ago | hide | past | favorite | 82 comments



From the Readme:

“ To limit the use of ChatTTS, we added a small amount of high-frequency noise during the training of the 40,000-hour model, and compressed the audio quality as much as possible using MP3 format, to prevent malicious actors from potentially using it for criminal purposes.”

I’m having a hard time understanding why they have degraded the training, audio, and thus the output. It’s not like this is the first or only text to speech system.


So that it is more inconvenient to use theirs for such purposes... I would do the same.


This is pretty decent, but a bit slow on my M2 Pro. Runs better on CPU, which is strange.

Still, here's a quick guide to getting it to work on Metal:

    --requirements.txt additions--
    torchvision==0.18.0
    accelerate==0.30.1

    --gpu_utils.py patch--
    def select_device(min_memory = 2048):
        logger = logging.getLogger(__name__)
        if torch.backends.mps.is_available():
            device = torch.device('mps')
            return device
could probably do with support for device_map for multiple backends...

Edit: it also seems tho hallucinate/become increasingly unreliable with longer sentences.


I have to say the Chinese female voice sounds the most natural. It's really amazing how far these have got!

Video with examples: https://b23.tv/uumKPam (bilibili)


I hadn't heard any good prosodic laugh implementations yet.

In my mind that was the last hurdle to cross before being able to fool people regularly with a non-human voice.

Great work!

Hook that DSL into a prompt, [uv_break]gg. [laugh]


Is this supporting voice clonning?


Not clear based on what criteria OP has determined this is the best OS model. I also don't see that claim being made anywhere in the GitHub repo so I suspect it might be a case of vibe-based benchmarking (VBB).

As pointed out by u/modeless, there is an established leaderboard and this model isn't on it (yet)

https://news.ycombinator.com/item?id=40508445


The completion level is impressive! I can hardly tell the difference from a human voice, especially with the natural pauses and laughter, which surpasses ChatGPT’s quality. However, there’s a noticeable electric noise at the end of sentences, which feels unnatural. (As a native Chinese speaker, I find the Chinese output even better in comparison.)


Yes, it's just a new born project, looking forward the next verison.


There is also a glitch in "dialogue"


Beyond a certain level of quality, what is the purpose of improving similarity with human voice other than scamming? I’m asking because I genuinely don’t know. It seems even a rudimentary TTS is usable as long as you can tell what it’s saying.


I don’t use any TTS at the moment because they all sound weird and it’s distracting when random words or phrases are mispronounced or have odd intonations or misplaced stresses. I would use it a lot more if it sounded entirely natural, I know this because I listen to a lot of audiobooks and podcasts.


In voice assistants, robocalls, e-books, even singing, live voice interpretation/translation... a lot of stuff.


Voice assistants -> Siri sounds just fine

Robocalls -> I want to know I’m speaking to a robot

Audio books -> reasonable. An accurate tone is pleasant

Singing -> ever heard of vocaloids? They’ve existed for at least a decade or two


> Singing -> ever heard of vocaloids? They’ve existed for at least a decade or two

That it was technically already possible does not mean there isn't benefit from improved quality. In fact, Vocaloid itself has been improving and now uses AI.

Would also add making movies, podcasts, news broadcasts, etc. available automatically in a huge range of languages. You wouldn't want movies dubbed by Microsoft Sam (beyond initial comedic effect).


> You wouldn't want movies dubbed by Microsoft Sam (beyond initial comedic effect).

You'd be surprised how common something like this used to be in Poland, though admittedly we used an Ivona voice for this, which was a lot more pleasant.

Having a single narrator narrate the entire movie, overlaying the original audio track, is already common here, much more so than dubbing or subtitles. This is for historic reasons, in the communist era, obtaining the raw audio tracks for dubbing was often impossible, all the translators often had was a normal copy of the movie in its original language.

In the early 2000's, we had a lot of early / unofficial pirate releases, and they had to be translated into Polish somehow. Subtitles were certainly one method, but as we're all used to the single-narrator style, many people didn't mind listening to a somewhat decent synthetic voice instead.


Games -> AI-powered characters that interact with you in realtime

Commercials/tutorials/corporate training videos -> Voiceover work

TV shows -> Dubbing in various languages

Fast food drive-throughs -> Taking customer orders


> robocalls

E.g. scamming. For anything that is just about conveying information through audio, like voice assistants, traditional TTS already works fine.


This is why I don't get why OpenAI doesn't open source their TTS. I've read that it is because of the risk of misuse, but if they release it with only two voices, one male and one female voice, and ensure that these voices are recognizable AI voices (in the sense of popular), there shouldn't be any risk?

Most of us just need a machine to have the ability to speak to us in a way that sounds halfway human, but not as horrible as old open source TTS systems.


Would you settle for as "horrible" as new open source TTS systems? :D

My current go to is Piper TTS: https://github.com/rhasspy/piper

It's MIT-licensed, supports ~30 languages and multiple voices[0]/quality/licenses.

Voice output samples: https://rhasspy.github.io/piper-samples/

Discovered just now that recently there's also been some efforts to train Public Domain/CC-BY licensed voices specifically for Piper TTS too: https://brycebeattie.com/files/tts/

---- footnotes ----

[0] e.g. One of the English voice sets has ~900 voices(!).


i do find that this Piper project is really amazing.

however if i am going to listen to 3 hours of audiobook, i would still pay for a better TTS versus Piper. (now i buy a lot of audiobooks read by voice actors but not every book i want to audio-read has a market to support a voice actor being paid to read it)

however Piper is definitely going in the correct direction though, huge progress.


My interest in TTS is around "indie" game creation, animation and "radio plays".

A couple of years ago I started development of a tool to help with the generation of game audio such as NPC dialogue, "barks" or narration for those without access to/budget for human voice actors: https://rancidbacon.itch.io/dialogue-tool-for-larynx-text-to...

One thing I found interesting is that writing a small "scene" and then hearing dialogue being spoken by a variety of voices often prompted the writing of further lines of dialogue in response to perceived emotion contained in voices in the generated output. Plus it was just fun. :)

The version of the tool on that page is based on Larynx TTS which has continued development more recently as Piper TTS: https://github.com/rhasspy/piper

I'm yet to publish my port which uses Piper TTS though: https://gitlab.com/RancidBacon/larynx-dialogue/-/tree/featur...

Though I did upload some sample output (including some "radio announcer" samples in response to a HN comment :) ): https://rancidbacon.gitlab.io/piper-tts-demos/

Obviously there's variations in voice quality, and ability to control expression is currently limited but beats hearing my own voice. :D


Restoring the voices of people who have lost their ability to speak is one, making audiobooks much cheaper and more widely available, particularly for languages where they're far more rare than in English is another.

Overall, this tech is a boon for the disability community.


Narration of marketing or educational materials without hiring voice talent. For instance, if an independent developer wants to create a short video that explains key features of their software project. The voice should sound natural enough that the listener doesn't get distracted questioning whether it is real or hear silly mispronunciations.


I'm always on a lookout for the best TTS for my language learning app.


Video games, for one.


Sounds good but feel like theirs something slightly off to the cadence of the voice in the sample, but maybe i'm imagining


I think you are right, It is not the completely same.


Sounds natural and intelligible at 3x speed, which is plus.

>The Real-Time Factor (RTF) is around 0.65.

What is the state of real time tts models?


0.65 isn't that big a gap. This is probably less than one jart away from being optimised to real-time.


Aura can stream responses and has a time-to-first-byte around 200-400ms


Could it be used to teach me Mandarin? Actually since it's only voice synthesis, I guess it would still miss the voice recognition and capability the quality of my attempt to reproduce tonal language sentences.


Wow - the most impressive thing about this is the control options. I’m not aware of any other TTS systems with the same balance of control, quality and language support. Looking forward to testing this out…


Is there any good voice2voice open source model?


This is somewhat off-topic, but here goes:

It seems to me that English TTS is already extremely good, even if you're looking at implementations that are far from being the best ones for English.

...and sometimes I wonder, if it's really economically efficient for that many players to compete on making English TTS yet another hair's breadth better than the next guy's, while TTS for languages other than English is this vast field of unmet market demand. At least these guys are doing Chinese, so: good for them.

Last time I looked into TTS systems for German, Google was the only game in town. What I wouldn't give for a viable alternative! It doesn't even need to be open source, I'd be quite ready to pay top dollar.


> Last time I looked into TTS systems for German, Google was the only game in town. What I wouldn't give for a viable alternative! It doesn't even need to be open source, I'd be quite ready to pay top dollar.

Will you still pay top dollar if it is open source though? :D

Piper TTS[0] (MIT Licensed; developed by main dev of Larynx TTS, Mimic3 TTS & Rhasspy voice assistant) has support for ~30 languages, at least some of which have multiple voices available--in a range of quality & data licenses.

And, particularly fortuitous for your needs, potentially, there's at at least one German voice that was recorded[1] specifically for Piper[2] (with emotion variants and CC0-licensed, no less :) )...

Check out `thorsten` & `thorsten_emotional` on the samples page: https://rhasspy.github.io/piper-samples/

I can't speak to the quality of the German voice specifically but for English at least I've found Piper's quality & range of voices of use[3].

---- footnotes ----

[0] https://github.com/rhasspy/piper

[1] https://www.youtube.com/playlist?list=PL19C7uchWZeojjI5FUk3q...

[2] In addition to other German voices based on other sources: https://huggingface.co/rhasspy/piper-voices/tree/main/de/de_...

[3] Somewhat of an understatement.


Thanks for the pointers! Will look into those.

> Will you still pay top dollar if it is open source though? :D

There is one OS project that I do support financially. If I get commercial value out of any of the above, I imagine I would do the same.


OpenAI's API says they support German? Never tried it though.

https://platform.openai.com/docs/guides/text-to-speech


https://elevenlabs.io/ And choose German for a sample. Not sure if it’s good enough for your needs but they have a wide variety of languages and voices.


Thanks for the pointer. This does indeed look pretty good. Do you happen to know if they're using any APIs underneath the covers? Or are they fully self-contained?


ElevenLabs is their own thing, they make the models themselves


Have you looked at Azure TTS voices? They, and not Google are the dominant player in TTS offerings.


Where is the demo that can be used?



I don't know how to code, but I found a website that I can use online,https://chattts.com/


Looks like it is yet another xtts fork


A good time to link the TTS leaderboard: https://huggingface.co/spaces/TTS-AGI/TTS-Arena

Eleven Labs is still very far above open source models in quality. But StyleTTS2 (MIT license) is impressively good and quite fast. It'll be interesting to see where this new one ends up. The code-switching ability is quite interesting. Most open source TTS models are strictly one language per sentence, often one language per voice.

In general though, TTS as an isolated system is mostly a dead end IMO. The future is in multimodal end-to-end audio-to-audio (or anything-to-audio) models, as demonstrated by OpenAI with GPT-4o's voice mode (though I've been saying this since long before their demo: https://news.ycombinator.com/item?id=38339222). Text is very useful as training data but as a way to represent other modalities like audio or image data it is far too lossy.


> In general though, TTS as an isolated system is mostly a dead end IMO

Do you mean like as a simple text to speech application? There is a huge need for better quality audiobook output.


I don't think recording an audiobook with human-level quality is "simple". It's really a kind of acting. TTS models do very poorly at acting because they generally process one sentence at a time, or at most a paragraph, and have very little context or understanding. They just kind of fake it like a newscaster reading an unrehearsed script from a teleprompter.

True human-level audiobook reading would require understanding the whole story, which often assumes general cultural knowledge, which you'll only get from a model trained on LLM-scale data. If you asked GPT-4o's new end-to-end voice mode to read an audiobook you'd probably get a better result than any TTS model. I bet it would even do different voices for the characters if you asked it to.


Well, no. This is a reasonable guess turned strangely confidently wrong and opinionated.

Voice acting is quite literally done a sentence or at most a paragraph at a time. Often the recording order is completely different from the script.

An actor may very well record his final scene on the first day of a project, after the whole character arc has transpired. But you know, acting. They get fed a line with stage direction and do a bunch of takes and somehow it works.

Heck you might be a full blown Italian who can't say a word in English but with the right kind of jacket it comes out a banger: https://www.youtube.com/watch?v=-VsmF9m_Nt8

You mention Eleven labs being ahead, check out Suno. There is no LLM-scale anything involved there. The voice in this context is a musical instrument and there are lots of viable ways to tackle this problem domain.


We're taking about audiobooks here. An actor recording an audiobook does not read the sentences or paragraphs in a random order without context.

Sure, voice acting for games or movies is done piecemeal. But the actor still gets information about the story ahead of time to inform their acting, along with their general cultural knowledge as a human. Most crucially, when acting is done in this way it is done with a human director in the loop with deep knowledge of the story and a strong vision, coaching the actor as they record each line and selecting takes afterward. When the directing is done poorly, it is pretty easy to tell.

Sure, for a movie or game you could direct a TTS system line by line in the same way and select takes manually, but it would be labor intensive and not at all automatic. And to take human direction the model would need more than just the text as input. Either a special annotation language (requiring a bunch of engineering and special annotated training datasets), or preferably a general audio-to-audio model that can understand the same kind of direction a human voice actor gets.


> I don't think recording an audiobook with human-level quality is "simple"

It is, though, I've just done it a few days back. Once you have a clean text extracted from the book (this is actually the difficult part, removing headers, footers, footnote references, etc.), you can feed it into edge-tts (I recommend the Ava voice) and you get something that is, in my opinion, better than most human performers. Not perfect, but humans aren't either (I'm currently listening to a book performed by a human who pronounces GNU like G-N-U).


Something tells me one of you is speaking about fiction and one of you is speaking about nonfiction.

Inflection, emotion, tone, character-specific affects and all that can really change the audiobook experience for fiction. Your mentioning of footnote references and GNU suggests you're talking about nonfiction, perhaps technical books. For that, a voice that never significantly changes is fine and maybe even a good thing. For fiction it'd be a big step down from a professional human reader who understands the story and the characters' mental states.


I'm talking about both. I cannot listen to audiobooks that are acted, in fact, that was the reason I decided to go down the rabbit hole of creating my own.

> For fiction it'd be a big step down from a professional human reader who understands the story and the characters' mental states.

On the contrary, I don't want the reader to understand anything, I just want the text in audio form and I will do the interpretation of it myself.


shrug good for you. A lot of people including myself find audiobook fiction really hard to listen to if read with a flat automated voice.

I think how you tend to listen might also matter. I mostly use audiobooks when I'm driving or otherwise doing something else that is going to claim a portion of my attention. Following the narrative and dialog is easier when the audio provides cues like vocal tone changes for each speaker / narrator.


Agreed. In fact a great example of this is the Blood Meridian audio book where each of the characters seems to get a distinct "voice" despite being narrated by a single person.

You can find it on YouTube easily if you want an example.


Maybe authors can tag sentences/paragraphs with acting directions while they write to facilitate acting. Seems like there's ways for some human input to streamline process.


Approaches based on tagging and interpreting metadata are tempting. Building structured human knowledge into the system seems like a good idea. But ultimately it isn't scalable or effective compared to general learning from large amounts of data. It's exactly the famous Bitter Lesson. http://www.incompleteideas.net/IncIdeas/BitterLesson.html

To the extent that authors provide supplemental notes or instructions to human actors reading their books, that information would be helpful to provide to an automated audiobook reading model. But there is no reason for it to be in a different form than a human actor would get. Additional structure or detail would be neither necessary nor helpful.


The difference is production moves from multiple people / skills to potentially one person, the writer, who ideally already knows emotions in scene. Economics makes sense before one click audio book / production as long as it's subtantially labour reducing.


It would be better to just have a professional director guide the model the same way you would any other actor.


Not only the whole story, but also which character is currently speaking, what place and mood he is in, whether it is sarcasm or irony and many many more aspects.

However, in my opinion it would be a huge benefit, if this kind of metadata would be put into the ebook file in some way, so that it would be something extractable and not has to be detected. I think it would be enough to ID the characters and tag a gender and a mood in the book together with citations, so that you could add different speech models for different characters. That would also allow to use different voices for different characters.

I wrote a little tool called voicebuilder (which I will open source next year). It's a "sentence splitter" which is able to extract an LJSpeech training dataset for an audio file, epub file and length matching. Works pretty accurate for now, although it needs manual polish of the extracted model. Still way faster than doing it manually.

This way you can build speech sets of your favorite narrators and although you would never be allowed to publish them, I think for private use they are great!


for non-fiction books TTS is already good enough. what's needed is the convenience and speed for turning text to audio. if with one click on my ebook app I can start listening, it'll be a darn good feature for me.


This one does seem like it does multiple languages in a sentence (at least for its currently supported languages, Chinese and English): https://www.bilibili.com/video/BV1zn4y1o7iV/?share_source=co...

But it does seem like the Chinese version is better than the English one for this TTS, which would make the arena not quite as applicable to it as they're all focusing on English.


Pi by Inflection.ai was doing audio-to-audio long before GPT-4o with the most realistic voice ever (human-like imperfections, micro-pauses, etc). I don’t know why it didn’t get more attention.


Was it end-to-end or it was audio -> speech to text -> LLM -> text to speech -> audio? I imagine it's the latter.


It's end-to-end audio, in the sense that you speak and it will reply audibly, all without visibly transcribing your words into a prompt and submitting (it may in fact be employing STT->LLM on the backend, I don't know).

Works great in the car on speaker with the kids -- endless Harry Potter trivia, science questions, etc. I was completely blown away by the voice. Truly conversational.


> It's end-to-end audio, in the sense that you speak and it will reply audibly

This is not what was meant by "audio-to-audio" or "end-to-end". It's not a statement about the UI, it's a statement about the whole system.

> it may in fact be employing STT->LLM on the backend

It certainly is, and additionally TTS after the LLM, all connected solely by text. This is not audio-to-audio, it's audio-to-text-to-audio, and the components are trained separately, not end-to-end. ChatGPT has already done this too, for months.

See OpenAI's GPT-4 blog post: https://openai.com/index/hello-gpt-4o/

> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.


Thanks. I hadn't actually read the announcement, just all the hullabaloo about how the voice sounded so human-like (and like ScarJo), and that's what had impressed me the most about conversing with Pi, thus my OP.


Can it understand how you feel by the intonation of your voice, can it recognize the animal by the sound? If not, then it's probably not end-to-end, ChatGPT already had this mode for months where they simply use a STT and TTS to let you converse with the AI.


Why isn’t Microsoft Azure’s TTS on here? (or am I missing something and it’s called something else)


Many proprietary ones are missing including OpenAI. I'd guess that they don't have budget to pay for the API usage. I think the leaderboard is more focused on open source options.


Isn’t gpt 4o voice not audio to audio, but audio to text to audio?


It isn't released yet, but the new one that they demoed is audio-to-audio. That's why it can do things like sing songs, and detect emotion in voices.

The one that you can currently access in the ChatGPT app (with subscription) is the old one which is ASR->LLM->TTS.


Are we sure it’s a single model behind the scenes doing that?

Practically it doesn’t really matter, but I’d like to know for sure.


It's the second paragraph in their announcement blog post. https://openai.com/index/hello-gpt-4o/

> Prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This process means that the main source of intelligence, GPT-4, loses a lot of information—it can’t directly observe tone, multiple speakers, or background noises, and it can’t output laughter, singing, or express emotion.

> With GPT-4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network.


I'm pretty sure you can use the new GPT4o audio-to-audio model already – even without a subscription. You can even use the "Sky" model if you didn't update your app.


> Attribution-NonCommercial-NoDerivatives 4.0 International

Strictly speaking, this is not open source, as the commonly accepted definitions of open-source software include freedom of use and modification.

But in an industry where "OpenAI" is 100% proprietary, I guess "open-source" doesn't really mean much.


Note that the repo itself doesn't claim it's open source, it's the title of the link that (incorrectly) claims it is.


The repo does claim that something is "open source" but the only included license text is "CC-BY-NC-ND" and the README seems to restrict permissions even further.

In addition, the Hugging Face repo[-1] states the license as "Creative Commons Attribution Non Commercial 4.0" (lacking the "ND").

Unfortunately, this combination of license imprecision and restrictiveness seems par for the course with a lot of academic TTS projects. (And, even for commercial "Open Source" TTS projects it's often the case that while code might be OSS licensed, none of the the voice data/models are.)

The current version[0] of the README repo states:

* "The open-source version on HuggingFace is a 40,000 hours pre trained model without SFT." (Presumably refers to model.)

* "At the same time, we have internally trained a detection model and plan to open-source it in the future." (Not directly relevant.)

The included "Roadmap" indicates related completed & uncompleted tasks:

* "[x] Open-source the 40k hour base model and spk_stats file"

* "[ ] Open-source VQ encoder and Lora training code"

* "[ ] Open-source the 40k hour version with multi-emotion control"

However, as noted, the current LICENSE[1] file states:

* "Attribution-NonCommercial-NoDerivatives 4.0 International"

And the README also contradicts the license:

* "This repo is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes."

* "The information and data used in this repo, are for academic and research purposes only."

And this part of the "disclaimer" would make me concerned about potential licensing issues in regard to code and or data from other sources:

* "The data obtained from publicly available sources, and the authors do not claim any ownership or copyright over the data."

The code in the repo itself appears to have no license information contained within it.

My go-to actually Open Source licensed Text-To-Speech project (with a range of voice[2] model licenses[3]--including Public Domain & CC-BY[4]) is Piper TTS: https://github.com/rhasspy/piper

---- footnotes ----

[-1] https://huggingface.co/2Noise/ChatTTS

[0] https://github.com/2noise/ChatTTS/blob/f4c8329f0d231b272b676...

[1] https://github.com/2noise/ChatTTS/blob/f4c8329f0d231b272b676...

[2] Voice samples: https://rhasspy.github.io/piper-samples/

[3] Though I would also caution that (at least by my interpretation) some of the voices listed as CC0/PD or CC-BY also note that they've been "fine-tuned" on models which have more restrictive licenses and thus probably can't inherit the voice data's more permissive license.

[4] However, these Public Domain and/or CC-BY voices do appear specifically created to be data license-compliant: https://brycebeattie.com/files/tts/


You're pointing out that their license is not open source. I agree. It's not. My point is that they're not claiming it is.

I'll concede that the README is misleading but, as far as I can tell, it's not making any claims that the repo is open source. They may hint at it in an underhanded way, like having "open-source@20noise.com" as an email that you should contact them at, or make promises about open sourcing it "in the future", but they make no specific claims about their model being open source.

I also agree that corporations and many academics do a source of "open washing" by claiming "source available" is equivalent to "open source" but my point remains, this project doesn't actually claim to be open source.

And thanks for the link to piper. I do wish there were a list of "awesome (actual) open source tts/tts/misc." list somewhere, so, for those of us that care, can figure out where the actual open source models and data are.


Yes. Piper is awesome but it should also be noted that it relies heavily on espeak-ng which is licensed under the GPL so you shouldn’t use it in your app unless you’re prepared to release all the source code.


Yeah, I think the link title should be "ChatTTS: Best openly licensed TTS model" instead.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: