Orpheus-3B – Emotive TTS by Canopy Labs

Metricon · 2025-03-20T04:53:52 1742446432

GGUF version created by "isaiahbjork" which is compatible with LM Studio and llama.cpp server at: https://github.com/isaiahbjork/orpheus-tts-local/

To run llama.cpp server: llama-server -m C:\orpheus-3b-0.1-ft-q4_k_m.gguf -c 8192 -ngl 28 --host 0.0.0.0 --port 1234 --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock

Zetaphor · 2025-03-20T05:44:16 1742449456

I've been testing this out, it's quite good and especially fast. Crazy that this is working so well at Q4

Imustaskforhelp · 2025-03-21T13:46:39 1742564799

Can somebody please create a gradio client for this as well. I really want to try this out but the complexity messes me up.

thot_experiment · 2025-03-20T07:16:10 1742454970

Wait, how do you get audio out of llama-server?

hexaga · 2025-03-20T07:22:10 1742455330

Orpheus is a llama model trained to understand/emit audio tokens (from snac). Those tokens are just added to its tokenizer as extra tokens.

Like most other tokens, they have text reprs: '<custom_token_28631>' etc. You sample 7 of them (1 frame), parse out the ids, pass through snac decoder, and you now have a frame of audio from a 'text' pipeline.

The neat thing about this design is you can throw the model into any existing text-text pipeline and it just works.

thot_experiment · 2025-03-20T07:28:57 1742455737

got it, so inference in llama.cpp server won't actually get me any audio directly

Metricon · 2025-03-20T07:38:52 1742456332

If you run the `gguf_orpheus.py` file in that repository, it will capture the audio tokens and convert them to a .wav file. With a little more work, you can feed the streaming audio directly using `sounddevice` and `OutputStream`

On a Nvidia 4090, it's producing:

  prompt eval time =      17.93 ms /    24 tokens (    0.75 ms per token,  1338.39 tokens per second)

         eval time =    2382.95 ms /   421 tokens (    5.66 ms per token,   176.67 tokens per second)

        total time =    2400.89 ms /   445 tokens

*A Correction to the llama.cpp server command above, there are 29 layers so it should read "-ngl 29" to load all the layers to the GPU.

thot_experiment · 2025-03-20T08:03:43 1742457823

is there any reason not to just use `-ngl 999` to avoid that error? Thanks for the help though, I didn't realize lmstudio was just llama.cpp under the hood. I have it running now, though decoding is happening on CPU torch because of venv issues, still running about realtime though, I'm interested in making a full fat gguf to see what sort of degradation the quant introduces. Sounds great though, can't wait to try finetuning and messing with the pretrained model. Have you tried it? I guess you just tokenize the voice with SNAC, transcribe it with whisper, and then feed that in as a prompt? What a fascinating architecture.

gianpaj · 2025-03-26T12:11:46 1742991106

You need to decode the tokens into audio. See `convert_to_audio` method in `decoder.py`

You can run `python gguf_orpheus.py --text "Hello, this is a test" --voice tara` and connect to the llama-server

See https://github.com/isaiahbjork/orpheus-tts-local

See my GH issue example output https://github.com/isaiahbjork/orpheus-tts-local/issues/15

huijzer · 2025-03-20T08:39:49 1742459989

I always am a bit skeptical of these demos, and indeed I think they didn't put much effort into getting the most out of ElevenLabs. In the demo, they used the Brian voice. For the first example, I can get this in ElevenLabs [1]. Stability was set to 20 here and all the other settings were at their default. Having stability at the default of 50 sounds more like what is in the demo on the site [2].

Having said that, I'm fully in favor of open source and am a big proponent of open source models like this. ElevenLabs in particular has the highest quality (I tested a lot of models for a tool I'm building [3]), but the pricing is also 400 times more expensive than the rest. You easily pay multiple dollars per minute of text-to-speech generation. For people interested, the best audio quality I could get so far is [4]. Someone told me he wouldn't be able to tell that the voice was not real.

[1]: https://elevenlabs.io/app/share/3NyQKlL6EeOHpIDtL5pA

[2]: https://elevenlabs.io/app/share/TUx4yluXtV3pFTHr7Cl7

[3]: https://github.com/transformrs/trv

[4]: https://youtu.be/Ni-dKlCpnb4

sebastiennight · 2025-03-24T03:43:43 1742787823

Great demo on #4 ; and content-wise, a good lesson I needed to be reminded of.

I was such a fan of CoquiTTS and so happy when they launched a commercially licensed offering. I didn't mind taking a small hit on quality if it enabled us to support them.

And then, the quality of the API outputs were lower than what the self-hosted open source Coqui model provided... I'm thinking this was one of the reasons usage was not at the level they hoped for, and they ended up folding.

The saddest part is they still didn't assign commercial rights to the open-source model, so I think Coqui is in a dead-end now.

Imustaskforhelp · 2025-03-21T13:45:01 1742564701

the [4] is such that since you've told me that its AI , my brain can say that of course its AI , but if you hadn't told me that , I might have thought that maybe this guy speaks like this or reading it in monotonous-ish way (like reading from a script?) and wants to sound professional.

Crazy.

Though I still wish open source to better than elevenlabs. but its all just a dream.

thorum · 2025-03-21T18:07:19 1742580439

It’s kind of like ChatGPT writing, where it can easily fool people who see it for the first time, but after a while you start to recognize the common patterns.

hadlock · 2025-03-20T01:21:34 1742433694

I'm looking forward to having an end-to-end "docker compose up" solution for self hosted chatgpt conversational voice mode. This is probably possible today, with enough glue code, but I haven't seen a neatly wrapped solution yet on par with ollama's.

nickthegreek · 2025-03-20T01:32:08 1742434328

You can glue it with home assistant right now, but it’s not a simple docker compose. Piper TTS and Kokoro were the main 2 voice engines people are using.

Orpheus would be great to get wired up. I’m wondering how well their smallest model will run and if it will be fast enough for realtime

Zetaphor · 2025-03-20T03:29:41 1742441381

With some tweaking I was able to get the current 3B's "realtime" streaming demo running on my 12GB 4070 Super with about a second of latency running at BF16

thot_experiment · 2025-03-20T07:03:15 1742454195

Open WebUI has this already. Works okay, there's definitely a lot a fair bit of latency tho.

tough · 2025-03-20T03:43:53 1742442233

harbor is a great docker bedrock for llm tools, has some tts stuff havent tried them https://github.com/av/harbor/wiki/1.1-Harbor-App#installatio...

rcarmo · 2025-03-20T07:56:41 1742457401

Slightly less enthusiastic Californian - good - but the “British” voice feels cringe.

ben_w · 2025-03-20T16:18:55 1742487535

Aye. As a native Brit myself, I'm not entirely sure which region that accent is supposed to be from.

It's the vocal equivalent of a triple-jointed arm, or a horizon that's different on the left and right side of a portrait.

nico · 2025-03-20T02:48:54 1742438934

> even on an A100 40GB for the 3 billion parameter model

Would any of the models run on something like a raspberry pi?

How about a smartphone?

Zetaphor · 2025-03-20T03:30:46 1742441446

They're going to be releasing a few more smaller models, as small as 150M

That said if you want something to use today on a Pi you should check out Kokoro

hadlock · 2025-03-20T05:07:18 1742447238

What kind of binary do you run Kokoro with for audio output

eternityforest · 2025-03-20T10:51:22 1742467882

I use sherpa-onnx, which is great because it also does Piper without any dependencies that recent python versions get angry about.

satvikpendem · 2025-03-21T19:08:22 1742584102

Is there some sort of better tutorial for sherpa-onnx? I tried looking into it but it seemed quite complex to get going, last I checked.

csukuangfj · 2025-03-25T07:12:51 1742886771

I am one of the authors of sherpa-onnx. Can you describe why you feel it is complex? If you use Python, all you need is to run pip install sherpa-onnx, and then download a model and use the example code from the folder python-api-exmaples

satvikpendem · 2025-03-29T05:27:14 1743226034

Hi, I vouched your comment since it was dead, presumably because yours is a new account. I'm looking to use it in a Flutter app, possibly with flutter_rust_bridge FFI if needed, so I was wondering how to do that, as well as where to get the models and how to use them. I didn't see any end to end example in the docs.

Zetaphor · 2025-03-20T05:42:11 1742449331

You can run it with Python or in the browser with WASM

deet · 2025-03-20T01:51:08 1742435468

Impressive for a small model.

Two questions / thoughts:

1. I stumbled for a while looking for the license on your website before finding the Apache 2.0 mark on the Hugging Face model. That's big! Advertising that on your website and the Github repo would be nice. Though what's the business model?

2. Given the LLama 3 backbone, what's the lift to make this runnable in other languages and inference frameworks? (Specifically asking about MLX but Llama.cpp, Ollama, etc)

mmoskal · 2025-03-20T01:55:18 1742435718

I wonder how can it be Apache if it's based on Llama?

Philpax · 2025-03-20T08:40:52 1742460052

That's a good question - I was initially thinking that it was pretrained from scratch using the Llama arch, but https://github.com/canopyai/Orpheus-TTS/blob/main/pretrain/c... implies the use of 3.2 3B as a base.

bfLives · 2025-03-22T14:20:53 1742653253

Looks like only the code is Apache, not the weights:

> the code in this repo is Apache 2 now added, the model weights are the same as the Llama license as they are a derivative work.

https://github.com/canopyai/Orpheus-TTS/issues/33#issuecomme...

ForTheKidz · 2025-03-20T12:32:27 1742473947

It sounds like reading from a script, or like an influencer. In that sense it's quite good: i could buy this is human.

However it's not a very good reading of the script, in human terms. It feels even more forced and phony than aforementioned influencers.

evrimoztamur · 2025-03-20T09:01:20 1742461280

Impressive for a small model, and I think it could be improved by fixing individual phrases sounding like they were recorded separately. Subtle differences in sound quality, and no natural transitions between individual words, it fails to sound realistic. I think these should be fixable as we figure out how to fine tune on (and thus normalizing) recording characteristics.

8organicbits · 2025-03-20T11:36:46 1742470606

A couple things I noticed:

- in the prompt "SO serious" it pronounces each letter as "ess oh" instead of emphasizing the word "so"

- there's no breathing sounds or natural breathing based pauses

Choosing which words in a sentence to emphasize can completely change the meaning of a sentence. This doesn't appear to be able to do that.

Still, huge progress over where we were just a couple years ago.

admiralrohan · 2025-03-20T09:33:39 1742463219

What is the difference between small and large models in case of TTS?

For language models I understand the thinking quality is different. But for TTS? Do anyone used small models in production use case?

michaelgiba · 2025-03-20T01:27:06 1742434026

Nice, I’m particularly excited for the tiny models.

NetOpWibby · 2025-03-20T06:49:55 1742453395

Having a NetNavi is gonna be possible at some point. This is nuts.