Hacker News new | past | comments | ask | show | jobs | submit login
Orpheus-3B – Emotive TTS by Canopy Labs (canopylabs.ai)
185 points by Zetaphor 9 days ago | hide | past | favorite | 38 comments





GGUF version created by "isaiahbjork" which is compatible with LM Studio and llama.cpp server at: https://github.com/isaiahbjork/orpheus-tts-local/

To run llama.cpp server: llama-server -m C:\orpheus-3b-0.1-ft-q4_k_m.gguf -c 8192 -ngl 28 --host 0.0.0.0 --port 1234 --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock


I've been testing this out, it's quite good and especially fast. Crazy that this is working so well at Q4

Can somebody please create a gradio client for this as well. I really want to try this out but the complexity messes me up.

Wait, how do you get audio out of llama-server?

Orpheus is a llama model trained to understand/emit audio tokens (from snac). Those tokens are just added to its tokenizer as extra tokens.

Like most other tokens, they have text reprs: '<custom_token_28631>' etc. You sample 7 of them (1 frame), parse out the ids, pass through snac decoder, and you now have a frame of audio from a 'text' pipeline.

The neat thing about this design is you can throw the model into any existing text-text pipeline and it just works.


got it, so inference in llama.cpp server won't actually get me any audio directly

If you run the `gguf_orpheus.py` file in that repository, it will capture the audio tokens and convert them to a .wav file. With a little more work, you can feed the streaming audio directly using `sounddevice` and `OutputStream`

On a Nvidia 4090, it's producing:

  prompt eval time =      17.93 ms /    24 tokens (    0.75 ms per token,  1338.39 tokens per second)

         eval time =    2382.95 ms /   421 tokens (    5.66 ms per token,   176.67 tokens per second)

        total time =    2400.89 ms /   445 tokens
*A Correction to the llama.cpp server command above, there are 29 layers so it should read "-ngl 29" to load all the layers to the GPU.

is there any reason not to just use `-ngl 999` to avoid that error? Thanks for the help though, I didn't realize lmstudio was just llama.cpp under the hood. I have it running now, though decoding is happening on CPU torch because of venv issues, still running about realtime though, I'm interested in making a full fat gguf to see what sort of degradation the quant introduces. Sounds great though, can't wait to try finetuning and messing with the pretrained model. Have you tried it? I guess you just tokenize the voice with SNAC, transcribe it with whisper, and then feed that in as a prompt? What a fascinating architecture.

You need to decode the tokens into audio. See `convert_to_audio` method in `decoder.py`

You can run `python gguf_orpheus.py --text "Hello, this is a test" --voice tara` and connect to the llama-server

See https://github.com/isaiahbjork/orpheus-tts-local

See my GH issue example output https://github.com/isaiahbjork/orpheus-tts-local/issues/15


I always am a bit skeptical of these demos, and indeed I think they didn't put much effort into getting the most out of ElevenLabs. In the demo, they used the Brian voice. For the first example, I can get this in ElevenLabs [1]. Stability was set to 20 here and all the other settings were at their default. Having stability at the default of 50 sounds more like what is in the demo on the site [2].

Having said that, I'm fully in favor of open source and am a big proponent of open source models like this. ElevenLabs in particular has the highest quality (I tested a lot of models for a tool I'm building [3]), but the pricing is also 400 times more expensive than the rest. You easily pay multiple dollars per minute of text-to-speech generation. For people interested, the best audio quality I could get so far is [4]. Someone told me he wouldn't be able to tell that the voice was not real.

[1]: https://elevenlabs.io/app/share/3NyQKlL6EeOHpIDtL5pA

[2]: https://elevenlabs.io/app/share/TUx4yluXtV3pFTHr7Cl7

[3]: https://github.com/transformrs/trv

[4]: https://youtu.be/Ni-dKlCpnb4


Great demo on #4 ; and content-wise, a good lesson I needed to be reminded of.

I was such a fan of CoquiTTS and so happy when they launched a commercially licensed offering. I didn't mind taking a small hit on quality if it enabled us to support them.

And then, the quality of the API outputs were lower than what the self-hosted open source Coqui model provided... I'm thinking this was one of the reasons usage was not at the level they hoped for, and they ended up folding.

The saddest part is they still didn't assign commercial rights to the open-source model, so I think Coqui is in a dead-end now.


the [4] is such that since you've told me that its AI , my brain can say that of course its AI , but if you hadn't told me that , I might have thought that maybe this guy speaks like this or reading it in monotonous-ish way (like reading from a script?) and wants to sound professional.

Crazy.

Though I still wish open source to better than elevenlabs. but its all just a dream.


It’s kind of like ChatGPT writing, where it can easily fool people who see it for the first time, but after a while you start to recognize the common patterns.

I'm looking forward to having an end-to-end "docker compose up" solution for self hosted chatgpt conversational voice mode. This is probably possible today, with enough glue code, but I haven't seen a neatly wrapped solution yet on par with ollama's.

You can glue it with home assistant right now, but it’s not a simple docker compose. Piper TTS and Kokoro were the main 2 voice engines people are using.

Orpheus would be great to get wired up. I’m wondering how well their smallest model will run and if it will be fast enough for realtime


With some tweaking I was able to get the current 3B's "realtime" streaming demo running on my 12GB 4070 Super with about a second of latency running at BF16

Open WebUI has this already. Works okay, there's definitely a lot a fair bit of latency tho.

harbor is a great docker bedrock for llm tools, has some tts stuff havent tried them https://github.com/av/harbor/wiki/1.1-Harbor-App#installatio...

Slightly less enthusiastic Californian - good - but the “British” voice feels cringe.

Aye. As a native Brit myself, I'm not entirely sure which region that accent is supposed to be from.

It's the vocal equivalent of a triple-jointed arm, or a horizon that's different on the left and right side of a portrait.


> even on an A100 40GB for the 3 billion parameter model

Would any of the models run on something like a raspberry pi?

How about a smartphone?


They're going to be releasing a few more smaller models, as small as 150M

That said if you want something to use today on a Pi you should check out Kokoro


What kind of binary do you run Kokoro with for audio output

I use sherpa-onnx, which is great because it also does Piper without any dependencies that recent python versions get angry about.

Is there some sort of better tutorial for sherpa-onnx? I tried looking into it but it seemed quite complex to get going, last I checked.

I am one of the authors of sherpa-onnx. Can you describe why you feel it is complex? If you use Python, all you need is to run pip install sherpa-onnx, and then download a model and use the example code from the folder python-api-exmaples

Hi, I vouched your comment since it was dead, presumably because yours is a new account. I'm looking to use it in a Flutter app, possibly with flutter_rust_bridge FFI if needed, so I was wondering how to do that, as well as where to get the models and how to use them. I didn't see any end to end example in the docs.

You can run it with Python or in the browser with WASM

Impressive for a small model.

Two questions / thoughts:

1. I stumbled for a while looking for the license on your website before finding the Apache 2.0 mark on the Hugging Face model. That's big! Advertising that on your website and the Github repo would be nice. Though what's the business model?

2. Given the LLama 3 backbone, what's the lift to make this runnable in other languages and inference frameworks? (Specifically asking about MLX but Llama.cpp, Ollama, etc)


I wonder how can it be Apache if it's based on Llama?

That's a good question - I was initially thinking that it was pretrained from scratch using the Llama arch, but https://github.com/canopyai/Orpheus-TTS/blob/main/pretrain/c... implies the use of 3.2 3B as a base.

Looks like only the code is Apache, not the weights:

> the code in this repo is Apache 2 now added, the model weights are the same as the Llama license as they are a derivative work.

https://github.com/canopyai/Orpheus-TTS/issues/33#issuecomme...


It sounds like reading from a script, or like an influencer. In that sense it's quite good: i could buy this is human.

However it's not a very good reading of the script, in human terms. It feels even more forced and phony than aforementioned influencers.


Impressive for a small model, and I think it could be improved by fixing individual phrases sounding like they were recorded separately. Subtle differences in sound quality, and no natural transitions between individual words, it fails to sound realistic. I think these should be fixable as we figure out how to fine tune on (and thus normalizing) recording characteristics.

A couple things I noticed:

- in the prompt "SO serious" it pronounces each letter as "ess oh" instead of emphasizing the word "so"

- there's no breathing sounds or natural breathing based pauses

Choosing which words in a sentence to emphasize can completely change the meaning of a sentence. This doesn't appear to be able to do that.

Still, huge progress over where we were just a couple years ago.


What is the difference between small and large models in case of TTS?

For language models I understand the thinking quality is different. But for TTS? Do anyone used small models in production use case?


Nice, I’m particularly excited for the tiny models.

Having a NetNavi is gonna be possible at some point. This is nuts.



Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: