Orpheus is a llama model trained to understand/emit audio tokens (from snac). Those tokens are just added to its tokenizer as extra tokens.
Like most other tokens, they have text reprs: '<custom_token_28631>' etc. You sample 7 of them (1 frame), parse out the ids, pass through snac decoder, and you now have a frame of audio from a 'text' pipeline.
The neat thing about this design is you can throw the model into any existing text-text pipeline and it just works.
If you run the `gguf_orpheus.py` file in that repository, it will capture the audio tokens and convert them to a .wav file. With a little more work, you can feed the streaming audio directly using `sounddevice` and `OutputStream`
On a Nvidia 4090, it's producing:
prompt eval time = 17.93 ms / 24 tokens ( 0.75 ms per token, 1338.39 tokens per second)
eval time = 2382.95 ms / 421 tokens ( 5.66 ms per token, 176.67 tokens per second)
total time = 2400.89 ms / 445 tokens
*A Correction to the llama.cpp server command above, there are 29 layers so it should read "-ngl 29" to load all the layers to the GPU.
is there any reason not to just use `-ngl 999` to avoid that error? Thanks for the help though, I didn't realize lmstudio was just llama.cpp under the hood. I have it running now, though decoding is happening on CPU torch because of venv issues, still running about realtime though, I'm interested in making a full fat gguf to see what sort of degradation the quant introduces. Sounds great though, can't wait to try finetuning and messing with the pretrained model. Have you tried it? I guess you just tokenize the voice with SNAC, transcribe it with whisper, and then feed that in as a prompt? What a fascinating architecture.
I always am a bit skeptical of these demos, and indeed I think they didn't put much effort into getting the most out of ElevenLabs. In the demo, they used the Brian voice. For the first example, I can get this in ElevenLabs [1]. Stability was set to 20 here and all the other settings were at their default. Having stability at the default of 50 sounds more like what is in the demo on the site [2].
Having said that, I'm fully in favor of open source and am a big proponent of open source models like this. ElevenLabs in particular has the highest quality (I tested a lot of models for a tool I'm building [3]), but the pricing is also 400 times more expensive than the rest. You easily pay multiple dollars per minute of text-to-speech generation. For people interested, the best audio quality I could get so far is [4]. Someone told me he wouldn't be able to tell that the voice was not real.
Great demo on #4 ; and content-wise, a good lesson I needed to be reminded of.
I was such a fan of CoquiTTS and so happy when they launched a commercially licensed offering. I didn't mind taking a small hit on quality if it enabled us to support them.
And then, the quality of the API outputs were lower than what the self-hosted open source Coqui model provided... I'm thinking this was one of the reasons usage was not at the level they hoped for, and they ended up folding.
The saddest part is they still didn't assign commercial rights to the open-source model, so I think Coqui is in a dead-end now.
the [4] is such that since you've told me that its AI , my brain can say that of course its AI , but if you hadn't told me that , I might have thought that maybe this guy speaks like this or reading it in monotonous-ish way (like reading from a script?) and wants to sound professional.
Crazy.
Though I still wish open source to better than elevenlabs. but its all just a dream.
It’s kind of like ChatGPT writing, where it can easily fool people who see it for the first time, but after a while you start to recognize the common patterns.
I'm looking forward to having an end-to-end "docker compose up" solution for self hosted chatgpt conversational voice mode. This is probably possible today, with enough glue code, but I haven't seen a neatly wrapped solution yet on par with ollama's.
You can glue it with home assistant right now, but it’s not a simple docker compose. Piper TTS and Kokoro were the main 2 voice engines people are using.
Orpheus would be great to get wired up. I’m wondering how well their smallest model will run and if it will be fast enough for realtime
With some tweaking I was able to get the current 3B's "realtime" streaming demo running on my 12GB 4070 Super with about a second of latency running at BF16
I am one of the authors of sherpa-onnx. Can you describe why you feel it is complex? If you use Python, all you need is to run pip install sherpa-onnx, and then download a model and use the example code from the folder python-api-exmaples
Hi, I vouched your comment since it was dead, presumably because yours is a new account. I'm looking to use it in a Flutter app, possibly with flutter_rust_bridge FFI if needed, so I was wondering how to do that, as well as where to get the models and how to use them. I didn't see any end to end example in the docs.
1. I stumbled for a while looking for the license on your website before finding the Apache 2.0 mark on the Hugging Face model. That's big! Advertising that on your website and the Github repo would be nice. Though what's the business model?
2. Given the LLama 3 backbone, what's the lift to make this runnable in other languages and inference frameworks? (Specifically asking about MLX but Llama.cpp, Ollama, etc)
Impressive for a small model, and I think it could be improved by fixing individual phrases sounding like they were recorded separately. Subtle differences in sound quality, and no natural transitions between individual words, it fails to sound realistic. I think these should be fixable as we figure out how to fine tune on (and thus normalizing) recording characteristics.
To run llama.cpp server: llama-server -m C:\orpheus-3b-0.1-ft-q4_k_m.gguf -c 8192 -ngl 28 --host 0.0.0.0 --port 1234 --cache-type-k q8_0 --cache-type-v q8_0 -fa --mlock
reply