
Show HN: Text-to-speech and speech-to-text open-source software stack - ftreml
https://github.com/codeforequity-at/botium-speech-processing
======
ftreml
This project is the result of a one year long learning process in speech
recognition and speech synthesis.

The original task was to automate the testing of a voice-enabled IVR system.
While we started with real audio recordings, very soon it was clear that this
approach is not feasible for a non-trivial app and it will be impossible to
reach a satisfying test coverage. On the other hand, we had to find a way to
transcribe the voice app response to text for doing our automated assertions.

As cloud-based solutions where not an option (company policy), we very quickly
got frustrated as there was no "get shit done" Open Source stack available for
doing medium-quality text-to-speech and speech-to-text conversions. We learned
how to train and use Kaldi, which is according to some benchmarks the best
available system out there, but mainly targeting academic users and research.
We made heavy-weight MaryTTS work to synthesize speech in reasonable quality.

And finally, we packaged all of this in a DevOps-friendly HTTP/JSON API with a
Swagger definition.

As always, feedback and contributions are welcome!

~~~
ayumu722
I've built a speech synthesis system with Marytts before, it works, but
unfortunately, the quality is not very good, HMM and unit selection speech
synthesis are now very old approaches, they are far from the current state of
the art, you should try open-source implementations of Tacotron2 or wavnet,
you will surely achieve better quality.

~~~
ftreml
are you ready for a pull request ?

~~~
ayumu722
I'll be very pleased, but I can't promise anything. Good job by the way,
Speech synthesis and Speech recognition are not easy subjects to work on.

------
pouta
I built something quite similar on my own product. Is there any interest on
adding more STT/TTS backends to the software? Think services like Lyrebird or
Trint.

I could contribute towards it since I have done it before.

Thank you for building this!

~~~
ftreml
yes absolutly! it should be a good mix of freely available packages, with
meaningful default configuration.

------
hajimemash
Here's a sample wav output from using their swagger endpoint:
[https://drive.google.com/file/d/15y83NSXOCrEW9v9eQVCy6oHcWJ8...](https://drive.google.com/file/d/15y83NSXOCrEW9v9eQVCy6oHcWJ8DXGE0/view?usp=sharing)

Why does the voice/pronunciation have such drastic volume spikes and dips?

~~~
DonHopkins
"Now let's have a little taste of that old computer generated swagger." -Max
Headroom

[https://www.youtube.com/watch?v=WTN1WsUCyQc&t=3m26s](https://www.youtube.com/watch?v=WTN1WsUCyQc&t=3m26s)

------
sandreas
Could you explain, what's the difference to

\- [https://github.com/gooofy/zamia-speech#asr-
models](https://github.com/gooofy/zamia-speech#asr-models)

\- [https://github.com/mpuels/docker-py-kaldi-asr-and-
model](https://github.com/mpuels/docker-py-kaldi-asr-and-model)

in regards of speech recognition except the fact that its easier to use?

~~~
ftreml
zamia-speech: asr training scripts for research purposes, several ready
trained asr models for download, based on voxforge data. zamia-speech is the
(very hard in terms of know-how, hardware and software requirements) training
part to be done where projects like botium speech processing are be based
upon.

the other one is an example for packaging kaldi in a docker container.

~~~
sandreas
Thank you :-) I tried to get in touch with

[https://www.vorleser.net/](https://www.vorleser.net/)

in the past to provide Speech training data, but they were not really
interested. I thought it would be great to improve STT having a real good and
HUGE set of german audiobooks based on Text, that is publicly available...
unfortunately i had no success trying to script something for this purpose
(mainly lack of time).

~~~
ftreml
I point you to this article: [https://medium.com/@klintcho/creating-an-open-
speech-recogni...](https://medium.com/@klintcho/creating-an-open-speech-
recognition-dataset-for-almost-any-language-c532fb2bc0cf)

It basically describes the thing you mentioned - matching freely available
audio books with the source text and using some tools to preprocess the data
suitable for ASR training (alignment, splitting).

------
hardwaresofton
Can anyone in the space expand on why it's increasingly rare to see people
using/building on Sphinx[0]? Do people avoid it simply because of an
impression that it won't be good enough compared to deep learning driven
approaches?

[0]: [https://cmusphinx.github.io/](https://cmusphinx.github.io/)

~~~
shakna
I've avoided Sphinx after trying to use it, because:

1\. Compiling it is hit and miss. Sometimes it works, sometimes it doesn't.
There is no official package in any Linux distribution, so packaging anything
with it is incredibly painful. There's no easy way to cross-compile your
project, so you'll end up working around the build process.

2\. The documentation is woeful.

> Recent CMUSphinx code has noise cancellation featur. In
> sphinxbase/pocketsphinx/sphinxtrain it’s ‘remove_noise’ option. In sphinx4
> it’s Denoise frontend component. So if you are using latest version you
> should be robust to noise in some degree already. Most modern models are
> trained with noise cancellation already, if you have your own model you need
> to retrain it.

> The algorithm impelmented is spectral subtraction in on mel filterbank.
> There are more advanced algorithms for sure, if needed you can extend the
> current implementation.

Inconsistent methods across the codebases prevents you knowing where to look,
and if it is documented, it may involve spelling errors which you have to
guess around (like above). Also plenty of vague references to other documents
that may or may not even exist.

------
tianshuo
Is this using google's tacotron2 or wavenet anywhere? How does this compare to
them?

------
CommanderData
Any recommendations for a real time solution?

I maintain a platform which features live video events we'd like to add
captioning and so far can only see IBM Watson providing a websockets interface
for near real time stt.

~~~
ftreml
the project includes a websocket endpoint for realtime decoding. will add it
to the docs.

we are already using it for a callcenter with around 50 parallel audio
streams.

~~~
ftreml
forgot to mention: when doing realtime parallel processing, the default
configuration of this project is not a feasible setup. you have to run way
more decoder workers, maybe distributed on various machines.

------
briga
Is MaryTTS still as good as it gets for free TTS? I've been researching this
topic and it seems like there are some open-source implementations of
Tacotron, but the quality isn't necessarily great.

~~~
lunixbochs
The nvidia tacotron implementation was much better out of the box than the wav
in the neighbor thread.

------
tomcam
That's fantastic work and the demo is very well done. Thanks for sharing it.
You obviously put a lot of hard work into it. Feels super polished.

------
monkpit
What exactly does “low-key” mean in this context?

~~~
ftreml
it means: easy to install, easy to use, medium performance. no further know-
how needed. compared to the total effort for selecting, training, deploying
speech recogniction and speech synthesis it provides an extremly quick
boilerplate to add voice to your pipeline. i wish there was something like
that when we started the project.

------
bobmaxup
If marytts is so good, why are we in many linux distros still using
[https://en.wikipedia.org/wiki/Festival_Speech_Synthesis_Syst...](https://en.wikipedia.org/wiki/Festival_Speech_Synthesis_System)
as our default tts system?

~~~
ftreml
just a guess: marytts is rather heavy weight

------
polishdude20
So why is 40gigs of free space needed?

~~~
ftreml
speech recogniztion model filed are quite big

~~~
polishdude20
But 40gb? I feel like that includes training data or something. A model can't
just be 40 GB or else all of the audio would have to be passed through all
40gb of the model during inference. That seems huge.

~~~
ftreml
40GB is maybe too much, but when building the docker images there is some
space wasted. The image size after building is around 20GB (12GB marytts, 3GB
kaldi de, 6GB kaldi en)

------
mariushn
Would love to see a live demo. MaryTTS demo link is broken.

~~~
ftreml
see here: [https://speech.botiumbox.com](https://speech.botiumbox.com)

~~~
mariushn
Thanks. Hard to use and poor result :(

Good example: [https://cloud.google.com/text-to-
speech/](https://cloud.google.com/text-to-speech/)

~~~
ftreml
It's an API, of course it is hard to use without any real user interface ...
but as an STT/TTS API, it won't get more easy than that ...

Of course it is not a competitor to Google in any sense.

------
dmos62
I'd like to have my laptop read out epubs or articles. Recommendations for
speech synthesis (TTS) on the command line?

~~~
ftreml
picotts is a command line tool. marytts is a client/server tool. (both
included in botium speech processing and callable with curl).

high quality with google cloud speech and amazon polly.

~~~
dmos62
picotts looks cool, but it's not maintained (maybe it's perfect already?).
nanotts is a fork of picotts with improvements to its cli interface, but it's
not very maintained either (I was not able to compile it, due to it expecting
Alsa; somehow -noalsa switch didn't help). I also discovered gtts (and the
simpler google_speech), both available through pip. They're interfaces to
google's apis, and have routines for handling large texts properly.

------
grizzles
Facebook has released wav2letter++. I'd wager that will outperform kaldi by a
wide margin.

~~~
Nimitz14
It is very unlikely a E2E toolkit can outperform kaldi if they're trained on
the same data.

Also I suspect these guys aren't using the latest kaldi architectures.

~~~
grizzles
What's your reasoning? Google's latest published research is analogously
parsimonious with this approach. They are doing decoding using a simple beam
search through a single neural network. See
[https://ai.googleblog.com/2019/03/an-all-neural-on-device-
sp...](https://ai.googleblog.com/2019/03/an-all-neural-on-device-speech.html)

~~~
ftreml
very very interesting, didnt know this one. as soon as i am having some hours
left will try to run some evaluation on this. after all, for my project only
performance in german language counts

------
monkeydust
Are there any performance metrics of this versus other offline and cloud based
services?

~~~
ftreml
yes there are plenty of them. just google for something like "kaldi vs
google".

in short: not surprising the blockbuster cloud services provide better results
as they have way more training data. tradeoff between price, privacy, quality.

------
ajaviaad
Which languages are supported?

~~~
ftreml
currently included german and english. contributions for other languages
welcome, native speakers will have better insights into quality of speech
output and recognition model

------
z3t4
Would be cool with a web demo.

~~~
ftreml
[https://speech.botiumbox.com](https://speech.botiumbox.com)

just a small server, hope that it wont crash when posting the link here

~~~
z3t4
Cool! Thanks! I've tried both wav2letter and DeepSpeech, and now this, but I
get very poor results even with short sentences (compared to Google's
proprietary services). Would it be possible to also make an API for passing in
training data and automatically update the model? I'm thinking that the
results might get better if they are trained with the specific
audio/hardware/settings and dialect of the end user.

~~~
magicalhippo
I just played with DeepSpeech (v0.6.1) and I found significant improvements by
using a custom language model.

The language model is built from sentences and is rather quick to build
(~seconds), at least for the small number of sentences I used. This can then
be combined with the pre-trained neural net.

Though I hear DeepSpeech is currently fairly US-influenced when it comes to
recognizing accents. So if you're not a native speaker, consider contributing
to the open-source dataset over at
[https://voice.mozilla.org/](https://voice.mozilla.org/)

~~~
ftreml
its always an issue with domain specific utterances. for the freely available
data they are oov and have to be handled somehow (fe generate pronounciation
with seqitur)

