and then you'll be greeted with some TensorFlow Lite diagnostics, followed by the intermediate states of the beam-search decoder, followed by the hopefully correct transcription result.
This model seems strongly overtrained on CV test set. Usually improvement from LM rescoring is just 10% relative. In the paper https://arxiv.org/pdf/2206.12693.pdf the improvement is from 10.1% WER to 3.64% WER (Table 6). Such a big improvement suggests that LM is biased.
Also, the perplexity of provided ngram LM on CV test set is just 86 and most of 5-gram histories are already in the LM. This also suggests bias.
Also in Table 6, you see that Facebook's wav2vec 2.0 XLS-R went from 12.06% without LM to 4.38% with 5-gram LM. In comparison to that, I found TEVR going from 10.10% to 3.64% unproblematic. The core assumption of my paper is that for German specifically, the language model is very important due to conserved (and usually mumbled) word endings.
Anyway, it's roughly a 64% reduction for both wav2vec2 XLS-R and TEVR. So if your criticism that I overtrained the TEVR model turns out to be correct, then that would suggest that the Zimmermeister 2022 wav2vec2 XLS-R was equally over-trained, which would still make it a fair comparison w.r.t. the 16% relative improvement in WER.
Or are you suggesting that all wav2vec2 -derived AI models are strongly overtrained for CommonVoice? Because they seem to do very well on LibriSpeech and GigaSpeech, too.
Could you explain what you mean by "perplexity" here? Can you recommend a paper about it? I haven't read about that in any of the ASR papers I studied, so this sounds like an exciting new technique for me to learn :)
BTW, regardless of the metrics, this is the model that "works for me" in production.
BTW, BTW, it would be really helpful for research if Vosk could also publish a paper. As you can see, PapersWithCode.com currently doesn't list any Vosk WERs for CommonVoice German, despite the website reporting 11.99% for vosk-model-de-0.21.
First of all thank you for your nice research! It is really inspiring.
> Also in Table 6, you see that Facebook's wav2vec 2.0 XLS-R went from 12.06% without LM to 4.38% with 5-gram LM.
It is probably Jonatas Grosman's model, not Facebook. Bias is a common sin for common voice trainers. Partially because they integrate Guttenberg texts into LM, partially because for some languages CV sentences intersect between train and test.
The improvement from LM is from 6.68 to 6.03 as expected.
> Zimmermeister 2022 wav2vec2 XLS-R was equally over-trained
Yes
> Or are you suggesting that all wav2vec2 -derived AI models are strongly overtrained for CommonVoice? Because they seem to do very well on LibriSpeech and GigaSpeech, too.
Not all the models are overtrained, I mainly complain about German ones. For example Spanish is reasonable:
> Could you explain what you mean by "perplexity" here? Can you recommend a paper about it? I haven't read about that in any of the ASR papers I studied, so this sounds like an exciting new technique for me to learn :)
> BTW, regardless of the metrics, this is the model that "works for me" in production.
Sure, but it could work even better if you take more generic model.
>BTW, BTW, it would be really helpful for research if Vosk could also publish a paper. As you can see, PapersWithCode.com currently doesn't list any Vosk WERs for CommonVoice German, despite the website reporting 11.99% for vosk-model-de-0.21.
I now had time and did some testing and the CER is already pretty much excellent for TEVR even without the language model, so it appears to me that what the LM does is mostly to fix the spelling. In line with that, recognition performance is still good for medical words, but in some cases the LM will actually reduce quality there by "fixing" a brand-name to a sequence of regular words.
Thanks for the perplexity paper :) I'll go read that now.
Sounds interesting, although as someone not that deeply into ML these terms don't say a lot to me. What would "bias" mean in this case? That the model would recognize a "standard German" speaker, but not someone from Bavaria? Because that happens to a lot of (non-Bavarian) Germans too.
The model knows the recognition text well and demonstrates good results because of it. If you test the same model on some unrelated speech which model didn't see yet the results will not be that great. Error rate might be significantly worse than other systems.
So, what caught my interest was this quote at the bottom:
"Alternatively, you can donate roughly 2 weeks of A100 GPU credits to me and I'll train a suitable recognition model and upload it to HuggingFace."
This takes 2 weeks of A100 GPU time to train? Current GC spot pricing puts that at about $300.
Yes, but only for me in this specific case. That's because this German model was derived based on a model which was pre-trained for months on hundreds of thousands of hours of YouTube audio in various languages. So if I now train an English recognizer, I don't start from scratch. Instead, I start from a checkpoint which can already almost flawlessly recognize any human utterance. (Or all phonemes in the IPA alphabet, to be more precise)
The "training" then only learns the mapping between IPA phonemes and text notation.
For Romanian, I believe someone would first have to collect a large dataset of speech recordings together with groundtruth text. Even if you find cheap narrators working for $10 per hour, that's still $100k for 10k hours of data.
This looks pretty cool, especially since it is offline and free! The only caveat is that it is probably not feasible to train yourself, right?
I should play with this if I have some time. I've had the idea for a while to build a voice assistent which can switch modes or datasets while you are speaking. If you say "computer, play ...." for example it would load a recognizer that is specialized on song names. The idea is that you can mix English song names in a German prompt, and it will not be confused. Every voice assistent I know gets confused, presumably because they convert speech to plain text and only then act on the text.
> I've had the idea for a while to build a voice assistent which can switch modes or datasets while you are speaking. If you say "computer, play ...." for example it would load a recognizer that is specialized on song names.
I would say by now the generic recognizers are so good that this is becoming less and less useful. For example, this tool handles non-existing German words quite well.
That said, the tool has a "--data_folder_path" parameter where you can specify a different acoustic and language model.
BTW, I also want to build an offline voice assistant :)
That's how I got started on this journey. You might be interested in my next project, where I try to do offline real-time English recognition with a WebRTC API to make it easy for developers to connect my AI module with their own task logic. Here's the waiting list: https://madmimi.com/signups/f0da3b13840d40ce9e061cafea6280d5...
No. The best you can do with this AI architecture would be something like NVIDIA's NeMo, meaning you group the input audio stream into 2s blocks with 1s overlap and then run the speech recognition on that.
If you want real-time speech recognition with less than 0.5s of delay between speaking the word and it being fully recognized, then one needs to implement a different architecture. And that one is much more difficult and expensive to train that this one (which was already expensive).
That said, I want a fully offline and privacy-respecting voice assistant myself.
So attempting to build the AI for real-time streamed live English speech recognition will be my next project. I plan to ship it as an OpenGL-accelerated binary with WebRTC server so that others can easily combine my recognition with their logic. But it probably won't be free since I'm looking at >$100k in compute costs to build it. In any case, here's the waiting list: https://madmimi.com/signups/f0da3b13840d40ce9e061cafea6280d5...
Yeah, I'm looking into government programs such as the EU "Prototype Fund", too. But the issue with crowdfunding is that if I want to raise $100k on Kickstarter, I need to spend $10k for an agency and another $20k for ads to promote the campaign. So it's quite wasteful (30% just for marketing) unless you already have a large audience willing to pay, which I don't have.
So I believe my best bet might be to partner up with a larger company who will pay for development and/or just charging users for a license. Nuance's Dragon Home is $200 and their Pro version is $500, so there's a lot of room for me to be cheaper while still reaching $100k in revenue with a realistic number of users.
Thanks for the information! I've been looking to build an accessibility tool for a deaf community so that they can see live captions of conversations, but some of the existing solutions I've tried seem to lag behind in conversational speech accuracy, or they're difficult/impossible to fine-tune with the community-specific words and phrases.
This type of "Digital Therapeutics" might be paid for by German health insurance companies. For example, https://gaia-group.com/en/ appears to be a successful provider of medical apps.
If you don't mind, please email me at moin@deutscheki.de and explain in a bit more detail what the needs of that deaf community are. Maybe I can forward that to the right people to get my government to pay for the app that you wish to see developed.
I mean I agree with you, it certainly would increase life satisfaction for deaf people if they could "listen in" into conversations to know what others are gossiping about.
Original poster here, I just wanted to say thank you!
Despite some snark - which I totally deserve, but I can't change the title anymore - I also received some very helpful advice, learned something new, and got introduced to people who plan to use this technology to help others. I can't think of any better outcome for me publishing my research.
People just get confused by large numbers of similar syllables because we have to buffer them for more processing. I suspect a pure speech-to-text model doesn't need to worry so much about context and can just take the syllables one by one.
Actually, this model uses attention layers which are kind of like a query/key=>value look-up and allow the model to merge the knowledge from neighboring time-steps into the current logit prediction.
The result is that this model performs worse for highly repetitive words, just like humans do.
Training this pipeline was already quite expensive so I compared against all models and papers I could find online, but I couldn't afford to train a full new model just to check wav2vec2 with BPE.
That said, I did check against exhaustively allowing all 1-4 character tokens, which is pretty similar to BPE, and that performed worse in every situation.
Without even actually retraining everything, I would be curious how different the tokens are with your technique compared to using an off-the-shelf solution like SentencePiece with the same output vocabulary size.
We tested it on CommonVoice German, which is what Mozilla used for DeepSpeech German, too. The idea behind that dataset is that arbitrary people on the internet submit their recordings (hence the "common" in the name) and then if enough other people upvote it as "understandable", it gets included in the dataset.
As such, the AI works well with a variety of accents.
I suspect that "works well" means that the model will output words in "official german" and kind of corrects pronounciation errors? I am asking because I had the use case to automatically give feedback to non native german speakers.
In my opinion, the focus here should be on the fact that this is a state of the art AI which beats Facebook's wav2vec2 by a relative 16% improvement, Scribosermo (based on NVIDIA NeMo) by a relative 44% and Mozilla's DeepSpeech German by a relative 75% improvement. People usually don't share their production-quality tools ;)
That said, I wrote "284 lines of C++" to indicate that this is compact enough for people to actually read and understand the source code. Also, compiling my implementation is super easy and straightforward ... something which can't be said for Kaldi, Vosk, or DeepSpeech.
If you try to read the CTC beam search decoder from Mozilla's DeepSpeech [1], that alone is about 2000 LOC in multiple files. If you try to read the pyctcdecode source that is used by HuggingFace [2], that's 1000+ LOC of Python.
But this implementation is all the client-side, i.e. the entire "native_client" folder hierarchy in DeepSpeech [3], narrowed down to a mere 284 lines.
Also, both DeepSpeech and HuggingFace Transformers use TensorFlow as a dependency, i.e. just like me. So in my opinion, it doesn't make sense to include TF in the LOC comparison if all the AI speech recognition systems use it. That would be like including libstdc++, too.
"16% better than wav2vec2, 44% better than Scribosermo, 75% better than DeepSpeech" was more than enough for a good headline. Of course everyone was going to get sand in their panties over "284 lines of C++", and now it's time to pay the HN pedantics piper.
I wanted to specifically highlight that people can (and should) read the source code. I now see that this might have been a mistake. But I was hoping to share the joy of taking a cool tool and looking under the hood.
You cannot win. The top comments on HN are _always_ pedantry. These people will always find something to complain about or nitpick while completing missing the forest for the trees.
By what definition, principle or authority do you determine what is AI?
Don't take this the wrong way, but I find that people with more knowledge of the subject tend to be more open about what they include, whereas people with less knowledge tend to do more gatekeeping. AI has a "moving goal post" issue that is notable enough to warrant a wikipedia page:
https://en.wikipedia.org/wiki/AI_effect
Touting the "low number of lines" on a neural network project seems kind of silly, since the logic is encoded in the weights. Kind of like if I said "Doom in 5 lines of JS", but those 5 lines just downloaded and ran 130kb of WASM
There are so many comments about the "real" length of the program.
The number 284 means something to people who work in speech recognition - these are the people that know how much _they_ write when they try to compete with this library.
This number isn't meant for people who are disinterested (have no stake) in speech recognition.
"X in N lines of code" is usually used for compressing an algorithm down to its bare essentials. "Ray tracing in 100 lines of code". In that case, it's disingenuous to have 100 lines of glue code that calls into pov-ray. If I compile 284 lines of code, I expect a few kb executable (including libraries that aren't language-libraries like libc). This hasn't really simplified the speech-recognition algorithm, it still requires large amounts of training data and time, and still does the speech recognition by solving a large matrix with Tensorflow. It's not interesting to me that you can hook up Tensorflow in a few lines of code; I know Tensorflow can solve the problem. However, I would be interested in 284 lines of code that replace Tensorflow, which is what the title suggested to me. A better title would be to focus on how the model and/or the handling of the data produces better results, because that's what this code seems to be about.
If you compare this with Mozilla's DeepSpeech repository, you will find that correctly calling TensorFlow Lite and handling the results in only 284 lines of code is impressive, too.
Also, the C++ code is mostly a custom beam search decoder based on the research for my paper, so it's not like TensorFlow is doing all the heavy lifting here, because precisely that TEVR token decoder causes the relative 16% performance improvement of this speech recognition AI over others.
I don’t want to take anything away from the achievement, because this looks very useful, but the headline misled me a bit. I’m note sure whether it’s my affinity for code golf, but when I see someone bragging about about line count, I don’t expect the use of multi million line nonstandard libraries. :) For anyone wondering, most of the 284 lines are (of course, some might say) calls into Tensorflow. Still, I think this is really nice, just not what I expected from the title.
> I’m not sure whether it’s my affinity for code golf, but when I see someone bragging about about line count, I don’t expect the use of multi million line nonstandard libraries.
I wouldn't say the headline is "misleading" - no one is about to be fooled into thinking 300 lines of C++ could be capable of state-of-the-art speech recognition. The headline is squarely in the territory of "complete nonsense, but you can tell that without having to read further than the headline".
I’m not sure what gives you the confidence to make absolute statements like this. It might be unlikely, but code golfers, demo sceners and the like regularly do crazy stuff with ridiculously little code.
As soon as you succeed, I'm pretty sure someone will complain that the sequence of matrix multiplications in the AI parameter file also counts as "code" in the wider sense.
Download a large amount of random German language videos off YouTube, but only ones with handmade subtitles. Correlate audio with text. Record audio, transform to text.
I posit this can be done in less than 284 lines of C++ while having an error rate equal to or better than the state-of-the-art for everyday speech.
To those criticizing the title, how would you improve it? Dependencies do the heavy lifting , but that's true even for a hello-world. It seems obvious enough that I wouldn't see a reason to clarify.
Instead, to me, this reads, "Hey C++ fans, you don't need Python for nice things. Look at what you can do in 284 lines of code!"
I don't have C++ chops, so this is nontrivial for me. I appreciate OP sharing this!
> To those criticizing the title, how would you improve it?
By not making false assertions?
Judging by the comments the 284 lines is in addition to many hours of GPU training time plus some magic and a huge library. I didn’t even click the link because I knew it was 284 lines of plumbing code on top of something else.
I once wrote a entire renderman compatible renderer in a couple hundred lines of python — which was really a super dumb script which generated ctypes bindings from the header for an actual renderer (pixie if anyone is curious).
I really don't think it's a false assertion. What code isn't "plumbing"?
For example, you can get 90% of the way to a CSV parser in Python in one line (as `[line.split(",") for line in open("some.csv").readlines()]`). Should we consider it a false assertion to call that a "one liner" and if so, how should it properly be described?
I actually wrote a generator for a SVG parser/writer and not counting the xml library it came in at around close to 40k lines of code. Or maybe 70k lines, don’t recall just know it is a lot. If I split it up into one class per file it took something like 45 minutes to compile (it is a C-API extension) so I would dump all the code into a single file for faster compiles.
Haven’t looked at it in a while but the generator file is probably around 400 something lines. Certainly not going to claim its a validating SVG library in 400 odd lines of code.
Think I had some cockamamie scheme to make a SVG to grease pencil converter for blender and only got around to shaving that one yak.
The core difference in my opinion is that the 284 lines of code here effect a relative 16% improvement in result quality over what until now was the best publicly available research.
That's why I wrote "State-of-the-Art" at the beginning of the title. Because this is based on new research and it works better than previous research.
Yes, this is precisely why I included it. I wanted to highlight for people who have experience with speech recognition in general that this is a magnitude easier to read than Mozilla's DeepSpeech.
Sure, but even something like `int main() {}` relies on an operating system, the C runtime, a standard library, and hardware to boot. Best build the universe to make an apple pie
That would leave out the fact that those 284 lines contain the new decoder based on our paper which leads to the relative 16% reduction in word error rate.
Also, it's 3 characters too long to be a valid HN title.
And lastly, this does contain the parameters for a new AI model which is based on my research, so it's not all "glue logic" ;)
Using operating system provided functionality, a GUI library or anything in libc, isn't something I'd hold against anyone when counting lines of code. Relying on an entire external project however, that's a bit misleading.
I’m inclined to accept that, too, but it is a moving goalpost and unfair when comparing between OSes and/or time periods. For example, consumer OSes have been shipping with speech recognition libraries for decades, and they’re getting better and better.
I don't know. It's not that it isn't impressive and it is very illustrative of how much you can build, utilizing the tools that are available to all of us. It's just at what point are you writing code and when are you just initializing and configuring an existing tool.
It's also how you present it. If you said: "Building a TensorFlow backed German speech recognition system in only a few hundred lines of code." Then I feel like you're being more honest.
I tried to include TensorFlow in the submission headline, but that would have been too long. "Show HN: State-of-the-Art German Speech Recognition in 284 lines of C++" uses up 72 of the 80 character limit. Originally I also wanted to mention that this is and offline, cloud-free, and privacy-respecting, but that, too, didn't fit.
I can appreciate your dilemma, but I think the submission would have been better served if the title emphasizes the real win, which is improved accuracy over some alternatives (like you mention elsewhere in the comment thread), rather than lines of code (because that invites scrutiny on the wrong, non-salient dimension).
It doesn't. It implements a new way of decoding the logits which improves performance by a relative 16% over the previously best German speech recognition, which was Facebook's wav2vec2.
And the size is relevant to people in the industry because DeepSpeech uses 2000+ LOCs for implementing their decoding, so this works better and is 10x less code.
Mozilla's DeepSpeech is so large that you can't really read it and understand it. This one is 10x less code while recognition quality is 75% better (lower relative word error rate).
So this one is small enough that you can read the source code if you want to, while DeepSpeech is not.
Good point – calling this clickbait might be too cynical.
From this perspective I can definitely get behind advertising the project with the LoC measurement.
Subjectively, I still find this to be a bit of a "not telling the whole truth", however I also only ever toyed around with speech recognition ai.
How to assemble a German speech recognition program in under 300 lines.
-
ps: I skimmed the paper cited and what I wrote above is -not- correct. The project is not simply assembling a pipeline, it is claiming ('papers with code') an innovation:
"This paper presents TEVR, a speech recognition model designed to minimize the variation in token entropy w.r.t. to the language model. This takes advantage of the fact that if the language model will reliably and accurately predict a token anyway, then the acoustic model doesn't need to be accurate in recognizing it."
...
"We have shown that when combined with an appropriately tuned language model, the TEVR-enhanced model outperforms the best German automated speech recognition result from literature by a relative 44.85% reduction in word error rate. It also outperforms the best self-reported community model by a relative 16.89% reduction in word error rate."
Yes, we actually did research and spent a significant amount of GPU time. Thanks to my luck of stumbling into the right people at the right time, I could afford cloud-scale training because OVH granted me very generous rebates...
The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:
Yes and no. With perfect audio quality, it'll write down almost verbatim what you said. But as the audio gets more noisy, it'll shift more and more towards the most likely interpretation.
> The main innovation is that we prevent loss mis-allocation during training of the acoustic AI model by pre-weighting things with the loss of the language model. Or in short:
> We don't train what you don't need to hear
This does sound a lot more interesting than the ~280 lines of code.
For a researcher, yes. But for understanding the trick there, you need to have read and understood the CTC loss paper.
For people like my industry clients, on the other hand, "code that is easy to audit and easy to install" is a core feature. They don't care about the research, they just want to make audio files search-able.
284 loc + however many LoC in tensor flow + however many training weights.
I think "X in Y LoC" should be limited to where Y is the LoC to do the work, not the LoC to setup/interface with some other library. We're getting ever closer to "SQL DB in only 200 LoC" that simply forwards to sqlite or some such.
That's a custom-designed beam search decoder implemented in C++ and based on the research for my TEVR paper. It increases performance by a relative 16% reduction in word error rate.
I just used the Linux CLOC on the C++ files that I wrote.
Yes, I didn't count my 3 dependencies KenLM, Wave, or TensorFlow, because those are used by pretty much all speech recognition projects. For comparing the complexity of my code to Mozilla's DeepSpeech, it makes sense to ignore the LOCs for shared dependencies.
It was a cute gimmick for the title, nm. But you should try to avoid raw loops in favor of established iteration patterns / standard library algorithms. I found this talk on the subject by Sean Parent to be educational:
Actually the model is TEVR, which is only the wav2vec2 feature extraction but with a modified encoder that improves performance by exploiting redundancies in the German language.
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."
wget "https://github.com/DeutscheKI/tevr-asr-tool/releases/downloa..."
cat tevr_asr_tool-1.0.0-Linux-x86_64.zip.00* > tevr_asr_tool-1.0.0-Linux-x86_64.zip
unzip tevr_asr_tool-1.0.0-Linux-x86_64.zip
sudo dpkg -i tevr_asr_tool-1.0.0-Linux-x86_64.deb
tevr_asr_tool --target_file=test_audio.wav
and then you'll be greeted with some TensorFlow Lite diagnostics, followed by the intermediate states of the beam-search decoder, followed by the hopefully correct transcription result.
And if that piques your curiosity, here's a short overview over the code: https://github.com/DeutscheKI/tevr-asr-tool#how-does-this-wo...