> SoTa in low-resource setting Libri-light by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5
> SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on clean data
This note isn't super specific, but it's outdated if I'm understanding it correctly. To my understanding, the SOTA on this data is held by Conformer 1B (a 1 billion parameter model), at 1.4 clean, 2.6 noisy.
Conformer 1B is something like wav2vec 2.0 pretraining + conformer + noisy student + specaugment.
Wav2vec 2.0 is very cool, but I've had some trouble reproducing the pretraining and fine tuning reliably. It might need a lot of resources (e.g. hundreds of clustered GPUs).
I think Wav2vec-U is extremely cool.
I recently bought rtx 3090 in hopes of playing around with some computer vision applications but I guess having 24GB VRAM is nothing if I want to get something SOTA working.
If you're working at a place with giant datacenters full of (T/G)PUs, you can train one giant model a few times, or train smaller models hundreds of times. Without hyperparameter search, there's a really high chance that you're just looking in the wrong region and wind up with something gigantic but kinda meh.
So, the simple strategy is to use the smaller models to find a great mix of hyperparameters, and then scale up to a gigantic model. The EfficientNet paper demonstrates some fairly reliable ways to scale up the model, changing width and depth together according to a scaling factor.
But yeah, even for smaller model footprints, the ability to run tens of experiments in parallel goes a very long way. If you've got a single GPU to play with, I would instead try to focus on a well-scoped interesting question that you can answer without having to demonstrate SOTA-ness, as it will be an uphill climb.
Also remember that it's good to lean heavily on pre-trained models to save time. Anything you can do to iterate faster, really.
Don't try to chase SOTA - that's a fruitless endeavour.
24GB of VRAM is plenty for CV and you can train some excellent models with it. You also need to keep in mind that you don't necessarily need to train models from scratch either.
You can achieve great things by downloading a well-tested, pretrained model and fine-tune it for your particular task or application. Trying to come up with new models and training them from scratch is an exercise in futility for really big models.
I usually only train smaller models (couple of million parameters) and training and finetuning usually takes anywhere from a few hours to a day or two. But then again my hardware is two generations older than yours.
Consider how with word2vec there are relationships in the embedding space between semantically related words. I would expect the examples of that for word2vec (e.g. king -> queen being a similar translation as man -> woman) to apply here too, but can it also do things like place regular questions and rhetorical questions in different regions of the embedding space based off of of the inflection in the speech?
It would also be interesting to see what relationships exist between equivalent words in different languages within the embedding space. I suppose something like that is probably already used for text translation neural networks, but maybe some notable differences exist when dealing with speech directly.
Disclaimer : I am not affiliated to jaidedAI, just a satisfied user.
It performed better than expected. I only tested a few images so please don't take my word for it.
That led me to PaddleOCR. There is still plenty of room for improvement but I found it way more convenient to use for my purposes than messing with Tesseract.
Perhaps I should try more examples, but it doesn't look like it's ready yet.
You can train a similar system to produce audio on the output of wav2vec, though it probably won't sound similar to the input audio (accent/voice) unless you expose more features of the input than phonemes.
So no in this case.
The Illustrated Wav2vec - https://jonathanbgn.com/2021/06/29/illustrated-wav2vec.html