Hacker News new | past | comments | ask | show | jobs | submit login
Wav2vec Overview: Semi and Unsupervised Speech Recognition (vaclavkosar.com)
162 points by vackosar 74 days ago | hide | past | favorite | 22 comments

One addendum to the linked post's notes:

> SoTa in low-resource setting Libri-light by a lot on WER clean test 100h labeled: others ~4 vs theirs ~2.5

> SoTa on high-resource noisy data (3.3 vs 3.4) close to SoTa on clean data

This note isn't super specific, but it's outdated if I'm understanding it correctly. To my understanding, the SOTA on this data is held by Conformer 1B (a 1 billion parameter model), at 1.4 clean, 2.6 noisy.

Conformer 1B is something like wav2vec 2.0 pretraining + conformer + noisy student + specaugment.



Wav2vec 2.0 is very cool, but I've had some trouble reproducing the pretraining and fine tuning reliably. It might need a lot of resources (e.g. hundreds of clustered GPUs).

I think Wav2vec-U is extremely cool.

I always wonder how people figure out these successful gigantic models if it takes hundreds of TPUs and days to train them.

I recently bought rtx 3090 in hopes of playing around with some computer vision applications but I guess having 24GB VRAM is nothing if I want to get something SOTA working.

The EfficientNet paper has some good things to say on this.

If you're working at a place with giant datacenters full of (T/G)PUs, you can train one giant model a few times, or train smaller models hundreds of times. Without hyperparameter search, there's a really high chance that you're just looking in the wrong region and wind up with something gigantic but kinda meh.

So, the simple strategy is to use the smaller models to find a great mix of hyperparameters, and then scale up to a gigantic model. The EfficientNet paper demonstrates some fairly reliable ways to scale up the model, changing width and depth together according to a scaling factor.

But yeah, even for smaller model footprints, the ability to run tens of experiments in parallel goes a very long way. If you've got a single GPU to play with, I would instead try to focus on a well-scoped interesting question that you can answer without having to demonstrate SOTA-ness, as it will be an uphill climb.

Also remember that it's good to lean heavily on pre-trained models to save time. Anything you can do to iterate faster, really.

The RTX 3090 is a beast compared to what researchers had available to them just a few years ago.

Don't try to chase SOTA - that's a fruitless endeavour.

24GB of VRAM is plenty for CV and you can train some excellent models with it. You also need to keep in mind that you don't necessarily need to train models from scratch either.

You can achieve great things by downloading a well-tested, pretrained model and fine-tune it for your particular task or application. Trying to come up with new models and training them from scratch is an exercise in futility for really big models.

I usually only train smaller models (couple of million parameters) and training and finetuning usually takes anywhere from a few hours to a day or two. But then again my hardware is two generations older than yours.

The real research problem is being able to buy a 3090.

So many papers casually mention their hyperparameters, neglecting to mention that those specific numbers are often necessary for performance. Something you don't realize unless you play around with their code…

I wonder how much better this would be at capturing information that doesn't translate well into text representations of speech.

Consider how with word2vec there are relationships in the embedding space between semantically related words. I would expect the examples of that for word2vec (e.g. king -> queen being a similar translation as man -> woman) to apply here too, but can it also do things like place regular questions and rhetorical questions in different regions of the embedding space based off of of the inflection in the speech?

It would also be interesting to see what relationships exist between equivalent words in different languages within the embedding space. I suppose something like that is probably already used for text translation neural networks, but maybe some notable differences exist when dealing with speech directly.

Does anyone know of some good open sourced projects for OCR? Tesseract always seems to be the default, and then it seems Google cloud, and other services are miles ahead. However, for those who don't want to rely on the big tech companies, are there any comparable alternatives?

In terms of setup and ease of use jaided OCR[1] is the best one I have used out there. I can't stress how many times I tried other alternative (tesseract, paddle OCR) out there and kept going back to use their OCR library. Mainly because it supports a large variety of languages out of the box while provide nice results.

Disclaimer : I am not affiliated to jaidedAI, just a satisfied user.

[1] https://github.com/JaidedAI/EasyOCR

I recently came across CRAFT wich appears to have come out of the ICDAR2017 Robust reading challenge.

It performed better than expected. I only tested a few images so please don't take my word for it.

That led me to PaddleOCR. There is still plenty of room for improvement but I found it way more convenient to use for my purposes than messing with Tesseract.



I just tried it at [1] on the text of your comment and it replied with an empty result.

Perhaps I should try more examples, but it doesn't look like it's ready yet.

[1] https://www.paddlepaddle.org.cn/hub/scene/ocr

Thanks for the suggestions - I've tried Paddle before, however was looking for a local only version - seems like Paddle uses external sources (cn) for some of its OCR features, and unfortunately that won't work in this line of work

There is easyocr which is good enough but lacks maturity (it was acknowledged at some point by Yann LeCun). The code base isn't ideal. I'm currently working on my own custom OCR since easyocr isn't perfect at detecting emails for example Www.ismaj@gmail ;com

Did anything eventually happen with Ocropus?

As someone who's an idiot about machine learning, is it possible to run this code in reverse? e.g. take the generated (or novel) vectors and convert them back into audio/waveforms?

If you look at the architecture diagram for Wav2Vec-U, the "generator" is doing exactly that - generating waveforms from the vectors. All GANs work this way, and is how websites like https://thispersondoesnotexist.com/ work. Of course as the sibling comment notes the results today might not be great for this task, and it is open research, bit it's not as of it just can't be done at all.

My reading of the generator diagram (figure 6) isn't that it is generating waveforms, but that it is generating phoneme probabilities.

You can train a similar system to produce audio on the output of wav2vec, though it probably won't sound similar to the input audio (accent/voice) unless you expose more features of the input than phonemes.

Generalized reverse projection through even non recurrent neural networks is still an open research problem.

So no in this case.

I wouldn't rule it out entirely; you could use these as a replacement for linguistic inputs in a tts system, for example, and I imagine it wouldn't be totally terrible. It would still end up being a pretty heavy system, though, with many other parts.

That doesn't sound like a particularly realistic problem to solve.

I agree, but all the more glory if someone does solve it then. And the field is still new enough that I don't want to be cited for decades like the iPod release "no wireless. Less space than a Nomad. Lame." slashdot comment.

Great summary! I also recently wrote a post digging into the internals of wav2vec with illustrations:

The Illustrated Wav2vec - https://jonathanbgn.com/2021/06/29/illustrated-wav2vec.html

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact