Hacker News new | past | comments | ask | show | jobs | submit login
Meta AI announces Massive Multilingual Speech code, models for 1000+ languages (github.com/facebookresearch)
705 points by crakenzak 15 days ago | hide | past | favorite | 222 comments

I loaded the language coverage into Datasette Lite and added some facets here:


Here's how I did that: https://gist.github.com/simonw/63aa33ec827b093f9c6a2797df950...

Here are the top 20 represented language families:

    Niger-Congo 1,019
    Austronesian 609
    Sino-Tibetan 288
    Indo-European 278
    Afro-Asiatic 222
    Trans-New Guinea 219
    Otomanguean 149
    Nilo-Saharan 131
    Austro-Asiatic 100
    Dravidian 60
    Australian 51
    Creole 45
    Kra-Dai 43
    Uto-Aztecan 41
    Quechuan 36
    Language isolate 35
    Torricelli 32
    Maipurean 31
    Mayan 30
    Sepik 30

Thanks -- great way to visualize how massive this set of languages really is.

All this great work and the prior sota translations and Meta still only accepts Nort American english voice control in their VR equipment lol.

Based on the availability of STT, TTS, and translation models available to download in that github repo, a real-life babelfish is 'only' some glue code away. Wild times we're living in.

Nah, full-on babelfish is simply not possible. The meaning of the beginning of a sentence can be modified retroactively by the end of the sentence. This means that the Babelfish must either be greatly delayed or awkwardly correct itself every once in a while.

That’s why the babelfish translates brainwaves rather than sounds, which is especially important for communicating with our nonverbal alien neighbors.

It’s also good at translating the vagaries of alien poetry into English

Ah, reminds me of learning German, where you can chuck all the verbs onto the end. There was this sentence we had as a fun toy example, where it was like half a paragraphs of verbs in the end, and you had to try and match them up with the beginning of the sentence.

Edit:found a reference to it https://www.reddit.com/r/German/comments/ul0xgt/just_for_fun...

Joke that I heard was supposed to come from the 1880s.

Some travelers stop off at an inn in the Swiss Alps, and they notice that are the various tables at the inn are other nationalities. They amuse themselves by listening in on the conversations (and displaying their stereotypes) - the Italians are all talking at the same time very loud and never stop moving their hands, the French are all arrogant artistes, the Danes are boring and only talk about the weather, and then they notice the Germans.

There was obviously a very important conversation going on with the Germans because one German would say something and every other German would stop and listen intently until they were done, and then another one would start up and every one would stop and listen intently until that one was done, but in following the conversation it became clear it was not because the conversation was important, but because the Germans were waiting to hear the verb to understand what was being said.

And sometimes that modification is not just "we don't know which verb it ends in" but "the whole structure is different than expected":


sometimes humans too don't know what they're going to say until they've said it

Close enough for practical value. Yes, big downside. But personally I’d use it.

Between the options of "you can't talk to this person" and "your conversation will have some delays", I know which I'd choose.

I wonder how well a LLM could guess the end of the sentence though. Most of the time decently well?

I wonder if an esolang exists that could be a universal target though (if a language can handle any ambiguity by appending, then it could always be output without backtracking).


So just add 5 seconds of latency?

AFAIK STT is still very bad without speaker-specific fine-tuning, so it's not going to be a literal babelfish (translating in the ear of the receiver), but it could make you speak many languages.

Your interlocutor's earpiece could beam yours a delta for a finetuned model of their voice before they open their mouth. Except not compatible across iMessage users and whatsapp users or some other predictable silicon valley negative sum power play like that.

hope that glue is better then the duct tapes google use on translate.g.c

it is still hopeless, and much worse than the dictionary based, at gender/nums/dicleasions in general.

i sometimes use it just to not thing about the grammar in some languages, and most times I'm doing a surprised double take of something that would be completely inappropriate or offensive instead of my simple phrases.

Despite any shortcomings, Google Translate is still a technological marvel.

Modern translation apps and GPS are godsends that make travel a million times easier. And they're free! It blows my mind. Traveling would be so much more incredibly difficult without them.

Google Translate is still inferior to ChatGPT 3.5. I suspect this style of model is significantly more expensive to run and Google doesn't want to give it away for free. Really, the only problem with ChatGPT is that it refuses to translate things that go against its nanny programming, which can make it almost worse than useless in some real-life situations.

I tried ChatGPT 3.5 against Google Translate, translating English to Greek, my native language, and they perform almost the same. The text was difficult text of science fiction, fantasy stuff and the results were tolerable. Roughly 50% of the text had to be manually rewritten.

Maybe for more casual sentences and not so difficult text, they perform better, i haven't tried. Anyways, they are both better than nothing.

Performance does vary with languages. In italian GPT blows translate away. The UI is not as nice but at least you can get it to translate to multiple languages in one go.

I would like to use stuff like this as a side-project. Buy a Nvidia Geforce GPU and stick it into my 24/7 server and play around with it in my free time, to see what can be done.

The issue with all these AI models is that there's no information on which GPU is enough for which task. I'm absolutely clueless if a single RTX 4000 SFF with its 20GB VRAM and only 70W of max power usage will be a waste of money, or really something great to do experiments on. Like do some ASR with Whisper, images with Stable Diffusion or load a LLM onto it, or this project here from Facebook.

Renting a GPU in the cloud doesn't seem to be a solution for this use case, where you just want to let something run for a couple of days and see if it's useful for something.

Here's a sheet I came across today...


Granted, it's talking about quantized models, which use less memory. But you can see the 30B models taking 36 GB at 8-bit, and at least 20 GB at 4-bit.

The page even lists the recommended cards.

But as others have pointed out, you may get more bang "renting" as in purchasing cloud instances able to run these workloads. Buying a system costs about as much as buying instance time for one year. Theoretically, if you only run sporadic workloads when you're playing around it would cost less. If you're training... that's a different story.

The more VRAM the better if you'd like to run larger LLMs. Old Nvidia P40 (Pascal 24GB) cards are easily available for $200 or less and would be easy/cheap to play. Here's a recent writeup on the LLM performance you can expect for inferencing (training speeds I assume would be similar): https://www.reddit.com/r/LocalLLaMA/comments/13n8bqh/my_resu...

This repo lists very specific VRAM usage for various LLaMA models (w/ group size, and accounting for context window, which is often missing) - this are all 4-bit GPTQ quantized models: https://github.com/turboderp/exllama

Note the latest versions of llama.cpp now have decent GPU support and has both a memory tester and lets you load partial models (n-layers) into your GPU. It inferences about 2X slower than exllama from my testing on a RTX 4090, but still about 6X faster than my CPU (Ryzen 5950X).

Again this is inferencing. For training, pay attention to 4-bit bitsandbytes, coming soon: https://twitter.com/Tim_Dettmers/status/1657010039679512576

Wait why is renting a GPU in the cloud not a solution? You can even try multiple options and see which ones are capable enough for your use case.

Look into some barebones cloud GPU services, for example Lambda Labs which is significantly cheaper than AWS/GCP but offers basically nothing besides the machine with a GPU. You could even try something like Vast in which people rent out their personal GPU machines for cheap. Not something I'd use for uhhh...basically anything corporate, but for a personal project with no data security or uptime issues it would probably work great.

My annoyance was managing state. I’d have to spend hours installing tools, downloading data, updating code, then when I want to go to bed I have to package it up and store as much as I can on s3 before shutting off the $$ server.

I've played a lot with Stable Diffusion using AWS spot instances, mostly because it is the platform with which I'm more familiar. The Terraform script[0] should be easy to adapt to any other project of this kind.

Let me know if you are interested, and maybe we can find time to work on it together :).

[0] https://github.com/ciberado/stable-diffusion-webui-terraform...

If you shut down an ec2 instance you only pay for the volume not the compute. Just shut it down and start it up the next day.

You should check out https://brev.dev. You can rent GPUs, pause instances, use your own AWS/GCP accounts to make use of your credits, and the CLI lets you use your GPU as if it’s on your local machine.

You can try skypilot: https://skypilot.readthedocs.io/en/latest/

It handles storage, setup, etc for machine learning work loads across several providers - which helps a lot if you need one of the instances that rarely have capacity like 8x A100 pods.

Aren't Vortex AI Workbooks on Google Cloud exactly the tool to solve this?

aws s3 sync + image snapshot

all that costs time :-/ When you're finished working, you got to wait for the syncs to finish (or I guess run them all the time?).

For home computer setups, you can simply walk away when you need a break.

No, you write a script that runs the sync and shut down the instance. When for instance is stopped you don’t pay for it. Resuming it is a simple api call. You don’t even really need to do the sync, it’s just to be certain if the instance volume is lost you have a backup.

The shutdown / stop on an instance is like closing the lid on your laptop. When you start it again it resumes where it was left off. In the mean time the instance doesn’t occupy a VM.

A caveat is you can’t really do this with spot instances. You would need to do a sync and rebuild on start. But, again, scriptable easily.

Dont forget you can play computer games on it too!

This is what containers solve. Don't waste time manually installing things. Store state in a database via the app on a different host.

Speaking as someone who has encountered similar difficulties, this response has strong 'Draw the rest of the owl' vibes

Speaking as someone who has solved these difficulties hundreds of times, "draw the rest of the owl" doesn't tell you the specific things to google to get detailed examples and tutorials on how millions of others have sidestepped these repeated issues.

Yep... you spend hours messing around with docker containers and debugging all the weird build errors.

I am less familiar with storing data in a db (for ml hosting concerns), but I'd imagine it would add overhead (as opposed to accessing files on disk).

You also have to deal with hosting a db and configuring the schema.

You "spend hours messing around" with everything you don't know or understand at first. One could say the same about writing the software itself. At its core Dockerfiles are just shell scripts with worse syntax, so it's not really that much more to learn. Once you get it done once, you don't have to screw around with it anymore, and you have it on any box you want in seconds.

In either case you have to spend hours screwing around with your environment. If those hours result in a Dockerfile, then it's the last time. If they don't, then it's each time you want it on a new host (which as was correctly pointed out a pain in the ass).

Storing data in a database vs in files on disk is like application development 101 and is pretty much a required skill period. It's required that you learn how to do this because almost all applications revolve around storing some kind of state and, as was noted, you can't reasonably expect it to persist on the app server without additional ops headaches.

Many people will host dbs for you without you having to think about it. Schema is only required if you use a structured db (which is advisable) but it doesn't take that long.

I applaud your experience, but honestly I agree with parent: knowledge acquisition for a side project may not be the best use of their time, especially if it significantly impedes actually launching/finishing a first iteration.

It's a similar situation for most apps/services/startup ideas: you don't necessarily need a planet scale solution in the beginning. Containers are great and solve lots of problems, but they are not a panacea and come with their own drawbacks. Anecdotally, I personally wanted to make a small local 3 node Kubernetes cluster at one time on my beefy hypervisor. By the time I learned the ins and outs of Kubernetes networking, I lost momentum. It also didn't end up giving me what I wanted out of it. Educational, sure, but in the end not useful to me.

I'm having trouble imagining what data I would store in a database as opposed to a filesystem if my goal is to experiment with large models like Stable Diffusion.

I would take GP's kind of dogmatic jibber jabber with a grain of salt. There is an unspoken and timeless elegance to the simplicity of running a program from a folder with files as state

Isn't terminfo db famous for this filesystem-as-db approach? File Vs DB: I say do whatever works for you. There is certainly more overhead in the DB route.

> Storing data in a database vs in files on disk is like application development 101

It's OK until you're dealing with, say, 130GiB of tensors, on what is effectively a binary blob that needs to be mostly in VRAM somehow.

I really don't want to read 130GiB of blobs from a database all the time.

IMO tensors & other large binary blobs are fairly edge-casey. You might as well treat them like video files and video file servers also don't store large videos in databases either, and most devs don't have large binary blob management experience.

'shell scripts with worse syntax' lol I wish shell could emulate Alpine on a non-linux box. Shell script with worse syntax for config VM may be closer to a qemu cloud init file.

Well then you're wasting time managing your containers? Have you ever used k8s, it's a full time job lol

I don’t think anyone is suggesting k8s for running an ai model as a side project on a single machine.

> a side project on a single machine

Sorry I don't see how will containers help here

I used to think that the 300MB containers I shipped to prd were reasonably unoptimized.

Those AMD ROCm containers are like 14GiB compressed.


Besides some great tooling out there if you wanted to roll your own, you can literally rent windows/linux computers, with persistent disks. If you have good internet, you can even use it as a gaming PC, as I do.

Is there an easy way to off-board the persistent disk to cheaper machines when you don't need the gpus?

Like imagine, setting up and installing everything with the gpu attached, but when you're not using the gpu or all the cpu cores, you can disconnect them.

If you have docs on how to do this, please let me know.

With AWS (and probably most other cloud VPC services) the disk is remote from the hardware so you can halt the CPU and just pay for the storage until you restart.

AWS also provides accessible datasets of training data:


What sort of frame rate and cost do you get? I have the highest tier GeForce now subscription and it sometimes drops to horrible conditions.

"but offers basically nothing besides the machine with a GPU"

they must offer distributed storage, that can accommodate massive models, though? how else would you have multiple GPUs working on a single training model?

I have not seen many setups that wouldn’t pay itself back (including energy in my case) within a year (sometimes even 6 months) with buying vs renting. For something that pays itself back that fast, and that is without renting it out myself, just training with it, I cannot see how I would want to rent one.

Edit; on Lambda labs, the only exception seems to be the H100; it would be 1.5 years or so, but even 2 years would still fast enough. I have an A100 which paid itself back; thinking of getting another one.

I think the downside to buying hardware as well is that compared to other tech, this LLM/ML stuff is moving very quickly, people are great at quantising now (whereas I only really saw it done before for the coral edge TPUs etc).

Someone could buy an H100 to run the biggest and bestest stuff right now, but we could find that a model gets shrunk down to run on a consumer card within a year or two with equivalent performance.

I suppose it makes sense if someone wants to be on the bleeding edge all the time.

Cloud has a variable price (up to the whim of whatever they decide the price to be that day) so it's uncertain, but typically it is far" more expensive for this type of application. So when faced with 1) probably far more expensive, or 2) single price that will be cheaper, is always available, and has far more uses, I think most would choose 2) for self-hosting. Cloud is very rarely* a good option.

Having to connect to a gpu over the internet seems extremely cumbersome. Stuff like this should be as easy as running a local program with an accelerator.

You can finetune whisper, stable diffusion, and LLM up to about 15B parameters with 24GB VRAM.

Which leads you to what hardware to get. Best bang for the $ right now is definitely a used 3090 at ~$700. If you want more than 24GB vram just rent the hardware as it will be cheaper.

If you're not willing to drop $700 don't buy anything just rent. I have had decent luck with vast.ai

There is the world of used Nvidia Teslas, like the M40. Very cheap, but some assembly required.

I own a P100, P4, M40. They either lack ram or speed. Also, unless you are putting them in a server you have to cool them.

If your goal is to learn ML don't tinker with very obsolete hardware. Rent or buy something modern.

What do you mean by assembly needed? I looked them up and they look like normal gfx cards. Am I missing something?

They're datacenter GPUs so you need special power supplies and an adapter to connect it to a regular motherboard.

From what I’ve read cooling can also be a challenge.

What is their advantages?

“Very cheap” isn’t enough?

apparently there are a group of folks finetuning with these cards.

Source? I own 3 and I would take a single 3090 any day. M40s are simply too old.

Can we use 3090 or nvidia gpus in general with mac? do people generally have their windows dekstop for the gpus?

No clue but if you want learn/finetuned ML use a Linux box otherwise you will spend all your time fighting your machine. If you just want to run models Mac might work.

I would recommend a 3090. It can handle everything a 4000 series can albeit slightly slower, has enough VRAM to handle most things for fun, and can be bought for around $700.

Just do it. Spend a few hours doing research and you will find out. With that said, buy as much memory as can. That makes 4090 king if you have the server that can carry it, plus the budget. For me, I settled for 3060, it's a nice compromise between cost and ram. Cheap, 12gb and 170TPW.

I think you just need to educate yourself a bit about the space. These models are very small (the large version is only 1B parameters) so should run on a 4GB gaming GPU.

A 3080 Ti is sufficient to run the largest Whisper model (just under 12 GB VRAM).

Use the A100 as a base reference and work your way down from there.

ASR: "Automatic Speech Recognition"; also known as "Speech to Text" (STT)

TTS: "Text to Speech"

LID: "Language Identification"

In case anyone else was confused about what the acronyms mean.

TLA: “Text LLM Application.” ;)

Imagine if we used these types of models for like 500 years and it locked their vocabulary in time, disallowing any further language blending; then somehow the servers turned off and nobody could communicate across language barriers anymore.

Someone should write that down in some sort of short-story involving a really tall structure.

Imagine if we used them everyday for 500 years but tune all the languages to be ever so slightly similar each day until there is one language.

Call it “Inglesh” or something phonetically similar

With ~1000 dialects.

Imagine if someone said this about dictionaries.

Imagine if your dictionary was constantly updated to modify your behavior

That statement makes no sense.

My point was that languages evolve, even though we write books and make dictionaries with fixed-in-time vocabulary lists. Similarly, langauges will evolve even if LLMs, for mysterious reasons, have a fixed-in-time vocabulary. I was responding to "and it locked their vocabulary in time, disallowing any further language blending".

Update your dictionary

When a jet skier accidentally runs into a tree that's called a snow crash, aha

I just wanted to test out the TTS locally on a powerful Ubuntu 22.04 machine, but the process for setting it up seems pretty broken and poorly documented. After 20 minutes of trying I finally gave up since I couldn't get the VITS dependency to build (despite having a fully updated machine with all required compilers). It seems like they never really bother to see if the stuff works on a fresh machine starting from scratch. Somehow for my own projects I'm always able to start from a fresh git clone and then directly install everything using this block of code:

``` python3 -m venv venv source venv/bin/activate python3 -m pip install --upgrade pip python3 -m pip install wheel pip install -r requirements.txt ```

But whenever I try using these complicated ML models, it's usually an exercise in futility and endless mucking around with conda and other nonsense. It ends up not being worth it and I just move on. But it does feel like it doesn't need to be like this.

I wrote a colab to run TTS, it takes some time to install dependencies : https://colab.research.google.com/drive/11tZ1c7kQJ5uKImm9t-C...

Overall I got pretty poor results in english and french, I guess it would require some fine-tuning

Thanks for this! Out of curiosity, how does vits play into this? I know nothing about the repo.

Also, I used the following to play the sound after it is generated:

from IPython.display import Audio, display display(Audio(output_file, autoplay=True))

I think vits is just a repo that implements the TTS model

I've created this environment.yml:

  name: fairseq
    - conda-forge
    - pytorch
    - python=3.9
    - Cython==0.29.21
    - librosa==0.8.0
    - matplotlib=3.3
    - numpy=1.19
    - scipy==1.5.2
    - tensorboard==2.3.0
    - pytorch=1
    - torchvision=0
    - Unidecode==1.1.1
    - pip:
        - phonemizer==2.2.1
You can install it with micromamba (or conda):

  conda create -f environment.yml
  conda activate fairseq
You'll need to do this:

  cd path/to/vits/monotonic_align
  mkdir monotonic_align
  python setup.py build_ext --inplace
Then back to fairseq:

  cd path/to/fairseq
  PYTHONPATH=$PYTHONPATH:path/to/vits python examples/mms/tts/infer.py --model-dir checkpoints/eng --wav outputs/eng.wav --txt "As easy as pie"
(Note: On MacOS, I had to comment out several .cuda() calls in infer.py to make it work. But then it generates high-quality speech very efficiently. I'm impressed.)

Yeah, a lot of the releases from researcg groups are woefully poorly documented.

Usually, some hero releases a friendly install system within a few days, though.

Research projects are notorious for this. The people behind them aren't very experienced in distributing software (many never heard of docker), the usual language of choice is Python which has poor packaging, native dependencies complicate things, and everything is on the bleeding edge by definition.

Usually, it needs a brave soul to reverse engineer what exact versions of dependencies are needed and make a Colab or Dockerfile that puts it all together.

This thing uses pytorch, the bane of everything that attempts to make pypi installs sane.

As a human, you need to read https://pytorch.org/get-started/locally/ and install the correct version, depending on your pytorch-version/os/packaging-system/hardware-platform combo. It's bad. It's several conflicting index-urls messing with your requirements. It also doesn't work if you pick the wrong thing.

Or we need something like a "setup.py" that is turing-complete, to dynamically pick up the correct dependency for you.

`pip install -r requirements.txt` is not enough.

The problem with all these model releases is they have no demos or even video of it working. It’s all just download it and run it, like it’s an app.

They're intended for researchers/professionals not consumers, and I'm not sure how a video is going to be helpful?

And the issue with a live demo is that these are resource-intensive, they're not just webpages. It's an entire project to figure out how to host them, scale them to handle peaks, pay for them, implement rate-limiting, and so forth.

For the intended audience, download-and-run-it doesn't seem like an issue at all. I don't see how any questions are going to be answered by a video.

Its so weird to me that they'd do 99% of the effort and just skip past the 1% of work to provide a dumbed-down summary and instructions for broader appeal. Clearly these are released in part for public-relations and industry clout.

Don't get me wrong they published this, it took a ton of work, they didn't have to do it. But its ultimately a form of gatekeeping that seems to come straight out of Academia. And honestly, that part of Academia sucks.

It's not 1% of the work. If you have ever written public documentation, or half-decent internal documentation that is understandable to people outside your team, you would know that it takes a lot of effort to get it right. Let alone something highly technical like this. If you get things wrong or skip things you think are trivial but not to others, you get more questions. And it is not a one-time thing, it is a living document that needs to be updated and revised over time.

A good research team is a handful of people working on a focused set of questions.

A good product team is probably around a dozen people minimum? Especially if you need to hit the quality bar expected of a release from a BigCorp. You've got frontend, UX, and server components to design, in addition to the research part of the project. The last real app I worked on also included an app backend (ie, local db and web API access, separate from the display+UX logic) and product team. Oh yeah, also testing+qa, logging, and data analysis.

And after all that investment, God help you if you ever decide the headcount costs more than keeping the lights on, and you discontinue the project...

Public app releases are incredibly expensive, in other words, and throwing a model on GitHub is cheap.

Even researchers and professionals appreciate a sample or two.

We're all research professionals now.

Here's the blog post with video of it working: https://ai.facebook.com/blog/multilingual-model-speech-recog...

I'd argue that having a "download and run" approach is so much better than videos or demos. Why do you think this is a problem?

Because to download and run it you need to have a laptop nearby to download and run it on, with the correct operating system and often additional dependencies too.

I do most of my research reading on my phone. I want to be able to understand things without breaking out a laptop.

There's a paper associated with it that you can read on your phone. And I don't think demo videos are really associated with "research". I agree they could've added both, but let's be honest here you'll have demo videos on this in the next 12 hours for sure.

That's a related complaint: everyone continues to insist on releasing papers as PDFs, ignoring the fact that those are still pretty nasty to read on a mobile device.

Sure, release a PDF (some people like those), but having an additional responsive web page version of a paper makes research much more readable to the majority of content consumption devices. It's 2023.

I'll generally use https://www.arxiv-vanity.com/ to generate those but that doesn't work with this specific paper since it's not hosted on arXiv.

Would you like a back rub while I'm at it?

fwiw, i know the topic of the thread has deviated, but i share the frustration about reading PDFs on my device. In my case, it's an accessibility issue - I can't see well, so zooming with reflow would make my life materially better since I'd be able to read research papers on my morning commute.

Sometimes users have needs that may seem superfluous and beg for a snarky reply, but there are often important reasons behind them, even though they may not be actionable.

I'd pay $x000 for an app that does some sort of intelligent pdf-to-epub conversion that doesn't require human-in-the-loop management/checking.

PDFs are pretty shite on mobile.

Interesting that there isn’t a common solution for this. I guess it’s rather niche?

It's a lot bigger investment. It'd be nice to at least see a video; that's easier than expecting people to download and run something just to see it.

Why should we have any marketing materials, ever? Why do we show pictures of products? Sometimes people want to see the capabilities before they completely dive in and spend their time working on something.

I was super excited at this, but digging through the release [0] one can see the following [1]. While using Bible translations is indeed better than nothing, I don't think the stylistic choices in the Bible are representative of how people actually speak the language, in any of the languages I can speak (i.e. that I am able to evaluate personally).

Religious recordings tend to be liturgical, so even the pronunciation might be different to the everyday language. They do address something related, although more from a vocabulary perspective to my understanding [2].

So one of their stated goals, to enable people to talk to AI in their preferred language [3], might be closer but certainly a stretch to achieve with their chosen dataset.

[0]: https://about.fb.com/news/2023/05/ai-massively-multilingual-...

[1]: > These translations have publicly available audio recordings of people reading these texts in different languages. As part of the MMS project, we created a dataset of readings of the New Testament in more than 1,100 languages, which provided on average 32 hours of data per language. By considering unlabeled recordings of various other Christian religious readings, we increased the number of languages available to more than 4,000. While this data is from a specific domain and is often read by male speakers, our analysis shows that our models perform equally well for male and female voices. And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.

[2]: > And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.

[3]: > This kind of technology could be used for VR and AR applications in a person’s preferred language and that can understand everyone’s voice.

This paragraph justifies the decision, IMHO.

> Collecting audio data for thousands of languages was our first challenge because the largest existing speech datasets cover 100 languages at most. To overcome this, we turned to religious texts, such as the Bible, that have been translated in many different languages and whose translations have been widely studied for text-based language translation research.

I think the choice was just between having any data at all, or not being able to support that language.

While AFAIU the UN: UDHR United Nations Universal Declaration of Human Rights is the most-translated document in the world, relative to the given training texts there likely hasn't been as much subjective translation analysis of UDHR.

Awesome-legal-nlp links to benchmarks like LexGLUE and FairLex but not yet LegalBench; in re: AI alignment and ethics / regional law https://github.com/maastrichtlawtech/awesome-legal-nlp#bench...

A "who hath done it" exercise:

[For each of these things, tell me whether God, Others, or You did it:] https://twitter.com/westurner/status/1641842843973976082?

"Did God do this?"

The UN's UDHR likely yields <5mins of audio data in any language which is kinda useless as far as training data goes.

Compared to e.g. religious text translation, I don't know how much subjective analysis there is for UDHR translations. It's pretty cut and dry: e.g. "Equal Protection of Equal Rights" is pretty clear.

"About the Universal Declaration of Human Rights Translation Project" https://www.ohchr.org/en/human-rights/universal-declaration/... :

> At present, there are 555 different translations available in HTML and/or PDF format.

E.g. Buddhist scriptures are also multiply translated; probably with more coverage in East Asian languages.

Thomas Jefferson, who wrote the US Declaration of Independence, had read into Transcendental Buddhism and FWIU is thus significantly responsible for the religious (and nonreligious) freedom We appreciate in the United States today.

Of course, my critique is less "This project shouldn't exist", but rather "It seems to me there are several biases that affect the performance of this project in the context it was presented in".

This is a great project and an important stepping stone in a multilingual AI future.

> I don't think the stylistic choices in the Bible are representative of how people actually speak the language, in any of the languages I can speak

The cadence and intonation sounded a little weird, but I suspect fine-tuning can improve that by a lot. I am really excited to see some low-resource language finally get some mainstream TTS support at all.

This looks huge. Anyone know how this compares with Whisper in terms of quality and speed?

according to their blog post[1], MMS achieves ~half the error rate on words, while supporting 11x more languages. pretty impressive.

[1] https://ai.facebook.com/blog/multilingual-model-speech-recog...

I wonder what the performance is on English specifically.

Edit: Just checked the paper, it seems to be worse[1][2] but feel free to correct me.

I feel like they should've just taken the Whipser architecture, scaled it, and scaled the dataset as they did.

[1] Page: https://i.imgur.com/bq15Tno.png

[2] Paper: https://scontent.fcai19-5.fna.fbcdn.net/v/t39.8562-6/3488279...

It's worse on English and a lot of other common languages (see Appendix C of the paper). It does better on less common languages like Latvian or Tajik, though.

Which implies, Whisper just hasn't focused on those languages? Seems disingenuous to make the claim that the error rate has halved, when it's worse in the apex language

My guess is wav2vec performs better on low resource than whisper.

lack of labels on graph axes should be a crime

> The MMS code and model weights are released under the CC-BY-NC 4.0 license.

Huge bummer. Prevents almost everyone from using this and recouping their costs.

I suppose motivated teams could reproduce the paper in a clean room, but that might also be subject to patents.

Trying to attach a license to model weights seems counter-productive to me. If you argue they are copyrightable then surely they are derivative works of the training data, which is unlikely to all be public domain. Machine learning enthusiasts are better off lobbying for model weights being non-copyrightable as it doesn't have any creative input and is the result of a purely mechanical process.

The copyright on the code, on the other hand, would definitely be copyrighted and would need a clean-room implementation, as you said. The community could pool its resources and do it once, and license it under the AGPL to keep it and further improvements available to everyone.

I think this is a lot better than the other option, which would've been not releasing it at all. No company in the business of making money wants to give away their edge to would-be competitors for free.

Id imagine you could use inference from this as training for a commercial model, as that isn't currently protected under copyright.

I'd go a step further and say that the models themselves probably aren't copyrightable.

I mean I personally feel like almost all software is math and anything that isn’t frontend isn’t legally copyrightable or patentable. But the courts disagree with my interpretation. And if it were up to me all copyright and patents would have exponentially increasing fees on an annual basis to maintain, but thatd require a change is statutes. I think the jury is still out on your interpretation though, and I suspect the courts will fall in line with whatever the largest rights holders want.

effort and cost should be accounted for, reasonable licensing fees should go in the patent. Loss of lives and loss of quality of life should be considered. The only goal on the other end of the equation should be to encourage people sufficiently to not keep their invention secret.

We should also have a fund for particularly tricky inventions and copyrights that would greatly benefit mankind.

Say, someone or some institution writes a good school book. We can just buy the book for whatever we think is reasonable. If they want to publish the same book the next year with the chapters shuffled around they could be entitled to a tiny fraction of the previous sum or it could be denied for being spam. If this bankrupts the company is of very little interest.

My layman's reading is that the license does seem to allow recouping of costs, would be interested if there is a more nuanced interpretation to the contrary.

(tangentially, like another comment briefly mentioned, are models actually copyrightable? Programs are, because they are human authored creative works.)

So many so-called overnight AI gurus hyping about their snake-oil product and screaming about 'Meta is dying' [0] and 'It is over for Meta' but little of them actually do research in AI and drive the field forward and this once again shows that Meta has always been a consistent contributor to AI research, especially in vision systems.

All we can just do is take, take, take the code. But this time, the code's license is CC-BY-NC 4.0. Which simply means:

Take it, but no grifting allowed.

[0] https://news.ycombinator.com/item?id=31832221

Unless you’re in a country that doesn’t care about the IP restrictions, then you do whatever you want.

Yes, they are so far more open than open<for business>ai.

According to [1] on the accompanying blog post, this brings the Whisper 44.3 WER down to 18.7, although it’s unclear to me how much better this is at primarily English speech recognition. I’d love to see a full comparison of accuracy improvements as well as a proper writeup of how much more power it takes to run this in production or on mobile vs something like whisper.

[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.8562-6/346801894_...

Come to think about it, Meta is a much better name for an AI company than a VR company.

If anything, it sounds like the start of a "Meta? I don't even know her" joke.

It's not VR, it's Metaverse; VR might involve experiences that are not bullshit.

The metaverse, if we define it as an information layer that exists in parallel with the physical 'stuff' universe and is a seamless, effortless, and essential part of what we experience as reality will be an enormous part of our future. Meta might just be a few centuries ahead of the curve, which is just as bad as being a few centuries late.

If you define it like that, it's just the internet.

Centuries really? I think Meta's going to be just on time.

I think we may skip the whole VR experience and just have robots instead of virtual friends

That's a large overlap rather than sequential.

I assume this "competes" directly with https://sites.research.google/usm/ -- would be cool to see side-by-side benchmarks sometime! Maybe I should make those. I requested access to USM but have not been granted any access yet.

Is there any indication that USM will be open sourced though? This is more so competing with Whisper.


1107 languages but no Czech or Slovak? Many languages with way fewer speakers made it to the list. I wonder what we did to Meta...

no chinese TTS either.

Meta on a roll. any demo on how good the text to speech is ?

Doesn't seem like there's a demo set up by them yet, but you can just download the model weights and run the inference yourself and compare it to OpenAI Whisper or anything you have access to.

yeah thanks. just saw that. fingers crossed it's really good for korean. i'm just glad they released it at least. the other sota TTS models by big companies won't be seeing the light of day.

reproducing the english TTS demo resulted in very underwhelming quality.

Wow, I didn't even know there was 7,000 documented languages in the world!

"According to the World Atlas of Languages' methodology, there are around 8324 languages, spoken or signed, documented by governments, public institutions and academic communities. Out of 8324, around 7000 languages are still in use."


Most are at risk of extinction.

"half of the languages spoken today have fewer than 10,000 speakers and that a quarter have fewer than 1,000 speakers" (https://en.wikipedia.org/wiki/Language_death).

"Today, on average, we lose one language in the world every six weeks. There are approximately 6800 languages. But four percent of the population speaks 96 percent of the languages, and 96 percent of the population speaks four percent of the languages. These four percent are spoken by large language groups and are therefore not at risk. But 96 percent of the languages we know are more or less at risk. You have to treat them like extinct species." (https://en.wikipedia.org/wiki/Language_preservation).

"Over the past century alone, around 400 languages – about one every three months – have gone extinct, and most linguists estimate that 50% of the world’s remaining 6,500 languages will be gone by the end of this century (some put that figure as high as 90%, however). Today, the top ten languages in the world claim around half of the world’s population." (https://www.bbc.com/future/article/20140606-why-we-must-save...).

Wow, preserving almost-dead languages sounds like something that LLMs would be pretty appropriate for, right? We would primarily need as large a body of written text translated into both a "known" language and the dying language as possible.

May be a cool piece of tech for historians in 100s of years from now!

Even Bavarian is covered... :D

I'm trying to use this on a 3M mp3 file to test ASR with language code deu, CPU only, and I keep getting this error -- are there limits to the MMS inference?

  File "fairseq/data/data_utils_fast.pyx", line 30, in fairseq.data.data_utils_fast.batch_by_size_vec
    assert max_tokens <= 0 or np.max(num_tokens_vec) <= max_tokens, (
  AssertionError: Sentences lengths should not exceed max_tokens=4000000
  Traceback (most recent call last):
  File "/home/xxx/fairseq/examples/mms/asr/infer/mms_infer.py", line 52, in <module>
  File "/home/xxx/fairseq/examples/mms/asr/infer/mms_infer.py", line 44, in process

Same error here, CPU only, haven't find out a way to run this on GPU (cuda) yet

Real Open AI lab.


Wonder how this compares with deepgram offering. has anyone used/tried/compared or even read enough literature to compare. The WER rates showed in deepgram are still better than the largest MMS and the specific use case based fine-tuned models (zoom meetings, financial calls etc) probably make a bigger difference. WDYT ?

Great, now the Terminators will be barking orders at each other in languages I can't understand.

FYI: Another round of massive layoffs at Meta this Wednesday. Stay-at-home lockdown ensues.

I tried the english TTS example and the result is quite underwhelming (compared to bark or polly/azure-tts ). It sounds like TTS systems one or two decades ago. Would those language-specific TTS models need to be finetuned?

Most of languages support only LID (language identification) task. Still impressive

Anyone know what hardware it takes to run this? Asking as an enthusiastic newbie

At a guess, based on model size, just about anything, even a Raspberry Pi 4 etc.

As comparison the GGML port of Whisper (OpenAI's equivalent), runs in the browser, via WASM: https://whisper.ggerganov.com/

If this is the case, what hardware is needed for real-time tts?

Would it be possible to have this in somethin "a la" whisper.cpp?

If it performs as well as the paper implies, then I certainly hope someone will pick it up and give it the same treatment.

Maybe one of those models has figured out a way to tell Zuck that the whole Metaverse concept is nonsense, hopefully it'll be graceful about letting him down.

I don't have much knowledge about TTS models, is it possible/affordable to fine-tune those models on your own voice ?

I just want to translate a movie language audio from one to another, whats the easiest way to do this at home?

i checked the language coverage for "kashmiri"


"Himachal Pradesh state:"

this is obviously wrong so i dont know what else is wrong

I was kind of hoping for Interlingue, but was surprised to not even see Esperanto on the list.

Random thought ; could Esperanto (as a crypto-language) be turned into any sort of programming language. Could one, conceivably program in Esperanto in any meaningful way?

I think you are more likely to do so in a more logical and structured language like lojban

There's nothing particularly special about Esperanto in that regard compared to natural languages that it is modelled after.

Esperanto is supported in their larger model

if I have an extra MBP 16" 2020 hanging around 16GB Ram, Quadcore i-7.... can I run this? I'd like to try TTS capabilities! LMK if you've got any guides or instructions online I can checkout:)

Some interesting links and tools in this thread, e.g. datasette.io

can any of these models be coerced into just doing straight phonetic transcription? like spitting out IPA?

meta is doing more for open ai than openai

I don't think anyone is too surprised by the fact that OpenAI went closed. As ironic as it may be, it was pretty obviously the only way they were going to continue to exist.

On the other hand, it definitely underscores:

- How blatantly companies exploit the "open source" concept with no remorse or even tacit acknowledgement; for many startups, open source is a great way to buy some goodwill and get some customers, then once they have what they want, close the doors and start asking for rent money. Nothing wrong with doing closed source or commercial software, but would OpenAI still have gained relevance if they hadn't started the way they did with the name they did?

- How little anyone gives a shit; we all watched it happen, but apparently nobody really cares enough. Investors, customers, the general public, apparently all is fair in making money.

I'm not suggesting OpenAI is especially evil, definitely not. In fact, the most depressing thing is that doing this sort of bait and switch mechanic is so commonplace and accepted now that it wasn't news or even interesting. It's just what we expect. Anything for a dollar.

But maybe I still seem like I'm just being whiny. Okay, fair enough. But look at what's happening now; OpenAI wants heavy regulation on AI, particularly they want to curtail and probably just ban open source models, through whatever proxy necessary, using whatever tactics are needed to scare people into it. They may or may not get what they want, but I'm going to guess that if they do get it, ~nobody will care, and OpenAI will be raking in record profits while open source AI technology gets pushed underground.

Oh I'd love to be wrong, but then again, it's not like there's anything particularly novel about this strategy. It's basically textbook at this point.

Because OSS doesn't work at scale, the economic incentives are not aligned.

Sure, it may work sometimes with the goodwill of someone who cares, but 90% of OSS code is a dead portfolio the authors built with the hope of landing a tech job or skipping some algorithm questions.

Sure, OSS allow people to experiment with crap for free (even though it's mostly big corps benefiting from OSS) but what about the negative effects OSS produce on small businesses?

How many developers could spend their life maintaining small parts of software instead of working in soulless corporations, if giving away your code for free (and without maintenance) wasn't so common? How much better would this code be compared to the wasteland of OSS projects? How much more secure could the entire ecosystem be? How many poor developers are working for free just in the hope of getting a job someday?

We need to stop dreaming the OSS dream and start making things fairer for the developers involved.

In the long term, software is not very novel. Needs evolve quickly in the genesis of a new category of software, but it doesn't take very long for it to stabilize for many categories. That's why, in my estimation anyways, open source is actually more sustainable in some cases: because when software becomes a commodity, it makes more sense for stakeholders to collaborate and collectively benefit from it.

There is no "OSS dream" anymore—today, there is an OSS reality. We have some open source stuff that objectively works: there are business models that are more or less proven, at least as proven as any internet or software-oriented business model, and plenty of highly successful projects that deliver immense value to the world.

Then again, some of it doesn't seem to work, and there are a lot of unknowns about how it works, what the dynamics will be, etc. But, if we're to call open source into question, we shouldn't forget to call proprietary software into question, too. Proprietary software has many seemingly-endemic issues that are hard to mitigate, and the business model has been shifting as of late. Software is now sold more as a subscription and a service than it is a product. The old business model of boxed software, it seems, has proven unsustainable for many participants.

The main issue open source seems to have is really funding. It works well when open source software acts as a compliment to some other commercial business, but it works poorly when the software itself is where all the value is. After all, if you, for example, just host an open source piece of software in exchange for money, you're effectively competing in the highly competitive web hosting business. It can work since you can obviously provide some value, but it's a precarious position. Thus very few companies really have a lot of money to put into the Linux desktop, or at least if you wanted to compare it to Windows or macOS. It's complimentary to some of them who use it as a developer workstation or something, but there are only a couple companies who I think genuinely have a good model. System76 is definitely one of them, to be explicit about it.

But rather than give up entirely, I propose something else: we should literally fund open source collectively. Obviously a lot of us already do: you can donate to projects like Blender or Krita monetarily, and you can donate your time and code as I'm sure many of us also do for random projects. But also, I think that open source should get (more) public funding. These projects arguably end up adding more value to the world than you put in them, and I think governments should take notice.

Of course in some cases this has already taken shape for one reason or another. Consider Ghidra. Clearly released for the benefit of NSA's PR, but wait, why not release Ghidra? It's immensely useful, especially given that even at a fraction of the functionality of Hex Rays products, it's still extremely useful for many parties who could simply never afford the annual cost of maintaining IDA Pro licenses, especially now that it is only available as a subscription.

The way I see it, software and computers in general are still moving quite fast even though it's clearly stagnating compared to where it once was. But, as things stabilize, software will simply need to be updated less, because we simply aren't going to have reasons to. As it is, old versions of Photoshop are already perfectly serviceable for many jobs. And at that point, we're only going to need one decent open source release for some given category of work. Things will need occasional improvements and updates, but c'mon, there's not an unlimited amount of potential to squeeze out of e.g. an image editor, any more than you can a table saw or a hammer. At some point you hit the point of diminishing returns, and I think we're nearing it in some places, hence why the switch to subscription models is necessary to sustain software businesses.

It's a myth that open source is held up by poor developers starving for their ideology. I'm sure a lot of that exists, but a lot of open source is also side projects, work subsidized by companies for one reason or another, projects with healthy revenue streams or donations, etc.

For OpenAI, AI is the revenue source. For Meta, it's a tool used to build products.

If OpenAI gives their stuff away, they lose customers. If Meta does it, they can have community around it, and have joint effort about improving tools that then they'll use for their internal products.

OpenAI is modern (and most likely - very short lived) Microsoft in the AI space, while Meta tries to replicate Linux in the AI space.

this is absolutely true, with pytorch, llama, ton of models in vision and now with this !

We're in upside-down world, Meta is now cool; actually I always admired their open source projects, starting with React and PyTorch.

That's really not hard to do, is it?

I don't think open source LLMs over a certain capability will be legal soon, in the same way an open source SDR transmitter would not be legal.

What an irony !

fairseq is a fairly missed naming opportunity. C-3PO would have been better.

fairseq has existed for many years and is Fb’s (now Meta’s) AI research group’s sequence-to-sequence modeling toolkit. It’s not specific to this MMS model.

I still hate the name “meta”. You?

Why is Meta open sourcing their AI work like this? Is it because they don't have a great reputation even among tech companies?

Meta has had a culture of open source, even before their reputation got bad (React, RocksDB, Thrift, folly, etc). They open source a lot of their AI models (instead of just source) because they don't have an enterprise cloud to sell these things, unlike Amazon/Google/Microsoft's offerings (which all offer speech transcription and text to speech as services).

I understand that. I’m asking in the context that a number of business commentators have come out and said that Google open sourcing the transformer technology was in itself a big mistake, in hindsight. I think the business side in Meta would have taken a similar view.

This is an underrated point for Meta. They are good at opensourcing.

My speculation: the repo hasnt been in active development for at least a year, all the main contributors left to found Character.AI, so Meta is open sourcing it because it only rots sitting there and going open means goodwill, and maybe someone else picks it up and runs with it and contributes back to make it more useful.

Does it do American Sign Language, US fifth largest language?

I didn’t think so.

American Sign Language, ISO-693-3 code 'ase', does not have a formal written or spoken grammar. Obviously, the list of MMS supported languages[0] does not include support for 'ase' because MMS is a tool for spoken language and, as an intermediate state, written language.

[0] https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mm...

Perhaps you could help by inventing a written version of ASL, since one doesn't currently exist. [EDIT: I was wrong, sorry, see my response below.] Seems like that would be a prerequisite for a model based on written language.

And of course if you could also create an entire corpus of training material in your written ASL?

It is called SignWriting for all you multiple naysayers.


Very fair point. I've never seen it used, but SignWriting is already in the Unicode standard[0] (at U+1D800 - U+1DAAF).

I suspect Meta/Facebook doesn't have a lot of content to work off of. I've only been able to find community-generated examples of SignWriting on the official website[1], and none of those seem to be using Unicode characters. MMS is an audio-to-text tool, so it seems unlikely that it can be trivially expanded to take in visual data (pictures of text or video of ASL being performed).

I suspect the process of turning viewed ASL into SignWriting text will be very difficult to automate. I would not be surprised if such a project would either use a different textual encoding or directly translate out to English (which also sounds terribly hard, but these LLM advances recently have surprised me).

[0] https://www.unicode.org/charts/PDF/U1D800.pdf

[1] https://signwriting.org/#RecentPostingsUS

I stand corrected, thanks. There's a lot of info on the internet that says a written form of ASL doesn't exist, which is what I found when I Googled it.

Looking into it, it seems very much at the experimental stage in terms of digital representation -- while Unicode symbols exist, they require being placed in 2D boxes (using a drawing tool like SVG). It seems like it's only in the past few years that there have been proposals for how to turn it into a linear canonical text encoding?

Is anyone actually using those linear encodings -- SignPuddle or Formal SignWriting, they seem to be called -- in the wild, outside of demonstration texts or academia? Especially since they only date to 2016 and 2019.

Is there anywhere close at all to a corpus that Meta could train on? Because it still seems like the answer is no, but I also got my research wrong when Google gave no indication that SignWriting existed in the first place.

SignWriting has been documented at National Deafness section in California State University of Northridge South Library since 1968 and at Gallaudet University, Washington, D.C. since 1950s.

That's great, but it doesn't have anything to do with the points I raised.

Which is that I can't find any indication of a linear digital encoding that has been used to any appreciable extent that Meta could train on a corpus of it.

Which is why I'm struggling to understand why you're criticizing Meta? How could they realistically train a linear text model on ASL when the necessary content doesn't appear to exist?

Most hearing impaired people have never heard or don't care to use SignWriting. You are right about it's existence, for what it's worth.

right, a minor point, hearing-impaired is an ablist slang.

Instead, go by hard-of-hearing, people with hearing loss or simply, Deaf.

Also "deaf" is barely ok when use alone for the above generalized replacement however "deaf" often poorly is refers to most severe of hearing-loss as determined by differing standards of hearing loss, but capitalized "D"eaf is a direct reference to those of actively using sign language and engage within deaf culture, whether it would be American Sign Language, British Sign Language, or some 40-odd variants and different nationalities.

How would a speech model do sign language?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact