Blog Post: https://ai.facebook.com/blog/multilingual-model-speech-recog...
Languages coverage: https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mm...
Here's how I did that: https://gist.github.com/simonw/63aa33ec827b093f9c6a2797df950...
Here are the top 20 represented language families:
Trans-New Guinea 219
Language isolate 35
Edit:found a reference to it https://www.reddit.com/r/German/comments/ul0xgt/just_for_fun...
Some travelers stop off at an inn in the Swiss Alps, and they notice that are the various tables at the inn are other nationalities. They amuse themselves by listening in on the conversations (and displaying their stereotypes) - the Italians are all talking at the same time very loud and never stop moving their hands, the French are all arrogant artistes, the Danes are boring and only talk about the weather, and then they notice the Germans.
There was obviously a very important conversation going on with the Germans because one German would say something and every other German would stop and listen intently until they were done, and then another one would start up and every one would stop and listen intently until that one was done, but in following the conversation it became clear it was not because the conversation was important, but because the Germans were waiting to hear the verb to understand what was being said.
it is still hopeless, and much worse than the dictionary based, at gender/nums/dicleasions in general.
i sometimes use it just to not thing about the grammar in some languages, and most times I'm doing a surprised double take of something that would be completely inappropriate or offensive instead of my simple phrases.
Modern translation apps and GPS are godsends that make travel a million times easier. And they're free! It blows my mind. Traveling would be so much more incredibly difficult without them.
Maybe for more casual sentences and not so difficult text, they perform better, i haven't tried. Anyways, they are both better than nothing.
The issue with all these AI models is that there's no information on which GPU is enough for which task. I'm absolutely clueless if a single RTX 4000 SFF with its 20GB VRAM and only 70W of max power usage will be a waste of money, or really something great to do experiments on. Like do some ASR with Whisper, images with Stable Diffusion or load a LLM onto it, or this project here from Facebook.
Renting a GPU in the cloud doesn't seem to be a solution for this use case, where you just want to let something run for a couple of days and see if it's useful for something.
Granted, it's talking about quantized models, which use less memory. But you can see the 30B models taking 36 GB at 8-bit, and at least 20 GB at 4-bit.
The page even lists the recommended cards.
But as others have pointed out, you may get more bang "renting" as in purchasing cloud instances able to run these workloads. Buying a system costs about as much as buying instance time for one year. Theoretically, if you only run sporadic workloads when you're playing around it would cost less. If you're training... that's a different story.
This repo lists very specific VRAM usage for various LLaMA models (w/ group size, and accounting for context window, which is often missing) - this are all 4-bit GPTQ quantized models: https://github.com/turboderp/exllama
Note the latest versions of llama.cpp now have decent GPU support and has both a memory tester and lets you load partial models (n-layers) into your GPU. It inferences about 2X slower than exllama from my testing on a RTX 4090, but still about 6X faster than my CPU (Ryzen 5950X).
Again this is inferencing. For training, pay attention to 4-bit bitsandbytes, coming soon: https://twitter.com/Tim_Dettmers/status/1657010039679512576
Look into some barebones cloud GPU services, for example Lambda Labs which is significantly cheaper than AWS/GCP but offers basically nothing besides the machine with a GPU. You could even try something like Vast in which people rent out their personal GPU machines for cheap. Not something I'd use for uhhh...basically anything corporate, but for a personal project with no data security or uptime issues it would probably work great.
Let me know if you are interested, and maybe we can find time to work on it together :).
It handles storage, setup, etc for machine learning work loads across several providers - which helps a lot if you need one of the instances that rarely have capacity like 8x A100 pods.
For home computer setups, you can simply walk away when you need a break.
The shutdown / stop on an instance is like closing the lid on your laptop. When you start it again it resumes where it was left off. In the mean time the instance doesn’t occupy a VM.
A caveat is you can’t really do this with spot instances. You would need to do a sync and rebuild on start. But, again, scriptable easily.
I am less familiar with storing data in a db (for ml hosting concerns), but I'd imagine it would add overhead (as opposed to accessing files on disk).
You also have to deal with hosting a db and configuring the schema.
In either case you have to spend hours screwing around with your environment. If those hours result in a Dockerfile, then it's the last time. If they don't, then it's each time you want it on a new host (which as was correctly pointed out a pain in the ass).
Storing data in a database vs in files on disk is like application development 101 and is pretty much a required skill period. It's required that you learn how to do this because almost all applications revolve around storing some kind of state and, as was noted, you can't reasonably expect it to persist on the app server without additional ops headaches.
Many people will host dbs for you without you having to think about it. Schema is only required if you use a structured db (which is advisable) but it doesn't take that long.
It's a similar situation for most apps/services/startup ideas: you don't necessarily need a planet scale solution in the beginning. Containers are great and solve lots of problems, but they are not a panacea and come with their own drawbacks. Anecdotally, I personally wanted to make a small local 3 node Kubernetes cluster at one time on my beefy hypervisor. By the time I learned the ins and outs of Kubernetes networking, I lost momentum. It also didn't end up giving me what I wanted out of it. Educational, sure, but in the end not useful to me.
It's OK until you're dealing with, say, 130GiB of tensors, on what is effectively a binary blob that needs to be mostly in VRAM somehow.
I really don't want to read 130GiB of blobs from a database all the time.
Sorry I don't see how will containers help here
Those AMD ROCm containers are like 14GiB compressed.
Like imagine, setting up and installing everything with the gpu attached, but when you're not using the gpu or all the cpu cores, you can disconnect them.
If you have docs on how to do this, please let me know.
AWS also provides accessible datasets of training data:
This is basically the building blocks.
they must offer distributed storage, that can accommodate massive models, though? how else would you have multiple GPUs working on a single training model?
Edit; on Lambda labs, the only exception seems to be the H100; it would be 1.5 years or so, but even 2 years would still fast enough. I have an A100 which paid itself back; thinking of getting another one.
Someone could buy an H100 to run the biggest and bestest stuff right now, but we could find that a model gets shrunk down to run on a consumer card within a year or two with equivalent performance.
I suppose it makes sense if someone wants to be on the bleeding edge all the time.
Which leads you to what hardware to get. Best bang for the $ right now is definitely a used 3090 at ~$700. If you want more than 24GB vram just rent the hardware as it will be cheaper.
If you're not willing to drop $700 don't buy anything just rent. I have had decent luck with vast.ai
If your goal is to learn ML don't tinker with very obsolete hardware. Rent or buy something modern.
TTS: "Text to Speech"
LID: "Language Identification"
In case anyone else was confused about what the acronyms mean.
Someone should write that down in some sort of short-story involving a really tall structure.
My point was that languages evolve, even though we write books and make dictionaries with fixed-in-time vocabulary lists. Similarly, langauges will evolve even if LLMs, for mysterious reasons, have a fixed-in-time vocabulary. I was responding to "and it locked their vocabulary in time, disallowing any further language blending".
python3 -m venv venv
python3 -m pip install --upgrade pip
python3 -m pip install wheel
pip install -r requirements.txt
But whenever I try using these complicated ML models, it's usually an exercise in futility and endless mucking around with conda and other nonsense. It ends up not being worth it and I just move on. But it does feel like it doesn't need to be like this.
Overall I got pretty poor results in english and french, I guess it would require some fine-tuning
Also, I used the following to play the sound after it is generated:
from IPython.display import Audio, display
conda create -f environment.yml
conda activate fairseq
python setup.py build_ext --inplace
PYTHONPATH=$PYTHONPATH:path/to/vits python examples/mms/tts/infer.py --model-dir checkpoints/eng --wav outputs/eng.wav --txt "As easy as pie"
Usually, some hero releases a friendly install system within a few days, though.
Usually, it needs a brave soul to reverse engineer what exact versions of dependencies are needed and make a Colab or Dockerfile that puts it all together.
As a human, you need to read https://pytorch.org/get-started/locally/ and install the correct version, depending on your pytorch-version/os/packaging-system/hardware-platform combo. It's bad. It's several conflicting index-urls messing with your requirements. It also doesn't work if you pick the wrong thing.
Or we need something like a "setup.py" that is turing-complete, to dynamically pick up the correct dependency for you.
`pip install -r requirements.txt` is not enough.
And the issue with a live demo is that these are resource-intensive, they're not just webpages. It's an entire project to figure out how to host them, scale them to handle peaks, pay for them, implement rate-limiting, and so forth.
For the intended audience, download-and-run-it doesn't seem like an issue at all. I don't see how any questions are going to be answered by a video.
Don't get me wrong they published this, it took a ton of work, they didn't have to do it. But its ultimately a form of gatekeeping that seems to come straight out of Academia. And honestly, that part of Academia sucks.
A good product team is probably around a dozen people minimum? Especially if you need to hit the quality bar expected of a release from a BigCorp. You've got frontend, UX, and server components to design, in addition to the research part of the project. The last real app I worked on also included an app backend (ie, local db and web API access, separate from the display+UX logic) and product team. Oh yeah, also testing+qa, logging, and data analysis.
And after all that investment, God help you if you ever decide the headcount costs more than keeping the lights on, and you discontinue the project...
Public app releases are incredibly expensive, in other words, and throwing a model on GitHub is cheap.
I do most of my research reading on my phone. I want to be able to understand things without breaking out a laptop.
Sure, release a PDF (some people like those), but having an additional responsive web page version of a paper makes research much more readable to the majority of content consumption devices. It's 2023.
I'll generally use https://www.arxiv-vanity.com/ to generate those but that doesn't work with this specific paper since it's not hosted on arXiv.
Sometimes users have needs that may seem superfluous and beg for a snarky reply, but there are often important reasons behind them, even though they may not be actionable.
I'd pay $x000 for an app that does some sort of intelligent pdf-to-epub conversion that doesn't require human-in-the-loop management/checking.
Interesting that there isn’t a common solution for this. I guess it’s rather niche?
Religious recordings tend to be liturgical, so even the pronunciation might be different to the everyday language. They do address something related, although more from a vocabulary perspective to my understanding .
So one of their stated goals, to enable people to talk to AI in their preferred language , might be closer but certainly a stretch to achieve with their chosen dataset.
: > These translations have publicly available audio recordings of people reading these texts in different languages. As part of the MMS project, we created a dataset of readings of the New Testament in more than 1,100 languages, which provided on average 32 hours of data per language. By considering unlabeled recordings of various other Christian religious readings, we increased the number of languages available to more than 4,000. While this data is from a specific domain and is often read by male speakers, our analysis shows that our models perform equally well for male and female voices. And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.
: > And while the content of the audio recordings is religious, our analysis shows that this doesn’t bias the model to produce more religious language.
: > This kind of technology could be used for VR and AR applications in a person’s preferred language and that can understand everyone’s voice.
> Collecting audio data for thousands of languages was our first challenge because the largest existing speech datasets cover 100 languages at most. To overcome this, we turned to religious texts, such as the Bible, that have been translated in many different languages and whose translations have been widely studied for text-based language translation research.
I think the choice was just between having any data at all, or not being able to support that language.
Awesome-legal-nlp links to benchmarks like LexGLUE and FairLex but not yet LegalBench; in re: AI alignment and ethics / regional law
A "who hath done it" exercise:
[For each of these things, tell me whether God, Others, or You did it:] https://twitter.com/westurner/status/1641842843973976082?
"Did God do this?"
"About the Universal Declaration of Human Rights Translation Project"
> At present, there are 555 different translations available in HTML and/or PDF format.
E.g. Buddhist scriptures are also multiply translated; probably with more coverage in East Asian languages.
Thomas Jefferson, who wrote the US Declaration of Independence, had read into Transcendental Buddhism and FWIU is thus significantly responsible for the religious (and nonreligious) freedom We appreciate in the United States today.
This is a great project and an important stepping stone in a multilingual AI future.
The cadence and intonation sounded a little weird, but I suspect fine-tuning can improve that by a lot. I am really excited to see some low-resource language finally get some mainstream TTS support at all.
Edit: Just checked the paper, it seems to be worse but feel free to correct me.
I feel like they should've just taken the Whipser architecture, scaled it, and scaled the dataset as they did.
 Page: https://i.imgur.com/bq15Tno.png
 Paper: https://scontent.fcai19-5.fna.fbcdn.net/v/t39.8562-6/3488279...
Huge bummer. Prevents almost everyone from using this and recouping their costs.
I suppose motivated teams could reproduce the paper in a clean room, but that might also be subject to patents.
The copyright on the code, on the other hand, would definitely be copyrighted and would need a clean-room implementation, as you said. The community could pool its resources and do it once, and license it under the AGPL to keep it and further improvements available to everyone.
We should also have a fund for particularly tricky inventions and copyrights that would greatly benefit mankind.
Say, someone or some institution writes a good school book. We can just buy the book for whatever we think is reasonable. If they want to publish the same book the next year with the chapters shuffled around they could be entitled to a tiny fraction of the previous sum or it could be denied for being spam. If this bankrupts the company is of very little interest.
(tangentially, like another comment briefly mentioned, are models actually copyrightable? Programs are, because they are human authored creative works.)
All we can just do is take, take, take the code. But this time, the code's license is CC-BY-NC 4.0. Which simply means:
Take it, but no grifting allowed.
"half of the languages spoken today have fewer than 10,000 speakers and that a quarter have fewer than 1,000 speakers" (https://en.wikipedia.org/wiki/Language_death).
"Today, on average, we lose one language in the world every six weeks. There are approximately 6800 languages. But four percent of the population speaks 96 percent of the languages, and 96 percent of the population speaks four percent of the languages. These four percent are spoken by large language groups and are therefore not at risk. But 96 percent of the languages we know are more or less at risk. You have to treat them like extinct species." (https://en.wikipedia.org/wiki/Language_preservation).
"Over the past century alone, around 400 languages – about one every three months – have gone extinct, and most linguists estimate that 50% of the world’s remaining 6,500 languages will be gone by the end of this century (some put that figure as high as 90%, however). Today, the top ten languages in the world claim around half of the world’s population." (https://www.bbc.com/future/article/20140606-why-we-must-save...).
File "fairseq/data/data_utils_fast.pyx", line 30, in fairseq.data.data_utils_fast.batch_by_size_vec
assert max_tokens <= 0 or np.max(num_tokens_vec) <= max_tokens, (
AssertionError: Sentences lengths should not exceed max_tokens=4000000
Traceback (most recent call last):
File "/home/xxx/fairseq/examples/mms/asr/infer/mms_infer.py", line 52, in <module>
File "/home/xxx/fairseq/examples/mms/asr/infer/mms_infer.py", line 44, in process
As comparison the GGML port of Whisper (OpenAI's equivalent), runs in the browser, via WASM: https://whisper.ggerganov.com/
"Himachal Pradesh state:"
this is obviously wrong so i dont know what else is wrong
On the other hand, it definitely underscores:
- How blatantly companies exploit the "open source" concept with no remorse or even tacit acknowledgement; for many startups, open source is a great way to buy some goodwill and get some customers, then once they have what they want, close the doors and start asking for rent money. Nothing wrong with doing closed source or commercial software, but would OpenAI still have gained relevance if they hadn't started the way they did with the name they did?
- How little anyone gives a shit; we all watched it happen, but apparently nobody really cares enough. Investors, customers, the general public, apparently all is fair in making money.
I'm not suggesting OpenAI is especially evil, definitely not. In fact, the most depressing thing is that doing this sort of bait and switch mechanic is so commonplace and accepted now that it wasn't news or even interesting. It's just what we expect. Anything for a dollar.
But maybe I still seem like I'm just being whiny. Okay, fair enough. But look at what's happening now; OpenAI wants heavy regulation on AI, particularly they want to curtail and probably just ban open source models, through whatever proxy necessary, using whatever tactics are needed to scare people into it. They may or may not get what they want, but I'm going to guess that if they do get it, ~nobody will care, and OpenAI will be raking in record profits while open source AI technology gets pushed underground.
Oh I'd love to be wrong, but then again, it's not like there's anything particularly novel about this strategy. It's basically textbook at this point.
Sure, it may work sometimes with the goodwill of someone who cares, but 90% of OSS code is a dead portfolio the authors built with the hope of landing a tech job or skipping some algorithm questions.
Sure, OSS allow people to experiment with crap for free (even though it's mostly big corps benefiting from OSS) but what about the negative effects OSS produce on small businesses?
How many developers could spend their life maintaining small parts of software instead of working in soulless corporations, if giving away your code for free (and without maintenance) wasn't so common?
How much better would this code be compared to the wasteland of OSS projects? How much more secure could the entire ecosystem be?
How many poor developers are working for free just in the hope of getting a job someday?
We need to stop dreaming the OSS dream and start making things fairer for the developers involved.
There is no "OSS dream" anymore—today, there is an OSS reality. We have some open source stuff that objectively works: there are business models that are more or less proven, at least as proven as any internet or software-oriented business model, and plenty of highly successful projects that deliver immense value to the world.
Then again, some of it doesn't seem to work, and there are a lot of unknowns about how it works, what the dynamics will be, etc. But, if we're to call open source into question, we shouldn't forget to call proprietary software into question, too. Proprietary software has many seemingly-endemic issues that are hard to mitigate, and the business model has been shifting as of late. Software is now sold more as a subscription and a service than it is a product. The old business model of boxed software, it seems, has proven unsustainable for many participants.
The main issue open source seems to have is really funding. It works well when open source software acts as a compliment to some other commercial business, but it works poorly when the software itself is where all the value is. After all, if you, for example, just host an open source piece of software in exchange for money, you're effectively competing in the highly competitive web hosting business. It can work since you can obviously provide some value, but it's a precarious position. Thus very few companies really have a lot of money to put into the Linux desktop, or at least if you wanted to compare it to Windows or macOS. It's complimentary to some of them who use it as a developer workstation or something, but there are only a couple companies who I think genuinely have a good model. System76 is definitely one of them, to be explicit about it.
But rather than give up entirely, I propose something else: we should literally fund open source collectively. Obviously a lot of us already do: you can donate to projects like Blender or Krita monetarily, and you can donate your time and code as I'm sure many of us also do for random projects. But also, I think that open source should get (more) public funding. These projects arguably end up adding more value to the world than you put in them, and I think governments should take notice.
Of course in some cases this has already taken shape for one reason or another. Consider Ghidra. Clearly released for the benefit of NSA's PR, but wait, why not release Ghidra? It's immensely useful, especially given that even at a fraction of the functionality of Hex Rays products, it's still extremely useful for many parties who could simply never afford the annual cost of maintaining IDA Pro licenses, especially now that it is only available as a subscription.
The way I see it, software and computers in general are still moving quite fast even though it's clearly stagnating compared to where it once was. But, as things stabilize, software will simply need to be updated less, because we simply aren't going to have reasons to. As it is, old versions of Photoshop are already perfectly serviceable for many jobs. And at that point, we're only going to need one decent open source release for some given category of work. Things will need occasional improvements and updates, but c'mon, there's not an unlimited amount of potential to squeeze out of e.g. an image editor, any more than you can a table saw or a hammer. At some point you hit the point of diminishing returns, and I think we're nearing it in some places, hence why the switch to subscription models is necessary to sustain software businesses.
It's a myth that open source is held up by poor developers starving for their ideology. I'm sure a lot of that exists, but a lot of open source is also side projects, work subsidized by companies for one reason or another, projects with healthy revenue streams or donations, etc.
If OpenAI gives their stuff away, they lose customers. If Meta does it, they can have community around it, and have joint effort about improving tools that then they'll use for their internal products.
OpenAI is modern (and most likely - very short lived) Microsoft in the AI space, while Meta tries to replicate Linux in the AI space.
I didn’t think so.
And of course if you could also create an entire corpus of training material in your written ASL?
I suspect Meta/Facebook doesn't have a lot of content to work off of. I've only been able to find community-generated examples of SignWriting on the official website, and none of those seem to be using Unicode characters. MMS is an audio-to-text tool, so it seems unlikely that it can be trivially expanded to take in visual data (pictures of text or video of ASL being performed).
I suspect the process of turning viewed ASL into SignWriting text will be very difficult to automate. I would not be surprised if such a project would either use a different textual encoding or directly translate out to English (which also sounds terribly hard, but these LLM advances recently have surprised me).
Looking into it, it seems very much at the experimental stage in terms of digital representation -- while Unicode symbols exist, they require being placed in 2D boxes (using a drawing tool like SVG). It seems like it's only in the past few years that there have been proposals for how to turn it into a linear canonical text encoding?
Is anyone actually using those linear encodings -- SignPuddle or Formal SignWriting, they seem to be called -- in the wild, outside of demonstration texts or academia? Especially since they only date to 2016 and 2019.
Is there anywhere close at all to a corpus that Meta could train on? Because it still seems like the answer is no, but I also got my research wrong when Google gave no indication that SignWriting existed in the first place.
Which is that I can't find any indication of a linear digital encoding that has been used to any appreciable extent that Meta could train on a corpus of it.
Which is why I'm struggling to understand why you're criticizing Meta? How could they realistically train a linear text model on ASL when the necessary content doesn't appear to exist?
Instead, go by hard-of-hearing, people with hearing loss or simply, Deaf.
Also "deaf" is barely ok when use alone for the above generalized replacement however "deaf" often poorly is refers to most severe of hearing-loss as determined by differing standards of hearing loss, but capitalized "D"eaf is a direct reference to those of actively using sign language and engage within deaf culture, whether it would be American Sign Language, British Sign Language, or some 40-odd variants and different nationalities.
Blog Post: https://ai.facebook.com/blog/multilingual-model-speech-recog...
Languages coverage: https://dl.fbaipublicfiles.com/mms/misc/language_coverage_mm...