Hacker News new | past | comments | ask | show | jobs | submit login
ElevenLabs Launches Voice Translation Tool to Break Down Language Barriers (elevenlabs.io)
129 points by beriboy on Oct 10, 2023 | hide | past | favorite | 51 comments



Okay, here are my thoughts: It sorta bad, but sometimes it is pretty goddammit good.

What bothered me the most is that it oftentimes when it transfers the "interpretation", it puts the intonation and the stress in the wrong part of sentence, it totally misses it. So it's just plain weird to see someone talking like that. But rarely, there are moments when it puts the word stress in the exactly right part of sentence and it looks like it was dubbed by a human. It's not many moments, but they are there, and you can have an idea of what this technology could look in its best version.


I tried it on a video I made a few months ago. Here’s the original, in which I speak English:

https://www.youtube.com/watch?v=8JUepj7wIl0

And here are the first two and a half minutes with me dubbed into Japanese:

https://www.youtube.com/watch?v=z85FXwH6pUU

Aside from some pacing problems, the voice and intonation sound quite good to me. But anyone listening to the translation alone would find it hard to follow, as the tool seems to have mistranscribed some of my English and mistranslated some of the rest. (I understand Japanese.)

The dubbing would probably be a lot better if I had spoken English more slowly and with awareness that what I said would be interpreted into another language.


I think it did pretty well. No offense as your original video is still really great for a native English speaker (me) to listen to, but the English speech varies in pace a lot, sometimes you speak incredibly quickly, at such a pace that you even miss some syllables or blend them together for example https://youtu.be/8JUepj7wIl0?t=194 it sounds like you said "technogial" or "technodule" rather than "tech-no-log-ical". So I can understand that a machine may make mistakes on this; don't worry, as a Kiwi I mumble and do this too ha ha.

But the Japanese result is still already very very impressive even if I haven't spoken since I was 15 and over there on exchange (and stopped learning once I realise they romanise everything, why label a bottle with appuru jyusu when you have RINGO SHIRU!)


I gave it a try. Comment if you want me to try some other videos or target languages.

Elon Musk on Joe Rogan, English to Finnish: https://www.youtube.com/watch?v=OOkOafhAsDE

I'm a native Finnish speaker and while it starts off a bit strange, when they get to the mid to latter part it is really incredible, it's like an alternative universe where they were somehow born Finnish.

Putin talking about AI, Russian to English: https://www.youtube.com/watch?v=YSkjQJcqaFo

Paul Graham interview "What are some common mistakes founders make?" from English to Japanese: https://www.youtube.com/watch?v=Rq_G0VXJ1xk

This one didn't work out so well. While some parts are decent, others sound like nonsense. But super cool it included pg blowing into the mic in the translation!

Pulp Fiction to Finnish: https://www.youtube.com/watch?v=mdLChf-0pFE

Half of the time it sounds like a professional dub, but sometimes gets confused.


Translating the Paul Graham video to Croatian works surprisingly well.

(As a non-native speaker) several bits of the Japanese transcription were just Japanese-sounding noise, and the prosody was completely all over the place in the bits that were actually Japanese. The dub does get across most of the things being talked about, but I wouldn't consider it an acceptable dub (even for most of the bits where it is actually in Japanese).

I wonder if reason is that the dubbing works better between more similar languages and struggles when dealing with disparate languages. I don't speak Chinese or Korean, but it would be interesting to see how well their dubs are.

Can you do another video into Japanese, to see if it's just an issue with the Paul Graham video? Also, how about translating from another language to English?


Here's another example:

Steve Jobs on death | Walter Isaacson and Lex Fridman (EN->JP): https://www.youtube.com/watch?v=6YiJLDhcWSI

Maybe it does better on longer videos, but as someone who is fairly well-versed in Japanese this one was not a good result either. There is also a bit of wonkiness in this one, where in the dub Lex says some of Walter's lines and vice-versa, and at one point both of them repeat verbatim the same exact translated phrase which doesn't correspond to anything said in the original.


Good idea, added a Putin video translated from Russian: https://www.youtube.com/watch?v=YSkjQJcqaFo


I love the slow whispered "Thaaank Yooooouuuu" at :10.


Translating the Musk video to french is weird. What's interesting is that sometimes protagonists speak with an accent from Quebec, and sometimes with kind of an old school voice from some old french movie. There's a laugh in the sequence, which is translated with again some audio quality that belongs to an old french movie.


Here's Boris Johnson speaking German. It really does sound like his voice:

https://youtu.be/J4nxHqblzKA

I wonder if they're breaking Youtube's (and others) TOS by downloading the videos to work their AI magic on them?


Back to the Future to Finnish: https://www.youtube.com/watch?v=fvvqUv6YYhc

I laughed because the voices are so out there, and then it made two little funny mistakes that caught me off guard when I was already laughing, so I could only switch to crying at that point.


No matter what i try, even a new file from audacity gives "Watermark not allowed for audio input." error..


I had the same problem. Then I watched the video and see that the upload is a video file not an audio file, so I uploaded as an MP4 and it worked. Seems like some sort of weird bug.


Doesn't seem like a bug at all.

They probably need to watermark the output video to clarify that the video is not original and is 'tampered' with.

The software is warning the user that it (rightly) refuses to watermark the video ... of an audio file.


The error message doesn't provide any sort of insight into whether or not that's the case. Contrary to that, the upload box says "Audio or Video File, up to 100MB". So that would imply that they aren't accepting only video input. While I agree with your suggestion that the reason why they seem to be accepting only video files is based on watermarking the output, the upload box seems to accept an audio file input. They should say "We only accept video file input". It's possible they accept audio-only files for higher level paid plans where you can uncheck the watermark box in the Advanced options.


One giant step forward towards peace on earth.

The day every Russian or Chinese can listen to English-speaking news in their native tongue will have a massive impact on how tight a leash their local thuggish governments can keep on them.

Conversely, the day every American finally has access to non US-controlled world news automatically translated to English will be a giant step towards the US finally understanding how the rest of the world functions, and perhaps they'll finally become capable of deciding democratically to stop meddling in other people's affairs.


I'm not sure, it will definitely improve things, but insofar as the informational warfare part of the republican party in the US has shown us in the rest of the world, both Democrats and Republicans speak English and therefore the real barrier seems to be ideological rather than communication.

The only real solution is to remove our human/animal nature; a lot of people only care about something if it directly affects them. A lot of hot topic issues in the US at the moment seem to circle around controlling others (abortion, gay rights) and the freedom to do what you want even if it affects others (guns, owning an oversized vehicle).


Perhaps I’m a bit too cynical about my own country, but I don’t see a day in the near future where the average American even cares about factually correct news instead of ideologically validating news, let alone going out of their way to translate foreign news to better understand the world; The people who care enough to do that are already doing that with existing technology. Making it more accessible probably won’t change too much, unfortunately.


Unfortunately, flag-carrying is hard-coded into our firmware. :/

It does make me laugh to think of some fun places we can stick USB cables to initiate a DFU.


I just tested it with a video of a Greek introductionary language lesson. Quite a lot of those examples actually did not work, as they seemed to have a problem with the words being spoken intentionally understandable, slower and in a way that while helpful to people not knowing the language, is not natural to the language.

But the things not working were actually sometimes really funny and the voice sounded more drunk than I would have imagined.


I want to convert articles and ebooks into audiobooks. What’s the best way to do this? Is there anything locally I can use?


I guess I will live to see an iBabelfish…we will finally overcome jhwhs babylonian curse.


Is there any voice color/timbre change service that can be used for music production? Some sort of system that preserves the phonemes, the pitch and the timing of the input speech while changing the timbre.


RVC WebUI / VC Client? Locally fine-tunable and runs at real time.

1: https://github.com/w-okada/voice-changer


Logic Pro has a plugin for this. Been years since I used it, but I think it was called a “formant shifter” or something to that effect


This sounds extremely exciting. But on second thought I fear with this technology we'll be getting too lazy to actually learn a foreign language - which is also awesome.


the vast majority of people are already too lazy to learn a second language... and those with the willpower to do so won't be stopped by this.


I mean the Internet brought us all so much closer together than ever before; technology like this will make us even closer, if we can do better/perfect translations it means that we could be interacting with people in English who are native Spanish speakers, for example, without even realising (on the Internet, at least).

And it's tough, because if you like travelling, it's not really feasible to learn every single language for every place...but in the future if I can pop an earbud in & get a realtime translation & they can do the same, I'd probably speak to so many more strangers when I visit non English speaking countries. It'll be so great.


Tried it on math video and it was surprisingly good at getting the speakers tone and the overall content correct.

Where it struggled a bit were terms like "X to the power of 7". Sometimes it chose the correct term and sometimes a more literal translation.

It had difficulties getting choosing a consistent translation of "you" (informal "du" or formal "Sie") and translated the "it's" variant incorrectly.


I have been using dubbah which allows you to edit the translation as needed, and regenerate individual lines of dialogue


I've found their system tries to guess the language of the text, but there are many languages that use the same words, and so it will speak in the wrong language (or accent, even). I hope they make a solution for this, otherwise I don't think it will be suitable for production use cases.


It autodetects language by default but you can set to a specific one. Though you'd still have that problem on a multi-lingual input video.



What's the best "offline" TTS? 11Labs is great but I want to run locally


I use silero for tts, and whisper (open source, local) for speech to text/translation.


Where are good examples of Silero TTS voices / outputs?


Last I looked into it, tortoise-ttl


How does it handle translating copyrighted material? Disclaimer buried somewhere?


I saw that Spotify recently released a similar feature for selected podcasts, could anyone recommend some recent research papers related to voice translation? Would be really interested in reading!



I wonder what they use for the translation part


Assuming DeepL ..... not sure why they'd build NMT in house


DeepL is very limited in context, only looking at a small surrounding sentence. Additionally, when you're trying to optimize for a more concise or verbose translation, you really need a generalist model.


The transformer model was invented to attend to context over the entire sequence length. Look at how the original authors used the Transformer for NMT in the original Vaswani et al publication. https://github.com/jadore801120/attention-is-all-you-need-py... See how the source sentence is run through the encoded, and then the decoder is run in an auto-regressive mode. It's been a while, so I can't recall if the encoded embeddings are masked to the same index to the decoder input sequence, and I don't see why you would even need to do particular task auto-regressively in theory. Found this survey, https://ar5iv.labs.arxiv.org/html/2204.09269 which outlines the pros and cons. A cool feature of transformers is their ability to pull in a great deal of surrounding context, look at how BERT was trained. Yeah, things are bit more boring with decoder only GPT types, but they are winning the day with the easy scaling.


Why not just use GPT-4?


nice! anybody thats still in the waiting list phase should get their money pulled by their VCs


This is like magic, wow.


Genuinely impressive used as intended. Spectacular when not - English to English translation.

https://twitter.com/jonathanfly/status/1711607166371561805


That is hilarious, thank you.


I remember when this guy[1] launched TranslateAudio a few months ago. For those unfamiliar, it's the first tool that launched the voice translation. It was built by an Indiehacker.

[1] - https://twitter.com/mehulsharmamat

[2] - https://www.translateaudio.com/


Their terms of service confused and creeped me out back when I looked into these people




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: