Okay, here are my thoughts: It sorta bad, but sometimes it is pretty goddammit good.
What bothered me the most is that it oftentimes when it transfers the "interpretation", it puts the intonation and the stress in the wrong part of sentence, it totally misses it. So it's just plain weird to see someone talking like that. But rarely, there are moments when it puts the word stress in the exactly right part of sentence and it looks like it was dubbed by a human. It's not many moments, but they are there, and you can have an idea of what this technology could look in its best version.
Aside from some pacing problems, the voice and intonation sound quite good to me. But anyone listening to the translation alone would find it hard to follow, as the tool seems to have mistranscribed some of my English and mistranslated some of the rest. (I understand Japanese.)
The dubbing would probably be a lot better if I had spoken English more slowly and with awareness that what I said would be interpreted into another language.
I think it did pretty well. No offense as your original video is still really great for a native English speaker (me) to listen to, but the English speech varies in pace a lot, sometimes you speak incredibly quickly, at such a pace that you even miss some syllables or blend them together for example https://youtu.be/8JUepj7wIl0?t=194 it sounds like you said "technogial" or "technodule" rather than "tech-no-log-ical". So I can understand that a machine may make mistakes on this; don't worry, as a Kiwi I mumble and do this too ha ha.
But the Japanese result is still already very very impressive even if I haven't spoken since I was 15 and over there on exchange (and stopped learning once I realise they romanise everything, why label a bottle with appuru jyusu when you have RINGO SHIRU!)
I'm a native Finnish speaker and while it starts off a bit strange, when they get to the mid to latter part it is really incredible, it's like an alternative universe where they were somehow born Finnish.
This one didn't work out so well. While some parts are decent, others sound like nonsense. But super cool it included pg blowing into the mic in the translation!
Translating the Paul Graham video to Croatian works surprisingly well.
(As a non-native speaker) several bits of the Japanese transcription were just Japanese-sounding noise, and the prosody was completely all over the place in the bits that were actually Japanese. The dub does get across most of the things being talked about, but I wouldn't consider it an acceptable dub (even for most of the bits where it is actually in Japanese).
I wonder if reason is that the dubbing works better between more similar languages and struggles when dealing with disparate languages. I don't speak Chinese or Korean, but it would be interesting to see how well their dubs are.
Can you do another video into Japanese, to see if it's just an issue with the Paul Graham video? Also, how about translating from another language to English?
Maybe it does better on longer videos, but as someone who is fairly well-versed in Japanese this one was not a good result either. There is also a bit of wonkiness in this one, where in the dub Lex says some of Walter's lines and vice-versa, and at one point both of them repeat verbatim the same exact translated phrase which doesn't correspond to anything said in the original.
Translating the Musk video to french is weird. What's interesting is that sometimes protagonists speak with an accent from Quebec, and sometimes with kind of an old school voice from some old french movie.
There's a laugh in the sequence, which is translated with again some audio quality that belongs to an old french movie.
I laughed because the voices are so out there, and then it made two little funny mistakes that caught me off guard when I was already laughing, so I could only switch to crying at that point.
I had the same problem. Then I watched the video and see that the upload is a video file not an audio file, so I uploaded as an MP4 and it worked. Seems like some sort of weird bug.
The error message doesn't provide any sort of insight into whether or not that's the case. Contrary to that, the upload box says "Audio or Video File, up to 100MB". So that would imply that they aren't accepting only video input. While I agree with your suggestion that the reason why they seem to be accepting only video files is based on watermarking the output, the upload box seems to accept an audio file input. They should say "We only accept video file input". It's possible they accept audio-only files for higher level paid plans where you can uncheck the watermark box in the Advanced options.
The day every Russian or Chinese can listen to English-speaking news in their native tongue will have a massive impact on how tight a leash their local thuggish governments can keep on them.
Conversely, the day every American finally has access to non US-controlled world news automatically translated to English will be a giant step towards the US finally understanding how the rest of the world functions, and perhaps they'll finally become capable of deciding democratically to stop meddling in other people's affairs.
I'm not sure, it will definitely improve things, but insofar as the informational warfare part of the republican party in the US has shown us in the rest of the world, both Democrats and Republicans speak English and therefore the real barrier seems to be ideological rather than communication.
The only real solution is to remove our human/animal nature; a lot of people only care about something if it directly affects them. A lot of hot topic issues in the US at the moment seem to circle around controlling others (abortion, gay rights) and the freedom to do what you want even if it affects others (guns, owning an oversized vehicle).
Perhaps I’m a bit too cynical about my own country, but I don’t see a day in the near future where the average American even cares about factually correct news instead of ideologically validating news, let alone going out of their way to translate foreign news to better understand the world; The people who care enough to do that are already doing that with existing technology. Making it more accessible probably won’t change too much, unfortunately.
I just tested it with a video of a Greek introductionary language lesson. Quite a lot of those examples actually did not work, as they seemed to have a problem with the words being spoken intentionally understandable, slower and in a way that while helpful to people not knowing the language, is not natural to the language.
But the things not working were actually sometimes really funny and the voice sounded more drunk than I would have imagined.
Is there any voice color/timbre change service that can be used for music production? Some sort of system that preserves the phonemes, the pitch and the timing of the input speech while changing the timbre.
This sounds extremely exciting. But on second thought I fear with this technology we'll be getting too lazy to actually learn a foreign language - which is also awesome.
I mean the Internet brought us all so much closer together than ever before; technology like this will make us even closer, if we can do better/perfect translations it means that we could be interacting with people in English who are native Spanish speakers, for example, without even realising (on the Internet, at least).
And it's tough, because if you like travelling, it's not really feasible to learn every single language for every place...but in the future if I can pop an earbud in & get a realtime translation & they can do the same, I'd probably speak to so many more strangers when I visit non English speaking countries. It'll be so great.
I've found their system tries to guess the language of the text, but there are many languages that use the same words, and so it will speak in the wrong language (or accent, even). I hope they make a solution for this, otherwise I don't think it will be suitable for production use cases.
I saw that Spotify recently released a similar feature for selected podcasts, could anyone recommend some recent research papers related to voice translation? Would be really interested in reading!
DeepL is very limited in context, only looking at a small surrounding sentence. Additionally, when you're trying to optimize for a more concise or verbose translation, you really need a generalist model.
The transformer model was invented to attend to context over the entire sequence length. Look at how the original authors used the Transformer for NMT in the original Vaswani et al publication. https://github.com/jadore801120/attention-is-all-you-need-py...
See how the source sentence is run through the encoded, and then the decoder is run in an auto-regressive mode. It's been a while, so I can't recall if the encoded embeddings are masked to the same index to the decoder input sequence, and I don't see why you would even need to do particular task auto-regressively in theory. Found this survey, https://ar5iv.labs.arxiv.org/html/2204.09269 which outlines the pros and cons. A cool feature of transformers is their ability to pull in a great deal of surrounding context, look at how BERT was trained. Yeah, things are bit more boring with decoder only GPT types, but they are winning the day with the easy scaling.
I remember when this guy[1] launched TranslateAudio a few months ago. For those unfamiliar, it's the first tool that launched the voice translation. It was built by an Indiehacker.
What bothered me the most is that it oftentimes when it transfers the "interpretation", it puts the intonation and the stress in the wrong part of sentence, it totally misses it. So it's just plain weird to see someone talking like that. But rarely, there are moments when it puts the word stress in the exactly right part of sentence and it looks like it was dubbed by a human. It's not many moments, but they are there, and you can have an idea of what this technology could look in its best version.