Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Real-time speech-to-speech translation
158 points by thangalin 36 days ago | hide | past | favorite | 70 comments
Has anyone had any luck with a free, offline, open-source, real-time speech-to-speech translation app on under-powered devices (i.e., older smart phones)?

* https://github.com/ictnlp/StreamSpeech

* https://github.com/k2-fsa/sherpa-onnx

* https://github.com/openai/whisper

I'm looking for a simple app that can listen for English, translate into Korean (and other languages), then perform speech synthesis on the translation. Basically, a Babelfish that doesn't stick in the ear. Although real-time would be great, a max 5-second delay is manageable.

RTranslator is awkward (couldn't get it to perform speech-to-speech using a single phone). 3PO sprouts errors like dandelions and requires an online connection.

Any suggestions?




It's not exactly what OP wants out-of-the-box, but if anyone is considering building one I suggest taking a look at this.¹ It is really easy to tinker with, can run both on devide or in a client-server model. It has the required speech-to-text and text-to-speech endpoints, with multiple options for each built-in. If you can make the LLM AI assistant part of the pipeline to perform translation to a degree you're comfortable with, this could be a solution.

¹ https://github.com/huggingface/speech-to-speech


A similar option exists with txtai (https://github.com/neuml/txtai).

https://neuml.hashnode.dev/speech-to-speech-rag

https://www.youtube.com/watch?v=tH8QWwkVMKA

One would just need to remove the RAG piece and use a Translation pipeline (https://neuml.github.io/txtai/pipeline/text/translation/). They'd also need to use a Korean TTS model.

Both this and the Hugging Face speech-to-speech projects are Python though.


Your library is quite possibly the best example of effortful, understandable and useful work I have ever seen - principally evidenced by how you keep evolving with the times. I've seen you keep it up to date and even on the edge now for years and through multiple NLP mini-revolutions (sentence embeddings/new uses) and what must have been the annoying release of LLMs and still push on to have an explainable and useful library.

Code from txtai just feels like exactly the right way to express what I am usually trying to do in NLP.

My highest commendations. If you ever have time, please share your experience/what lead to you taking this path with txtai. For example I see you started in earnest around August 2020 (maybe before) - at that time i would love to know if you imagined LLMs coming on to be as prominent as they are now and for instruction-tuning to work as well as it is. I know at that time many PhD students I knew in NLP (and profs) felt LLMs were far too unreliable and would not reach e.g. consistent scores on MMLU/HELLASWAG.


I really appreciate that! Thank you.

It's been quite a ride from 2020. When I started txtai, the first use case was RAG in a way. Except instead of an LLM, it used an extractive QA model. But it was really the same idea, get a relevant context then find the useful information in it. LLMs just made it much more "creative".

Right before ChatGPT, I was working on semantic graphs. That took the wind out of the sails on that for a while until GraphRAG came along. Definitely was a detour adding the LLM framework into txtai during 2023.

The next release will be a major release (8.0) with agent support (https://github.com/neuml/txtai/issues/804). I've been hesitant to buy into the "agentic" hype as it seems quite convoluted and complicated at this point. But I believe there are some wins available.

In 2024, it's hard to get noticed. There are tons of RAG and Agent frameworks. Sometimes you see something trend and surge past txtai in terms of stars in a matter of days. txtai has 10% of the stars of LangChain but I feel it competes with it quite well.

Nonetheless I keep chugging along because I believe in the project and that it can solve real-world use cases better than many other options.


I have a dozen or so tabs open at the moment to wrap my head around txtai and its very broad feature set. The plethora of examples is nice even if the python idioms are dense. The semantic graph bits are of keen interest for my use case, as are the pipelines and workflows. I really appreciate you continuing to hack on this.


You got it. Hopefully the project continues it's slow growth trajectory.


> free

> offline

> real-time

> speech-to-speech translation app

> on under-powered devices

I genuinely don't think the technology is there.

I can't even find a half-good real-time "speech to second language text" tool, not even with "paid/online/on powerful device" options.


> I genuinely don't think the technology is there.

Definitely true for OP's case, especially for non-trivial language pairs. For the best case scenario, e.g. English<>German, we can probably get close.

> I can't even find a half-good real-time "speech to second language text" tool, not even with "paid/online/on powerful device" options.

As in "you speak and it streams the translated text"? translate.google.com with voice input and a more mobile-friendly UI?


The problem with Google is its translation quality. Not sure about Korean, but Japanese/English (either way) definitely isn't there.

For Japanese to English, the transcription alone is already pretty inaccurate (usable if you know some Japanese; but then again you already know Japanese!)


This hasn't been my experience with English/Japanese translation with Google Translate. For context I used Google Translate for pair programming with Japanese clients 40 hours per week for about 6 months, until I ponied up for a DeepL subscription.

As long as you're expressive enough in English, and reverse the translation direction every now and again to double check the output then it works fine.


As I mentioned in another reply, the scenario here is translating "artistic" or "real-world" (for lack of a better term) literature accurately—whether it's a novel, a YouTuber's video, casual conversation, or blog posts/tweets with internet slang and abbreviations. In these cases, getting things 95% right isn’t enough to capture the nuances, especially when the author didn’t create the content with translation in mind (which I believe matches your experience).

Machine translation for instructional or work-related texts has been "usable" for years, way before LLM emerged.

LLM-based translation has certainly made significant progress in these scenarios—GPT-4, for example, is fully capable IMHO. However, it's still not quite fast enough for real-time use, and the smaller models that can run offline still don't deliver the needed quality.


English -> Japanese machine translations(whether it's GT or DL or GPT) are fairly usable these days in the sense that it reduces interpretation workload to trivial amount especially with typical skillset of a Japanese white collar workers, or not perfect in the sense that the fact that the output is a translation is always apparent to native speakers - but that is the case too even with offline human translators, so could be a moot point.

Anyway, the current state of affairs float somewhere comfortably above "broken clock" and unfortunately below "Babelfish achieved", so opinions may vary.


Interesting, I was in Japan a few months ago and I found google translate to be pretty good. Even when Hotels etc. provided information in English I found it was better to use google lens on the Japanese information.

I can't say much about the quality of English -> Japanese translation, except that people were generally able to understand whatever came out of it.


It's usable as a tool for quick communication or reading instructional text.

But don't expect to be able to use it to read actual literature or, back to the topic, subtitling a TV series or a YouTube video without misunderstanding.


The leading LLMs are already very strong at translation (including EN<>JP, Korean is more model-dependent), so then what you want is simply Google Translate but powered by a strong LLM? I'm sure there's already dozens of such wrappers that offer just that.


> I can't even find a half-good real-time "speech to second language text" tool,

iOS's built in translate tool? I haven't tried it for other languages but a quick test English <> Thai seemed to handle things fine (even with my horrible Thai pronunciation and grammar) and even in Airplane mode (i.e. guaranteed on-device) with the language pack pre-downloaded.


It’s not free, but I’ve had some success using ChatGPT’s Advanced Voice mode for sequential interpreting between English and Japanese. I found I had to first explain the situation to the model and tell it what I wanted it to do. For example: “I am going to have a conversation with my friend Taro. I speak English, and he speaks Japanese. Translate what I say into Japanese and what he says into English. Only translate what we say. Do not add any explanations or commentary.”

We had to be careful not to talk over each other or the model, and the interpreting didn’t work well in a noisy environment. But once we got things set up and had practiced a bit, the conversations went smoothly. The accuracy of the translations was very good.

Such interpreting should get even better once the models have live visual input so that they can “see” the speakers’ gestures and facial expressions. Hosting on local devices, for less latency, will help as well.

In business and government contexts, professional human interpreters are usually provided with background information in advance so that they understand what people are talking about and know how to translate specialized vocabulary. LLMs will need similar preparation for interpreting in serious contexts.


> In business and government contexts, professional human interpreters are usually provided with background information in advance so that they understand what people are talking about and know how to translate specialized vocabulary. LLMs will need similar preparation for interpreting in serious contexts.

I've given some high-profile keynote speeches where top-notch (UN-level) real-time simultaneous interpreters were used. Even though my content wasn't technical or even highly specialized (more general business), spending an hour with them previewing my speech and answering their questions led to dramatically better audience response. I was often the only speaker on a given day who made the effort to show up for the prep session. The interpreters told me the typical improvement they could achieve from even basic advance prep was usually >50%.

It gave me a deep appreciation for just how uniquely challenging and specialized the ability to do this at the highest level is. These folks get big bucks and from my experience, they're worth it. AI is super impressive but I suspect getting AI to replicate the last 10% of quality top humans can achieve is going to be >100% more work.

An even more specialized and challenging use case is the linguists who spend literally years translating one high-profile book of literature (like Tolstoy). They painstakingly craft every phrase to balance conveying meaning, pace and flavor etc. If you read expert reviews of literature translation quality you get the sense it's a set of trade-offs for which no optimal solution exists, thus how these trade-offs are balanced is almost as artistic and expressive an endeavor as the original authorship.

Current AI translators can do a fine job translating things like directions and restaurant menus in real-time, even on a low-end mobile device but the upper bound of quality achieved by top humans on the hardest translation tasks is really high. It may be a quite a while before AIs can reach these levels technically - and perhaps even longer practically because the hardest use cases are so challenging and the market at the high-end is far too small to justify investing the resources in attaining these levels.


I agree with everything you wrote.

A few more comments from a long-time professional translator:

The unmet demand for communication across language barriers is immense, and, as AI quality continues to improve, translators and interpreters working in more faceless and generic fields will gradually be replaced by the much cheaper AI. That is already happening in translation: Some experienced patent translators I know have seen their workflows dry up in the past couple of years and are trying to find new careers.

This past August, I gave a talk on AI developments to a group of professional interpreters in Tokyo. While they hadn’t heard of any interpreters losing work to AI yet, they thought that some types of work—online interpreting for call centers, for example—would be soon replaced by AI. The safest type of human interpreting work, I suspect, is in-person interpreting at business meetings and the like. The interpreters’ physical presence and human identity should encourage the participants to value and trust them more than they would AI.

Real-time simultaneous translators are an interesting case. On the one hand, they are the rarest and most highly skilled, and national governments and international organizations have a long track record of relying on and trusting them. On the other hand, they usually work in soundproof booths, barely visible (if at all) to the participants. The work is very demanding, so interpreters usually work in twenty- to thirty-minute shifts; the voice heard by meeting participants therefore changes periodically. The result is less awareness that the interpreting is being done by real people, so users might feel less hesitation about replacing them with AI.

When I’ve organized or participated in conferences that had simultaneous human interpreting, the results were mixed. While sometimes it worked well, people often reported having a hard time following the interpretion on headphones. Question-and-answer sessions were often disjointed, with an audience member asking about some point that they seem to have misunderstood in the interpretation and then their interpreted question not being understood by the speaker. The interpreters were well-paid professionals, though not perhaps UN-level.


I tried this a few times since my family speaks Finnish and son speaks Japanese, but the issue is that it keeps forgetting the task.

It’ll work at first for a sentence or two, then the other party asks something and instead of translating the question, it will attempt to answer the question. Even if you remind it of its task it quickly forgets again.


That happened once or twice for me, too. I wonder if an interpreting-specific system prompt would prevent that problem….


Expecting a boom in the speech-to-speech market in the following months. It's the next thing.


It's been the next thing ever since Bill Gates said so around the time of the Windows 95 launch.

But it does feel like we are close to getting a proper babelfish type solution. The models are good enough now. Especially the bigger ones. It's all about UX and packaging it up now.


It is impossible to accurately interpret with a max 5 second delay. The structure of some languages requires the interpreter to occasionally wait for the end of a statement being the start of interpretation is possible.


‘Meanwhile, the poor Babel fish, by effectively removing all barriers to communication between different races and cultures, has caused more and bloodier wars than anything else in the history of creation.’


Author of 3PO here: check out our latest version 2.12. Many fixes have been incorporated in the past two weeks. Cheers.


It would be interesting to have your application connected to an asterisk/issabel system. Have you considered this way of working?


> Many fixes have been incorporated in the past two weeks.

Thanks for the efforts! Still many fixes to go, though: I used the version from two days ago, which had numerous issues. Also, 3PO isn't offline, so I won't be pursuing it.


No problem! Some fixes were on the server side. We had a server side issue a couple days ago for a few hours. You may have been affected by it, giving you those errors. They have been fixed too. Take care!


It works really well for me, but I wish it supported more languages for the input - i guess this is a limitation of the model you are using? Do you mind giving some info about the tech stack? I love how fast it is.


thanks for the kind words!! It is built on top of a fairly straightforward infrastructure. Client side is C#, .NET, MAUI. Server side is Firebase Cloud Functions/Firestore/Web Hosting + Azure.


Thanks - I was wondering more about the translation pipeline - i am assuming something like whisper and m2m100? How did you manage to get the latency so low? Other examples I have seen feel really sluggish in comparison.


Fairly basic there... chunking audio (websocket) and send them for translation sequentially.


Only seems to cover half of what you're asking for... Starred this the other day and haven't gotten to trying it out :

https://github.com/usefulsensors/moonshine


A friend recommends SayHi, which does near-realtime speech-to-speech translation (https://play.google.com/store/apps/details?id=com.sayhi.app&...). Unfortunately it's not offline though.


Is that the same app? That seems like a social/dating app. This reddit thread suggests the SayHi app was discontinued

https://www.reddit.com/r/language/comments/1elpv37/why_is_sa...


I've develop an macOS App: BeMyEars which can realtime speech-to-text translation. It first transcribe and then translate between language. All of this is working on-device. If you only want smart phone app: you can also try YPlayer, it's also working on-device. They can be downloaded from AppStore.


YPlayer does not support Portuguese? I am interested in having real time spoken Portuguese translated into English.


Thanks for building this. Super stoked that the translation in on-device.

I'll be downlaoding it and giving it a try today!!


Real-time and under-powered, no way. All the available tools (and models) today require non-negligible hardware.


I just realized I will actually see a real Babelfish hitting the market in my lifetime. Amazing times indeed.


The tech is really here. This summer, I was fascinated the accuracy by the spoken language auto-detection capabilities. It really works and it only needs 1-2 seconds to catch the nuiance of a specific language.

So, I ventured into building 3PO. https://3po.evergreen-labs.org

Would love to hear everyone's feedback here.


Samsung Interpreter might be the closest, but is neither free nor does it work on low-power devices


I've been looking for something like this (Not for Korean though) and I'd even be happy to pay - though I'd prefer to pay by usage rather than a standing subscription fee. So far, no luck, but watching this thread!


RTranslator is quite close. It needs a TTS for the target language to be installed.


>Although real-time would be great, a max 5-second delay is manageable.

Humans can't even do this in immediate real-time, what makes you think a computer can? Some of the best real-time translators that work at the UN or for governments still have a short delay to be able to correctly interpret and translate for accuracy and context. Doing so in real-time actually impedes the translator from working correctly - especially in languages that have different grammatical structures. Even in langauges that are effectively congruent (think Latin derivatives), this is hard, if not outright impossible to do in real time.

I worked in the field of language education and computer science. The tech you're hoping would be free and able to run on older devices is easily a decade away at the very best. As for it being offline, yeah, no. Not going to happen, because accurate real-time translation of even a database of the 20 most common languages on earth is probably a few terrabytes at the very least.


Is this possible to do smoothly with languages that have an extremely different grammar to English? If you need to wait until the end of the sentence to get the verb, for instance, then that could take more than five seconds, particularly if someone is speaking off the cuff with hesitations and pauses (Or you could translate clauses as they come in, but in some situations you'll end up with a garbled translation because the end of the sentence provides information that affects your earlier translation choices).

AFAIK, humans who do simultaneous interpretation are provided with at least an outline, if not full script, of what the speaker intends to say, so they can predict what's coming next.


> AFAIK, humans who do simultaneous interpretation are provided with at least an outline, if not full script

They are usually provided with one, but it is by no means necessary. SI is never truly simultaneous and will have a delay, and the interpretor will also predict based on the context. Which makes certain languages a bit more difficult to work with, e.g. Japanese, sentences of which I believe often have the predicate after the object, rather than the usual subject-predicate-object order, making the "prediction" part harder.


> If you need to wait until the end of the sentence to get the verb, for instance, then that could take more than five seconds

I meant a five-second delay after the speaker finishes talking or the user taps a button to start the translation, not necessarily a five-second rolling window.


Off topic, but what's the state-of-art behind speech recognition models at the moment?

Are people still using with DTW + HMMs?


HMMs haven't been state of the art in speech recognition for decades (I.e. since it actually got good). It's all end-to-end DNNs now. Basically raw input -> DNN -> ASCII.

Well almost anyway - last I checked they feed a Mel spectrogram into the model rather than raw audio samples.


> state of the art in speech recognition for decades

Decades doesn't sound right. Around 2019, the Jasper model was SOTA among e2e models but was still slightly behind a non e2e model with an HMM component https://arxiv.org/pdf/1904.03288


FREE and OFFLINE and OPEN SOURCE and REAL-TIME on UNDER-POWERED devices?


I think it's certainly possible if you compromise on accuracy. https://en.wikipedia.org/wiki/Dragon_NaturallySpeaking has been around since the late 90s, there's various (rather robotic-sounding) free speech synths available which don't require much processing power at all (look at the system requirements of https://en.wikipedia.org/wiki/Software_Automatic_Mouth ), and of course machine translation has been an active topic of research since the mid-20th century.

IMHO it's unfortunate that everyone jumps to "use AI!" as the default now, when very competitive approaches that have been developed over the past few decades could provide decent results but at a fraction of the computing resources, i.e. a much higher efficiency.


Yes but why OFFLINE ? today's world is so super connected that I am wondering why is asking for this requirement.


> why OFFLINE

Why online? Why would I want some third-party to (a) listen to my conversations; (b) receive a copy of my voice that hackers could download; (c) analyze my private conversations for marketing purposes; (d) hobble my ability to translate when their system goes down, or permanently offline; or (e) require me to pay for a software service that's feasible to run locally on a smart phone?

Why would I want to have my ability to translate tied to internet connectivity? Routers can fail. Rural areas can be spotty. Cell towers can be downed by hurricanes. Hikes can take people out of cell tower range. People are not always inside of a city.


Hosting is annoyingly expensive. ping latency between us-east-1 and ap-southeast-1 is 230ms. So you either setup shop in one location or go multi-region (which adds up).

Also, there are many environments (especially when you travel) where your phone is not readily connected.


This is why it would be tough to be an AI startup in 2024… totally unrealistic customer expectations


Which usually stem from AI startups making wildly unrealistic promises; it's all a very unfun cat and mouse game.


> * https://github.com/openai/whisper

I would be very concerned about any LLM model being used for "transcription", since they may injecting things that nobody said, as in this recent item:

https://news.ycombinator.com/item?id=41968191


They list the error rate on the git repo directly, it was never good even when it was the best.

I saw mediocre results from the biggest model even when I gave it a video of Tom Scott speaking at the Royal Institution where I could be extremely confident about the quality of the recording.


WER is a decent metric to compare models but there's a difference between mistranscribing "effect" for "affect" and the kind of hallucinations Whisper has. I've run thousands of hours of audio through it for comparisons to other models and the kinds of thing you see Whisper inventing out of whole cloth is phrases like "please like and subscribe" in periods of silence. To me it suggested that it's trained off a lot of YouTube.


Interesting; that's certainly a bigger baseline than the one hour or so that I tried, which wasn't big enough to reveal that.


This phone has been around for ages, and does the job. It's well weapon! https://www.neatoshop.com/product/The-Wasp-T12-Speechtool


moonshine?



doesnt google translate do this?


maybe... Google doesn't support smaller languages for STT.

My other gripe with these tools is if there is background noise, they are pretty useless. You can't use them in a crowded room.


In most noisy contexts, a throat microphone should work [1]. A small Bluetooth one could also connect to its small earpiece(s) to make a wearable speech interface whose bigger battery could be concealed in the mic and last much longer than usual earbuds.

[1]: https://en.wikipedia.org/wiki/Throat_microphone


When I usually need a translation app is when I am traveling. I can't just ask a stranger to tape a throat mic and wear my headphones to have a conversation with me though




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: