
Firefox Voice - makeworld
https://voice.mozilla.org/firefox-voice/
======
AsyncAwait
I see a lot of skeptical voices here, (somewhat warranted, given it's a voice
assistant technology), but the fact remains that if we want open, on-device
voice recognition, we'll have to do the work and donate sample data.

This extension is trying to provide some useful functionality in the hopes
that Mozilla gets more data for
[https://commonvoice.mozilla.org](https://commonvoice.mozilla.org)

I'd at least consider recording your voice, especially if you're a non-native
English speaker, like myself, have an accent etc.

It took many years for free software to start to take on the smartphone
segment, with previous efforts, (including by Mozilla), failing and only now
PinePhone & Librem 5 giving it another go, but unless you're a super hardcore
enthusiast, you carry an iPhone/Android today.

I see this as a way to push back on the likes of Amazon, Google and Apple with
this. If regular Firefox users are able to use an on-device, privacy
respecting voice assistance and other open-source projects can use Mozilla's
tools and datasets to build compelling competitors to Alexa, I'd see that as
proof that free software is able to address new, emerging markets too.

~~~
panpanna
> if we want open, on-device voice recognition, we'll have to do the work and
> donate sample data.

Fair enough, but is any stopping Mozilla from _also_ selling your voice data
to third parties including advertisers and commercial ML interests?

(Asking this because Firefox has started sharing our data with leanplum)

~~~
Cybiote
It still remains a risk well worth taking. With Mozilla, it is merely
uncertain but for just about every other company, it is all but guaranteed.
You can plainly see this philosophy in the industry's naming sense, where
super-computers of yester-year are relegated to "edge" roles.

As of today, the open source and free software equivalents to machine learning
and AI products are sorely lacking when compared to commercial offerings.
Whether it is open-ended speech to text with good ergonomics, text to speech,
intent recognition, speaker recognition, OCR for text, OCR in the wild,
translation, object recognition, image segmentation, image to text or natural
language processing, commercial offerings are leagues ahead of what free
software can do.

If we look at one of the most impressive AI demonstrations in history, GPT-3,
it is not apparent whether open source can even replicate it because with AI,
unlike in the past, time and skill is no longer directly fungible with money.
I would argue the concentration of such capabilities to Microsoft and Google
servers is a threat to the ideals of free software as great as any it has seen
before. Yet, relatively little attention is spent there because people are too
focused on yesterday's problems.

This concentration is difficult to avoid because current algorithms require
large amounts of data and computing ability, which only large corporations can
marshal. Mozilla is far from perfect but despite their many stumbles, they're
the only large organization seriously attempting to address this imbalance. As
much as these algorithms are marketed as AI to users, ML is better thought of
as libraries, in the line of ffmpeg, to programmers. Mozilla still do seem to
care about creating a local-first offering. If everyone stops using them then
what is gained exactly?

~~~
marcinzm
>With Mozilla, it is merely uncertain but for just about every other company,
it is all but guaranteed

I disagree, most large companies view this data as a competitive advantage and
won't sell it directly. They may sell the results but the data itself is their
moat. Smaller companies on the other hand are more willing to sacrifice future
profits for current money.

~~~
AsyncAwait
This is an open dataset. There's no point selling a dataset that's already
free.

As far as recording things I didn't consent to, the likes of Google way more
likely to do that[1].

1 - [https://mashable.com/article/google-assistant-microphone-
smo...](https://mashable.com/article/google-assistant-microphone-smoke-alarm)

------
djsumdog
I've looked into open source voice assistants before. I found mycroft, Jarvis
and a few others, but either got bogged down in dependencies or configuration.
Many supported shipping your data to Google or Amazon if you configured it, or
an open source voice recognition tool.

I hate this idea that our voice has to be shipped somewhere to be processed. I
remember a lot of the speech-to-text tools in the early 2000s weren't all that
great (they needed a lot of training), but why haven't we been able to advance
on-device processing? Why is everything done in "the cloud."

So the only way to semi-accurately do voice recognition is to source
algorithms that re-train off of millions of people? We have processors in our
desktops and laptops that dwarf that compute power by leaps and bounds. We
should be looking to Star Trek TNG level voice processing, on each individual
device, without some central mainframe.

But marketing, advertising revenue, data mining, free (as in beer) software
that pumps your data like an oil rig, efficiency in data centre (cloud) design
.. all these factors have led to these powerful little Intel/ARM/Ryzen chips
to be nothing more than thin clients when they're not playing games.

If Mozilla really wanted to make something amazing and in the spirit of
Firefox, give us an experiment where voice processing is done on our devices.
Even if it meant I needed to download a 230GB data set, I'd gladly do it, if
it could remotely help in getting away from these data silos.

~~~
synesthesiam
> I found mycroft, Jarvis and a few others, but either got bogged down in
> dependencies or configuration.

More recently, there is also Rhasspy
([https://rhasspy.readthedocs.io](https://rhasspy.readthedocs.io)) and
voice2json ([https://voice2json.org](https://voice2json.org)). I'm the author
of both, if you have questions.

> Why is everything done in "the cloud."

Besides being a way of collecting data and ultimately making money, it avoids
some of the "bogged down in dependencies or configuration" problem. My voice
projects need to run entirely offline on a variety of hardware and operating
systems. If each client was just a little app piping audio data to a cloud
service, it would be way easier to write and maintain.

> So the only way to semi-accurately do voice recognition is to source
> algorithms that re-train off of millions of people?

Nope. You can absolutely tune a speech model locally on your own samples and
get great accuracy. The trouble comes with open-ended speech: people expect
the voice assistant to recognize that new artist or movie they heard about
yesterday. That doesn't work without upkeep somewhere.

Rhasspy/voice2json are intended for pre-defined voice commands using a
template language. You can get almost perfect accuracy with this approach,
even with millions of possible commands. Re-training only takes a minute, so
personal upkeep isn't bad.

> Even if it meant I needed to download a 230GB data set, I'd gladly do it, if
> it could remotely help in getting away from these data silos.

It's a lot less than that; at most 1-2GB for a given language, usually a few
100MB: [https://github.com/synesthesiam/voice2json-
profiles/](https://github.com/synesthesiam/voice2json-profiles/)

~~~
vongomben
@synesthesiam your work is wonderful and I am willing to test both ASAP.

It kind of fills a gap left from snips.ai, will check and compare with what's
on the market now.

Super recent (less than a year for both projects, but may be wrong) and
apparently well documented. Hats off

------
yelloworangefog
I'm sorry, but cloud based speech recognition in itself would already be a red
flag, even if Mozilla was doing it in-house. Outsourcing it to Google though?
I feel like a company as ostensibly privacy-focused as Mozilla should really
know better by now...

~~~
GaryNumanVevo
They’re building their own open voice platform. Google speech to text is
presumably for testing.

> Note: In the future, we expect to enable Mozilla’s own technology for
> Speech-to-Text which enables us to stop using Google’s Speech-to-Text
> engine.

edit: s/texting/testing

~~~
posguy
Mozilla DeepSpeech is rapidly maturing, but it needs thousands of hours of
validated audio data to train each language. Its a feat that with only 2000
hours of audio they can achieve a 5.97% word error rate.

Baidu had 5000 hours of audio data to train their DeepSpeech and DeepSpeech 2
models, meanwhile Google, Microsoft & IBM have people constantly giving them
fresh audio to train and validate their models with.

Firefox Voice data should help rapidly expand the Common Voice audio corpus
beyond the 1492hrs it currently contains:
[https://commonvoice.mozilla.org/en/datasets](https://commonvoice.mozilla.org/en/datasets)

------
pixxel
Had to read the privacy policy to see they use Google.

>We share your audio recording with Google Cloud’s speech-to-text service to
assist us in processing and carrying out your commands. Audio recordings are
shared without personally identifiable metadata, and we’ve instructed Google’s
service not to retain the audio or transcript associated with a command after
it processes the command

~~~
pmontra
It's in the FAQ section of the page under "How is my audio processed?"

------
mikob
I've been working on this exact thing for Chrome the last 3 years:
[https://www.lipsurf.com](https://www.lipsurf.com) Anyone can make an open
source plugin for it to do anything with voice
([https://github.com/LipSurf/plugins](https://github.com/LipSurf/plugins))

I've wanted to port it to Firefox, but the HTML5 SpeechRecognition API
([https://developer.mozilla.org/en-
US/docs/Web/API/SpeechRecog...](https://developer.mozilla.org/en-
US/docs/Web/API/SpeechRecognition)) is still not available. Why not just make
the API available and leave this in addon territory for all developers?

~~~
aerovistae
I built something like this 7 years ago, it's called Hands Free for Chrome, a
now languishing project that I lost interest in a long time ago,
unfortunately. It made the top 10 of HN back then though! My site's design is
not nearly as nice as yours.

[https://www.handsfreechrome.com/](https://www.handsfreechrome.com/)

I just didn't get enough users or support to really care about it. But I wish
you the best. It was an exciting thing to build and using it always felt
futuristic to me.

This is just so fascinating though. It's like seeing what could have been if I
had been a better developer and found the dedication to really stick to the
project in the longterm.

Edit: I see we had the exact same idea! Your "tag" is my "map." Love it. One
big difference is that mine was just a free project. I'd be super interested
to know how many users you've got. I never had more than ~1100. From looking
through your website mine was a much less intensive project. (Oh, CWS says
4000+ for you....wow, wonder how many are paid.)

Edit2: Looking over your update history is almost nostalgic. "Fixed issue with
overlapping commands -- delaying commands that are partial matches of other
commands." Had to do the exact same thing!

Edit3: We have so many overlapping command names that I wonder if you took
inspiration from my project, almost. Either that or it's just a case of
convergent evolution.

Edit4: Suggestion for dictation: a way to alternate between a special
character and actually writing the word. Doesn't look like there's a way to do
^ vs carrot or & vs ampersand. Something like "Enter special character caret".
Maybe you already have a plugin for this though, idk.

Edit5: God, this is so well architected! plugins and contexts are just
fantastic ideas for this domain. Click by voice using hidden search-for-text
is also a perfect solution to that problem. I wonder if this could be made
more intelligent, i.e. "Click Submit in the sidebar on the left"\--
challenging though.

Edit6: Wow, just noticed someone else built something called "Handsfree for
Web" somewhere along the way and theirs is ALSO way better than what I had
built. Geez. Starting to feel bad about my awful website.

~~~
mikob
Never saw yours before, but I discovered "Handsfree for Web" a few months
after I started - and thought he had ripped mine off. But I no longer think
so. Yes, seems like many commands are the same. Shame that so much wheel
reinvention is going on. One thing that makes LipSurf "special" is the deep
integration with sites. I wanted to use Duolingo, Reddit, HN and some others
more with voice - so they get special plugins. Doing Duolingo with voice is a
game changer for language learning - and if it weren't for usecases like that
I would have likely lost interest like you long ago.

~~~
TheSpiceIsLife
I want hands free for CAD.

Imagine being able to vocalise and build a model.

I did have a HN user who said they be happy to collaborate with me to build it
but I dropped the ball and have since killed that email address.

~~~
franga2000
If you manage to get into the GPT-3 beta, I'd love to work on that with you
:D.

For simple models, English -> OpenSCAD sounds like it's doable given the
things I've seen on Twitter and for normal modeling, GPT-3 would probably make
an excellent intent recognizer for voice commands.

~~~
TheSpiceIsLife
How would I go about looking in to this?

I'm a metal fabricator by trade but also technically minded, AutoHotKey
scripts for a few things and can very basic Python and C# if I need to.

I tend to learn in a very _solutions oriented_.

And would be keen to collaborate / learn / skill share.

~~~
franga2000
The GPT-3 beta is something every programmer and their dog wants to get into
these days and most of us can't. It's a really impressive new language
processing neural network that people have managed to train to (among many
other things) generate code from an English description of the program. If it
can do that, it might be able to generate some reasonably complex Constructive
Solid Geometry models and even something like MEL commands in Maya.

Greg, OpenAI's CTO, occasionally manually lets people in if they convince him
of their use case (or, as one guy did, plant a bunch of trees in his name) in
an email (gdb@openai.com). It might be worth shooting him a message explaining
the idea. From what I've heard, once you have a key, it's mostly a matter of
feeding the model examples until it does what you want.

------
snoopfab
You should try rhasspy. It's open source. It respects your privacy by using
offline services. Fully customizable (each service can be replaced by another)
All the services are containerized for easy installation and is available for
several architectures such as arm ( on a raspberry pi). There is even n Option
to use Mozilla deepspeech tts service.

[https://rhasspy.readthedocs.io/en/latest/](https://rhasspy.readthedocs.io/en/latest/)

------
dhaavi
"We’ve instructed the Google Speech-to-Text engine to NOT save any
recordings."

Hahaha! :D Thanks for the good laugh.

~~~
input_sh
Bigger players have the leverage to get companies to do something they don't
do out-of-the-box. They can contractually oblige them to do that, as well as
sue each other if one side breaks its part of the deal.

~~~
Terretta
If they could guarantee Google doesn’t, they’d say that. Thus the much lesser
claim “instructed” which carries no such assurance.

I agree with and applaud their truthful choice of words. There’s _no such
thing_ as “contractually oblige”.

// To keep yourself cognitive, consider one party receiving a National
Security Letter (“NSL”) with gag order.

------
ipsum2
I guess Mozilla's own speech to text
([https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech))
isn't good enough, so they have to use Google's?

~~~
kevingadd
Presumably the reason they want you to opt-in to saving recordings is so they
can train DeepSpeech.

~~~
setzer22
But DeepSpeech has already been trained with millions of data samples!

I'd feel way better about it if they went for a slightly worse DeepSpeech
based implementation, but kept it working in the free software spirit they
have been known about for many years.

Also, for desktop devices inference on DeepSpeech is cheap enough, so they
could even go the extra mile and work on some Wasm magic to get offline
recognition.

That's the kind of work I'd expect from Mozilla! Not wiring up your data
collection to the Google Cloud APIs and call it a day! I'm genuinely
disappointed with them...

~~~
posguy
The audio Mozilla DeepSpeech is trained on is not very large (about 2000
hours) or diverse (eg: mostly native American English Male voices) and has
very little ability to handle noise, accents or other errata.

Comparatively, Baidu had 5000 hours of English to train their versions of
DeepSpeech and DeepSpeech2 on, and thus had better results years ago. Google,
Microsoft, IBM and other companies have users providing more audio samples on
a daily basis, enabling much better quality speech to text.

Mozilla's Common Voice project only has 1492hrs of validated English
currently:
[https://commonvoice.mozilla.org/en/datasets](https://commonvoice.mozilla.org/en/datasets)

------
mlindner
Mozilla needs to come more around to Apple's way of thinking. These things
need to be done locally on the device, not farmed out to some cloud. Use the
cloud (CDN) to deploy the software, but run the software locally.

------
input_sh
Alright, gave it a shot. First impressions:

* "Make me laugh" always brings me to the same YouTube video.

* Had pretty much no issues with the default prompts. It was able to find some challenging Spotify playlists, open random websites (including non-standard English domains ones when I spelled them out).

* "Read this page" uses an awful TTS engine, which is a shame considering that I might actually use this feature on a somewhat regular basis. I'm assuming it uses whatever it detects on the OS level, and so far I haven't bothered with finding a better one (on Ubuntu, if you know of one, please suggest).

* "Set a timer for X min" works just fine, which is probably the only thing I use Google's assistant on my phone (or whatever it's name might be now).

* I like the idea of routines in the app settings, which is supposed to tie multiple queries together. I could see myself using it for something like a morning routine (tell me what time it is, give me weather info, read me news, etc.)

~~~
Eyght
I tested the phrase "go to kyle's channel on twitch" and it actually went to
twitch.tv/kyle - which I found impressive.

------
krick
Google-worries aside, judging from the preview it's pretty slow. I'm not a
super-fast typer, but these delays sure look like something that would
discourage me from actually using it. Maybe it's not even that it's slow, just
that the delays are super-obvious somehow, all these disruptive animations and
such.

------
asimilator
Speech to text in the cloud is a hard no from me. Especially if it’s Google’s
speech to text.

~~~
ve55
Don't worry, it says they asked Google to not save everyone's voices!

On a serious note, it does bother me how much Mozilla constantly uses Google,
even when they have their own solutions They could easily choose not to,
especially with their massive budget, but often don't. They have their own
Voice API, but they use Google's. They have their own location API, but they
use Google's ('use my location' sends your info to Google in Firefox). They
have thrown Google analytics into browser components before, and used it on
their own websites.

~~~
iseanstevens
Agree, though Maybe google is giving them free/barter compute credits?

------
mike_ivanov
I don't understand the utility of it. Yes, I can see how this might be
considered cool and hip, but.. which my problem as a user does it solve,
exactly?

~~~
a_bonobo
Speech-based control is almost mandatory for people with various disabilities
(the blind, those with hand-movement problems/disabilities) and the elderly, a
huge chunk of the population.

------
bennettfeely
So do I yell out my password to log in to websites?

~~~
jcims
I mean, who doesn’t?

------
causality0
Voice browsing on Windows is exactly what I don't need. I'd have a lot of use
for being able to search the internet by voice and have the browser read an
article to me while my phone is mounted to my dashboard. Without having to
configure my whole phone for visual impairment, that is.

------
greggman3
I actually tried this with Siri while cooking yesterday. It's not there yet
but I asked "Hey, Siri ... read me the synopsis of the movie Adam's Rib" and
Siri proceeded to read a short synopsis of that movie. It worked on another
but had to make me choose one of 7. It failed on the 3 try where I tried
another movie, it gave me selections, when I picked one "read me the first
one" it just repeated the title instead of telling me the synopsis.

------
ozten
Is there a preference to have text to speech via another service provider?

------
hendersoon
Google charges 2.4 cents per minute for STT so there's no way Mozilla could
afford to offer this service if it actually got popular. I mean, that
obviously won't be an issue, but still.

~~~
kevingadd
Maybe the long-term plan is to run the trained model locally?

------
Aachen
> When you make a request using Firefox Voice, the browser captures the audio
> and uses cloud-based services to transcribe and then process the request.

Is it that hard to do local processing, either due to computational power or
storage requirements? Or is it just more convenient for them to do it this
way?

Edit: this comment in another subthread kind of answered the question:
[https://news.ycombinator.com/item?id=24098950](https://news.ycombinator.com/item?id=24098950)

If I'm drawing the right conclusion, it's a bit of both: hundreds of megabytes
of storage is fine for most people but not everyone, and while I probably
wouldn't listen to the latest and greatest artists (and binary diffs are a
thing, small additions aren't that large), it is convenient for devs to just
push it to a server and be done rather than pushing model updates to everyone
all the time.

Edit2:
[https://news.ycombinator.com/item?id=24096836](https://news.ycombinator.com/item?id=24096836)
Wait, what?! The data is all sent to Google? I was thinking of using this for
their sake (opting into using my data for common voice) but this is an instant
deal breaker.

------
miguelmota
The default keyboard shortcut wasn't working and it was opening a different
extension instead. I went to the voice extension settings and thought it was
bad ux how you have to enter the case-sensitive keyboard shortcut names
instead of pressing the keys to read the keys.

------
sradman
Recent HN thread _Thoughts on Voice Interfaces_ [1] about a blog post by one
of the Firefox Voice engineers.

[1]
[https://news.ycombinator.com/item?id=24040539](https://news.ycombinator.com/item?id=24040539)

------
madacol
No one here has mentioned this.

But I believe speech recognition will not take-off until it understands
whispering speeches.

Vocal chords strain makes current solutions unsustainable for continuous usage

------
nsriv
The Google Recorder app on Pixel phones (and I'm pretty sure general Android
release) does super accurate on-device transcription, for what it's worth.

------
Mandatum
Dupe?
[https://news.ycombinator.com/item?id=23904846](https://news.ycombinator.com/item?id=23904846)

------
maps7
I am a Mozilla supporter so I am happy to support this.

------
swiley
I still haven't understood how any of this is an improvement of a shell.

------
skizziepop
Opera had this a decade ago. RIP

------
person_of_color
Whats wrong with deep speech?

~~~
dsteinman
DeepSpeech is too large to run as a browser extension.

~~~
person_of_color
Have they tried nanonets?

~~~
remexre
Googled those, and it looks like some company selling remotely-running models,
which I think is what GP was referring to. Is there another technique that's
been SEOd out by this company?

------
svnpenn
So let me get this straight. They break global keyboard shortcuts, which
people could use to play/pause media on different websites three years ago:

[https://bugzilla.mozilla.org/show_bug.cgi?id=1411795](https://bugzilla.mozilla.org/show_bug.cgi?id=1411795)

and instead of fixing that, they introduce this shit? Fuck Mozilla. I am
already using Waterfox, looks like that wont be changing any time soon.

~~~
mceachen
Unless they only have one engineer, adding a new feature is not at the expense
of fixing one bug.

~~~
ethanwillis
Yet somehow they haven't allocated 1/N engineers to fix a 3 year old
accessibility bug.

~~~
chii
Whether they are non-profit or not, they still run like a business, and the
business decision to allocate resources are likely to use the same metrics as
any other businesses - maximize profit protential (however you define it).

It may just be that fixing accessibility bugs just doesn't produce much
"profit".

