
Future of DeepSpeech / STT after recent changes at Mozilla - trowngon
https://discourse.mozilla.org/t/future-of-deepspeech-stt-after-recent-changes-at-mozilla/66191
======
jarym
Maybe we should try to find a list of exactly what they are focussing on going
forward instead of the slow drip of things they’re cutting back on (servo,
MDN, DeepSpeech...)

It’s a sad sad day when you have an organisation getting hundreds of millions
in funding and turning away from what’s its good at. The decline has begun in
my eyes, it may not become apparent for a few years yet.

~~~
mrweasel
Cutting out DeepSpeech seems sensible to me, it’s out of place in the general
portfolio of products.

It would be nice if Mozilla could tell us what their focus is going to be, but
I doubt that Mozilla management know at this point.

At this point I’m somewhat concerned that Firefox will be irrelevant in fives
years, and I don’t currently feel that Mozilla is communicating clearly that
they still care about Firefox. I assume they must, but it would be comforting
to know that Firefox is still at the core of Mozillas strategy.

~~~
Cybiote
> Cutting out DeepSpeech seems sensible to me, it’s out of place in the
> general portfolio of products.

I disagree precisely because of the point you make later: _" I’m somewhat
concerned that Firefox will be irrelevant in fives years"_.

Functionality provided by deep learning is going to be an important component
of many types of software interactions going forward. The logistics of this
will be quite different from what we are used to in open source, with the need
to fund and coordinate compute, collect and handle data being a more vital
aspect compared to the past.

There are STT software, some mentioned in this thread, that match or are even
better than DeepSpeech but none of them are as ergonomic. Accounting for the
value of time, this means it will be more cost effective to outsource such
capabilities to the cloud. Which comes with trade-offs that are difficult to
appreciate in the short term:
[https://news.ycombinator.com/item?id=24236489](https://news.ycombinator.com/item?id=24236489)

I'd say DeepSpeech fits in the mold of Mozilla as a company providing
solutions to complicated software problems that are better at respecting the
user and their privacy.

In the old days, the most accurate TTS and STT models were built into the OS.
These days, you need to call into the cloud to get the best stuff. In [1],
Internet Archive complains about the quality of their OCR software. It's not
that OCR is so bad, it's that the best OCR is found on Google's and
Microsoft's computers. It's possible to cobble something together using open
source solutions like EasyOCR, Tesseract+OpenCV but that will only get you
part of the way there. What makes the cloud offerings so good is they have
enough resources to devote to pre-processing pipelines and architecture tweaks
and settings better able to handle edge cases. Most of the mass resides in
edge cases.

From my vantage, the future looks to be one of software as thin layers built
atop APIs which call into programs running on the servers of a handful of
companies. You might not think this a big deal but these software will be the
ones scanning the environment, writing the emails, completing the thoughts and
planning the calendars for the majority of humans.

[1] [https://blog.archive.org/2020/08/21/can-you-help-us-make-
the...](https://blog.archive.org/2020/08/21/can-you-help-us-make-the-19th-
century-searchable/)

~~~
posguy
Based on the testing I just did with Vosk, Mozilla DeepSpeech, Google Speech
to Text and Microsoft Azure, I disagree with your arugment that SaaS has the
best quality results.

Mozilla DeepSpeech was definitely trailing the bleeding edge, but Vosk using
the vosk-model-en-us-daanzu-20200328 model produces very accurate results even
on uncommon words, similar in performance to Google & Microsoft (which has
generally better formatting than Google's STT)

Try it yourself:

Google: [https://cloud.google.com/speech-to-
text/](https://cloud.google.com/speech-to-text/) See "Put Speech-to-Text into
action" header

Microsoft: [https://azure.microsoft.com/en-us/services/cognitive-
service...](https://azure.microsoft.com/en-us/services/cognitive-
services/speech-to-text/#features) See "Upload File"

Vosk: [https://alphacephei.com/vosk/](https://alphacephei.com/vosk/)

Had Mozilla provided 4x to 8x more GPU resources and more staff, then their
STT would likely be competitive. Other small STT developers can iterate and
test much faster due to having more hardware at their disposal.

------
echelon
I'm not sure about Mozilla's efforts in STT, but they were lagging pretty far
in TTS. [1]

Google/Baidu, universities, and an assortment of Chinese/Japanese/Korean
social media companies (Line, etc.) are posting the most compelling TTS
research, models, and code. Mozilla's TTS system [2] is an amalgam of some of
these models, but it lags pretty far behind state of the art.

Mozilla should focus on getting additional revenue streams. We can help them
out by trying to get Congress / DOJ to strip Google of its ability to have and
maintain a browser with which they entrench their search and advertising moat.
I think they're clearly in antitrust/anticompetitive territory.

[1] I'm pretty familiar with this field as I wrote
[https://vo.codes](https://vo.codes) and
[https://trumped.com](https://trumped.com) TTS systems. Neither of those are
state of the art in terms of mean opinion score (MOS), but they're incredibly
efficient.

[2] [https://github.com/mozilla/TTS](https://github.com/mozilla/TTS)

~~~
nshm
It is explainable given that there was a single developer working on TTS. It
is hard to compete with big academic teams/industry players this way.

I also believe Mozilla team was restricted by a lack of computing resources.
They had just a single 8GPU server or so.

~~~
qchris
This is an area that I find unbelievably frustrating. A lack of computing
resources in the current day is kind of insane. You can buy an 8GB GPU for
<$1000. Even with the rest of the costs, the cost of hardware like this is a
drop in the bucket when your main office is housed in Mountain View!
Especially on a project that ends up being public-facing, these are missed
opportunities where a little can go a long way.

~~~
nmstoker
I take your point but according to the release details on the repo it was not
8Gb on one card but a server with 8 cards, each a Quadro RTX 6000 with 24Gb,
and they're around £4k each currently, so the cost of the GPUs alone is £32k

[https://github.com/mozilla/STT/releases/tag/v0.8.2](https://github.com/mozilla/STT/releases/tag/v0.8.2)

~~~
qchris
Ah, I see-- not an 8GB, 1-GPU server, but an 8-GPU server. That does make a
bit of a difference, changing the cost from a new workstation to functionally
a piece of capital equipment. Still, I'm not sure that my point about
equipment costs falls short--even at (call it) $40K, you're probably talking
less than 3 months of the company's all-in cost for the developer themself,
amortized over multiple years.

------
taf2
I see a lot of what appears to be over reaction... doesn’t sound like
deepspeech is ending in the first part of the announcement

“ Most of the technical changes were already landed, and we see no reason not
to ship it. We’ll be releasing 1.0 soon and encourage everyone to update their
applications”

So looks like at least 1.0 is near and still gonna happen... I know these seem
like dark times for Mozilla but I believe they will survive. As I recall the
decline of Netscape was a pretty dark time and out of that came Phoenix - er
Firefox and here we are today... I’m sure Mozilla and many of the great
projects will survive

------
no_wizard
I don’t know what is going to save Mozilla, really I don’t. I just wish there
was a way to “reach” them and discuss how we the internet community could come
to an agreement about what they could do to derive value we would pay for.

It’s not for a lack of trying on their part for sure, but it feels like just
using their browser isn’t all there is to it any more

~~~
floatingatoll
What would you personally pay Mozilla for?

~~~
echelon
Firefox, Rust, and privacy.

It'd be really awesome if they could develop a search engine or phone (I know
they tried) that had an open standards / web-compatible development kit.

I want an anti-Google / anti-Apple. Something we own and can extend. Something
that doesn't sell our data.

I'd also like to see Mozilla doing lobbying. Partnering with the EFF. We've
strayed so far from the bright and open Internet of the 90's and 00's. It's
depressing to think about how locked up and proprietary it's all become.

I'll buy Mozilla / Firefox merch. I'll pay a subscription.

edit: Talk to Shuttleworth. Fold Ubuntu in. I'll buy a Mozilla phone and a
Mozilla laptop.

~~~
feanaro
I feel bad for doing a "me too" comment, but you've _nailed_ exactly my
thoughts on the subject. I feel like Mozilla hasn't really tried something
like this. Every time it gets suggested, it quickly gets shot down (by other
internet commenters) as "can't be done" and "wouldn't generate nearly enough
money".

Well... maybe not with that CEO salary.

~~~
echelon
Mozilla can model itself after Microsoft somewhat.

Provide a development stack (they're experts at Web and Rust). Make themselves
the go-to shop for developers in that realm.

Sell them on an OS and editor with support. Partner with Ubuntu. Hell, I would
even reach out to Nadella and see if they'd be willing to work with Mozilla on
hedging against Google. Mac is becoming locked down and kind of unpleasant to
develop on/for. Mozilla could win this.

Block all the advertising and tracking. Build a Spotify-like news aggregation
service you can access from your Mozilla subscription.

Build an email service like Hey and a file backup service like Dropbox. It's
too bad Zoom bought Keybase, but perhaps Chris Coyne wants a new gig?

We should team up to beat FAAMG. Most of the FAAMG actors are actually quite
damaging to open source despite benefiting from it greatly.

~~~
no_wizard
This all sounds to me like capital intensive businesses against entrenched
players where even the not so average consumer would likely not do more than
pay lip service to it unless there was some secret sauce to this that was more
compelling to the options

They neeed a good out of the park product in those markets to make any real
headway. Too idealistic.

My only thought on this is that they should pivot to be like algolia , focus
on Firefox being a reference implementation browser and seek their expertise
to the other vendors, maybe. It’s one of the few verticals I can think of that
would work strategically Without them having to pivot into things they have no
experience with

~~~
qchris
Do they? I mean, most of these vendors are already competing, and unlike
Firefox, they're not necessarily competing for the average Joe, but technical
users who often have different priorities.

Those are also services that groups are used to paying for already, which
means if they could eat the start-up costs, even at a reduced scale, they
could make a profit at even a slight premium for things that they _already_ do
very well, and go from there.

------
eruleman
Does anyone know of other open-source projects in the speech-to-text space?
DeepSpeech was one of the most promising projects, especially the latest
versions...

~~~
albertzeyer
There are a lot of open source projects in this space. DeepSpeech is actually
one of the outsiders (they are not represented well in the academic
community), and also not quite competitive to other software (at least last
time I checked).

E.g. some very active projects are:

* Kaldi ([https://github.com/kaldi-asr/kaldi/](https://github.com/kaldi-asr/kaldi/)) obviously, probably the most famous one, and most mature one. For standard hybrid NN-HMM models and also all their more recent lattice-free MMI (LF-MMI) models / training procedure. This is also heavily used in industry (not just research).

* ESPnet ([https://github.com/espnet/espnet](https://github.com/espnet/espnet)), for all kind of end-to-end models, like CTC, attention-based encoder-decoder (including Transformer), and transducer models.

* Espresso ([https://github.com/freewym/espresso](https://github.com/freewym/espresso)).

* Google Lingvo ([https://github.com/tensorflow/lingvo](https://github.com/tensorflow/lingvo)). This is the open source release of Googles internal ASR system, and used by Google in production (their internal version of it, which is not too much different).

* NVIDIA OpenSeq2Seq ([https://github.com/NVIDIA/OpenSeq2Seq](https://github.com/NVIDIA/OpenSeq2Seq)).

* Facebook Fairseq ([https://github.com/pytorch/fairseq](https://github.com/pytorch/fairseq)). Attention-based encoder-decoder models mostly.

* Facebook wav2letter ([https://github.com/facebookresearch/wav2letter](https://github.com/facebookresearch/wav2letter)). ASG model/training.

* (RETURNN ([https://github.com/rwth-i6/returnn](https://github.com/rwth-i6/returnn)) and RASR ([https://github.com/rwth-i6/rasr](https://github.com/rwth-i6/rasr)), our own, although this is currently free for academic use only. It is used in production as well. Supports hybrid NN-HMM, CTC, end-to-end attention-based encoder-decoder, transducer, etc.)

And there are much more.

You will also find lots of ready-to-use trained models.

~~~
Bootwizard
Can you run audio files through any of these or do they only support audio
from microphones?

~~~
nmstoker
At the point of them taking in input to process, audio that comes from a
microphone or comes from a file is basically just a series of numbers and is
the same. So there's no barrier in terms of feasibility.

Whether they're all set up to do that "off the shelf" is a different matter
but it should be fairly straightforward to add this to any that lack it and
because they're open-source anyone could do a bit of Googling etc and find
suitable code to adapt to do it. I know DeepSpeech definitely can take audio
from files directly as input as I've used it that way before, and I strongly
expect many (or possibly all) of the others could too.

------
ianlevesque
Between this and Servo I guess Mozilla is just giving up on relevance. That
really sucks.

~~~
marcinzm
That’s what got them in this mess in the first place, fifty pie in the sky
projects to be relevant instead of focusing on Firefox or just saving revenue
aggressively.

------
utunga
I work with Mozilla's DeepSpeech every day. Mozilla's STT is critical to the
survival of important indigenous languages throughout the world.

I sincerely hope we can help make this project continue and that Mozilla can
help us do that.

Ensuring indigenous languages have digital representation is essential to
their survival. Speech recognition and synthesis are a vital part of that.
Indigenous communities are often ignored by Big Tech because they bring little
financial value to their bottom lines, but financial bottom lines are not
everything. Culture is more important. Open source tools like DeepSpeech allow
communities to build the tools they need for themselves.

Māori have been working to help build tools for te reo Māori, and our project
is at the forefront of using open source tools like DeepSpeech to revitalize
the Māori language. The core of a good speech recognition system helps us in
many practical ways, such as improved transcription, support for
pronunciation, correct announcements in public transport, correct information
on maps and in many other ways. We may well continue to support and use
DeepSpeech if the project can continue.

But there are also many other projects in other countries in the world who may
follow on - such as the Kabyle people of Algeria who are using DeepSpeech, or
the Mohawk nation in North America who have been looking into it.

By the way we are working on our web presence but for now this quick one pager
gives some idea of the work we are doing -
[https://papareo.nz](https://papareo.nz).

~~~
nshm
Is your current data public? Do you have sufficient amount of untranscribed
data (1000+ hours)? We could help you.

------
bananaface
Are they still collecting data for Common Voice, or is it both projects that
they're terminating?

~~~
nshm
CV is also in some not quite clear state

[https://discourse.mozilla.org/t/mozilla-org-wide-updates-
imp...](https://discourse.mozilla.org/t/mozilla-org-wide-updates-impacts-on-
common-voice/65612)

------
awalton
Ugh this is a deep gut punch, as this is one of the most interesting recent
projects Mozilla was working on in my opinion.

------
_underfl0w_
Yikes. All this because they refuse to trim the fat at the C-level. A company
can't be profitable by only employing overhead. They'll all be forced to take
the ultimate pay cut when Mozilla closes up shop.

------
milofeynman
I was hoping DeepSpeech would lead to in home cloud-less "Alexas". Just ask me
for a subscription on it and productize it please.

~~~
nshm
There are many cloud-less Alexas already, a good one is rhasspy
[https://rhasspy.readthedocs.io/en/latest/](https://rhasspy.readthedocs.io/en/latest/),
it is not based on deepspeech though.

~~~
nmstoker
In the repo/docs, it suggests that DeepSpeech is an option for some languages
(English & German). Haven't tried it, but with recent(ish) performance
improvements in DS it can run on somewhat less powerful computers than used to
be the case.

Other options for similar assistants that can also use DS are Mycroft
([https://mycroft-ai.gitbook.io/docs/using-mycroft-
ai/customiz...](https://mycroft-ai.gitbook.io/docs/using-mycroft-
ai/customizations/stt-engine)) and DragonFire
([https://github.com/DragonComputer/Dragonfire](https://github.com/DragonComputer/Dragonfire))

------
markthethomas
DS is by far the easiest to use/most promising ASR library/kit/thing I’ve
used; really hoping it keeps going

------
badrabbit
Are there other foundations like mozilla we can donate to? For initiatives
that are in the interest of the public? The Apache foundation is all I can
think of but they focus on corporate use projects.

------
Causality1
Really a shame. I find so many of Google's STT quirks infuriating enough I'd
love a robust alternative.

------
neolog
Where does it say DeepSpeech is on hold? I don't see that anywhere.

~~~
dang
Submitted title was "Mozilla to put DeepSpeech project on hold". We've
replaced that with the article title per this guideline: " _Please use the
original title, unless it is misleading or linkbait; don 't editorialize._"
[https://news.ycombinator.com/newsguidelines.html](https://news.ycombinator.com/newsguidelines.html)

~~~
Vinnl
I guess that explains why I was under the impression the project was being
shuttered after reading the comments, but not the actual post yet.

The actual forum post just says they don't know anything about the future of
DeepSpeech yet, for those doing the same.

------
jonny383
Mozilla let politics take over its corporation to the point where it's
basically a far left extremist group now that's relying on semi-bribe funding
from Google.

There's no technology left anymore.

