
Offline voice AI within 512 KB of RAM [video] - kenarsa
https://www.youtube.com/watch?v=WadKhfLyqTQ
======
microtherion
"Any sufficiently advanced technology is indistinguishable from a rigged
demo." — James Klass

In particular, voice AI systems are by their nature not very discoverable,
which makes it hard to assess them quickly. What the video shows is a system
that can recognize some input sentences and speak some output sentences. It
does not show the size and flexibility of the command set, and it only gives a
vague impression of the speech output quality and recognition accuracy.

Technically, you could reproduce this demo with the system software that
shipped on a Mac Quadra 840AV in 1993.

~~~
kenarsa
This is a really good point. That is why we partially open-sourced our
technology to enable unbiased third party evaluation. You can run the exact
same demo on a Linux box or Raspberry Pi (any variant) using what's available
on the project GitHub repository here:
[https://github.com/Picovoice/rhino](https://github.com/Picovoice/rhino)

We are in process of open-sourcing a statistically-significant benchmark for
this tech. But this will happen in 2019.

~~~
xena
I would love to try and play with this to recognize non-English (lojban). How
can I add/train voice samples to this?

------
hliyan
I suspect that within the next two years, as this type of thing (voice
recognition, voice synthesis, image recognition) becomes ubiquitous, it will
cease to be called AI and the bar will be raised further. Very soon,
applications such as Duplex [1] will become the minimum bar for AI.

[1] [https://ai.googleblog.com/2018/05/duplex-ai-system-for-
natur...](https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-
conversation.html)

~~~
solarkraft
I hope so. But investors being so hot to in invest in "the future" that it
currently leads to markov chain being called "AI" will probably last for
another quarter or two.

It's not as vaporware-y as blockchain, but I still see some potential for the
bubble to pop. At some point people will hopefully realize that but everything
that can be done with a neural network needs to be done with one (and tons of
data).

~~~
tree_of_item
It looks like you took the exact opposite inference from the parent than I
did: it's not about vaporware, it's about people's impossible standards for
what's considered "artificial intelligence".

The neural network stuff is anything but vaporware, it's delivered incredible
results, but people keep coming up with silly dismissals along the lines of it
not being "real AI".

~~~
hliyan
> but people keep coming up with silly dismissals along the lines of it not
> being "real AI".

That is not what I meant at all. For me, it is the nature of the field and has
been that way since the days I first learned it (the 1990's). Once an
reproducible algorithm or methodology is discovered to solve an AI problem, it
generally ceases to be an AI problem.

~~~
tree_of_item
Yeah, I actually agree with that. I wonder why you think our statements were
incompatible, because that is the sort of thing I would have described as a
"silly dismissal". The problem was clearly an AI problem right up until it's
solved, and then suddenly it's "not real AI" and people talk about AI being
overhyped or whatever.

Imagine if people treated programming languages like that. People would get
excited about the idea of communicating with a computer, and then when you
finally build Python and show it to them, they say "but that's just parsing,
what about a real _language_?" The bar just keeps rising whenever you get
close to it. That's the sort of thing I meant by "silly dismissal".

------
01100011
So admittedly I know next to nothing about CNNs and such, but AFAIK, isn't
training the difficult, resource intensive part of CNNs? Once you've figured
out the coefficients, I think you can implement the pattern matching with an
order of magnitude, or less, resources. You don't even need high-precision
arithmetic, right?

~~~
jdietrich
Yes, and it's a very important property of CNNs. You can do the hard
computational work of training in the data center, but get results from
inferencing at the edge. In a sense, you're using the trained network as an
energy storage system, shifting energy requirements away from power- and cost-
constrained devices.

Inferencing is also highly amicable to hardware optimisation, which we're
starting to see in the latest flagship mobile SoCs. I expect to see low-cost
microcontrollers with inference accelerators within the next couple of years.

------
adrianN
When we get voice activated doodads that don't have to send my voice to the
mothership I might finally get one.

~~~
fooker
Still have to shout "GoPro, take a photo", at a crowded location drawing
attention. Yikes, no thanks!

I'll be interested when 'reading' thoughts without phoning home becomes a
reality.

~~~
loa-in-backup
That's a great opportunity to implement a throat microphone. Unfortunately
GoPro's don't have configurable audio input AFAIK

------
emcq
How does this compare to approaches like presented by MSR at NeurIPS this year
where also do keyword detection on the Pi Zero or impressively the M4F?

[https://dkdennis.xyz/](https://dkdennis.xyz/)

~~~
thegabriele
I don't know regarding your specific case, but one year ago i was able to
stitch together CMU Sphinx, Pi Zero, a bluetooth headset and google voice
recognition to have a working keyword detection + voice to text system (for
italian language).

------
janjongboom
See also uTensor
([https://github.com/uTensor/uTensor](https://github.com/uTensor/uTensor))
which scales down TensorFlow-trained models to something you can run on a
microcontroller. Currently working on hardware acceleration on MCUs with DSP
extensions.

------
d33
Makes me wonder how much AI we could fit into ZX Spectrum, an old Amiga or
80386...

~~~
emcq
Not much. Even a tiny arm cortex M4, which could live in a hearing aid for a
week on battery life, are often 64MHz with single cycle MACs.

Z80 I believe was something like 4 cycles per instruction and only a
megahertz, so we would be talking at least 16x slower than a M4. You would
need something between a 486 and Pentium to get close to the M4, and then
evenn further to get to the M7. If I remember correctly you couldn't even
decode MP3s in realtime until the faster 90+MHz 486.

~~~
solarkraft
These things are slow (by modern standards), yeah, but it was possible to get
the model this far down ... how far can we go?

~~~
bhouston
Must be a way to prune the nn with some reduction of quality. Probably a means
to reduce the quality until it is on an Arduino and some like an anonymous
video.

There was speech synethsis on the apple iie in like 64k of RAM and a 1mhz CPU.

~~~
jsjohnst
> There was speech synethsis on the apple iie in like 64k of RAM and a 1mhz
> CPU.

Speech synthesis is far far far easier than parsing a human voice, especially
when it doesn’t need to sound realistic (as was the case back then).

~~~
thesz
Current approach of speech recognition is about as old (maximum likelihood
using WFST).

In the old papers about problem, there was vocabulary size of 64K words,
because nothing was working for bigger vocabularies.

------
imagine99
I'd love to tinker with that for my own home automation projects but couldn't
find any indication whether it would be available and affordable for such a
use case. Any clues as to licensing fees for private/non-profit use?
Especially the offline aspect makes it very attractive, not only from a
privacy but also a reliability aspect.

~~~
kenarsa
Hello. This is Alireza. I am the founder of Picovoice.

I totally understand the need to support makers community. We do have GitHub
repositories for engines demoed here which allows you to use these
technologies to some extent (not the full set of capabilities). I am working
with our partners (both Soc and distribution) to come up with a maker-specific
product for evaluation and personal use. It most probably will be a HW/SW
product (i.e. a board that comes with our software). The product should allow
you to use the full set of features on that specific board. I am expecting
this to happen in 2019 and I will disclose the information as I am figuring
things out.

~~~
raidicy
I would love to be able to use this for personal use. I have RSI and WSR and
Vocola 3 work OKish for programming, for the most part. But they're locked
onto my specific system. I would really like to try to have something embedded
to be able to go to any computer, plug it in or use WiFi and be able to
dictate at a decent accuracy.

------
v_lisivka
This device is relatively powerful. I'm able to run Fedora with Wayland on
similar device (iMX6SL-evk). DNF is slow for first time, so I need to wait 20
minutes until all data is parsed, but then works fine.

~~~
kenarsa
The device you are referring to is quite different. I am taking this is the
board you are using?

[https://www.arrow.com/en/reference-designs/imx6slevk-
imx-6so...](https://www.arrow.com/en/reference-designs/imx6slevk-
imx-6sololite-evaluation-kit-based-on-imx-6sololite-applications-
processor/cbce9d413872d15fe5417bc807cb583b)

It was an ARM Cortex-a9 with NEON extension instead of ARM Cortex-M7. It is
basically a different family of i MX processors.

