
Machine learning can run on tiny, low-power chips - sebg
https://petewarden.com/2018/06/11/why-the-future-of-machine-learning-is-tiny/
======
symisc_devel
Correct. Powerful enough embedded devices are now defacto everywhere. We just
released an open source computer vision and machine learning library developed
initially for a French conglomerate specialized in IOT devices.

The library is cross platform, support real-time, multi-class object detection
and model training on embedded systems with limited computational resource and
IoT devices.

[https://sod.pixlab.io](https://sod.pixlab.io)

[https://github.com/symisc/sod](https://github.com/symisc/sod)

~~~
exikyut
This reply is completely tangential to the focus/topic of your comment, but I
wanted to say: _THIS_ is _the model_ of how to do open source.

The developers get financial security while they're working so they can focus,
everyone is funded to sit in one place (sometimes) which makes for great
communication... and then everybody (society as a whole) get to benefit.

If we don't figure out how to make computers write our programs for us within
the next 10 years, this is the development model of the future.

~~~
ujal
"If you wish to derive a commercial advantage by not releasing your
application under the GPLv3 or any other compatible open source license, you
must purchase a non-exclusive commercial SOD license. By purchasing a
commercial license, you do not longer have to release your application's
source code." \--

~~~
exikyut
I argue this is even better: anybody who plans to benefit commercially must
pass on the fruits of their success to upstream.

------
oulipo
At Snips we are running all our Voice AI models on embedded devices (like a
Raspberry Pi 3) and we can also work on MCUs, and we believe that embedded ML
will be the preferred way to solve privacy and efficiency challenges in the
future (disclaimer: I'm a co-founder)

If you are interested, you can start building your own Voice AI for free and
make it run on embedded devices in under an hour:
[https://snips.ai](https://snips.ai)

~~~
adamdrake
Fully agreed. I think the privacy angle is particularly compelling, and doing
on-device analytics using models that have low memory requirements and
acceptable (although not academically impressive) accuracy will be the norm.

I wrote about the privacy perspective a bit:

[https://adamdrake.com/scalable-machine-learning-with-
fully-a...](https://adamdrake.com/scalable-machine-learning-with-fully-
anonymized-data.html)

and I recently gave a lecture more focused on the performance aspect:

[https://adamdrake.com/big-data-small-machine.html](https://adamdrake.com/big-
data-small-machine.html)

The case for doing centralized data collection and model training seems to be
increasingly related to corporate greed and moat-building rather than actually
providing a good experience for users.

~~~
endymi0n
Technically, on-device processing is clearly the way forward (it's interesting
how Apple is currently pioneering the field in a way).

The pessimist in me already sees how three letter agencies worldwide will
welcome this change in order to push down their selectors to the device as
well. Recording only the one percent of potentially relevant conversations
will make backdoors exponentially easier to hide in the background traffic as
well as being much lighter to process.

------
Protostome
One has to distinguish between training and inference when talking about
"machine learning". Training a model is a long and and resource intensive
process, even if transfer learning is used.

Inference is much less energy intensive and could be done on small chips.

Regardless, I'm not as certain as the author about the future of ML on small
devices. Some ML models are huge and needs to be updated frequently, therefore
there is little sense in downloading those to small devices. In such cases, it
makes much more sense to send feature data to a remote server that can
generate a prediction within milliseconds, and then transmit that prediction
back to the device.

~~~
smrtinsert
The more I learn about machine learning, the more I learn its really all
training. After training and a model is available, it seems ready to be
commoditized to me. ML as a service seems the only reasonable way the industry
can evolve.

~~~
otterley
And it's coming: [https://lobe.ai](https://lobe.ai)

------
zopf
I still think the defining moment for ML inference (and maybe even training!)
on embedded devices will come when there are viable special-purpose, low-power
ML chips.

As much as I hate to do this, I'm going to make a comparison to Bitcoin
mining.

Mining is all about optimizing hashes/joule to get the best ROI. We watched it
go from CPU -> GPU -> FPGA -> ASIC in the quest for efficiency.

In ways, we're seeing the same thing in ML model training and inference. CPU
-> GPU -> TPU. We're even seeing some special-purpose coprocessors deployed,
as in the iPhone X. ([https://www.wired.com/story/apples-neural-engine-
infuses-the...](https://www.wired.com/story/apples-neural-engine-infuses-the-
iphone-with-ai-smarts/))

But I think the final leap will come by going from digital execution to
application-specific analog computing. If you don't need high precision, you
can compute extremely quickly and efficiently using properly-configured analog
circuits.

IBM is working on this kind of system with their TrueNorth line
([https://techcrunch.com/2017/06/23/truenorth/](https://techcrunch.com/2017/06/23/truenorth/))

It hasn't been proven yet, but I think there is huge potential.

~~~
alfalfasprout
I remain unconvinced we'll see ASICs dominating inference. Part of the problem
is that even if we're just talking about neural networks, there's a variety of
architectures, activation functions, etc. to consider. At this stage, from my
own benchmarking Nvidia is close enough to the TPU with the V100 card while
allowing much more flexibility in the software stack used.

For inference, GPUs are also pretty damn efficient since it's an
embarrassingly parallel task w/ minimal synchronization (no gradient updates
needed). In this case, FPGAs are a far better choice since you can push
updates to accommodate new network architectures, activation functions, ,etc.
The TPU instead relies on a matrix-multiplier unit which supports more use
cases but won't be as performant on something like an RNN.

~~~
cbHXBY1D
I think Microsoft's experience with FPGAs for inference would agree with you.

Currently, they are only allowing external customers use ResNet-50 with their
FPGA-enabled Azure ML.

------
telltruth
Few folks have been preaching this a lot but my understanding is that
devices/MCUs are getting more powerful overtime and the need to specialize for
low range devices would reduce, not increase. People use the argument in the
article to spawn large teams who do nothing but optimize for low end devices
assuming devices won't progress over time. I do ask if this is good use of
their time and talent.

~~~
arkh
It is always a cost analysis : even if small powerful processors are
available, should you use them?

If you're shipping a million units a 30ct difference can pay for a year of a
good dev.

~~~
iainmerrick
Small slow processors are likely to have much lighter power requirements too.
Barring a breakthrough in battery tech or wireless power, that’s going to be
important for a long time for many applications (especially IoT).

------
siscia
> A few years ago my priority would have been convincing people that deep
> learning was a real revolution, not a fad, but there have been enough
> examples of shipping products that that question seems answered.

Exactly what __product __the author is referring to? I am having a hard time
thinking one, but maybe is just me living in my bubble...

~~~
alexgmcm
I guess all the voice assistants use Deep Learning as it is the state of the
art in voice recognition and NLP.

I'm almost certain they all offload the processing to a server though.

~~~
haarts
There is the exception of [https://snips.ai/](https://snips.ai/).

I think that offloading the processing is not functionally required, but
having the data is valuable for the big corps.

------
baybal2
There is a showstopper:

Running a neural algo using already formed net is easy-peasy. Doing actual
learning on an MCU for anything serious is still impossible.

Learning can be ran on commodity GPUs/DSPs, and they will not be that much
worse than dedicated hardware. But on embedded side, a small, low-power ASIC
is the only thing that makes makes 99% voice recognition a possibility.

This is why I think that learning startups will not go anywhere far in
comparisons to companies that will be using results of that learning that can
be done in DCs using commodity hardware.

~~~
rbanffy
You can give the illusion of edge-learning by shipping datasets to large
number-crunchers in the cloud and receiving altered nets from it. That even
gives one the benefit of learning from the collective experience of fellow
devices.

I wonder if we could (of course we can, I'm wondering if someone already did
it) split the training workload across a number of small embedded devices with
their tiny NEON units and have them share the resulting trained models. Making
nodes self-coordinate the shared workload and assemble the results would be
interesting.

------
carlmr
That Cyber-Hans thing is already happening in industry. It was already
happening 15 years ago. At least that's the first time I saw something like
this for a device that would find out if roof shingles had a defect by tapping
it acoustically. They had a Hans doing it previously that would knock them and
listen, and they replaced it with a Cyber-Hans.

In this case it wasn't a neural net, I think it was simple multiple linear
regression + Fourier Transform.

~~~
B1FF_PSUVM
> find ... a defect by tapping it acoustically

Why train wheels used to be tapped with a hammer at platforms:
[https://en.wikipedia.org/wiki/Wheeltapper](https://en.wikipedia.org/wiki/Wheeltapper)

~~~
carlmr
That's really cool!

~~~
B1FF_PSUVM
I think there are movies, with train platform scenes, where you can see the
railway guy going by with a mallet, giving the wheels a light tap and
listening to the sound.

------
roymurdock
Current state of the art in embedded/IoT ML is to train ML algos in the cloud
on large datasets, then run it on gateway class devices (usually Linux/MSFT
boxes, but can get down to RPi levels of mem and compute). Most companies
today use docker to package and deploy the models, hence the need for a larger
footprint box. Check out AWS Greengrass, Azure Edge, and Foghorn for examples.

------
jguimont
Of course it should... :P

Neurones are electrical but mostly chemical when they work. The average speed
of a connection from one neurone 0.1-0.5 m/s. So if you hit your toe on a
chair and you are 1.8m meter high (and pardon my rough math/science here) it
would take almost 1 second to reach your brain (of course this is why reflexes
are handled close to the spine and not the brain).

And now imagine the complexe processing that is required to view/ear and
recognize something. It is done quite quickly and yet the basic processing
unit of the brain is slow. One might think it is the massive parallelism of
the brain that makes this possible so quickly, but even there if you think
about it it all that processing done in such small amount of time cannot be
more than a thousand operations...

~~~
taneq
> The average speed of a connection from one neurone 0.1-0.5 m/s.

You're about an order of magnitude out there, according to
[https://en.wikipedia.org/wiki/Nerve_conduction_velocity#Norm...](https://en.wikipedia.org/wiki/Nerve_conduction_velocity#Normal_conduction_velocities)

(And just for a simple cross-check, it doesn't take a second for you to
perceive sensations from your toes. Not even close.)

------
dvfjsdhgfv
The author has some very good points. Also, modern MCUs like STM32 are
powerful enough to run a whole big operating system like Linux while keeping
power usage relatively low and being as cheap as 8-bit MCUs, so using them for
ML tasks on different devices is a natural step forward.

~~~
ramzyo
Which? I'd be hard pressed to find an MCU that a) can run Linux, unless it's
MMU-less Linux (e.g. uClinux) or your definition of MCU includes architectures
like the Cortex A with MMUs b) has the RAM needed to run Linux, unless
external SDRAM or similar is provided on the PCB c) is as cheap as an 8-bit
MCU like an AVR.

If Cortex A-class, The iMX6UL from NXP comes to mind for a) and b), but no way
it also addresses c)

~~~
dvfjsdhgfv
I meant uClinux running on Cortex-M3/M4, but I really hope to run real Linux
on STM32MP recently added to the Linux kernel - the actual hardware is not
released yet though.

~~~
ramzyo
Got it - thanks for the clarification. Any idea on the price point for the
STM32MP?

~~~
dvfjsdhgfv
I have no idea and I don't want to speculate, I just hope to be pleasantly
surprised with a Cortex-M7 <€10 range.

------
bogomipz
The article states:

>"This makes deep learning applications well-suited for microcontrollers,
especially when eight-bit calculations are used instead of float, since MCUs
often already have DSP-like instructions that are a good fit."

Can someone shed some light on what the author means by "DSP-like
instructions"? What are characteristics of DSP instructions? Is there
something that makes these unique compared to general purpose CPUs or GPUs?

~~~
jononor
Here it means SIMD/vectorized operations. Arm Cortex M4F with DSP extensions
can do 4 operations on 8 bit data.

For traditional signal processing (FIR/IIR filters) dedicated opcode for
multiply accumulate is also common, since it is used so much.

Saturated add/subtract is another typical 'DSP' kind of feature.

EDIT: such opcodes exists on many CPU/GPUs also, but we are talking sub-
milliwatt capable devices here.

~~~
neiltan
With utensor.ai, you can probably try this out today. We are currently working
on integrating CMSIS-NN with uTensor. CMSIS-NN are these MCU SIMD optimized
functions.

------
dominotw
Slightly tangential question. I ride my electric uniwheel on the side walks
but sidewalks in my city sometimes have huge potholes so I have to be
constantly watch potholes so I don't trip over and lose half of my teeth.

Is it possible for me to embed a camera on my uni that can see potholes 10
feet away a beep my headphones? I am not sure where to even start with this.

~~~
btrettel
As a cyclist, I'd be interested in such a technology too.

Unfortunately, despite the lip service many US cities give to cyclists, when
it comes to practical issues like road quality, cities tend to not care. Here
in Austin there are quite a few bike lanes/cycletracks that are so bad that I
refuse to use them. Usually it's a combination of poor visibility of cyclists
in the lane (making being hit by turning drivers more likely) and poor road
quality (e.g., chip seal resulting in some of these lanes basically being
gravel). I've seen it claimed that the city regularly cleans out this gravely,
but I can only recall a few times over the past 5 years when I thought the
gravel might have been removed. I don't need machine learning to tell me to
avoid these roads, but the potholes would be helpful.

------
viksit
How would the smaller units handle larger ops and convolutions, RNNs and
others? Even assuming custom chips, all that heat that is generated (which
gpus use large fans and heat sinks for dissipation) has to be removed somehow.
Won’t that be a problem?

~~~
nl
I don't know why you were downvoted.

There are no "larger" ops. However, things like RNNs can require more memory
to execute because of the longer chain of data they need to execute the
operations on.

As noted in the article you can alleviate this by halving the size of the
model at the cost of accuracy.

The heat in large GPUs is because of the large number of cores they have
operating simultaneously.

------
sacado2
Wow. Deep learning would certainly not be my technique of choice on
constrained architectures, but there are situations where you don't really
have viable alternatives right now, so I'm glad to see that's actually doable.

~~~
rbanffy
It really depends on what you call "constrained". For about £5 you can get a
Linux-powered RISC machine with 512MB of RAM, a GPU, and rich IO capabilities.
I have worked on large multi-user environments smaller than that powering
dozens of serial terminals on everyone's desktops. That's _a lot_ of compute
power.

What I wouldn't like to do is to run the training part on such small devices.
If there is a good way to do incremental learning after you trained your model
so it could continuously fine tune itself using the embedded hardware on a
reasonable power budget, I'd go for it.

And while you won't run large networks, you can probably get away with many
smaller, more specialized ones.

~~~
jononor
Price for compute is less and less constraining every year. However if running
on battery the energy budget can be severely constraining.

Also people just end up wanting to do more. Real-time video at decent
framerate is still challenging for sub-100 USD devices. When that's easy, time
for real-time 3d data (LIDAR etc)

------
caycep
I'm not as familiar w/ the principles, but is there convergence behind the
principles of these chips and the neuromorphic chips proposed by Carver Mead?

------
tmaly
I remember reading a spec sheet on a chip based neural network Intel had
developed in the 80's. Maybe we are just facing another AI Winter.

------
candiodari
It's funny how incredibly bad news this is. And it does seem like it's
correct.

> For example, the MobileNetV2 image classification network takes 22 million
> ops ... 110 microwatts, which a coin battery could sustain continuously for
> nearly a year.

So making a tiny mine that blows up if and only if it sees a particular person
(or worse, a particular race or ...) is now theoretically possible and
essentially a few hardware revisions away from being doable.

~~~
pvaldes
The idea is thought-provoking, but would be another useless sink of tax money.

1-If the mine targets people from some race, then will attack your own
soldiers, local allies and spies of the same race

2-Clothes and makeup are common to all human cultures. After a few strikes the
people would learn how to blend in the landscape and avoid being taken by a
target.

3-The system would need a sort of eye over the soil, detectable by human eyes
and software, or a sort of wifi, detectable with software.

4-This "eye" part would be vulnerable to dust, leaves and debris falling over
this eye. Something that happens very quickly at soil level in deserts, snowy
areas and rainforests.

4-If the mine is inactivated until people of some colour appears, your enemies
could use a disguise to take it safely and reuse the weapon in their own army.

5-Such mines could be modified to target presidents, military high commands,
policemen or politicians, all easily distinguishable by their "feathers", well
known bagdes, official uniforms... At this point, the project of a mine aiming
to VIPS would be closed and deeply buried pretty fast.

~~~
XorNot
Or just mine an entire area of someone else's country and walk away, like we
do now.

The problem with most of these ideas is that if you're willing to do it, you
probably are willing to just shoot/explode/ethnically cleanse an area anyway.

The question as always is better framed as "what does this enable that they
couldn't do before?"

