
Deep Learning and Its Implications for Computer Architecture and Chip Design - godelmachine
https://arxiv.org/abs/1911.05289
======
londons_explore
Badly formatted paper with handwavey abstract with no real focus and dropping
buzzwords aplenty... I'll pass...

Oh - it's written by Jeff Dean, inventor of Mapreduce, Bigtable, tensorflow,
and practically a god... Yeah, I'll read it!

~~~
londons_explore
Read it. Worth a read, especially for those not closely following the machine
learning world.

The last section, focussing on having a single large sparsely activated model
which can accomplish thousands of different tasks by using a selection of
internal 'experts' interests me the most.

I suspect this type of model isn't used much today simply because each company
using ML only typically has a few problems to solve. If someone like Google,
with far more different problems to solve, can get this type of model to work
and demonstrate its effectiveness, I think it would be a big step towards
solving artificial general intelligence.

Jeff Dean has a lot of respect and influence inside Google, and his ideas tend
to get implemented. I'm looking forward to it!

~~~
sanxiyn
Sparsely activated multitask model is a kind of Jeff Dean's hobby horse. It
was published in 2017:
[https://arxiv.org/abs/1701.06538](https://arxiv.org/abs/1701.06538). My
assessment is that it is an intriguing but ultimately failed experiment, like
Geoffrey Hinton's capsule network.

~~~
acollins1331
Capsule networks are not failed experiments! Where is this coming from? They
merely haven't been applied as much as CNNs or FCNs but there are lots of
papers out there where capsule networks outperform those architectures.

Source: my thesis using capsule networks for semantic segmentation of aerial
imagery

~~~
XuMiao
I like the Capsule idea too. In some way, capsule network is very similar to
sparse attention network. It's just the attention normalization is different.
Attention is normalized on the inputs, the capsule is normalized on the
output. Potentially capsule can yield much cleaner patterns, while patterns
generated by attention networks can be overlapping. It's just that capsule is
much harder to solve.

------
Veedrac
> Figure 2 shows this dramatic slowdown, where we have gone from doubling
> general-purpose CPU performance every 1.5 years (1985 through 2003) or 2
> years (2003 to 2010) to now being in an era where general purpose CPU
> performance is expected to double only every 20 years [Hennessy and
> Patterson 2017].

This isn't true. CPU performance has stagnated very recently due to Intel's
struggles with 10nm, but we look to be leaving that behind us. Even if we
weren't, it's still not relevant for ML, since within the last decade GPUs
_have_ improved a factor of ~10—thus the terrifying Figure 2 is proven false.

------
amelius
Does anyone happen to have a link to a paper or book describing the state of
the art in placement and routing algorithms? I'd like to read up on that
topic.

~~~
kernyan
These two books might be useful,

1) [https://www.oreilly.com/library/view/electronic-design-
autom...](https://www.oreilly.com/library/view/electronic-design-
automation/9780123743640/) 2) [https://www.crcpress.com/Electronic-Design-
Automation-for-IC...](https://www.crcpress.com/Electronic-Design-Automation-
for-IC-Implementation-Circuit-Design-and/Lavagno-Markov-Martin-
Scheffer/p/book/9781138586017)

On the first book, see chapter 10 - 12 (on floorplanning, placement, and
routing). End of chapter 11 points you to some literature survey as well. But
the book itself is somewhat dated (published in 2009)

I haven't read the second book but it's much more recent (published in 2018),
it also has chapters on placement, and routing.

------
sorenn111
Potentially noob question, with Moore's law slowing down, are there enough
specializations/hardware modifications available like those mentioned in the
paper such that progress in ML will continue to progress rapidly? or will
these advancements simply forestall an inevitable asymptote.

~~~
retrac
It's little more than an educated guess on my part, but I figure there's about
two orders of magnitude in improvements in processing speed exploitable with
current processes, if a big-budget chip were designed specifically for ML
training. GPUs are architecturally not very optimal for the task.

You want something like a chip with a huge mesh of small independent cores
with their own local storage, quite possibly with non-digital circuits that
can very quickly approximate the functions with analog electronics, rather
than actually doing all of the calculations digitally. Some variation on that
is the approach both Intel and IBM have taken with their "neural chips" in the
last few years.

It seems that analog computers are finally getting their revenge.

~~~
solidasparagus
This doesn't seem to match my experience with ML and GPUs/ASICs.

TPU is the main ML ASIC in use. A major goal of the original TPU design seems
to be reducing the number of memory accesses. The other top-end ML device is
NVIDIA's GPUs with Tensor Cores. Both of those chips are designed around fast
matrix multiplication, which right now seems to be the most important
operation in deep learning - see how RNNs have started to fall out of favor to
CNN-based networks with attention heads.

The TPU is not faster than NVIDIA's GPUs, but it is cheaper. Right now the
future seems to be cheaper ML devices designed to be horizontally scalable.

From the CPU perspective, it appears that the major ML effort is related to
vectorizing instructions via advanced instruction sets.

Everyone who creates silicon is focused very heavily on using smaller and
smaller numeric types - float16 is standard and there is work being done for
even smaller int based work.

I haven't seen any analog-based ML devices in use. Can you share an example?
Is there even a way to approximate the results of a matmul using analog
devices?

It's impossible to guess how much more speed we can get with current
approaches, but everything from silicon to networking stack to libraries to
network architectures are in their infancy so I would expect dramatic
improvements in performance on a regular basis (but not as regular as other
areas of software because silicon development is slow)

~~~
sanxiyn
TPU v3 is rated 420 teraflops, while V100 GPU is rated 125 teraflops.

~~~
solidasparagus
What does TPU v3 mean there? tpu v3.8? In which case you are comparing 8
cores/4 chips to a single GPU which hardly seems fair. It's hard to compare
across ASICs. In practice the the largest readily available amount of compute
seem to be 8 V100s vs one 'Cloud TPU' (tpu v3.32). Those two have relatively
similar performance in practice (FLOPs seem to be a very poor way to compare
across ASICs), although TPU is typically several times less expensive.

------
buboard
brain floating point ... cool name

i guess brain's synaptic precision could go way lower, as low as 26 distinct
synapse weights:
[https://elifesciences.org/articles/10778](https://elifesciences.org/articles/10778)

> A particularly interesting research direction puts these three trends
> together, with a system running on large-scale ML accelerator hardware, with
> a goal of being able to train a model that can perform thousands or millions
> of tasks in a single model. Such a model might be made up of many different
> components of different structures,

yup, he is building a brain

~~~
quotemstr
> i guess brain's synaptic precision could go way lower, as low as 26 distinct
> synapse weights:
> [https://elifesciences.org/articles/10778](https://elifesciences.org/articles/10778)

Thanks for the link. Artificial neural networks all the way down to binary
weights [1] although this approach doesn't seem like the most efficient one.
It's interesting how we're still seeing a ton of variability in ML
architectures: it suggests we haven't stumbled on the right area yet. It
reminds me how early aviation had a huge diversity of aircraft plans, but now,
after a lot of optimization, we've settled on that one standard airliner shape
everyone uses everywhere.

[1] [https://arxiv.org/abs/1602.02830](https://arxiv.org/abs/1602.02830)

~~~
buboard
A formal theory of deep learning is proving to be much more elusive that
avionics. Interesting times though

