
The Looming Battle Over AI Chips - poster123
https://www.barrons.com/articles/the-looming-battle-over-ai-chips-1524268924
======
oneshot908
If you're FB, GOOG, AAPL, AMZN, BIDU, etc, this strategy makes sense because
much like they have siloed data, they also have siloed computation graphs for
which they can lovingly design artisan transistors to make the perfect craft
ASIC. There's big money in this.

Or you can be like BIDU, buy 100K consumer GPUs, and put them in your
datacenter. In response, Jensen altered the CUDA 9.1 licensing agreement and
the EULA for Titan V such that going forward, you cannot deploy Titan V in a
datacenter for anything but mining cryptocurrency, and his company reserves
the right going forward to audit your use of their SW and HW at any time to
force compliance with whatever rules Jensen pulled out of his butt that day
after his morning weed. And that's a shame. Because there's no way any of
these companies can beat the !/$ of consumer GPUs and NVDA is lying out of its
a$$ to say you can't do HPC on them.

But beyond NVDA shenanigans, I think it's incredibly risky to second guess
those siloed computation graphs from the outside in the hopes of anything but
an acqui-hire for an internal effort. Things ended well for Nervana even if
their HW didn't ship in time, but when I see a 2018 company
([http://nearist.ai/k-nn-benchmarks-part-wikipedia](http://nearist.ai/k-nn-
benchmarks-part-wikipedia)) comparing their unavailable powerpoint processor
to GPUs from 2013, and then doubling down on doing so when someone rightly
points out how stupid that is, I see a beached fail whale in the making, not a
threat to NVDA's Deepopoly.

~~~
lostgame
FYI I think your comment is informative and I understood a lot of it but
that's a shitton of acronymns for the uninitiated.

~~~
hueving
FB: Facebook

GOOG: Google

AAPL: Apple

AMZN: Amazon

BIDU: Baidu

ASIC: Application specific integrated circuit

GPU: Graphics processing unit

CUDA: Compute-unified device architecture

EULA: End-user license agreement

weed: marijuana

HPC: High-performance computing

NVDA: Nvidia

HW: hardware

------
Nokinside
Nvidia will almost certainly respond to this challenge with it's own
specialized machine learning and inference chips. It's probably what Google,
Facebook and others hope. Forcing Nvidia to work harder is enough for them.

Developing a new high performance microarchitecture for GPU or CPU is complex
task. A new clean sheet design architecture takes 5-7 years even for teams
that have been doing it constantly for decades in Intel, AMD, ARM or Nvidia.
This includes optimizing the design into process technology, yield, etc. and
integrating memory architectures. Then there is economies of scale and price
points.

Nvidia's Volta microarchitecture design started 2013, launch was December 2017

AMD's zen CPU architecture design started 2012 and CPU was out 2017.

~~~
osteele
Google’s gen2 TPU was announced May 2017, and available in beta February 2018.
That 2018.02 date is probably the appropriate comparison to Volta’s 2017.12
and Zen’s 2017 dates.

EDIT: I’m trying to draw a comparison between the availability dates (and
where the companies are now), not the start of production (and their
development velocity). Including the announcement date was probably a red
herring.

~~~
Nokinside
I'm aware.

Making a chip and making competitive chip are two different things.

When Nvidia enters the market with specialized chip it's likely on completely
another level in bandwidth, energy consumption and price per flop performance.
They have so much more experience with this.

* [https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk...](https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view)

* [https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-acce...](https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/)

------
joe_the_user
_Nvidia, moreover, increasingly views its software for programming its chips,
called CUDA, as a kind of vast operating system that would span all of the
machine learning in the world, an operating system akin to what Microsoft
(MSFT) was in the old days of PCs._

Yeah, nVidia throwing it's weight around in terms of requiring that data
centers pay more to use cheap consumer gaming chips may turn out to backfire
and certainly has an abusive-monopoly flavor to it.

As I've researched the field, Cuda really seems provides considerable value to
the individual programmer. But making maneuvers of this sort may show the
limits of that sort of advantage.

[https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/](https://www.theregister.co.uk/2018/01/03/nvidia_server_gpus/)

~~~
oneshot908
It probably won't. For every oppressive move NVDA has made so far, there has
been a swarm of low-information technophobe MBA sorts who eat their
computational agitprop right up, some of them even fashion themselves as data
scientists. More likely, NVDA continues becoming the Oracle of AI that
everyone needs and everyone hates.

~~~
arca_vorago
So is OpenCL dead? Because that's how everyone is talking. The tools you
choose, and their licensing, matters!

~~~
keldaris
OpenCL isn't dead, if you write your code from scratch you can use it just
fine and match CUDA performance. In my experience, OpenCL has two basic
issues.

The first is the ecosystem. Nvidia went to great lengths to provide well
optimized libraries built on top of CUDA that supply things people care about
- deep learning stuff, dense as well as sparse linear algebra, etc. There's
nothing meaningfully competitive on the OpenCL side of things.

The second is user friendliness of the API and the implementations. OpenCL is
basically analogous to OpenGL in terms of design, it's a verbose annoying C
API with huge amounts of trivial boilerplate. By contrast, CUDA supports most
of the C++ convenience features relevant in this problem space, has decent
tools, IDE and debugger integration, etc.

Neither of these issues is necessary a dealbreaker if you're willing to invest
the effort, but choosing OpenCL over CUDA requires prioritizing portability
over user friendliness, available libraries and tooling. As a consequence, not
many people choose OpenCL and the dominance of CUDA continues to grow.
Unfortunately, I don't see that changing in the near future.

------
deepnotderp
Do people think that nobody at nVidia has ever heard of specialized deep
learning processors?

1\. Volta GPUs already have little matmul cores, basically a bunch of little
TPUs.

2\. The graphics dedicated silicon is an extremely tiny portion of the die, a
trivial component (source: Bill Dally, nVidia chief scientist).

3\. Memory access power and performance is the bottleneck (even in the TPU
paper), and will only continue to get worse.

~~~
oneshot908
Never overestimate the intelligence of the decision makers at big bureaucratic
tech companies. Also, it is not in the best interest of any of them to be
reliant on NVDA or any other single vendor for any critical workload
whatsoever. Doubly not so for NVDA's mostly closed source and haphazardly
optimized libraries.

All that said, Bill Daly rocks, and NVDA is a hardened target. But the DL
frameworks have enormous performance holes once one stops running Resnet-152
and other popular benchmark graphs in the same way that 3DMark performance is
not necessarily representative of actual gaming performance unless NVDA took
it upon themselves to make it so.

And since DL is such a dynamic field (just like game engines), I expect this
situation to persist for a very, very long time.

~~~
Dibes
> Never overestimate the intelligence of the decision makers at big
> bureaucratic tech companies.

See Google and anything chat related after 2010

------
alienreborn
Non paywall link: [https://outline.com/FucjTm](https://outline.com/FucjTm)

------
etaioinshrdlu
It would be interesting to try emulate a many-core CPU as a GPU program and
then run an OS on it.

This sounds like a dumb idea, and it probably is. But consider a few things:

* NVIDIA GPUs have exceptional memory bandwidth, and memory can be a slow resource on CPU based systems (perhaps limited by latency more than bandwidth)

* The clock speed isn't _that_ slow, it's in the GHz. Still one's clocks per emulated instruction may not be great.

* You can still do pipelining, maybe enough to get the clocks-per-instruction down.

* Branch prediction can be done with ample resources. RNN based predictors are a shoe-in.

* communication between "cores" should be fast

* a many-core emulated CPU might not do too bad for some workloads.

* It would have good SIMD support.

Food for thought.

~~~
Symmetry
Generally speaking emulating special purpose hardware in software slows things
down a lot so I don't think that relying on a software branch predictor is
going to result in performance anywhere close to what you'd see in, say, an
ARM A53. And since you have to trade off clock cycles used in your branch
predictor with clock cycles in your main thread I think it would be a net
loss. Remember that even though NVidia calls each execution port a "Core" it
can only execute one instruction across all of them at a time. The advantage
over regular SIMD is that each shader processor tracks its own PC and only
executes the broadcast instruction if it's appropriate - allowing diverging
control flows across functions in ways that normal SIMD+mask would have a very
hard time with except in the lowest level of a compute kernel.

That also means that you can really only emulate as many cores as the NVidia
card has streaming multiprocessors, not as many as it has shared processors or
"cores".

Also, it's true that GPUs have huge memory bandwidth they achieve that by
trading off against memory latency. You can actually think of GPUs as
throughput optimized compute devices and CPUs as latency optimized compute
devices and not be very mislead.

So I expect that the single threaded performance of a NVidia general purpose
computer to be very low in cases where the memory and branch patterns aren't
obvious enough to be predictable to the compiler. Not unusably slow but
something like the original Raspberry Pi.

Each emulated core would certainly have very good SIMD support but at the same
time pretending that they're just SIMD would sacrifice the extra flexibility
that NVidia's SIMT model gives you.

~~~
joe_the_user
_Remember that even though NVidia calls each execution port a "Core" it can
only execute one instruction across all of them at a time._

There are clever ways around this limitation, see links in my post this
thread.

[https://news.ycombinator.com/item?id=16892107](https://news.ycombinator.com/item?id=16892107)

~~~
Symmetry
Those are some really clever ways to make sure that all the threads in your
program are executing the same instruction, but it doesn't get around the
problem. Thanks for linking that video, though.

~~~
joe_the_user
The key of the Dietz system (MOG) is that the native code that the GPU runs is
a bytecode interpreter. Bytecode "instruction pointer" together with other
data is just data in registers and memory that's interpreted by the native
code interpreter. So for each thread, the instruction pointer can point at a
different command - the interpreter runs the same instructions but the results
are different. So effectively you are simulating a general purpose CPU running
a different instruction on each thread. There are further tricks required to
make this efficient, of course. But you are effectively running a different
general purpose instruction per thread (actually runs MIPS assembler I
recall).

~~~
etaioinshrdlu
This is more or less what I'm talking about. I wonder what possibilities lie
with using the huge numerical computation available on a GPU applied to
predictive parts of a CPU, such as memory prefetch prediction, branch
prediction, etc.

Not totally dissimilar to the thinking behind NetBurst which seemed to be all
about having a deep pipeline and keeping it fed with quality predictions.

~~~
joe_the_user
I'm not sure if your idea in particular is possible but who knows. There may
be fundamental limits to speeding up computation based speculative look-ahead
not matter how many parallel tracks you have and it may run into memory
through-put issues.

But take a look at the MOG code and see what you can do.

Check out H. Dietz' stuff. Links above.

------
BooneJS
Pretty soft article. General purpose processors no longer have the performance
or energy efficiency that’s possible at scale. Further, if you have a choice
to control your own destiny, why wouldn’t you choose to?

~~~
jacksmith21006
Great post. It is like mining going to ASICs. We have hit limits and you now
have to do your own silicon.

A perfect example is the Google new speech synthesis. Doing 16k samples a
second through a NN is not going to be possible without your own silicon.

[https://cloudplatform.googleblog.com/2018/03/introducing-
Clo...](https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-
to-Speech-powered-by-Deepmind-WaveNet-technology.html)

Listen to the samples. Then think the joules required to do it this way versus
the old way and trying to create a price competitive product with the improved
results.

------
bogomipz
The article states:

>"LeCun and other scholars of machine learning know that if you were starting
with a blank sheet of paper, an Nvidia GPU would not be the ideal chip to
build. Because of the way machine-learning algorithms work, they are bumping
up against limitations in the way a GPU is designed. GPUs can actually degrade
the machine learning’s neural network, LeCun observed.

“The solution is a different architecture, one more specialized for neural
networks,” said LeCun."

Could someone explain to me what exactly are the limitations of current GPGUs
such as those sold by Nvidia when used in machine learning/AI contexts? Are
these limitation only experienced at scale? Ff someone has resources or links
they could share regarding these limitations and better designs I would
greatly appreciate it.

~~~
maffydub
I went to a talk from the CTO of Graphcore
([https://www.graphcore.ai/](https://www.graphcore.ai/)) on Monday. They are
designing chips targeted at machine learning. As I understood it, their
architecture comprises \- lots of "tiles" \- small processing cores with
collocated memory (essentially DSPs) \- very high bandwidth (90TB/s!)
switching fabric to move data between tiles \- "Bulk Synchronous Parallel"
operation, meaning that the tiles do their work and then the switching fabric
moves the data, and then we repeat.

The key challenge he pointed to was power - both in terms of getting energy in
(modern CPUs/GPUs take similar current to your car starter motor!) and also
getting the heat out. Logic gates take a lot more power than RAM, so he argued
that collocating small chunks of RAM right next to your processing core was
much better from a power perspective (meaning you could then pack yet more
into your chip) as well as obviously being better from a performance
perspective.

[https://www.youtube.com/watch?v=Gh-
Tff7DdzU](https://www.youtube.com/watch?v=Gh-Tff7DdzU) isn't quite the
presentation I saw, but it has quite a lot of overlap.

Hope that helps!

~~~
bogomipz
Thanks for the detailed response and link! Cheers.

------
emcq
There is certainly a lot of hype around AI chips, but I'm very skeptical of
the reward. There are several technical concerns I have with any "AI" chip
that ultimately leave you with something more general purpose (and not really
an "AI" chip, but good at low precision matmul):

* For inference, how do you efficiently move your data to the chip? In general most of the time is spent in matmul, and there are lots of exciting DSPs, mobile GPUs, etc. that require a fair amount of jumping through hoops to get your data to the ML coprocessor. If you're doing anything low latency, good luck because you need tight control (or bypassing entirely) of the OS. Will this lead to a battle between chip makers? Seems more likely to be a battle between end to end platforms.

* For training, do you have an efficient data flow with distributed compute? For the foreseeable future any large model (or small model with lots of data) needs to be distributed. The bottlenecks that come from this limit the improvements from your new specialized architecture without good distributed computing. Again better chips don't really solve this, and comes from a platform. I've noticed many training loops have terrible GPU utilization, particularly with Tensorflow and V100s. Why does this happen? The GPU is so fast, but things like summary ops add to CPU time limiting perf. Bad data pipelines not actually pipelining transformations. Slow disks bottlenecking transfers. Not staging/pipelining transfers to the GPU. And then there is a bit of an open question of how to best pipeline transfers from the GPU. Is there a simulator feeding data? Then you have a whole new can of worms to train fast.

* For your chip architecture, do you have the right abstractions to train the next architecture efficiently? Backprop trains some wonderful nets but for the cost of a new chip (50-100M), and the time it takes to build (18 months min), how confident are you that the chip will still be relevant to the needs of your teams? This generally points you towards something more general purpose, which may leave some efficiency on the table. Eventually you end up at a low precision matmul core, which is the same thing everyone is moving towards or already doing whether you call yourself a GPU, DSP, or TPU (which is quite similar to DSPs).

Coming from an HPC/Graphics turned deep learning engineer, I've worked with
gpus since 2006 and neural net chips since 2010 (before even AlexNet!!), so
I'm a bit of an outlier here having seen so many perspectives. From my point
of view the computational fabric exists we're just not using it well :)

~~~
justicezyx
Most top tier tech all.have their working solutions for these. It's a matter
of turning into product and moving the industry mindset.

~~~
jacksmith21006
There is nothing for the buyer to see. They are buying a service or capability
and what silicon that runs on is here or there to them.

A simple example is the Google New Speech synthesis service. It is done using
NN on their TPUs but nobody needs to know any of that.

[https://cloudplatform.googleblog.com/2018/03/introducing-
Clo...](https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-
to-Speech-powered-by-Deepmind-WaveNet-technology.html)

What the buyer knows is the cost and the quality of the service.

Now Google had to do their own silicon to offer this as otherwise the cost
would have been astronomical. The compute to do 16k samples a second with a NN
are astronomical.

If I could not see it myself I would say what Google did was not possible.

Just hope they share the details in a paper. If we can get to 16k cycles
through a NN at a reasonable cost that opens up a lot of interesting
applications.

------
MR4D
Back in the day, there was a 386. And also a 387 coprocessor to have the
tougher math bits.

Then came a 486 and it got integrated again.

But during that time, the GPU split off. Companies like ATI and S3 began to
dominate, and anyone wanting a computer with decent graphics had one of these
chips in their computer.

Fast forward several years, and Intel again would bring specialized circuitry
back into their main chips, although this time for video.

Now we are just seeing the same thing again, but this time it’s an offshoot of
the GPU instead of the CPU. Seems like the early 1990’s again, but the
acronyms are different.

Should be fun to watch.

------
davidhakendel
Does anyone have a non-paywall but legitimate link to the story?

~~~
trimbo
Incognito -> search for headline -> click

~~~
bogomipz
Thank you for this tip. Out of curiosity why does this trick work?

~~~
_delirium
Websites like this want traffic from Google. To get indexed by Googlebot they
have to show the bot the article text, and Google's anti-blackhat-SEO rules
mean that you have to show a human clicking through from Google the same text
that you show Googlebot. So they have to show people visiting through that
route the article text too.

------
Barjak
If I were better credentialed, I would definitely be looking to get into
semiconductors right now. It's an exciting time in terms of manufacturing
processes, and I think some of the most interesting and meaningful
optimization problems ever formulated come from semiconductor design and
manufacturing, not to mention the growing popularity of specialized hardware.

I would tell a younger version of myself to focus your education on some
aspect of the semiconductors industry.

------
mtgx
Alphabet has already made its AI chip...its second generation already.

~~~
jacksmith21006
Plus Google has the data and the upper layers of the AI stack to keep well
ahead.

------
jacksmith21006
The new Google Speech solution is the perfect example on why Google had to do
their own silicon.

Doing speech with 16k samples a second through a NN and keep at a reasonable
cost is really, really difficult.

The old way was far more power efficient and if you are going to use this new
technique which gets you a far better result and do it at a reasonable cost
you have to go all the way down into the silicon.

Here listen to the results.

[https://cloudplatform.googleblog.com/2018/03/introducing-
Clo...](https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-
to-Speech-powered-by-Deepmind-WaveNet-technology.html)

Now I am curious on the cost difference Google as able to achieve. It is still
going to be more then the old way but how close did Google come?

But my favorite new thing with these chips is the Jeff Dean paper.

[https://www.arxiv-vanity.com/papers/1712.01208v1/](https://www.arxiv-
vanity.com/papers/1712.01208v1/)

Can't wait to see the cost difference using Google TPUs and this technique
versus traditional approaches.

Plus this approach support multi-core inherently. How would you ever do a tree
search with multiple cores?

Ultimately to get the new applications we need Google and others doing the
silicon. We are getting to extremes where the entire stack has to be tuned
together.

I think Google vision for Lens is going to be a similar situation.

~~~
taeric
This somewhat blows my mind. Yes, it is impressive. However, the work that
Nuance and similar companies used to do are still competitive, just not
getting near the money and exposure.

I remember over a decade ago, they even had mood analysis they could apply to
listening to people. Far from new. Is it truly more effective or efficient
nowadays? Or just getting marketed by companies you've heard of?

~~~
sanxiyn
It is truly better. Objective metrics (such as word error rate) don't lie. You
can argue whether it makes sense to use, say, 100x compute to get 2x less
error, but that's a different argument; I don't think anyone is really
disputing improved quality.

~~~
taeric
Do you have a good comparison point? And not, hopefully, comparing to what
they could do a decade ago. I'm assuming they didn't sit still. Did they?

I question whether it is just 100x compute. Feels like more, since naturally
speaking and friends didn't hog the machine. Again, over a full decade ago.

More, the resources that Google has to throw at training are ridiculous. Well
over 100x what was used to build the old models.

None of this is to say we should pack up and go back to a decade ago. I just
worry that we do the opposite; where we ignore progress that was made a decade
ago in favor of the new tricks alone.

~~~
jacksmith21006
The thing is it is not simply the training but the inference aspect would have
require an incredible amount of compute compared to the old way of doing it.

Hope Google will do a paper like they did with the Gen 1 TPUs. Would love to
see the difference in terms of joules per word spoke.

------
jacksmith21006
The dynamics of the chip industry have completely changed. Use to be a chip
company like Intel sold their chips to a company like Dell that then sold the
server with the chip to a business which ran the chip and paid the electric
bill.

So the company that made the chip had no skin in the game with running the
chip or the cost of the electricity to run it.

Today we have massive cloud with Google and Amazon and lowering the cost of
running their operations goes a long way unlike the days of the past.

This is why we will see more and more companies like Google create their own
silicon which has already started and well on it's way.

Not only the TPUs but Google has created their own network processors as they
quietly hired away the Lanai team years ago.

[https://www.informationweek.com/data-centers/google-runs-
cus...](https://www.informationweek.com/data-centers/google-runs-custom-
networking-chips/d/d-id/1324285)?

Also this article helps explain why Google built the TPUs.

[https://www.wired.com/2017/04/building-ai-chip-saved-
google-...](https://www.wired.com/2017/04/building-ai-chip-saved-google-
building-dozen-new-data-centers/)

------
willvarfar
I just seem to bump into a paywall.

The premise from the title seems plausible, although NVIDIA seems to be
catching up again fast.

~~~
madengr
I was impressed enough with their CES demo to buy some stock. Isn’t the Volta
at 15E9 transistors? It’s at the point only the big boys can play in that
field due to fab costs, unless it’s disrupted due to some totally new
architecture.

First time on HN I can read a paywalled article, as I have a Barron’s print
subscription.

~~~
twtw
21e9 transistors.

