
Tesla’s Neural Processor In The FSD Chip - rbanffy
https://fuse.wikichip.org/news/2707/inside-teslas-neural-processor-in-the-fsd-chip/
======
idclip
This is some pretty exciting stuff. I read the wikis about the autopilot
processing units and these ML accelerators and its honestly mind bogglingly
exciting.

~~~
new_realist
It’s slightly less powerful than NVIDIA’s self driving offering.

~~~
gibolt
For the energy consumption per computation, it is supposedly ~10x better. That
matters a lot when you have to account for a limited battery/range and
cooling/extra hardware adding complexity to the car.

~~~
dragontamer
Personally speaking, I never liked this line of thinking.

A Tesla has ~70 kW-hrs of energy, and its motor uses all of that power in
5-hours (5-hours x 60 miles per hour == 300 miles). Just round numbers for an
estimate.

For those paying attention, that's roughly 14,000 watts of power (again: round
estimate, but we're probably +/\- within a magnitude). Do you really care if
your computer makes you use 14100 watts of power or 14500 watts of power?

~~~
steve_musk
Sure if you’re driving only on the highway. What about driving around the city
at 5-20 mph? Tesla’s end goal is to have autopilot do everything, so if they
are serious about it then it makes sense to consider the impact at non-highway
speeds.

Let’s say you are driving at an average speed of 20 mph - then by your
estimate you’re using ~4666 watts. So if your chip is 500W it’s a significant
portion of the power budget.

~~~
dragontamer
> What about driving around the city

Autopilot is highway only for now. I'd expect the chip to be off, at least
with the current version, if you are on city roads.

But sure, if you're driving on city roads, lets really math it out. You have
70kW-hrs and maybe you're going 20 mph. Lets call that 4600 Watts of energy
usage. Lets say you have a 500-Watt computer, for a total of 5100 Watts.

You can run your car for literally 13-hours at that power-usage. 70 kW-hrs is
a lot of energy storage. A 4600 Watt power-usage results in ~15 hours of
runtime instead.

Not a big change. And these are estimates only, I don't actually know what the
energy-consumption metrics are for city vs highway. But these "poor
calculations" are still good enough to realize that even a 500W computer is
well within the power-budget of a 70,000 Watt-hr battery pack.

\----------

Absolute worst case scenario (for the computer): you are going 0 MPH. Parked
in a parking lot using only the computer. That means 100% of the power usage
goes into the computer. What happens with a 500W computer?

Well, 70 kW-hrs can last 140 hrs at 500Ws of usage, that's over a workweek:
over 5 days of computer usage. Are we seriously worried about the miniscule
power usage of the car's computer system when a 70 kW-hr battery is in the
car?

~~~
pests
Highway only? My roommate has one, I don't think he's ever not using autopilot
even on side streets. He might only use the brake or accelerator a few times
in a whole days driving.

~~~
dragontamer
> Traffic-Aware Cruise Control is primarily intended for driving on dry,
> straight roads,such as highways and freeways. It should not be used on city
> streets.

> Warning: Autosteer is intended for use only on highways and limited-access
> roads with a fully attentive driver. ... Do not use Autosteer on city
> streets, in construction zones, or in areas where bicyclists or pedestrians
> may be present.

[https://www.tesla.com/sites/default/files/model_s_owners_man...](https://www.tesla.com/sites/default/files/model_s_owners_manual_north_america_en_us.pdf)

~~~
pests
I agree it says not to, but I also eat raw cookie dough and Im sure others do
too

------
zapnuk
I'm not a hardware guy, so this question is rather naive.

How to they run programs on it?

With a PC+GPU system you can just run the Python + tensorflow/pytorch/etc.
code. All required drivers can be easily installed.

How to they do it with a custom co-processor?

~~~
traverseda
The co-processor has some kind of IO. On a GPU that IO is generally a PCIE
port of some kind.

Once you have the two machines able to talk to one another on the physical
level, you need to decide what kind of protocol they're going to use.

The simplest example of that whole flow is serial ports, which you may be
familiar with if you've ever worked with arduino or microcontrollers or other
embedded systems. So basically the answer is "the same way any two computers
talk to one another", a protocol carried over some kind of physical medium. At
the point you have to create your own drivers, and your own way of compiling
code to run on the co-processor, but fundamentally it's no different from
loading an arduino sketch onto a micro-controller.

There are also solutions involving shared memory, where all the processors/co-
processors are connected to the same ram chips. Those generally aren't
practical with x86 chips, although it's a popular way for baseboard processors
to talk to the ARM chips that run cellphones. It's also ultimately they way
"symmetric multiprocessing" works in every computer that has more than one
core on it's CPU. It's also probably how the "neural processor" CPU
communicates with all the custom silicone that actually enables this product.
It's likely they have a conventional microcontroller core that handles the
input stream from the host computer and puts that data into the RAM layout
required by the "NPU". What they've labeled the "front end". I haven't really
looked into their architecture in any depth though.

~~~
dragontamer
PCIe is probably the "standard" connection that I'd expect in a system today.
16x lanes of PCIe 3.0 will provide over 15 GBps of bandwidth at ~microsecond
of latency. PCIe 3.0 as a protocol is complex enough to handle atomics these
days (atomic-compare and swap), which allows for high-speed coordination and
collaboration, very much akin to multi-core programming on cpu-to-cpu.

"Exotic" connections go beyond PCIe. You could theoretically share DDR4 RAM.
With this, its probably more natural to have "cache-coherent" interconnects.
Assuming the MESI protocol (aka: very much simplified), the coprocessor can
raise a section of RAM (64 bytes or so) into the "exclusive" state. As long as
the coprocessor is holding exclusive, the CPU will stall as it tries to read
that RAM. The co-processor can return the memory to the CPU control by setting
it to "Invalid" state. The CPU can then read (or write) to memory by setting
the memory to exclusive: under the CPU's control.

Communication to and from the co-processor in this case isn't "memory-mapped
IO", as much as it is "I/O over memory". :-). My explanation above is
simplified, but hopefully you get the gist.

A realistic example of a coherency-fabric is something like POWER9 OpenCAPI
can communicate with NVidia GPUs at over 50GBps, while sharing the same memory
space. (PCIe 3.0 and PCIe 4.0 has features to allow a similar effect at
slightly slower speeds. I would bet that PCIe 5.0 is working on some kind of
coherency fabric, but I don't actually know).

The most exotic would probably be iGPUs (Intel) and APUs (AMD), which perform
this kind of cache-coherent communication but at the L3 cache level and below.
Since APUs and iGPUs share the same chip, they can send these messages back
and forth without ever leaving the chip. Just entirely within L3 cache itself.

~~~
traverseda
Very interesting. I've seen a similar idea proposed to enable shared-memory
based RPC, instead of communicating over a socket both processes agree to
"trade" a chunk of memory back and forth, using it as a buffer for RPC
messages.

~~~
dragontamer
Its becoming clear that DDR4 RAM is "just another I/O system".

RAM has always forced CPUs to stall: refresh events are the most common (RAM
is refreshing the voltage on all memory: it takes a long time and the CPU may
have to wait in rare cases). But modern multicore CPUs have to have this
coherency fabric to ensure that the different cores see memory in the correct
order.

It doesn't really matter if the CPU is stalling because of a memory refresh,
or a CPU-exclusive hold on some RAM... or even I/O pretending to be a CPU-core
holding some RAM. The CPU's method of attack remains the same.

\-------

This is most clear with Intel's Optane DIMMs, persistent storage that operates
over the DDR4 protocol, pretending to be RAM. Slower than true DDR4 yes, but
fast enough that it wanted to move off of PCIe and pretend to be DDR4 RAM. I
expect more and more I/O to be "pretending to be RAM" in some effect, either
through a coherency fabric or maybe even just copying the DDR4 protocol (like
Optane).

------
person_of_color
Wow, Verilator hits the mainstream.

~~~
justicezyx
[https://www.veripool.org/wiki/verilator](https://www.veripool.org/wiki/verilator),
this?

Did not see it was mentioned in the article though.

~~~
brandmeyer
> Because they did not have emulation ready early on, they resorted to using
> the open-source Verilator simulator which they say managed to run 50 times
> faster than commercial simulators. “We used Verilator extensively to prove
> our design is very good,” said Venkataramanan, Sr Director Autopilot
> Hardware at Tesla.

That's consistent with my experience as well. Simulating with a message-
passing simulator (ie, most commercial simulators) isn't even remotely as fast
as using Verilator.

~~~
person_of_color
What's the technical reason Verilator can be faster?

~~~
brandmeyer
Strictly speaking, Verilog's machine model is one that passes messages around
between its processes. A given `always_ff` process with several statements
issues those statements one at a time in the abstract machine.

There are a few different simulator technologies out there.

A gate-level simulator renders the design into a bunch of SPICE nets and
discrete transistors. SPICE then turns the whole shebang into a gigantic
sparse matrix solver. This is the highest-fidelity model, and also the
slowest. A vendor who owns their own process node might do this. TSMC won't
let you see their transistor models or standard cell library. Maybe AMD gets
that much visibility. NGSPICE is a FOSS example of this type of simulator,
although I don't know of a netlister that ultimately starts from an HDL for
it.

A message-passing simulator provides a somewhat strict view of the Verilog
abstract machine. It executes the elements of each process more or less
sequentially, where each assignment issues a new message to the processes that
are sensitive to changes on the assigned variable. Its also dog slow. It might
seem reasonably accurate, but the quasi-sequential execution model can hide
real bugs in hardware that is actually executing in parallel. Icarus Verilog
is a FOSS example of this type of simulator.

Verilator gets faster a couple of ways. First off, it only implements the
synthesizable subset of Verilog. Second, it assumes that all of the flip-flops
that are triggered by a clock edge are actually triggered immediately, and
cause downstream effects on combinatorial logic immediately. This isn't
exactly true, either, since both wire and transistor delay is really a thing.
But it is somewhat more true than the message-passing model. In reality all of
those transistors are acting simultaneously after all. For the most part, if
you stick to synthesizable constructs, write your code in a reasonably high-
level fashion, and pay attention to the implementation tool's timing reports,
it ends up being true enough.

Or, in Verilator's own words:

> How can Verilator be faster than (name-the-commercial-big-3-simulator)?

> Generally, the implied part of the question is ``... with all of their
> manpower they can put into it.''

> Most commercial simulators have to be Verilog compliant, meaning event
> driven. This prevents them from being able to reorder blocks and make
> netlist-style optimizations, which are where most of the gains come from.

> Non-compliance shouldn't be scary. Your synthesis program isn't compliant,
> so your simulator shouldn't have to be -- and Verilator is closer to the
> synthesis interpretation, so this is a good thing for getting working
> silicon.

~~~
kingosticks
I've used Modelsim my entire working life and one part of that has been the
GUI debugger with it's waveforms, breakpoints, watches, execution stepping
etc. If Verilator is re-ordering and optimising blocks, is that all stil
possible, or only for the particular subset it doesn't happen to optimise
away?

~~~
brandmeyer
Yes, it is possible. By default, only the top-level nets are exported from any
given module. However, you can annotate additional nets of interest for
export. There's also a mode whereby you can export everything, at a big hit in
speed.

Even when exporting everything, its still much faster than Vivado's simulator,
though.

~~~
kingosticks
That's interesting, thanks. The new 'vopt' mode in Modelsim forces you to do
the same annotation thing, perhaps they are moving towards a Verilator style
approach. Sadly vopt mode doesn't work for some of our designs and actually
makes others slower so we have avoided adopting it (and are now stranded on an
old version of the tool).

------
magicfractal
Do you know where is the presentation video?

~~~
olgs
It’s the Autonomy day presentation about 1:20:00 in from the beginning.

I think a lot of the questions people have will be answered if they took the
time to go through the video.

[https://youtu.be/Ucp0TTmvqOE](https://youtu.be/Ucp0TTmvqOE)

~~~
George19
No this is different. This presentation is from Hot Chips which went much
deeper than Autonomy day.

------
HNLurker2
How do I get the blue print?

------
bitL
Can it manage 4k @ 1000fps for self-driving inferencing? If so, it could solve
most of the perceptual problems we have these days with self-driving cars.

~~~
verall
What?

What about dynamic range, exposure control, color information? What happens
when the sun is on the horizon and then the car enters and exits a cool-white
lit led tunnel so the entire scene briefly goes black and everything changes
colors?

There are tons of huge problems in this space these benchmarks wouldn't be
meaningful at all for. 1000fps and 4k might be some of the least meaningful
benchmarks I have ever heard someone discuss in the realm of self-driving.

~~~
bitL
Doesn't that depend on hardware? If you have a proper HDR camera that can
handle overload of light and records enhanced spectrum without clipping
information (those exist), I can't see how does it affect the underlying
perceptual algorithms themselves and how would that make those problems
suddenly unsolvable. Having a very high sampling frequency allows you to do
reasonable decisions on millimeter scales instead of yard/meter ones with
lower sampling frequency; if you have some clever voting/heatmap/etc.
mechanism in place, you'd get higher certainty that likely scales well with
further increase of sampling frequency. So the perception problem is "solved"
because a simple increase of frequency eliminates the need of applying more
sophisticated methods and gets you to required accuracy demanded by safety
protocols.

~~~
verall
> proper HDR camera that can handle overload of light and records enhanced
> spectrum without clipping information (those exist)

_No_, they don't. CMOS sensors are consistently limited by readout speed
(bits-per-second off of the sensor), so FPS, resolution, and dynamic range are
fundamental tradeoffs. The bleeding edge of HDR, high-framerate, high-res
sensor are full of dirty hacks here like PWL and hardware bracketing to
achieve what they do, and it is nothing like your quoted figures.

1000 fps limits your exposure time to 1ms unless you are achieving that
1000fps with sensor arrays, which adds tons of complexity to still sensors,
let alone trying to get a reasonable picture from a fast moving vehicle. There
is a reason if you ever look at a high-framerate camera setup they are using
like 3+ studio lamps to throw enough light on the subject.

This "high sampling frequency allows you to do decisions on millimeter scales"
is speculative nonsense, and does not even try to address camera spacial
frequency limitations that would prevent this.

I work on HDR CMOS cameras for the automotive market as my day job. I would
love a link to one of these enhanced spectrum cameras you speak of.

------
baybal2
Worth noting that there is nothing actually FSD specific in the chip. It
could've easily be a cookie cutter smartphone SoC. Though it is significantly
more powerful than anything on the market in such package for as long as
regular SMP goes.

I'm sure that they are not sure themselves where would their future
developments go, so they just added a lot of everything.

~~~
pjscott
Huh? At least half the die area is special-purpose hardware for running
convolutional neural network inference, specialized for consistent latency on
small batches of images streaming from several independent HDR video sensor
channels. This chip is _very_ specialized for self-driving cars.

~~~
baybal2
Many SoC these days come with a dedicated matrix multiplication unit.

