
Arm Announces Cortex-R82: First 64-Bit Real Time Processor - rbanffy
https://www.anandtech.com/show/16056/arm-announces-cortexr82-first-64bit-real-time-processor
======
ksec
The best part is actually Peter Greenhalgh, VP of Tech at ARM wrote in the
comment section.

>A real-time processor doesn't intrinsically process any differently from a
normal applications processor. What differentiates it is that it bounds
latencies and behaves deterministically. For example, rather than an interrupt
latency on applications CPU taking anywhere from 50-1,000 cycles (or more) the
interrupt latency can be bounded to under 40-cycles.

>Tightly Coupled Memories allow certain routines and data to be stored within
the CPU so there's never any chance that a cache eviction has taken place
which forces a fetch from DDR (or flash). In a phone or laptop, you don't have
routines that absolutely must be accessed in 5-cycles and can't wait
250-cycles for DDR. If you're controlling an HDD you can't have the read head
crashing off the spinning disk! Or, in automotive, the spark plug firing at
the wrong.

>Latency and determinism can be very important!

>Not a R8 derivative.

I cant wait to see a SSD Controller with this tech.

~~~
Klinky
What benefit do you believe this is going to bring SSDs? They are full of DRAM
buffers and utilize multiple simultaneous NAND channels to mask latency
inherent to the tech. It seems a realtime processor would bring little to the
table.

Edit: Well looking into it a bit, a lot of SSDs already use realtime ARM
processors in their controllers. The main benefit here seems to be being able
to address larger buffer sizes going forward and access to more cores.

~~~
wtallis
Not having to work around the limitations of 32-bit addressing is a
convenience, but the industry has produced 4+ TB SSDs with 4+ GB of DRAM for
quite a while now without this core.

The real demand for a core like this in SSD controllers comes from doing more
computationally expensive work on the SSD controller than just providing a
block storage device abstraction. Stuff like embedding a full key-value
database on the SSD or providing more general-purpose compute capability is a
hot topic for enterprise storage. There's already at least one company
shipping an SSD controller for computational storage with a mix of realtime
cores and Cortex-A53 cores. The Cortex-R82 means such a chip could be more
homogeneous with just one type of ARM core.

------
Animats
So this is basically a CPU with an upper bound on the horrible cases. That's
useful for real time. A standard test for real time is to run a hard real time
OS like QNX, and have a simple program which receives an interrupt from an
input pin and restarts a high priority task that's waiting for the interrupt.
The high priority task turns on an output pin. You hook the input and output
pins to an oscilloscope. You want to see all the output spikes about the same
distance from the input spikes. You don't want to see output outliers way out
there, late. If you see that, it's not a hard real time system. You can do
this with a standalone program to test the CPU by itself, but you really need
to test it when the CPU is also doing other things.

Sources of CPU-level trouble include 1) rarely used CPU instructions that run
slow microcode, 2) cases where the pipeline needs a total flush and that's
slow, 3) the board manufacturer doing something in system management mode and
not telling you. Drivers locking out interrupts too long is the usual Linux
problem. It's assumed that all real time code is locked in memory; you do not
page real time processes.

~~~
jancsika
Is there any upper bound on latency for something to qualify as a realtime
system? For example, a system that is guaranteed to compute its output at the
end of each week?

~~~
wtallis
There's a lot of subjectivity in these definitions. But an overnight batch job
run once a week would usually not be considered a realtime task. The duration
of the job and the time resolution of its deadline is pretty large compared to
the CPU scheduler timeslices used by general-purpose operating systems, and
you probably don't need any special hardware support. You just need to
provision enough CPU time overall, and not allow the job to be starved by
higher-priority tasks consuming all the available CPU time.

A task would usually be considered realtime to some degree if it cannot
automatically get its work done on time even when competing with a background-
priority long-running job—for example, if your deadlines are operating on
similar or smaller timescales to the operating system's CPU scheduler. Then
you need to start taking extra measures to ensure adequate responsiveness, not
just adequate throughput.

But there's also the aspect of consequences of missing a deadline. Most
realtime tasks are concerned with millisecond or shorter timescales, but a job
that must be performed once every 108 minutes to prevent the end of the world
would also be reasonably considered a realtime task.

~~~
jancsika
Yeah, it would be nice to have a real-life example with a relatively lengthy
latency. Perhaps with delivering an organ for a transplant or something.

At least wrt soft realtime scheduling for audio, I think users/programmers
tend to get severely tripped up because the latencies are simply too small for
them to make a clear mental model of the underlying system. That, along with
their ineluctable desire for their code to "go faster" drives them to request
features or design algorithms which would have to travel "back to the future"
in order to work at all.

So it would be nice to decouple "ultra-low latency" from "realtime" to make it
clear that, say, you can't put the heart on ice _after_ the helicopter already
lifted off to fly it to the hospital.

------
ChuckMcM
I'm pretty excited by this announcement. Not for storage but for software
defined radios. If Xilinx upgrades their RFSOC (which is an Zynq Ultrascale
variant with built in ADCs and DACs) to this core from the current A53 core it
will allow much more sophisticated base band processing in software that is
currently done with FPGA gates in the fabric. And while reconfigurable FPGAs
are nice, software can change modes much more quickly than reconfiguring an
FPGA can.

~~~
dragontamer
Question: are GPUs a consideration in the SDR world?

I'm no expert on SDR, but it seems to me that SDRs are the high bandwidth and
parallel workloads where GPUs excel in. (GPUs can perform fourier transforms
very efficiently)

The only possible qualm would be the latency of a GPU. But I don't imagine
that SDR workloads are very latency sensitive? Or am I mistaken there?

GPU kernels are just software, but the parallel nature of GPUs means that it's
better if most of the GPU is sharing the same code (very small L1 code cache
per thread). So it's far more flexible than a FPGA but somewhat less flexible
than a CPU (where different cores and threads can be executing very different
code with large and unique L1 caches)

~~~
ChuckMcM
Yes! There is a very active community of using Cuda to do signal
processing[1].

Latency is an interesting thing, its always part of the SDR pipeline since you
have filter delays and processing delays. Most digital streams are uni-
directional so you get the bits out, just shifted by 'x' nS. Since the 'x' is
deterministic you can plan for it.

[1]
[https://github.com/rapidsai/cusignal](https://github.com/rapidsai/cusignal)

~~~
CamperBob2
That's pretty cool. I don't know anyone who does realtime signal processing
work in Python, though, least of all myself. Are there C bindings for all of
that stuff?

~~~
ChuckMcM
Nearly all of the "heavy lifting" as it were are C or C++ libraries. Python is
just the 'plumbing' level. It is not dissimilar to using MATLAB where the
drivers are all optimized code but the connection between them is MATLAB.

This design pattern is a common of nearly all SDR frameworks (Gnuradio,
Redhawk, Pothos, Etc.) The "interconnect" between processing elements is
typically shared memory, and that is why it lends itself to GPU work as well.

That said, since my head is most comfortable thinking in C, I tend to write
stuff in C rather than Python :-).

------
gumby
> “real-time” processors which are used in high-performance real-time
> applications.

C’mon anandtech, you can do better than this: “real time” means deterministic,
not necessarily faster, and in fact often means _slower_.

As another poster pointed out, ARM’s VP of Tech posted a comment and posted
what real-time means. I don’t know why people jump from that to some idea it
would be faster.

~~~
doctoboggan
"high-performance" could just as easily be interpreted as faster response time
as compared to faster clock speed. So in that way they are "faster".

~~~
gumby
But real-time systems don’t guarantee faster. I was a real-time engineer for
years.

~~~
swebs
Faster latency, not faster throughput.

~~~
gumby
Not necessarily even lower latency, just predictable.

------
tyingq
_" Another big change to the microarchitecture is the inclusion of an MMU,
which allows the Cortex-82 to actually serve as a general-purpose CPU for a
rich operating system such as Linux."_

That's interesting, but they seem to be removing most of the differences
between the A and R series.

~~~
duskwuff
Not really. The performance-oriented A series can include -- and will continue
to include -- features which improve performance at the expensive of
inconsistency, such as branch prediction and speculative execution. The R
series values consistent behavior over performance, so it won't have those
features.

~~~
the_duke
R82 is in-order, but does provide branch prediction. [1]

[1] [https://developer.arm.com/ip-
products/processors/cortex-r/co...](https://developer.arm.com/ip-
products/processors/cortex-r/cortex-r82)

------
parkerhoyes
The title is a bit ambiguous - this is ARM's first 64-bit real time processor,
but it's not the first one out there. All of SiFive's S and U series cores are
64-bit with hard real-time determinism. [0]

[0] [https://scs.sifive.com/core-designer/](https://scs.sifive.com/core-
designer/)

------
Matthias247
Realtime capable hardware and software is all super intersting. It makes up
for a very different challenge when building that something that has to
fulfill a certain task in a certain amount of time instead of just providing a
best effort solution. Therefore I'm always curious to learn about products and
solutions in that field.

However for this presentation I'm a bit puzzled by some of the marketing
slides:

\- Optional NEON accelerates ML.

\- Advanced machine learning support

Is that really a use-case for those processors? Can ML be hard-realtime
anyway? I feel like a promotion that one could use those SIMD instructions for
e.g. erasure-coding on storage use-cases or e.g. for signal processing on
Telco/Audio/Video use-cases would be much closer to typical applications for
such a CPU.

Is it just me or sounds the "ML" on these slides more like "we need to join
the hype" than an actual use-case?

\- Optional MMU enables Linux and cloud-native software development

I agree - MMU is great, Linux support too. But what is cloud-native again? And
is something that runs on a SSD controller in a datacenter now cloud native?

~~~
nl
> Can ML be hard-realtime anyway?

Sure, why not? The number of instructions needed to run inference on a given
neural network is known so I don't see any real issues here.

(Training is of course a different matter)

> Is it just me or sounds the "ML" on these slides more like "we need to join
> the hype" than an actual use-case?

You could build query-by-example indexes (for text, sound, images or videos)
in the storage hardware with this kind of capability. That'd be pretty
interesting.

Imagine something like Isilon, but instead of allowing Hadoop analytics on the
storage hardware it could do approximate nearest neighbour across images.

(This is actually a pretty good idea...)

~~~
wtallis
> You could build query-by-example indexes (for text, sound, images or videos)
> in the storage hardware with this kind of capability. That'd be pretty
> interesting.

I've seen demos of this kind of thing from companies like NGD Systems, who use
Cortex A53s in addition to realtime cores in their SSD controller. Part of the
selling point for offloading this from the CPU is that code running on the SSD
controller has much faster access to the dataset and can query it more
efficiently, unconstrained by the drive's host interface.

There are also more mundane use cases for ML inside a SSD as heuristics to
predict data access patterns or media wearout trends, which can influence how
the SSD chooses to store data and tackle error correction.

------
anamax
The wheel of reincarnation turns again.

Back in mainframe days, "channel controllers", the subsystems that controlled
storage devices, were very programmable. At some points in time, it made sense
to off-load as much as possible from the CPU onto the channel controller. At
other points in time, the reverse was true.

For a couple of decades, we've had dumb storage devices. Now with some of the
NVMe stuff, we're moving back towards smart "storage".

Linux in an SSD is the same idea.

~~~
als0
Disk controllers have been "smart" for the last 15 years. There's just a
desire to run Linux instead of a barebones RTOS.

~~~
StillBored
Lot longer than 15 years. There were ISA SCSI controllers that had micro-
controllers which could be programmed at some level. The symbios SCRIPTS come
to mind which were common on the 53Cxx controllers.

[https://www.manualslib.com/manual/96635/Lsi-Symbios-
Sym8953u...](https://www.manualslib.com/manual/96635/Lsi-Symbios-
Sym8953u.html?page=12)

This has continued on raid/fibre/nvmeof/etc boards.

------
TheMagicHorsey
What would be the advantage of running an RTOS on Cortex-R82 vs any other ARM
processor? Doesn't an RTOS give you the hard realtime capability through
software?

I feel I must be missing something critical in my understanding. I thought
RTOS in software was sufficient to get realtime processing.

~~~
tashbarg
Your software can only guarantee what the hardware provides/guarantees. If
your cpu can take a variable amount of cycles to start processing an
interrupt, the RTOS can only guarantee the upper bound. A real time cpu is all
about being faster and more importantly more deterministic, therefore enabling
the RTOS to guarantee faster reaction times.

If an interrupt can usually be processed in 50 cycles but very rarely will
take 500, the RTOS can’t schedule for the 50 cycles. If the real time cpu
guarantees 200 cycles, the RTOS can actually schedule with that. Depending on
the application, that can make a huge difference.

------
brundolf
For those like me who didn't know what a "Real Time Processor" was:
[https://en.m.wikipedia.org/wiki/Real-
time_computing](https://en.m.wikipedia.org/wiki/Real-time_computing)

(somebody correct me if this is wrong)

~~~
supernova87a
I'm also interested --

What's the hardware difference required to support real time? Is it some
dedicated parts of compute to support the queueing / prioritization of jobs?
And some additional ability to have the "master" must-always-work part be able
to interrupt or reset "optional" processes?

~~~
akiselev
Deterministic latency. That usually means no cache, speculative execution, or
anything else that can't guarantee it will complete in a reasonable time.

Specifically, in many architectures interrupt handling code must be able to
yield _very_ quickly in order to continue receiving interrupts (or use
reentrant interrupts which are a whole other mess) and even a cache look up,
which might fail and have to wait for DDR ram for hundreds or thousands of
cycles, makes the system unable to guarantee that it will do what it needs to
do in time, like respond to some safety shutoff switch.

I think one of the things ARM did here is make a way for interrupt handling
code to stay in cache permanently along with some other determinism
guarantees.

~~~
supernova87a
Thanks! I guess in the designing of the accompanying software then, the people
writing it probably spend equal (or more) time in writing what happens if the
expected/desired behavior fails, than if it succeeds?

------
bfrog
This could be an amazing chip, I imagine its a replacement for the R5?

~~~
doctoboggan
The article talks about it like it's a replacement for the R8.

~~~
klysm
It's also called the R82 so I think that makes sense.

------
chmod600
Does real time mean that there is a minimum time an operation will take as
well as a maximum?

Asking because a minimum time could prevent some kinds of timing attacks.

------
Dork1234
Has anyone compared this to real-time capabilities of Intel hardware?

------
cordite
Is this just a native way to prevent interrupts on specific cores?

~~~
klysm
I believe it’s more determinism than just the interrupt timing.

------
person_of_color
32 bit versions are used in stuff like your WiFi chip.

------
TimSchumann
I wonder how you can get a dev kit for this.

------
fizixer
If I remember correctly, having a capable RTOS is much more important, and
they can be used in regular processors too [0].

I wonder what's so special about this processor that makes it better than
dozens of other hardware platforms (both microprocessors and microcontrollers)
on which embedded RTOSes are running and doing just fine.

Also if you have a 64-bit processor, but no 64-bit RTOS, you don't have much.

[0] [https://en.wikipedia.org/wiki/Comparison_of_real-
time_operat...](https://en.wikipedia.org/wiki/Comparison_of_real-
time_operating_systems)

~~~
rurban
You dont really need more than 2GB RAM in RT, it's a huge burden. Not even
more than a few megabyte. Also 64bit pointers are twice as slow as 32bit
pointers. Just think of randomly jumping around in your terabyte ram. Filling
the cache lines cost 50 cycles alone, and your budget is usually around
40-100. A 64bit rtos might be usable for image detection (self driving cars),
but not much else. The cpu cannot be general purpose, gpu's can do that much
better. I don't really see the market for such a thing.

~~~
easde
It's not uncommon to have RT workloads where the "hard" RT part can fit in a
small amount of tightly coupled memory, but the same processor is also used
for "soft" RT workloads that need a much larger amount of memory. 64-bit
addressing makes developing this kind of system a lot easier.

