
Jetson AGX Xavier - my123
https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-agx-xavier/
======
TaylorAlexander
I have one of these powering my open source four wheel drive robot. [1]

I've started doing machine learning experiments with it finally. (See [1] for
details)

There's a few tricks to getting the best performance. You want to convert your
neural network to run with NVIDIA's TensorRT library instead of just
tensorflow or torch. TensorRT does all the optimized goodness that gets you
the most out of the hardware. Not all possible network operations can run in
TensorRT (though nvidia updates the framework regularly). This means some
networks can't be easily converted to something fully optimized for this
platform. Facebook's detectron2 for example uses some operations that don't
readily convert. [2]

But then if you're new like me you've got to both find some code that will
ultimately produce something you can convert to TensorRT, and you also need
something that you can easily train. I've learned that training using your own
dataset is often non-obvious. A lot of example code shows how to use an
existing dataset but they totally gloss over the specific label format those
datasets use. That means you've got to do some digging to figure out how to
make your own dataset load properly in to the training code.

After trying a few different things, I've gotten some good results training
using Bonnet(al) [3]. I was able to make enough sense of its training code to
use my own dataset, and it looks like it will readily convert to TensorRT.
Then you load the converted network using NVIDIA's Deepstream library for
maximum pipeline efficiency [4].

The performance numbers for the AGX Xavier are very good, and I am hopeful I
will get my application fully operation soon enough.

[1] [https://reboot.love/t/new-cameras-on-rover/](https://reboot.love/t/new-
cameras-on-rover/)

[2]
[https://github.com/facebookresearch/detectron2/issues/192](https://github.com/facebookresearch/detectron2/issues/192)

[3] [https://github.com/PRBonn/bonnetal](https://github.com/PRBonn/bonnetal)

[4] [https://developer.nvidia.com/deepstream-
sdk](https://developer.nvidia.com/deepstream-sdk)

~~~
m3at
Cool robot!

For dealing with layers not supported by TensorRT, you might want to try to
export to onnx instead and then use tvm[1] to compile your model for the
hardware. I have not used it on nvidia boards yet, but I had good experience
on other less powerful ARM boards, and as tvm docs show some examples running
on jetson tx I imagine the AGX Xavier is likely fine too.

[1] [https://tvm.apache.org/](https://tvm.apache.org/)

------
king_magic
My biggest complaint with the Jetson line is it's all ARM. Look, I get it. But
the developer experience is horrible. Building Docker containers for ARM
devices is a pain. Hell, building anything for a Jetson can be a pain unless
it's a pre-packaged NVIDIA thing - really not a fan of building things from
source. Add on top of that NVIDIA's very low level documentation for pretty
much any tooling they ship, coupled with the difficulty in getting near-time
engineering support (unless you want to post to one of their message boards,
and hope you get an answer back in less than a week)... basically, it's really
rough to do anything seriously useful with Jetson hardware.

Second biggest complaint is deploying Jetsons in production environments. Dev
kits aren't production stable, so you either need to build your own carrier
board or find one pre-built, and frankly that's just a giant pain to do.

Third biggest complaint is having to flash Jetsons manually. Misery.

A production-ready x64 Jetson that you could order directly from NVIDIA would
be my dream. Add up all of the shortcomings and overhead of ARM Jetsons and
IMO you do not have a viable device for shipping AI solutions at scale.

~~~
0x8BADF00D
> Building Docker containers for ARM devices is a pain in the ass.

It's not so bad, you just need a beefy ARM machine to build the containers in
CI. It would be silly to build a Docker container on the Jetson itself. You
would never use an embedded device for compiles and builds, why would you
build Docker containers on one?

~~~
linarism
Genuinely curious, what is an example of a beefy ARM machine?

~~~
pram
Anything with a ThunderX processor is maximum beefy. You can get on-demand
servers like that from places like Packet. AWS also has their own A1 instances
with lower core counts. These would all be good for cross compiling/builds.

Comedy answer: iPad Pro

~~~
p1necone
> Comedy answer: iPad Pro

I mean, the iPad Pro _does_ have a relatively beefy processor. If only you
could run arbitrary code on it.

~~~
BillinghamJ
A lot of instruction sets aren't there yet, but
[https://ish.app](https://ish.app) is doing a truly incredible job in this
regard.

Gives you a working Alpine Linux installation which you can download and
install packages for normally, all within the bounds of the normal Apple
sandbox, with decent enough performance.

It doesn't have SSE or MMX yet, so eg Go and Node aren't usable at this point.
But a shocking amount actually does work perfectly, so it's only a matter of
time as more instruction sets are implemented.

~~~
p1necone
It's pretty atrocious that you have to run an x86 emulator to use your iPad
"pro" for this stuff. This project is super cool though.

~~~
p1necone
Actually wouldn't a userspace ARM "emulator" be faster than an x86 one on an
ARM device? Or are you so far removed from running actual CPU instructions
that it doesn't matter?

------
jerryyu
Just built a gesture controlled robot[1] with the Xavier board.

We were able to run OpenPose[2] at 27FPS, which we found was even faster than
running it on K80 on AWS p2.xlarge. It was a pain to install caffe and all the
dependencies on an ARM processor, but it worked out eventually.

We were able to train and run Tensorflow 2 models quickly also. Felt like
using an actual GPU at a fraction of the cost.

[1]
[https://www.youtube.com/watch?v=AF8zmTaa17s](https://www.youtube.com/watch?v=AF8zmTaa17s)

[2] [https://github.com/CMU-Perceptual-Computing-
Lab/openpose](https://github.com/CMU-Perceptual-Computing-Lab/openpose)

------
mchusma
The original posted title was more helpful. The Jetson AGX Xavier has been out
for a couple of years, but it dropped in price from $999 to $699 and has
double the ram now at 32GB.

------
my123
For a moddable Arm platform with all batteries included, there isn't really an
alternative to this, especially at $699.

It's much stronger than an RPi and could fill the gap between RPi and Arm-
based server platforms.

~~~
mappu
In terms of using it as an ARMv8 desktop workstation (with decent CPU
performance, real SATA / Ethernet / PCI-e connectors) - some other contenders
include the MACCHIATObin (quad A72) and Honeycomb LX2K (16-core A72, 750USD)
from Solid-Run.

------
fvv
I'm definitely not expert and probably this is a dumb question , but why smart
edge things like smart robot and not dumb edge with smart central brain ?
Anyway data are useful aggregated central;ly why not incorporate the brain
centrally too?

~~~
TaylorAlexander
Well, first some clarification - "edge" means "on robot" versus something in
the cloud. And the reason you do this is latency and connectivity.

I am designing a four wheel drive robot using the NVIDIA AGX Xavier [1] that
will follow trails on its own or follow the operator on trails. You don't want
your robot to lose cellular coverage and become useless. Even if you had
coverage, there would be significant data usage as Rover uses four 4k cameras,
which is about 30 megapixels (actually they max out at 13mp each or 52mp
total). Constantly streaming that to the cloud would be very expensive on a
metered internet connection. Even on a direct line the machine would saturate
many broadband connections. Of course you can selectively stream but this
makes things more complicated.

Latency is an issue. Imagine a self driving car that required a cloud
connection. It's approaching an intersection and someone on a bicycle falls
over near its path. Better send that sensor data to the cloud fast to
determine how to act!

On my Rover robot it streams the cameras directly in to the GPU memory where
it can be processed using ML without ever being copied through the CPU. It's
super low latency and allows for robots that respond rapidly to their
environment. Imagine trying to make a ping-pong playing robot with a cloud
connection.

I am also designing a farming robot. [2] We don't expect any internet
connection on farms!

[1] [https://reboot.love/t/new-cameras-on-rover/](https://reboot.love/t/new-
cameras-on-rover/) [2]
[https://www.twistedfields.com/technology](https://www.twistedfields.com/technology)

Edit: Don’t forget security! Streaming high resolution sensors over the cloud
is a security nightmare.

~~~
epmaybe
This is a bit off topic, but I'm constantly looking at ways to efficiently
stream 4K cameras live to local displays as well as remote displays at the
highest framerate and resolution possible. How feasible would it be on the
xavier to stream 2 4k cameras and display them on at least 2 4k screens? Extra
points if you could do that and simultaneously upload to a streaming service,
such as twitch.

~~~
m463
I've wondered about this too.

I think the magic camera interconnect is CSI/CSI2 and it's not really flexible
enough. You either have really short copper interconnects, or unavailable
fiber interconnects.

What would be cool is if csi to ethernet were a thing. either low latency put-
it-on-the-wire or compressed. I don't know, maybe it is. But make it a
standard like rca jacks.

~~~
manofmanysmiles
You can buy kits that send video over coax including power for somewhat
reasonable prices:

[https://leopardimaging.com/product/nvidia-jetson-
cameras/nvi...](https://leopardimaging.com/product/nvidia-jetson-
cameras/nvidia_agx_xavier_fpdlinkiii_camera_kits/li-ar0231-ti953-xavier/)

I haven't tried them, but am considered them for project.

------
weinzierl
They seem to offer a cheaper 8 GB model too but unfortunately I see no price
for it. I'm curious how much it'll be because, as much as I'd like to toy
around with this, the $699 is a little to much for just experimentation.

EDIT: The 8GB _Module_ seems to be $679 here[1]. This makes the $699 or the 32
GB _Developer Kit_ seem like a steal. Still, too expensive for play, I guess
I'll stick with my Jetson Nanos for a while...

[1]
[https://www.arrow.com/en/products/900-82888-0060-000/nvidia](https://www.arrow.com/en/products/900-82888-0060-000/nvidia)

~~~
fluffything
There is also the Jetson Nanokit which costs ~120 EUR.

~~~
potiuper
Maxwell does not have unified memory -> custom code compared to latest
generation along with performance disadvantages.

~~~
fluffything
Unified memory is useful during development when porting the application, but
I don't know of any well-tuned applications that use it.

~~~
potiuper
Any "well-tuned" multi-threaded application requires unified memory for
simultaneous access to managed memory from the CPU and GPUs as it is not
possible with compute capability lower than 6.0. This is because pre-Pascal
GPUs lack hardware page faulting, so coherence can’t be guaranteed. On these
GPUs, an access from the CPU while a kernel is running will cause a
segmentation fault.

------
crelex
I develop on one of these every single day I just use Visual Studio Code with
custom launch.json and task.json that allows me to ssh into the jetson, copy
over the compiled for linux code, and attach a remote debugger.. I like never
even touch the jetson and all my builds and shit are pushed over... its
actually an easy as shit development experience at this point.. You have to
kinda know what you're doing... but its all totally usable. The dev kit i have
has been running 415 days without any issues running 5 custom programs I've
written.

------
smg
The cheapest Volta GPUs I have seen so far cost over 2K for 12GB. Can the GPU
provided in this kit be used for training?

~~~
dchichkov
Yes, if your model is small enough or, if you are fine-tuning small number of
layers. TendorFlow 1.15 and 2.0 are available on Xavier. I understand that
PyTorch could be built as well.

Nite that the number of CUDA kernels and amount of memory available is
smaller, if compared to descrete Volta GPUs.

~~~
fizixer
You say it can do training for small models because of the presence of the
small (512-core) GPU? (plus maybe some left-over, control calculations by the
CPU)

~~~
corysama
It's a low-wattage device. It's performance can't hold a candle to a last-gen
card that uses 10X the power.

------
fxtentacle
I wonder how to train for those. The biggest one has 32GB of RAM and needs a
frozen inference graph converted to TensorRT. So one would need a GPU with
32GB of RAM in addition to this to be able to train the network. AFAIK, Nvidia
doesn't sell anything with that much RAM.

------
snek
The lineup continues to balloon in price. A lot of students would the TK and
TX models for robotics and whatnot, since they were only 200-300 bucks.

~~~
shmolyneaux
The Jetson Nano[0] seems like it would be better for students, it's only $99.

[0]: [https://developer.nvidia.com/embedded/jetson-nano-
developer-...](https://developer.nvidia.com/embedded/jetson-nano-developer-
kit)

------
farseer
Honest question, can something like this kill the market for embedded DSP
processors made by Texas Instruments or Analog Devices?

~~~
augustt
I mean, this is absurdly more powerful than those dedicated DSPs.

~~~
WJW
It's also easily 10x the price. It really matters what the application is and
how much processing power you need.

------
Symmetry
No details on the CPU I wonder if it's going to be a Carmel (via the Transmeta
lineage).

------
fizixer
Inference only. (So this is competing with Google TPUv1; a few years late and
way more expensive, but with more memory)

~~~
my123
It can do training with its GPU, not the fastest thing in the world though.

~~~
fizixer
So looks like it can do "checkbox training" (add an iota of training
capability just so you could check the box labelled "it does training").

Got it.

