Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tesla Project Dojo Overview (mvdirona.com)
125 points by kungfudoi on Aug 23, 2021 | hide | past | favorite | 89 comments



Nit: The person who presented the system is Ganesh Venkataramanan, not Genesh Venugopal, as stated in the article.


15 kW per tile, 12 tiles per rack… 180kW per rack at 480V three-phase is 375 amps per rack. Jesus. Good luck finding a rack that can support that. Either you build your own rack (and this could be Tesla’s plan, but I couldn’t find details on that), or you put maybe 3 tiles per rack. Hell, most GPU installations won’t put more that 4 GPU servers per rack for this very reason. I can’t imagine how heavy the wiring would be.



Fitting an ExaFLOP in 10 cabinets is so powerful as to be absurd.


ExaFLOP != ExaFLOP


Not sure what the downvotes are for, but ExaFLOP != ExaFLOP for a long time.

An A100 is an ExaFLOP GPU, if you count INT8 as FLOP.

Then you have the 10 different flavors of FP8, FP16, etc. and even the multiple flavors of FP32.

What people usually count as "ExaFLOP" is FP64, but this hardware can't even do FP32, much less FP64.

So yeah, an ExaFLOP != an ExaFLOP anymore, because these types of announcements never say "An ExaFLOP of what".

Everybody's brain has an infinite througput of FP0 (0 bit FP format). That doesn't mean that your brain can compute faster than Dojo, even though infinite ExaFLOPs >>>> 1 ExaFLOP. The reason is that we are counting different things.

So what's the throughput of this hardware in actual FP64 ExaFLOPs ? Zero, it doesn't support them. Not a very impressive marketing statement though. But what this means is that this hardware would be extremely bad at, e.g., solving linear systems of equations.


>So yeah, an ExaFLOP != an ExaFLOP anymore, because these types of announcements never say "An ExaFLOP of what".

I thought Tesla was pretty clear about which floating point formats they were talking about. When they first introduced the D1 they put up a slide with FLOPS measurements for FP16 and FP32. The impressive numbers (exaFLOP in ten cabinets for example) are based on FP16, of course.


Tesla at their scale are not short of space. They can afford a few more square feet of tent or wherever they want to keep their computer.

I just don't see why volumetric efficiency matters at all for this computer.

All that matters is usable flop per dollar. And if you put it in a cold place with very low electricity costs (eg. iceland), the electrical efficiency doesn't matter much either. All you care about is how cheap can you make high speed transistors.

The Dojo design looks like it really wasn't optimising for what matters at all.


Based on their presentation and design choices (and a tiny bit of experience) it looks to me like the limiting factor is actually bandwidth and latency not flops or electricity or heating or space or even the cost of the transistors themselves. A ton of flops doesn't help if you can't feed them data to process.

The reason they gave for doing things like "knocking out the walls of cabinets" was bandwidth, not space.

It turns out that the highest bandwidth solutions are also quite compact. One giant piece of silicon instead of many smaller pieces of silicon is great because the different parts of the giant piece of silicon can talk to eachother very quickly with very high bandwidth. Short wires are better than long wires, because you have less noise so you can fit in more signal (also less latency).

It's also just not clear to me how you imagine making things bigger would make them cheaper.


Locality has everything to do with performance. It's why your L1 cache is faster than your L2 cache.


I'm not a hardware person at all, but my understanding is that distributed computing systems are hard to scale due to latency, and a huge factor in latency at this scale is physical distance.

This was also mentioned in the AI Day presentation, and listed as a major factor in the design of Dojo.


It's a nice piece of hardware, reminiscent of old IBM water-cooled mainframes. But as a solution to self-driving, it's doubling down on a failed strategy. If they haven't been able to avoid hitting solid obstacles with what they've been doing, more data won't help.


Yes, more data will help. More data did help with all other Deep Learning problems, from Go to speech synthesis. GPT-3 is orders of magnitude better than GPT-2 because it has more data.

It's really Deep Learning 101.

Furthermore, it's a non-sequitur of an argument.

Dojo is generic neural network training chip. It can be used to speed up any and all Deep Learning problems. Self driving is one such problem.

During AI Day Tesla also presented, in depth, the architecture of the Deep Learning network they use.

If you have a critique of their approach or can show how a competitor's approach is better, then do share with the class. If you don't it's just lazy "Tesla bad because they solved a problem that no-one else solved"


1 - you should prove why you think that this problem is just like the other deep learning problems

2 - it's not true that "no-one else solved" this problem. If we define this problem as: have a self-driving vehicle which safely navigates its environment and does not endanger its passengers or the people outside. See the Waymo whitepapers: waymo.com/safety/performance-data

Ultimately, even if we agree that what Tesla is working on is orders of magnitude better than what's currently available from them, that might not cut it:

Let's assume that the current obstacle detection is 99% accurate... if the newer versions/improved models are 99.9%, 99.99%, etc. accurate...

we'll still have a 0.1, 0.01, 0.001%, etc of chance of an obstacle not being recognized. Tesla cars are a small fraction of those on the road, and there've been several deaths already. If million of Teslas will be sold, those small percentages will still mean a significant number of accidents and deaths caused by Tesla. If instead a Lidar reliably detects all obstacles, well in advance of the vehicle approaching it (and the car will stop/disable self driving if the Lidar becomes unoperable, e.g. due to bad weather)... it would be irresponsible to persuade huge swaths of the population to let a machine drive the vehicle, without providing the extra safety that a Lidar enables.


"If million of Teslas will be sold"

To date, Tesla has sold about 2 million cars.

Not sure how many of those support HW3 "FSD capable" computer, but I'd guess 1,5 million.


But not all 1.5mil of people were dumb enough to pay for non-existing FSD


Even without FSD, the basic AutoPilot uses the same sensors and vision stack for TACC, collision warnings, and collision avoidance.


They're still collecting data for Tesla


I've worked in data science and most of time, "more data" is just more noise.

The guy from comma.ai says the same thing.


It's not just more data, it's more unique scenarios with more samples of each. More edge cases to train on, that's got to help when training on more complex dimensions on the existing model.

They're not just going to bloat the dataset with straight driving.


You should not mix up the current "autopilot" with the software used in the "full self driving" software package. The traditional autopilot used mostly radar, which is known to not work well with non-moving obstactles. As radar has not much spatial resolution, it relies on the doppler effect to separate moving obstacles (slower cars) from the background, which could be anything, from a stopped car to trees or bridges at the road side.

The FSD software package so far has been rolled out as a beta only, to a few selected users. It is based on vision entriely and aims to properly identify everything on the road. The solid obstacles have all been hit by the previous auto pilot version. In the new cars currently sold, the radar has been replaced by a pure vision implementation close to the FSD software, so it remains to be seen which level of safety can be achieved with that.


The Autopilot is also moving away from Radar. In the US, new Model 3 or Y don't have Radar anymore.


That is, what my last sentence was referring to. It will be interesting to see, how it behaves.


Did you watch their presentation? They're definitely not trying to solve self-driving with just "more data"


All radar based lane keeping solutions have this problem. That is why Tesla is moving away from them. The current version of Autopilot already will no longer have the same issue. If you buy a Model Y or Model 3 in the US, it has no radar anymore.


radar based lane keeping solutions

Nobody uses radar for lane keeping. It can't see the lane lines.


I am talking about the overall architecture to make your typical highway standard lane keeping solution.

Pretty much every manufacture now has one and they always use forward facing camera and radar in combination. Of course the radar can't see lanes, but its essential for seen if there are cars ahead and how fast they are.

All of these solution suffer from the same problem of the radar not being good to detect a stationary object. That is where you have to do more with the camera data or add some form of lidar.

I think some 1 car (Audi maybe) use solid state lidar. Tesla has switched to vision only for the most part. No company yet uses active lidar as far as I know.



Which manufacturer is Tesla using for their dojo chips?


TSMC.


Anyone ever wonder about the possibility of Tesla/Elon entering into semiconductor manufacturing?

I can't imagine this thought hasn't crossed some minds in their boardroom.


Even Apple doesn't do the actual manufacturing. I'm not sure adding another product line with crazy difficult production processes would be a good idea for Tesla.


But Tesla likes to bring production in-house, whereas Apple outsources everything possible. It would not surprise me in the slightest if Tesla is looking to build a fab at some point in the future


I don't think it's profitable to build a leading edge chip fab with only yourself as the customer. At least, nobody else is doing it. Even with Intel's huge sales volumes they are still trying to bring in other customers to use their fabs to share the cost.

So I think getting in the fab business for Tesla would mean trying to be a direct competitor with TSMC, Intel, etc. by making chips for a large number of customers.

That seems way down the list of Tesla's priorities and best uses of dropping hundreds of billions of dollars.


This was my initial reaction as well, but considering the scale of Tesla I thought I should validate it.

Intel reportedly produces roughly 10 million wafers per year [1]. Roughly 70 million new cars are sold per year [2], with Tesla currently accounting for 0.5 million of those [3] with roughly 50% YOY growth numbers and a plan on continuing those for multiple years [4].

Cars/wafer is a pretty unclear number, at 5 Tesla is 1% of Intel's market (today), at 25 0.2% of Intel's market. It will take quite awhile for Tesla to hit intel's scale - but the automotive market as a whole might actually be pretty close to it if all new cars start including giant computers.

[1] 884k/month * 12 months/year, from https://www.eenewseurope.com/news/top-five-chip-makers-domin...

[2] https://www.statista.com/statistics/200002/international-car...

[3] https://backlinko.com/tesla-stats (or see [4] but then you have to add up quarters yourself)

[4] https://tesla-cdn.thron.com/static/ZBOUYO_TSLA_Q2_2021_Updat...


Interesting to think about, but I still don't see it making business sense for Tesla, even if they follow some crazy optimistic curve and take half the car market.

There are already multiple competing companies whose sole job is to make the best fabs at high production rates.

Tesla doesn't even make most of their batteries in-house, which is way more core to their business of EV manufacturing. They partner with Panasonic, LG, CATL, etc. because they are in the business of building out manufacturing capacity for battery cells.


Musk's companies win(usually, not always) by finding niches with sleepy competition and doing great engineering work there.

The semi industry is really far from that.


Surely you mean chip design and not manufacturing.


They already do chip design.

You could s/chip/battery/ and make the same incredulous statement 5 years ago

Or upholstery, or charging infrastructure, or casting equipment, etc.

Tesla had the capital to go as vertical as they want.


Tesla doesn't makes their own casting equipment (yet?). They do custom order the largest presses built by Idra Group in Italy (owned by a Chinese company) though [1].

[1] https://en.wikipedia.org/wiki/Giga_Press


No - I mean manufacturing. At a certain level of demand it could make sense, especially if the classes of devices you seek to produce are heterogeneous in their engineering - aerospace vs phone vs car.

The other advantage with vertical integration through the manufacturing piece is turn around time on new designs. It is a lot faster/easier to spin new test lots for speculative designs when everyone works for the same org.


They're apparently already in chip design, so surely he means manufacturing as the potential new business line?


Can someone explain like I'm five what the massive bandwidth is necessary for?


[My first attempt at an ELI5…] This is needed for massive ML models. GPT-3 for example has 175 billion parameters. The model training job simply won’t fit on a single processor’s memory by itself. To calculate these models, the bigger job has to be broken up into many smaller jobs, across different processors. But, to complete the big job, all of those smaller jobs need to talk to one another from time to time. Massive bandwidth speeds up that crosstalk so that the training can happen faster.


1.25MB/node is a lot of cache.


“Wide FSD beta in two weeks”, my ass.

Tesla is at least a decade away from the shit Elon is tweeting about.

Plan your next car purchases accordingly.

Sincerely, a Model Y owner


These tech presentations, especially by Karpathy, tend to make me feel overall optimistic about the progress Tesla is making in this space.

But then instead of the presentation ending with "And that's why we're confident in our 5-10 year roadmap to Level 5 Autonomy" they end with Elon tweeting that some beta version is going to roll out to customers in 2-4 weeks and the optimism all comes crashing back down again.


Yeah, the FSD promises are old and annoying. For some of us, we remember the ridiculous claims starting back in at least 2016. I agree with you that noone should buy this car because of "FSD".


Also: Tesla can make a nice car, but doesn't like doing so very often. Please look up the pre-delivery checklists.

Tesla is not like a regular manufacturer where you can walk around a vehicle and determine if you want to accept it. You have to actually test things - even the things that we've been doing right since the Model T. These things are built in tents by staff that is treated quite poorly, and you should be skeptical in proportion to that information.

Here is a good example: https://github.com/polymorphic/tesla-model-y-checklist

Although I noticed it's missing "check that the roof and trunk are in fact water proof".

Getting any of these things fixed after you take delivery is a huge PITA.


Checklists like that make things sound much worse than they are. Common issues are really minor compared to the things listed there.

And, tbh, hang around some forums for the company that has been "doing right since the Model T" and you'll see some of the same things. Roof alignment in particular has been a little bit of a problem for the Mach E.

With both companies, there are some horror stories after delivery too.

With both companies, the vast majority of cars are great and end up with really happy customers.


Model S Plaid just had amazing reviews by pretty much ever reviewer.

Check list make sense for all manufactures.

> These things are built in tents by staff that is treated quite poorly, and you should be skeptical in proportion to that information.

People are obsessed with this tent. Its just a simple stable structure that you can build quickly. It changes nothing about the manufacturing line being in a building.

The tend idea was set up when a guy with 30 years of manufacturing experience from BMW and other German car makers. They knew what they were doing, and its actually the cars from the tend that improved the quality problems they had early on.

> Getting any of these things fixed after you take delivery is a huge PITA.

Most of these things can be fixed by the mobile service, they can do it while your are not even there. Often the fix it while you work.


You're blowing things out if proportio lol. Teslas are extremely loved cars by their owners on average. I doubt they would be so highly rated if they had regular issues are you're suggesting.


"A decade" is just a random guess by some guy online but fine, I guess I will just switch to buying a different car with FSD instead...


You should sell your model Y. Not only does it sound like you don’t like it, you can probably turn a profit over your purchase price.


Contrary to popular belief it is quite possible to buy and enjoy a Tesla as a vehicle to drive without buying in to the Full Self Driving promise.


This website is not loading.


How ironic... James is VP and DE at AWS...


Any document on what they plan to do with all that heat afterwards? Warm water for the lavatories?


Tesla can generate whole lot of static 3D world data location wise that can be precompute so that the car computer only have to augment it with dynamic data and hence much less compute requirement.


That's not the approach they are taking though. There are no maps, only the trained NN is deployed to the vehicles. Making the system handle any unpredictable scenario without ever relying on precomputed maps is likely far safer.


Making the system handle any unpredictable scenario without ever relying on precomputed maps is likely far safer.

Sounds rather counterintuitive to me. I feel far safer driving on my regular commute because I know where the potholes are, where the blind junctions are despite the missing signs. I know exactly how fast I can go around the roundabout built with a ridiculous camber in wet weather that sends many people in to the ditch every year. etc.

Knowing all this, I can bias my attention to the cars/people around me rather than the environment.


There are 0-day no maps for potholes. There are no 0-day maps for temporarily disabled traffic lights, road construction, accidents etc.

It takes at least decade to build up something like Google Maps, but a lot of the data is already out-of-date.


Yes, but a lot also rarely changes. I wonder if they ever considered a hybrid approach that uses some of the more stable map data as a default which can be overridden by the computer vision system.


I reckon Tesla will eventually do this, and also eventually incorporate LIDAR. But they're attacking the hardest problem first (in the sense of "no one knows how to do this").


Route not drive


Sure… so OP biases his attention to cars/people/any unexpected changes to the route…


They do use maps but not HD maps.


Ok, I hope the black box is working, is there any research for how two trained Neural Network can be merged ?


If we could do that, could we merge brains?


I am surprised that massive computation in one place is currently the best performing approach to this.

We are searching for an algorithm which turns camera data into a vector model, right?

This sounds very parallelizable.

Have a lot of computers travel the search space individually. When one finds a algorithm better than the current best solution, have it call the others "Hey guys, this algo works better then what we got so far. Everyone iterate on this one now!".


Stochastic gradient descent is not embarrassingly parallel, because it's an iterative process where each step depends on the previous one. If you figure out a better optimization algorithm that is more parallelizable, the world will beat a path to your door.


Actually, nope! SGD is embarrassingly parallel when it's HOGWILD! https://arxiv.org/abs/1106.5730. Formally, the hogwild and serial implementations are indistinguishable. This has a lot of citations but it's such an easy trick that no door path beating seems to have happened. The "trick" is just to make model updates atomic, and then each SGD step can happen in parallel without coordination or locks.


Sorry, all Hogwild does is remove locks. It still needs constant global communication (global shared memory is assumed in the paper). That's not embarrassingly parallel.


As far as I know, hogwild doesn't really scale to very large systems and doesn't get used in practice, but maybe that's just because it's most useful on CPUs, which are inefficient for other reasons


And it is not like the models performance is actually known at every step


So we build self driving software by creating one big brain that does gradient descent.

Would we have been more effective if we built other things like this? 400,000 people were involved in the Apollo program. Would it have been better if we had just one person with a really big brain? How about the Linux kernel and Wikipedia?

And why should I figure out a better algorithm? Shouldn't that be the job of the one guy with the biggest brain who comes up with everything?


Uhm it's a lot easier to make a single supercomputer a million times faster than it is to make a single brain a million times cleverer.


What is the difference?

Couldn't evolution have settled on bigger brains if they are an advantage? Why all that slow interpersonal communication if it is more efficient to have the combined thought processes inside a single brain?


> Couldn't evolution have settled on bigger brains if they are an advantage?

Biological brains are limited by other factors. Human brain size in particular is limited by factors like needing to fit through a human birth canal without wrecking an upright walking gate ([0]) and being able to be powered using a hunter-gather diet.

[0] - loosing the upright walking/running gate would have limited certain ecological options (like persistence hunting). Instead, human gestation is shorter than it should be given our body size when you compare it to other mammals.


I think you might be interested in learning about Amdahl's law:

https://en.wikipedia.org/wiki/Amdahl%27s_law

Some parts of an algorithm are parallelizable, but only up to a certain limit; and that's where the endgame bottleneck is: the last remaining 'in series' parts and the communication overhead for the parallelized parts.

Note that their Dojo system has absolutely massive networking equipment: that's for having a better endgame at Amdahl's law.

----

On another note, one can modify the algorithm itself so that is has more inherent parallelism. For example DNA sequencing: you match short reads, them send them off to be reconciled together. The way to have more parallelism is to generate more matches -by lowering the threshold for matching-. Having more matches means more communication costs, but better parallelism.

For gradient descent, one of the tricks is learning in batches of input data: the neural network doesn't learn from the freshest point of view, but it can be done in parallel (also the way to aggregate new knowledge is to do an average of the weights, which must limit the amount of things learned)


Evolution doesn't care about brains. They're just an accidental byproduct that happens to improve our chances of survival.


Some things can be optimised in parallel, but other things can’t.

The example my father gave when I was a teen was: “nine women can’t make one baby in one month”.

> And why should I figure out a better algorithm?

They said “if”; if you do, you could get very, very rich.


This is of relevance:

"To get rid of the dependency on the radar sensor for the pilot, we generated over 10 billion labels across two and a half million clips. And so to do that, we had to skill our huge offline neural networks and our simulation engine across 1000s of GPUs, and just a little bit shy of 20,000 CPU cores. On top of that, we also included over 2000, actual autopilot full self driving computers in the loop with our simulation engine. And that's our smallest compute cluster.

So I'd like to give you some idea of what it takes to take our neural networks and move them in the car. And so the the two main constraints that we're working on there here are mostly latency and framerate, which are very important for safety, but also to get proper estimates of acceleration and velocity all of our surroundings.

And so the meat of the problem really is around the AI compiler that we write and extend here within the group that essentially maps the compute operations from a pytorch model, to a set of dedicated and accelerated pieces of hardware. And we do that by figuring out a schedule that's optimized for throughput while working on very severe SRAM constraints.

And so by the way, we're not doing that just on one engine, but across two engines on the autopilot computer. And the way we use those engines here at Tesla is such that, at any given time, only one of them will actually output control commands to the vehicle, while the other one is used as an extension of compute. But those rules are interchangeable, both on the hardware and software level.

So how do we very quickly together as a group to this AI development cycles? Well, first, we have been scaling our capacity to evaluate our software neural network dramatically over the past few years. And today, we are running over a million evaluations per week on any code change that the team is producing. And those evaluations run on over 3000 actual full self driving computers that are hooked up together in a dedicated cluster."


To farm out to cloud computing resources? Very expensive. It isn’t surprising to me that a company with a massive and known compute load would choose to pack that compute resource as efficiently as possible. Supercomputers are still a business.


Cloud computing is also like 100-200% surcharged, even after volume discounting. If you are not rapidly growing with uncertainty over resource requirements, you're going to get far better financial outcomes by taking your infra in-house. Example: Uber.


I'd slightly disagree with this - the real benefits from cloud are in the things you don't need to do. If you're just using the cloud as a normal datacentre (but in another location) you're not going to have a good time.

EG using a managed Database as opposed to rolling your own saves on patching, management etc...


If you're not rapidly growing and are paying millions in cloud compute, you should hire your own infra team and start managing your own database, so you can save millions of dollars.


ML training is a latency sensitive problem that benefits from locality: most of the learning algorithms wait for all parallel workers to send gradients back before updating weights. While workarounds exist, life is easier without them.


Why is this a necessary constraint? (Not trolling, just curious if this has been investigated.)

Our neurons don't have a global clock, it doesn't seem to be a problem for us. My intuition is that as long as the input is changing continuously and not through random presentation, and at a rate where big changes happen at a lower order of magnitude than the average compute rate, it wouldn't matter all that much in terms of accuracy.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: