Dojo Shows off Power

Tesla’s Dojo supercomputer has been in development since before last year’s AI day, and OOF was there a lot to catch us up on this year.

When unveiled last year, Tesla’s Dojo project was explained as an AI training tool. Basically, the computer would run simulations based on real-world data, and use those to train the AI software that would let Tesla vehicles truly self-drive.

For this task, Tesla is still using huge GPU farms - Racks and Racks of NVidia graphics cards, all struggling to process the data and train the AI. But NVidia GPUs aren’t really built for this specific work. The GPU’s are designed for use in a variety of situations, and so there are bottlenecks that slow down AI training that probably wouldn’t be noticed by the average gamer.

For starters, communication slows everything down. The cards need to talk to their own memory. They need to then pass data between the cards in the same rack. Then they need to talk between the racks! There’s a lot of data traffic, and while the current tech has been working - it’s working too slowly. Currently, Tesla says that training takes about a month. Waiting that long for results, that might have to be adjusted and tested again, is just too long.

And so Tesla Engineers started designing a purpose-built supercomputer with two goals in mind: Density, and scalability.

First of all, they made the individual tiles - the chips that do the processing - to fit together differently. As Tesla explained it: part of the reason other chips have a bottleneck is because power and data transfer happen in the same direction. With the new configurations, Dojo chips pass data communication laterally across the wafer they sit in, while passing power vertically.

There’s a lot of much more technical things happening with the integration and construction of the Dojo racks themselves, but this density mindset is mirrored in every step of the process. By the time a full Dojo rack comes together, it has enough compute power to fully replace 6 NVidia racks. 

At that rate, Dojo takes that one month training time, and reduces it to less than a week. Operations that take the normal GPU racks about 150 microseconds, Dojo can complete in just 5. Benchmark tests show the Dojo racks beating the NVidia ones by over 4 times when parsing real-world data. That’s incredible.

And it Just. Keeps. Scaling.

Tesla Engineers are working on a system for cross-cabinet data communication, so once a single cabinet full of Dojo racks are put into a unit of 10 other cabinets, they can all talk to each other with as little bottleneck as possible. Tesla is calling this configuration the ExaPod, because it produces 1 Exa-flop of compute power.

A Flop is floating point operation - basically a single calculation performed by a computer. To put an ExaFlop into context, Imagine a billion people, each holding a billion calculators, all pressing the “equals” button at once. It’s a little bit of an oversimplification, but it should give you all an idea as to what just one of these Dojo groups could handle.

Tesla’s plan is to build the first full ExaPod by next year, which would replace the current 72 GPU server cabinets they use right now - and more than double their training capacity. And they plan to build 7 of these monsters at their Palo Alto facility in California.

This work is probably the most impressive - but most technical - presentation at AI day. The Dojo project will supercharge all of Tesla’s other AI work, and it seems like it can only get more and more powerful.

Previous
Previous

Tesla Simulates to Train AI

Next
Next

How Tesla Bot Sees the World