I was invited to attend Tesla’s AI Day event and it was an incredible learning experience for me. Not only did I get to see the Cybertruck, but the event was a learning experience. AI really isn’t my field, but I believe that every experience provides an opportunity to learn something.
Many people only experience Tesla through its products or by interacting online with Elon Musk, Tesla, and one another or by reading about it online. I was in a unique position to actually experience it from a different perspective — one of being present and being able to witness history being unveiled. The atmosphere was magnetic and those attending were not just excited, but eager to learn. One thing that stood out to me was employees who only care about making a difference — and this could be felt at the event.
The invite from Tesla was an unexpected complete surprise. What I know about AI, in general, are the bits and pieces I’ve learned from covering it as a topic here. I don’t know Python, and math and science were things I struggled with in school, so I kind of felt unqualified to be there, honestly — yet extremely grateful to have been included by the leader in AI technology.
Attending AI Day
I think I learned more about the impact of AI while at Tesla’s headquarters than I had ever learned in my entire life — and this is just the tip of the iceberg. An easier way to explain what we’ve seen is as a behind-the-scenes look at what your car can do. Andrej Karpathy described this as an “under the hood components,” and the visual that popped into my head reflected just how much Tesla is changing automotive industry with its technology. It also reflected what Elon Musk said at the beginning of the presentation.
“What we want to show today is that Tesla is much more than an electric car company; that we have deep AI activity, in hardware, on the inference level, in the training level,” Elon said as he opened the event. He also said that he thought, arguably, that Tesla was the leader in real-world AI. “Those of you who have seen the FSD Beta can appreciate the way that the neural net is learning to drive.” Elon explained that this was just one application of real-world AI.
In the vision component, Karpathy explained, Tesla is trying to design a neural network that processes the raw information. For Tesla, this comes from the 8 cameras positioned around the vehicle that send images which need to be processed in real time into the vector space, which is a three-dimensional representation of everything you need for driving.
For those who are not familiar with this term, a neural network (NN) is a network or circuit of neurons. In AI, these are artificial neurons and are also called nodes. There are two types of NNs: biological and artificial. In AI, NNs are used to solve problems. Tesla’s NN focuses on predicting the vector space — not an image space. Another term to learn: vector space is a set of objects called vectors, which may be added together and multiplied (“scaled”) by numbers, called scalars, which are often real numbers but can be complex. Learn more about that here.
Tesla’s vector space is the three-dimensional spaces of lines, edges, curbs, traffic signs, traffic lights, cars, the positions and orientations of the cars, depths, velocity, etc. Karpathy showed video of the raw images that come into the stack, where the images are then processed into the vector space. You can see some of the vector space rendered on the display in the car.
Tesla Is Literally Bringing Cars To Life With AI
Karpathy explained that Tesla was creating, in a sense, an animal.
“What I find kind of fascinating about this is we are we are effectively building a synthetic animal from the ground up. So the car can be thought of as an animal. It moves around, it senses the environment, acts autonomously and intelligently, and we are building all of the components from scratch and in-house.”
So, not only is Tesla building cars, but it is bringing them to life with artificial intelligence.
“When we designed the visual cortex of the car, we also wanted to design the neural architecture of how the information flows in the system.”
Karpathy then shared a demonstration of just how Tesla is bringing its vehicles to life with AI. I suggest you watch the replay a few times to fully absorb the information.
Tesla’s Dojo Supercomputer
Ganesh Venkataramanan, Tesla’s senior director of Autopilot hardware and the leader of the Dojo project, unveiled Tesla’s long awaited Dojo supercomputer. I honestly don’t know what was more exciting — the reveal of Dojo or Venkataramanan’s joyous grin when he presented an actual Dojo training tile to us.
“It’s an honor to present this project on the behalf of the multidisciplanary Tesla team that is working on this project.”
He explained that Elon Musk asked the team to design a superfast training computer a few years ago and this was the birth of Project Dojo.
“There’s an insatiable demand for speed as well as capacity for neural network training.
“Our goal is to achieve best AI training performance and support all these larger, more complex models Andrej’s team is dreaming of, and be power efficient and cost effective at the same time. So we thought about how to build this and we came up with a distributed compute architecture. After all, all the training computers are distributed computers in one form or the other.”
He explained that these are connected with some type of network, and for Tesla’s case, it’s a two-dimensional network. However, it can be any different network, such as a CPU, GPU, or accelerators network. There is a common trend that he identified and this is that it’s easy to scale compute. It’s difficult to bandwidth and extremely difficult to reduce latencies, he explained.
How Tesla’s Philosophy and Design Point Is Reflected In Dojo
Venkataramanan explained that Tesla’s philosophy and design point catered to the challenges of the traditional limits.
“For Dojo, we envisioned a large compute plane filled with very robust compute elements packed with large pool of memory and interconnected with very high bandwidth and low latency fabric and in a 2D mesh format. And onto this, for extreme scale, the neural networks will be partitioned and mapped to extract different parallelism — model, graph, data parallelism.
“And then, a neural compiler will explore spacial and temporal localities such that it can reduce communication footprint to local zones and reduce global communications. And if we do that, our bandwidth utilization can keep scaling with the plane of compute that we desire out here.”
Venkataramanan then explained each of the aspects of Dojo, starting with the chip. The chips have compute elements and Tesla’s smallest entity of scale is called a training node, he explained. The reason for the size is that it performs better and doesn’t have any memory bottleneck issues. The team wanted to address bandwidth and latency as a primary optimization point and he explained just how the team achieved this.
They took the farthest distance that a signal could traverse in a very high clock cycle — two gigahertz plus. Next, they drew a box around it and noted that this is the smallest latency that a signal can traverse — one signal at a very high frequency. Then they filled up the box with wires to create the highest bandwidth that you can feed the box with. After this, they added machine learning compute and a large pool of SRAM. The final addition was a programmable core to control which gave Tesla its high performance training node.
It’s a 64-bit superscalar CPU optimized around matrix multiplied units and vector SIMD that supports floating point 32, brain floating pint 16 and a new format, configurable FP8. He added that its is backed by one and a quarter megabyte of fact ECC protected SRAM along with the low latency high bandwidth fabric the team designed. Just one single node has over 1 teraflop of compute in the smallest entity of scale.
The Training Node Architecture
He explained that it is a Superscaler In-Order CPU with four wide scalars and two wide vector pipes. It has four-way multithreading which increases utilization allowing for compute and data transfer simultaneously. It also has a custom instruction set architecture (ISA) with features such as transpose, gather and link traversals. Venkataramanan explained that in the physical realm, the team made it extremely modular.
This allows the team to begin abutting the traning nodes in any direction which forms the compute plane that the team envisioned. When they joined together 354 of the training nodes, he explained, it becomes capable of delivering over 362 teraflops of machine learning compute.
The high bandwith fabric interconnecting the chips have is 10 terabytes per second/direction. Next the team surrounded the compute array with 576 high speed low power serializer/deserializer (SerDes). This enables the team to to have extreme I/O bandwidth come out of the chip. In comparison, this is over twice the amount of bandwith that comes out of today’s state-of-the-art networking switch chips. Network switch chips are considered the gold standard for I/O bandwith.
Altogether, this is Tesla’s training optimized chip, or D1 chip. The D1 chip, he explained, is manufactured in 7-nanometer technology. It has 50 billion transistors.
“This is a pure machine learning machine.”
With a beaming expression on his face, Venkataramanan pulled out a Tesla D1 chip for us to see. The D1 chip was entirely designed by Tesla’s team and he described it as a GPU level compute with a CPU level flexibility with twice the network chip level IO bandwith.
Designing The System Around The D1
Venkataramanan explained that since D1 chips can seamlessly connect to one another without any glue, the team just started connecting them. The team connected 1,500 D1 chips together, which translated to a total of 500,00 training nodes to create Tesla’s compute plane. They didn’t stop there!
The team added Dojo interface processors on each end, a host bridge with high bandwidth, and more. Along with that, the interface processor enables Tesla to have a higher radix network connection.
“In order to achieve this compute plane, we had to come up with a new way of integrating these chips together.”
The Training Tile
The training tile is Tesla’s unit of scale for its system. It’s a groundbreaking integration of 25 known good D1 tiles onto a fan out wafer process integrated so tightly that it preserves the bandwith between each tile. In addition, the team generated a high bandwith, high density connector that preserves the bandwith coming out of the training tile.
“And this tile gives us nine petaflops of compute with a massive I/O bandwith coming out of it. This, perhaps, is the biggest organic MCM in the chip industry. Multi-chip module.
“It was not easy to design this. There were no tools that existed.
“Our engineers came up with different ways of solving this. They created new methods to make this a reality.”
Next was to feed the new training tile with power and the engineers had to create a new way of doing that.
“Vertically, we created a custom voltage regulator module that could be reflowed directly onto this fan out wafer. So what we did out here is we got chip, package, and we brought pcb level technology reflow onto this fan out wafer technology. This is a lot of integration already out here, but we didn’t stop here.
“We integrated the entire electrical thermal mechanical pieces out here to form our training tile fully integrated. Interfacing with a 52-volt DC input. It’s unprecedented. This is an amazing piece of engineering. Our compute plane is completely orthogonal to power supply and cooling. That makes high bandwidth compute planes possible.”
“It’s a 9 petaflop training tile. This becomes our unit of scale for our system. And this, is real.”
I Witnessed History
I think that when Venkataramanan pulled out the Dojo 9 petaflop training tile, it marked a moment of history being changed. Not only could I see the joy and pride on his face as he showed the world Tesla’s next-level compute plane, but I felt awe at being there and watching it unfold in front of me.
Tesla’s lead on legacy auto just increased by 9 petaflops.