Tesla CEO Elon Musk recently unveiled the company's Tesla Bot. The robot code-named Optimus shuffled across a stage, waved its hand, and pumped its arms in a slow-speed dance move. Musk predicts the robot could cost $20,000 within three to five years if all goes according to plan. But the question is, what can it do for us. But before we get into that, lets look at the main devices that drive the Tesla Bot.
Tesla Bot Actuators
The Actuators are the main drive system for any Robot. You could say a robot is nothing more than a PC with moving parts, or in other words, a Robot is a PC with Actuators and sensors. Tesla has developed its own Actuators for the Bot, it uses 3 types of rotary actuators and 3 types of Linear Actuators.
If you are wondering why Tesla didn't use standardized Linear Actuators like the FIRGELLI actuator, its because they have several constraints that means they have to develop their own systems to get the Robots to be ultimately lightweight, power efficient, high power density and low cost. Tesla have claimed they want to get the Bot to retail for $20,000 each. This in itself is a tall order for something that's gong to require 23 Actuators, and powerful PC, lots of sensors and a battery pack to make it last more than a few hours, plus a strong skeleton to hold it all together.
Tesla Bot Linear Actuators
The Linear Actuators Tesla developed are highly specific for a specific role, this means they would not really be of much use for any other application other than a Robot. Their Actuators employ a planetary Roller system and Tesla calls it, but this is basically code for Ballscrew leadscrew design, and instead of a traditional magnetic armature coil in the middle of the motor they decided to use a brushless core motor design. This means the Ball leadscrew design is very efficient and uses less power, but also more expensive. And they use a Brushless power system which means the live span will be significantly faster and allows highly specific drive modes controlled by the software.
The length of travel is only about 2" long, and as the picture showed of them lifting a Piano at 500KG, this is alot of weight. You may wonder why it needs to lift so much weight?, well that's because when installed in a metal skeleton, the actuators travel needs to amplify the stoke of what its moving. So if its moving the Leg of a Robot, the leg needs to be able to move about 150 degs, or over a 2 foot length the leg needs to swing from about zero to a 3-foot arc. The huma body that has evolved over 100,000's of years allows us humans to do this using our leg muscles, but getting a linear actuator to do this is no easy task. So the point I'm making is that, even though The Actuator can lift 500Kg of weight over 2-inches, once that actuators connected to a lever, the force gets reduced significantly, depending on the leverage ratio, and but the speed increases which makes for a nice trade-off.
Tesla Bot Presentation.
Here is what Tesla themselves had to say about the latest Bot presentation they gave on sept 30th 2022
Elon Musk presents: We've got some really exciting things to show you, I think you'll be pretty impressed. I do want to set some expectations with respect to our Optimus robot as as you know last year it was just a person in a robot suit but we've not we've come a long way and it's I think you know compared to that it's going to be very impressive. And we're going to talk about the advancements in AI for full self-driving as well as how they apply to more generally to real world AI problems like a humanoid robot and and even going beyond that. I think there's some potential that what we're doing here at Tesla could make a meaningful contribution to AGI, and and I think actually tells us a good entity to do it from a governance standpoint because we're a publicly traded company we have one class of of stock and that means that the public controls Tesla and I think that's actually a good thing um so if I go crazy you can fire me this is important maybe I'm not crazy I don't know. So yeah so we're going to talk a lot about our progress in AI autopilot as well as the progress in with with dojo, and then we're going to bring the team out and do a long q & a so you can ask tough questions. Whatever you'd like existential questions technical questions if it would want to have as much time for Q&A as possible so let's see with that you guess what daily.
Hey guys I'm Milan I work on autopilot and it is rubber I'm Lizzy a mechanical engineer on the project as well okay so should we should we bring up the Bot before we forward first time we try this robot without any backup support cranes
mechanical mechanisms no cables nothing yeah I want to join with you guys
tonight but it was the first time let's see you ready let's go self-driving computer that runs in your Tesla cars by the way this is the it's literally the first time the robot has operated without a tether was on stage tonight that's it, so the robot can actually do a lot more than we just showed you we just didn't want it to fall on its face, so we'll we'll show you some videos now of the robot doing a bunch of other things um yeah which are less risky.
Yeah we wanted to show a little bit more what we've done over the past few months with apart and just walking around and dancing on stage and just humble beginnings but you can see the autopilot neural networks running as
is just retrained for the bud directly on that on that new platform that's my watering can you see a rendered view that's that's the robot what's the that's the world the robot sees so it's it's very clearly identifying objects like this is the object it should pick up picking it up. We use the same process as we did for autopilot to collect data in train your networks that we then Deploy on the
robot that's an example that illustrates the upper body a little bit more
something that we'll like try to nail down in a few months over the next few months I would say to perfection this is really an actual station in the Fremont Factory as well that it's working at.
That's not the only thing we have to show today so that what you saw was what we call Bumble C, that's our sort of rough development robot using semi-off-the-shelf actuators but we actually have it's gone a step further than that already the team's done an incredible job and we actually have an optimist bot with a
fully Tesla designed at both actuators battery pack control system everything it wasn't quite ready to walk but I think it will walk in a few weeks, but we wanted to show you the robot and something that's actually fairly close to what will go into production, and and show you all the things it can do so let's bring it out
We expect to have in Optimus production unit one which is the ability
to move all the fingers independently move the thumb, have two
degrees of freedom so it has opposable thumbs and both left and right hand so
it's able to operate tools and do useful things, our goal is to make a useful
humanoid robot as quickly as possible and we've also designed it using the
same discipline that we use in designing the car which is to say to design it for manufacturing such that it's possible to make the robot at in high volume at low cost with high reliability so that's incredibly important I mean you've all seen very impressive humanoid robot demonstrations and that that's great but what are they missing?, they're missing a brain, they don't have the the intelligence to
navigate the world by themselves and they're also very expensive, and made in low volume whereas this this is the optimistic society and extremely capable robot but made in very high volume probably ultimately millions of units and it is expected to cost much less than a car.
I would say probably less than twenty thousand dollars would be my guess
the the potential for optimistic is I think appreciated by very effective
people hey as usual Tesla demos are coming in hot so okay that's good that's good um yeah the teams put on put in and the team has put in an incredible amount of work uh it's uh working days you know seven days a week running the 3am oil to to get to the demonstration today I'm super proud of what they've done is they've really done done a great job I just like to give a hand to the whole option of this team so you know that now there's still a lot of work to be done to refine Optimus and improve it obviously this is just Optimus version one and that's really why we're holding this event which is to convince some of the most talented people in the world like you guys to join Tesla and help make it a reality and bring it to fruition at scale such that it can help millions of people and the the and the potential it likes it is is really boggles the mind because you have to say like what what is an economy an economy is uh sort of productive entities times the productivity uh Capital times output productivity per capita at the point in which there is not a limitation on capital, it's not clear what an economy even means at that point an economy becomes quasially infinite so what what you know taken to fruition in the hopefully benign scenario um the this means a future of abundance a future where um there is no poverty where people you
can have whatever you want in terms of products and services it really is a fundamental transformation of civilization as we know it obviously we want to make sure that transformation is a positive one and safe and but that's also why I think Tesla as an entity doing this being a single class of stock publicly traded owned by the public is very important and should not be overlooked I think this is essential because then if the public doesn't like what Tesla's doing the public can buy shares in Tesla and vote differently.
This is a big deal like it's very important that that I can't just do what I want you know sometimes people think that not but it's not true so you know that's it's very important that the the corporate entity that has that that makes this happen is something that the public can properly influence and so I think the Tesla structure is is ideal for that and like I said that you know self-driving cars will certainly have a tremendous impact on the world um I think they will improve the
productivity of Transport by at least a half order of magnitude perhaps an order of magnitude perhaps more um optimists I think has maybe a two order of magnitude potential Improvement in economic output like like it's not clear it's not clear what the limit actually even is so but we need to do this in the right way we need to do it carefully and safely and ensure that the outcome is one that is beneficial to civilization and and one that Humanity once I can't this is also it's extremely important obviously, so and I hope you will consider joining Tesla to achieve those goals at Tesla we really care about doing the right thing here always aspire to do the right thing and and really not pay the road to hell with good intentions and I think the road to hell is mostly paved with bad intentions but every now and again there's a good intention in there so we want to do it do the right thing um so you know consider joining us and helping make it happen um with that let's let's uh move on to the next phase right on thank you Elon
All right so you've seen a couple robots today let's do a quick timeline recap so last year we unveiled the Tesla bot
concept but a concept doesn't get us very far we knew we needed a real development and integration platform to
get real-life learnings as quickly as possible so that robot that came out and did the little routine for you guys we
had that within six months built working on software integration Hardware upgrades over the months since then but
in parallel we've also been designing the Next Generation this one over here
so this guy is rooted in the the foundation of sort of the vehicle design process you know we're leveraging all of
those learnings that we already have obviously there's a lot that's changed since last year but there's a few things
that are still the same you'll notice we still have this really detailed focus on the true human form we think that
matters for a few reasons but it's fun we spend a lot of time thinking about how amazing the human body is we have
this incredible range of motion typically really amazing strength a fun
exercise is if you put your fingertip on the chair in front of you you'll notice that there's a huge range of motion that
you have in your shoulder and your elbow for example without moving your fingertip you can move those joints all
over the place um but the robot you know its main function is to do real useful work and
it maybe doesn't necessarily need all of those degrees of freedom right away so we've stripped it down to a minimum sort
of 28 fundamental degrees of freedom and then of course our hands in addition to that
humans are also pretty efficient at some things and not so efficient in other times so for example we can eat a small
amount of food to sustain ourselves for several hours that's great uh but when we're just kind of sitting around no
offense but we're kind of inefficient we're just sort of burning energy so on the robot platform what we're
going to do is we're going to minimize that idle power consumption drop it as low as possible and that way we can just
flip a switch and immediately the robot turns into something that does useful work
so let's talk about this latest generation in some detail shall we so on the screen here you'll see in
Orange are actuators which we'll get to in a little bit and in blue our electrical system
so now that we have our sort of human-based research and we have our first development platform we have both
research and execution to draw from for this design again we're using that vehicle design
foundation so we're taking it from concept through design and Analysis and
then build and validation along the way we're going to optimize for things like cost and efficiency
because those are critical metrics to take this product to scale eventually how are we going to do that well we're
going to reduce our part count and our power consumption of every element possible we're going to do things like
reduce the sensing and the wiring at our extremities you can imagine a lot of mass in your hands and feet is going to
be quite difficult and power consumptive to move around and we're going to centralize both our
power distribution and our compute to the physical center of the platform
so in the middle of our torso actually it is the Torso we have our battery pack this is sized at 2.3 kilowatt hours
which is perfect for about a full day's worth of work what's really unique about this battery
pack is it has all of the battery Electronics integrated into a single PCB within the pack so that means everything
from sensing to fusing charge management and power distribution is all on one all
in one place we're also leveraging both our Vehicle Products and our Energy Products to roll
all of those key features into this battery so that's streamlined manufacturing really efficient and
simple cooling methods battery management and also safety and of course we can leverage Tesla's
existing infrastructure and supply chain to make it so going on to sort of our brain it's
not in the head but it's pretty close also in our torso we have our Central Computer so as you know Tesla already
ships full self-driving computers in every vehicle we produce we want to leverage both the autopilot hardware and
the software for the humanoid platform but because it's different in requirements and in form factor we're
going to change a few things first so we still are gonna it's gonna do everything that a human brain does
processing Vision data making Split Second decisions based on multiple sensory inputs and also Communications
so to support Communications it's equipped with wireless connectivity as well as audio support
and then it also has Hardware level security features which are important to protect both the robot and the people
around the robot so now that we have our sort of core
we're going to need some limbs on this guy and we'd love to show you a little bit about our actuators and our fully
functional hands as well but before we do that I'd like to introduce Malcolm who's going to speak a little bit about
our structural foundation for the robot [Applause]
thank you
Tesla have the capability to finalize highly complex systems it does get much more complex than a crash you can see
here a simulated crash on model 3 superimposed on top of the actual physical crash
it's actually incredible how um how accurate it is just to give you an idea of the complexity of this model
it includes every knot Bolton washer every spot Weld and it has 35 million degrees of freedom it's quite amazing
and it's true to say that if we didn't have models like this we wouldn't be able to make the safest cars in the world
so can we utilize our capabilities and our methods from the automotive side to influence a robot
well we can make a model and since we had crash software we used the same software here we can make it fall down
the purpose of this is to make sure that if it falls down ideally it doesn't but it's superficial damage
we don't want to for example break its gearbox at its arms that's equivalent of a dislocated shoulder of a robot
difficult and expensive to fix so we wanted to dust itself off get on with a job that's been given
if we could also take the same model and we can drive the actuators using the input from a previously solved model
bringing it to life so this is producing the Motions for the tasks we want the robot to do these
tasks are picking up boxes turning squatting walking upstairs whatever the set of tasks are we can play to the
model this is showing just simple walking we can create the stresses in all the components that helps us to
optimize the components these are not dancing robots these are
actually the modal Behavior the first five modes of the robot and typically when people make robots they make sure
the first mode is up around the top single figures up towards 10 Hertz
who is it do this is to make the controls of walking easier it's very difficult to walk if you can't guarantee
where your foot wobbling around that's okay to make one robot we want to make thousands maybe Millions
we haven't got the luxury of making them from carbon fiber titanium we want to make them on plastic things are not
quite so stiff so we can't have these high targets I'll call them dumb targets
we've got to make them work at lower targets so is that is that going to work well if you think about it sorry about
this but we're just bags of soggy jelly and Bones thrown in we're not high frequency if I stand on
my leg I don't vibrate at 10 Hertz we people operate at low frequency so we
know the robot actually can it just makes controls harder so we take the information from this the modal data and
the stiffness and feed that into the control system that allows it to walk
just changing tax slightly looking at the knee we could take some inspiration from
biology and we can look to see what the mechanical advantages of the knee is it turns out it actually represents quite
similar to the four bar link and that's quite non-linear that's not surprising really because if
you think when you bend your leg down the torque on your knee is much more when it's bent than it is when it's
straight so you'd expect a non-linear function and in fact the biology is non-linear
this matches it quite accurately so that's the representation the four by
link is obviously not physically four bar link as I said the characteristics are similar but me betting down that's
not very scientific let's be a bit more scientific we've played all the tasks through the through this graph but this
is showing pickets of walking squatting the tasks I said we did on the stress and that's the uh the talk a scene at
the knee against the knee bend on the horizontal axis this is showing the requirement for the knee to do all these
tasks and then put a curve through it surfing over the top of the Peaks and that's saying this is what's required to
make the robot do these tasks
so if we look at the four bar link that's actually the green curve and it's saying that the non-linearity of the
four by link is actually linearized the characteristic of the force what that really says is that's lowered the force
that's what makes the actuator have the lowest possible Force which is the most efficient we want to burn energy up slowly
what's the blue curve well the blue curve is actually if we didn't have a four bar link we just had an arm
sticking out of my leg here with a with an actuator on it a simple two bar link
that's the best you could do with a simple two bar link and it shows that that would create much more force in the
actuator which would not be efficient so what's that look like in practice
well as you'll see but it's very tightly packaged in the knee you'll see a good
transparent in a second you'll see the full bar link there it's operating on the actuator this is determined the
force and the displacements on the actuator and now pass you over to concertina to
so I am I would like to talk to you about um the design process and the actuator
portfolio uh in our robot so there are many similarities between a
car and the robot when it comes to powertrain design the the most important thing that matters here is energy mass and cost
we are carrying over most of our designing experience from the car to the robot
so in the particular case you see a car with two drive units and the drive units
are used in order to accelerate the car 0 to 60 miles per hour time or drive a
cities Drive site while the robot that has 28 actuators and
it's not obvious what are the tasks at the actuator level so we have tasks that
are higher level like walking or climbing stairs or carrying a heavy object which need to be translated into
joint into joint specs therefore we use our model
that generates the torque speed trajectories for our joints which
subsequently is going to be fed in our optimization model and to run through
the optimization process this is one of the scenarios that the
robot is capable of doing which is turning and walking so when we have this torque speed
trajectory we laid over an efficiency map of an actuator and we are able along
the trajectory to generate the power consumption and the energy accumulative
energy for the task versus time so this allows us to define the system
cost for the particular actuator and put a simple Point into the cloud then we do
this for hundreds of thousands of actuators by solving in our cluster and the red line denotes the Pareto front
which is the preferred area where we will look for optimal so the X denotes
the preferred actuator design we have picked for this particular joint so now we need to do this for every joint we
have 28 joints to optimize and we parse our cloud we parse our Cloud again for every joint
spec and the red axis this time denotes the bespoke actuator designs for every
joint the problem here is that we have too many unique actuator designs and
even if we take advantage of the Symmetry still there are too many in order to make something Mass
manufacturable we need to be able to reduce the amount of unique actuator designs therefore we run something
called commonality study which we parse our Cloud again looking this time for
actuators that simultaneously Meet The Joint performance requirements for more than one joint at the same time so the
resulting portfolio is six actuators and they show in a color map the middle figure
um and the actuators can be also viewed in this slide we have three rotary and
three linear actuators all of which have a great output force or Torque per Mass
the rotary actuator in particular has a mechanical clutch integrated on the high speed side angular contact
ball bearing and on the high speed side and on the low speed side a cross roller
bearing and the gear train is a strain wave gear and there are three integrated sensors
here and the bespoke permanent magnet machine the linear actuator
I'm sorry the linear actuator has planetary rollers and an inverted planetary Screw
As a gear train which allows efficiency and compaction and durability
so in order to demonstrate the force capability of our linear actuators we
have set up an experiment in order to test it under its limits
and I will let you enjoy the video
so our actuator is able to lift
a half tone nine foot concert grand piano
and
this is a requirement it's not something nice to have because our muscles can do
the same when they are direct driven when they are directly driven or quadricep muscles can do the same thing
it's just that the knee is an up gearing linkage system that converts the force
into velocity at the end effector of our Hills for purposes of giving to the
human body agility so this is one of the main things that are amazing about the human body and I'm
concluding my part at this point and I would like to welcome my colleague Mike who's going to talk to you about hand
design thank you very much thanks constantinos
so we just saw how powerful a human and a humanoid actuator can be however
humans are also incredibly dexterous the human hand has the ability to move
at 300 degrees per second it has tens of thousands of tactile sensors
and it has the ability to grasp and manipulate almost every object in our daily lives
for our robotic hand design we were inspired by biology we have five fingers an opposable thumb
our fingers are driven by metallic tendons that are both flexible and strong we have the ability to complete wide
aperture power grasps while also being optimized for precision gripping of small thin and delicate objects
so why a human like robotic hand well the main reason is that our factories and the world around us is
designed to be ergonomic so what that means is that it ensures that objects in our Factory are graspable
but it also ensures that new objects that we may have never seen before can be grasped by the human hand and by our
robotic hand as well the converse there is is pretty interesting because it's saying that these objects are designed to our hand
instead of having to make changes to our hand to accompany a new object
some basic stats about our hand is that has six actuators and 11 degrees of freedom it has an in-hand controller which
drives the fingers and receives sensor feedback sensor feedback is really important to
learn a little bit more about the objects that we're grasping and also for proprioception and that's the ability for us to recognize where
our hand is in space one of the important aspects of our hand is that it's adaptive this adaptability
is involved essentially as complex mechanisms that allow the hand to adapt to the objects that's being grasped
another important part is that we have a non-back drivable finger drive this clutching mechanism allows us to hold
and transport objects without having to turn on the hand Motors you just heard how we went about going
we went about designing the Tesla bot Hardware now we'll hand it off to Milan and our autonomy team to bring this
robot to life thanks Mike
all right um so all those cool things we've shown earlier in the video were posted
possible just in a matter of a few months thanks to the amazing word that we've done autopilot over the past few years
most of those components ported quite easily over to the Bots environment if you think about it we're just moving
from a robot on Wheels to a robot on legs so some of those components are pretty similar and some other require
more heavy lifting so for example our computer vision neural networks
reported directly from autopilot to the Bots situation it's exactly the same occupancy Network
that we're talking to a little bit more details later with the autopilot team that is now running on the bot here in
this video the only thing that changed really is the training data that we had to recollect
we're also trying to find ways to improve those occupancy networks using work made on your Radiance fields to get
really great volumetric rendering of the Bots environments for example here some
machine read that the bot might have to interact with
another interesting problem to think about is in indoor environments mostly with that sense of GPS signal how do you
get about to navigate to its destination say for instance to find its nearest charging station so we've been training
more neural networks to identify high frequency features key points within the
Bots camera streams and track them across frames over time as the bot navigates to its its environment
and we're using those points to get a better estimate of the Bots pose and trajectory within its environment as
it's walking we also did quite some work on the
simulation side and this is literally the autopilot simulator uh to which we've integrated the robot's Locomotion
code and this is a video of the motion control code running in the operator simulator simulator showing the
evolution of the robots walk over time and so as you can see we started quite slowly in April and start accelerating
as we unlock more joints and deeper more Advanced Techniques like arms balancing over the past few months
and so Locomotion is specifically one component that's very different as we're moving from the car to the Bots
environment and so I think it warrants a little bit more depth and I'd like my colleagues to start talking about this
now foreign
hi everyone I'm Felix I'm a robotics engineer on the project and I'm going to talk about walking
seems easy right people do it every day you don't even have to think about it
but there are some aspects of walking which are challenging from engineering perspective for example
physical self-awareness that means having a good representation of yourself what is the length of your limbs what is
the mass of your limbs what is the size of your feet all that matters also having an energy efficient gate you
can imagine there's different styles of walking and all of them are equally efficient
most important keep balance don't fall and of course also coordinate the motion
of all of your limbs together so now humans do all of this naturally but as Engineers or roboticists we have
to think about these problems and if I'm going to show you how we address them in our Locomotion planning and control
stack so we start with Locomotion planning and our representation of the bond that
means the model of the robot's kinematics Dynamics and the contact properties and using that model and the desired
path for the Bots our Locomotion planner generates reference trajectories for the entire system
this means feasible trajectories with respect to the assumptions of our model
the planner currently Works in three stages it starts planning footsteps and ends with the entire motion photo system
and let's dive a little bit deeper in how this works so in this video we see footsteps being planned over planning
Horizon following the desired path and we start from this and add then for
trajectories that connect these footsteps using toe off and yield strike just as the humans just as humans do
and this gives us a larger stride and less knee Bend for high efficiency of the system
the last stage is then finding a center of mass trajectory which gives us a fee dynamically feasible motion of the
entire system to keep balance as we all know plans are good but we
also have to realize them in reality let's say you know see how we can do this
[Applause] thank you Felix hello everyone my name
is Anand and I'm going to talk to you about controls so let's take the motion plan that Felix
just talked about and put it in the real world on a real robot let's see what happens
it takes a couple steps and falls down well that's a little disappointing
but we are missing a few key pieces here which will make it work
now as Felix mentioned the motion planner is using an idealized version of
itself and a version of reality around it this is not exactly correct
it also expresses its intention through trajectories and wrenches branches of
forces and torques that it wants to exert on the World to locomote
reality is way more complex than any similar model also the robot is not
simplified it's got vibrations and modes compliance sensor noise and on and on
and on so what does that do to the real world when you put the bot in the real world
well the unexpected forces cause unmodeled Dynamics which essentially the planner doesn't know about and that
causes destabilization especially For A system that is dynamically stable like biped locomotion
so what can we do about it well we measure reality we use sensors and our understanding of
the world to do state estimation and status to me here you can see the attitude and pelvis pose which is
essentially the vestibular system in a human along with the center of mass trajectory being tracked when the robot's walking
in the office environment now we have all the pieces we need in
order to close the loop so we use our better bot model we use the understanding of reality that
we've gained through State estimation and we compare what we want versus what we expect the reality we expect that
reality is doing to us in order to add corrections to the behavior of the
robot here the robot certainly doesn't appreciate being poked but it doesn't
admirable job of staying upright the final Point here is a robot that
walks is not enough we needed to use its hands and arms to
be useful let's talk about manipulation
[Applause]
hi everyone my name is Eric robotics engineer on teslabot and I want to talk
about how we've made the robot manipulate things in the real world we wanted to manipulate objects while
looking as natural as possible and also get there quickly so what we've done is
we've broken this process down into two steps first is generating a library of natural motion references or we could
call them demonstrations and then we've adapted these motion references online to the current real world situation
so let's say we have a human demonstration of picking up an object we can get a motion capture of that
demonstration which is visualized right here as a bunch of keyframes representing the locations of the hands
the elbows the Torso we can map that to the robot using inverse kinematics and if we collect a
lot of these now we have a library that we can work with but a single demonstration is not
generalizable to the variation in the real world for instance this would only work for a box in a very particular
location so what we've also done is run these reference trajectories through a
trajectory optimization program which solves for where the hand should be how the robot should balance
during uh when it needs to adapt the motion to the real world so for instance
if the box is in this location then our Optimizer will create this
trajectory instead next Milan's going to talk about uh
what's next for the Optimus uh Tesla y thanks thanks Larry
right so hopefully by now you guys got a good idea of what we've been up to over the past few months
um we started doing something that's usable but it's far from being useful there's still a long and exciting road
ahead of us um I think the first thing within the next few weeks is to get Optimus at least at
par with Bumble C the other bug prototype you saw earlier and probably Beyond we're also going to start
focusing on the real use case at one of our factories and really gonna try to try to nail this down and I run out all
the elements needed to deploy this product in the real world I was mentioning earlier
um you know indoor navigation graceful for management or even servicing all
components needed to scale this product up but um I don't know about you but after
seeing what we've shown tonight I'm pretty sure we can get this done within the next few months or years and I make
this product a reality and change the entire economy so I would like to thank the entire Optimus team for the hard
work over the past few months I think it's pretty amazing all of this was done in barely six or eight months thank you
very much [Applause]
thank you hey everyone
hi I'm Ashok I lead the autopilot team alongside Milan God it's coming so hard to top that
Optimus section he'll try nonetheless anyway
um every Tesla that has been built over the last several years we think has the
hardware to make the car drive itself we have been working on the software to
add higher and higher levels of autonomy this time around last year we had
roughly 2 000 cars driving our FSD beta software since then we have significantly
improved the software as robustness and capability that we have now shipped it to 160 000 customers as of today
yep [Applause]
this is not come for free it came from the sweat and blood of the engineering team over the last one year
for example we trained 75 000 neural network models just last one year that's
roughly a model every eight minutes that's you know coming out of the team and then we evaluate them on our large
clusters and then we ship 281 of those models that actually improve the performance of the car
and this space of innovation is happening throughout the stack the the planning software the
infrastructure the tools even hiring everything is progressing to the next level
the FSG beta software is quite capable of driving the car it should be able to navigate from
parking lot to parking lot handling CDC driving stopping for traffic lights and stop signs
negotiating with objects at intersections making turns and so on
all of this comes from the camera streams that go through our neural networks that run on the car itself it's
not coming back to the server or anything it runs on the car and produces all the outputs to form the world model
around the car and the planning software drives the car based on that
today we'll go into a lot of the components that make up the system the occupancy Network acts as the base
geometry layer of the system this is a multi-camera video neural
network that from the images predicts the full physical occupancy of the world around
the robot so anything that's physically present trees walls buildings cars walls what
have you it predicts if it's specifically present it predicts them along with their future motion
on top of this base level of geometry we have more semantic layers in order to
navigate the roadways we need the lens of course but then the roadways have lots of
different lanes and they connect in all kinds of ways so it's actually a really difficult problem for typical computer
vision techniques to predict the set of planes and their connectivities so we reached all the way into language
Technologies and then pulled the state of the art from other domains and not just computer vision to make this task
possible for vehicles we need their full kinematic state to control for them
all of this directly comes from neural networks video streams raw video streams come into the networks go through a lot
of processing and then outputs the full kinematic state that positions velocities acceleration jerk all of that
directly comes out of the networks with minimal post processing that's really fascinating to me because how how is
this even possible what world do we live in that this magic is possible that these networks predicts fourth
derivatives of these positions when people thought we couldn't even detect these objects
my opinion is that it did not come for free uh it it required tons of data so we had a bit sophisticated Auto labeling
systems that Shone through raw sensor data run a ton of offline compute on the
servers it can take a few hours run expensive neural networks distill the information into labels that train our
in-car neural networks on top of this we also use our simulation system to synthetically
create images and since it's a simulation we trivially have all the labels
all of this goes through a well-oiled data engine pipeline where we first
train a baseline model with some data ship it to the car see what the failures are and once you know the failures
we mine the fleet for the cases where it fails provide the correct labels and add the data to the training set
this process systematically fixes the issues and we do this for every task that runs in the car
yeah and to train these new massive neural networks this year we expanded our training infrastructure by roughly
40 to 50 percent so that sits us at about 14 000 gpus today across multiple
training clusters in the United States we also worked on our AI compiler which
now supports new operations needed by those neural networks and map them to the uh the best of our underlying
Hardware resources and our inference engine today is capable of Distributing the execution of
a single neural network across two independent system on ships essentially two independent computers interconnected
within the simple self-driving computer and to make this possible we have to keep a tight control on the end-to-end
latency of this new system so we deployed more advanced scheduling code across the full FSD platform
all of these neural networks running in the car together produce the vector space which is again the model of the
world around the robot or the car and then the planning system operates on top of this coming up with trajectories that
avoid collisions or smooth make progress towards the destination using a combination of model based optimization
plus neural network that helps optimize it to be really fast
today we are really excited to present progress on all of these areas we have the engineering leads standing by to
come in and explain these various blocks and these power not just the car but the same components also run on the Optimus
robot that Milan showed earlier with that I welcome panel to start talking about the planning section
hi all I'm parel joint let's use this intersection scenario to
dive straight into how we do the planning and decision making in autopilot so we are approaching this intersection
from a side street and we have to yield to all the crossing vehicles rightness as we are about to enter the
intersection The Pedestrian on the other side of the intersection decides to cross the road
without a crosswalk now we need to yield to this pedestrian yield to the vehicles from the right and
also understand the relation between The Pedestrian and the vehicle on the other side of the intersection
so a lot of these intra-object dependencies that we need to resolve in a quick glance
and humans are really good at this we look at a scene understand all the possible interactions evaluate the most
promising ones and generally end up choosing a reasonable one
so let's look at a few of these interactions that autopilot system evaluated we could have gone in front of this
pedestrian with a very aggressive launch in a lateral profile now obviously we are being a jerk to The
Pedestrian and we would spook The Pedestrian and his cute pet we could have moved forward slowly short
for a gap between The Pedestrian or and the vehicle from the right again we are being a jerk to the vehicle
coming from the right but you should not outright reject this interaction in case this is only safe interaction available
lastly the interaction we ended up choosing stay slow initially find the reasonable
Gap and then finish the maneuver after all the agents pass
now evaluation of all of these interactions is not trivial especially when you care about modeling
the higher order derivatives for other agents for example what is the longitudinal
jerk required by the vehicle coming from the right when you assert in front of it relying purely on collision checks with
modular predictions will only get you so far because you will miss out on a lot of valid interactions
this basically boils down to solving a multi-agent joint trajectory planning problem over the trajectories of ego and
all the other agents now how much ever you optimize there's going to be a limit to how fast you can
run this optimization problem it will be close to close to order of 10 milliseconds even after a lot of incremental approximations
now for a typical crowded unpredictable left say you have more than 20 objects each
object having multiple different future modes the number of relevant interaction combinations will blow up
we the planner needs to make a decision every 50 milliseconds so how do we solve this in real time
we rely on a framework what we call as interaction search which is basically a parallelized research over a bunch of
maneuver trajectories the state space here corresponds to the kinematic state of ego the kinematic
state of other agents the nominal future multiple multimodal predictions and all the static entities in the scene
the action space is where things get interesting we use a set of maneuver trajectory
candidates to Branch over a bunch of interactional decisions and also incremental goals for a longer
Horizon maneuver Let's Walk Through This research very quickly to get a sense of how it works
we start with a set of vision measurements namely Lanes occupancy moving objects these get represented as
sparse extractions as well as latent features we use this to create a set of goal
candidates Lanes again from the lanes Network or unstructured regions which correspond to
a probability mask derived from Human demonstrations once we have a bunch of these gold
candidates we create seed trajectories using a combination of classical optimization approaches as well as our
Network planner again trained on data from the customer feed now once we get a bunch of these free
trajectories we use them to start branching on the interactions we find the most critical interaction
in our case this would be the interaction with respect to The Pedestrian whether we assert in front of it or yield to it
obviously the option on the left is a high penalty option it likely won't get prioritized so we Branch further onto
the option on the right and that's where we bring in more and more complex interactions building this optimization
problem incrementally with more and more constraints and that research keeps flowing branching on more interactions branching
on more goals now a lot of tricks here lie in evaluation of each of each of this node
of the research inside each node initially we started with creating
trajectories using classical optimization approaches where the constraints like I described would be added incrementally
and this would take close to one to five milliseconds per action now even though this is fairly good
number when you want to evaluate more than 100 interactions this does not scale
so we ended up building lightweight queryable networks that you can run in the loop of the planner
these networks are trained on human demonstrations from the fleet as well as offline solvers with relaxed time limits
with this we were able to bring the rundown runtime down to close 200 microseconds per action
now doing this alone is not enough because you still have this massive research that that you need to go
through and you need to efficiently prune the search space so you need to do a do scoring on each
of these trajectories few of these are fairly standard you do a bunch of collision checks you do a bunch of comfort analysis what is the jerk and
actually required for a given maneuver the customer Fleet data plays an important role here again
we run two sets of again lightweight variable networks both really augmenting each other one of them trained from
interventions from the FST beta Fleet which gives a score on How likely is a given maneuver to result in
interventions over the next few seconds and second which is purely on human demonstrations human driven data giving
a score on how close is your given selected action to a human driven trajectory
the scoring helps us prune the search space keep branching further on the interactions and focus the compute on
the most promising outcomes the the cool part about this
architecture is that it allows us to create a cool blend between uh data driven approaches where you
don't have to rely on a lot of hand engineered costs but also ground it in reality with physics-based checks
now a lot of what what I described was with respect to the agents we could observe in the scene but the same
framework extends to objects behind occlusions we use the video feed from eight cameras
to generate the 3D occupancy of the world the blue mask here corresponds to the
visibility region we call it it basically gets blocked at the first
occlusion you see in the scene we consume this visibility mask to generate what we call as ghost objects which you
can see on the top left now if you model the spawn regions and the state transitions of this ghost
objects correctly if you tune your control response as a
function of that existence likelihood you can extract some really nice human-like behaviors
now I'll pass it on on to fill to describe more on how we generate these occupancy Networks
hey guys my name is Phil uh I will share the details of the occupancy Network we build over the past year
this network is our solution to model the physical work in 3D around our cars and it is currently not shown in our
customer facing visualization and what we will see here is the road Network output from our internal Dev tool
the occupancy Network takes video streams of all our 80 cameras as input produces a single unified volumetric
occupancy in Vector space directly for every 3D location around our car it
predicts the probability of that location being occupied a lot since it has video contacts it is
capable of predicting obstacles that are occluded instantaneously
for each location it also produces a set of semantics such as curb car pedestrian
and low debris as color coded here
occupancy flow is also predicted for motion since the model is a generalized Network
it does not tell static and dynamic objects explicitly it is able to produce and
model the random motions such as the swerving trainer here
this network is currently running in all Teslas with FSD computers and it is
incredibly efficient runs about every 10 milliseconds with our neural accelerator
so how does this work let's take a look at the architecture first we Rectify each camera images with
the camera calibration and the images were shown here were given to the network it's actually not
the typical 8-bit RGB image as you can see from the first imagery on top we're
giving the 12 bit raw photo account image to the network since it has four
bits more information it has 16 times better dynamic range as well as reduced
latency since we don't have the wrong ISP in Adobe anymore we use a set of records and back with
FPS as a backbone to extract images space features next we construct a set of 3D position
query along with the IMG space features as keys and values fit into an attention module
the output of the attention module is high dimensional spatial features
these special features are aligned temporarily using vehicle odometry
to derive motion last this spatial temporal features go
through a set of D convolution to produce the final occupancy and occupancy flow output
they're formed as fixed size boxer gray which might not be precise enough for planning on control
in order to get a higher resolution we also produce per voxel feature Maps which will feed into MLP with 3D spatial
Point queries to get position and semantics at any arbitrary location
after knowing the model better let's take a look at another example here we have an articular bus parked along right
side row highlighted as a L-shaped boxer here as we approach the bus starts to
move the blue the front of the cart turns blue first indicating the model predicts the frontal bus has a down zero
occupancy flow and the s-bus keeps moving the entire bus turns blue
and you can also see that the network predicts the precise curvature of the bus
well this is a very complicated problem for traditional object detection Network as you have to see whether I'm going to
use one cuboid or perhaps a two to fit the curvature but for occupation Network
since all we care about is the occupancy in the visible space and we'll be able to model the curvature precisely
besides the voxel grade the occupancy Network also produces a drivable surface
the drivable surface has both 3D geometry and semantics they are very useful for control especially on healing
and curvy roads the surface and the voxel gray are not predicted independently instead the
voxel grid actually aligns with the surface implicitly here we are at a hero Quest where you
can see the 3D geometry of the surface being predicted nicely
planner can use this information to decide perhaps we need to slow down more for the Hillcrest and as you can also
see the voxel grade aligns with the surface consistently
besides the Box source and the surface we're also very excited about the recent breakthrough in neural readings field or
Nerf we're looking into both incorporate some of the light color features into
occupancy Network training as well as using our Network output as the input state for Nerf
as a matter of fact Ashok is very excited about this this has been his uh personal weekend project for a while
on these nerves because I think the Academia is building a lot of these
Foundation models uh for language using like tons of large data sets for language but I think for vision nerves
are going to provide the foundation models for computer vision because they are grounded in geometry and geometry
gives us a nice way to supervise these networks and freezes of the requirement to Define an ontology and the
supervision is essentially free because you just have to differentiably render these images so I think in the future uh this
occupancy Network idea where you know images come in and then the network produces a consistent
volumetric representation of the scene that can then be differentially rendered into any image that was observed I I
personally think is a future of computer vision uh and you know we do some initial work on it uh right now but I
think in the future both at Tesla and in the Academia we will see that these
combination of One-Shot prediction of volumetric occupancy uh will be that's
my personal bet sexual so here's an example early result of a
3D Reconstruction from our free data instead of focusing on getting perfect RGB reprojection in imaging space our
primary goal here is to accurately represent the warnings 3D space for driving and we want to do this for all
our free data over the world in all weather and lighting conditions and obviously this is a very challenging
problem and we're looking for you guys to help finally the occupancy network is trained
with large auto level data set without any human in the loop and with that I'll pass to Tim to talk
about what it takes to train this network thanks Phil
[Applause] all right hey everyone let's talk about some training
infrastructure so we've seen a couple of videos you know four or five uh I think and care
more and worry more about a lot more Clips on that so we've been looking at
the occupancy networks just from Phil just fills videos it takes 1.4 billion
frames to train that Network what you just saw and if you have a hundred thousand gpus uh it would take one hour
but if you have uh one GPU it would take a hundred thousand hours so that is not
a Humane time period that you can wait for your training job to run right we want to ship faster than that so that
means you're going to need to go parallel so you need a more compute for that that means you're going to need a
supercomputer so this is why we've built in-house three supercomputers comprising
of 14 000 gpus where we use 10 000 gpus for training and around four thousand
gpus for auto labeling all these videos are stored in 30 petabytes of a distributed managed video
cache you shouldn't think of our data sets as fixed let's say as you think of your
imagenet or something you know with like a million frames you should think of it as a very fluid thing so we've got a
half a million of these videos flowing in and out of this cluster these clusters every single day
and we track 400 000 of these kind of python video instantiations every second
so that is that's a lot of calls we're gonna need to capture that in order to govern the retention policies of this
distributed video cache so underlying all of this is a huge amount of infra all of which we build and manage
in-house so you cannot just buy you know 40 000
gpus and then a 30 petabytes of Flash mvme and just put it together and let's go train uh it actually takes a lot of
work and I'm gonna go into a little bit of that what you actually typically want to do is you want to take your accelerator so
that it could be the GPU or Dojo which we'll talk about later and because that's the most expensive
component that's where you want to put your bottleneck and so that means that every single part of your system is
going to need to outperform this accelerator and so that is really complicated that
means that your storage is going to need to have the size and the bandwidth to deliver all the data down into the nodes
these nodes need to have the right amount of CPU and memory capabilities to feed into your machine learning
framework this machine learning framework then needs to hand it off to your GPU and then you can start training but then you
need to do so across hundreds or thousands of GPU in a reliable way in
logstap and in a way that's also fast so you're also going to need an interconnect extremely complicated we'll talk more
about dojo in a second so first I want to take you to some
optimizations that we've done on our cluster so we're getting in a lot of videos and
video is very much unlike let's say training on images or text which I think is very well established video is quite
literally a dimension more complicated um and so that's why we needed to go end
to end from the storage layer down to the accelerator and optimize every single piece of that because we train on the photon count
videos that come directly from our Fleet we train on those directly we do not post process those at all
the way it's just done is we seek exactly to the frames we select for our batch we load those in including the
frames that they depend on so these are your iframes or your keyframes we package those up move them into shared
memory move them into a double bar from the GPU and then use the hardware decoder that's only accelerated to
actually decode the video so we do that on the GPU natively and this is all in a very nice python pytorch extension
doing so unlocked more than 30 training speed increase for the occupancy networks and freed up basically a whole
CPU to do any other thing um you cannot just do training with just
videos of course you need some kind of a ground Truth uh and uh that is actually an interesting problem as well the
objective for storing your ground truth is that you want to make sure you get to your ground truth that you need in the
minimal amount of file system operations and load in the minimal size of what you need in order to optimize for aggregate
cross cluster throughput because you should see a compute cluster as one big device which has internally fixed
constraints and thresholds so for this we rolled out a format that
is native to us that's called small we use this for our ground truth our feature cache and any inference outputs
so a lot of tensors that are in there and so just the cartoon here let's say these are your uh is your table that you
want to store then that's how that would look out if you rolled out on disk so what you do is you take anything you'd
want to index on so for example video timestamps you put those all in the header so that in your initial header
read you know exactly where to go on disk then if you have any tensors uh you're going to try to transpose the
dimensions to put a different dimension last as the contiguous Dimension and then also try different types of
compression then you check out which one was most optimal and then store that one this is actually a huge step if you do
feature caching unintelligible output from the machine Learning Network rotate around the
dimensions a little bit you can get up to 20 increase in efficiency of storage then when you store that we also
ordered the columns by size so that all your small columns and small values are together so that when you seek for a
single value you're likely to overlap with a read on more values which you'll use later so that you don't need to do
another file system operation so I could go on and on I just went on
on touched on two projects that we have internally but this is actually part of a huge continuous effort to optimize the
compute that we have in-house so accumulating and aggregating through all these optimizations We Now train our
occupancy networks twice as fast just because it's twice as efficient and now if we add in bunch more compute and go
parallel we cannot train this in hours instead of days and with that I'd like to hand it off to
the biggest user of compute John
hi everybody my name is John Emmons I lead the autopilot Vision team I'm going to cover two topics with you
today the first is how we predict lanes and the second is how we predict the future behavior of other agents on the road
in the early days of autopilot we modeled the lane detection problem as an image space instant segmentation task
our network was super simple though in fact it was only capable of printing Lanes from of a few different kinds of
geometries specifically it would segment the Eagle Lane it could segment adjacent
lanes and then it had some special casing for forks and merges this simplistic modeling of the problem
worked for highly structured roads like highways but today we're trying to build a system
that's capable of much more complex Maneuvers specifically we want to make left and right turns at intersections
where the road topology can be quite a bit more complex and diverse when we try to apply this simplistic modeling of the
problem here it just totally breaks down taking a step back for a moment what
we're trying to do here is to predict the spark set of lame instances in their connectivity and what we want to do is to have a
neural network that basically predicts this graph where the nodes are the lane segments and the edges encode the
connectivities between these Lanes so what we have is our lane detection
neural network it's made up of three components in the first component we have a set of
convolutional layers attention layers and other neural network layers that encode the video streams from our eight
cameras on the vehicle and produce a rich visual representation
we then enhance this digital representation with a coarse roadmap Road level map data which we encode with
a set of additional neural network layers that we call the lane guidance module this map is not an HD map but it
provides a lot of useful hints about the topology of lanes inside of intersections the lane counts on various roads and a set of other attributes that
help us the first two components here produced a
dense tensor that sort of encodes the world but what we really want to do is to convert this dense tensor into a
smart set of lanes in their connectivities we approach this problem like an image
captioning task where the input is this dense tensor and the output text is predicted into a special language that
we developed at Tesla for encoding Lanes in their connectivities in this language of lanes the words and
tokens are the lane positions in 3D space in The Ordering of the tokens introducted modifiers in the tokens
encode the connective relationships between these Lanes by modeling the task as a language
problem we can capitalize on recent autoregressive architectures and techniques from the language Community for handling the multiple
modality of the problem we're not just solving the computer vision problem at autopilot we're also applying the state-of-the-art and
language modeling and machine learning more generally I'm now going to dive into a little bit more detail this language component
what I have depicted on the screen here is the satellite image which sort of represents the local area around the
vehicle the set of nosing edges is what we refer to as the lane graph and it's ultimately what we want to come out of this neural
network we start with a blank slate we're going to want to make our first
prediction here at this Green Dot this green dots position is encoded as
an index into a course grid which discretizes the 3D World now we don't predict this index directly
because it would be too computationally expensive to do so there's just too many grid points and predicting a categorical
distribution over this has both implications at training time and test time so instead what we do is we disretch the
world coarsely first we predict a heat map over the possible locations and then we latch in the most probable location
on this we then refine the prediction and get the precise point
now we know where the position of this token is we don't know its type in this case though it's the beginning of a new
Lane so we approach it as a start token and because it's a star token there's no
additional attributes in our language we then take the predictions from this first forward pass and we encode them
using a learned additional embedding which produces a set of tensors that we combine together
which is actually the first word in our language of lanes we add this to the you know first position in our sentence here
we then continue this process by printing the next Lane point in a similar fashion
now this Lane point is not the beginning of a new Lane it's actually a continuation of the previous Lane
so it's a continuation token type now it's not enough just to know that
this Lane is connected to the previously protected plane we want to encode its precise geometry which we do by
regressing a set of spline coefficients we then take this Lane we encode it
again and add it as the next word in the sentence we continue predicting these continuation Lanes until we get to the
end of the prediction grid we then move on to a different Lane segment so you can see that cyan dot there now
it's it's not topologically connected to that pink point it's actually forking off of that that blue sorry that green
point there so it's got a fork type and Fork tokens
actually point back to previous tokens from which the fork originates so you
can see here the fork Point predictor is actually the index zero so it's actually referencing back to tokens that it's already predicted like you would in
language we continue this process over and over again until we've enumerated all of the
tokens in the Ling graph and then the network predicts the end of sentence token
yeah I just want to note that the reason we do this is not just because we want to build something complicated it's
almost feels like a turing complete machine here with neural networks though is that we tried simple approaches for
example uh trying to just segment the lanes along the road or something like that but then the problem is when
there's uncertainty say you cannot see the road clearly and there could be two lanes or three lanes and you can't tell
a simple segmentation based approach would just draw both of them is kind of a 2.5 Lane situation and the
post processing algorithm would hilariously fail when the predictions are such yeah the problems don't end there I mean
you need to predict these connective conditions like these connective Lanes inside of intersections which it's just not possible with the approach that
ashok's mentioning which is why we had to upgrade to this sort of like overlaps like this segmentation would just go Haywire but even if you try very hard to
you know put them on separate layers it's just a really hard problem what language just offers a really nice framework for modern getting a
sample from a posterior as opposed to you know trying to do all of this in post-processing
but this doesn't actually stop for just autopilot right John this can be used for Optimus again you know I guess they wouldn't be
called Lanes but you could imagine you know sort of in this you know stage here that you might have sort of paths that sort of you know encode the possible
places that people could walk yeah it's basically if you're in a factory or in a you know home setting
you can just ask the robot okay let me please talk to the kitchen or please route to some location in the factory
and then we predict a set of Pathways that would you know go through the aisles take the robot and say okay this
is how you get to the kitchen it just really gives us a nice framework to model these different paths that simplify the navigation problem or the
downstream planner all right so ultimately what we get from
this Lane detection network is a set of lanes in their connectivities which comes directly from the network there's
no additional step here for as far simplifying these you know dense predictions into into indispersed ones
this is just a direct unfiltered output of the network
okay so I talked a little bit about Lanes I'm going to briefly touch on how we model and predict the future paths in
other semantics on objects so I'm just going to go really quickly through two examples the video on the
right here we've got a car that's actually running a red light and turning in front of us what we do to handle
situations like this is we predict a set of short time Horizon future trajectories on all objects we can use
these to anticipate the dangerous situation here and apply whatever you know braking and steering action is required to avoid a collision
in the video on the right there's two vehicles in front of us the one on the left lane is parked apparently it's
being loaded unloaded I don't know why the driver decided to park there but the important thing is that our neural network predicted that it was stopped
which is the red color there um the vehicle in the other lane as you notice also is stationary but that one's
obviously just waiting for that red light to turn green so even though both objects are stationary and have zero velocity it's the semantics that is
really important here so that we don't get stuck behind that awkwardly parked car
predicting all of these agent attributes presents some practical problems when trying to build a real-time system
we need to maximize the frame rate of our object section stack so that autopilot can quickly react to the changing environment
every millisecond really matters here to minimize the inference latency our neural network is split into two phases
in the first phase we identified locations in 3D space where agents exist
in the second stage we then pull out tensors at those 3D locations append it with additional data that's on the
vehicle and then we you know do the rest of the processing this specification step allows the
neural network to focus compute on the areas that matter most which gives us Superior performance for a fraction of the latency cost
so putting it all together the autopilot Vision stack predicts more than just the geometry and kinematics of
the world it also predicts a rich set of semantics which enables safe and human-like driving
I'm not going to hand things off to Street we'll tell us how we run all these cool neural networks on our FSD computer thank you
[Applause]
hi everyone I'm SRI today I'm going to give glimpse of what it takes to run this FSC networks in the
car and how do we optimize for the inference latency uh today I'm going to focus just on the
FSG Lanes Network that John just talked about
so when you started this track we wanted to know if we can run this FSC Lanes Network natively on the trip engine
which is our in-house neural network accelerator that we built in the FSD computer
when we build this Hardware we kept it simple and we made sure it can do one
thing ridiculously fast dense dot products but this architecture is auto
regressive and iterative where it crunches through multiple attention attention blocks in the Inner Loop
producing sparse points directly at every step so the challenge here was how
can we do this parse Point prediction and sparse computation on a dense dot product engine let's see how we did this
on the trip so the network predicts the heat map of
most probable spatial locations of the point now we do a Arc Max and a one
heart operation which gives the one hard encoding of the index of the spatial location
now we need to select the embedding associated with this index from an embedding table that is learned during
training to do this on trip we actually built a lookup table in SRAM and we engineered
the dimensions of this embedding such that we could achieve all of this thing with just matrix multiplication
not just that we also wanted to store this embedding into a token cache so
that we don't recompute this for every iteration rather reuse it for future Point prediction again we pulled some
tricks here where we did all these operations just on the dot product engine it's actually cool that our team
found creative ways to map all these operations on the trip engine in ways
that were not even imagined when this Hardware was designed but that's not the only thing we have to
do to make this work we actually implemented a whole lot of operations and features to make this model
compilable to improve the intake accuracy as well as to optimize performance
all of these things helped us run the 75 million parameter model just under 10
millisecond of latency consuming just 8 watts of power
but this is not the only architecture running in the car there are so many other architectures modules and networks
we need to run in the car to give a sense of scale there are about a billion parameters of all the networks
combined producing around 1000 neural network signals so we need to make sure
we optimize them jointly and such that we maximize the compute
utilization throughput and minimize the latency so we built a compiler just for neural
networks that shares the structure to traditional compilers as you can see it takes the massive
graph of neural Nets with 150k nodes and 375k connection takes this thing
partitions them into independent subgraphs and com compels each of those
subgraphs natively for the inference devices then we have a neural network
Linker which shares the structure to traditional Linker where we perform this link time optimization
there we solve an offline optimization problem uh for with compute memory and memory
bandwidth constraints so that it comes with an optimized schedule that gets executed in the car
on the runtime we designed a hybrid scheduling system which basically does
heterogeneous scheduling on one SOC and distributed scheduling across both the socs to run these networks in a model
parallel fashion to get 100 drops of compute utilization we need to optimize across all the
layers of software right from tuning the network architecture the compiler all
the way to implementing a low latency high bandwidth RDMA link across both the srcs and in fact going even deeper to
understanding and optimizing the cache coherent and non-coherent data Paths of the accelerator in the soc this is a lot
of optimization at every level in order to make sure we get the highest frame rate and as every millisecond counts
here and this is this is just the this is the
visualization of the neural networks that are running in the car this is our digital brain essentially as you can see
these operations are nothing but just the matrix multiplication convolution to name a few real operations running in the car
to train or train this network with a billion parameters you need a lot of labeled data so aegon is going to talk
about how do we achieve this with the auto labeling pipeline
thank you uh thank you Sherry
uh hi everyone I'm Jurgen Zhang and I'm leading a geometric Vision at autopilot
so yeah let's talk about Auto labeling
so we have several kinds of all the labeling Frameworks to support various types of networks but today I'd like to
focus on the awesome Lanes net here so to successfully train and generalize
this network to everywhere we think we went tens of millions of trips from
probably one one million intersection or even more so
then how to do that so it is certainly achievable uh to Source sufficient
amount of trips because we already have as Tim explained earlier we already have like 500 000 trips per day cash rate
um however converting all those data into a training form is a very challenging technical problem
to solve this challenge we tried various ways of manual and auto labeling so from
the First Column to the second from the second to the third each Advance provided us nearly 100x Improvement in
throughput but still uh we won an even better Auto labeling machine that can provide
provide providers good quality diversity and scalability
to meet all these requirements uh despite the huge amount of engineering effort required here we've developed a
new order labeling machine powered by multi-trib reconstruction so this can replace 5 million hours of
manual labeling with just 12 hours on cluster for labeling 10 000 trips
so how we solved there are three big steps the first step is high Precision trajectory and structure recovery by
multi-camera visual inertial odometry so here all the features including ground surface are inferred from videos
by neural networks then tracked and reconstructed in the vector space
so the typical drift rate of this trajectory in car is like 1.3 centimeter
per meter and 0.45 Milli radian per meter which is pretty decent uh considering its compact compute
requirement than the recovery service and raw details are also used as a strong
guidance for the later manual verification step this is also enabled in every FSD
vehicle so we get pre-processed trajectories and structures along with the trip data
the second step is multi-2 reconstruction which is the big and core piece of this machine
so the video shows how the previously shown trip is reconstructed and aligned
with other trips basically other trips from different people not the same vehicle so this is done by multiple
internet steps like course alignment pairwise matching joint optimization then further surface refinement
in the end the human analyst comes in and finalizes the label
so each happy steps are already fully parallelized on the cluster so the
entire process usually takes just a couple of hours
the last step is actually Auto labeling the new trips so
here we use the same multi-trip alignment engine but only between pre-built reconstruction and each new
trip so it's much much simpler than fully reconstructing all the clips altogether
that's why it only takes 30 minutes per trip to other label instead of manual
several hours of manual labeling and this is also the key of scalability
of this machine this machine easily scales as long as we
have available compute and trip data so about 50 trees were newly order
labeled from this scene and some of them are shown here so 53 from different vehicles
so this is how we capture and transform the space-time slices of the world into
the network supervision yeah one thing I like to note is that again just talked about how we Auto
label our lanes but we have Auto laborers for almost every task that we do including our planner and many of
these are fully automatic like no humans involved for example for objects or other kinematics the shapes their
Futures everything just comes from Auto labeling and the same is true for occupancy too and we have really just
built a machine around this yeah so if you can go back one slide not one more
it says parallelized on cluster so so that sounds pretty straightforward but
it really wasn't um maybe it's it's fun to share how something like this comes about um so a while ago we didn't have any
auto labeling at all and then someone makes a script it starts to work it starts working better until we reach a
volume that's pretty high and we clearly need a solution and so there were two other engineers in
our team who were like you know that's an interesting you know uh thing what we needed to do was build a whole graph of
essentially python functions that we need to run one after the other first you pull the clip then you do some cleaning then you do
some Network inference then another Network inference until you finally get this but so you need to do this as a
large scale so I so I tell them we probably need to shoot for you know 100 000 Clips per day or like 100 000 items
that seems good um and so the engineers say well we can do you know a bit of postgres and a bit
of elbow grease we can do it meanwhile we are a bit later and we're doing 20
million of these functions every single day again we pull in around half a million
clips and on those we run a ton of functions each of these in a streaming fashion and so that's kind of the back
end infra that's also needed to not just run training but also Auto labeling yeah it really is like a factory that
produces labels and like production lines yield quality uh inventory like all of the same Concepts applied to this
label Factory uh that applies for you know the factory for our cars that's right
okay uh thanks uh so yeah so concluding
this section uh I'd like to share a few more challenging and interesting examples for Network for sure and even
for humans probably uh so from the top there's like examples for like lack of Lies case or foggy night or roundabout
and occlusions by heavy occlusions by parked cars and even rainy night with their raindrops on camera lenses uh
these are challenging but once their original scenes are fully reconstructed by other clips they all of them can be
Auto labeled so that our cards can drive even better through these challenging scenarios
so now let me pass the mic to David to learn more about how Sim is creating the new world on top of these labels thank
you
thank you again my name is David and I'm going to talk about simulation so simulation plays a critical role in
providing data that is difficult to source and or hard to label however 3D scenes are notoriously slow
to produce take for example the simulated scene playing behind me a
complex intersection from Market Street in San Francisco it would take two weeks for
artists to complete and for us that is painfully slow however I'm going to talk about using
jaegan's automated ground truth labels along with some brand new tooling that allows us to procedurally generate this
scene and many like it in just five minutes that's an amazing a thousand times faster than before
so let's dive in to our scene like this is created we start by piping the automated ground
truth labels into our simulated World creator tooling inside the software Houdini starting with Road boundary
labels we can generate a solid Road mesh and re-topologize it with the lane graph labels this helps inform important Road
details like Crossroads slope and detailed material blending
next we can use the line data and sweep geometry across its surface and project it to the road creating Lane paint
decals next using median edges we can spawned
Island geometry and populate it with randomized foliage this drastically changes the visibility of the scene
now the outside world can be generated through a series of randomized heuristics a modular building generators
create visual obstructions while randomly placed objects like hydrants can change the color of the curves while
trees can drop leaves below it obscuring lines or edges
next we can bring in map data to inform positions of things like traffic traffic lights or stop signs we can trace along
its normal to collect important information like number of lanes and even get accurate street names on the
signs themselves next using Lane graph we can determine Lane connectivity and spawn directional
Road markings on the road and they're accompanying road signs and finally with Lane graph itself we
can determine Lane adjacency and other useful metrics to spawn randomized traffic permutations Insider simulator
and again this is all automatic no artists in the loop and happens within minutes and now this sets us up to do
some pretty cool things since everything is based on data and heuristics we can start to fuzz
parameters to create visual variations of the single ground truth it can be as subtle as object placement and random
material swapping to more drastic changes like entirely new biomes or locations of environment like Urban
Suburban or rural this allows us to create infinite targeted permutations for specific
ground truths that we need more ground Truth for and all this happens within a click of a
button and we can even take this one step further by altering our ground truth
itself say John wants his Network to pay more attention the directional Road markings
to better detect an upcoming captive left turn lane we can start to procedurally alter our lane graph inside
the simulator to help folk to create entirely new flows through this intersection to help
Focus the Network's attention to the road markings to create more accurate predictions and this is a great example of how this
tooling allows us to create new data that could never be collected from The Real World
and the true power of this tool is in its architecture and how we can run all tasks in parallel to infinitely
scale so you saw the tile Creator tool in action converting the ground truth
labels into their counterparts next we can use our tile extractor tool
to divide this data into geohash tiles about 150 meters square in size
we then save out that data into separate geometry and instance files this gives us a clean source of data that's easy to
load and allows us to be rendering engine agnostic for the future
then using a tile loader tool we can summon any number of those cache tiles using a geohash ID currently we're doing
about these five by five tiles or three by three usually centered around Fleet hotspots or interesting land graph
locations in the tile loader also converts these tile sets into U assets for consumption
by the Unreal Engine and gives you a finished project product from what you saw in the first slide
and this really sets us up for size and scale as you can see on the map behind us
we can easily generate most of San Francisco city streets and this didn't take years or even months of work but
rather two weeks by one person we can continue to manage and grow all
this data using our PDG Network inside of the tooling this allows us to throw
compute at it and regenerate all these tile sets overnight this ensures all environments are of
consistent quality and features which is super important for training since new ontologies and signals are constantly
released and now to come full circle because we
generated all these tile sets from ground truth data that contain all the weird intricacies from The Real World
and we can combine that with the procedural Visual and traffic variety to create Limitless targeted data for the
network to learn from and that concludes the Sim section I'll pass it to Kate to talk about how we can
use all this data to improve autopilot thank you
thanks David hi everyone my name is Kate Park and I'm here to talk about the data engine which is the process by which we
improve our neural networks via data we're going to show you how we deterministically solve interventions
via data and walk you through the life of this particular clip in this scenario
autopilot is approaching a turn and incorrectly predicts that Crossing vehicle as stopped for traffic and thus
a vehicle that we would slow down for in reality there's nobody in the car it's just awkwardly parked we've built this
tooling to identify the mispredictions correct the label and categorize this
clip into an evaluation set this particular clip happens to be one of 126
that we've diagnosed as challenging parked cars at turns because of this
infra we can curate this evaluation set without any engineering resources custom
to this particular challenge case to actually solve that challenge case
requires mining thousands of examples like it and it's something Tesla can trivially do we simply use our data
sourcing infra request data and use the tooling shown previously to correct the
labels by surgically targeting the mispredictions of the current model we're only adding the most valuable
examples to our training set we surgically fix 13 900 clips and uh
because those were examples where the current model struggles we don't even need to change the model architecture a
simple way update with this new valuable data is enough to solve the challenge case so you see we no longer predict
that Crossing vehicle as stopped as shown in Orange but parked as shown in red
in Academia we often see that people keep data constant but at Tesla it's
very much the opposite we see time and time and again that data is one of the best if not the most deterministic lever
to solving these interventions we just showed you the data engine Loop
for one challenge case namely these parked cars at turns but there are many challenge cases even for one signal of
vehicle Movement we apply this data engine Loop to every single challenge case we've diagnosed whether it's buses
curvy roads stopped Vehicles parking lots and we don't just add data once we
do this again and again to perfect the semantic in fact this year we updated our vehicle movement signal five times
and with every weight update trained on the new data we push our vehicle movement accuracy up and up
this data engine framework applies to all our signals whether they're 3D
multi-cam video whether the data is human labeled Auto labeled or simulated whether it's an offline model or an
online model model and Tesla is able to do this at scale because of the fleet
Advantage the infra that our engine team has built and the labeling resources that feed our Networks
to train on all this data we need a massive amount of compute so I'll hand it off to Pete and Ganesh to talk about
the dojo supercomputing platform thank you [Applause]
thank you thank you Katie
thanks everybody thanks for hanging in there we're almost there my name is Pete Bannon I run the custom
silicon and low voltage teams at Tesla and my name is Ganesh venkat I run the
doji program
[Applause] thank you I'm frequently asked why is a car
company building a super computer for training and this question fundamentally
misunderstands the nature of Tesla at its heart Tesla is a hardcore technology
company all across the company people are working hard in science and engineering
to advance the fundamental understanding and and methods that we have available
to build cars Energy Solutions robots and anything else so can we we can do to
improve The Human Condition around the world it's a super exciting thing to be a part of and it's a privilege to run a very
small piece of it in the semiconductor group tonight we're going to talk a little bit about dojo and give you an
update on what we've been able to do over the last year but before we do that I wanted to give a little bit of
background on the initial design that we started a few years ago when we got started the goal was to provide a
substantial Improvement to the training latency for our autopilot team some of
the largest neural networks they trained today run for over a month which inhibits their ability to rapidly
explore Alternatives and evaluate them so you know a 30X speed up would be
really nice if we could provide it at a cost competitive and energy competitive way
to do that we wanted to build a chip with a lot of arithmetic arithmetic
units that we could utilize at a very high efficiency and we spent a lot of time studying whether we could do that
using DRM various packaging ideas all of which failed and in the end even though
it felt like an unnatural act we decided to reject dram as the primary storage medium for this system and instead focus
on SRAM embedded in the chip SRAM provides unfortunately a modest
amount of capacity but extremely high bandwidth and very low latency and that enables us to achieve High utilization
with the arithmetic units those choices
of that particular choice led to a whole bunch of other choices for example if you want to have virtual memory you need
page tables they take up a lot of space we didn't have space so no virtual memory we also don't have interrupts the
accelerator is a bare bonds Rob piece of Hardware that's presented to a compiler
in the compiler is responsible for scheduling everything that happens in a terministic way so there's no need or
even desire for interrupts in the system we also chose to pursue model
parallelism as a training methodology which is not the typical situation most
most machines today use data parallelism which consumes additional memory capacity which we obviously don't have
so all of those choices led us to build a machine that is pretty radically
different from what's available today we also had a whole bunch of other goals one of the most important ones was no
limits so we wanted to build a compute fabric that would scale in an unbounded way for the most part I mean obviously
there's physical limits now and then but you know pretty much if your model was
too big for the computer you just had to go buy a bigger computer that's what we were looking for today the way package
machines are packaged there's a pretty fixed ratio of for example GPU CPUs and
and dram capacity and network capacity and we really wanted to disaggregate all that so that as models evolved we could
vary the ratios of those various elements and make the system more flexible to meet the needs of the
autopilot team yeah and it's so true with like No Limits philosophy was our guiding star
all the way all of our choices were centered around that and and to the
point that we didn't want traditional data center infrastructure to limit our
capacity to execute these programs at speed so that's why we
that's why sorry about that that's why we integrated
vertically our data center entire data center by doing a vertical
integration of the data center we could extract new levels of efficiency we could optimize power
delivery Cooling and as well as system management across
the whole data center stack rather than doing Box by box and integrating that
those boxes into Data Centers and to do this we also wanted to
integrate early to figure out limits of scale uh for our
software workloads so we integrated Dojo environment into our autopilot software very early and we learned a lot of
lessons and today uh Bill Chang will go over our hardware update as well as some
of the challenges that we faced along the way and Rajiv kurian will give you a
glimpse of our compiler technology as well as go over some of our cool results
right there you go
thanks Pete thanks Ganesh um I'll start tonight with a high level
vision of our system that will that will help set the stage for the the challenges and the problems we're
solving and then also how software will then leverage this for performance
now our vision for Dojo is to build a single unified accelerate a very large
one software would see a seamless compute plane with globally addressable
very fast memory and all connected together with uniform high bandwidth and
low latency now to realize this we we need to use
density to achieve performance now we leverage technology to get this density in order to break levels of
hierarchy all the way from the chip to the scale out systems
now silicon technology has has used this has done this for decades chips have
followed Moore's law to for density and integration to get performance scaling
now a key step in realizing that Vision was our training tile not only can we integrate 25 dies at
extremely high bandwidth but we can scale that to any number of additional tiles by just connecting them together
now last year we showcased our first functional training tile and at that time we already had workloads running on
it and since then the team here has been working hard and diligently to deploy
this at scale now we've made amazing progress and had a lot of Milestones along the way and of
course we've had a lot of unexpected challenges but this is where our fail fast
philosophy has allowed us to push our boundaries
now pushing density for performance presents all new challenges one area is power delivery
here we need to deliver the power to our compute die and this directly impacts
our Top Line compute performance but we need to do this at unprecedented density we need to be able to match our
die pitch with a power density of almost one amp per millimeter squared
and because of the extreme integration this needs to be a multi-tiered vertical
power solution and because there's a complex heterogeneous material Stack Up
we have to carefully manage the material transition especially CTE
now why does the coefficient of thermal expansion matter in this case CTE is a fundamental material property
and if it's not carefully managed that Stack Up would literally rip itself apart
so we started this effort by working with vendors to deliver to develop this
power solution but we realized that we actually had to develop this in-house
now to balance schedule and risk we built quick iterations to support
both our system bring up and software development and also to find the optimal design and
stack up that would meet our final production goals and in the end we were able to reduce CTE over 50 percent
and meet our performance by 3x over our initial version
now needless to say finding this optimal material stack up while maximizing
performance at density is extremely difficult
now we did have unexpected challenges along the way here's an example where we push the
boundaries of integration that led to component failures
this started when we scaled up to larger and longer workloads and then intermediate intermittently a single
site on a tile would fail now they started out as recoverable failures but as we pushed some much
higher and higher power these would become permanent failures
now to understand this failure you have to understand why and how we build our
power modules solving density at every level is the is
is the Cornerstone of actually achieving our system performance now because our X Y plane is used for
high bandwidth communication everything else must be stacked vertically
this means all other components other than our die must be integrated into our power modules
now that includes our clock and our power supplies and also our system controllers
now in this case the failures were due to losing clock output from our oscillators
and after an extensive debug we found that the root cause was due to vibrations on the module from
piezoelectric effects our nearby capacitors
now singing caps are not a new phenomenon and in fact very common in power design
but normally clock chips are placed in a very quiet area of the board and often
not affected by power circuits but because we needed to achieve this level of integration these oscillators need to
be placed in very close proximity now due to our switching frequency and
then the vibration resonance created it caused Auto plane vibration on our mems
oscillator that caused it to crack now the solution to this problem is a
multi-prong approach we can reduce the vibration by using soft terminal caps
we can update our mems part with a lower Q factor for the outer plane Direction
and we can also update our switching frequency frequency to push the resonance further away from these
sensitive bands now addition to the to the density uh at
the system level we've been making a lot of progress at the infrastructure level
we knew that we had to re-examine every aspect of the data center infrastructure
in order to support our unprecedented power and cooling density
we brought in a fully custom designed CDU to support dojo's dense cooling
requirements and the amazing part is we're able to do this at a fraction of the cost versus buying off the shelf and
modifying it and since our Dojo cabinet integrates enough power and cooling to match an
entire row of standard it racks we need to carefully design our cabinet and
infrastructure together and we've already gone through several iterations of this cabinet to optimize
this and earlier in this year we started load testing our power and cooling
infrastructure and we were able to push it over two megawatts before we tripped our substation and got a call from the
city yeah now last year we introduced only a
couple of components of our system the custom D1 die and the training tile but
we teased the exit pod as our end goal we'll walk through the remaining parts of our system that are required to build
out this exit pod now the system tray is a key part of
realizing our vision of a single accelerator it enables us to seamlessly seamlessly
connect tiles together not only within the cabinet but between cabinets
we can connect these Tiles at very tight spacing across the entire accelerator
and this is how we achieve our uniform communication this is a laminated bus bar that allows
us to integrate very high power mechanical and thermal support in an extremely dense integration
it's 75 millimeters in height and and supports six Tiles at 135 kilograms
this is the equivalent of three to four fully loaded high performance racks
next we need to feed data to the training tiles this is where we've developed the dojo interface processor
it provides our system with high bandwidth dram to Stage our training data
and it provides full memory bandwidth to our training tiles using TTP our custom