Tesla CEO Elon Musk recently unveiled the company's Tesla Bot. The robot code-named Optimus shuffled across a stage, waved its hand, and pumped its arms in a slow-speed dance move. Musk predicts the robot could cost $20,000 within three to five years if all goes according to plan. But the question is, what can it do for us. But before we get into that, lets look at the main devices that drive the Tesla Bot.
Tesla Bot Actuators
The Actuators are the main drive system for any Robot. You could say a robot is nothing more than a PC with moving parts, or in other words, a Robot is a PC with Actuators and sensors. Tesla has developed its own Actuators for the Bot, it uses 3 types of rotary actuators and 3 types of Linear Actuators.

If you are wondering why Tesla didn't use standardized Linear Actuators like the FIRGELLI actuator, its because they have several constraints that means they have to develop their own systems to get the Robots to be ultimately lightweight, power efficient, high power density and low cost. Tesla have claimed they want to get the Bot to retail for $20,000 each. This in itself is a tall order for something that's gong to require 23 Actuators, and powerful PC, lots of sensors and a battery pack to make it last more than a few hours, plus a strong skeleton to hold it all together.
Tesla Bot Linear Actuators

The Linear Actuators Tesla developed are highly specific for a specific role, this means they would not really be of much use for any other application other than a Robot. Their Actuators employ a planetary Roller system and Tesla calls it, but this is basically code for Ballscrew leadscrew design, and instead of a traditional magnetic armature coil in the middle of the motor they decided to use a brushless core motor design. This means the Ball leadscrew design is very efficient and uses less power, but also more expensive. And they use a Brushless power system which means the live span will be significantly faster and allows highly specific drive modes controlled by the software.

The length of travel is only about 2" long, and as the picture showed of them lifting a Piano at 500KG, this is alot of weight. You may wonder why it needs to lift so much weight?, well that's because when installed in a metal skeleton, the actuators travel needs to amplify the stoke of what its moving. So if its moving the Leg of a Robot, the leg needs to be able to move about 150 degs, or over a 2 foot length the leg needs to swing from about zero to a 3-foot arc. The huma body that has evolved over 100,000's of years allows us humans to do this using our leg muscles, but getting a linear actuator to do this is no easy task. So the point I'm making is that, even though The Actuator can lift 500Kg of weight over 2-inches, once that actuators connected to a lever, the force gets reduced significantly, depending on the leverage ratio, and but the speed increases which makes for a nice trade-off.
Tesla Bot Presentation.
Here is what Tesla themselves had to say about the latest Bot presentation they gave on sept 30th 2022
Elon Musk presents: We've got some really exciting things to show you, I think you'll be pretty impressed. I do want to set some expectations with respect to our Optimus robot as as you know last year it was just a person in a robot suit but we've not we've come a long way and it's I think you know compared to that it's going to be very impressive. And we're going to talk about the advancements in AI for full self-driving as well as how they apply to more generally to real world AI problems like a humanoid robot and and even going beyond that. I think there's some potential that what we're doing here at Tesla could make a meaningful contribution to AGI, and and I think actually tells us a good entity to do it from a governance standpoint because we're a publicly traded company we have one class of of stock and that means that the public controls Tesla and I think that's actually a good thing um so if I go crazy you can fire me this is important maybe I'm not crazy I don't know. So yeah so we're going to talk a lot about our progress in AI autopilot as well as the progress in with with dojo, and then we're going to bring the team out and do a long q & a so you can ask tough questions. Whatever you'd like existential questions technical questions if it would want to have as much time for Q&A as possible so let's see with that you guess what daily.
Hey guys I'm Milan I work on autopilot and it is rubber I'm Lizzy a mechanical engineer on the project as well okay so should we should we bring up the Bot before we forward first time we try this robot without any backup support cranes
mechanical mechanisms no cables nothing yeah I want to join with you guys
tonight but it was the first time let's see you ready let's go self-driving computer that runs in your Tesla cars by the way this is the it's literally the first time the robot has operated without a tether was on stage tonight that's it, so the robot can actually do a lot more than we just showed you we just didn't want it to fall on its face, so we'll we'll show you some videos now of the robot doing a bunch of other things um yeah which are less risky.
Yeah we wanted to show a little bit more what we've done over the past few months with apart and just walking around and dancing on stage and just humble beginnings but you can see the autopilot neural networks running as
is just retrained for the bud directly on that on that new platform that's my watering can you see a rendered view that's that's the robot what's the that's the world the robot sees so it's it's very clearly identifying objects like this is the object it should pick up picking it up. We use the same process as we did for autopilot to collect data in train your networks that we then Deploy on the
robot that's an example that illustrates the upper body a little bit more
something that we'll like try to nail down in a few months over the next few months I would say to perfection this is really an actual station in the Fremont Factory as well that it's working at.
That's not the only thing we have to show today so that what you saw was what we call Bumble C, that's our sort of rough development robot using semi-off-the-shelf actuators but we actually have it's gone a step further than that already the team's done an incredible job and we actually have an optimist bot with a
fully Tesla designed at both actuators battery pack control system everything it wasn't quite ready to walk but I think it will walk in a few weeks, but we wanted to show you the robot and something that's actually fairly close to what will go into production, and and show you all the things it can do so let's bring it out
We expect to have in Optimus production unit one which is the ability
to move all the fingers independently move the thumb, have two
degrees of freedom so it has opposable thumbs and both left and right hand so
it's able to operate tools and do useful things, our goal is to make a useful
humanoid robot as quickly as possible and we've also designed it using the
same discipline that we use in designing the car which is to say to design it for manufacturing such that it's possible to make the robot at in high volume at low cost with high reliability so that's incredibly important I mean you've all seen very impressive humanoid robot demonstrations and that that's great but what are they missing?, they're missing a brain, they don't have the the intelligence to
navigate the world by themselves and they're also very expensive, and made in low volume whereas this this is the optimistic society and extremely capable robot but made in very high volume probably ultimately millions of units and it is expected to cost much less than a car.
I would say probably less than twenty thousand dollars would be my guess
the the potential for optimistic is I think appreciated by very effective
people hey as usual Tesla demos are coming in hot so okay that's good that's good um yeah the teams put on put in and the team has put in an incredible amount of work uh it's uh working days you know seven days a week running the 3am oil to to get to the demonstration today I'm super proud of what they've done is they've really done done a great job I just like to give a hand to the whole option of this team so you know that now there's still a lot of work to be done to refine Optimus and improve it obviously this is just Optimus version one and that's really why we're holding this event which is to convince some of the most talented people in the world like you guys to join Tesla and help make it a reality and bring it to fruition at scale such that it can help millions of people and the the and the potential it likes it is is really boggles the mind because you have to say like what what is an economy an economy is uh sort of productive entities times the productivity uh Capital times output productivity per capita at the point in which there is not a limitation on capital, it's not clear what an economy even means at that point an economy becomes quasially infinite so what what you know taken to fruition in the hopefully benign scenario um the this means a future of abundance a future where um there is no poverty where people you
can have whatever you want in terms of products and services it really is a fundamental transformation of civilization as we know it obviously we want to make sure that transformation is a positive one and safe and but that's also why I think Tesla as an entity doing this being a single class of stock publicly traded owned by the public is very important and should not be overlooked I think this is essential because then if the public doesn't like what Tesla's doing the public can buy shares in Tesla and vote differently.
This is a big deal like it's very important that that I can't just do what I want you know sometimes people think that not but it's not true so you know that's it's very important that the the corporate entity that has that that makes this happen is something that the public can properly influence and so I think the Tesla structure is is ideal for that and like I said that you know self-driving cars will certainly have a tremendous impact on the world um I think they will improve the
productivity of Transport by at least a half order of magnitude perhaps an order of magnitude perhaps more um optimists I think has maybe a two order of magnitude potential Improvement in economic output like like it's not clear it's not clear what the limit actually even is so but we need to do this in the right way we need to do it carefully and safely and ensure that the outcome is one that is beneficial to civilization and and one that Humanity once I can't this is also it's extremely important obviously, so and I hope you will consider joining Tesla to achieve those goals at Tesla we really care about doing the right thing here always aspire to do the right thing and and really not pay the road to hell with good intentions and I think the road to hell is mostly paved with bad intentions but every now and again there's a good intention in there so we want to do it do the right thing um so you know consider joining us and helping make it happen um with that let's let's uh move on to the next phase right on thank you Elon
All right so you've seen a couple robots today let's do a quick timeline recap so last year we unveiled the Tesla bot
concept but a concept doesn't get us very far we knew we needed a real development and integration platform to
get real-life learnings as quickly as possible so that robot that came out and did the little routine for you guys we
had that within six months built working on software integration Hardware upgrades over the months since then but
in parallel we've also been designing the Next Generation this one over here
so this guy is rooted in the the foundation of sort of the vehicle design process you know we're leveraging all of
those learnings that we already have obviously there's a lot that's changed since last year but there's a few things
that are still the same you'll notice we still have this really detailed focus on the true human form we think that
matters for a few reasons but it's fun we spend a lot of time thinking about how amazing the human body is we have
this incredible range of motion typically really amazing strength a fun
exercise is if you put your fingertip on the chair in front of you you'll notice that there's a huge range of motion that
you have in your shoulder and your elbow for example without moving your fingertip you can move those joints all
over the place um but the robot you know its main function is to do real useful work and
it maybe doesn't necessarily need all of those degrees of freedom right away so we've stripped it down to a minimum sort
of 28 fundamental degrees of freedom and then of course our hands in addition to that
humans are also pretty efficient at some things and not so efficient in other times so for example we can eat a small
amount of food to sustain ourselves for several hours that's great uh but when we're just kind of sitting around no
offense but we're kind of inefficient we're just sort of burning energy so on the robot platform what we're
going to do is we're going to minimize that idle power consumption drop it as low as possible and that way we can just
flip a switch and immediately the robot turns into something that does useful work
so let's talk about this latest generation in some detail shall we so on the screen here you'll see in
Orange are actuators which we'll get to in a little bit and in blue our electrical system
so now that we have our sort of human-based research and we have our first development platform we have both
research and execution to draw from for this design again we're using that vehicle design
foundation so we're taking it from concept through design and Analysis and
then build and validation along the way we're going to optimize for things like cost and efficiency
because those are critical metrics to take this product to scale eventually how are we going to do that well we're
going to reduce our part count and our power consumption of every element possible we're going to do things like
reduce the sensing and the wiring at our extremities you can imagine a lot of mass in your hands and feet is going to
be quite difficult and power consumptive to move around and we're going to centralize both our
power distribution and our compute to the physical center of the platform
so in the middle of our torso actually it is the Torso we have our battery pack this is sized at 2.3 kilowatt hours
which is perfect for about a full day's worth of work what's really unique about this battery
pack is it has all of the battery Electronics integrated into a single PCB within the pack so that means everything
from sensing to fusing charge management and power distribution is all on one all
in one place we're also leveraging both our Vehicle Products and our Energy Products to roll
all of those key features into this battery so that's streamlined manufacturing really efficient and
simple cooling methods battery management and also safety and of course we can leverage Tesla's
existing infrastructure and supply chain to make it so going on to sort of our brain it's
not in the head but it's pretty close also in our torso we have our Central Computer so as you know Tesla already
ships full self-driving computers in every vehicle we produce we want to leverage both the autopilot hardware and
the software for the humanoid platform but because it's different in requirements and in form factor we're
going to change a few things first so we still are gonna it's gonna do everything that a human brain does
processing Vision data making Split Second decisions based on multiple sensory inputs and also Communications
so to support Communications it's equipped with wireless connectivity as well as audio support
and then it also has Hardware level security features which are important to protect both the robot and the people
around the robot so now that we have our sort of core
we're going to need some limbs on this guy and we'd love to show you a little bit about our actuators and our fully
functional hands as well but before we do that I'd like to introduce Malcolm who's going to speak a little bit about
our structural foundation for the robot [Applause]
thank you
Tesla have the capability to finalize highly complex systems it does get much more complex than a crash you can see
here a simulated crash on model 3 superimposed on top of the actual physical crash
it's actually incredible how um how accurate it is just to give you an idea of the complexity of this model
it includes every knot Bolton washer every spot Weld and it has 35 million degrees of freedom it's quite amazing
and it's true to say that if we didn't have models like this we wouldn't be able to make the safest cars in the world
so can we utilize our capabilities and our methods from the automotive side to influence a robot
well we can make a model and since we had crash software we used the same software here we can make it fall down
the purpose of this is to make sure that if it falls down ideally it doesn't but it's superficial damage
we don't want to for example break its gearbox at its arms that's equivalent of a dislocated shoulder of a robot
difficult and expensive to fix so we wanted to dust itself off get on with a job that's been given
if we could also take the same model and we can drive the actuators using the input from a previously solved model
bringing it to life so this is producing the Motions for the tasks we want the robot to do these
tasks are picking up boxes turning squatting walking upstairs whatever the set of tasks are we can play to the
model this is showing just simple walking we can create the stresses in all the components that helps us to
optimize the components these are not dancing robots these are
actually the modal Behavior the first five modes of the robot and typically when people make robots they make sure
the first mode is up around the top single figures up towards 10 Hertz
who is it do this is to make the controls of walking easier it's very difficult to walk if you can't guarantee
where your foot wobbling around that's okay to make one robot we want to make thousands maybe Millions
we haven't got the luxury of making them from carbon fiber titanium we want to make them on plastic things are not
quite so stiff so we can't have these high targets I'll call them dumb targets
we've got to make them work at lower targets so is that is that going to work well if you think about it sorry about
this but we're just bags of soggy jelly and Bones thrown in we're not high frequency if I stand on
my leg I don't vibrate at 10 Hertz we people operate at low frequency so we
know the robot actually can it just makes controls harder so we take the information from this the modal data and
the stiffness and feed that into the control system that allows it to walk
just changing tax slightly looking at the knee we could take some inspiration from
biology and we can look to see what the mechanical advantages of the knee is it turns out it actually represents quite
similar to the four bar link and that's quite non-linear that's not surprising really because if
you think when you bend your leg down the torque on your knee is much more when it's bent than it is when it's
straight so you'd expect a non-linear function and in fact the biology is non-linear
this matches it quite accurately so that's the representation the four by
link is obviously not physically four bar link as I said the characteristics are similar but me betting down that's
not very scientific let's be a bit more scientific we've played all the tasks through the through this graph but this
is showing pickets of walking squatting the tasks I said we did on the stress and that's the uh the talk a scene at
the knee against the knee bend on the horizontal axis this is showing the requirement for the knee to do all these
tasks and then put a curve through it surfing over the top of the Peaks and that's saying this is what's required to
make the robot do these tasks
so if we look at the four bar link that's actually the green curve and it's saying that the non-linearity of the
four by link is actually linearized the characteristic of the force what that really says is that's lowered the force
that's what makes the actuator have the lowest possible Force which is the most efficient we want to burn energy up slowly
what's the blue curve well the blue curve is actually if we didn't have a four bar link we just had an arm
sticking out of my leg here with a with an actuator on it a simple two bar link
that's the best you could do with a simple two bar link and it shows that that would create much more force in the
actuator which would not be efficient so what's that look like in practice
well as you'll see but it's very tightly packaged in the knee you'll see a good
transparent in a second you'll see the full bar link there it's operating on the actuator this is determined the
force and the displacements on the actuator and now pass you over to concertina to
so I am I would like to talk to you about um the design process and the actuator
portfolio uh in our robot so there are many similarities between a
car and the robot when it comes to powertrain design the the most important thing that matters here is energy mass and cost
we are carrying over most of our designing experience from the car to the robot
so in the particular case you see a car with two drive units and the drive units
are used in order to accelerate the car 0 to 60 miles per hour time or drive a
cities Drive site while the robot that has 28 actuators and
it's not obvious what are the tasks at the actuator level so we have tasks that
are higher level like walking or climbing stairs or carrying a heavy object which need to be translated into
joint into joint specs therefore we use our model
that generates the torque speed trajectories for our joints which
subsequently is going to be fed in our optimization model and to run through
the optimization process this is one of the scenarios that the
robot is capable of doing which is turning and walking so when we have this torque speed
trajectory we laid over an efficiency map of an actuator and we are able along
the trajectory to generate the power consumption and the energy accumulative
energy for the task versus time so this allows us to define the system
cost for the particular actuator and put a simple Point into the cloud then we do
this for hundreds of thousands of actuators by solving in our cluster and the red line denotes the Pareto front
which is the preferred area where we will look for optimal so the X denotes
the preferred actuator design we have picked for this particular joint so now we need to do this for every joint we
have 28 joints to optimize and we parse our cloud we parse our Cloud again for every joint
spec and the red axis this time denotes the bespoke actuator designs for every
joint the problem here is that we have too many unique actuator designs and
even if we take advantage of the Symmetry still there are too many in order to make something Mass
manufacturable we need to be able to reduce the amount of unique actuator designs therefore we run something
called commonality study which we parse our Cloud again looking this time for
actuators that simultaneously Meet The Joint performance requirements for more than one joint at the same time so the
resulting portfolio is six actuators and they show in a color map the middle figure
um and the actuators can be also viewed in this slide we have three rotary and
three linear actuators all of which have a great output force or Torque per Mass
the rotary actuator in particular has a mechanical clutch integrated on the high speed side angular contact
ball bearing and on the high speed side and on the low speed side a cross roller
bearing and the gear train is a strain wave gear and there are three integrated sensors
here and the bespoke permanent magnet machine the linear actuator
I'm sorry the linear actuator has planetary rollers and an inverted planetary Screw
As a gear train which allows efficiency and compaction and durability
so in order to demonstrate the force capability of our linear actuators we
have set up an experiment in order to test it under its limits
and I will let you enjoy the video
so our actuator is able to lift
a half tone nine foot concert grand piano
and
this is a requirement it's not something nice to have because our muscles can do
the same when they are direct driven when they are directly driven or quadricep muscles can do the same thing
it's just that the knee is an up gearing linkage system that converts the force
into velocity at the end effector of our Hills for purposes of giving to the
human body agility so this is one of the main things that are amazing about the human body and I'm
concluding my part at this point and I would like to welcome my colleague Mike who's going to talk to you about hand
design thank you very much thanks constantinos
so we just saw how powerful a human and a humanoid actuator can be however
humans are also incredibly dexterous the human hand has the ability to move
at 300 degrees per second it has tens of thousands of tactile sensors
and it has the ability to grasp and manipulate almost every object in our daily lives
for our robotic hand design we were inspired by biology we have five fingers an opposable thumb
our fingers are driven by metallic tendons that are both flexible and strong we have the ability to complete wide
aperture power grasps while also being optimized for precision gripping of small thin and delicate objects
so why a human like robotic hand well the main reason is that our factories and the world around us is
designed to be ergonomic so what that means is that it ensures that objects in our Factory are graspable
but it also ensures that new objects that we may have never seen before can be grasped by the human hand and by our
robotic hand as well the converse there is is pretty interesting because it's saying that these objects are designed to our hand
instead of having to make changes to our hand to accompany a new object
some basic stats about our hand is that has six actuators and 11 degrees of freedom it has an in-hand controller which
drives the fingers and receives sensor feedback sensor feedback is really important to
learn a little bit more about the objects that we're grasping and also for proprioception and that's the ability for us to recognize where
our hand is in space one of the important aspects of our hand is that it's adaptive this adaptability
is involved essentially as complex mechanisms that allow the hand to adapt to the objects that's being grasped
another important part is that we have a non-back drivable finger drive this clutching mechanism allows us to hold
and transport objects without having to turn on the hand Motors you just heard how we went about going
we went about designing the Tesla bot Hardware now we'll hand it off to Milan and our autonomy team to bring this
robot to life thanks Mike
all right um so all those cool things we've shown earlier in the video were posted
possible just in a matter of a few months thanks to the amazing word that we've done autopilot over the past few years
most of those components ported quite easily over to the Bots environment if you think about it we're just moving
from a robot on Wheels to a robot on legs so some of those components are pretty similar and some other require
more heavy lifting so for example our computer vision neural networks
reported directly from autopilot to the Bots situation it's exactly the same occupancy Network
that we're talking to a little bit more details later with the autopilot team that is now running on the bot here in
this video the only thing that changed really is the training data that we had to recollect
we're also trying to find ways to improve those occupancy networks using work made on your Radiance fields to get
really great volumetric rendering of the Bots environments for example here some
machine read that the bot might have to interact with
another interesting problem to think about is in indoor environments mostly with that sense of GPS signal how do you
get about to navigate to its destination say for instance to find its nearest charging station so we've been training
more neural networks to identify high frequency features key points within the
Bots camera streams and track them across frames over time as the bot navigates to its its environment
and we're using those points to get a better estimate of the Bots pose and trajectory within its environment as
it's walking we also did quite some work on the
simulation side and this is literally the autopilot simulator uh to which we've integrated the robot's Locomotion
code and this is a video of the motion control code running in the operator simulator simulator showing the
evolution of the robots walk over time and so as you can see we started quite slowly in April and start accelerating
as we unlock more joints and deeper more Advanced Techniques like arms balancing over the past few months
and so Locomotion is specifically one component that's very different as we're moving from the car to the Bots
environment and so I think it warrants a little bit more depth and I'd like my colleagues to start talking about this
now foreign
hi everyone I'm Felix I'm a robotics engineer on the project and I'm going to talk about walking
seems easy right people do it every day you don't even have to think about it
but there are some aspects of walking which are challenging from engineering perspective for example
physical self-awareness that means having a good representation of yourself what is the length of your limbs what is
the mass of your limbs what is the size of your feet all that matters also having an energy efficient gate you
can imagine there's different styles of walking and all of them are equally efficient
most important keep balance don't fall and of course also coordinate the motion
of all of your limbs together so now humans do all of this naturally but as Engineers or roboticists we have
to think about these problems and if I'm going to show you how we address them in our Locomotion planning and control
stack so we start with Locomotion planning and our representation of the bond that
means the model of the robot's kinematics Dynamics and the contact properties and using that model and the desired
path for the Bots our Locomotion planner generates reference trajectories for the entire system
this means feasible trajectories with respect to the assumptions of our model
the planner currently Works in three stages it starts planning footsteps and ends with the entire motion photo system
and let's dive a little bit deeper in how this works so in this video we see footsteps being planned over planning
Horizon following the desired path and we start from this and add then for
trajectories that connect these footsteps using toe off and yield strike just as the humans just as humans do
and this gives us a larger stride and less knee Bend for high efficiency of the system
the last stage is then finding a center of mass trajectory which gives us a fee dynamically feasible motion of the
entire system to keep balance as we all know plans are good but we
also have to realize them in reality let's say you know see how we can do this
[Applause] thank you Felix hello everyone my name
is Anand and I'm going to talk to you about controls so let's take the motion plan that Felix
just talked about and put it in the real world on a real robot let's see what happens
it takes a couple steps and falls down well that's a little disappointing
but we are missing a few key pieces here which will make it work
now as Felix mentioned the motion planner is using an idealized version of
itself and a version of reality around it this is not exactly correct
it also expresses its intention through trajectories and wrenches branches of
forces and torques that it wants to exert on the World to locomote
reality is way more complex than any similar model also the robot is not
simplified it's got vibrations and modes compliance sensor noise and on and on
and on so what does that do to the real world when you put the bot in the real world
well the unexpected forces cause unmodeled Dynamics which essentially the planner doesn't know about and that
causes destabilization especially For A system that is dynamically stable like biped locomotion
so what can we do about it well we measure reality we use sensors and our understanding of
the world to do state estimation and status to me here you can see the attitude and pelvis pose which is
essentially the vestibular system in a human along with the center of mass trajectory being tracked when the robot's walking
in the office environment now we have all the pieces we need in
order to close the loop so we use our better bot model we use the understanding of reality that
we've gained through State estimation and we compare what we want versus what we expect the reality we expect that
reality is doing to us in order to add corrections to the behavior of the
robot here the robot certainly doesn't appreciate being poked but it doesn't
admirable job of staying upright the final Point here is a robot that
walks is not enough we needed to use its hands and arms to
be useful let's talk about manipulation
[Applause]
hi everyone my name is Eric robotics engineer on teslabot and I want to talk
about how we've made the robot manipulate things in the real world we wanted to manipulate objects while
looking as natural as possible and also get there quickly so what we've done is
we've broken this process down into two steps first is generating a library of natural motion references or we could
call them demonstrations and then we've adapted these motion references online to the current real world situation
so let's say we have a human demonstration of picking up an object we can get a motion capture of that
demonstration which is visualized right here as a bunch of keyframes representing the locations of the hands
the elbows the Torso we can map that to the robot using inverse kinematics and if we collect a
lot of these now we have a library that we can work with but a single demonstration is not
generalizable to the variation in the real world for instance this would only work for a box in a very particular
location so what we've also done is run these reference trajectories through a
trajectory optimization program which solves for where the hand should be how the robot should balance
during uh when it needs to adapt the motion to the real world so for instance
if the box is in this location then our Optimizer will create this
trajectory instead next Milan's going to talk about uh
what's next for the Optimus uh Tesla y thanks thanks Larry
right so hopefully by now you guys got a good idea of what we've been up to over the past few months
um we started doing something that's usable but it's far from being useful there's still a long and exciting road
ahead of us um I think the first thing within the next few weeks is to get Optimus at least at
par with Bumble C the other bug prototype you saw earlier and probably Beyond we're also going to start
focusing on the real use case at one of our factories and really gonna try to try to nail this down and I run out all
the elements needed to deploy this product in the real world I was mentioning earlier
um you know indoor navigation graceful for management or even servicing all
components needed to scale this product up but um I don't know about you but after
seeing what we've shown tonight I'm pretty sure we can get this done within the next few months or years and I make
this product a reality and change the entire economy so I would like to thank the entire Optimus team for the hard
work over the past few months I think it's pretty amazing all of this was done in barely six or eight months thank you
very much [Applause]
thank you hey everyone
hi I'm Ashok I lead the autopilot team alongside Milan God it's coming so hard to top that
Optimus section he'll try nonetheless anyway
um every Tesla that has been built over the last several years we think has the
hardware to make the car drive itself we have been working on the software to
add higher and higher levels of autonomy this time around last year we had
roughly 2 000 cars driving our FSD beta software since then we have significantly
improved the software as robustness and capability that we have now shipped it to 160 000 customers as of today
yep [Applause]
this is not come for free it came from the sweat and blood of the engineering team over the last one year
for example we trained 75 000 neural network models just last one year that's
roughly a model every eight minutes that's you know coming out of the team and then we evaluate them on our large
clusters and then we ship 281 of those models that actually improve the performance of the car
and this space of innovation is happening throughout the stack the the planning software the
infrastructure the tools even hiring everything is progressing to the next level
the FSG beta software is quite capable of driving the car it should be able to navigate from
parking lot to parking lot handling CDC driving stopping for traffic lights and stop signs
negotiating with objects at intersections making turns and so on
all of this comes from the camera streams that go through our neural networks that run on the car itself it's
not coming back to the server or anything it runs on the car and produces all the outputs to form the world model
around the car and the planning software drives the car based on that
today we'll go into a lot of the components that make up the system the occupancy Network acts as the base
geometry layer of the system this is a multi-camera video neural
network that from the images predicts the full physical occupancy of the world around
the robot so anything that's physically present trees walls buildings cars walls what
have you it predicts if it's specifically present it predicts them along with their future motion
on top of this base level of geometry we have more semantic layers in order to
navigate the roadways we need the lens of course but then the roadways have lots of
different lanes and they connect in all kinds of ways so it's actually a really difficult problem for typical computer
vision techniques to predict the set of planes and their connectivities so we reached all the way into language
Technologies and then pulled the state of the art from other domains and not just computer vision to make this task
possible for vehicles we need their full kinematic state to control for them
all of this directly comes from neural networks video streams raw video streams come into the networks go through a lot
of processing and then outputs the full kinematic state that positions velocities acceleration jerk all of that
directly comes out of the networks with minimal post processing that's really fascinating to me because how how is
this even possible what world do we live in that this magic is possible that these networks predicts fourth
derivatives of these positions when people thought we couldn't even detect these objects
my opinion is that it did not come for free uh it it required tons of data so we had a bit sophisticated Auto labeling
systems that Shone through raw sensor data run a ton of offline compute on the
servers it can take a few hours run expensive neural networks distill the information into labels that train our
in-car neural networks on top of this we also use our simulation system to synthetically
create images and since it's a simulation we trivially have all the labels
all of this goes through a well-oiled data engine pipeline where we first
train a baseline model with some data ship it to the car see what the failures are and once you know the failures
we mine the fleet for the cases where it fails provide the correct labels and add the data to the training set
this process systematically fixes the issues and we do this for every task that runs in the car
yeah and to train these new massive neural networks this year we expanded our training infrastructure by roughly
40 to 50 percent so that sits us at about 14 000 gpus today across multiple
training clusters in the United States we also worked on our AI compiler which
now supports new operations needed by those neural networks and map them to the uh the best of our underlying
Hardware resources and our inference engine today is capable of Distributing the execution of
a single neural network across two independent system on ships essentially two independent computers interconnected
within the simple self-driving computer and to make this possible we have to keep a tight control on the end-to-end
latency of this new system so we deployed more advanced scheduling code across the full FSD platform
all of these neural networks running in the car together produce the vector space which is again the model of the
world around the robot or the car and then the planning system operates on top of this coming up with trajectories that
avoid collisions or smooth make progress towards the destination using a combination of model based optimization
plus neural network that helps optimize it to be really fast
today we are really excited to present progress on all of these areas we have the engineering leads standing by to
come in and explain these various blocks and these power not just the car but the same components also run on the Optimus
robot that Milan showed earlier with that I welcome panel to start talking about the planning section
hi all I'm parel joint let's use this intersection scenario to
dive straight into how we do the planning and decision making in autopilot so we are approaching this intersection
from a side street and we have to yield to all the crossing vehicles rightness as we are about to enter the
intersection The Pedestrian on the other side of the intersection decides to cross the road
without a crosswalk now we need to yield to this pedestrian yield to the vehicles from the right and
also understand the relation between The Pedestrian and the vehicle on the other side of the intersection
so a lot of these intra-object dependencies that we need to resolve in a quick glance
and humans are really good at this we look at a scene understand all the possible interactions evaluate the most
promising ones and generally end up choosing a reasonable one
so let's look at a few of these interactions that autopilot system evaluated we could have gone in front of this
pedestrian with a very aggressive launch in a lateral profile now obviously we are being a jerk to The
Pedestrian and we would spook The Pedestrian and his cute pet we could have moved forward slowly short
for a gap between The Pedestrian or and the vehicle from the right again we are being a jerk to the vehicle
coming from the right but you should not outright reject this interaction in case this is only safe interaction available
lastly the interaction we ended up choosing stay slow initially find the reasonable
Gap and then finish the maneuver after all the agents pass
now evaluation of all of these interactions is not trivial especially when you care about modeling
the higher order derivatives for other agents for example what is the longitudinal
jerk required by the vehicle coming from the right when you assert in front of it relying purely on collision checks with
modular predictions will only get you so far because you will miss out on a lot of valid interactions
this basically boils down to solving a multi-agent joint trajectory planning problem over the trajectories of ego and
all the other agents now how much ever you optimize there's going to be a limit to how fast you can
run this optimization problem it will be close to close to order of 10 milliseconds even after a lot of incremental approximations
now for a typical crowded unpredictable left say you have more than 20 objects each
object having multiple different future modes the number of relevant interaction combinations will blow up
we the planner needs to make a decision every 50 milliseconds so how do we solve this in real time
we rely on a framework what we call as interaction search which is basically a parallelized research over a bunch of
maneuver trajectories the state space here corresponds to the kinematic state of ego the kinematic
state of other agents the nominal future multiple multimodal predictions and all the static entities in the scene
the action space is where things get interesting we use a set of maneuver trajectory
candidates to Branch over a bunch of interactional decisions and also incremental goals for a longer
Horizon maneuver Let's Walk Through This research very quickly to get a sense of how it works
we start with a set of vision measurements namely Lanes occupancy moving objects these get represented as
sparse extractions as well as latent features we use this to create a set of goal
candidates Lanes again from the lanes Network or unstructured regions which correspond to
a probability mask derived from Human demonstrations once we have a bunch of these gold
candidates we create seed trajectories using a combination of classical optimization approaches as well as our
Network planner again trained on data from the customer feed now once we get a bunch of these free
trajectories we use them to start branching on the interactions we find the most critical interaction
in our case this would be the interaction with respect to The Pedestrian whether we assert in front of it or yield to it
obviously the option on the left is a high penalty option it likely won't get prioritized so we Branch further onto
the option on the right and that's where we bring in more and more complex interactions building this optimization
problem incrementally with more and more constraints and that research keeps flowing branching on more interactions branching
on more goals now a lot of tricks here lie in evaluation of each of each of this node
of the research inside each node initially we started with creating
trajectories using classical optimization approaches where the constraints like I described would be added incrementally
and this would take close to one to five milliseconds per action now even though this is fairly good
number when you want to evaluate more than 100 interactions this does not scale
so we ended up building lightweight queryable networks that you can run in the loop of the planner
these networks are trained on human demonstrations from the fleet as well as offline solvers with relaxed time limits
with this we were able to bring the rundown runtime down to close 200 microseconds per action
now doing this alone is not enough because you still have this massive research that that you need to go
through and you need to efficiently prune the search space so you need to do a do scoring on each
of these trajectories few of these are fairly standard you do a bunch of collision checks you do a bunch of comfort analysis what is the jerk and
actually required for a given maneuver the customer Fleet data plays an important role here again
we run two sets of again lightweight variable networks both really augmenting each other one of them trained from
interventions from the FST beta Fleet which gives a score on How likely is a given maneuver to result in
interventions over the next few seconds and second which is purely on human demonstrations human driven data giving
a score on how close is your given selected action to a human driven trajectory
the scoring helps us prune the search space keep branching further on the interactions and focus the compute on
the most promising outcomes the the cool part about this
architecture is that it allows us to create a cool blend between uh data driven approaches where you
don't have to rely on a lot of hand engineered costs but also ground it in reality with physics-based checks
now a lot of what what I described was with respect to the agents we could observe in the scene but the same
framework extends to objects behind occlusions we use the video feed from eight cameras
to generate the 3D occupancy of the world the blue mask here corresponds to the
visibility region we call it it basically gets blocked at the first
occlusion you see in the scene we consume this visibility mask to generate what we call as ghost objects which you
can see on the top left now if you model the spawn regions and the state transitions of this ghost
objects correctly if you tune your control response as a
function of that existence likelihood you can extract some really nice human-like behaviors
now I'll pass it on on to fill to describe more on how we generate these occupancy Networks
hey guys my name is Phil uh I will share the details of the occupancy Network we build over the past year
this network is our solution to model the physical work in 3D around our cars and it is currently not shown in our
customer facing visualization and what we will see here is the road Network output from our internal Dev tool
the occupancy Network takes video streams of all our 80 cameras as input produces a single unified volumetric
occupancy in Vector space directly for every 3D location around our car it
predicts the probability of that location being occupied a lot since it has video contacts it is
capable of predicting obstacles that are occluded instantaneously
for each location it also produces a set of semantics such as curb car pedestrian
and low debris as color coded here
occupancy flow is also predicted for motion since the model is a generalized Network
it does not tell static and dynamic objects explicitly it is able to produce and
model the random motions such as the swerving trainer here
this network is currently running in all Teslas with FSD computers and it is
incredibly efficient runs about every 10 milliseconds with our neural accelerator
so how does this work let's take a look at the architecture first we Rectify each camera images with
the camera calibration and the images were shown here were given to the network it's actually not
the typical 8-bit RGB image as you can see from the first imagery on top we're
giving the 12 bit raw photo account image to the network since it has four
bits more information it has 16 times better dynamic range as well as reduced
latency since we don't have the wrong ISP in Adobe anymore we use a set of records and back with
FPS as a backbone to extract images space features next we construct a set of 3D position
query along with the IMG space features as keys and values fit into an attention module
the output of the attention module is high dimensional spatial features
these special features are aligned temporarily using vehicle odometry
to derive motion last this spatial temporal features go
through a set of D convolution to produce the final occupancy and occupancy flow output
they're formed as fixed size boxer gray which might not be precise enough for planning on control
in order to get a higher resolution we also produce per voxel feature Maps which will feed into MLP with 3D spatial
Point queries to get position and semantics at any arbitrary location
after knowing the model better let's take a look at another example here we have an articular bus parked along right
side row highlighted as a L-shaped boxer here as we approach the bus starts to
move the blue the front of the cart turns blue first indicating the model predicts the frontal bus has a down zero
occupancy flow and the s-bus keeps moving the entire bus turns blue
and you can also see that the network predicts the precise curvature of the bus
well this is a very complicated problem for traditional object detection Network as you have to see whether I'm going to
use one cuboid or perhaps a two to fit the curvature but for occupation Network
since all we care about is the occupancy in the visible space and we'll be able to model the curvature precisely
besides the voxel grade the occupancy Network also produces a drivable surface
the drivable surface has both 3D geometry and semantics they are very useful for control especially on healing
and curvy roads the surface and the voxel gray are not predicted independently instead the
voxel grid actually aligns with the surface implicitly here we are at a hero Quest where you
can see the 3D geometry of the surface being predicted nicely
planner can use this information to decide perhaps we need to slow down more for the Hillcrest and as you can also
see the voxel grade aligns with the surface consistently
besides the Box source and the surface we're also very excited about the recent breakthrough in neural readings field or
Nerf we're looking into both incorporate some of the light color features into
occupancy Network training as well as using our Network output as the input state for Nerf
as a matter of fact Ashok is very excited about this this has been his uh personal weekend project for a while
on these nerves because I think the Academia is building a lot of these
Foundation models uh for language using like tons of large data sets for language but I think for vision nerves
are going to provide the foundation models for computer vision because they are grounded in geometry and geometry
gives us a nice way to supervise these networks and freezes of the requirement to Define an ontology and the
supervision is essentially free because you just have to differentiably render these images so I think in the future uh this
occupancy Network idea where you know images come in and then the network produces a consistent
volumetric representation of the scene that can then be differentially rendered into any image that was observed I I
personally think is a future of computer vision uh and you know we do some initial work on it uh right now but I
think in the future both at Tesla and in the Academia we will see that these
combination of One-Shot prediction of volumetric occupancy uh will be that's
my personal bet sexual so here's an example early result of a
3D Reconstruction from our free data instead of focusing on getting perfect RGB reprojection in imaging space our
primary goal here is to accurately represent the warnings 3D space for driving and we want to do this for all
our free data over the world in all weather and lighting conditions and obviously this is a very challenging
problem and we're looking for you guys to help finally the occupancy network is trained
with large auto level data set without any human in the loop and with that I'll pass to Tim to talk
about what it takes to train this network thanks Phil
[Applause] all right hey everyone let's talk about some training
infrastructure so we've seen a couple of videos you know four or five uh I think and care
more and worry more about a lot more Clips on that so we've been looking at
the occupancy networks just from Phil just fills videos it takes 1.4 billion
frames to train that Network what you just saw and if you have a hundred thousand gpus uh it would take one hour
but if you have uh one GPU it would take a hundred thousand hours so that is not
a Humane time period that you can wait for your training job to run right we want to ship faster than that so that
means you're going to need to go parallel so you need a more compute for that that means you're going to need a
supercomputer so this is why we've built in-house three supercomputers comprising
of 14 000 gpus where we use 10 000 gpus for training and around four thousand
gpus for auto labeling all these videos are stored in 30 petabytes of a distributed managed video
cache you shouldn't think of our data sets as fixed let's say as you think of your
imagenet or something you know with like a million frames you should think of it as a very fluid thing so we've got a
half a million of these videos flowing in and out of this cluster these clusters every single day
and we track 400 000 of these kind of python video instantiations every second
so that is that's a lot of calls we're gonna need to capture that in order to govern the retention policies of this
distributed video cache so underlying all of this is a huge amount of infra all of which we build and manage
in-house so you cannot just buy you know 40 000
gpus and then a 30 petabytes of Flash mvme and just put it together and let's go train uh it actually takes a lot of
work and I'm gonna go into a little bit of that what you actually typically want to do is you want to take your accelerator so
that it could be the GPU or Dojo which we'll talk about later and because that's the most expensive
component that's where you want to put your bottleneck and so that means that every single part of your system is
going to need to outperform this accelerator and so that is really complicated that
means that your storage is going to need to have the size and the bandwidth to deliver all the data down into the nodes
these nodes need to have the right amount of CPU and memory capabilities to feed into your machine learning
framework this machine learning framework then needs to hand it off to your GPU and then you can start training but then you
need to do so across hundreds or thousands of GPU in a reliable way in
logstap and in a way that's also fast so you're also going to need an interconnect extremely complicated we'll talk more
about dojo in a second so first I want to take you to some
optimizations that we've done on our cluster so we're getting in a lot of videos and
video is very much unlike let's say training on images or text which I think is very well established video is quite
literally a dimension more complicated um and so that's why we needed to go end
to end from the storage layer down to the accelerator and optimize every single piece of that because we train on the photon count
videos that come directly from our Fleet we train on those directly we do not post process those at all
the way it's just done is we seek exactly to the frames we select for our batch we load those in including the
frames that they depend on so these are your iframes or your keyframes we package those up move them into shared
memory move them into a double bar from the GPU and then use the hardware decoder that's only accelerated to
actually decode the video so we do that on the GPU natively and this is all in a very nice python pytorch extension
doing so unlocked more than 30 training speed increase for the occupancy networks and freed up basically a whole
CPU to do any other thing um you cannot just do training with just
videos of course you need some kind of a ground Truth uh and uh that is actually an interesting problem as well the
objective for storing your ground truth is that you want to make sure you get to your ground truth that you need in the
minimal amount of file system operations and load in the minimal size of what you need in order to optimize for aggregate
cross cluster throughput because you should see a compute cluster as one big device which has internally fixed
constraints and thresholds so for this we rolled out a format that
is native to us that's called small we use this for our ground truth our feature cache and any inference outputs
so a lot of tensors that are in there and so just the cartoon here let's say these are your uh is your table that you
want to store then that's how that would look out if you rolled out on disk so what you do is you take anything you'd
want to index on so for example video timestamps you put those all in the header so that in your initial header
read you know exactly where to go on disk then if you have any tensors uh you're going to try to transpose the
dimensions to put a different dimension last as the contiguous Dimension and then also try different types of
compression then you check out which one was most optimal and then store that one this is actually a huge step if you do
feature caching unintelligible output from the machine Learning Network rotate around the
dimensions a little bit you can get up to 20 increase in efficiency of storage then when you store that we also
ordered the columns by size so that all your small columns and small values are together so that when you seek for a
single value you're likely to overlap with a read on more values which you'll use later so that you don't need to do
another file system operation so I could go on and on I just went on
on touched on two projects that we have internally but this is actually part of a huge continuous effort to optimize the
compute that we have in-house so accumulating and aggregating through all these optimizations We Now train our
occupancy networks twice as fast just because it's twice as efficient and now if we add in bunch more compute and go
parallel we cannot train this in hours instead of days and with that I'd like to hand it off to
the biggest user of compute John
hi everybody my name is John Emmons I lead the autopilot Vision team I'm going to cover two topics with you
today the first is how we predict lanes and the second is how we predict the future behavior of other agents on the road
in the early days of autopilot we modeled the lane detection problem as an image space instant segmentation task
our network was super simple though in fact it was only capable of printing Lanes from of a few different kinds of
geometries specifically it would segment the Eagle Lane it could segment adjacent
lanes and then it had some special casing for forks and merges this simplistic modeling of the problem
worked for highly structured roads like highways but today we're trying to build a system
that's capable of much more complex Maneuvers specifically we want to make left and right turns at intersections
where the road topology can be quite a bit more complex and diverse when we try to apply this simplistic modeling of the
problem here it just totally breaks down taking a step back for a moment what
we're trying to do here is to predict the spark set of lame instances in their connectivity and what we want to do is to have a
neural network that basically predicts this graph where the nodes are the lane segments and the edges encode the
connectivities between these Lanes so what we have is our lane detection
neural network it's made up of three components in the first component we have a set of
convolutional layers attention layers and other neural network layers that encode the video streams from our eight
cameras on the vehicle and produce a rich visual representation
we then enhance this digital representation with a coarse roadmap Road level map data which we encode with
a set of additional neural network layers that we call the lane guidance module this map is not an HD map but it
provides a lot of useful hints about the topology of lanes inside of intersections the lane counts on various roads and a set of other attributes that
help us the first two components here produced a
dense tensor that sort of encodes the world but what we really want to do is to convert this dense tensor into a
smart set of lanes in their connectivities we approach this problem like an image
captioning task where the input is this dense tensor and the output text is predicted into a special language that
we developed at Tesla for encoding Lanes in their connectivities in this language of lanes the words and
tokens are the lane positions in 3D space in The Ordering of the tokens introducted modifiers in the tokens
encode the connective relationships between these Lanes by modeling the task as a language
problem we can capitalize on recent autoregressive architectures and techniques from the language Community for handling the multiple
modality of the problem we're not just solving the computer vision problem at autopilot we're also applying the state-of-the-art and
language modeling and machine learning more generally I'm now going to dive into a little bit more detail this language component
what I have depicted on the screen here is the satellite image which sort of represents the local area around the
vehicle the set of nosing edges is what we refer to as the lane graph and it's ultimately what we want to come out of this neural
network we start with a blank slate we're going to want to make our first
prediction here at this Green Dot this green dots position is encoded as
an index into a course grid which discretizes the 3D World now we don't predict this index directly
because it would be too computationally expensive to do so there's just too many grid points and predicting a categorical
distribution over this has both implications at training time and test time so instead what we do is we disretch the
world coarsely first we predict a heat map over the possible locations and then we latch in the most probable location
on this we then refine the prediction and get the precise point
now we know where the position of this token is we don't know its type in this case though it's the beginning of a new
Lane so we approach it as a start token and because it's a star token there's no
additional attributes in our language we then take the predictions from this first forward pass and we encode them
using a learned additional embedding which produces a set of tensors that we combine together
which is actually the first word in our language of lanes we add this to the you know first position in our sentence here
we then continue this process by printing the next Lane point in a similar fashion
now this Lane point is not the beginning of a new Lane it's actually a continuation of the previous Lane
so it's a continuation token type now it's not enough just to know that
this Lane is connected to the previously protected plane we want to encode its precise geometry which we do by
regressing a set of spline coefficients we then take this Lane we encode it
again and add it as the next word in the sentence we continue predicting these continuation Lanes until we get to the
end of the prediction grid we then move on to a different Lane segment so you can see that cyan dot there now
it's it's not topologically connected to that pink point it's actually forking off of that that blue sorry that green
point there so it's got a fork type and Fork tokens
actually point back to previous tokens from which the fork originates so you
can see here the fork Point predictor is actually the index zero so it's actually referencing back to tokens that it's already predicted like you would in
language we continue this process over and over again until we've enumerated all of the
tokens in the Ling graph and then the network predicts the end of sentence token
yeah I just want to note that the reason we do this is not just because we want to build something complicated it's
almost feels like a turing complete machine here with neural networks though is that we tried simple approaches for
example uh trying to just segment the lanes along the road or something like that but then the problem is when
there's uncertainty say you cannot see the road clearly and there could be two lanes or three lanes and you can't tell
a simple segmentation based approach would just draw both of them is kind of a 2.5 Lane situation and the
post processing algorithm would hilariously fail when the predictions are such yeah the problems don't end there I mean
you need to predict these connective conditions like these connective Lanes inside of intersections which it's just not possible with the approach that
ashok's mentioning which is why we had to upgrade to this sort of like overlaps like this segmentation would just go Haywire but even if you try very hard to
you know put them on separate layers it's just a really hard problem what language just offers a really nice framework for modern getting a
sample from a posterior as opposed to you know trying to do all of this in post-processing
but this doesn't actually stop for just autopilot right John this can be used for Optimus again you know I guess they wouldn't be
called Lanes but you could imagine you know sort of in this you know stage here that you might have sort of paths that sort of you know encode the possible
places that people could walk yeah it's basically if you're in a factory or in a you know home setting
you can just ask the robot okay let me please talk to the kitchen or please route to some location in the factory
and then we predict a set of Pathways that would you know go through the aisles take the robot and say okay this
is how you get to the kitchen it just really gives us a nice framework to model these different paths that simplify the navigation problem or the
downstream planner all right so ultimately what we get from
this Lane detection network is a set of lanes in their connectivities which comes directly from the network there's
no additional step here for as far simplifying these you know dense predictions into into indispersed ones
this is just a direct unfiltered output of the network
okay so I talked a little bit about Lanes I'm going to briefly touch on how we model and predict the future paths in
other semantics on objects so I'm just going to go really quickly through two examples the video on the
right here we've got a car that's actually running a red light and turning in front of us what we do to handle
situations like this is we predict a set of short time Horizon future trajectories on all objects we can use
these to anticipate the dangerous situation here and apply whatever you know braking and steering action is required to avoid a collision
in the video on the right there's two vehicles in front of us the one on the left lane is parked apparently it's
being loaded unloaded I don't know why the driver decided to park there but the important thing is that our neural network predicted that it was stopped
which is the red color there um the vehicle in the other lane as you notice also is stationary but that one's
obviously just waiting for that red light to turn green so even though both objects are stationary and have zero velocity it's the semantics that is
really important here so that we don't get stuck behind that awkwardly parked car
predicting all of these agent attributes presents some practical problems when trying to build a real-time system
we need to maximize the frame rate of our object section stack so that autopilot can quickly react to the changing environment
every millisecond really matters here to minimize the inference latency our neural network is split into two phases
in the first phase we identified locations in 3D space where agents exist
in the second stage we then pull out tensors at those 3D locations append it with additional data that's on the
vehicle and then we you know do the rest of the processing this specification step allows the
neural network to focus compute on the areas that matter most which gives us Superior performance for a fraction of the latency cost
so putting it all together the autopilot Vision stack predicts more than just the geometry and kinematics of
the world it also predicts a rich set of semantics which enables safe and human-like driving
I'm not going to hand things off to Street we'll tell us how we run all these cool neural networks on our FSD computer thank you
[Applause]
hi everyone I'm SRI today I'm going to give glimpse of what it takes to run this FSC networks in the
car and how do we optimize for the inference latency uh today I'm going to focus just on the
FSG Lanes Network that John just talked about
so when you started this track we wanted to know if we can run this FSC Lanes Network natively on the trip engine
which is our in-house neural network accelerator that we built in the FSD computer
when we build this Hardware we kept it simple and we made sure it can do one
thing ridiculously fast dense dot products but this architecture is auto
regressive and iterative where it crunches through multiple attention attention blocks in the Inner Loop
producing sparse points directly at every step so the challenge here was how
can we do this parse Point prediction and sparse computation on a dense dot product engine let's see how we did this
on the trip so the network predicts the heat map of
most probable spatial locations of the point now we do a Arc Max and a one
heart operation which gives the one hard encoding of the index of the spatial location
now we need to select the embedding associated with this index from an embedding table that is learned during
training to do this on trip we actually built a lookup table in SRAM and we engineered
the dimensions of this embedding such that we could achieve all of this thing with just matrix multiplication
not just that we also wanted to store this embedding into a token cache so
that we don't recompute this for every iteration rather reuse it for future Point prediction again we pulled some
tricks here where we did all these operations just on the dot product engine it's actually cool that our team
found creative ways to map all these operations on the trip engine in ways
that were not even imagined when this Hardware was designed but that's not the only thing we have to
do to make this work we actually implemented a whole lot of operations and features to make this model
compilable to improve the intake accuracy as well as to optimize performance
all of these things helped us run the 75 million parameter model just under 10
millisecond of latency consuming just 8 watts of power
but this is not the only architecture running in the car there are so many other architectures modules and networks
we need to run in the car to give a sense of scale there are about a billion parameters of all the networks
combined producing around 1000 neural network signals so we need to make sure
we optimize them jointly and such that we maximize the compute
utilization throughput and minimize the latency so we built a compiler just for neural
networks that shares the structure to traditional compilers as you can see it takes the massive
graph of neural Nets with 150k nodes and 375k connection takes this thing
partitions them into independent subgraphs and com compels each of those
subgraphs natively for the inference devices then we have a neural network
Linker which shares the structure to traditional Linker where we perform this link time optimization
there we solve an offline optimization problem uh for with compute memory and memory
bandwidth constraints so that it comes with an optimized schedule that gets executed in the car
on the runtime we designed a hybrid scheduling system which basically does
heterogeneous scheduling on one SOC and distributed scheduling across both the socs to run these networks in a model
parallel fashion to get 100 drops of compute utilization we need to optimize across all the
layers of software right from tuning the network architecture the compiler all
the way to implementing a low latency high bandwidth RDMA link across both the srcs and in fact going even deeper to
understanding and optimizing the cache coherent and non-coherent data Paths of the accelerator in the soc this is a lot
of optimization at every level in order to make sure we get the highest frame rate and as every millisecond counts
here and this is this is just the this is the
visualization of the neural networks that are running in the car this is our digital brain essentially as you can see
these operations are nothing but just the matrix multiplication convolution to name a few real operations running in the car
to train or train this network with a billion parameters you need a lot of labeled data so aegon is going to talk
about how do we achieve this with the auto labeling pipeline
thank you uh thank you Sherry
uh hi everyone I'm Jurgen Zhang and I'm leading a geometric Vision at autopilot
so yeah let's talk about Auto labeling
so we have several kinds of all the labeling Frameworks to support various types of networks but today I'd like to
focus on the awesome Lanes net here so to successfully train and generalize
this network to everywhere we think we went tens of millions of trips from
probably one one million intersection or even more so
then how to do that so it is certainly achievable uh to Source sufficient
amount of trips because we already have as Tim explained earlier we already have like 500 000 trips per day cash rate
um however converting all those data into a training form is a very challenging technical problem
to solve this challenge we tried various ways of manual and auto labeling so from
the First Column to the second from the second to the third each Advance provided us nearly 100x Improvement in
throughput but still uh we won an even better Auto labeling machine that can provide
provide providers good quality diversity and scalability
to meet all these requirements uh despite the huge amount of engineering effort required here we've developed a
new order labeling machine powered by multi-trib reconstruction so this can replace 5 million hours of
manual labeling with just 12 hours on cluster for labeling 10 000 trips
so how we solved there are three big steps the first step is high Precision trajectory and structure recovery by
multi-camera visual inertial odometry so here all the features including ground surface are inferred from videos
by neural networks then tracked and reconstructed in the vector space
so the typical drift rate of this trajectory in car is like 1.3 centimeter