Claims like these are made Every 6 months. From personal experience I can say it’s very easy to Overfit a huge CNN on a data set, and then observe it performs well on the test set drawn from the same distribution. This is Despite attempts to make the test set differ from training. Typically these models fail outright in the wild.
I have also spoken to autopilot engineers at TESLA Who confirm this suspicion. And this is a highly select team that work with Elon musk on a weekly basis.
The fact of the matter is robust depth estimation is not possible with just cameras. Especially ones mounted inside the windshield. Remember when it rains the distribution shifts completely. Tesla solves this issue by training two nets, so now you’re just overfitting to two distributions.
> robust depth estimation is not possible with just cameras
I can believe this statement, and am definitely off put by the false promises and failed timelines of autonomy released by Tesla over the past 5 years, but the more important question would be “is the necessary amount of depth estimation to achieve autonomous vehicles possible with just cameras?”, and intuitively the answer is yes, seeing as that is how humans achieve an acceptable level of depth perception for the same task.
Our eyes are less than 3" apart and we achieve "acceptable" depth perception (at least at close range).
What if we had two cameras 12" apart? What about 60" apart (opposite corners of windshield)? How much easier would that make the depth perception problem?
From a quick viewing of Tesla's autopilot info it doesn't seem like they are doing stereoscopic vision with their cameras. Why not? Too hard to do?
The distance between camera is really important. In the lab everything is fine tuned before each deployment. In the wild anything could happen, and if the distance changes (that’s the base of the triangle ), all your calculations are off.
That’s why monocular depth estimation is more robust. Skydio drones do it as well.
Great summary, but what exactly is wrong with overfitting here? If you have enough memory for the model and you train on every square foot of earth’s surface in different conditions, you shouldn’t have to worry about anything else right? Also, you might be able to generate synthetic data pretty easily here, making it suitable for GANs etc.
I have also spoken to autopilot engineers at TESLA Who confirm this suspicion. And this is a highly select team that work with Elon musk on a weekly basis.
The fact of the matter is robust depth estimation is not possible with just cameras. Especially ones mounted inside the windshield. Remember when it rains the distribution shifts completely. Tesla solves this issue by training two nets, so now you’re just overfitting to two distributions.