Hacker News new | past | comments | ask | show | jobs | submit login
Camera crushes Lidar, claims startup (ieee.org)
62 points by pseudolus on Aug 24, 2023 | hide | past | favorite | 171 comments



> To see how their system performed in these situations, NoDar conducted a series of tests on a remote airstrip in Maine with almost zero light pollution.

Isn't this the opposite of realistic?

> In broad daylight, NoDar’s setup generated 40 million 3D data points per second compared to the lidar’s 600,000. In extremely heavy rain the number of valid data points dropped by only around 30 percent, while for lidar the drop was roughly 60 percent. And in fog with visibility of roughly 45 meters they found that 70 percent of their distance measurements were still accurate, compared to just 20 percent for lidar.

These results sound very cherry picked.


Yeah and having 600k high quality lidar points is better than having 40 million noisy stereo points. Honestly their depth map is so full of holes and artifacts, despite the fact that a depth map is the easiest way to hide poor quality (since range inaccuracies only manifest as subtle hue changes).

Also not sure why they picked a 600k points per second lidar instead of a 3 million point per second one like an Ouster 128 beam one.


> despite the fact that a depth map is the easiest way to hide poor quality

I did my PhD on stereo and LIDAR; also in industry building these things. This is something that really annoyed us about camera companies selling stereo systems: you cannot tell anything from a colormapped depth plot. It doesn't matter if it's grayscale or jet or a nice perceptually uniform one. Zed are really bad for this in their promo material.

Small errors might be really significant for reconstruction, but depth maps make it easy to hide errors, fuzzy bits, holes, discontinuities etc. Really you want to test against a known calibration object at a distance and present the reconstruction error, show the 3D reconstruction top-down (or in a way which lets you see how much depth variation there is) or compare to a simultaneous LIDAR capture which might be sparse but will be more accurate (absolute) at distance.


A lot of lidar companies also show rainbow-colored point clouds from a perspective close to that of the sensor. So silly...


I'm actually trying to objectively compare different stereo cameras, and I've been wondering if there is an "standard" calibration scene.

I've been just looking at the depth maps to try and figure stuff like height accuracy, minimum separation between objects before they blob together, etc, but I'd like to know your thoughts on what a good method of comparison might be.


You can buy known objects (expensive, for metrology calibration) or nowadays you could 3D print test objects. A simple one is to set up a flat board with white noise printed on it and then measure depth noise as a function of distance (eg fit a plane).

The cameras usually aren't the problem. You just need two reasonably high quality machine vision cameras that are ideally hardware synced. There are geometric limits on how accurate you can be, related to the camera separation/baseline and how well you expect you can match at the sub pixel level. 0.1-0.25 px would be considered decent. Normally you'd design the problem in reverse eg what error do we need at the worst distance, field of view dictates lenses, etc. It can be very bespoke.

Depth reconstruction is more reliant on matching algorithms and illumination. You can test stereo algorithms on benchmark datasets (The classics are Middlebury and KITTI). Illumination includes things like random dot projection or other artificial texture to aid matching + reconstruction.


>The other challenge for cameras is that, unlike lidar, which has its own light source, they rely on ambient light. That’s why they often struggle at night or in bad weather.

They were testing to disprove the idea it doesn't need ambient light. It's 'worst case scenario / point making testing' not realistic real world testing.

Also I'm not sure I see how you're deducing 'very cherry picked' data there. They've given a range of situations and some general results that look pretty good. I'd assume it's cherry picked in as much as it's not a full data set, but it doesn't exactly yell suspiciously missing anything core to me. Are you expecting the system to fall apart in light rain, but excel in heavy?


"Camera crushes lidar" is marketing spin, because different types of sensors "win" for different scenarios.

You really want a sensor fusion strategy for devices making life-or-death decisions on your behalf.


The problem with sensor fusion is what do you do when the sensors disagree? You have to decide which sensor to trust. But LiDAR is subject to interference, and it can’t see lane markings, traffic lights, speed limit signs, or emergency vehicle lights. If you ignored camera data and just relied on LiDAR, you wouldn’t be able to drive safely for very long. If you choose to ignore the LiDAR, then the sensor fusion is just adding noise to your camera vision model of the world.

If you somehow do sensor fusion perfectly and the models never disagree, then why even have the LiDAR? At that point you’ve solved vision using cameras.


The idea of sensor fusion is to have some kind of error model for all sensors (and frequently for navigation also a dynamic model of the vehicle) so that you can weigh the various sensor differently based on the uncertainty. The most primitive incarnation of this would be a simple linear Kalman Filter, but you can use similar concepts with more complex non-linear observation and dynamic models


Expanding a bit for those who don't know how a Kalman filters (or any Bayesian recursive estimator for that matter) work here's the essential idea.

Each sensor is given an uncertainty model, that is for example a stochastic model by adding e.g. Gaussian noise to the "true" value it is measuring. Further, you have another model that describes the dynamics, e.g. equations from physics can tell you where the car will go if you know the current speed, position, etc..

1. The kalman filter computes a probabilistic prediction of what it thinks is happening by using the dynamics model. That is, based on what it knows so far, where will the car (probably) be when the next measurement comes up?

2. When measurements from various sensors come in, the Kalman filter uses Bayes's theorem to compute a mean (posterior), in which each measurement is weighted by the probability that the measured value is correct (using the uncertainty models; "correct" here means "in agreement with the prediction"). In other words, sensor that are inaccurate (large variance) will be considered less in the computation of the mean, while more accurate sensors are given more importance.

Once the mean of the measured quantities are computed they are used again in step 1 and the whole thing is repeated. As you can see, that disagreements are justified by the inaccuracies of the sensors, and the process of performing a probabilistic weighted average solves the problem. For the Kalman filter in particular, it can be be show (mathematically proved) that this process minimizes the variance (uncertainty) of the measured quantities (which btw. is an amazing result if you think about it).


Thanks for the excellent summary on the Kalman filter! I admit I was too lazy to write any details.


That is the general idea, but now you've just descended to the next level of the iceberg. While it's true that sensor fusion is a Good Idea in lots of contexts, the whole paradigm rests on a fundamental assumption that you can model the problem domain. Perception pushes up against the boundaries of what Bayesian filtering can handle because even the most complex error models are hopelessly simplistic compared to the system generating your measurements. Your model would have to capture the bounds of uncertainty for what your stereo pair might say about the depth of any point in any scene it will ever see.

A simple example of why this gets complicated: if I have a point in one camera and a point in another camera and I know they correspond to the same real-world spatial point, I can calculate some distance, and the statistics of that calculation can be captured by a halfway-reasonable error model. But how did I know they corresponded to the same point in the first place? Well, because they look the same according to some image feature... or because some deep neural network told me so.. etc. There just aren't very good ways to model just how haywire ^that^ process can go. So at the end of the day, once you let this evil into your perception system, using statistics to blend your sensors together is undermined, and all of your precious covariances just turn into tuning knobs you can twiddle.

The dirty secret is that almost all robotic perception systems are hiding unprincipled, un-modelled heuristics in the data association process. This is kicked under the rug because it doesn't really fit into traditional estimation theoretic frameworks. In a lot of papers you'll see academics push it aside by just calling that the "front-end", which they brush aside as a little widget you put on the front. If you're lucky they'll do ablations across a couple different options.

Of course, this is just one level deeper down the iceberg. It goes far deeper. Even if you could model the statistics of a depth camera well, the statistics of "what are all the objects in your scene about to do" is another couple of orders of magnitude more un-modellable. Often engineers will do something like attach a "constant-velocity" model to the agents in a scene. Imagine trying to bin all of the reasons you might stop walking in a straight line into a bubble that describes how "noisy" that picture of the world is! Now you can begin to appreciate just how hopeless it is to explicitly model uncertainty in the world around us.


You're basically describing the no-free lunch theorem, but in practice we can define fairly robust generic models for certain things, like "physics". That's why optical mice work pretty well.

You can start to add unjustified assumptions and that'll make the world model weaker, yes. But starting with pretty basic assumptions like "you can segment an object from an series of images because each solid object will move on its own trajectory", or even more basic like "objects have edges" and then a few dozen samples per second, and suddenly you have a fairly robust way to detect things.

Same for predicting where something will go. If you can estimate an objects current velocity, acceleration, and jerk with reasonable precision, you don't really need a highly predictive heuristic for the world model.

For decision making you need more robust heuristics, like "the car to my left has right of way at the stop sign", but you don't need that level of heuristic to identify that there is a car and that it isn't part of the pavement and that it is currently sitting still.


Indeed, concepts that get hyped such as truly autonomous vehicles are still herculean tasks, and I am convinced that we are not going to find any theorethical way do demonstrate the safety of those technologies anytime soon (especially machine learning / deep learning).

However, you should also consider that working with heuristics is pretty much all engineers do. We ought to solve problems, even if the theory is not there yet. A great example is how we got air travel long before we had any real understanding of the fluid dynamics happening around the fuselage (as it was generally computationally intractable). So, sometimes a simple epipolar camera model with noisy clouds around the subjects is sufficiently accurate for the task. The real problem IMHO is that the degree to which these rudimentary approximations are tested for safety is not nearly enough with respect to how critical they are in the whole system.

A while ago I stumbled upon this presentation on system safety [1] which had an interesting perspective coming from the aerospace industry. In aerospace they test the shit out of every component to make sure that a failure does not cause an airplane to crash. In comparison waymo, uber and everyone else has done almost nothing in terms of testing for safety before putting out their products.

[1]: Richard Murray: "Can We Really Use Machine Learning in Safety Critical Systems?" https://youtu.be/Wi8Y---ce28?si=HsqgiLngdHojpYO9


I keep hearing this argument which makes no sense. I guess it comes from Tesla marketing because they keep repeating that and it rubs off?

On technical level two sensors are clearly better than one even if you just pick one in case of disagreement, but as others have said Kalman filters and other more advanced techniques exist. There is a reason airplanes or spacecraft have multiple redundant sensors like this for decades.

The argument only makes sense if you want to save money, but then say you are being cheap up front.


It's literally not even how "Sensor fusion" works, even in the most trivial example. As long as most errors are independent per sensor, you can combine them for a more confident result.

It is indeed Tesla marketing that posits otherwise, which is wrong, and Tesla fans eat it up.


Yeah the “disagreement between sensors” is largely overblown by Tesla when they were trying to rationalize cutting parts.

Somehow everyone else has figured it out, and even Tesla knows how to do it and have done it for years.

It’s pure bunk that’s used to cover for other decisions and now gets parroted around


I could believe that for radar (and lidar) but not for ultra sonic sensors which cost close to 0.


I mean, this is the car company that decided to skip rain sensors for cost reasons.

I own a Tesla but I also acknowledge that Elon will claw every last dollar he can to increase margins even by a few cents.


Karpathy explains the reasoning on Lex's podcast[1]. This was after he left Tesla, but of course he's hardly impartial.

[1] https://youtu.be/cdiD-9MMpb0?feature=shared&t=5279


> On technical level two sensors are clearly better than one even if you just pick one in case of disagreement

Can you explain this? If you always pick the same one in case of disagreement, what is the purpose of the other sensor? You're not getting any additional information when they agree.


the issue is, more of these sensors costs a lot of money, so can you get to the point that it's a business and not a science experiment if each vehicle costs a massive amount? if you can get away with some sensors and not others, you have lower costs. Of course, if cost is not a factor, more sensors is better.


"The argument only makes sense if you want to save money, but then say you are being cheap up front."


well, you aren't being cheap. If you can get to where you want to go without lidar, why would you want to spend more money on lidar. If you can't it's a different matter.


Great! In some sense, then, you are agreeing rather than arguing ;P. The point was if they want to say that, they can say that, and the people in this thread (including me, FWIW) would sigh with casual acceptance; but, instead, they make the argument that it is somehow better NOT to have the lidar, as they supposedly are now (as opposed to a while back?...) claiming that it is so difficult to "fuse" the knowledge of the various sensors that you are better off picking only one.


sure, but I think Musk has said it multiple times as well.


Humans drive both with their eyes and ears, although Tesla marketing like to state otherwise.

We are perfectly capable combining both inputs and act accordingly. An AI system can easily learn how to combining visual and LIDAR input to make the right decision given the circumstances.In the end it is just a decision tree.


Is there any situation in which drivers disregard what they see and rely on sound instead? Sound informs what we look at, but we rely entirely on sight to drive.


If an inner tire on the dually truck or trailer I drive at work blows out, I may be completely unable to see anything if there's no smoke or tire debris in the mirrors. But I absolutely know what a blown tire sounds like and feels like, and will absolutely pull over and come to a stop for repairs.

Same as if I'm driving in winter and hit a patch of ice - Visually, it looks identical to the rest of the snowy, icy roads here, but I will drive if I feel and here that I'm spinning my wheels or sliding sideways.

If I smell coolant, oil, or a belt, even if my eyes tell me my gauges disagree I'll be pulling over for those issues as well.

Conversely, if I've replaced a tire and the TPMS light says the removed tire in the cargo area has low pressure (duh, that's why I changed it) but I know the spare (without a TPMS transmitter installed) is good, or otherwise know that the idiot light is a false positive, I'll trust my other senses over my eyes.


* A car accelerating fast from your left

* Screeching breaks from the right

* Sounds of a rattling bicycle trying to undertake you

* An approaching emergency vehicle

* Other cars beeping their horn at you

Of course you will confirm such situations visually, but you definitely using hearing in addition to sight.


Sirens from not-yet-seen emergency vehicles is the most basic example.


Disregard? Not entirely, but sometimes I will be driving my car and opening its top or window, and for few seconds I can hear that there’s another car next to me, however, it’s in my blind spot, so yeah it does assist in driving.


Not just sound but also touch and balance.


Humans also use proprioception during driving to understand changes in gross velocity and vector and hand and foot positioning.


For me, my ears response faster than my eyes in some cases. I think someone else pointed out that when there's a bicycle ringing their bells on your side, your ears will catch it first then you use your vision to confirm it.


Don't you just make a conservative decision in that case? My eyes and ears disagree all the time, but if one of them says danger that is the one I listen too.


I'm sure this example isn't new at all but it's the first time I've encountered the argument phrased this way and I find it incredibly compelling. Looks good, smells bad is another one (that probably happens more often).

It did get me thinking about different senses and how I prioritize them. Given that humans strongest sense is visual, it's interesting to me that the priorities of what to trust seem opposite of my expectations. It seems to me that visual signals are seemingly least "trusted" compared to the other senses. As if the logic goes, "my sense of smell is so bad that if it detects danger, it must be very dangerous".

Similarly, I'm sure dogs can smell rotting flesh long before meat is unsafe.


Even simpler: if one sensor tells you confidently there's something in your way, and the other doesn't, assume there's something in your way.


The problem in this particular case is that the sensor may be confidently telling you there's something in the way, but in reality it's just a plastic bag. Hard braking at 70mph on a highway as if you were about to hit a concrete wall is probably not a great outcome in this scenario if there's somebody behind you.


If your sensor is confidently telling you that a plastic bag is a concrete wall, maybe you shouldn't be using that sensor to begin with.


> Don't you just make a conservative decision in that case? My eyes and ears disagree all the time, but if one of them says danger that is the one I listen too.

In the context of vehicular autonomy it's more complicated than that because for example a radar sensor picking up an overhead sign or a truck parked on the shoulder as if it's an obstruction on the road is something you want to ignore when you're going 80 MPH on the highway rather than slamming on the brakes.

When you're the only vehicle on the road, stopping if anything goes wrong is always the safest idea. When you're one of hundreds of vehicles in a high speed flow of traffic stopping would put you and everyone else on the road at significantly greater risk.


I think you're misunderstanding sensor fusion. Any sensor fusion worth its salt functions closer to a weighted average, but instead of linear combination, you're relying on the error model and overall system model to figure out the actual state from the variety of sensors


What I think is where it probably starts working well is when you can fuse information from other cars and sensors on the road. And or bicycles and even pedestrians.

Imagine a stop light with a lidar sensor that broadcasts that information. Car ahead broadcasting that it's stopped in the fast lane.


Think about it from a security and distributed systems perspective. How do you establish that the sensor data is good (sensor isn't failing), the data is trustworthy (not malicious), and ensure you have it when you need it? V2V and V2I take already hard problems like self-driving and add a bunch of other hard problems on top.

They're hardly worth discussing as serious solutions today.


LiDAR easily passes the “featureless white wall” test. That’s why I’ll be waiting.


>The problem with sensor fusion is what do you do when the sensors disagree? You have to decide which sensor to trust.

There are two cameras in this computer vision system. If one of them goes offline, is obscured, becomes dirty, or malfunctions you lose stereo vision and depth perception. So you'd obviously need 3+ cameras. And now we're right back at sensor fusion challenges again.


I mean this in the kindest way possible: you need to mix up your sources of information, as you have become blinded by lies from marketers and a cult of personality. Machines have been using redundant sensors since machines and sensors have existed, with great success.


Similarly to how ML vision also a black box, there is no reason why its error handling can’t also be that.

If you hear a sound but see nothing, you do become much more aware of the generic area afterwards, something like that should also be possible.


You remove one of the sensors and pretend the disagreement didn't happen?


Or you give it an infinite uncertainty,in most sensor fusion algorithms it would result in zero weight given to that sensor in the fused observation


And that leads to another good point: it's easier to predict which sensor is trustworthy if you have more than one type. Is it dark? Trust the lidar more. Raining? Trust the radar more. There is at least a rudimentary radar to sanity-check the optical sensors, right?

I can almost see the objection that multiple cameras are as good as one camera+lidar, but I think it's a mistake to trust any system that can't check itself for consistency across multiple bands. It doesn't take a very good radar to keep you from ramming a fire truck. In fact, whatever runs the cruise control's distance sensor should have been enough to prevent a bunch of the Tesla oopsies reported in the press. When tackling one of the hardest engineering problems faced by humankind, it seems stupid not to take advantage of all the data you can get.


You are right. There are several reasons for why you want to fuse (different) sensors: - reduce uncertainty - estimate biases and systematic errors. As you said, different sensors are affected by environmental conditions differently, so it's only reasonable to use all the information instead of only rely on one measurement method


> The problem with sensor fusion is what do you do when the sensors disagree?

Like one sensor says you have a bus stopped in front of you and the other says it's all clear? And your choices are full steam ahead or prepare to not ram the apparent bus?


You really want what works best given the ms time budget in the context of your on-board compute.


If it’s a life or death decision, you should opt for the best strategy, not the cheapest.


Person you're replying to is talking about time budget offered by your computer, not monetary budget. Regardless everything is about optimization - the "best" option will almost always be so costly that no consumer will be able to afford it.


If the technology (i.e. self-driving) cannot be delivered safely at an affordable cost, then probably it shouldn't be delivered at all then, no?


This isn't germane to the conversation at all, the budget under discussion is compute time.


If we can't fit the compute required to do it right in a car, we shouldn't be doing it at all. Their point is perfectly fine, regardless of WHAT KIND of "budget" is being discussed.

That being said, even automobiles make safety tradeoffs for cheapness or feasibility. However, we really shouldn't allow any tradeoffs for a completely unnecessary feature like "self driving". Imagine if wanting your car to have android auto or similar meant it couldn't use the lights, because a tradeoff was made.


Safely only means you're above a threshold value that is safe enough, but you can always be safer with sufficient further expense. Presumably your budget allows you to go beyond the absolute bare minimum, but it obviously won't get you to infinity either, so you optimize for the best option within your constraints.


Define safely given the current road hazards of human drivers.


“Time budget” is ultimately the same thing as the monetary budget. Spend more, get more ms time.

In a life or death situation, you should opt for the system which will keep you alive more, not the one that costs less.


The best strategy is the one that reduces global annual traffic injuries the most.

That’s probably also the cheapest viable strategy.

If a sensor-fusion car cuts accidents per mile (vs human) by 100x, but can only be deployed on 100,000 cars a year, and a camera-only car kills 10x more than that per mile, but can be put on 10,000,000 cars a year for the same cost, the camera-only car will end up saving 10x more people than the “better” system.

(I exaggerated both the improvement ratio and cost ratio because I like multiplying by powers of ten)


If we pick different made up numbers, the answer is different.


It’s still a useful thought exercise in the context of ‘the perfect’ being the enemy of improving the death rate.


"Thought exercises" do not belong in a safety discussion. This isn't an 8th grade debate, it's a company putting people at risk to increase their valuation while claiming "it's for the greater good".


What? Thought exercises always belong in safety discussions. Where is this weird notion that "safety Uber Alles" is actually how safety works or even should work?

Going for absurd safety standards or expectations is absurd and self defeating. Again, as the other anon said, a practical solution that helps improve safety without handwaving material realities (cost, feasibility, adoption rates) is always better than a "safer" option that won't actually be used.

Obviously corporations try to make more money, but people also dont like buying more expensive cars.


I think what the poster above is going for is that the level of safety (as total number of saved human lives) resulting from turning a (napkin or otherwise) calculation into a definitive technological choice is probably suboptimal. Typically we would want a safe process to include retroaction or "self-improving cycles". That is not a single point of measurement or calculation is being used but instead we do provision for future safety evolution of the system, and monitor the safety conditions by setting up regular measurements. So, what is considered safe is not any standard in itself, because as conditions and technologies change it may become outdated quickly, but rather the process to redefine that standard so that we have confidence that our product is not only safe but that we can keep it safe in the long run (since we want to optimize on the total of lives saved, not on a weekly or monthly death toll).

A calculation that leads you to underdesign a product's safety and leaves no room for this product's safety improvement, in terms of mechanical or electronic update, is clearly not thought as being safe in that regard, regardless of the economies of scale or even low-term utilitarian goals (that would be expressed as: people spending money on a tesla would be safer in the short run, rather than using no automatic driving at all while waiting for a better product).

This is an important difference, and there is a societal choice to make here: do we (as society) want to buy now, and potentially have regrets later (when the safety of the product degrades with time, causing it to also have a record of people's deaths), or do we want to proactively force a notion of safety onto cars that is more than just being good enough at an arbitrary point in time, so that we hav more confidence over the long-term viability of that (societal) investment? As you can guess I gravitate towards the later, but of course it's a gradient, with several choices in-between, because pushing that thinking to an extreme would lead to stagnation, which would not do anything in terms of improving safety, as you noted.


I explained how to reason about the trade off using 3rd grade math.

If the outcome of your safety discussion ends up suggesting a “safer”, “more expensive” solution that will definitely leave more people dead and injured then that analysis, then something is seriously wrong.


> If a sensor-fusion car cuts accidents per mile

There is no robust proof that any self-driving system outperformes a well-trained driver.

We could take that money and invest it into advanced driving lessons


Or hell, public infrastructure that allows everyone to make it home safely even if you are so drunk you can barely walk.


Who is “we”? Current self-driving research is almost entirely funded by private investors. We’re way past the days of it being a darpa science project.


When I buy a car, will I pay for the LiDAR or investors


The best strategy is to hide under your mattress.


No it's not, have you heard of bed sores and muscle atrophy? You need to exercise a reasonable amount to minimize cardiovascular risk as well.


Cheapest being humans.


Not necessarily. Maybe currently, but perhaps not in the future.


That's mostly spin too. You want a strategy that works well enough, because that's the regime we live in currently. Other human drivers certainly aren't implementing a LIDAR-based sensor fusion strategy and you have to share the roads with them. Your algorithm for evaluating their safety is "how likely are they to kill you" and not an absolutist position on the specific equipment in their heads or cars, and you're clearly OK with that.

As far as LIDAR itself: sure, yeah, you get depth info out of it. But depth info is only part of the problem, and frankly it's clear at this point that it's one of the easiest. The hard parts are in the recognition side: not "is that pedestrian in your path" (easy), but "is that pedestrian going to step into the street or not" (hard). And that's a computer vision problem. You can use LIDAR output as vision input, sure, but it's has no advantages.

Tesla was right, basically.


> But depth info is only part of the problem, and frankly it's clear at this point that it's one of the easiest.

That's proven false by the cars continuing to drive into stationary objects. This failure mode is not ambiguous.

Camera input is garbage for interpreting geometry, especially from very smooth or very discontinuous surfaces, and especially with the shitty low resolution and low dynamic range cameras they use, and especially with non-stereoscopic cameras with no motion freedom relative to the vehicle body. Lidar is a necessary crutch for working around the fact that, while hypothetical cameras that don't exist might work well, all available cameras are unsuitable for the purpose, and calculating multi-view geometry accurately costs time.

> The hard parts are in the recognition side: not "is that pedestrian in your path" (easy), but "is that pedestrian going to step into the street or not" (hard). And that's a computer vision problem.

Pedestrian motion is not strictly a vision challenge but a general category of environment understanding (mass, momentum, motion mechanics). Vision is only one possible input mode preliminary to modeling.


> That's proven false by the cars continuing to drive into stationary objects.

It has? This again gets to "are they safer than human drivers?", because the competition hits stationary objects all the time. If you have data let's discuss data, but "proven false" is, again, just spin.

> Pedestrian motion is not strictly a vision challenge but a general category of environment understanding

Semantic evasion. You agree that it's "not a problem solved by LIDAR", right? It needs a camera. You can use a LIDAR output as a (somewhat inferior) camera, but it's not providing any advantages.


> "But depth info is...one of the easiest [problems]" ... "are they safer than human drivers?" ... Semantic evasion.

It looks like you're jumping from "depth is easy with cameras" (demonstrated false) to "safer than humans anyway without it" (speculative and not demonstrated by anyone), so who here is evading? That they're safer is not demonstrated. That depth is easy with just cameras is demonstrated to be false by the continuing failures in the presence of extreme financial incentive to not have those failures.

> You agree that it's "not a problem solved by LIDAR", right? It needs a camera. You can use a LIDAR output as a (somewhat inferior) camera, but it's not providing any advantages.

The LIDAR addresses the part where all current cameras are unsuited to mapping physical world geometry under driving conditions. It's not one or the other, but you appear to be assuming an imaginary not-the-one-we-live-in reality where only one is needed because you assume that all available cameras aren't actually very bad. But they are all actually very bad. So we continue to need both for the indeterminate future until someone invents mechanically robust extreme fidelity stereoptic cameras with motion freedom independent from the vehicle body, which is what humans use.

Humans are unsafe predominantly because of inattention, not ability. Camera-only vehicles are unsafe because of camera ability before you even get to the attention part.

Tesla's repeated failures over the years (and your conviction toward what Tesla is doing regardless) demonstrate a dangerously erroneous belief that object identification is the first and most important step for path planning. But that's not how humans drive, and it's not how to drive safely. The vehicle should avoid driving into any space that isn't going to be open smooth road, period, so the most important step is mapping geometry. There are no cameras currently suited for that. This is not a theoretical limitation. Just a practical one. Becoming suitable with current cameras would require many more cameras with much more processing per frame, so if you're trying to save costs vs lidar, you won't.


> Other human drivers certainly aren't implementing a LIDAR-based sensor fusion strategy and you have to share the roads with them

humans are absolutely doing sensor fusion, brains are bayesian inference machines. do not underestimate the power of the visual system.

and no, that the brain does is not an argument in favor of LIDAR-less cars. the eyeball + visual cortex system is alien technology compared to our feeble models. beware the hubris of a man who has learned to classify golden retrievers.


> humans are absolutely doing sensor fusion, brains are bayesian inference machines. do not underestimate the power of the visual system.

Not in the sense in the upthread comment they aren't, no. We have two cameras and two microphones. The latter is limited to weak detection of horns and tire screeches and not much else, and the former are too close together to give stereoscopic depth information at traffic distances.

We have a camera, basically. We do lots of stuff with the camera, sure. But that's not sensor fusion.

We sure as hell don't have anything like LIDAR.


> the former are too close together to give stereoscopic depth information at traffic distances

They're only too close together to give very precise depth information, but they do still provide useful depth information. They also double the incoming light and SNR, which is why the average person performs better on visual acuity tests with both eyes than with only one or the other.

> We have a camera, basically. We do lots of stuff with the camera, sure. But that's not sensor fusion.

They're varifocal cameras with very good dynamic range that receive double the light input and that also have full freedom to move around, both rotationally and translationally, which provides, among other things, more depth information and better object boundary segmentation from controlled parallax and focus, and which involves the continuously varied activation of many different muscles and sensory nerves because they're attached to the extremely complex and sensitive proprioceptive structure called the rest of your body, which your brain fully uses as input when processing visual information. And we know that your brain uses this other information, because not having this other information causes reduced perception and motion sickness.

So, no, they're not just cameras. And, yes, we do sensor fusion.


How do you manage disagreements in fusion?


The old school way is to add logic based on the strengths and weaknesses of each sensor type. My example is not specific to automotive sensors (I haven't worked in the automotive sector, but I do have now-outdated experience in obstacle detection and ranging, along with avoidance algorithms).

Sonar sensors are most accurate at medium ranges, but they are notorious for detecting ghost objects that do not really exist. Infrared range sensors are more reliable but are only accurate at very short range. So when a sonar sensor detects an object 8.4 meters away, you use the infrared sensor to double check. If the infrared sensor says there's an object 9 meters away in the same direction, you assume the object is real but is actually 8.4 meters away. If the infrared sensor says the nearest object in that direction is 20 meters away, you assume the sonar sensor made something up.

If you have enough types of sensors, you can also use a "majority rule". If two of 3 sensor types agree, you assume the 3rd is an anomaly. Lidar is excellent for this because it is accurate across a very large range, so it tends to overlap with most of your other sensors. This increases that odds that when there is a disagreement, one of the agreeing sensors will be capable of accurately measuring the distance to the object.


Thanks for this. This is what I come to HN for — to learn something outside of my field.

Do AI systems have the potential to weight or inform those transactions based on historical historical data then? The “experienced” aspect of learning all the things that turned out to be true or false in previous comparisons or data decision points would seem to be the obvious missing piece, but I have never really understood the specifics.


Statistics. "Sensors disagree" is the EXPECTED result when you get a reading from multiple sensors, and the whole point of sensor fusion is that, if the sensors have independent error models, that disagreement IMPROVES your output.


That's the essence of fusion.


Kalman filters.


Atleast someone knows what they are talking about.


Kalman filters are like sensor fusion 101, and anyone who has attached more than one distance sensor to an arduino has attempted it. It's not that unreasonable that the average person has no idea what sensor fusion is, what IS unreasonable is the damn head of self driving at Tesla claiming that "what do you do when the sensors disagree" is even a valid question.



This article doesn’t address when cameras are blocked, which is the obvious issue with camera-only self driving. Teslas have crashed when cameras were blinded by the sun. Now throw in snow, rain, dust… Is that solvable with lots of cameras and different types? …Does it need to be solved?

Maybe the bigger question - anyone know the status of low cost lidar? Dozens of startups and larger companies were working on it 10 years ago, yet Lidar still costs “thousands” according to the article


There's something bizarre going on with lidar manufacturing/pricing. When I was looking into it for a specific application it was much cheaper to buy entire made-in-china products containing the exact lidar module than it is to buy the part separately, even at bulk pricing.


This is very common with all kind of components. There are economies of scale your vendor can achieve when they sell someone a million of the same thing. Also the company buying a milion of the same thing is going to pay the vendor a significant sum, even if they get all kind of discounts, and that puts them at a much better negotiating position than you buying a single one.

Hobbyist buying a few units of a component, even if they are buying it with a significant margin, will net the producer peanuts. So not surprising they don’t worry much about serving them that market.


Yes, I'm aware of that. That's why I added the bit about bulk pricing.

In my case I was looking at buying quite a number of units, outside of a hobbyist application. In fact, I would say it was a higher number than the cheaper China-made products could possibly sell (different market sizes). It seemed to me that they didn't want to sell for any price really but would make an exception if they could really, really rip me off.


Maybe this explains the reason why its somehow too costly to put lidar in a $50,000 tesla, but I can have it in my $200 xiaomi vacuum.


As others have pointed out, lidar doesn't denote capability.

1d lidars that have a range of 8 meters indoors are quite cheap <$15 volume.

"2d" lidar, that is one measuring one plane's depth, are generally a lost more costly. Not only that they are bigger and eat more power. again indoor only.

3d lidars are more expensive still, and if you want it to work outdoors, even more.


Xiaomi vacuum has 1D laser rangefinder that is physically rotated in a 2D plane, much cheaper and simpler than a 3D LiDAR.


Not exactly, as explained by the other commenter, but it could help explain why your $200 Xiaomi has lidar but a $1500 Roomba does not.


You do know that lidar has resolution and a vacuum doesn't need high resolution for its needs vs a car?


What does a human driver do when their vision is obstructed?

1. Attempt to use the vehicle's built-in windscreen wiper to remove the obstruction.

2. Failing that, stop the car. Preferably before the vision gets so badly obstructed that the car cannot safely be brought to a stop. But stop the car even so.

3. Get out and clear the obstruction. Admittedly the AI will have trouble with this, but it is vanishingly rare anyway, and if the car is carrying passengers, this task can be given to the passengers.


When the human continues driving in incliment weather, and runs into the back of a van full of kids and kills all of them, we put them in jail for making a bad judgement call.

How do we handle the AI mowing over a pedestrian when it makes a bad judgement call? Right now, the status quo is that we do jack and shit, and I can't help but feel like that's not a good plan.


The same way we handle a failed brake system. Bad maintenance or bad design, that leads to operator's or manufacturer's insurance paying.


Car brake systems actually have several built in redundancies, including an entire secondary system for backup emergency use.

What redundancies can you implement in a black box "AI" model?


Automatic emergency braking is already available, unless some moron disables it.


It’s an interesting conundrum, but in a full AI world the hope is that it’s so rare we don’t feel the need to be punitive at all and can chock it up to bad luck and try to learn from it. Perhaps more similar to when a airplane crashes and people die.


This is whataboutism for ADAS.

People and ADAS have their own, different, and critical weaknesses. Neither is a panacea. (Which is why mass transit investment should be prioritized over scifi fantasy ADAS.)


Or walking, the most robust transportation option


Yeah, right. Since if there is snowfall we get out of the car and shoo the flakes away. Or scream at the sun to stop blinding us.

Humans have something called perception and cognition, we can make sense of things we don't see.

AFAIK we don't have cameras yet that can do that. We need better sensors.


The cameras are for perception and the AI is for cognition.


If we have to wait for "General" AI to have self driving cars, we should probably stop selling them today.


> Maybe the bigger question - anyone know the status of low cost lidar?

"Solid state" lidars would fit the fit bill for likely low cost lidar. They are probably 3-4 years out, and have been for the last 10 years.


iPhone 13 Pro has solid state LiDAR.


It was introduced in the 12 Pro models released in 2020.


This page from March 2023 says $1,000/unit for Lidar: https://www.sae.org/news/2023/03/adas-and-autonomous-vehicle...


> Teslas have crashed when cameras were blinded by the sun. Now throw in snow, rain, dust…

I used to think the more sensors the better, but after listening to George Hotz talk about it I can see the logic of focusing on ambient spectrum in visual and near range. Of course, he will talk up his approach as best, but here it is as best as I recall:

  1. more sensors ~= more signal
  2. more sensors means 
    a. longer processing pipeline for fusing data streams (timing, registration)
    b. more software, thus more surface area for defects
    c. decisions about response when 1 sensor modality fails
  3. visual range spectrum is 
    a. well adapted for environment
    b. has inexpensive and high quality sensors
    c. sufficient for humans so is sufficient to get to human-like driving by a computer
The answer to blocked cameras is:

  1. to have protocols to slow down and stop gracefully
  2. maintain enough of a spatial model of the vehicle surroundings to perform the above (Simultaneous Localization and Mapping, SLAM)
Both of the above are basically what humans do.


Our eyeballs are not cameras and have way more depth info from their function than just two arrays of pixels that you can derive parallax from, and all the claims that "humans only use their eyes" fundamentally ignore all the other parts we use, up to and including an intrinsic simulation of physics in our brain.


Yes, sure. Cameras and biological light sensing have different tradeoffs. My lay person's understanding is that the eye-brain neuron pathway bandwidth is not theoretically sufficient for what we perceive and so our brain is effectively running an ongoing simulation of the future a few miliseconds ahead of now and correcting based on sensory input.

The book "An Immense World: How Animal Senses Reveal the Hidden Realms Around Us" by Ed Yong [0] is really great for understanding how sensory input informs but isn't the same as a mental model of the world built into the operations of a living thing.

Likewise ADAS and similar systems do not operate simply on what is sensed at any particular moment. Even ahead of things like being blinded by a sunset, there are occlusions when one object moves behind another and cannot be directly detected but can be inferred by an object model that predicts future positions given the the earlier known velocity and acceleration. [1]

0. https://www.amazon.com/Immense-World-Animal-Senses-Reveal-eb...

1. Visual SLAM in dynamic environments based on object detection https://www.sciencedirect.com/science/article/pii/S221491472...


More than that, I mean eyes have more data than just what light is hitting their retinas. The work that the brain and neurons do to aim and focus your eyes at a distant object essentially solves several math problems that give you very direct distance info. Your brain knows that, if the angular deviation of your eyes away from parallel is X to aim at an object, then it is ~Y distance away. It also knows that, these muscles have to flex this much to focus on that object, which ALSO provides depth info to your brain. Solid state image sensors cannot provide either of those datasets.

These two processes are actually why VR can be difficult on the eyes, because while the main way your brain senses depth is the parallax (the classic "binocular vision" way people think of), the sense of focus is telling your brain that everything is right in front of your eyes.


The first rangefinder, micking this process mechanically, was invented in 1769. You’re essentially arguing for Lidar / sensor fusion.

Do you have any sources for this being a significant factor in human depth estimation? “Infinity” focus starts at 6 meters, yet we’re able to estimate much larger distances with great accuracy.


I looked up the history of the rangefinder and the work of Watt in the 1770s is kind of obscure. For one, he called it a “micrometer” [0] even though he also created something like what is called a micrometer today, only he called it an “end measuring machine.” Additional confusion comes from “telemeter” as an early term for a rangefinder. Only Watt was also there at the beginning of what we now call telemetry: “additions to his steam engines for monitoring from a (near) distance such as the mercury pressure gauge and the fly-ball governor.” [2]

Watt's micrometer, designed between 1770 and 1771, was what we would now call a 'rangefinder'. It was used for measuring distances, and was essential for his canal surveying work.

Adapted from a telescope, with adjustable cross-hairs in the eye-piece, it was particularly useful for measuring distances between hills or across water.

0. https://digital.nls.uk/scientists/biographies/james-watt/dis...

1. https://collection.sciencemuseumgroup.org.uk/objects/co59281...

2. https://en.wikipedia.org/wiki/Telemetry


You know cameras focus too.


It is known that in neurological vision there is are also top-down pathways, where later stage processing influences early stage processing. Especially in poor conditions, the interpretation of images is based on an already established 3D model of the environment and information about ones own direction of movement.

It is nice that the system can generate 40 million 3D data points per second, but those point still need to be processed (and interpreted) in later stage processing.


The cameras being 1.2m apart is nice for accurate triangulation at a larger distance, but in a crowded location with obstancles near the road, it could lead to substantial blind areas at closer distances. I also guess that in the dark, reflections of lights in puddles on the road could lead to a stark reduction of valid 3D points or even mismatches.


I thought a deranged camera had crushed a LIDAR and then claimed the life of a startup.


Long baseline stereo matching has a trade off of poor depth resolution in short ranges and requires wide angle lenses which result in an uneven distribution of data after correcting for distortion. The wider the lens, the more distortion increasing as you move away from the center of the lens as pixels are no longer square; you also need higher and higher resolution sensor to get good matches around the edges.

This makes it very hard and computationally expensive because you need to search progressively larger areas of the image to do matching to get the same FOV at longer range. Large baseline also is more likely to suffer from occlusion (even self-occlusion) by objects in the foreground resulting in holes in the depth map. In addition to being harder to keep the cameras rigid between calibrations (but I guess they figured this part out).

Instead of a depth map like they show it would be much more useful if they instead show a map of the reprojection error of a flat calibration target at the minimum and maximum depth ranges. Even if they are not aiming for geometric accuracy it would give a much better idea of the actual performance of their system.

The problem with these stereo camera companies is they get obsessed with maximizing certain metrics when the fact is there is a fine balance of trade offs to maintain that is highly specific to the desired application and even scene. Sure, you can increase the baseline and figure out how to mitigate calibration but that creates many more problems to solve that only become apparent when you try to apply it to some application.

Anyone interested in long stereo baseline should check out this blog which has been around forever: https://www.elphel.com/blog/2017/09/long-range-multi-view-st...



I'd be curious to understand why the they think the technology would work in fog. The article suggests it does, but the science behind how that worked wasn't explained (or I didn't get it).

Does anyone understand this? The problem with fog as I understand it is that it defuses light, washing out vision. My understanding is that Lidar does not bounce off fog, or at least to the same degree.

Is my understanding there correct?


Clickbait as usual, it is one company's own claim.


So mud and dirt and dust and oil and antifreeze and all kinds of garbage spray on the roads covers all parts of a car. Some people never wash their car if they don't want to.

What happens with these sensors? Can they detect physical interference? Will they refuse to let an assist mode of any kind activate if there is a malfunctioning sensor?

What kind of self-diagnostic do they run, how often and how fail-proof is it?


Volvo has had this solved for ages. Bring them back!

https://imgur.io/gallery/C3Kww


And when those break down and the owner doesn't fix it?

Self driving cars will require regular and stringent inspections of functionality, otherwise it's a time bomb.


This seems fairly trivial.

* Since safety regulations will be required for this, include a built in self test, with a refusal to engage self driving until the test/sensors are deemed safe enough.

* Race cars use a spool of plastic film over their cameras, so a clean window is always ready. That's on option. There's the standard wiper. If you look around next time you're out, you'll see that most cars are very clean. So, I think this would be a non issue for most people.

* Since the self test requirements may become more stringent over the years, cars that can't comply with the latest requirements can have an anual/semi annual verification, like we do now for emissions of older cars. Maybe as part of the standard, thresholds for the self test will need to be adjustable, to keep them in the "safe" range, and trigger earlier checks/cleaning.

* You have to compare all of this to having a human behind the wheel.


I am not sure if you live in the USA but we have dozens of states without any car inspection at all. In my city alone there are hundreds if not thousands of cars and trucks that drive around without mufflers or exhaust bypass, headlights broken or off at night.

Tesla Model3 is one of the most popular selling cars now and that's just a hop and skip away from someone hacking the self-driving to always being on and selling it for cheap like the exhaust bypass hack that's been done to tens of thousands of trucks on the road today. https://www.thedrive.com/news/inside-the-epas-messy-war-on-d...


Private individuals owning self-driving vehicles won't really be a thing in the longer term, it doesn't make economic sense. There are countless advantages to having a company or local government run and maintain a fleet.


> it doesn't make economic sense

It does, in the other parts of the world, where there are cars being made that exist outside of the luxury market.

You can still get a Nissan Versa for $16k. Adjusted for inflation, that's only $7k, in 1990, which is cheaper than a Ford Escort from 1990, and 35% more horsepower.

The problem is that people's standards are ridiculous.


Both of my cars have some level of ADAS systems on them. Both have ways to have sprayers clean cameras or have the cameras in the path of the standard wipers. When they cannot sense well enough they will disallow activating their ADAS features.

What happens when the windshield gets to coated in junk that the driver can't drive?


The most performant AVs right now aren't using solely lidar. They have cameras and many other monitors in their sensor suites.


Would solid state Lidar really be that much more expensive than a CMOS camera if it were manufactured at scale?


It's extraordinary that the article mentions Waymo and Cruise in this context but not Tesla.


It seems Tesla is finally starting to get serious about L4 autonomy, but they don't have much to show for it yet. What's extraordinary is how many people have fallen for Elon's bullshit and remain committed to it in spite of it being obvious bullshit.

In any case, many journalists are still trying to spin it as cameras vs lidar when it's really single modality vision systems with no redundancy versus multi-modal vision systems that conventionally include camera as well as lidar.

Various Lidar startups have also made inroads in deriving vision from laser range finding technology. It's hard to speculate on what will win in the end, but nitpicking these things is a form of bike shedding. Ultimately it comes down to the software, which is a much bigger problem and not as straightforward to wrap one's brain around.


I assume the downvotes are about mentioning that one person that gets mentioned a lot? Factually this post is pretty accurate IMHO. As a human you can almost always extract "this could have been seen" from fusion sensor data. That does not mean the computer can (some of this also vibes from you judging post mortem and typically having additional information about the scene). Neither hand made nor trained classifiers master the "long tail" that is needed for L4. This does not change with some percent improvement of one sensor.


Tesla doesn’t have autonomous vehicles. It has level 2 vehicles. Waymo and cruise have level 4 vehicles.


>Waymo and cruise have level 4 vehicles.

In a single small extremely constrained geofenced area using vehicles with hundreds of thousands of dollars worth of sensors each. Those companies are complete dead ends. They are not even remotely solving for the general case.


Geofenced is the definition of level 4. Tesla doesn’t have a level 3 car on the market.


>Geofenced is the definition of level 4. Tesla doesn’t have a level 3 car on the market.

And levels 3/4 are useless, better served by public transit. There's a reason the Japanese/Korean manufacturers are skipping them. You either have really good level 2 (i.e. LKAS+ACC), or you go to full level 5. And Waymo/Cruise will never solve for level 5 with their approach. Tesla (and Comma) are at least working on the general solution to it.


And levels 3/4 are useless, better served by public transit.

Level 4 where the geofence is "all major cities and highways" would be immensely useful.

> There's a reason the Japanese/Korean manufacturers are skipping them.

https://www.caranddriver.com/news/a35729591/honda-legend-lev...

https://canada.autonews.com/technology/hyundai-very-close-ac...


Toyota partnered with Pony.ai and have a fleet of level 4 taxis operating in all major cities(eg: Beijing, Shanghai and Shenzhen) in China. Honda has level 3 self driving cars on the market in Japan that you can buy today. Kia is working on a level 3 car and also announced level 4 taxi plans. Hyundai also has plans to add level 3 functionality to their cars.


You can take a public transit to the airport. I’ll take a Waymo taxi.


Is this supposed to be an argument? Some sort of elitism about public transit?


How much does it cost to pay a human driver for 10 years?


I highly recommend listening to this podcast episode. Lex Friedman interviews Andrej Karpathy (former director of AI at Tesla): https://podcasts.apple.com/us/podcast/lex-fridman-podcast/id...

1:32:39 Andrej explains why Tesla removed Radar and relies on vision. Part of it is that other sensors bloat and complicate the software needed to interpret data coming in. What if they get conflicting signals? There may be calibration or manufacturing inconsistency to account for. You need to normalize that in your software and the entropy quickly gets out of control. I find this explanation very compelling and feel like vision is the more "necessary" sensor anyway. After all, roadways are designed to be navigated using vision.


He "forgets" to make a point of the number one reason they don't use lidar: cost.


For lidar surely, but what about the ultra sonic sensors?


If something is required for the core functioning of your product, that thing is not “bloat”. That’s like saying “It’s really hard to do security and authorization for this server I’m setting up, I’ll just call it bloat and leave everything open.”


If you're going to deal with multiple cameras, why wouldn't you at least have a 4x4 array of cameras, with a spacing of at least 10 cm? You could deal with small object occlusion, and gain some redundancy.


Sounds cool.

But these days, I’ve learned to wait for a working commercial prototype, before cheering.


I read the title to mean "a camera crushed a lidar sensor which caused a startup to go into administration"


Hasn't this already been settled after Tesla's claim?


> claim

Claims never settle anything.


Was it?


¿Por qué no los dos?


and scissors cut paper


it'll also crush the kids they run over


If time-of-flight sensing made sense, we would have evolved that over millions of years - oh wait...


I'm actually interested in this, are you claiming humans do time of flight measurements?


My "we" was really animals in general, where there are several examples: bats, dolphins. But I have seen some academic papers on human echolocation over the years.


Oh, I thought you meant time of flight for light, e.g. lidar. That would be very surprising.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: