Woah, that's a long time! What are the problems in computer vision that need to be solved?
if this is accurate, the current models give a hint of the issue. the car can't yet see very far and any minor obstruction confuses the hell out of it. there's too little pixel to make out the road at a distance and the car doesn't use any of the other clues human do to figure out what the road is doing next - i.e. we can guess a corner from vertical signs, guard rails, trees and even hill sides - see this example: in red the pixels that hints a tesla of an incoming corner, in blu those that also a human can use: https://i.imgur.com/CvntZuZ.png
the problem is that a camera, especially if it's not at a high vantage point, will have very few pixel to represent distant features.
Humans while not perfect are capable if making split-second decisions on incomplete data under a surprising range if conditions based on the fact that our brains are an unmatched pattern-matching beast feeding off of our experience.
Unless you're talking traditional computer security, which it doesn't seem like you are, these types of threats have not prohibited human drivers despite the fact that humans are very susceptible to "adversarial attacks" while driving too. Whether it's putting carefully crafted stickers over a stop sign to confuse a CNN or yanking it out of the ground to get a human driver killed.. you're talking about interfering with the operator of a moving vehicle... so what's the critical difference here?
> so what's the critical difference here?
If neural networks are deployed at scale in self driving cars, a single bug could trigger millions of accidents.
Printing an adversarial example on billboards would lead to crashes all around the country. Are we going to assume no one is going to try? (btw: real world adversarial examples are easy to craft ).
Like literally the article we’re commenting on. Image recognition systems in general are much more susceptible to errors in cases where humans wouldn’t even think twice.
And yes, “hey, a stop sign was here just yesterday” is also a situation for which humans are uniquely equipped, and computers aren’t.
Humans are LOUSY in almost all of those conditions as well as much less challenging ones--we engineer the road in an attempt to deal with human imperfections. The I-5/I-805 intersection in California used to have a very slight S-curve to it--there would be an accident at that point every single day. Signs. Warning markers. Police enforcement. NOTHING worked. They eventually just had to straighten the road.
Humans SUCK at driving.
Most humans have a time and landmark-based memory of a path and they follow that. Any deviation from that memory and boom accident.
This is the problem I have with the current crop of self-driving cars. They are solving the wrong problem. Driving is two intertwined tasks--long-term pathing, which is mostly spatial memorization, and immediate response, which is mostly station keeping with occasional excursions into real-time changes.
Once they solve station-keeping, the pathing will come almost immediately afterward.
Ever notice how a bunch of stupid drivers playing with their phones tend to lock to the same speed and wind up abreast of one another? Ever notice how you feel compelled to start rolling forward at a light even when it is still red simply because the car next to you started moving? When in fog, you are paying attention to lane markers if you can see them, but you are also paying attention to what the tail lights ahead of you are doing.
All of that is "station keeping".
And it's normally extremely important to give it priority--generally even over external signals and markings (a green light is only relevant if the car in front of you moves). It's the kind of thing that prevents you from running into a barrier because everybody else is avoiding the barrier, too.
Of course, it's also what leads to 20 car pile ups, so it's not always good...
It's also not objectively a simpler problem. Humans are actually not particularly good at speech recognition, especially when talking to strangers and when they can't ask the speaker to repeat themselves. Consider how often you need subtitles to understand an unfamiliar accent, or reach for a lyrics sheet to understand what's being sung in a song. For certain tasks ASR may be approaching the noise floor of human speech as a communication channel.
Humans may not be particularly great at speech transcription, but they're phenomenal at speech recognition, because they can fill in any gaps in transcription from context and memory. At 95% accuracy, you're talking about a dozen errors per printed page. Any secretary that made that many errors in dictation, or a court reporter that made that many errors in transcribing a trial, would quickly be fired. In reality, you'd be hard pressed to find one obvious error in dozens of pages in a transcript prepared by an experienced court reporter. It is not uncommon in the legal field to receive a "proof" of a deposition transcript, comprising hundreds of pages, and have only a handful of substantive errors that need to be corrected. That is to say, whether or not the result is exactly what was said, it's semantically indistinguishable from what was actually said. (And that is why WER is a garbage metric--what matters is semantically meaningful errors.)
The proof of the pudding is in the eating. If automatic speech recognition worked, people would use it. (After all, executives used to dictate letters to secretaries back in the day.) The rare occasions you see people dictate something into Siri and Android, more often then not what you see is hilarious results.
Yes, Switchboard has problems (I've mentioned many of them here) but it was something that 1990s systems could be tuned for. You would see even more dramatic improvements when using newer test sets. A speech recognition system from the 1990s will fall flat on its face when transcribing (say) a set of modern YouTube searches. Most systems in those days also didn't even make a real attempt at speaker-independence, which makes the the problem vastly easier.
Executives don't dictate as much any more because most of them learned to touch type.
Now it works great!
Maybe if it could be made to work well while whispering....