Hacker News new | past | comments | ask | show | jobs | submit login

> my most optimistic wild guess is 10-20 years

Woah, that's a long time! What are the problems in computer vision that need to be solved?


if this is accurate, the current models give a hint of the issue. the car can't yet see very far and any minor obstruction confuses the hell out of it. there's too little pixel to make out the road at a distance and the car doesn't use any of the other clues human do to figure out what the road is doing next - i.e. we can guess a corner from vertical signs, guard rails, trees and even hill sides - see this example: in red the pixels that hints a tesla of an incoming corner, in blu those that also a human can use: https://i.imgur.com/CvntZuZ.png

the problem is that a camera, especially if it's not at a high vantage point, will have very few pixel to represent distant features.

The autopilot behaves like an illegal street racer in that video (extremely tight turns) and this happened while it was only a few seconds away from colliding with a motorcycle.

as a side note: not all the video was using autopilot mode, the icon changes to show where there was manual intervention

Adversarial attacks. Extrapolating from incomplete data. Consistent performance in any/most lightning conditions: night, dusk, low hanging sun over an icy road. Fog. Any combination of the above.

Humans while not perfect are capable if making split-second decisions on incomplete data under a surprising range if conditions based on the fact that our brains are an unmatched pattern-matching beast feeding off of our experience.

Adversarial Attacks? Like removing a speed limit sign? Painting over lane markers? Dropping bricks off of overpasses? Throwing down a spike strip? Sugar in the gas tank?

Unless you're talking traditional computer security, which it doesn't seem like you are, these types of threats have not prohibited human drivers despite the fact that humans are very susceptible to "adversarial attacks" while driving too. Whether it's putting carefully crafted stickers over a stop sign to confuse a CNN or yanking it out of the ground to get a human driver killed.. you're talking about interfering with the operator of a moving vehicle... so what's the critical difference here?

I think the difference is that software tends to be much more fragile - and more predictable - than humans. Paint fake lane markers, and an autopilot might drive full speed into a wall because it trusts them; the next 5 cars with autopilots will all do the same thing. An attacker can verify that an autopilot will do this ahead of time. A human on the other hand will be more likely to notice that things are amiss - they can pick up on contextual clues, like the fresh paint, and the fact they've driven that road hundreds of times before and instantly notice the change, and the pile of burning self-driving cars.

I'm not sure why hackers prefer to hack large scale computer systems rather than individual humans, but they do. So we have to protect neural networks against adversarial examples for the same reason we have to protect databases against sql injections.

> so what's the critical difference here?

If neural networks are deployed at scale in self driving cars, a single bug could trigger millions of accidents.

Printing an adversarial example on billboards would lead to crashes all around the country. Are we going to assume no one is going to try? (btw: real world adversarial examples are easy to craft [1]).

[1] https://arxiv.org/pdf/1707.07397.pdf

> Adversarial Attacks? Like

Like literally the article we’re commenting on. Image recognition systems in general are much more susceptible to errors in cases where humans wouldn’t even think twice.

And yes, “hey, a stop sign was here just yesterday” is also a situation for which humans are uniquely equipped, and computers aren’t.

Adversarial attacks with low friction for the attacker e.g malicious software updates.

Do you mean OTA or just any attack because I think non-autonomous vehicles would have the same concerns... or really any equipment with embedded computers.

I mean any attack where the attacker is not required to move away from the keyboard and can corrupt multiple vehicles in one go.

Could the keyboard be in a radio-equipped car?


Boy, are you an optimist.

Humans are LOUSY in almost all of those conditions as well as much less challenging ones--we engineer the road in an attempt to deal with human imperfections. The I-5/I-805 intersection in California used to have a very slight S-curve to it--there would be an accident at that point every single day. Signs. Warning markers. Police enforcement. NOTHING worked. They eventually just had to straighten the road.

Humans SUCK at driving.

Most humans have a time and landmark-based memory of a path and they follow that. Any deviation from that memory and boom accident.

This is the problem I have with the current crop of self-driving cars. They are solving the wrong problem. Driving is two intertwined tasks--long-term pathing, which is mostly spatial memorization, and immediate response, which is mostly station keeping with occasional excursions into real-time changes.

Once they solve station-keeping, the pathing will come almost immediately afterward.

Compared to the current generation of “self-driving cars” I’d say humans excel at driving.

You make a succinct point. Can you elaborate on the difference between Station Keeping and Lane Keeping, what is generally available now as LKAS?

"Station Keeping" is maintaining your position relative to the other cars around around you--and it's what most people do when driving.

Ever notice how a bunch of stupid drivers playing with their phones tend to lock to the same speed and wind up abreast of one another? Ever notice how you feel compelled to start rolling forward at a light even when it is still red simply because the car next to you started moving? When in fog, you are paying attention to lane markers if you can see them, but you are also paying attention to what the tail lights ahead of you are doing.

All of that is "station keeping".

And it's normally extremely important to give it priority--generally even over external signals and markings (a green light is only relevant if the car in front of you moves). It's the kind of thing that prevents you from running into a barrier because everybody else is avoiding the barrier, too.

Of course, it's also what leads to 20 car pile ups, so it's not always good...

It’s not a long time. Computer speech recognition, a far simpler problem, has barely advanced at all in 10-20 years. Siri is no better than Dragon Dictate was in the late 1990s. It’s possibly worse.

Yeah this is just completely wrong. Without getting into specific products, public test sets from the 1990s like Switchboard and WSJ are now at around human-level transcription accuracy rates; 20 years ago the state of the art was nowhere near that.

It's also not objectively a simpler problem. Humans are actually not particularly good at speech recognition, especially when talking to strangers and when they can't ask the speaker to repeat themselves. Consider how often you need subtitles to understand an unfamiliar accent, or reach for a lyrics sheet to understand what's being sung in a song. For certain tasks ASR may be approaching the noise floor of human speech as a communication channel.

I assume you're basing your claim on WER on data sets like Switchboard, which is a garbage metric: https://medium.com/descript/challenges-in-measuring-automati....

Humans may not be particularly great at speech transcription, but they're phenomenal at speech recognition, because they can fill in any gaps in transcription from context and memory. At 95% accuracy, you're talking about a dozen errors per printed page. Any secretary that made that many errors in dictation, or a court reporter that made that many errors in transcribing a trial, would quickly be fired. In reality, you'd be hard pressed to find one obvious error in dozens of pages in a transcript prepared by an experienced court reporter. It is not uncommon in the legal field to receive a "proof" of a deposition transcript, comprising hundreds of pages, and have only a handful of substantive errors that need to be corrected. That is to say, whether or not the result is exactly what was said, it's semantically indistinguishable from what was actually said. (And that is why WER is a garbage metric--what matters is semantically meaningful errors.)

The proof of the pudding is in the eating. If automatic speech recognition worked, people would use it. (After all, executives used to dictate letters to secretaries back in the day.) The rare occasions you see people dictate something into Siri and Android, more often then not what you see is hilarious results.

That article is correct that WER has some problems, but it also correctly concludes that "Even WER’s critics begrudgingly admit its supremacy."

Yes, Switchboard has problems (I've mentioned many of them here) but it was something that 1990s systems could be tuned for. You would see even more dramatic improvements when using newer test sets. A speech recognition system from the 1990s will fall flat on its face when transcribing (say) a set of modern YouTube searches. Most systems in those days also didn't even make a real attempt at speaker-independence, which makes the the problem vastly easier.

Executives don't dictate as much any more because most of them learned to touch type.

Oh come on. I remember playing with speech rec in the 90's and it was terrible.

Now it works great!

Siri, at least, is total garbage. Half the time I try to dictate a simple reminder, Siri botches it. (The other day, I tried to text my wife that both Maddie and Tae sing. Siri kept transcribing “sing” as “sick.”) Siri at least is no better than Dragon Naturally Speaking was in the 1990s. The Windows 10 speech recognizer is somewhat better, but it’s still not usable (what was the last time you saw anybody use it?).

I don't have experience with Dragon or Siri, but Google Assistant has been improving at a noticable pace and for me seems seems to recognize at least 90% correct.

I think the biggest problem with speech recognization is that it annoys everyone around you. I would use it more often but I don't like being noisy...

Maybe if it could be made to work well while whispering....

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact