So who's doing interesting applied work in scene understanding for AR?

moron4hire · on July 1, 2020

I've not paid close attention to AR in the last year. I almost exclusively did AR at my previous job, but I always really only wanted to do VR. My current job is 100% VR focused again, so I'm not completely up to date. But I do still see stuff in passing on Twitter and such.

I think one of the highest-value things that is in progress right now is the work Apple is doing to combine feature detection and location. They get your rough location with GPS, stream a feature-point cloud to your device, then figure out your precise location based on the camera view. Just having a reliable, to the centimeter, position and orientation of a user in the full, real world is going to be a huge enabling factor for AR applications.

Without it, the best thing you could do is turn-by-turn directions, because GPS precision and drift prevent you from doing anything in close proximity. But if it's not in close proximity, then really limited in the detail you can provide people looking through their phone.

With it, you can start to do a lot more stuff. Art installations, public information kiosks, event-based things. And it's out in the world where other people will see people doing it. Exactly like how people got interested in Pokemon Go because they saw other people in the street playing it.

I think we're still a long way off from useful object recognition. I've seen a lot of concepts around brands wanting their appliances detected and giving users instruction manuals, repair manuals, or value-added services. The problem is, state-of-the-art object recognition can fairly reliably tell you "I see a refrigerator", they just can't tell you "this is a Whirlpool refrigerator" say nothing about the specific model. So that kind of goes back into the funding problem. Whirlpool, GE, Frigidaire, they all want AR, but not if it's going to work with other brands. Same with basically every other product on the market. So object recognition is kind of at the same point that location tracking is, with regards to detail. It's going to either take an unrelated company going out on a limb to support multiple brands without the brands' involvement, or it's going to take an unlikely development in object recognition that can reliably detect brands and models.

Things that were just coming along when I stopped doing fulltime AR work (hopefully they've gotten better in the last year):

Microsoft and IBM were doing a lot of great work in improving object detection, plus providing it as a service to be utilized in applications. That's another problem, most of the stuff you see is so research-grade that it's still years away from being productized, if it doesn't get canned first. But at least some of the high-level object recognition stuff is productized right now. It's slow, though, so you have to be smart in your UX about how you manage the queries. A neural network can tell you that it saw a cat in one picture you sent it a full second ago, but it can't tell you that it sees a cat right now in your video stream. But if you can work with detecting things in still images, it could be usable.

PTC was doing some much simpler, but still very interesting work with their Vuforia system and pivoting to value-added services on top of native AR subsystems, rather than just providing the AR subsystem. Vuforia was great 3 - 5 years ago when we didn't have any AR subsystem, but Google and Apple have basically ripped the rug out from under them. Image-target tracking is both terrible and great. It's terrible because it's not very flexible. But it's great because it tells you something contextual about the user: they have my image target in view. They have a live-annotation system for spatial drawing in 3D that's really interesting for teleconferencing. Except Vuforia is positioning it for industrial repair. Again, "who is paying for this" gets in the way.

If appliance brands could get their collective heads out of their asses and accept that maybe a new paradigm requires a little flex and adaptation on their own part, I could see new products being developed that are visually easier to detect.

I also think all the work going into speech recognition and semantic understanding of text is super important for AR and VR. It's not just about making user interfaces for people to use these systems hands-free (though that's important, too, because there are a lot of scenarios where a user might not have free use of their hands). Having reliable, contextual information about what people are talking about in, say, a meeting, could enable virtual assistant technologies that aren't dumpster fires.

Similarly, reliable facial recognition would be a huge help for AR systems, in a lot of very obvious ways.

But facial recognition leads me to the unfortunate thought that we're going to run into some intractable problems in machine learning that are going to prevent the full, perfect future of AR. Even disregarding the moral hazard of selecting an appropriate training set, the problem is that ML-based techniques are inherently biased. That's the entire point, to boil down a corpus of data into a smaller model that can generate guesses at results. ML is not useful without the bias.

Bias is OK in some contexts (guessing at letters that a user has drawn on a digitizer) and absolutely wrong in others (needlessly subjecting an innocent person to the judicial system and all of its current flaws). The difference is in four areas, how easily one can correct for false positives/negatives, how easy it is to recognize false output, how the data and results relate to objective reality, and how destructive bad results may be.

Things like product suggestions or voice dictation systems work because, when we get a bad result, we can easily recognize and correct for it, often by just retrying. And part of why we can tell that there is a problem is because they results are linking back to some notion of an objective reality. In contrast, a NN that dreams up photos of dogs melting into a landscape has no impact on reality.

But facial recognition runs into so many problems here. If you're trying to detect a particular person's face, they don't have another face that you can try to see if you get better results. If you don't know who the person is that you're trying to detect (e.g. identifying a person from a photos), then you don't even know when the results are wrong to try to get a different answer. And because you're bringing these results back to an action in objective reality, the consequences of wrong answers have real impact on real people (e.g. identifying suspects from security camera photos).

So yeah, I'm not too hopeful for AR. The tech is cool. I certainly want to be able to have good AR tech. But some of the farther afield ideas on how the tech might enhance semantic understanding of the world... I think a lot of it is a pipe-dream. I suspect the actually achievable maxima is strictly limited to entertainment and productivity.