Hacker News new | comments | show | ask | jobs | submit login
Ask HN: How do World Lenses work without depth sensor?
24 points by hackerews 9 days ago | hide | past | web | 18 comments | favorite
Snap just released World Lenses and I'm wondering if CV experts can explain how they work: https://www.youtube.com/watch?v=K6x44v8prFA.

I was under the impression that phones needed depth and advanced motion tracking (eg see Tango https://developers.google.com/tango/overview/motion-tracking) to enable stuff like placing digital objects down and having them stick without moving around. Surprised Snapchat can now do that without needing additional sensors on the device.

How might that work?

You can use the accelerometer to figure out which way is "down". Then, using CV techniques to identify the ground plane. Even if you get this a bit wrong, as long as you visually track some registration points, you get consistent positioning of your virtual object, which is probably the most important bit.

I noticed in the videos that nothing ever goes in front of the animated objects, so they're probably not identifying objects, and trying to figure out object boundaries or anything like that. It's a simple, well-polished trick.

That would be my guess as well. If you actually try the feature, it doesn't actually map the planes in the image, it just assumes you're pointing it at flat ground.

Tango is the "correct" way to do this and will give essentially the most robust set of data back to whoever is looking. A SLAM image + depth based hybrid approach will always give you more robust data.

But just plain old analysis of "parallax" (things closer move faster etc) can quickly give you a workable understanding of the general geometry of a space.

The new MR headsets like Magic Leap and HoloLens likely use both types of sensors (like the kinect camera) to determine this stuff. But Snap just wants the widest push possible, and since it is not mounted on your eyes, the 'pixel stick' being off is not as much of an issue.

There's a lot you can do with just a single camera and imu. PTAM is just one example there are other formulations but it tends to work well without using a ton of CPU/GPU resources


There are a few things stacked on top of each other to make it happen, but this is a good start.

Combine this with GPS information and you can roughly "place it in the world" and make sure it doesn't clip the ground plane. After that, it is just pulling it up from your sparse 2D DB and dumping into the scene for other cameras. If they really care, there'll be some data included from the placement camera to help localize it.

PTAM has been around for a while. There's newer techniques for monocular slam, but I've not heard of any that were "consumer ready". Any ideas on whether they are using a new semi-dense approach or improving an existing paper?

Imperial (A UK University) are currently making large leaps forward in terms of modelling a room or environment with just a single camera and little computing power. Their basic methodology was:

- Pick "unique" points currently in view

- Track how they move as the camera moves

- Combine this data with the accelerometer to get an accurate movement reading of the phone.

- You can get the depth of any point by comparing two images and knowing the change in user position.

Simple algorithm, but their results were astonishingly good. Snap don't need to model the entire room, they just need to work out where these points are to keep the image appearing still.

There are several ways to do depth mapping with just a single camera. Structure from Motion, and Multiple View Geometry are a couple. Of course, more sensors will usually provide better data. I imagine there are plenty of situation where this app with behave poorly (e.g. low light, scenes with little texture), they just don't show it in the demo.

Besides, analysing the pixels directly, maybe their algorithm take into account the vibration/movement of the live video to detect depth.

Close one eye. You can still estimate depth, can't you?

Now I have to figure out how I do that. I have always assumed that I am able to do so because I am familiar with the dimensions of the things typically in my field of view.

Familiarity is one factor, but there are others.



But I guess I find it extremely surprising that Google would go through years of working on Tango - software, hardware, developer community, etc, etc - if the tech was available on existing phones.


If you've got typical levels of binocular depth perception:

Close one eye. Slowly poke something 2-3 feet away. You can do it, but you might be off an inch or two with your distance estimation and be surprised when you actually make contact. Open both eyes and do something similar with another object. The added depth information should mean that you can tell where something is to a greater precision.

Same thing with Tango. In both cases, you can extract some information about the environment from a 2D image stream. Add in accelerometer, gyroscope, and depth-sensing hardware, and you can correct some of the edge cases while increasing precision even for the things that you could do with standard hardware.

Why google keeps pushing Tango is an excellent question - there are some good people in the group, but in general the performance and achievements have been abysmal. Considering that TensorFlow and Tango come from the same organization would produce serious cognitive difference if teh googz scale wasn't taken into account - yes, I worked with Tango for a little over a year, and then gave it to my son as a game platform - seems the future is heading away from structured light to pure visual mechanisms, possibly with LIDAR or ToF for a little analytical support

I was looking into mobile SLAM use cases a while back and my theory around why google is pushing tango instead of simple optimizing for current hardware is they have the capital to develop for an infinitly long time horizon. It's a low risk assumption to say eventually all phones will have tango-class sensors and this way google can be the leader when the time comes.

Semi-dense visual odometry, thats what you need to search.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact