Hacker News new | past | comments | ask | show | jobs | submit login

And before you do mundane data science on your structured data, you should figure out if there is a better way to get cleaner raw data, more data, as well as more accurate data.

For example, I predict stereo vision algorithms will die out soon, including deep-learning-assisted stereo vision. It's useful for now but not something to build a business around. Better time-of-flight depth cameras will be here soon enough. It's just basic physics. I worked on one for my PhD research. You can get pretty clean depth data with some basic statistics and no AI algorithm wizardry. We're just waiting for someone to take it to a fab, build driver electronics, and commercialize it.




Stereo vision is obviously highly effective in biology as it has independently evolved a great many times. Time-of-flight may be poised for a renaissance, but it scales badly and is active, not passive. Stereo vision, and its big brother light fields, are far more general and are certainly not going to "die out".


> Stereo vision is obviously highly effective in biology as it has independently evolved a great many times

This argument sounds like a second cousin of the "chemicals are bad, but if it is natural it is good" argument. Just because it evolved in nature doesn't mean it's optimal, or it's the best system under massively different constraints. And who knows what evolution would have thrown up after a few more billion years.


I think the point he was making is that it is effective in different environmental challenges and constraints. Not that it's optimal. So it might still be useful for building into a robot that's supposed to explore a environment(think space and planets for example). It may get relegated to academia, if we find other optimal solutions for our robots doing our boring chores though.


Human vision is active rather than passive as well. We really wouldn't be able to get around the world if we relied on passive inference from sensory stimuli to model our environment.


Yet humans produce terrible depth data.


Terrible for what purpose? Humans seem pretty good at throwing things to each other and catching them. I'm very bad at coming up with a good numeric estimate of linear size. As a fencer, I could never tell you how many inches between me and my opponent, how long his or my arms are, how tall he is, etc. I could definitely tell you which parts of our bodies are within reach of each other's arm extension, fleche, lunge, etc.


Terrible for the purpose of proving a general-purpose stereo machine vision system is practical.

The distance at which a stereo vision system can capture precise depths depends on the distance between eyes, and the eyes' angular resolution. Human depth perception works well for things within about 10m, but when you get out to 20-40m humans get a lot less info from stereo vision.

When you get to that distance, humans seem to have a whole load of different tricks - shadows, rate of size change, recognising things of known size, perspective and so on. You can see a car and know how far it is even without stereo vision, because you know how big cars are, and how big lanes and road markings are. You can even see two red lights in the distance at night and work out whether they're the two corners of a car, or two motorbikes side-by-side and closer to you.

On the other hand, your basic general-purpose stereo machine vision system doesn't try to understand what it's looking at - you just identify 'landmarks' that can be matched in both images (high contrast features, corners etc) and measure the difference in angle from the two cameras. This is relatively simple and easy to understand!

For tasks that humans can do that involve depth perception of things more than ~40m away - flying a plane, for example, where most things are more than 40m away if you're doing it right! - nice simple stereo vision can't get the job done, because humans are actually using their other tricks.

Of course, despite this limitation stereo vision comes up a lot in nature - it's still a beneficial adaption, because most things in nature that will kill you do so from less than 10m away :)


> Of course, despite this limitation stereo vision comes up a lot in nature - it's still a beneficial adaption, because most things in nature that will kill you do so from less than 10m away :)

It's actually pretty rare for non-predatory animals to have good stereo vision. Most of them are optimized for a wide field of view instead, evolving eyes placed on either side of their head. Think rabbits, parrots, bison, trout, iguanas, etc.

https://en.wikipedia.org/wiki/Binocular_vision

https://www.quora.com/Why-have-most-animals-evolved-to-see-o...


It depends what you mean by "practical", though. If you mean "practical for use as a dense 3D reconstruction technique" then sure, it's pretty bad. If you mean "usable under an incredible range of lighting conditions and situations" then I'd say it's practical.

Edit: IMO binocular vision is probably more to do with redundancy than depth perception. If you damage or lose an eye, you can still operate at near full capacity. Losing vision 'in the wild' is a death sentence.


Humans produce good enough depth data.


What does it mean? I'd say the ability to perceive depth is pretty useful.


Actually brains eventually learn how to discern distance with one eye too, iirc.


Right but in this case there is a trade-off: if you don't have the better data now how long before you will? Should you start now (and gain a time advantage at a cost) with what you have or wait for the better data (read sensor/process/technology/people etc).

My experience is that waiting for cleaner data is often like waiting for Godot and will often be a project killer (sometimes justifiably). This is a key issue at the moment in advanced ML: clean large training sets ideal for supervised training are elusive and the companies making real-world advances are pretty much all using available data (and semi-supervised techniques) rather than expensive made up training sets.


>Better time-of-flight depth cameras will be here soon enough. It's just basic physics. I worked on one for my PhD research. You can get pretty clean depth data with some basic statistics and no AI algorithm wizardry. We're just waiting for someone to take it to a fab, build driver electronics, and commercialize it.

You should talk to us at Leaflabs. Commercializing research-level tech in embedded electronics is what we do.


are you guys looking to hire anyone with compressed sensing/neuro-imaging experience? possibly someone with the username /mathperson lol


Neuro-imaging? Very probably. Send a resume and cover letter.


> For example, I predict stereo vision algorithms will die out soon, including deep-learning-assisted stereo vision

Maybe for regular "cameras", but parallax still gives the most accurate results for 3D reconstruction from satellite images, where it is a very lively area of research. The resolution, and surface coverage, of radar satellites is far worse than what you can obtain by stereo matching optical images taken by the powerful telescopes on satellites. I guess in other fields like microscopy it makes a lot of sense also. Not all imaging happens indoor on commodity cameras!


However as ma2rten eloquently describes, for the purpose of AI research it might be beneficial to bite through the harder AI required when only having access to stereo vision.

A lot of problems are not tackled on a fundamental level. Occlusion, context, proprioception, prediction, timing, attention, saliency, etc.

A simple rat has more intelligence than whatever is behind a dashcam, security cam or webcam.


You're ignoring form factor for optimized technology.

Yes, a quality TOF system would be great. However good luck convincing consumers to adopt hardware with lidar on it. The Tango is having enough trouble on it's own and it does pretty well for consumer systems with IR.

Besides that you can't do FTDT with laser systems AFAIK. You need something to capture unseen places, such as ultrasonics/HF - which I guess you could argue fall into TOF but I haven't seen that work done.

In the end my money (literally!) is on the opposite if your approach, namely building better RGB systems because there are already a trillion cameras deployed that we can extract from.


Already happening. The activity in the TOF-field is crazy.

First project Tango smartphone will be out soon.


First consumer Tango phone is already out!

http://shop.lenovo.com/us/en/smartphones/phab-series/phab2-p...

I think it first went on sale Nov. 1 but just started shipping initial batches recently.


I think the hidden strife in your triplet is a land mine. That is, getting more data is at odds with cleaner data for most folks. Adding accuracy? I hope you remember the difference between precision and accuracy. :)

(None of which takes from your point.)


Any links to your papers? How much better have things gotten over e.g. Kinect v2, which was pretty bulky and power hungry, and had active cooling? I'd imagine it has gotten a lot better.


ToF depth being single POV and depth infered from successive frames ?


A light source built into the camera is modulated in time. The camera uses this to infer depth. When the modulation is synced up right, the camera can see the light wave "travel" as it illuminates close objects first and then farther ones over successive frames. The camera isn't actually fast enough to capture the light wave traveling, but the timing between the shutter and the light source is shifted by nanoseconds each frame to accomplish this.

They can have problems operating outside because it is hard to make a lightsource brighter than the sun.


That's an older form of ranging. It's easier to do than pulse ranging, but you have to outshine ambient light at the color being used full time. The Swiss Ranger [1] is a good example of such a system. It's indoor only and short range. With a pulse system, you only have to outshine ambient light for a nanosecond, which is quite possible in sunlight.

I've been expecting good, cheap non-scanning laser distance imagers for a decade. In 2003, I went down to Advanced Scientific Concepts in Santa Barbara and saw the first prototype, as a collection of parts on an optical bench. Today ASC makes good units [2], but they cost about $100K. DoD and Space-X buy them. There's one on the Dragon spacecraft, for docking. That technology isn't inherently expensive, but requires custom semiconductors produced with non-standard processes such as InGaAs. Those cost too much in small volumes. There's been progress in coming up with designs that can be made in standard CMOS fabs.[3] When that hits production, laser rangefinders will cost like CMOS cameras.

[1] http://www.adept.net.au/cameras/Mesa/SR4000.shtml [2] http://www.advancedscientificconcepts.com/products/overview.... [3] https://books.google.com/books?id=Op6NCwAAQBAJ&lpg=PA64


Hey, 3D vision system amateur here, but very interested to learn more!

Can anybody point me to some literature or reference materials about attempts to combine the inputs from multiple techniques simultaneously?

E.g. a device with stereo conventional cameras and infrared cameras & emitters which compares the resulting model from each input source/technique and actively re-adjusts final depth estimate?

Is "sensor fusion" the right jargon to use in this context?

Or, even crazier, a control system which actively jitters the camera's pose to gain more information for points in the depth map with lower confidence scores / conflicting estimates?

But maybe such a setup is overly complex and yields minimal gains in mixed indoor & outdoor scenarios?


We have a setup that combines ToF, structured light and multiple colour cameras to reconstruct hands from the elbow down. Short version: it's a massive pain in the ass. In fact the setup really only works because we have a preconceived motion model (particular hand gestures) and have carefully arranged the scene to avoid interference. I'm unaware of a general solution where you can just throw more cameras in and get better scene data.

One neat thing though you might want to look at: if all you have is structured light (ie Kinect v1) you can simply attach a vibrating motor to each emitter/receiver to avoid a lot of interference per [0]

[0] https://wwwx.cs.unc.edu/~maimone/media/kinect_VR_2012.pdf


No, that would also qualify as "AI algorithm wizardry" as dheera puts it. Instead it's just a hardware sensor that measures the time at which light arrives at each pixel. Knowing the speed of light you can calculate the depth of each pixel.


Oh I see, a physical "Z-Buffer".




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: