
Have We Forgotten about Geometry in Computer Vision? - AndrewKemendo
http://alexgkendall.com/computer_vision/have_we_forgotten_about_geometry_in_computer_vision/
======
arketyp
My apprehension was that the computer vision community has been suffering some
serious cognitive dissonance lately because here they spent all these years
mapping problems to feature spaces of manageable dimensionality, backed by
theory saying that proper assumptions must be made to reduce the search space;
and then comes these deep nets, hardly tailored to the problems, and out-
performs algorithms with decade old history of fine-tuning.

Despite this, I don't think anyone disputes the potential of a good set of
assumptions. Instead I think what deep learning has thought us is that we
should reconsider what these assumptions should be. While geometry might well
be the first kind of language a toddler learns to think in, this should
probably not be confused with the rigorous geometry of Euclid. Quite possibly
we have some spatial relationships such as the affine transformations hard-
coded in our brain at birth, but this does not mean, for instance, that one is
therefore necessarily ever able to to draw a house in correct perspective.

~~~
trevyn
Excellent point -- our brains and deep networks seem to learn a kind of "fuzzy
geometry" that is in some ways more robust than numerically correct geometry,
and this also allows us to spend more cycles on higher-level abstractions.

~~~
snovv_crash
My problem with our brains and NN is exactly this "fuzzy geometry" you speak
of. Using precise geometry, if we have a pair of cameras equivalent in layout
to eyes we can reconstruct a world with mm-precise accuracy. That is something
that cannot be done with neural networks (artificial or biological), instead
you get a sort of semantic tagging of "green sofa here, table here, bed
there".

But if afterwards you want to use this model to know if the sofa will fit into
the alcove (or the car into the garage), the NN systems will be wildly
unreliable.

~~~
uf
It seems to me that humans can predict 3D spaces pretty good.

I've done alot of construction work. After some practice you can measure quite
precisely what does fit where and even how long something is. Our brains seem
to be able to "compute" such things despite all the difficulties of
constructing a coherent image of our surroundings in the first place.

Or think of moving your huge couch through the narrow stairway of your house -
we can predict how you need to turn it so it fits. Or think of truck drivers
that are able to maneuver their large vehicles within cm range. Even when they
can't see some of it (dead angles) and need to rely on their mental model.

Did I misunderstand your comment? Could you elaborate what I am missing?

~~~
Sammi
Anecdotal evidence: I used to work at a construction materials depot when I
was younger. I remember how lost I was when I initially started, I couldn't
recognize any of the material sizes by looking at them, while all the old
timers always just knew what size they were just by glancing at them. Then
only after a couple of months, suddenly I could too. Years later I'm still
pretty good at guessing lengths, while people around me seem to always be way
off. I can hold out my hands and make a foot, three feet, 1 cm, 0,5 cm 10 cm
between them pretty reliably, while other people seem to be way off.

------
andreyk
I have a feeling people are going to reply without reading this through and
assume the author poses a Deep Learning vs Classic CV sort of argument, in a
deep-learning-is-overrated sort of way. Whereas it seems to me he is merely
saying Deep Learning should be informed by Classic CV.

"I think we’re running out of low-hanging fruit, or problems we can solve with
a simple high-level deep learning API. Specifically, I think many of the next
advances in computer vision with deep learning will come from insights to
geometry."

And it's true. A lot of the low-hanging fruit have been gotten, and stuff like
SLAM is not about to be done wholly by deep learning (probably). There are a
lot of problems that require more insight and analysis than 'throw a deep
network at a large dataset'. As he concludes

"learning complicated representations with deep learning is easier and more
effective if the architecture can be structured to leverage the geometric
properties of the problem."

And thinking about what the architecture should is basically the hard bit in
deep learning.

~~~
amelius
The sad parts, for me are:

1\. Thinking about a deep learning architecture is really a different kind of
research. It requires much guesswork, a lot of trial and error, and lots of
waiting for training cycles to complete. For me this type of work is much less
interesting than engineering a solution, even if the results are better.

2\. Deep learning is quickly becoming a commodity. One can download libraries
to do almost anything with neural networks. I fear that deep learning will
become the new "web development" of the 90s. Everybody can do it (this is not
really true just yet, but I suspect we are rapidly reaching an asymptote where
it is true).

~~~
dagw
2) would be awesome. If for example our hydrologists could easily take their
extensive domain knowledge and experiment with layer of deep learning on top
of that without having to go though the "high priests" that can only be a good
thing in my mind.

I might be very much an oddity among programmers, but I genuinely believe that
the more 'programming' that can be done by non-programmers, who actually
understand the domain they are trying to model, the better. If nothing else it
would free up more time for programmers to work on actual hard problems where
they have more to contribute.

~~~
amelius
My main concern is whether it deserves my time.

Investing in deep learning now costs considerable time and effort, whereas in
the near future when deep learning is a commodity, that investment gives no
real advantage.

~~~
gajjanag
Back when I took signal processing, I remember our professor making a very
insightful comment. I sadly don't remember his exact words, but it went
something like this.

"20 years back, everyone wanted CD's for music, today it is flash drives, the
upcoming thing is streaming music from the internet. Technologies come and go.
What I can say is that the Fourier transform will still be around 200 years
from now as a valuable way of understanding the world."

I am of course in no real position to assert that this is a good idea (as I am
still in graduate school), but this is the reason why I gravitate towards math
and physics classes. Conditional on the assumption that we have a meaningful
civilization decades from now (and none of the dangers outlined for instance
in the excellent book "Here Be Dragons: Science, Technology and the Future of
Humanity" by Olle Haggstrom) cause its utter collapse, I think it is a safe
bet to assume "fundamental" topics like Maxwell's equations, Fourier
transforms, etc will still be around, alive, and fruitfully studied and
applied across a variety of domains. Part of the reason is simply historical:
the Fourier transform has had over 200 years to prove its worth time and again
across a variety of disciplines.

Maybe deep learning is here to stay, maybe it isn't. I don't know, and I am
not willing or interested in making bets either way. I am willing to bet on a
time proven framework of understanding, such as the examples above.

~~~
kefka
Indeed.

ML seems a great part, "Magic Function Machines". People make features that
can be tracked, and more features, and more features. Then they splatter them
against the wall, and hope for the best.

The best, are things like "Tell what type of bird it is"... But in the end, we
have no clue HOW it got those results. Just that the Mystery Magic Function +
data = results.

I can certainly see a middle path of being able to train a learning function,
and then interrogating said function for the underlying things that make it
true. Once we understand the primitives, then we can make ideal functions that
do X, and do it well. Because right now, learning functions do it rather well
but with exceptional overhead. And they're hard to tune without recrunching
the whole dataset.

------
cmontella
I think there's still a lot of room for exploiting geometry. In particular, it
leads to very simple models that require little training, and most importantly
are completely transparent in how they work. I did some work on a robotic
wheelchair that exploited the prevalence of poll-like objects in the
environment (trees, parking meters, street lamps) to localize. The model just
looked for cylindrical objects and matched them against a map that we
generated a priori.

Deep models are best suited when you don't know what features are important in
your model. For us, it was straight forward with the radius and orientation of
the poll object, so a deep model would have probably been the wrong approach.

~~~
trevyn
This is a good observation. My personal opinion is that using many smaller
networks as components of a larger, manually engineered system could be a
fruitful approach for these complex problems.

End-to-end deep learning is the holy grail -- and may ultimately happen -- but
I don't think computation in silicon semiconductors is going to cut it, and we
don't yet know what the next computational substrate will be.

Did you use a smallish convolutional model for extracting pole position,
orientation, and radius from image data, or another approach?

~~~
cmontella
If you're interested, here is a link to the initial paper:
[http://vader.cse.lehigh.edu/publications/fsr12_mon_per.pdf](http://vader.cse.lehigh.edu/publications/fsr12_mon_per.pdf)

------
trevyn
_However, these models are largely big black-boxes. There are a lot of things
we don’t understand about them._

I'm getting kind of sick of this "deep learning is a black box" trope, because
it's really not true anymore. Yes, it's a black box if you just use "some data
and 20 lines of code using a basic deep learning API" as mentioned in the
article. But if you spend some time understanding the architecture of networks
and read the latest literature, it's pretty clear how they function, how they
encode data, and how and why they learn what they do.

Because neural networks are so dramatically more effective than they used to
be, in so many domains, it's true that we don't yet have a good understanding
of optimal ways to build, train, and optimize networks. But that is _exactly_
why there is so much excitement -- because there is a lot to discover, and a
lot of progress that can be made quickly.

I agree with the author that fundamental physical and geometric approaches are
still relevant and useful, and have been somewhat ignored recently, but the
fact remains: If you and I as individuals want to maximize our personal
impact, and capture as much value as we can while working on interesting
problems, deep learning is an _excellent_ field in which to do that.

It's kind of like we just discovered a nice vein of gold, and the silver
miners are like "yeah, but we don't know much about that vein of gold and how
long it will last." Which is true, but in the meantime, there's a lot of easy
money to be made, and ultimately, both types of resources are important and
synergistic.

~~~
YeGoblynQueenne
>> I'm getting kind of sick of this "deep learning is a black box" trope,
because it's really not true anymore.

I know, right? Like, take this model I trained this morning. Here's the
parameters it learned:

[0.230948, 0.00000000014134, 0.1039402934, 0.000023001323, 0.00000000000005]

I mean, what's "black-box" about that, really? You can instantly:

(a) See exactly what the model is a representation of.

(b) Figure out what data was used to train it.

(c) Understand the connection between the training data and the learned model.

It's not like the model has reduced a bunch of unfathomably complex numbers to
another, equally unfathomable. You can tell exactly what it's doing- and, with
some visualisation, it gets even better.

Because then it's a _curve_. Everyone _groks_ curves, right?

Right, you guys?

/s obviously.

~~~
josefx
Not to mention that the bug which causes it to detect a cat as panda is
instantly visible. You really should change that 0.00000000014134 to a
0.000000000141339 .

------
Eerie
Check this out

"Learning 3D Models from a Single Still Image"
[https://www.youtube.com/watch?v=bWbEsDbfayc](https://www.youtube.com/watch?v=bWbEsDbfayc)

"3-D Reconstruction from a Single Still Image"
[http://cs.stanford.edu/people/asaxena/reconstruction3d/](http://cs.stanford.edu/people/asaxena/reconstruction3d/)

"Make3D: Convert your still image into 3D model"
[http://make3d.cs.cornell.edu/](http://make3d.cs.cornell.edu/)

------
Stanleyc23
Not that forgotten, right? Isn't SLAM almost pure geometry? I don't think
stuff like LSD-SLAM or any structure from motion stuff has deep learning built
in.

~~~
AndrewKemendo
Correct about the current state of SLAM. However the next wave for computer
vision focused tracking and mapping is heavily based on ML - though there
isn't a single combined system that gives robust SLAM/PTAM level results yet.

~~~
logicallee
why would anyone downvote the parent comment (by andrewkemendo) given that
they seem to speak with the authority of an expert in the field. thanks for
your comment.

------
MichailP
This seems like a perfect thread to ask a question that is itching me for some
time. Is anyone aware of deep learning approaches or some combo with geometry
algorithms for mesh generation? Something like quad, hex or tetrahedral
meshing aided with deep learning?

~~~
froindt
Interested in where you're going with this. What are the advantages you are
seeking out? I've done some computational geometry with slice data from
meshes, but have been operating off of STL files, not a feature-based format.

~~~
MichailP
Just interested in simple mesh algorithms. I feel as if machine learning could
help but almost no one is going in this direction. Higher order quad and hex
meshing is not a solved problem, and could help in a bunch of numerical
physics algorithms.

------
state_less
I'd be keen on doing or learning about a mapping from spatial tree to 2d
raster and have the inverse (dual) mapped back to a spatial tree via a deep
learning model, though I'm not sure how you'd represent the shapes and
transform matrices. Maybe some sort of matrix stream like char2vec?

A nice property of such an experiment, you could generate your training data
via permutations of the spatial trees.

You may have to accept multiple valid answers, since one could correctly say,
the ball is to the left of the car, or the car is to the right of the ball.

------
antman
What is the cost/benefit in this? Is it appropriate to divert resources to
older techniques because we haven't yet figured how to do it in the new way?
Yes we can score a few points but that is an advancement only to an academic
setting.

We used to need PhDs to do simple Computer Vision applications and object
recognition in a meaningful matter was a far away dream. Now a child can make
an application on its Raspberry Pi, because the new way of doing things is
generalizable. You don't need to spend huge amounts of time redoing things
just to get to the basics and reach the state of the art as it was 20 years
ago.

Should we reintroduce the old ways for the quick wins, or should we divert our
research resources to try to solve unsolved problems? The GPU will get cheaper
the cloud GPUs will be cheaper.

So this is the state of things today. If somebody wants to advertise his
paper's submission to a conference good for him but it should not be presented
as an important advancement that should be the new way of doing things.
Because it isn't.

When we decide that its feasible to send robots to the planets or even build
robots on them, building chessboards to do rectification of the cameras should
not be one of their tasks.

Disclaimer: I had horse in this race too, I was on the losing side of the deep
learning argument, I was wrong, I got over it.

~~~
josefx
> What is the cost/benefit in this? Is it appropriate to divert resources to
> older techniques because we haven't yet figured how to do it in the new way?

I too would like to know why cars have wheels when helicopters have shown that
rotors work well. We should have abandoned wheels ages ago since they are
clearly old and with that inferior.

> Now a child can make an application on its Raspberry Pi, because the new way
> of doing things is generalizable.

And that app will have issues deciding if it sees a couch or a leopard. Or do
you expect a child to correctly train its neural net?

> The GPU will get cheaper the cloud GPUs will be cheaper.

Why again do we have multiple algorithms for sort when a few nested for loops
would do? Maybe smart algorithms scale better than hardware ever could.

> building chessboards to do rectification of the cameras should not be one of
> their tasks.

And yet we have mars probes with a build in color table. Why do you hate
geometry?

~~~
antman
>I too would like to know why cars have wheels when helicopters have shown
that rotors work well. We should have abandoned wheels ages ago since they are
clearly old and with that inferior.

I said cost/benefit and your example is about an expensive way to do what a
car can do. You get some points for speed but lose big points on affordable
transportation. The world has settled on the car.

> And that app will have issues deciding if it sees a couch or a leopard. Or
> do you expect a child to correctly train its neural net?

It will train a neural net sooner than it will learn 3d computer vision.

> And yet we have mars probes with a build in color table. Why do you hate
> geometry?

I said that I was a skeptic against deep learning an in favor of the old ways,
so why do you claim that I hate it? Its inadequacy regarding the current and
possibly future state of computer vision technology is not an expression of my
feelings. Its a mere observation.

~~~
josefx
> I said cost/benefit and your example is about an expensive way to do what a
> car can do.

Sometimes neural net running on a sever rack full of GPUs is also overkill,
including the cost/benefit. You get bonus points for buzzwords thought.

> It will train a neural net sooner than it will learn 3d computer vision.

And you would trust it to run as required? I admire your courage. Unless the
person selecting the training data knew what they were doing I wouldn't. I
certainly wouldn't trust a child to get it right without being trained itself.

> I said that I was a skeptic against deep learning an in favor of the old
> ways, so why do you claim that I hate it?

Your mention against "building chessboards" for basic calibration as if that
was in any way hard or even necessary. Calibration of sensors on mars is
already a solved problem. I don't understand why you would think otherwise.

------
krosaen
This reminded me of a cool paper I came across recently, Spatial Transformer
Networks [1], a good example of how knowledge of geometry helps frame the
problem more effectively, allowing the network to learn how to e.g rotate
objects into a canonical orientation before identifying them.

[1] [https://arxiv.org/abs/1506.02025](https://arxiv.org/abs/1506.02025)

------
santaclaus
> In computer vision, geometry describes the structure and shape of the world.
> Specifically, it concerns measures such as depth, volume, shape, pose,
> disparity, motion or optical flow.

Isn't optimization geometry? DL is just shifting the requisite geometric
insights to a different level of abstraction. Take that knowledge of geometry,
revolutionize non-convex optimization, and let all fields that build on the
base abstraction benefit.

~~~
deepnotderp
DL's optimization is pretty simple SGD. But I would suggest that the type of
optimization in deep learning is actually a lot more like geometry than
calculus. This was more of a hunch until recently, but some results such as
the successful application of tropical geometry and less exotic Riemannian
geometry to the analysis of DL loss surfaces are showing some evidence for
that hypothesis.

------
richard___
I have a hunch an understanding of classical control will also lead to much
better results than the deep RL stuff people are doing for ML based control

~~~
throwaway87423
Classical control often restricts itself to Linear systems, and non-linear
control is next to useless for non drift-free systems (if you can compute the
Lie algebra in the first place). SOS relaxations won't scale, and all bets are
off when dealing with contacts.

~~~
tnecniv
So what's your suggestion when dealing with any non-trivial system? All those
issues you mention are true but also active research areas.

------
taeric
This is making the odd assumption that our previous abstractions should play
well with our current abstractions of pixels.

My feeling is that the "perfect" abstraction of reality to geometry is
actually a very high order function that we don't fully understand. An easy
example is parallel lines aren't always parallel, even though that is a common
geometry affordance.

So, that our current toolset does not play well with our previous one should
not be that surprising. Would it be nice if they did? Yes. But it would also
be nice if Newtonial physics played well with quantum physics.

Why? Tough to answer. Not impossible, but above my ability to understand.

~~~
electronvolt
I think you're kind of missing the author's point.

> My feeling is that the "perfect" abstraction of reality to geometry is
> actually a very high order function that we don't fully understand.

You don't need a perfect abstraction of reality, though. All you need to do is
get _close enough_ to catch up with humans or outperform them and you've
solved most of the hard problems in computer vision. Fortunately--at human
scales, which are the only ones you need to care about for computer vision,
reality behaves closely enough to a pure Euclidean space that you're fine. :)

The author's primary argument seems slightly more nuanced, too, than just
"It'd be nice if we could use old techniques".

Basically, they're claiming that if you can build geometric understanding
_into_ the ML model, you will get significantly better results that just
naively plugging and chugging away with raw data. That's an empirical claim
that can be validated by researchers--either it will give a significantly
improved performance on well defined problems (stereo vision, etc.) or it
won't. Vision is one of the research areas that have developed pretty good
benchmarks over the years. :)

~~~
taeric
But that is exactly my point. Geometric models are the old tools we had. The
new tools are ML Models. It would be nice if they both worked together. But
there is nothing to say that they should. Nor is it obvious that it would be
beneficial.

Your point that this is testable, though, is important. I fully agree with
that and was not intending to dismiss the idea. Just because I am not as
confident as the author, does not mean that I am right. :) (I'd accept that I
am likely not right.)

------
phkahler
>> For example, one of my favourite papers last year showed how to use
geometry to learn depth with unsupervised training.

I've been saying that LIDAR is a hack for some time. People don't need it and
neither should computers.

~~~
pasta
LIDAR is not a hack. It measures depth even when objects are not moving.

With vision we can estimate depth, but sometimes we need movement for that.

~~~
musesum
So, LIDAR is best solution for an autonomous vehicle when it is not moving?

~~~
pasta
I did not say that.

It might be the best solution when nothing is moving.

Our brains also cannot measure distance when there are unknown reference
objects and everything is standing still.

~~~
musesum
That was a joke. Some people would laugh. Others down vote.

------
anjc
It isn't just a case of deep learning vs stereo geometry, you can have many
cameras informing/improving depth analysis.

I've been out of the loop for while, but I'd be mind-blown if people were
happy to rely on deep learning for depth reconstructions in domains
like...self-driving vehicles, versus provably accurate, fault tolerant systems
with multiple cameras + geometry.

And what are the hardware requirements typically like for DNN with high
resolution images for real time reconstruction? Are we well past the stage
where it's real time and accurate?

------
kordless
Who here visualizes, but does not have strong spatial skills?

------
osi
I was just reading [https://blog.openai.com/adversarial-example-
research/](https://blog.openai.com/adversarial-example-research/) and it
struck me that an understanding of geometry might help. For the examples and
scenarios cited, the geometries of the two items would be different (sometimes
markedly, sometimes less, like washer-vs-safe).

------
tmsldd
If you can afford data and computing power, dpl is the way to go... you don't
need to think much to get the thing done quickly.. but in my eyes , having a
suitable mathematical description to a particular problem always give you
useful insights with respect to the problem, variables and its relationships
and this is actually what most of us are looking for..

------
martijn_himself
Apologies for asking a computer vision related but off-topic question: does
anyone know where to start if I wanted to track moving objects (players) in a
(sports) video? Is there a 'general purpose' open-source library out there
anyone would recommend or is this not 'trivial' to implement (I presume non of
this is trivial stuff)?

~~~
adrianN
Have you looked at OpenCV?

~~~
martijn_himself
I have only just started looking at the problem and (coincidentally) found
this thread on the front page of HN so I couldn't resist asking the question!

I will look into OpenCV, is it the go-to libary for these kind of problems?

~~~
tnecniv
Pretty much, but it has some warts (the matrix library is notably bad imo).

How easy the task is depends on the nature of your game. If your players have
a red outline and that's the only red in the scene, no big deal.

------
osdf
There are several nice references within the article. I think the gvnn, Neural
Network Library for Geometric Computer Vision,
([https://arxiv.org/abs/1607.07405](https://arxiv.org/abs/1607.07405)) should
also be mentioned.

------
real-hacker
I am wondering if a properly structured deep net can capture these Geometry
features with a few layers, if these features are relevant to the problem?

Is this reflecting our human arrogance that we simply know better than AI
neural networks?

~~~
faragon
Sure it can. The problem is to validate that e.g. the net really knows
geometry, and not that it is just working for problems similar to the ones in
the training. Unless the training can be validated, i.e. formally proven that
after some specific training the net "knows" covering all the cases, you'll
have to complement neural nets with classic computer vision, no matter how
good you think your model is.

------
ouid
"not many problems that aren't solved by deep learning"

The problems that deep learning doesn't solve aren't solved by anything else
either. There's still lots of computer vision that we cannot do.

------
latently
"However, these models are largely big black-boxes. There are a lot of things
we don’t understand about them."

This describes geometry as well as it describes deep learning.

~~~
aqsalose
err, no. The model is a "black box" if the only thing we have is the input and
the output, and only little intuition how the model produces output from the
input. We have spent at least a couple of thousand years studying geometry; we
know geometry quite well.

Let me demonstrate with a stupidly simple geometric model.

Suppose (for the sake of the argument) that we have simple image input,
consisting only simple solid geometrical structures. Say, solid 2-d circles of
one color, on the background of different color.

From high school geometry, we know that we know everything there's to know
about a circle when we know its location on x-y plane and its radius. We could
easily come with a parametric model for fitting circles in the pixel image
data of circular objects. (For example, we could minimize difference in 2-norm
between data image and image corresponding to a set of circles, [x_i,y_i,r_i],
i=1..n ). This kind of descriptive parametric model would be particularly
_easy_ to understand: model structure consists nothing but representations of
circles! (But of course, it wouldn't be particularly interesting model; it
would apply to simple images consisting of circles only).

Alternatively we could work out the mathematics bit more, and come up with
something like Hough transform to find circular shapes. Still nothing
mysterious about it:
[https://en.wikipedia.org/wiki/Circle_Hough_Transform](https://en.wikipedia.org/wiki/Circle_Hough_Transform)

However, my point is: We could also train a neural network to find circles in
the images of our example. It might be good at it. However, understanding how
the circle representations are encoded in the final trained network certainly
would not be as easy than in our nice parametric model.

Some realistic applications of "simple" geometric models would be active
contours / snakes (
[https://en.wikipedia.org/wiki/Snake_(computer_vision)](https://en.wikipedia.org/wiki/Snake_\(computer_vision\))
) or (stretching the meaning of the word 'geometry') the various traditional
edge detection algorithms that have been around long time.

Or read the post, in which the author describes how they utilized projection
geometry model to account for camera positions and orientation, or for
stereographic images. We know how the geometry of stereographic vision works:
we don't need waste resources to train a network to learn inscrutable model
for it.

Deep learning is useful when we need models for things complicated enough that
we don't know how to model them. (For example, model that tells us "is there a
dog in this image".)

~~~
latently
In my opinion you are overconfident in the foundations of mathematics. Like
deep learning models, math works. Why and how does it work? It's open to
interpretation in both cases. In both cases, we don't have a complete
understanding. It is that lack of complete understanding that makes it a black
box.

------
duality
The fact that convolutional neural nets are used in vision is significant. The
convolutional structure encodes the geometry.

