Homography Explained with Code

collingreen · 2024-12-04T03:05:55 1733281555

It's interesting how there are so many papers published in this space but they all tend to use these same few images (the webcam chessboard, the two buildings, and the diagrams). I've been looking into how to reliably stitch video frames into "orthographs" (similar problem is faced by aerial surveys; there is a lot of good work on this from the drone community) recently and have read probably two dozen recent papers spanning homography, photogrammetry, feature detection, sfm, nerf, and segmentation and most of them reuse these diagrams and at least some of these images.

Maybe the world would benefit from some more well documented, open licensed training/validation data?

mojomark · 2024-12-04T03:21:26 1733282486

I think it's intentional. A lot of creators of computer display technology (physical) and computer vision based algorithms use a "standard image set" to present their invention's performance against a common benchmark, like the image of "Lena" (1). There's also standard benchmark images of a parrot, a tea pot, a bowl of fruit and a rabbit that I see often.

1. https://en.wikipedia.org/wiki/Lenna

collingreen · 2024-12-04T16:14:26 1733328866

This makes sense and we see the Lenna and Utah teapot everywhere for things like 3D rendering and compression. There are equivalent datasets for evaluating cv models - you see them benchmarked against coco a lot, for example.

I still am surprised to see all the same images and diagrams in the -explanation- part of so many papers. It feels like early word processor days and everyone using the same clip art.

namibj · 2024-12-04T07:32:00 1733297520

If not the kind that aerial surveys create, then what orthography are you even working on?

And what's keeping you from fusing the video frames with IMU (and, if available, GNSS) data?

collingreen · 2024-12-04T16:11:12 1733328672

I was trying to reconstruct multi-position "panoramas" from video recorded on a phone as someone walks - like getting one big photo of a street (but I'm inside essentially mapping walls). This 2011 talk has some good examples although they benefit from being relatively far away so standard feature detection like SIFT and ORB can do the job well. http://graphics.cs.cmu.edu/courses/15-463/2011_fall/Lectures...

Most phones have an accelerometer and other sensors so I'm exploring if those can be used to determine the phone's movement between frames accurately enough to help me stitch it back together. When relatively close to the subject the perspective changes so quickly that matching detected features with something like RANSAC really struggles.

I'd voraciously consume any good links you have; I'm happily over my head on this and learning/iterating at every turn. I think I've accidentally given myself a relatively hard problem because of the constraints.

amstan · 2024-12-04T09:16:06 1733303766

> Multiple View Geometry in Computer Vision, Richard Hartley and Andrew Zisserman, [117] (some sample chapters are available here, CVPR Tutorials are available here)

Heh, about 10 years ago I read that book and figured out a few things:

* triangulate position in 3d space of an object given a few 2d pictures (from known camera locations, ie: camera intrinsic and extrinsic matrices)

* how to instruct my friend to make a simple 3d rendering "engine"

Nice to see the same stuff distilled into an article.

Neeloppher · 2024-12-04T18:09:59 1733335799

Intresting