Hacker News new | past | comments | ask | show | jobs | submit login
A dumb reason computer vision apps aren’t working: Exif Orientation (medium.com/ageitgey)
599 points by ageitgey on Oct 9, 2019 | hide | past | favorite | 167 comments

Getting image and video orientation correct, which should be trivial, is decidedly not. In writing PhotoStructure [1], I discovered

1) Firefox respects the Orientation tag using a standard [2] CSS tag, but Chrome doesn't [3]

2) That videos sometimes use "Rotation", not "Orientation" (and Rotation is encoded in degrees, not the crazy EXIF rotation enum). Oh, and some manufacturers use "CameraOrientation" instead, just for the LOLs [4]

3) That the embedded images in JPEGs, RAW, and videos sometimes are rotated correctly, and sometimes they are not, depending on the Make and Model of the camera that produced the file. If Orientation or Rotation say anything other than "not rotated," you can't trust what's in that bag of bits.

[1] https://blog.photostructure.com/introducing-photostructure/

[2] https://www.w3.org/TR/css-images-4/#image-notation

[3] https://bugs.chromium.org/p/chromium/issues/detail?id=158753

[4] https://exiftool-vendored.js.org/interfaces/makernotestags.h... (caution, large page!)

And after all that, editing (simply rotating) the photo in different applications has different results. Some change only the Orientation tag, others change the actual data, some seem to change both so it's still incorrect when opened in other viewers; and then there's the embedded thumbnail (but that is rarely used), so the result is a mess.

I'm interested in your PhotoStructure application, just subscribed to the beta!

When you rotate a photo or video in PhotoStructure, you can have it persist rotation/orientation by updating the file directly (PhotoStructure uses exiftool under the hood), but it's not the default out of concern for unknown bugs that may invalidate the original file in some way.

By default it just writes the new orientation to an XMP or MIE sidecar. The downside of this approach is that most applications don't respect sidecars.

> That videos sometimes use "Rotation", not "Orientation" (and Rotation is encoded in degrees, not the crazy EXIF rotation enum)

Not technically degrees. MP4 encodes it as a matrix. (Now, there are matrices corresponding to degree transformations...)

I haven't seen that: can you give me an example?

(The output of `exiftool -Rotation example.mp4` would be great!)

It's been a long time, but I once wrote code to parse and rewrite ['rotate'] the matrices at a previous job.

The "Matrix Structure" is in the 'mvhd' atom.


* https://developer.apple.com/library/archive/documentation/Qu...

* https://developer.apple.com/library/archive/documentation/Qu...

[Some of the best documentation for MP4 that I've seen on the web is from Apple, since it grew out of QuickTime.]

This has caused me quite a bit of trouble with my image gallery as well, I've taken to just using exiftool to remove all rotation from my photos, which breaks half of them, and then manually hard-rotating them using ImageMagick, which is technically not a lossless operation if I'm understating it correctly.

I really don't understand why it was decided that most photo viewing applications would honor EXIF rotation, but web browsers would not.

Use jpegtran, it can rotate jpg images losslessly.


Make sure you rtfm carefully.

Many years ago I accidentally deleted all the metadata from many images that needed rotating.

> -copy comments

> Copy only comment markers. This setting copies comments from the source file but discards any other data that is inessential for image display.

> The default behavior is -copy comments.

Argh! Thanks! So also my "lossless" transformations .. weren't.

For those interested in how EXIF is treated in other apps, a dated but informative article from 2012: https://www.daveperrett.com/articles/2012/07/28/exif-orienta...

They should just put a gravity vector in the image.

I’m pretty sure iPhones do put it in. You should exif dump an iPhone XS image. INSANE what’s in there.

Go on...

This would (maybe) help with a common thing my wife does when taking videos: Start recording in portrait mode, then realize you did that, and rotate the phone 90 degrees to get widescreen video (but without restarting the recording).

When you play it back on a phone (with auto-orientation mode on), starting from holding the phone in portrait mode (as you normally do):

* it starts playing back as portrait, which looks fine

* the video rotates (because the camera was physically rotated), so now you're watching a widescreen video that's 90 degrees off

* Your natural reaction is to flip the phone 90 degrees to make down "down" again, but this changes the phone into widescreen mode, and because it thinks it's playing a portrait-style video, it changes to portrait-in-widescreen mode, and now the video is again tilted 90 degrees but 1/3 the size with huge black bars on either side

If you play it back on a computer/TV, you get the same end result: a widescreen video that's rotated 90 degrees, and 1/3 the size with huge black bars on either side.

Can't help you with the video-taking technique but when playing back the videos, if you hold the phone so that the video matches the screen, then rotate it so the screen is upwards, you can then spin it so it looks the right way up as long as you keep the screen vertical enough.

In these cases I just enable rotation lock.

Haha! I guess one has to cut the video and rotate manually one part?

Might as well put in acceleration and rotation while we're at it.

Pitch and yaw

You're joking, but many smartphones encode gyroscope readings in the metadata.

And uptime seconds (!?)

And estimated distance to subject, AGPS information, GPS acquisition time, depth field metadata, current battery level, operating system version, ...

We can use both sets of names, better yet, mix them!

Ah, the famous ropitchaw vector!

Won't help when filming in the ISS. Microgravity or not, I would like to view the result in the same orientation as the crewmember who filmed it.

And how would the camera know without gravity?

“You can’t trust that bag of bits.” That made me giggle.

Have you submitted a post just about PhotoStructure yet?

I've been looking for a replacement for Google Photos. It's the only thing keeping me on Google products at this point.

Shameless plug: JPEG Autorotate 3 allows you both view orientation of the raw jpeg data and preview how it looks with orientation data applied. :)


Wow, your description of the problem of photo organisation is exactly. I've signed up for the beta!

One of my pet peeves in ML/stats/data science is people who hardly look at their data. Unless there are privacy reasons not to, then you really need to look at some data. You'll learn so much more from looking at a few hundred samples than you will from different metrics. You'll get a feel for how complex the problem is, or whether something simple will do. Check your assumptions. You might even realize that your images are sideways.

As Karpaty said:

1. Become one with the data.

The first step to training a neural net is to not touch any neural net code at all and instead begin by thoroughly inspecting your data. This step is critical. I like to spend copious amount of time (measured in units of hours) scanning through thousands of examples, understanding their distribution and looking for patterns. Luckily, your brain is pretty good at this. One time I discovered that the data contained duplicate examples. Another time I found corrupted images / labels. I look for data imbalances and biases. I will typically also pay attention to my own process for classifying the data, which hints at the kinds of architectures we’ll eventually explore. As an example - are very local features enough or do we need global context? How much variation is there and what form does it take? What variation is spurious and could be preprocessed out? Does spatial position matter or do we want to average pool it out? How much does detail matter and how far could we afford to downsample the images? How noisy are the labels?

source: http://karpathy.github.io/2019/04/25/recipe/

It's not directly related to this topic, but my favorite example of this sentiment is Anscombe's Quartet [0], four sets of datapoints that have (almost) the same statistical values, but a very obviously different layout when simply viewed together on a graph.

Plus an animated version that includes a T-Rex [1].

[0] https://en.m.wikipedia.org/wiki/Anscombe%27s_quartet

[1] https://www.autodeskresearch.com/publications/samestats

Off topic, but each time I accidentally open the Wikipedia mobile layout in the desktop I'm amazed at how much better it is. Is there a way to always redirect to the mobile version?

Yes, if you're on firefox I've been using that extension for quite a while and it works just fine: https://addons.mozilla.org/en-US/firefox/addon/mobile-wikipe...

If you are logged in, the following will do it, I think: Go to Preferences, then Appearance and change the skin to “MinervaNeue”.

That’s what my first numeric mentor taught me. You have to look at raw data. The first q he’d ask any programmer was “did you look at the data, the actual data?”. He was a PhD in Physics and his approach really sticked with me.

But it’s not always straightforward to “look at data”

And also at the results and the intermediate steps. Visualize everything. Don't just look at the final evaluation metric but dive in and see where things go wrong.

Do just a few bad predictions skew your score? What does the best prediction look like? What does the worst look like?

Are all your results just shifted by 2 pixels to the left due to some bug? Are there mislabeled examples in the test set? Etc. etc.

Stuck. Everything else is perfect idiomatic English.


Just looking at data and descriptive statistics is one of the first things a person is taught in machine learning, data science and statistics coursework. It’s a major skill in the field that is emphasized all the time.

Practitioners frequently do cursory data analysis and data exploration to gain insight into the data, corner cases and which modeling approaches are plausible.

Just to give some examples, Bayesian Data Analysis (Gelman et al), Data Analysis Using Regression and Multilevel/Hierarchical Models (Gelman and Hill), Doing Bayesian Data Analysis (Kruschke), Deep Learning (Goodfellow, Bengio, Courville), Pattern Recognition and Machine Learning (Bishop) and the excellent practical blog post [0] by Karpathy all list graphical data checking, graphical goodness of fit investigation, descriptive statistics and basic data exploration as critical parts of the model building workflow.

If you are seeing people produce models without this, it’s likely because companies try to have engineers do this work, or hire from bootcamps or other sources that don’t produce professional statisticians with grounding in proper professional approach to these problems.

When people mistakenly think models are commodities you can copy paste from some tutorials, and don’t require real professional specialization, then yes, you get this kind of over-engineered outcome with tons of “modeling” that’s disconnected from the actual data or stakeholder problem at hand.

[0]: https://karpathy.github.io/2019/04/25/recipe/

The article explains why it's practically impossible to see that (check your assumptions! :P )

It’s easy to print out the dimensions of the image when it’s loaded. I always log the size (pixels, rows, etc) of any data I load.

Well, after you've read in your data, read them back out and display them on the screen.

This actually is generally true for any computer programming debugging. You need to look at the data as it appears to your program, to catch bugs in your preprocessing pipelines.

I've tried working like that and it makes a massive difference. To be able to visualize all the intermediate stages is valuable because they are basically never right until you are able to isolate and refine them individually.


A critical piece of wisdom I find myself repeating to younger students, handed down from an advisor: "always look at the pixels". (Sometimes replacing pixels with voxels/lidar points/meshes, etc.)

Parallel issue: people who fit complex models when simple descriptive statistics would be more useful/a better fit.

This is a byproduct of hiring bootcamp grads or tasking a modeling project to engineers who read some tutorials. People think they can scan a few Jupyter notebooks and then professionally solve statistics problems.

People wonder why it’s hard and expensive to hire ML engineers... because they actually solve these problems with craft. Meaning, they systematically grow understanding of the data, start with simple models, and have well articulated reasons explaining cases when complexity is justified.

It's the machine learning version of people who use hadoop if they could have used command-line utilities.

Yes, thank you. This can't be stated enough or loudly enough! You absolutely have to manually check, then check it again -- even with privacy issues, find a way to be non-identifiable and check it. Data is such a huge factor in whether your project will fail or not, yet too many people don't give it the respect (and dare I say love) it deserves.

For one of my projects, I looked and manually labeled thousands of pictures, yet this issue never came up until people were actually using pictures coming from their smartphones.

somehow, the Allegory of the Cave comes to mind... ;)

Yes. It's a method I use:

Did you look at the actual data?


Then they are wrong. Find a way to cheaply visualize data/state and call me back when you're done.

100%. One of the most common mistakes too.

I ran into this headache while writing a web app.

I made an image upload widget that provided a preview, and when users selected the "take a picture" option on their phones, I showed the preview with a blob link and CSS background-image property. The images were showing sideways on some phones.

I looked at the EXIF data of those photos and of course Orientation: 90 showed up.

It was easy to fix on the backend when processing the images but I struggled to do it in a performant way on the front-end. One solution involved reading the EXIF data with JS and rotating with canvas and .toBlob(), but proved too slow for large 10MB photos as it blocks the main UI thread.

One thing I thought of is just reading the orientation and the using CSS transforms to rotate the preview, but I never got around to trying it.

This shows up even in basic websites! My partner, who is an artist, ended up with some of her portfolio broken seemingly at random in different browsers---and for some of it, "up" wasn't visually obvious and she'd rotated some of the images in apps in addition (preview, adobe things, etc.), so there wasn't a good way to just change everything. ended up having to do a ton of work to strip exif data on an image by image basis.

Basically entirely because of this ridiculous issue where chrome refuses to respect exif tags.[1]

[1] https://stackoverflow.com/questions/42401203/chrome-image-ex...

Squoosh.app, a webapp from Google for compressing and resizing images, has an article about using WebAssembly for exactly this purpose: https://developers.google.com/web/updates/2019/02/hotpath-wi...

"In squoosh we wrote a JavaScript function that rotates an image buffer by multiples of 90 degrees. While OffscreenCanvas would be ideal for this, it isn't supported across the browsers we were targeting, and a little buggy in Chrome."

Wait, I know its a big company, but you're at the same company...

Your second approach is almost exactly what I've done in an iPhone video editing app I've been working on to deal with video previews.

Rather than reencode the video just to preview it, I can just apply the same transforms (scale, rotation, translation) to the view that's displaying the preview. I then mask that view with another view of the same size so it doesn't go outside the edges.

Of course I still need to encode the video with those transforms if I want it to show up in their camera roll later.

Try web workers! I've been doing all kinds of crazy expensive spatial data validation in many parallel threads. It's amazing. I think off screen canvas in workers is gaining browser support recently.

If you don't need to actually modify the image data you shouldn't be doing it at all. The correct solution is to specify a rotational transform so the hardware will do the heavy lifting on the GPU where this kind of computation belongs.

> Try web workers!

NO, please do not try web workers. I don't want any more running on my device than absolutely necessary.

That ship has sailed, if you don't want people running arbitrary code on your device, you should disable JavaScript

Funny enough, I do that

You don't want to use all your cores to run things faster?

It would be nice if they only made things faster. In practice it seems like they use all my cores to do more things, things that I don't want or care about them doing.

Nope. I want things to look plain and simple. I want to use fewer cores for longer battery life.

The faster a CPU finishes, the faster the CPU can sleep, which is how you save battery life. The more cores you use the lower the CPU frequency can be, which saves power, since frequency increases do not use power linearly.

I think it goes without saying that javascript workers have nothing to do with the design and layout of a webpage.

> The faster a CPU finishes, the faster the CPU can sleep, which is how you save battery life.


> The more cores you use the lower the CPU frequency can be, which saves power, since frequency increases do not use power linearly.

More cores in use has little to do with frequency and more to do with heat. More heat means more thermal throttling which lowers frequency. Lower frequency means that the CPU doesn't sleep sooner.

> I think it goes without saying that javascript workers have nothing to do with the design and layout of a webpage.

Yup. That's exactly why I don't want them. Why should I execute something which doesn't, and shouldn't, have anything to do with rendering page content?

Don't get me wrong, I'm fine with using more cores if it's actually beneficial. But every use I've ever seen for a web/service/javascript worker has always been user hostile by taking what should be done on a server and offloading it onto the user's device instead.

Using all your cores for the same workload would mean it finishes faster or finishes in the same time with significantly lower frequency. It saves power and heat. Your example would mean using more cores for the same amount of time, which makes no sense in this comparison.

Do you also buy single core computers to save power?

It is clear that what you are saying has nothing to do with javascript features at all and just boils down to not liking bloated web pages.

Was rotating with CSS transforms an option?

That way you'd only need to read the EXIF data, but wouldn't need to go all the way to rotating the pixels yourself.

Have run into this myself when building the same "preview upload" feature. It's annoying that it's not something that is just handled by the browser, it feels like it should be a supported feature of the "img" tag or something.

> One solution involved reading the EXIF data with JS and rotating with canvas and .toBlob()

At my previous job I did the same thing, although I never noticed a significant slowdown. I also made the file size smaller since we wanted to have predictable upload times and mitigate excessive usage of storage space.

The other reason was that the EXIF data was wierd on some devices and the back end library didn't rotate them correctly.

I had this problem with the artist open studio website.

Oddly it seems like it wasn't as big a problem until cell phones. ( I honestly couldn't tell which way is up for some images, being abstract.)

we ended up using JS library called "croppie", but I'm not sure it helps with large images.

As we speak I need a file upload widget that will reject some images based on exif. What’s your recommendation for client side js exif extraction?

ImageMagick will "apply" the exif rotation to the image data for you.

    convert input -auto-orient output
Or in place:

    mogrify -auto-orient *

Even outside of machine vision this issue has caused me all sorts of headaches. Especially when switching between tools that honor or ignore the tag.

"But the tricky part is that your camera doesn’t actually rotate the image data inside the file that it saves to disk."

Cameraphones are so powerful these days. I don't understand why the app can't simply include a behavior setting to always flip the original pixels.

It might not be up to the app? If the phone has hardware-accelerated JPEG compression, then potentially the image will already be compressed in the 'wrong' orientation before the app gets its hands on it. So rotating the image data could involve re-compressing the image again, leading to quality loss. Or if you choose to get the raw sensor data instead, and rotate the data before doing the compression yourself, you might lose out on the hw acceleration entirely.

(I've not developed any camera apps, so this is just a guess!)

> So rotating the image data could involve re-compressing the image again, leading to quality loss.

With JPEG, no, 90 degree rotations can be accomplished in a lossless manner.

See jpegtran from the libjpeg library. A version of its manpage is here:



"... It can also perform some rearrangements of the image data, for example turning an image from landscape to portrait format by rotation.

jpegtran works by rearranging the compressed data (DCT coefficients), without ever fully decoding the image. Therefore, its transformations are lossless: there is no image degradation at all, which would not be true if you used djpeg followed by cjpeg to accomplish the same conversion. ..."

Because the data sources are unrelated. The camera sensor is hooked up to the ISP which is hooked up to a hardware JPEG encoder. This is necessary in order to get those hyper fast shots off.

You'll notice that an orientation sensor is no where in that list. So what happens is the camera hardware spits out a JPEG. The app then combines it with the orientation sensor & produces the EXIF headers. It could choose to decode, rotate, re-encode, but that's slow (~100ms) and hurts shot-to-shot latency. And it loses quality. And, hey, since everything supports EXIF orientation anyway, why bother?

> It could choose to decode, rotate, re-encode

Or it could simply rotate without decoding or re-encoding, which has the added advantage of being lossless.

Obviously it's still added processing time and (probably more importantly) development time, so it's generally not worth bothering, however it's important to point out that JPEG rotation can (in the case of 90 degree increments) be done losslessly.

The phone is accurately recording the image, as well as the orientation. The bug happens when the EXIF information is stripped. Sure, phone apps could add an option to physically rotate the image, as a bug workaround, but it's not surprising that they don't do this.

It is surprising, because it seems like such an obvious fix to the problem of stripped EXIF data. Couldn't be that hard to implement a user setting which tells the app to rotate the file itself before saving.

This is a major oversight for the companies who develop camera apps. The major ones even have whole teams dedicated to that single app.

It would be trivial to rotate a bitmap with no loss of quality, but you can’t rotate an already compressed jpg without changing the image itself, which reduces quality.

Assuming the image size is a multiple of the block size (8x8, 16x8, or 16x16 depending on the chroma subsampling), lossless rotation of JPEGs is possible, e.g. with jpegtran:


Really interesting, good find.

Funny, it looks like a common headache.

Lots of other complaints in this thread and also something I have encountered on Android with a small side feature allowing users to upload some pictures.

I'm not sure how good your algorithm is if it can only work in a specific orientation. A slightly tilted image can already cause problems for such an algorithm and many people have a hard time getting a picture with only a few degrees of rotation.

It would probably be better to deal with this in an elegant way, e.g. set up the algorithm to work regardless of orientation. This seems like a (mostly) solved problem: https://d4nst.github.io/2017/01/12/image-orientation/

You might not want rotation invariance in your algorithm. Our own perceptual system is not rotation-invariant, most obviously for faces in this illusion: https://en.wikipedia.org/wiki/Thatcher_effect

Forcing rotation in the training phase would be a good way to regularize the network. But of course that will lower the accuracy.

That’s not an option if you’re trying to detect left-arrows vs right-arrows, on street signs for ex.

Depends. Based on the linked article, the picture can be completely upside down if it was taken in landscape mode. The article I linked is capable of rotating it upright regardless. After rotating to the right orientation, left and right are suddenly perfectly workable.

Of course using exif data for such rotations is easier, but a tilted picture of a tilted sign can create a lot of tilt that human vision copes with fine but an orientation dependent network cannot.

If the algoritm fails with a rotated image, then I would claim its a bad algorithm that is over-fit and not at all generalising what it has learnt. What about slight rotation? Where does it start failing, and would a human?

Also, making an algorithm for detecting rotated images should be easy if it affect the results so much.

Yea, there’s some truth to that, but then we wouldn’t get these gems:

Is it a duck or is it a rabbit?


Topic of image orientation always reminds me about this article: https://www.daveperrett.com/articles/2012/07/28/exif-orienta...

One of the common image processing libraries does take into account EXIF rotation: OpenCV [1] I would tend to use that over manually rotating using the code from the article. Although beware that OpenCV cannot open quite as many formats as Pillow, most notably it cannot open GIFs due to patent issues. You can get a prebuilt wheel of OpenCV by pip installing the package opencv-python [2]

[1] https://docs.opencv.org/trunk/d4/da8/group__imgcodecs.html#g...

[2] https://pypi.org/project/opencv-python/

There's a big photography site I use that treats the EXIF inconsistently. I rotate the image in my editor to work on it, then save and upload. In some contexts it looks OK, but in other contexts the site rotates the image again and it's wrong. I don't want to strip the EXIF because it has interesting information such as the camera model and lens, and the exposure settings. My editor doesn't correct the EXIF rotation setting, so I have no choice but to use a utility to strip that single value from the file before I upload it.

CNNs are translation and scale invariant thanks mostly to the pooling operation. Good data augmentation (rotating images for example) would have build a model more robust to this effect.

This is functionally a coordinate system problem. Thankfully, it's pretty easy here. Just wait until we start getting more models for 3D data. I worked with 3D data for the longest time, and it was incredibly painful — different libraries can use wildly different coordinate systems (e.g. I've seen the up-direction be +z, -z, +y, and -y). At that point, it's nontrivially difficult to even figure out what the right way to convert between coordinate systems is.

Train a model that recognizes L,R, or U side way images to reorient them all. Run it once.

Some websites also strip EXIF data from user submitted content but do not rotate the image to match the now-missing information about which side is up.

Why don’t you train your classifiers to be orientation-agnostic? My brain certainly had no problem recognizing the geese on the sideways image...

Data augmentation can have unwanted consequences. For example, horizontal or vertical flipping, what could be more harmless? You can still recognize stuff when it's upside-down, can't you? It's a great data augmentation... Unless, of course, your dataset happens to involve text or symbols in any way, such as the numbers '6' and '9', or left/right arrows in traffic signs, or such obscure letters as 'd' and 'b'.

> Unless, of course, your dataset happens to involve text or symbols in any way, such as the numbers '6' and '9', or left/right arrows in traffic signs, or such obscure letters as 'd' and 'b'.

If your dataset consists of nothing but isolated 'd's and 'p's in unknown orientation, you won't be able to classify them correctly because that is an impossible task. But it would be more common for your dataset to consist of text rather than isolated letters, and in the context of surrounding text it's easy to determine the correct orientation, and therefore to discriminate 'd' from 'p'.

So it's not a problem, except when it is. Good to know.

Incidentally, how does that work for mirroring, when all that surrounding text gets mirrored too? (Consider the real example of the lolcats generated by Nvidia's StyleGAN, where the text captions are completely wrong, and will always be wrong, because it looks like Cyrillic - due to the horizontal dataflipping StyleGAN has enabled by default...)

Interestingly, your brain actually isn't orientation agnostic: http://nancysbraintalks.mit.edu/video/what-you-can-learn-stu...

Why make it harder than it has to be?

I'd say you're not making the problem "harder" - rather, requiring that object detection be orietation-agnostic makes the problem exactly as hard as the problem actually is. Allowing the network to train only on images with a known, fixed, (correct) orientation makes the object detection too _easy_, so the network results will likely fail if you feed it any real-world data.

i.e. you should be training your image application with skew/stretch/shrink/rotation/color-pallete-shift/contrast-adjust/noise-addition/etc. applied to all training images if you want it to be useful for anything other than getting a high top-N score on the validation set.

A goose is still a goose, even if not viewed from the "normal" orientation.

Yeah... this article really looks like blame-shifting to me. I'm imagining a future wherein we have bipedal murderbots, but we're still training AI with "properly oriented" images... a bot trips over a rock, and starts blasting away at mid-identified objects.

Well, even humans find it more difficult to process upside-down faces or objects. There's power in assumptions that are correct 99% of the time.

Regardless, the article isn't really shifting blame, in so much as explaining what's happening in the real world, with the real tools. The tools don't care about EXIF. Consumer software uses EXIF to abstract over reality. A lot of people playing with ML don't know about either.

you may not know what face it is, but you do know that it is a face right? it's like saying you know someone is in front of you if he were standing in up, in your clear view, but the moment he's lying on a couch in centre of your vision you can't tell with certainty if the thing on the couch is a human being?

I could tell, but I'm under impression that my 4 months old kid still has problems with that, or at least had when she was 2 months old.

I think this is closer to the performance we should expect of current neural models - a few months old child, not an adult. NNs may be good at doing stuff similar to what some of our sight apparatus does, but they're missing additional layers of "higher-level" processing humans employ. That, and a whole lot of training data.

Bonus round:

(1) Find the ImageNet (ILSVRC2012) images possessing EXIF orientation metadata.

(2) Of those images, find which ones have the "correct" EXIF orientation.

Has anyone done this yet? I'd be curious to know if their dataset is "cleaned" or not.

Spoiler, ish:

The last time I measured ImageNet JPEGs with EXIF orientation metadata, the number of affected images was actually quite small (< 100, out of a dataset of 1.28M). There are also some duplicates, but altogether it seems fairly "clean."

When I built a consumer site that processed tons of photos I ran into this all the time. Ended up doing it all myself by parsing the exif data and doing the rotation. Also ended up writing some pretty extensive resize code that worked much better than what was built-in and more like the great scaling effects you see in Preview.app.

It always bothers me that many libraries still use nearest-neighbor to resize images instead of some kind of Lanczos filtering.

Hi Adam, if you read this, you have a small mistake.

Your quote: "This tells the image viewer program that the image needs to be rotated 90 degrees counter-clockwise before being displayed on screen"

should be:

"This tells the image viewer program that the image needs to be rotated 90 degrees clockwise before being displayed on screen". Cheers.


>Most Python libraries for working with image data like numpy, scipy, TensorFlow, Keras, etc, think of themselves as scientific tools for serious people who work with generic arrays of data. They don’t concern themselves with consumer-level problems like automatic image rotation — even though basically every image in the world captured with a modern camera needs it.

What a snotty attitude. The tools are already complex enough to take on responsibilities of parsing the plethora of ways a JPEG can be "rotated". This thread is a testament to the non-triviality of the issue and I certainly don't want a matrix manipulation or machine learning library to bloat up and have opinions on how to load JPEGs just so someone careless out there can save a couple lines.

If your computer vision application can't even recognize which way up a picture like the ones in the article is then surely you have bigger problems then EXIF orientation.

As others have said: what about pictures that are simply not aligned with the horizon?

All the top architectures are not rotation-invariant. I think some of the blame lies in how successful CNNs are (as Hinton claims).

I ran into this issue while developing https://www.faceshapeapp.com where user uploads their photo and face detector is run on top of the picture. Shortly after launching about 10% of users complained having rotated images, after some debugging, I discovered that it was due to exif rotation.

You could parse the exif and rotate the image using canvas, but thankfully, there's already a JS library which does it for you: https://github.com/blueimp/JavaScript-Load-Image.

Would it really be too much of a burden for phones to just save photos at the correct orientation now? I understand the hardware limitations that were present in 2007, but surely these can't still be a factor?

Is there any good reason to save a photo in an orientation that does not match the orientation that the device is being held in? Shouldn’t up be up? If that results in a NxM photo, save it NxM. If it results in a MxN photo save it that way!

The only edge case is a camera pointed straight up or straight down. Or a camera in space.

Correct orientation is relative to the photographer, not to the ground. Most of the times the photographer is in the usual, vertical orientation but sometimes you really want to take a shot at an angle, and you have to fight with your phone to do it correctly. I really dislike these kinds of "convenience" optimizations.

Coming back to ancient software design principles:

> The user is always right

Let them have a setting. It's really that easy.

> Or a camera in space.

Or freefall

Would it really be too much of a burden for python image libraries to decode jpegs to the correct orientation? I understand lack of knowledge about exif in the 2000s, but surely these days that can't still be a problem?

Isn't it better to train the model that way though? In fact it might be good to load every image in different orientations to add to the training data.

Only goes to show how far we have to go still when computers don't even tilt their heads automatically when given rotated input.

Well, if you are using the input Image as JPEG. The information is stored in a tag, you can remove it with any kind of language. There's also a C++ library for it: https://github.com/mayanklahiri/easyexif

I should add that this very problem also exists with video. What makes it worse is that "smartphone" video apps started regularly using video orientation metadata years before desktop video apps even seemed to acknowledge that it was a thing.

It's the simplest way to do it and has so few downsides that it's not going to change until someone invents a googly eye camera that rotates with the phone.

As an alternative to the python code in the article, there's this command-line tool: https://linux.die.net/man/1/exiftran

I just could not stop laughing reading this. It does remind me, that my professor once showed us a slide of the coastline of africa, which none of us recognized until he rotated it to the correct orientation.

Who thought Exif Orientation was a good idea? It's a needless complication of the file format. It's not like iPhones (or any camera) lacks the computing power to rotate the image before saving it.

Wikipedia says EXIF was released in 1995 (https://en.wikipedia.org/wiki/Exif). If you were shooting with a DSLR, say 6 Mega pixel with 8 bits per pixel, the raw output would be 18MB in size (https://en.wikipedia.org/wiki/Kodak_DCS_400_series). In order to rotate this raw 6Mp image you would need 36MB of RAM (input and output buffers of the same size, non-overlapping). Then, after the rotation you could perform JPEG, so that the rotation is lossless. Finally, you could store the JPEG image to disk.

36MB of RAM just for raw image buffers would have been quite expensive in 1995. Simply tagging some extra data onto the image to say which orientation it should be presented in takes almost no extra memory or processing within the camera, some big desktop PC could easily rotate the uncompressed JPEG to perform a "lossless" rotation after the fact (ie: uncompress JPEG in wrong orientation, rotate, present to user).

Technically, you wouldn't need a full 18MB for the output buffer so long as you perform the JPEG in-line with the rotation and are willing to deal with slicing the image into swaths. So in theory you could get away with like a 1MB output buffer but then your rotation time would depend on your JPEG timing and you couldn't take another picture with the main raw buffer until rotation and JPEG were both complete. It's a tradeoff, time versus memory.

Maybe the first step before doing image classification or object detection images is to do a image rotation detector ;)

Sure, that would reduce the problem space by one, but that ain't the reason CV doesn't work.

some of the object detection ai systems do actually rotate the images in 4 to 8 ways like in fast.ai. when trained with such input data, the ai detects the object in almost all orientations.

Sounds like the training sets should include more rotated samples.

that or train models to be rotationally invariant.

Surely someone can create a neural network that reorients the image, even if the Exif orientation is wrong (JPEG lets you can do this without re-encoding). Sounds like it should be a very simple problem for the vast majority of "regular" images.

Or how about just use the damn exif information to orient it correctly as the article outlines? The actual article is far less interesting than what the title seems to imply. While the title suggests a problem with computer vision. This is more of a programmer logic error.

There are many times my phone detects I'm holding it horizontally while taking a picture, but I'm actually trying to take a vertical picture, and vice versa.

For those images, the exif information is technically correct, but actually wrong.

EXIF info is often missing or wrong. Photos that have been passed around, going through various services, apps, scanners, screenshots, etc. along the way, frequently have their EXIF info stripped (e.g., for privacy when exported or uploaded), written incorrectly originally (e.g., you scan an old B&W print, put it in sideways to fit in the scanner, and the scanner invisibly includes EXIF data assuming it was correctly oriented), or rewritten incorrectly (e.g., stripped EXIF is replaced with default orientation).

If exif information is incorrect, that's just programmer error at the firmware/library level.

exif is a bit of a swamp, and may well be missing entirely.

I've encountered plenty of pictures with incorrect EXIF orientation, mostly caused by holding the phone at a weird angle.

For example, if I hold the phone parallel with the ground (aimed at the floor), what is the correct EXIF orientation?

JPEG lossless rotation can only be done if the dimensions (height and width in px) of the image are multiples of MCU (typically 8x8 for 4:4:4 or 16x16 for 4:2:0).

For the uninitiated, MCU is not Marvel Cinematic Universe as the search results might try to confuse you. It's Minimum Coded Units, and a nice explanation is over here:


Edit: the link also closed for me a forgotten open loop about the meaning of an annoying error message Windows XP era. Thanks for making me look it up!

ImpulseAdventure is THE place when I look for reference related to anything about JPEG.

He also has a very useful utility (I normally use it to check DCT table) called JPEGSnoop [1].

[1] https://www.impulseadventure.com/photo/jpeg-snoop.html

Sure, that's a good enough excuse. I look forward to defeating government facial recognition algorithms by holding my head at an angle.

Better (in this case) would be to use algorithms that take advantage of rotational symmetry. What do you do when the phone is at 45 degrees?

Also known as "programmer error".

What value do you think this comment contributed? The entire article clearly describes it as a commonly mistake and is trying to raise awareness so people stop repeating it.

I think it contributed the warning that the title is clickbait. Its very clear based on the discussion in the thread that people who've just read the title are discussion computer vision and machine learning strategies to solve this problem when the article describes an almost elementary programming issue.


Nonsense. Image recognition engines are very capable of detection in even at the most extreme angles. Sure it won’t rotate the image for you but it will certainly tell you there is a goose in it. What tricks image recognition technology more then anything is lighting.

Let me clarify and say googles IR engine is capable. I actually ran the first set of tests developed around image recognition technology. I worked at Neven Vision which got acquired by google in 2006.

This article is all wrong.

At the core, there is something that is converting this jpeg or misc encoded data into raw encoded data, and this process MUST account for the orientation.

Either the app is reading the image, converting it before passing it to the CV/ML/AI library, and this conversion step needs to respect this tag, and either transfer the tag or apply it to the transformed object; OR, the CV/ML/AI library is getting in encoded image data, and it needs to check for this tag.

Those are the two options, either the CV/ML/AI library sees the tag, and should consider it, or it doesn't, and the library that is stripping it away shouldn't be doing that

This article helps people use the libraries they have, and it does so correctly. If you want to fix the libraries, submit a PR, don't complain about this article.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact