Hacker News new | past | comments | ask | show | jobs | submit login
Real-world dynamic programming: seam carving (avikdas.com)
412 points by akdas on June 26, 2019 | hide | past | favorite | 66 comments

I held a workshop once where people implemented this (and other image algorithms). The end result can be seen here [0], and the tasks here [1] (but in Norwegian). Can click the "run seam carving" button to watch it unfold step by step.

[0]: https://matsemann.github.io/image-workshop/ [1]: https://github.com/Matsemann/image-workshop

This was a homework assignment in Robert Sedgewick's Algorithms course on Coursera!


Great class and fun assignment!

Wasn't the first assignment about detecting percolation using the union find datastructure? IIRC, sedgwick's class doesn't cover dynamic programming directly at all.

Seam carving is the second assignment of Algorithms Part II. I also highly recommend Princeton's Algorithms course on Coursera.

Ah, I haven't taken Part II of the course. Probably not a bad idea to start :)

It feels like those images of the energy could have been improved by plotting the log of the energy. It would allow us to see changes at the low end where the decisions are being made.

I'm toying around learning the algorithm. Thought I might share a log plot of the energy. https://imgur.com/a/YlrS7aV

This brings to mind the Viterbi algorithm that calculates the most likely sequence of events in a hidden Markov model (Markov model itself is just a state machine on a graph with weighed edges, where each weight on a state transition is represented as a probability of that transition).

It's essentially bases to the same algorithm: you can eliminate sequences if they lead to the same event that is already part of a solution, but with lesser probability. The amount of paths would be exponential, but the ability to eliminate them keeps it polynomial. That really brings forth the beauty of dynamic programming.

I just literally published an article on that yesterday :) https://avikdas.com/2019/06/24/dynamic-programming-for-machi...

Agreed, it's basically the same algorithm.

One really nice application of the Viterbi algorithm is map-matching, i.e. taking a GPS trail and "snapping" it to the road network to determine the actual route taken. It's difficult because GPS trails are often sparse and noisy, so you can't do something simple like "project onto the closest road segment". If you take the "states" of the model to be road segments and derive the transition probabilities from the shortest route between segments, you can apply Viterbi and get a very accurate result most of the time. Of course, calculating shortest routes involves Dijkstra, another famous dynamic algorithm.

Cool! Has anyone written this up and made cool visualizations? It's one of those many things I'd like to do but don't have time for :)

Two of the more well-known routing engines have this capability and have written about it on their blogs with some visualisations:

- OSRM (the main routing engine behind openstreetmap.org) https://blog.mapbox.com/matching-gps-traces-to-a-map-7373019...

- Graphhopper https://www.graphhopper.com/blog/2016/08/08/releasing-the-ma...

Unfortunately it's not a simple project to get started on because you already need to have a way to import road network data (e.g. from openstreetmap), build an internal graph model, do some geographic indexing and perform fast many-to-many routing calculations.

yep - look up 'Hidden Markov Map Matching through Noise and Sparseness' by Krumm and Newson

it calculates the final probability by combining 'emission' probabilities (the probability that a GPS observation was on a particular road) by the 'transition' probability that if an observation was on a particular road at one point, what is the probability that it is now on this other road segment. By combining these two the final probability incorporates both the nearness of the GPS signals to the roads and the connectivity of the road network itself.

I've found the formulas applied in this paper are good in practice only if the GPS updates are relatively frequent

I've also found that it can be tricky to map "convoluted" routes that don't necessary go simply from point A to pointB.

If you don't mind me asking, roughly what frequency threshold have you found the algorithm to perform badly above, and are you aware of any algorithms or formulae which perform better in these situations?

it should be in seconds - the problem is that the paper assumes that the 'great circle' (straight line) distance between two points should be almost the same as the 'route' distance between those points, with an exponential probability distribution.

This means that if the path between two points is not simple (around a corner) the probability drops off very quickly. If the time between measurements is in minutes, this heuristic is pretty useless (and you should really use log-scale for your numbers!)

edit: this is actually shown in figure 8 of the paper where they explore different 'sampling periods'

edit 2: I have not explored other methods yet, but it would probably make sense to start by deriving the formula the way they do, by exploring ground-truth data.

edit 3: I just noticed that my comments are largely repeating what you're saying - sorry!

Ah, that rings a bell now. You can vary a parameter they call "beta" to allow for more convoluted routes, and I think a larger value gives a little leeway for less frequent fixes.

Agreed, the log scale is really important to avoid arithmetic underflow =] I believe OSRM and Graphhopper both do it that way. In my implementation I've flipped from thinking of measurement/transition "probabilities" to "disparities", and I choose the final route that has the least disparity from the trail. It seems to handle trails with around a 30-60s frequency over a 5-10hr period with decent accuracy.

actually, beta is less useful than that! I think it represents the median difference between the two distances, its not a tolerance (at least as far as I can recall after experimenting with tuning this value).

As with you, I have found that it still often gives ok results with slower frequencies, as long as the transition probabilities are still relatively in the same scale as each other for a particular observation pair. However it means that there's no point trying to 'tune' using the gamma and beta parameters

A lot of problems on sequences are amendable to dynamic programming. Two other famous ones are Smith-Waterman and Needleman-Wunsch for finding optimal global and local alignments of sequences–a typical problem in genetics.

The Levenshtein distance, which finds the fewest edits to transform one string to another is also a classical example.

There's also the approach that calculates the energy of the resulting image as opposed to the seam being removed, which allows the seams to pass through objects where doing so will minimize artifacts.

Paper: http://www.faculty.idc.ac.il/arik/SCWeb/vidret/index.html

GitHub: https://github.com/axu2/improved-seam-carving

Thanks for sharing - it's kind of sad that nobody ever seems to know about the significantly better forward energy version. Especially since it's such a minor tweak.

Cool to see this popping up again. It always impresses if you haven't seen it before and is a cool algorithm to work through.

The original paper was discussed on slashdot and back at that time I was inspired to build a little GUI around an open source algorithm implementation to play with my Qt skills.

It allows you to shrink, expand and "mask out" regions you don't want touch etc.

Still available on Google Code archive:


If you like DP imaging applications like this, this old Microsoft Research technical report is neat: it uses DP to merge frames from two webcams placed left and right to synthesize a view in the middle, like having a webcam in the middle of your monitor. The DP is interesting because it has penalities set up assuming planar content because faces are pretty flat and in front of the cameras. Link: https://www.microsoft.com/en-us/research/publication/efficie...

Why are DP problems so popular for interviews? I am doing leetcode now and they seem to be everywhere.

I'd honestly just walk out if a company asks this stuff, yet the actual work is maintaining a CRUD app.

they are a very good judge of whether the programmer can do complex problems (but still simple enough a solution to do in an interview setting).

I remember implementing this for a class years ago, and then the professor suggested doing the inverse to try to expand the image width. The idea was you would duplicate the lowest energy seam... but all that did was create a lot of repeats of the same seam.

I never did finish that weird idea, but I probably needed to try something like increasing the energy of the chosen seam (and its duplicate)... I may try that again, just because I'm curious what would happen.

The original seam carving paper discussed expanding too: https://youtu.be/6NcIJXTlugc?t=56


> Figure 8: Seam insertion: finding and inserting the optimum seam on an enlarged image will most likely insert the same seam again and again as in (b). Inserting the seams in order of removal (c) achieves the desired 50% enlargement (d). Using two steps of seam insertions of 50% in (f) achieves better results than scaling (e). In (g), a close view of the seams inserted to expand figure 6 is shown.

That's actually done in the paper! You basically choose the lowest energy seam, duplicate it, then blacklist it and duplicate the next lowest energy seam (to prevent repeatedly duplicating the same seam). The results are quite good.

The approach from the original paper is to remove seams like you are decreasing the size, which provides you with a set of exclusive seams that can be added to the original image (with some mapping logic). This produces an output without repeating the same seam repeatedly.

This is also known as "liquid rescale" and there is (was?) a gimpl plugin for it. It last updated in 2013 or so. After that the developer was hired by Adobe to work on Photoshop.

Yes, Gimp had it as a plugin first - before Photoshop.

Great. Now adapt this to video (:

A friend and I once tried this at a hackathon. If you do it naively like we did, frame-by-frame, you will find that the optimal seam to carve shifts around a lot as the lighting changes subtly.

This results in characters in frame apparently moonwalking around. It's a really cool effect, but not good for actually resizing a film. Might make for a cool bit of a music video.

   Might make for a cool bit of a music video
I was thinking the same!

What if also have an energy term for difference in interframe x-position along the seam?

Or maybe doing histogram equalisation for pairs of adjacent frames before the seam cutting might help.

This same algorithm could be applied in three dimensions (X, Y, time) such that the seam connects along adjacent pixels through time as well as the Y axis. The seam here would be a 2D plane, rather than a 1D line as it is for a single image.

If you visualize a video as a stack of frames, then the seam would cut through the stack, like cutting a cake with a knife, following the minimum-energy path through the stack.

You'd do this by modifying the recurrence relation to add a term for the energy of pixels in the previous frame as well as the previous line in the current frame.

I don't think it's necessarily so simple, and I think the multi-frame case blows up the complexity of the algorithm.

In the single frame case, each pixel is a 0 dimensional point, and for each pixel, you evaluate the 3 adjacent pixels in the row above. Then to find the seam you just pick the lowest energy pixel at the edge of the frame and follow the thread up. So the total runtime of this algorithm is O(pixels).

In the multi-frame case, if you want the video to be totally smooth, you have to think higher-dimensionally.

In the multi frame case, each seam is a 1 dimensional list of pixels, and for each seam, you evaluate the $HUGE_NUMBER of adjacent seams in the previous frame.

That is, the 2D case's runtime is proportional to the image height times the number of adjacent pixels, which is a small constant (3). In the 3D case, the runtime is proportional to the number of frame times the number of adjacent seams, which is massive.

I can imagine some heuristic optimizations that would allow you to guide your search, though. For instance, you could significantly downsample the video in both pixels and framerate and solve that, and then use that low-resolution solution to constrain and prune an approximate high-resolution search.

I think this is the correct way to conceptualize generalizing this algorithm to moving video. I also think it's a fundamentally bad approach to resizing video.

The problem is that humans watching the video will build a 3D mental model of the scene; any transform that modifies multiple frames must do so in such a way as to maintain global spatial consistency, or it will look odd. Seam carving is too primitive to do this. You will have shots where (for instance) someone is walking obliquely away from a building towards the camera, and yet making no headway because the algorithm is carving away the (low energy, yet perceptually indispensible) pixels that seperate them - the result will be that they appear to be walking in place (and growing).

I have to say, that sounds like it might be an awesome effect!

Apparently seam carving is already used to produce surreal memes:


That would get pretty funky when the scene changes.

For what it's worth, that was exactly what they did:


See this comment in this thread:


(HN really should make it easier to make "anchor" links, so people don't have to load a new page just to see a sub-thread. That probably took me almost 5 minutes, which is like a quarter of the "anti-procrastination" budget with the default settings. It should have taken 10 seconds.)

I noticed they don't provide any images with human faces.

Since the human visual system is so highly sensitive to faces I think the best approach here would be to apply a facial detection algorithm, then boost the energy for the regions where faces are detected.

Basically you'd apply a heuristic that because faces are so special to the human visual system, the perceived energy of a face is higher than the pixels would otherwise indicate.

This would make seams avoid altering any faces in the scene.

> apply a facial detection algorithm, then boost the energy for the regions where faces are detected.

Great intuition! The original paper actually goes into this, and they come up with that solution.

This solution is a special case of allowing the user to apply positive and negative penalties as they wish. The latter allows targeted object removal.

The algorithm definitely has issues with faces.

One part I didn't get into is artificial weights. The original paper discusses "painting" certain regions of the image as high energy, driving seams away from those regions. This ties into faces when the paper suggests face detection to automatically paint faces with high energy, avoiding these issues in the first place!

The basic algorithm doesn't work well with faces, but you can use facial recognition to add a large penalty to carving through regions with faces. So it can be extended pretty easily to handle this.

Probably intentional, the seam carving algorithm has a tendency to make faces look distorted: http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15463-f1...

"Human faces? You can't handle human faces!"


Any good set of super hard dynamic programming problems to practice? (I mean way harder than leetcode/hackerrank etc.)

You can look at competitive programming sites, they are a lot more serious than leetcode/hackerrank



Hackerrank has some fairly no-joke problems on it. Is https://www.hackerrank.com/challenges/decibinary-numbers/pro... too easy? I enjoyed that one a while back.

I think I have a list of curated DP problems bookmarked somewhere, I'll see if I can track it down.

Energy markets used to use Lagrangian Relaxation and Dynamic Programming to find the least cost dispatch. Some still do, but the larger ones (all the ones in the US) use Mixed-Integer Linear Programming and Linear Programming as the engines can handle a lot more and still solve.

The problem is that you'd need to get your hands on a large dataset and none of the true ones are public. I think some of the national labs have synthetic models of various sizes to play with. Seriously cool field.

Really interesting, thanks for sharing!

I believe there's a typo here: "the time complexity would still be 𝑂(𝑊)" should be "the space complexity would still be 𝑂(𝑊)".

Thanks so much for pointing that out! I've updated the article.

Tangential - the turbulent water looks for me like the large scale structure of the Universe.

Isn't there a more generic deep-learning approach (i.e. with less assumptions) for this problem?

It wouldn't really have fewer assumptions. In fact, it probably would have more. We just wouldn't know what they are. Classical image analysis is still interesting and valuable because you can construct an algorithm based on desired properties without having to have a large, labelled training set beforehand, and because it's computationally much less expensive.

> It wouldn't really have fewer assumptions. In fact, it probably would have more.

It depends on how you look at it. A deep learning approach is supposedly more generic. Therefore I suppose the assumptions would be dynamic instead of fixed.

Assumptions are NOT dynamic. Once we have chose a "loss" function or whatever fancy name you want to call the objective function by, you've already made a choice. There are never dynamic assumptions (a classic example would be the choice of use L2 loss in the pixel space essentially assumes a Gaussian likelihood, which is in principle kind of goofy but hey it works). Although, as alluded to earlier, it is hard to understand the space induced by the architectural assumptions (and many other moving parts)

I like to think it this way - effectively, deep learning provides learned priors from data for a downstream task whereas the manual way comes from expert knowledge without the learning part.

There are plenty of fixed assumptions within deep learning. Off the top of my head: (1) Loss function (2) Pooling layers (which hard code invariances)

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact