Hacker News new | past | comments | ask | show | jobs | submit login
Image Background Removal (lyst.com)
308 points by Peroni on Mar 4, 2014 | hide | past | web | favorite | 74 comments



Frankly, I'm not very impressed. This problem is a very very well known problem in Image Processing. Usually people call it background subtraction or the more general form is image segmentation.

There is many really well working algorithms in this field. A google search with the major research conferences (ICCV, ICIP, SIGGRAPH etc) will give you the latest and the greates of these algorithms. You'll also find good (image segmentation) work if you limit the search for csail.mit.edu.

If you want your uploader to help you, you can also go for one of the supervised image segmentation algorithms. Otherwise you'll need an unsupervised algorithm.

Given that the authors also want to detect what is in the image, this might be helpful for them:

http://people.csail.mit.edu/mrub/papers/ObjectDiscovery-cvpr...

This guy is also doing some great work in that field:

http://www.engr.uconn.edu/~cmli/


My thinking exactly.

The really crazy thing is that they seem to be reinventing the wheel so that they can lean on the optimized code a graphics library provides. I think that's flawed thinking to begin with. In my experience with graphics programming, a reasonably optimized direct implementation tends to beat the hell out of a nine-step filter chain, no matter how good your graphics library. Even if they used the same algorithm, a direct implementation could condense steps 2, 3, and 4 into a single convolution [edit: whoops, no you cant; Sobel has a non-convolution step].

But more importantly, they've tied their hands by limiting themselves to a tiny set of operations. Combing the computer vision literature would have been a better use of time than trying to chain filters together.

In fact, some background subtraction algorithms effectively do as an intermediate step what Lyst wanted as an end result.


Background subtraction operates on video streams, not still images, so this is incorrect.


"Image segmentation" is usually the term for doing this with still images, and OpenCV provides a couple of functions for it. They're not perfect, but they're probably more effective than what this article describes.


I agree. I was pointing out that background subtraction is not the same task.


okay, this is image segmentation that attempts to estimate and remove the 'background' segments using some domain specific assumptions. This is generally referred to, in common language, as 'background subtraction'.


I will not debate common language use of 'background subtraction' because I have not discussed it with laypeople before, but what I can say is that if you were trying to implement what the article is trying to implement and were trying to find relevant literature, then searching for background subtraction would not turn up anything useful.


> The really crazy thing is that they seem to be reinventing the wheel so that they can lean on the optimized code a graphics library provides. I think that's flawed thinking to begin with.

Actually I disagree with you, even though what you say is correct in principle. On any given day, the vast majority of my workload revolves around working with other people’s code. So I may spend 7 hours trying to get a new library to compile or figure out why I’m getting an exception or how to get out of dependency hell and only 1 hour getting “real work done”.

For me, it’s exhausting to find a new library or API that loosely does what I need, only to find that I have to install a new language, new framework, new compiler, or even new package manager to use it. Developers have a tendency to copy other developers (even when the “normal” way of doing things is not ideal), so many libraries have no binary that I can test, and in fact no example of usage or up to date documentation.

Then there are subtleties with new libraries such as speed or memory usage. So perhaps a library that does exactly what you want runs at 1 frame every N seconds while the highly optimized function in a mainstream graphics library runs at many hundreds or even thousands of frames per second by utilizing concurrency or the graphics card.

So in fact when it’s all said and done, I tend to think more in transformations. I ask myself if I can express a solution slightly differently if it allows me to use an existing tool, then encapsulate it in a black box that has the same inputs and outputs as my ideal solution. Then my frustration is that other developers don’t seem to think this way. “Make one tool that does one thing well” has become the mantra and driven us into this fragmented ecosystem.

This is just an aside, but: The only truly general purpose language that has a syntax that doesn’t make me want to club myself over the head is probably MATLAB, but unfortunately their licenses are too expensive for me. So I have high hopes for Octave, and after that, maybe NumPy. So maybe one point we could agree on is that we shouldn’t need a graphics “library” in the first place. If we had a good mainstream concurrent language, then many of these algorithms become one paragraph code snippets and run with speed comparable to C or Java.

Edit: I wanted to give a concrete example. Low level languages like C are overly verbose in their implementations, by 10 or 100 times usually, because they try to leverage the wrong metaphors. Notice how compactly concepts like image compression can be expressed with the right ones:

http://www.mathworks.com/help/images/discrete-cosine-transfo...


Yes, reinventing the wheel is okay when you have a simple problem. But the article describes a complicated problem, and the presented solution delivers pretty poor results. In this specific case, I am pretty sure that looking for an existing solution would be the wiser choice. I saw two separate solutions to this problem at a recent graphics conference alone.

(Tip: if you have trouble compiling something you found on Github, contact the author and offer $200 to walk you through the installation. Might save a lot of frustration)


I agree that using optimized libraries is the better choice when those libraries do something close to what you are trying to do, but this algorithm is Rube Goldberg-esque.

Also, OpenCV includes functions for image segmentation. If they couldn't use that, it would have been nice for the article to at least touch on why.


but this algorithm is Rube Goldberg-esque

I think that's being a bit unkind, it's simply clear that the people are not experienced with computer vision and/or image processing in general. They approached the problem from their domain and found the solution that worked for them. If I had gotten to the point that I wanted to do a domain-specific image segmentation heuristic I would certainly build it up from a series of filters and image morphology steps. If the performance (both accuracy and speed) was satisfactory, I don't know why you would invest in optimization at that point. Also, from experience I know that implementing algorithms from papers is a slow and many times painful process as the original authors generally don't publish reference code, and if they do it's probably in Matlab. And if they don't have someone adept at image processing their potential for success would likely be low.

Also, OpenCV includes functions for image segmentation. If they couldn't use that, it would have been nice for the article to at least touch on why.

I agree that a quick "Here's what we tried that's readily available" would have been good for others to learn from, because for many the general solutions would be sufficient.


There will be a follow-up posts addressing a lot of valid points raised in these comments.


Everything I can quickly think of in OpenCV would tend to include too much background without user interaction unless similar transformations were performed anyway, so I assume if they had thought of it, that's why they ignored OpenCV.


You should give Node.js and npm a shot. It's one of the best examples of practical modularization. It's very easy and quick to test a new package, and while the documentation is often rough, the main readme on GitHub almost always has an example to show what you want to do. (Obviously Node.js is only suited for certain applications, and it's not great with graphics in particular).


I work in the same market as Lyst, and can say for a fact that the images variations we get from clients even from renowned fashion houses is huge. By variations I mean compression, quality, subject's clarity and the list just goes on...

A good algorithm would surely solve part of the problem, but given the imagery non-uniform patterns a great deal of resources (engineering skills + computing power) would likely be required. Given Lyst's recent VC rounds maybe they can afford that, but I surely cannot.

So I went for the poor-man option: https://gist.github.com/mvsantos/5554663

(I'm definitely bookmarking the link you suggested and also the one suggested by sjtgraham.)


Funny - I'm in the same domain and I have found that product images are generally high quality and somewhat uniform. You can usually find the product placed front and centre and with few distractions (Most images have either white or light grey background).


Renowned fashion houses overall do a great job, except when they add gradient or shadows to mask stuff. But the bulk of the problem, in my case, comes from large department stores when they blend furniture or accessories in fashion photoshoots making them complex and distracting from the main image subject.


The negation step seems superfluous. The edge detection filter can't detect edges in a regular photo but can in a negative?


The reason given is that the algorithm (Sobel) looks for light-to-dark transitions. There are two such transitions in the positive image: one from the background to the edge, and another from the edge into the object. Negating the object leaves only one light-to-dark gradient on the edge of the object.


Yeah, but that reason is nonsense. The Sobel operator uses convolution to approximate the magnitude of the gradient. It doesn't "look for" anything.

And since it's the magnitude of the gradient that Sobel finds, Sobel(Invert(img)) is mathematically equivalent to Sobel(img). The invert step is essentially an expensive noop.


I don't think background subtraction means what you think it does. Background subtraction refers to the subtraction of neighboring frames of a video sequence in order to find moving objects in a video sequence. So, yes segmentation is a more general and difficult problem than background subtraction, but it is in no way relevant to the task described in the article.


Image segmentation is exactly what they are trying to do. You can quibble about the exact definition of "background subtraction", but it doesn't change the fact that they are reinventing the wheel over a solved problem.


Right, image segmentation is the correct terminology. Background subtraction is incorrect. I don't think it's quibbling to point that out.


The corpus of product images at Lyst seems to be quite diverse. How many of those algorithms are good at very generalised segmentation, i.e. equally good at segmenting boat shoes on decking as they are say a tank in a field? What are your favourite papers on the subject?


One of the coolest approaches I've seen does some cool inference on fully connected Conditional Random Fields via high-dimensional filtering. Amazing results.

http://vladlen.info/publications/efficient-inference-in-full...


at least get your facts and terminology straight before saying things like this.


This was posted on HN a while back, really impressive: http://clippingmagic.com/


I've tried it with the sneakers sample but the results are disappointing. It had a hard time differentiate the brown of the background and the welt.


I thought it worked ok? http://clippingmagic.com/images/4953782/public/86ede827aa2fa...

You need a couple of extra markings, but that's life on images with background colors close to the foreground colors.

(I'm one of the creators of clippingmagic.com)


Image segmentation is one of the most studied problems in computer vision. A really good lectures about this topic is by Prof. Daniel Cremers: http://www.youtube.com/watch?v=fpw26tpHGr8. In one of the talks, he describes methods that were popular in 80' and that are pretty much what the author developed.


Sounds like a saliency detection algorithm would be appropriate as a pre-processing step prior to edge detection or thresholding. The choice of algorithm may depend on the image characteristics; eg, [1] utilizes a global center-surround, works well for large and distinct foregrounds, while [2] uses a FFT, works better for noisier backgrounds. There is a lot of literature on saliency detection, segmentation, etc.

[1] http://infoscience.epfl.ch/record/135217/files/1708.pdf

[2] http://www.klab.caltech.edu/~xhou/papers/cvpr07.pdf


This might be of interest, to people looking for a more human guided approach, where rough edges are defined by a human:

http://www.alphamatting.com/


Came here to post the same thing (current researcher in this field). Although you could actually automate the labelling by building a classifier for the type of object you wish to extract.


Another automated background removal method that worked best for me on complex backgrounds was Graphcut Algorithm, by Microsoft research. Gives the best results and has a implementation in OpenCV.


Yep, that works remarkably well.

I think it is called GrabCut - http://research.microsoft.com/en-us/um/cambridge/projects/vi...


The advantage of GrabCut compared to the parent article is that it takes both color difference and similarity into account rather than just difference. This is done by combining all that information into a very cleverly-laid-out Markov Random Field where making an energy-minimizing graph cut roughly corresponds to finding a better pixel classification.

The disadvantage compared to the parent is that it's a semi-supervised algorithm: it requires a bounding box to be drawn around the desired image. If you had a bunch of very similar images you could pre-generate the Gaussian Mixture Model that GrabCut expects and turn it into a supervised learning algorithm, but then you'd lose the major innovation of GrabCut compared to the graph cut methods discovered before it: you would not be able to re-run with the newest "best-guess" and be guaranteed that the energy would monotonically decrease. It would also fail spectacularly if it encountered something it hadn't seen before.


Another thing to consider is that GrabCut is patent-encumbered for commercial use -- it relies on the mechanism described in Zabih, Boykov, and Veksler's "Fast Approximate Energy Minimization via Graph Cuts," which is patented in the US since 2004 (Patent No. 6,744,923).


I'm still looking for a white to transparent filter. This program fails to do that properly, if i look at shadow in the transparent examples with the boot.

Any recommendations other than keeping Photoshop CS5 around just to use KillWhite [1]? (Why Adobe axed Pixel Bender[2] is still just beyond me.)

MathMap for gimp looks promising to create pixel-filters, yet I haven't tried it yet. Being able to to this on commandline would be nice, but isn't a requirement.

[1]http://mikes3d.com/extra/scripting-plugins/killwhite/

[2]http://www.adobe.com/devnet/pixelbender.html

[3]http://www.complang.tuwien.ac.at/schani/mathmap/


In standard gimp this functionality is available in Colors - Color to Alpha. Maybe I'm misunderstanding what you're looking for.


Thanks for the hint, just tried it and it not behaving as I need to.

Expected behavior in HSL color space would be anything with the same Hue and Saturation value (with optional adjustable thresholds) of a selected color gets assigned an alpha value of 1 - lightness. Where lightness of 1 being white and alpha of 1 being opaque.

KillWhite seems to do exactly this: http://mikes3d.com/extra/wp-content/uploads/2010/07/apple.pn...

I guess by now, it's easiest for me to re-implement this as a filter for HTML5 2DCanvas.


Yes white (the default, or any color you set) to transparent is definitely 'Colour to alpha'. Works great and you can repeat for multiple colours in the same image.


Imagemagick is good for command-line processing. http://www.imagemagick.org/Usage/masking/#alpha


If you download the pixel blender version of the kill white plugin, you can read the source. It looks pretty simple to implement; does just a little work while in the HSV color space.


Just dived into that too, looks pretty straight-forward.

Here is the comparison between KillWhite's method and gimp's color to alpha:

http://i.imgur.com/FGJE97r.png

Note that it adds transparency pretty much everywhere on the apple and alters it too much. (Reflection on apple is gone after blending with new background)


I'm about 98% certain:

Killwhite is identical to Gimp's "Color to Alpha" if you use white as source color.

In their example, it leaves a grey color background with a little alpha. They started with a brighter blue, and was dimmed by the their resulting image.

I'm guessing that in your example, you used the background grey as the input color, so it took the background to clear, then you used a color sample of Killwhite's background to use as the background of yours. ...but I may be wrong.

edit: I am playing with it, and if you use white as the color, then put the result on the white background, it's identical to the original., if you do the same with the grey color, same results. So "Color to Alpha" is the same, just allows you to pick a color.


They are not identical.

If this [1] is the actual code it looks like it's operating in RGB color space.

Which explains why it extracts the lowest value color component as alpha from every pixel and not just from those with low saturation.

    alpha1 = (1 - a1) / (c1);
This is done three times, for each RGB channel once. c is the selected color (in this case white with c1, c2, c3 = 1)

a are the channels of the current pixel (a1, a2, a3)

Later the highest alpha (lowest transparency) is kept for the output.

Which results in the effect that all highlights, regardless of their saturation, are made transparent.

Meanwhile the core function of KillWhite is executed in HSV color space:

    Alpha = 1 - (Value - Saturation)
Here Saturation reduces transparency! This is not the case in color to alpha.

[1] https://git.gnome.org/browse/gimp/tree/plug-ins/common/color...

[2] http://stackoverflow.com/a/14915403/731179

if there is need for further discussion I would suggest switching to stackexchange. Adding to

http://superuser.com/questions/348167/how-can-i-remove-all-w... or http://graphicdesign.stackexchange.com/questions/13073/alpha...

or open something new


I am working on a similar problem and came up with similar methodologies. I also used the Canny filter which worked quite well (http://en.wikipedia.org/wiki/Canny_edge_detector). However, it all broke down with complex patterns in the foreground/background. I'm currently trying to incorporate more structural information about the foreground objects. Learning algorithms is probably another option here? Just not sure how to frame the problem for a learning algorithm.


I've worked on this problem quite a bit. The hard thing to accept is that there can be large areas in an image where there isn't a significant difference between the foreground and background. In other words, the raw data doesn't contain enough information to detect an edge there and no segmentation algorithm will find what is not there in the data. Humans see edges "semantically". We know there should be one there and so we "imagine" one. I work on image segmentation and recognition and my code cycles between those two goals. I segment and look for noisy and smooth edges, then I postulate where there should be a smooth edge (for example, if you recognize a smooth edge "parallel" to a noisy edge, check if the noisy one could be smoothed "similar" to the other one). The end result is that I can separate the foreground from the background but I don't try or hope to recover the edge that a human would find. My segmented images have the uncanny valley problem. The edges are correct in some areas and slightly off in other areas. The image looks real and fake :)


I had this issue for a site a few months ago and did the flood fill approach, also making available a simpler mode. This algorithm could use a few tweaks, but it worked quite well for images with only white backgrounds, but can easily be tweaked to deduce background colors.

https://github.com/PCaponetti/image-background-removal


There is nothing innovative here. And with the kind of images that they are considering, you could simply use a Canny edge detector and take the outer-most contour to be the region of interest. As the previous comments have said, this is a well tackled problem. They must have done some literature survey.


Usually automated image processing is tough without a priori info about the image, or some human input.

Adding to these methods, a 'holes' fill algorithm can be very helpful. This would proably be used in place of your Alpha Mask.

The process can be run multiple times with different parameters for better results. For example, doing edge detect->threshold->edge detect (with different parameters) can help remove some of the smaller artifacts you see in the images.

If you're able to have some manual input, simply clicking the object of interest or doing a rough outline can be a great addition to an algorithm, especially in avoiding the problem seen in Global Threshold with Complex Background.


There's a really good paper from 2011 which does quite a good job in image segmentation with very noisy background[1]. Their algorithm is biologically motivated and analyses contrast changes in different parts of the image as this is exactly how we separate interesting parts in images from uninteresting parts.

[1] http://vecg.cs.ucl.ac.uk/Projects/SmartGeometry/contrast_sal...


you need to really model what is going on at the edge boundary. For an individual pixel, how much of the contribution is from the background, and how much is from the object? Pixels values are the result of a smoothing operation over each pixels field of view (which is actually probably a little bit bigger than the image dimensions suggest).

When you zoom into an image its clear that the edges of objects can affect pixel intensities a few pixels away (blur). So you have to reverse that which can be tricky. Image people like Matlab so thats where the premade solution will be found


RemoveBackground in the Wolfram Language...

http://reference.wolfram.com/language/ref/RemoveBackground.h...



I have conducted research in exactly this. Their results would likely be much more accurate by applying Canny edge detection instead of the simpler Sobel edge detection


Canny edge detection uses Sobel in most cases as a first or second step (after blurring) and adds a non-maximum suppression + hysteresis thresholding - You do most probably know that, just wanted to clarify that it's not really "instead".


This is awesome, love to see a well thought out image processing algorithm used for such a useful application.


I would agree with you if the image processing algorithm were actually well thought out. As it is, I fear that this article is encouraging some very bad ways of thinking about problems.

First off, they don't mention searching for an off the shelf solution to the problem. If OpenCV's built-in GrabCut or Watershed filter wouldn't work for them, they should explain why.

Secondly, they don't examine the literature for existing approaches to the problem. Sometimes you won't find what you're looking for in those approaches, but in that case, the problem with existing approaches will inform how you decide to tackle the problem.

Finally, they solve the problem by building a filter chain, but they don't seem to actually understand what the filters are doing. They say that the Sobel operator "looks for light to dark transitions". This is completely incorrect. The Sobel operator does nothing more than estimate the magnitude of the gradient of the image. Furthermore, based on this misunderstanding, they invert the image before applying the Sobel filter - a step which does literally nothing.


Small suggested edit:

> it normally has moved more background than product

Should probably be:

> it normally has removed more background than product


This seems like a great problem to tackle with deep learning.


I'm curious what you mean by this, could you elaborate


It's a new game in town: whenever someone says "big data", you reply "deep learning", someone says "mobile", you say "real time", someone says "sharing", you say "social", etc ...


Hello. Greetings. Brrrrrrring!!! http://www.youtube.com/watch?v=KTc3PsW5ghQ


Unsupervised learning using deep neural networks, e.g. Google made a face detector using only unlabelled images that trained itself[1]. Seems like a similar approach could be applied to image segmentation, i.e. this problem.

[1] http://static.googleusercontent.com/media/research.google.co...


Hmm, I think the difficulty would then be how you train the neural network, I'm not sure that'd be an easy task.

I think you'd have to apply a fair amount of pre-processing anyway before you passed it over to the ANN.


I think the difficulty is in the computing power required to train the net, but there has been some progress in that area lately http://techblog.netflix.com/2014/02/distributed-neural-netwo...


You're both right. There are challenges in training a deep-learning neural network, and that training requires a lot of processing power.

We open sourced a pretty cool standalone machine in Java that addresses those issues about a week ago. Looking for feedback...

http://deeplearning4j.org


DeepLearning4j author here. I'd just like to add that despite training neural networks being hard, they are great for understanding data if trained right.

There are a lot of innovations in image processing wrt neural nets specifically. The right neural network can learn everything from scene detection to simple object recognition.

I would highly reccommend taking a look at the neural nets course on coursera to understand some of the use cases.


You could probably do some neat things with Google glass. Feed the computer the images the human pays attention to all day, and let the computer extrapolate objects by how a persons' vision tracks them.


If by "all day" you mean "a couple of hours" that it takes to run down the batteries by capturing and processing live video.


I'm not so sure ML/DL can effectively calculate the contours of an image. ML's strength is in regression and classification (eg, differentiating between handbags and shoulder bags), but there are more appropriate tools for image segmentation and foreground extraction.


Interesting... Thanks for the disclosure of the code. Now if only NSA was willing to open source their image/video manipulation software...

When researching for this you didn't find any good premade solution or were they simply too highly priced?


We didn't find any pre-made solutions which fit with our requirements, which were: reasonable results in the general case and speed.

We used GraphicsMagick and the pgmagick package to integrate into our code base and because GraphicsMagick is crazy fast.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: