I think it is key to remember that the accessibility of huge amounts of labeled data is behind most of this. ImageNet is 1.2TB (~1.2 million images!). Convolutional neural nets have been around for a long, long time (early 90s? or maybe late 80s) and they just needed more data (and a few training tricks :) ).
As we enter the world of "big data" more and more companies are locking this level/size (or bigger) data away in closed vaults. This is a competitive edge in business, but also strangles the ability for academics to do "real world" research, which is often a huge criticism of academia at large.
Academic research can have a role in tech R&D if given a chance, so I hope that companies will open their data to private researchers to continue this kind of improvement - even if under secrecy clauses or some other safeguard.
All that said, the technological improvements have been astounding, and I hope this is only the beginning! There have been some incredible results using related techniques in NLP for machine translation recently, and speech has always been a great playpen.
If anyone is interested in playing with these things in Python, myself and another researcher have recently created a library for using these type of neural networks as a black-box WITHOUT ANY TRAINING!, in a similar style to scikit-learn. See [1] for more. We support only a few nets currently, but plan to integrate support for caffe and pylearn2 in the near future.
Great work guys! Are these going to only be computer vision nets?
I think the huge problem with most deep learning frameworks out there are they are either hard to use or limited in scope.
There's nothing wrong with that per se, but I'd be curious to see what you guys intend. It's great that you guys are giving an sklearn interface to it.
Support is planned for audio and hopefully text - I am working on building a million song dataset to recreate the work Sander Dieleman did for Spotify, and have had some possible support in October for the weights of a trained speech network! So yes, feature extraction from other domains should be on the horizon.
We are specifically trying to make it easy to say: I want to transform an image (or audio, etc.) using a pretrained net. Download the weights, extract the features for me, and give me the feature vectors so I can do something with those. This seems to be really, really, really hard in all the tools I have used and usually involves training yourself, which is not very useful for things on the level of ImageNet.
Of course, having nice examples and good docs is one of the great parts of scikit-learn, and is actually one of the things I have been working on most recently. Our docs aren't to that level yet, but I hope they can be one day.
DeCAF (precursor to Caffe) and OverFeat binaries were really kind of the first in this regard (about 1 year ago now), but IMO one of the limitations is that they lose interaction with the rest of the Python ecosystem for data munging, and simple algorithms for exploration. By wrapping the weights, we hope to leverage the support of the Python ML ecosystem easily, while still being able to use the power of these networks.
Right now the most compelling use case is as part of a scikit-learn pipeline i.e. make_pipeline(OverfeatTransformer, LinearSVC) or whatever. Feed images in, get predictions out. I am also working on a demo of "writing your own twitter bot" similar to https://twitter.com/id_birds , written by Daniel Nouri . I like sloths, so it will definitely be a slothbot.
I also hope to support recurrent architectures from Groundhog (https://github.com/lisa-groundhog/GroundHog) as several researchers here at the LISA lab have been using it to get pretty amazing results in NLP and audio, both of which are potential targets in the future. If we can leverage their work, it would be a very nice way for people to immediately play with SOTA architectures in different applications.
In any case, just loading in weights and extracting image features easily is nice, and was a benefit for both Michael and myself in a research project this summer.
Great to hear! That's exactly what I'm trying to replicate as well. I'm mainly trying to do it for industry myself. Not a lot of people like making this stuff for the JVM ecosystem (understandble of course...I love python as well)
I also agree about caffe as well. The python ecosystem is amazing and should be leveraged which also increases adoption.
As I said before, being able to do this at scale for people where their data is stored on the JVM should help it make it more accessibble to a lot of people.
Re: Twitter bot. This looks really cool.!
I'll be keeping an eye on developments here. Good stuff!
It can work the other way too. Ironically the academics who created ImageNet restrict who can download it and don't allow it to be used for commercial use.
Well, they have no choice. Because technically the copyright of each image is still held by the people who took the images (or in some cases the people in the images). Are weights of a trained network based on ImageNet a derived work, a cesspool of millions of copyright claims? What about networks trained to appoximate another ImageNet network?
There is a lot of legal gray area here unfortunately, so the signing of the license to work with the images seems like a CYA move to me.
'Are weights of a trained network based on ImageNet a derived work, a cesspool of million of copyright claims?'
I'd argue not. The success of deep learning in vision seems to be in acquiring allocentric representations of objects. Copyright protects the expression and not the concept. Parameter weights describe the concept, not particular instantiations of it.
This is what I am hoping too - but derived works are a sticky subject. Imagine a scenario where someone gets of one of these "data vaults" from a large company, then trains a network and throws away the training data. You are still holding the "essence" of their datastore, even without the actual data.
I guess we won't really know until someone goes to court over it.
Convolutional neural nets have been around for a long, long time (early 90s? or maybe late 80s) and they just needed more data (and a few training tricks :) ).
The state of the art has become a lot better in terms of training deep (multi-layer) neural networks. The naive approach (start at a "random" weight setting, use gradient descent) that works on a convex error surface fails catastrophically on deep, complex neural nets. (Shallow neural net training is a non-convex problem as well, but seems to be "less non-convex" in practice.) In the past 10 years, people have become better at getting past that.
Theoretically, only one hidden layer is necessary for neural nets to be universal. Thus, for a long time, most research focused on single-layer networks because those were "good enough" to model any mathematical function. The problem is that convergence, for single-layer nets, can be very slow (especially given that you're often doing stochastic gradient descent when working large data sets). Single-layer nets are often very difficult to audit. There's a lot of "cross talk" where unrelated features are mapped to the same "space" in the network. So you have a "black box" that is hard to interpret.
Deep neural nets (which is what ML researchers call "deep learning", until MBAs start abusing the latter term, as they have with "big data") have come back into style over the past few years, due to recent research into how to make them actually perform, and findings about superior convergence with the right conditions. In deep nets, there's often a problem of signals either amplifying or vanishing as they propagate throughout the net. The former leads to saturation (the neural net moves very slowly from a suboptimal but flat place on the error surface) and the latter is too linear and unlikely to pick up interesting features. Even now, making deep neural nets not sensitive to initial starting conditions is an unsolved problem, but there's been a lot of progress.
I would hazard the guess that the convolutional technique is a lot more useful in deep neural networks than it is in single-hidden-layer neural nets.
Hi, I am a research engineer in Yann LeCun's group at Facebook. I hate to seem to be picking a fight, but you're a prominent poster here, and this comment seems likely to garner a fair amount of attention. Unfortunately almost every sentence you've written betrays a subtle misunderstanding of the space, and the totality is quite misleading.
> The naive approach (start at a "random" weight setting, use gradient descent) that works on a convex error surface fails catastrophically on deep, complex neural nets. (Shallow neural net training is a non-convex problem as well, but seems to be "less non-convex" in practice.)
Starting at a random (no scare quotes needed) point in weight space and SGD'ing is exactly what Alex Krizhevsky, and all of the derivative convnets over the last two years, did and do. It works just fine; I sit at work doing it all day long. You need to have enough data to train on, big enough models, and enough flops to train the big models on the big data before your interns' grandchildren die. We have all of the above now. Aside: even single-layer neural networks do not have convex error surfaces; convexity, and funky error surface geometry, is not a relevant distinction between shallow and deep nets. There have been no magical optimization breakthroughs, it's still SGD with the same herbs and spices that were used in the 90's (momentum, e.g.).
> The problem is that convergence, for single-layer nets, can be very slow (especially given that you're often doing stochastic gradient descent when working large data sets).
"Stochastic" gradient descent just means doing lots of weight updates per epoch. Ceteris paribus, training on large, redundant data sets, stochastic converges faster than batch because it gets to consider more points in the weight space than batch per pass over the data. The problem with single-layer neural nets is not that they converge slowly; the problem is that the layer size needs to grow exponentially with the task size. Single-layer neural nets' universal approximation power is thus not of great practical consequence. The power of deep nets is the power of composition: f(g(h(x))) is a strictly more powerful model than f(x) holding the number of parameters constant.
> Even now, making deep neural nets not sensitive to initial starting conditions is an unsolved problem, but there's been a lot of progress.
You just initialize with a Gaussian ball around zero and explore whatever valley in the error surface you happen to be in. Works 100% dandy.
> I would hazard the guess that the convolutional technique is a lot more useful in deep neural networks than it is in single-hidden-layer neural nets.
It doesn't really make sense to talk about a "single layer convolutional net", because if you only have a single layer, and all you can do is convolve with it, then the output of your net will necessarily be a big pile of filtered versions of the input image. Unless your task is specifically to learn a target set of filters, it would make no sense to have a single layer convnet.
From that paper, it seems like it's just that deep nets are easier to train to find state of the art solutions than shallow nets. It's not that shallow nets are inherently incapable of performing as well. Shallow nets can mimic/approximate a given well trained deep net and preserve almost all of the accuracy. So it's not the case that a good solution to the task doesn't exist in the set of hypothesizes spanned by sized shallow net, it's just that people don't know how to effectively find the right parameters for the shallow nets.
So far this is only shown on TIMIT and MNIST, which are pretty trivial datasets, so it may be dataset dependent.
One of the authors is giving a talk at INRIA in France this month, and the abstract mentioned CIFAR10 (possibly unpublished results). If they have manged to compress a CIFAR10 network, that is a strong indicator that ImageNet networks could be compressed in the same way... but no one has published any results in this regard to my knowledge.
However this is an active area of research for me, and I hope to explore it more soon. I think there are better ways to approximate than this, but this work at least shows that it may be possible.
> Aside: even single-layer neural networks do not have convex error surfaces;
Single layers can indeed be made to have convex error surfaces fairly easily. One can do so by matching the error/loss function with the squashing/link function. What some old NN folks got wrong was mixing up square loss with logistic function, that is an unhealthy mix. Now if one were to use KL divergence instead of square loss then one would indeed have a convex loss function. In fact this would be nothing but logistic regression. One can however push this idea further, with any choice of a monotonic squashing function one can derive a 'matching' loss that would give you a convex loss. Classical statisticians know this and call it with a different name: canonical generalized linear models. I am not from that tribe, mine is more ML we may perhaps call it minimizing Bregman loss.
Just so that its clear I am talking about single layer networks not single hidden layer networks, there are plenty of cases were the former is useful.
> There have been no magical optimization breakthroughs
It is arguable whether Hessian free methods, contrastive divergence or auto-encoder based training methods qualify as 'breakthroughs' but they have definitely equipped invigorated researchers in this broad area with their capabilities.
> It is arguable whether Hessian free methods, contrastive divergence or auto-encoder based training methods qualify as 'breakthroughs' but they have definitely equipped invigorated researchers in this broad area with their capabilities.
The scientific revolution in computer vision right now is due to deep convnets, trained in a supervised way using backprop and SGD. All of the systems we're talking about that have started blowing away records are members of this family, and were trained this way. If Alex Krizhevsky had not entered ImageNet 2012, we would not be having this conversation (in part because I probably would never have gotten curious enough about it to leave my home territories of systems and programming languages). Second order methods, unsupervised pre-training, RBMs, graphical models, etc. etc. etc. were exciting for those inside the field, definitely provided encouragement to those optimistic about deep models, and still might prove important, but they have had little impact and visibility to skeptical people outside, in the way that entering a computer vision competition and murdering all the computer vision systems did.
Well, HF training was a pretty big deal IMO. Definitely saved my bacon in training some recurrent nets, much easier to get working and/or recovering from bad optimization but pretty slow.
The SGD we use today actually has some strong ties back to that second order optimization work - see some papers by Ilya Sutskever relating a special form of momentum back to second order methods like HF. http://www.cs.utoronto.ca/~ilya/pubs/2013/1051_2.pdf His dissertation covers this at some length as well.
Using Adagrad, Adadelta, etc. isn't really SGD as it was back in 2012, and this years entrant "GoogLeNet" basically halved the error again using these and other tricks (we think) - which is even more impressive considering 11% - 6.7% is a HUGE increase in difficulty, just my 2 cents.
However, there is a good reason the colloquial name for these things is "Alexnets"... that work was truly incredible and has not stopped behind the doors of Google I don't think.
Hi, I am a research engineer in Yann LeCun's group at Facebook.
Really? Let's chat offline. I'm michael.o.church at gmail.
I'm not looking for a new job right now but I've been in this business for long enough to know that that can change at any time. If nothing else, I'd love to have lunch the next time I'm in New York (am I correct in assuming that you are in NYC?) and get the kind of intellectual ass-kicking you get when you meet someone who actually knows this sort of field at a deep level.
I'm also getting to the point where I have to decide whether I want to go into "real AI"-- and be a small fish again-- or take the big-fish/smaller-pond path of management (I'm in very early discussions about the MD/Data-Sci position at a fast-growing HK/Sing hedge fund, which probably scares the shit out of guys like you who actually know this stuff, as opposed to traders who read two papers, think they understand them better than they do, and build trading systems.) I'm afraid that if I take the executive/finance track, I might get even farther away from the deep-knowledge/R&D space.
I'm only 31, so I'm not afraid to take the Real AI route and be a small fish in a large/badass pond again. In fact, I'd prefer it, even though the winds seem to be taking me the other way. (Finance/management would be the big-fish/small-pond route, since my data science/ML understanding is well into the top 1% in that world and, again, I'm not offended in the least if you say that that scares you. It scares me. R&D people like you get deep knowledge; guys like me in the private sector-- "private sector", here, meaning startups and finance but not R&D labs-- spend about 85% of our time fighting political battles and self-promoting and rarely have the time to learn anything as deeply as we should.)
I hate to seem to be picking a fight
Don't worry. You're not. It's great to hear from someone who actually gets to use this stuff at work. Thanks for taking the time.
Unfortunately almost every sentence you've written betrays a subtle misunderstanding of the space
I understand the space theoretically, but I'll readily admit that I have, compared to you, almost no real-world experience. I've built neural nets for a few small problems, but nothing at the scale you have.
Perhaps the issues I'm raising are completely theoretical and pose no problem in practice.
Starting at a random (no scare quotes needed) point in weight space
I put "random" in quotes because it's not always clear how to sample a useful "random" point for initialization. There's no such thing as a uniformly random point in R^n, of course, so you need to choose a distribution a priori like U[0,1]^n or N(0, I_n). This seems to pose no problem (even while it's nowhere near the "correct" weights, and we both know that individual weights have no independent meaning in neural nets) if there's a heterogeneity in the scales of the inputs, but can be a problem if you have large scale variations.
If one of your inputs ranges from 0 to 1000 and another ranges from 0 to 0.001, then those "random" (scale-agnostic) weight-initialization distributions actually begin with a 10^6:1 bias favoring the former input. Of course, this is a trivial example and scale normalization is as old as dirt, but I think the point (that useful "random" initialization is not so easily defined, especially when you have deep and messy network topologies in which signals tend to vanish or amplify) is sound.
When you transform the space (e.g. feature extraction, scale normalization, adjustment for multicollinearity) a scale-agnostic distribution like U[0,1]^n or N(0, I_n) becomes dramatically different in terms of what it actually means, relative to the data. The fact that these pre-training techniques seem to be effective if not necessary (at least, the people who I read swear by them) seems to indicate that, at least for some problems, this is a real issue.
With a small number of layers, you still need some randomness to not arrive at the (trivial and useless) stationary point you get from w = 0-- because the hidden nodes don't differentiate-- and then SGD with momentum is enough to get you to a good local minimum. However, it doesn't seem that initialization (beyond "random enough to differentiate the hidden nodes") becomes a major concern for shallower nets.
If I understand correctly, it's when you have 6+ layers (and certain categories of neural nets, like recurrent neural nets, are effectively much deeper) that you start to have these initialization issues, because activation values vanish or grow (to saturation) exponentially in the depth of the network and a bad initial point can leave the network in a borked state (e.g. saturation) where the training performs very badly.
Putting "random" in quotes was an attempt to say, "hey, picking a 'random' point in a useful way is not always trivial, because you still have to choose a sampling distribution a priori" but it was late, I am jet-lagged from a trip to Asia, etc., so maybe I didn't express it well.
Aside: even single-layer neural networks do not have convex error surfaces; convexity, and funky error surface geometry
It's correct that single-layer neural nets are non-convex. My understanding, and correct me if I'm wrong, is that with shallower nets, the "broad == deep" bet (that the best local minima will have the largest basins of attraction) is usually correct, and that this breaks down when the nets are very deep (as with recurrent nets, since BPTT is effectively "unrolling" an RNN into a very deep BPNN). I can't even begin to visualize the error surface of a 10+ layer deep neural net, so if that understanding is wrong, please correct me.
(Broad == deep, the contention that the better local minima are more likely to have larger basins, implies that you're likely to get the optimal local minima with a few initial samples. What you wouldn't want is for all the good local minima to have tiny basins-- to be narrow but deep-- because you'd be unlikely to hit with your initial sampling.)
The problem with single-layer neural nets is not that they converge slowly; the problem is that the layer size needs to grow exponentially with the task size. Single-layer neural nets' universal approximation power is thus not of great practical consequence. The power of deep nets is the power of composition: f(g(h(x))) is a strictly more powerful model than f(x) holding the number of parameters constant.
Thanks. This makes a lot of sense. Yay for composition. (I still have a ways to go in ML; my expertise is in functional programming/language design.)
You just initialize with a Gaussian ball around zero and explore whatever valley in the error surface you happen to be in. Works 100% dandy.
Here's the paper I had in mind when I wrote that comment: http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pd... . If I'm misunderstanding the lessons of it, or if the paper is just wrong, please correct me. What I've taken from it is that deep neural network training is quite sensitive to initial starting point, hence the successes of pre-training. To pre-train is, effectively, to change the meaning of "Gaussian ball" (or the like) relative to the data.
That also gets to why I put "random" in quotes. If you do pre-training, feature extraction, et al (which seem to be necessary for many problems but, again, correct me if I'm wrong) then the Gaussian unit ball in the weight space for the (transformed) data is an entirely different set. Even with linear transforms (e.g. scale normalization) this is true.
But again, you've actually used this stuff in your day job and I haven't yet (though I hope to, in my next gig) so I'll defer to your judgment as to whether this is actually an issue. Am I making sense, at least?
It doesn't really make sense to talk about a "single layer convolutional net"
That was my sense, too. It was 11 at night and I didn't want to commit to saying "single-layer convolutional nets are never useful", so I reduced my certainty in what I was saying to "I would hazard the guess ... a lot more useful ...". Generally, when I'm fighting Pacific levels of jet lag and it's after dark, "I can't see how it would work" does not justify "It cannot work".
Hey, nice to make your acquaintance, and thanks for offering your contact info. Mine can be backed out my HN profile as well. I actually work in Menlo Park; the AI group is split across New York and Menlo Park, with a very small European contingent for now.
WRT the Glorot and Bengio paper, it’s true that there was a lot of excitement surrounding unsupervised pre-training of DNNs, but this mostly preceded the current wave of successes. The big differences between the architectures that are working on image processing today and that paper are:
1. Moar data. The datasets this paper was looking at were on the order 10^5 or so images. 10^6 is a different ballgame.
2. Convolution. Sharing weights really is special. This means there are far fewer parameters to learn in the early parts of the network, and so pre-training seems less necessary.
3. RelU activations. The survey of activation functions uses only smoothly differentiable ones whose gradients get tiny as you are far from zero. RelU has fewer problems with the gradient getting tiny or huge at idiosyncratic points, and also has the virtue of sparsifying the gradients as you backprop (since anything that landed in the negative tail has zero gradient).
So yeah, we really do do entirely unpretrained learning of low-level features, straight from RGB values between 0 and 256, and it works! Isn't that cool??
That was a big deal, and I think pretty much eliminated greedy layerwise pretraining in the "we have plenty of data, but can't generalize well" case. Good initialization rules help too, but are mostly heuristic and problem dependent to my knowledge.
The last few slides have a kind of "survey list" to get up to speed with modern deep learning approaches for images. I also put the slides on github at http://github.com/kastnerkyle/EuroScipy2014 , which hopefully preserves the hyperlinks where speakerdeck does not.
It's interesting that, in the image showing the classes easiest and hardest for the algorithm, all the easy ones are animals and all the difficult ones are human-made artifacts.
Does nature tend to create forms which are easily detected by relatively simple neural networks? The evolutionary explanation could be that this allows animals with primitive neural systems to more easily distinguish other members of their species.
I think it's because these human made object have no single form. For instance "letteropener" describes the function of the object, not the form. Compare that to "red fox" which is always gonna look more or less the same.
Yes, but the network is still differentiating between select breeds which means it is learning traits of the animal which are unique to the breed. And training at a higher level is perfectly doable, say "fox" instead of the fox breed.
Ultimately, if you are able to look at an image and say "letter opener" there are features which differentiate this from a knife/can opener/whatever - these are the exact things a convolutional neural network should be able to use (in theory) and has nothing to do with the label, which is typically unimportant as long as it is unique and accurate.
We could flip all the labels around and still get unique answers - the network is just learning a mapping from input -> some integer, and I would argue the variance in dog breeds and lighting in natural scenes is much trickier than the angle/shape of a letter opener.
I still think it comes down to the composition of this particular dataset. Augmenting this with images scraped from online stores would be very interesting as it is fairly trivial to get huge numbers of images for anything that is typically sold online - I think Google is way ahead on this one!
It's impossible to tell from the examples given in the article, but I wouldn't be surprised if the same classifier that gets 100% on "Blenheim Spaniel" and "Flat-coated Retriever" gets less than 100% on "Dog".
It's a question of how visually coherent the category you're trying to learn is. From a purely visual perspective, the first two categories are relatively tightly bunched in the state space, whereas "dog" covers a diffuse cloud of appearances whose total range might even encompass the area where many non-dog animals also lie. Humans may rely on some additional semantic knowledge about different kinds of animal to produce an accurate classification. It's not entirely unlike how determining the meaning of the words in the phrase "eats shoots and leaves" can't be done reliably without incorporating contextual clues such as whether we were just talking about pandas or a murder in a restaurant.
There may also be issues around how distinct the categories are from each other. A couple years ago yours truly picked up a letter opener off the table and used it to spread butter on his toast, much to the amusement of his hosts.
In practical use, you can simply search for anything in the "dog" subclass using the WordNet hierarchy... so there is no loss in accuracy unless you have confusion across the search groups! We actually support this in sklearn-theano - if you plug in 'cat.n.01' and 'dog.n.01' for an OverfeatLocalizer we return all matched points in that subgroup.
In general, if you misclassify "dog" for a fixed architecture you will most certainly misclassify "Blenheim Spaniel" and "Flat-coated Retriever" - the two other classes are subsets of the first. The "eats shoots and leaves" sentence is analogous to a "zoomed in" picture of fur - we don't know what it is but we are pretty sure what it isn't! This is still useful, and would already get most of the way there for large numbers of fur colors/patterns.
I think the concerns you have are more important at training time, but I have not seen a scenario where it has mattered very much. In general having good inference about these nets is really hard, but I think your initial thought about "dog space" ties in nicely to a post by Christopher Olah (http://christopherolah.wordpress.com/2014/04/09/neural-netwo...) - maybe you will find it interesting?
And yes it becomes really fascinating to extend your last thought to "optical illusions" and other tricks of the mind - even our own processing has paths are easily deceived and sometimes flat out wrong... so it is no surprise when something far inferior and less powerful also has trouble :)
The tiger [it's a leopard] and stingray [some other ray?] are wrong, but the system is 100% certain they're right; seems quite a big error considering the apparent accuracy of the other labels.
Isn't it contextual - flat-coated retriever, well-done, but how good is it at picking one out of a pile of images of black animals, panthers, house cats?
It could also be that there are more labeled examples of animal, or that the translation/rotation of an animal is less disruptive than than the same transform to a car or other object.
I know for sure that ImageNet has a huge amount of animals in it, down to very select sub-breeds, while the object categories are usually at higher levels.
I'd say it's an entropy thing. The principle of "correlation of parts" means images of animals have low entropy relative to, say, machines. Intuitively a machine can contain an almost arbitrary set of shapes and components, whereas animal forms are more constrained.
See my earlier answer - I think that it is all about the data. And most things humans design have a similar "correlation of parts" due to the human preference for symmetry. Not all the images are face on headshots - lots of running/action, off centered, etc. ImageNet is a very "real" dataset in that sense.
Entropy isn't the right term here - low entropy would imply something about the compressibility of each image based only on whether it was a machine or a natural object, which I don't think is case.
Well, if the effect is just experimental error, I agree that no explanation is needed.
I was using entropy in the information theoretic sense, as you might assign a measure of entropy in bits to each character in a language. If part of an animal is "less surprising" I'd say it contributed less to the entropy of the whole thing. Maybe that's too woolly.
There is no way you could identify the whole machine based on seeing a small part if an identical part recurs in many different machines. It's not holographic in the way animals are.
Chicken, egg. Maybe these things are easy because they are crucial to survival (and everything that was bad at recognizing these signs died), not due to being easier for simple visual processes.
Both chicken and egg at the same time. Evolution's going to favor the solution that involves the least costly adaptations. That probably means simpler markings will be favored because it's less costly for the prey species to evolve them and it's less costly for the predator species to evolve an instinct to avoid them.
> The evolutionary explanation could be that this allows animals with primitive neural systems to more easily distinguish other members of their species.
It's probably the other way around. Neural systems optimize for natural objects.
But isn't the neural network (like the algorithm described in the article) a mathematical concept? Are the computer-based neural networks so closely modeled after biological equivalents that they would inherit such an optimization?
(As you can tell, I know nearly nothing about neural networks, but curious to learn more...)
Digital circuits? I mean, it is just some matrix multiplies and a nonlinearity (then stack to the moon). No "circuitry" is really involved until you get into recurrent networks, and even then that is just feedback. Not quite sure what you mean here.
There have been experiments trying to encode information the way the brain does, they just haven't worked very well (or at least not as well).
There is equivalency with PCA (well ZCA, a modified form of PCA) in cat and monkey brains, and likely others as well. See Sejnowski and Bell in [1]
Also, PCA is an affine transform, so there is no reason it couldn't be incorporated/learned by the net itself. In fact, I think most nets these days eschew PCA/ZCA when they have sufficient data support.
To clarify for others, this type of neural network has nothing to do with the brain. A neural network is really a "universal function approximator", and I actually prefer to call it as such. Our goal is to learn the best possible mapping of input -> label, through whatever means necessary. It turns out that learning hierarchies of features helps from both a learning aspect and a computational point of view. But a sufficiently wide single layer could do the same thing in theory.
I have no idea. But I'd wager that cause and effect in natural systems has more to do with visual systems adapting to shapes than shapes of entire animals adapting to other species' visual systems.
But isn't the neural network (like the algorithm described in the article) a mathematical concept?
It is. It's a family of models that (a) can be expressed compactly using linear algebra and (b) can represent a large class of mathematical functions. (In fact, the family of neural nets can learn all continuous functions.) "Learning" is typically some variety of gradient descent (on the "error surface" defined by the training set) in the space of parameters.
Are the computer-based neural networks so closely modeled after biological equivalents that they would inherit such an optimization?
In my opinion, no. Biological neural networks are, in fact, a lot more complicated. They have time behavior, changing network topologies, and chemical influences (neurotransmitters) that play a major role. There's still a lot we don't know about them.
I disagree with GP's contention. While the biological neural network was an inspiration for this class of mathematical models, and convolutional behavior (in ANNs, a regularization, or mechanism for reducing the dimensionality of the parameter space, sacrificing training-set performance but often improving test-set performance) may be used in our visual cortex, but artificial neural nets are quite different and, mathematically, most varieties are quite simple.
Not to denigrate the techniques used here, but it is interesting as a computer vision researcher (face recognition in my case) how important good labelled training and testing sets are. Some of my successes over the years have come more from figuring out where to get good data than good computer vision techniques.
Not only large datasets, but the elasticity of a community to revisit old ideas from a new angle. Computer vision has been very good in this regard overall, though there were some dark times (https://www.facebook.com/yann.lecun/posts/10152034328862143).
Can you ever generate artifical training and test data?
For instance, suppose I would like to be able to take a photo of a chess game and turn that into a diagram of the position. I have no idea where I would get natural photos of thousands or millions of chess games to use for training and testing a chess piece identifying vision system.
Could I instead make 3D models of common chess set designs, and then generate and render photorealistic images of chess positions to use for training and test data for the vision system?
Probably, but it might not be able to work very well with different conditions, like a picture of a chessboard in central park vs a chessboard in a library. This could probably be solved with more renders, and even if it can't, the renders would probably be a good starting point
You still need some natural data, otherwise the network will probably overfit on regularities in your CG engine that don't exist in the real world. It might do that anyway, even with natural data.
There are two irresponsible sentences in the article, "In other words, it is not going to be long before machines significantly outperform humans in image recognition tasks." and "Or put another way, It is only a matter of time before your smartphone is better at recognizing the content of your pictures than you are.".
The irresponsibility is in seeing the existence of a technique to solve a more complex version of a toy problem (e.g. find the location in this photo that exactly matches pattern x), and inferring that the same technique, given more power, will exhibit super-human behaviour.
It's a reasonable claim that, in the task as described, the technique outperforms humans but that's about as exciting a claim as saying how much quicker the latest supercomputer is at arithmetic than a human. The point being that human object recognition isn't about labelling a scene with nouns, but somehow instinctively knowing the relevant objects for a situation and, if required, the appropriate situation-specific noun.
I therefore request the final sentence of the article be rewritten as "Or put another way, it may only be a matter of time, funding and motivation before your smartphone (with equivalent computing resources of the likes of Google Inc. and University of Tokyo) can ascribe one or more nouns from a set of size of order 1000 to regions of a photo, given that huge amounts of pre-processing and man hours have been dedicated to the pre-processing of that exact set of nouns to create a training set, and that you take the photo in similar lighting conditions as the training set, don't apply any filters, and that the objects referred to by the appropriate nouns is neither small nor thin, better than you.".
It is just extrapolating the error rate reduction over the last few years. Spam filters have become better than moderators in labeling spam only in the last decade or so.
When computers first started to become faster than mathematicians this was really a breakthrough. The same is happening now with object and speech recognition.
The computer succesfully completes a task. That it is not how humans intuitively approach these same tasks is irrelevant for this accomplishment. What if the results were only half as good, but the system behaved more like humans, who does this satisfy?
The state-of-the-art is capable of detecting far more than 1000 objects, does not need labeled data, is robust to changes in light and does not care about the camera used. No preprocessing the data needed, features are automatically generated (preprocessing the target labels is a bit silly BTW).
So yes, in the very near future, algorithms will be better security guards than well... security guards.
My point is that extrapolating error rate reduction only applies to this tightly defined task.
You can only make claims about machines being better at "general" pattern recognition when we make progress on the issue that's stopped all Cognitivist General AI projects dead, which is that of situational awareness.
Arithmetic operations, spam detection and the task described in the article have a much smaller, and static, problem space than most human activities. You can demonstrably already knock up an automated-barrier style security guard. However, I'd argue that there does not exist an algorithm or appropriately weighted n-layer network that can handle all the ambiguity, countermeasures and ill-defined or contradictory situations that human security guards, or even just their object recognition capabilities, handle largely instinctively.
Do you think that computers are better at chess than humans? If yes, how does this relate to pattern recognition. If not, what makes someone or something better at chess, while still losing against a computer? Is that a beautiful move? Tactics? Irrational sacrifices to cause confusion?
Do you think that a machine's situational awareness can not achieve or surpass the level of a human? If not, what is holding the machines back?
Why do you think that instinct works better to create more rational, consistent and correct predictions? Are 100 security guards better than a single security guard at dealing with ambiguities? Do you think an algorithm to detect fights, drug dealers, and pickpockets from street cams can not exist? What if a NN could detect these cases faster and flag this to a human security guard for action/no-action.
It needed training on 1TB of labeled images in the first place. Arguably it can be used to transfer that knowledge to other tasks with a much smaller amount of labeled samples but still requires supervision.
Google trained a NN on unlabeled Youtube stills. It was able to detect/group/cluster pics of cats without ever seeing a label. This still needs supervision to teach the NN that whatever name it created for this cluster, us humans call this "cats".
If the error rate gets low enough, a NN could start labeling pics.
Finally, recent work has shown that running a dictionary through an image search engine can yield high quality labeled images automatically.
Aside: Thank you for contributing to sklearn. Really feel like I am standing on the shoulders of giants when I use that library.
Can any machine vision researchers recommend a survey/overview paper or two that would make a good technical introduction to these particular techniques?
What I see in the OP, via
my Web browser Firefox 27.0.1, is
just an ad I can't make go
away in the middle of the screen.
So, without some special effort,
say, grab and parse the HTML,
I can't read the content. Anyone
else have this issue?
Thanks. I couldn't see a button for closing the window. Apparently somehow asking for higher magnification of the Firefox window didn't give higher magnification of the ad window; thus I didn't see the close button.
Thanks.
With your feedback, I tried again and did see the big X close button outside of the main part of the ad window. The thing did close.
As we enter the world of "big data" more and more companies are locking this level/size (or bigger) data away in closed vaults. This is a competitive edge in business, but also strangles the ability for academics to do "real world" research, which is often a huge criticism of academia at large.
Academic research can have a role in tech R&D if given a chance, so I hope that companies will open their data to private researchers to continue this kind of improvement - even if under secrecy clauses or some other safeguard.
All that said, the technological improvements have been astounding, and I hope this is only the beginning! There have been some incredible results using related techniques in NLP for machine translation recently, and speech has always been a great playpen.
If anyone is interested in playing with these things in Python, myself and another researcher have recently created a library for using these type of neural networks as a black-box WITHOUT ANY TRAINING!, in a similar style to scikit-learn. See [1] for more. We support only a few nets currently, but plan to integrate support for caffe and pylearn2 in the near future.
[1] http://sklearn-theano.github.io/