This release is absolutely phenomenal. As liuliu mentioned in the release notes most of the pre-existing implementations are either not-that-good, closed source, or their datasets are closed. This is, seemingly, the first good open source deep learning classifier out there! There is a whole cottage industry of commercial deep learning image classifier services springing up (like http://www.ersatzlabs.com/) but this will surely make this technology even more accessible. I can't wait to use it in some of my side projects!
I think this is a great thing to put out into the community, and I hope it gains widespread adoption.
I do have a few things to comment on here:
The architectures themselves have never been closed source (hence publication). Frankly, I think Alex didn't release his implementation because until TITAN came along, you needed two GTX580 GPUs to do distributed training. This is unique (and difficult) to setup and support! Just "putting the code out there" is not useful to many researchers, and the architecture is in the paper already to implement and extend or modify. You will be bombed to death with support requests, even if you just say "this is provided as is". Many researchers don't want to deal with that IMO.
Data is EVERYTHING for training these architectures, and the ImageNet dataset itself is still semi-closed source, and hard to get. You have to sign up to get their license coverage if you want to download the original images for research. I think you can get weblinks to the original data, which you could then ostensibly crawl yourself, but that gets murky from a copyright perspective.
It will also take a very, very long time to wget all 1.2TB of ImageNet. My download took about 45 days, so keep that in mind!
Most people just do not have the hardware to train a network of this magnitude. CIFAR10 is a much more appropriate dataset for consumer grade hardware/non lab-funded work IMO. I think this is one of the reasons most advances on this front have simply released the trained network parameters as a "black-box" preprocessor - it is much more useful for general tasks on normal laptops and PCs! See http://fastml.com/yesterday-a-kaggler-today-a-kaggle-master-...
I personally used a simple wrapper to DeCAF + a sklearn logistic regression to score a pretty good (96%) score for the kaggle Cats vs. Dogs competition on a Lenovo T400, and I think this is where the commercial usefulness will go as well. The training + data to get the "magic numbers" is probably going to be the KFC/Coke tradesecret formula of data science in coming years. Plowing through a bunch of floats in a particular format, then applying simple classifiers is good enough for many standard computer vision tasks, and could possibly be fit into a very small formfactor. Think FPGAs...
What is the advantage of using this instead of OverFeat for natural images? I have been operating on the assumption that the strange license of OverFeat is not a problem, since you are using someone's software to feed in input, and generate a bunch of floating point values.
Are the floating point outputs of a program then "tainted" by the license of the code as well? If not, why not use the state of the art, which handles both localization AND classification?
I can see using ccv for custom datasets, though I am also assuming the 6GB TITAN requirement for GPU use is only for ImageNet, and not for a custom dataset? TITAN GPUs are pretty expensive...
I am very impressed by the result, as I have not been able to even approach OverFeat with my own work (pylearn2 + theano) thus far. Will definitely be experimenting with this in the future. Thanks!
From my understanding, you cannot deploy OverFeat in production because its license is only for research and evaluation purposes.
You still need TITAN for any reasonable custom datasets (with non-trivial data). The current CPU implementation doesn't support Dropout, therefore, you can only play it with CIFAR-10 (./bin/cifar-10.c) dataset if you don't have a GPU.
This is an preliminary implementation, I do plan to finish up CPU training part to be on-par with GPU in subsequent releases.
So you are storing the entire dataset on the GPU then, to speed up processing? Or is there support for a "minibatch" mode that sends chunks at a time to be processed?
Unless this more an issue with the model size of the "Krizhevsky net" being on the order of 6GB?
Also, does this mean you managed to get a 2D convolutional kernel optimized for Kepler architectures? If so, that is awesome! Alex's code is still only optimized for Fermi architectures if I recall correctly.
Not really the whole dataset. For ImageNet, I have mini_batch size of 256, and I need to allocate the whole network on GPU for this mini_batch (which is 256 * the neurons in the network), and plus parameters (parameters is about 200MiB * 3 for updates and momentum). Also to speed up certain operations, the data need to be reshaped, and there are 500MiB scratch space just for that purpose. In total, I am using close to 6GiB GPU memory. You probably can get to only use 4GiB memory if the batch size is 128.
The code is never optimized to the extreme. I optimized the code to the point of being able to finish at reasonable time (9 days for 100 epochs). The convolutional kernel is parametrized (with template and some macro tricks, forward and backward propagation convolutional kernels are parametrized into hundreds of small functions) and the best parameters were chose with a mini benchmark in the beginning of training process.
So... duh. You would need > 1.2 TB to fit all of ImageNet on the card :). Thanks for clarifying, and pardon my brain lapse! Also, thanks for putting this out there - if I get some time I may send some pull requests your way. Awesome stuff.
You need 1.2TB SSD for sure to train the complete ImageNet dataset. The data is loaded into GPU memory only one batch (256 images) at a time. But the loading part will be the bottleneck if you use a rotational disk.
As a user of and contributor to Caffe [1], I have to take the opportunity to plug it here. Like the CCV classifier linked, Caffe is fully open-source [2], has a downloadable state-of-the-art model pre-trained on ImageNet [3], and scripts/documentation that make it very easy to compute features using our pre-trained model or other models [4].
Unlike the linked CCV release (unless I'm misinformed -- haven't actually tried it, please correct me if I say anything inaccurate), Caffe supports completely customizable architectures via a configuration language, fully supports training [5], finetuning [6], and inference (feature extraction/classification) in these customizable architectures, and seamlessly runs on both CPU and GPU.
Caffe is also very fast; twice as fast at CPU feature computation as its predecessor DeCAF, and faster than cuda-convnet at training/testing ImageNet architectures on a Titan/K40 GPU.
The linked CCV release does mention Caffe, but quickly dismisses it due to the license. It's true that our pre-trained model [3] is licensed only for non-commercial use, but ALL of the Caffe code is BSD-licensed, including the exact script we used to train said model. So if you're a commercial entity, using Caffe for feature extraction/classification from a state-of-the-art network is a matter of purchasing a $1000 GPU (NVIDIA Titan -- I'm assuming you own a computer), downloading the ImageNet dataset, and waiting about a week for training to converge. This will buy you the ability to adapt the classifier to YOUR visual classification problem by finetuning [6], rather than being stuck with the particular 1000 categories the pre-trained model knows about.
This is a preliminary implementation, but is a complete one includes both training and testing code. The big difference is that ccv is a computer vision library in general and Caffe is a artificial neural network library. This does mean quite a few different ways of approaching things, for example, ccv's implementation does allow you to specify network topology, but doesn't have a implementation of local non-weight-sharing layer (because CIFAR-10 and ImageNet doesn't need such type of layer).
You can also chop off the last full connect layer and train a SVM on top of it with ccv, I actually plan to do exactly what you guys did with that and train on VOC 2012 dataset.
All in all, ccv 0.6 is a preliminary implementation of convnet, but it is important for a library claims to be "modern" to contain the said implementation. And providing the pre-trained data model with a liberal license (so that you can fine-tune your classification problem on top of the pre-trained data model) is also aligned with ccv's goal.
I hadn't seen the detailed documentation - thanks so much for the acknowledgments there!
And thanks for correcting me about CCV's support for custom architectures and training -- I'd just assumed that it wasn't supported since it wasn't mentioned in the post, but I guess this was more of a marketing decision as most users are probably just interested in feature extraction/classification from the pretrained net. :) I would argue that GPU support is pretty necessary for training modern network architectures a la Krizhevsky to be remotely practical, though.
I apologize if I came off as overly competitive or derisive, this is obviously very nice work and it seems like an attractive option for many users. Always happy to see deep learning made more accessible and open!
On a sort-of-related note: anyone know of a good FOSS library for making a reverse image search? My current method is just to use findimgdupes, but this is rather slow and hard to script around. (Ideally, my desktop with my main private image storage would have a simple web interface that works like tineye.)
You are probably looking for image similarity tools. I've been playing with the IMGSeek tools [1] for a few months in my spare time. My hobby project is to create a reverse image search engine.
It depends - some algorithms exploit correlations between pixels to reduce the computational load. This is bad if the Z channel is not strongly correlated with the others (RGB are very strongly correlated in natural images). Since depth shouldn't usually be correlated with color, this might cause some issues. Some experiments were done in "Learning Feature Representations with K-means", A. Ng. and A. Coates. (http://www.stanford.edu/~acoates/papers/coatesng_nntot2012.p...)
Give it a try, but also look at the underlying assumptions of your algorithm if it performs poorly.