I'd love to hear more about what you tried specifically. I'm considering doing this myself, and I was thinking of building a very large labeled dataset of 3d rendered images using the LDraw parts library and training on that. I could include hundreds of images per part by using different viewing angles, zoom levels, focus, etc in the rendering process. Did you try anything like that?
After endless messing around I finally bit the bullet and trained a neural net, from 0 to 100 in a few weeks and it is rapidly getting more usable now.
The feature detection code may get a second life though: as a meta-data vector to be embedded in to the net. But only if it is really necessary.
I'm quite curious though if you can get your method to work, especially for the parts that are very rare and rare colors.
I was assuming that at minimum I'd need to do a lot of filtering in order to get the camera images and renders into a state where they are similar enough to work for training.
Any chance that you'll be releasing source code for this project and/or your labeled dataset?
Yes, but not yet. It needs to get a lot better before I'm going to stamp my name on it as a release. Right now it is rather embarrassing from a code quality point of view, it has been ripped apart and put together several times now and every time it gets a lot better but we're not there yet.
Just so I understand the process correctly, did you manually sort some pieces to get a labeled training set, feed those through the machine, train the NN with that, and then manually correct the errors when sorting unknown pieces, added all those pictures to the same training set and then finally run the full training again? How many labeled images do you need to start getting acceptable performance? Are you training the NN continuously with every new image, or from scratch with an increasing data set?
Do you think a stereo camera would improve the classification in a meaningful way, or maybe a second camera from a different angle?
Yes, but that cycle repeats every day. So the training never really stops, it just runs at night and the machine runs during the day. Today it sorted close to 10K parts and those images will now be added to the training set and then I'll start the training overnight so tomorrow morning my error rate should be much better than it was today and so on.
> How many labeled images do you need to start getting acceptable performance?
Good question! Answer: I don't really know but judging by how fast the error rate is improving between 100 and 200 per 'class' so that will be 200K images or so when it is one with the 1000 most commonly found parts.
> Are you training the NN continuously with every new image, or from scratch with an increasing data set?
From scratch with every expanded set. I suspect that's the better way but I have no proof. My intution is that it is hard to make a neural net learn something entirely new that it has not seen before and every day totally new stuff gets added. So I re-train all the way from noise.
> Do you think a stereo camera would improve the classification in a meaningful way, or maybe a second camera from a different angle?
You're getting close to the secret sauce :)
I guess my lack of knowledge in the field shines through. Continuous learning is apparently under active research at the moment, and this blog post about it  is less than two months old, so your intuition was right.
If I were to guess the secret sauce I'd say that a mirror might be involved. Is depth information not worth the trouble for these kinds of classification problems?
You might be right there :)
> Is depth information not worth the trouble for these kinds of classification problems?
Yes, it would be, but there's much more to it than that. Also keep in mind that there are parts that are almost transparent and that no matter what background color you come up with there will be a bunch of lego parts that match it.
Colored strobes may also help separating out different color pieces, although I expect that would be overkill.