So is this 98% with stained images from the data set? I thought CNN's were data hungry with my limited knowledge. 98% of ~400 images seems fairly impressive. But I wonder about how well it will perform with unseen images.
Tangentially, there seems to be little news on new opensource data sets for anything. Saw Google do a few different sometime back. Do research companies only care about advancing techniques so they can use their private dataset to reap all that public research? Or is there genuinely no way to create data sets? For example, is it financially/ technologically impossible to make an "ImageNet" of cells? (Or maybe a lot of data sets are coming out and I am just unaware of them.)
it is very expensive & time consuming to create a vast amount of properly labeled image cell dataset. In general you need >2 pathologists to confirm the cell types (they disagree sometimes and you usually take the majority vote); this almost never happens with cat images;) Also there exist a multitude of device acquisition modalities for image capturing in microscopy, different stains for the same types of cell, etc. & actually simple RGB cameras are considered fairly low tech for these kind of operations.
ps. I am no deep l expert (i use more 'traditional' ml) but as you pointed out ~400 images for these techniques can be an 'overfitting' recipe of disaster..
Would something like images of rat cells to train to than transfer learn on humans be worth while? The author of the article tried it with ImageNet and it didn't work out. But I wonder about the viability of that techniques with non human cells.
well the same principle as with the VGGnet applies here too. If the rat images differ 'significantly' (whatever that means; i am a research engineer not a pathologist) then you will have nothing to transfer. Maybe it would be more fruitful to try to transfer via a huge amount of artificially generated cell images (there are toolboxes for that and its not linear transformations like rotation etc.) blended with some subset of vggnet (or similar) trained only with 'circular' objects ..
As an off-topic aside, there has been very suspicious behavior with this submission. It has been submitted atleast 6 times in 3 days by mostly the same people (https://news.ycombinator.com/from?site=athelas.com), and I know the OP of this submission submitted it earlier today and deleted it after it didn't get upvotes.
Don't do that. At the least, it isn't worth it for just a blog post.
Yes, deleting stories like that isn't allowed (see https://news.ycombinator.com/newsfaq.html) and we penalize accounts that do it. Deletion is for things that shouldn't have been submitted in the first place.
This particular submission is one that a moderator saw (independently) and put in the second-chance queue described at https://news.ycombinator.com/item?id=11662380. We might not have done that if we'd seen the previous ones, but as long as the community is interested in the story it can stay up now.
Tangentially, there seems to be little news on new opensource data sets for anything. Saw Google do a few different sometime back. Do research companies only care about advancing techniques so they can use their private dataset to reap all that public research? Or is there genuinely no way to create data sets? For example, is it financially/ technologically impossible to make an "ImageNet" of cells? (Or maybe a lot of data sets are coming out and I am just unaware of them.)