If one instead use a RNN (recurrent neural network) particularly with LSTM, then it can take as input the sequence of all photos from a business and the output of the model would then be the sequence of labels, similar to how translation models work.
Of course then another problem could perhaps be the ordering of labels is unrelated to order of photos for a business, but there is probably some way to handle this by either data synthesis (multiple permutations of the training data to ignore ordering of labels) or by sorting the labels in a certain fashion that the model can learn.
That's really interesting. I have another 2000-ish word length part two* I'm hoping to put out around the end of the week with some ideas for others things to try. Most of my ideas for next steps revolved around tweaking the data loader to output composite images, or tweaking the loss the function to be more forgiving of false negatives. This is an alternative I hadn't considered.
You would still need 34+ layers of ResNet to detect the features, I think, but perhaps the fully connected classification layers could be replaced with an RNN. You would probably ignore all of the outputs aside from the last one, so that part of the ordering wouldn't matter. Input ordering might have an effect, though.
Assuming you don't object, I'm going to add this to the "extension ideas" at the end of my part 2 (with a hat tip, of course).
Of course the next problem would be figuring out how to actually accomplish this with the fast.ai library...
* I was originally on the fence about just publishing the entire 5000-ish words as a single post, but Ulysses actually started to have trouble with it and that pushed me over the edge.
I would instead try separate inputs for each image, followed by an aggregation layer. I think a maximum-pooling approach makes most sense for most of the labels: there seem to be some photos for each restaurant providing all the information regarding certain labels. When there are five interior images, and one showing a patio, you want the latter to determine the label "outdoor dining".
For other characteristics, such as "elegant" (I'm making up these labels because I read the post yesterday and don't remember the specifics) average pooling might be a better fit.
In fact, a simple concatenation might work well, allowing the model to learn these things on its own. With differing numbers of images per restaurant that wouldn't work, though. Instead, one might try both MAX and AVG pooling, followed by concatenation. Or, if available, play around with other types of averaging, such as trimming outliers.
On another point: I was slightly put off by the criticism of the data structure. That structure follows directly from the problem being solved and isn't some oversight making life more complicated than necessary. To work with such data, I tend to throw it into a rails application and export whatever format I need (although I'd use a python solution if I were more familiar with them).
Yes you would still would need a good image classifier like ResNet 34 to actually extract usuable features which could then be used as input to a RNN model.
An important point if you would try to actually implement such a model, don't train the image classifier together with the RNN as that would probably make the training unnecessarily slow with the end result of either not making a dent in the weights of the ResNet or simply overfitting the relatively small and inappropriate training set that is provided by Yelp for this competition. Instead use the ResNet classifier in a preprocessing step to transform the images to the activations of a later layer in the ResNet and use those as the input for the RNN model, note when training the RNN you should persist these activations as you would otherwise would unnecessarily need to run forward prop multiple times through the ResNet for each image. In this manner you get the benefit of the ResNet which is great at extracting features from images and the ability to train the RNN model much more quickly.