Hacker News new | comments | show | ask | jobs | submit login

Nick (the author) reflects on this: "The data needs to be merged into a format which can be used to train a neural network. Solving this leads to the second, much bigger issue: many of the resulting label to image mappings are inappropriate...".

If one instead use a RNN (recurrent neural network) particularly with LSTM, then it can take as input the sequence of all photos from a business and the output of the model would then be the sequence of labels, similar to how translation models work. Of course then another problem could perhaps be the ordering of labels is unrelated to order of photos for a business, but there is probably some way to handle this by either data synthesis (multiple permutations of the training data to ignore ordering of labels) or by sorting the labels in a certain fashion that the model can learn.

Hi, Nick here.

That's really interesting. I have another 2000-ish word length part two* I'm hoping to put out around the end of the week with some ideas for others things to try. Most of my ideas for next steps revolved around tweaking the data loader to output composite images, or tweaking the loss the function to be more forgiving of false negatives. This is an alternative I hadn't considered.

You would still need 34+ layers of ResNet to detect the features, I think, but perhaps the fully connected classification layers could be replaced with an RNN. You would probably ignore all of the outputs aside from the last one, so that part of the ordering wouldn't matter. Input ordering might have an effect, though.

Assuming you don't object, I'm going to add this to the "extension ideas" at the end of my part 2 (with a hat tip, of course).

Of course the next problem would be figuring out how to actually accomplish this with the fast.ai library...

* I was originally on the fence about just publishing the entire 5000-ish words as a single post, but Ulysses actually started to have trouble with it and that pushed me over the edge.

The LSTM idea is good, but the ordering problem stems from the LSTM assumption that the input is in some way ordered (i.e. words in a text, or frames in a video).

I would instead try separate inputs for each image, followed by an aggregation layer. I think a maximum-pooling approach makes most sense for most of the labels: there seem to be some photos for each restaurant providing all the information regarding certain labels. When there are five interior images, and one showing a patio, you want the latter to determine the label "outdoor dining".

For other characteristics, such as "elegant" (I'm making up these labels because I read the post yesterday and don't remember the specifics) average pooling might be a better fit.

In fact, a simple concatenation might work well, allowing the model to learn these things on its own. With differing numbers of images per restaurant that wouldn't work, though. Instead, one might try both MAX and AVG pooling, followed by concatenation. Or, if available, play around with other types of averaging, such as trimming outliers.

On another point: I was slightly put off by the criticism of the data structure. That structure follows directly from the problem being solved and isn't some oversight making life more complicated than necessary. To work with such data, I tend to throw it into a rails application and export whatever format I need (although I'd use a python solution if I were more familiar with them).

Actually the LSTM model doesn't asume anything about ordering itself, but it could probably overfit because of ordering depending on how much training data is available, and in this case we know that the order of input and output is not directly related. One way to make the LSTM model more robust and less prone to over-fitting would simply be by making a number of copies of each training examples where each copy is then mutated by changing order of the images and tags.

Could you clarify what you mean by "the criticism of the data structure"? I don't entirely follow your meaning, but I suspect there might be something in the post I need to update/clarify.

Glad to inspire and go ahead and use this idea. By the way my full name is Kaveh Hadjari.

Yes you would still would need a good image classifier like ResNet 34 to actually extract usuable features which could then be used as input to a RNN model.

An important point if you would try to actually implement such a model, don't train the image classifier together with the RNN as that would probably make the training unnecessarily slow with the end result of either not making a dent in the weights of the ResNet or simply overfitting the relatively small and inappropriate training set that is provided by Yelp for this competition. Instead use the ResNet classifier in a preprocessing step to transform the images to the activations of a later layer in the ResNet and use those as the input for the RNN model, note when training the RNN you should persist these activations as you would otherwise would unnecessarily need to run forward prop multiple times through the ResNet for each image. In this manner you get the benefit of the ResNet which is great at extracting features from images and the ability to train the RNN model much more quickly.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact