One of the fantastic qualities of embedding based language models is that they provide a view on a semantic space that can be used quantitatively in most any downstream language task. As a conversational intelligence company Frame has many products that are enhanced by having a high quality domain specific language model to build on: tagging, sentiment, topic extraction, key words, summarization, etc. Best of all, these products can be iterated on in parallel! Improvements in a language model’s representation of a body of text should improve all downstream task without modification.
The same thing is also true for computer vision models. A core deep network usually trained either with a dual embedding of associated text against search ranking or trained to predict tags or labels. The output network may be of limited use on the original training task but ends up producing an excellent embedding model by extracting the neurons from some deep layer.
You start automatically encoding your entire image collection and incoming images into that embedding model and rely on it as a lingua franca on which to base all sorts of other companion models like object detection, face recognition, gender/age/ethnicity prediction, spam detection, aesthetic / composition appraisal, caption generation, style transfer etc etc.
In this work we didn't explore classification performance characteristics. I suspect the nature of the misclassification at lower levels of domain data would revolve around the ways language usage differs in reviews vs common english. "Blockbuster" may have generally negative or neutral sentiment in a wikipedia based language model, perhaps most often referring to the failed rental chain. In the context of movie reviews "blockbuster" is almost always universally positive.
I think in this case the alignment between unlabeled domain data and the language task supports a convergence in language task performance. One argument to continue to prefer the ULM+domain model is that it is likely more generally capable if you remain in same domain but switch to a task that is less directly related to your unlabeled data. I haven’t seen any research that directly speaks to that intuition so it’s a good area for further study.
Yes, great insight! The choice to focus on sentiment here was mainly to align with fast.ai’s original research—-hopefully maximizing the generalizability and accessibility of the results.
Internally we have made use of these results to improve a broad set of language tasks, hopefully will be able to publish on those in the coming months as well.