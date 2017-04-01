They start with:
> Our L1-regularized model matches multichannel CNN performance with only 11 labeled examples, and state-of-the-art CT-LSTM Ensembles with 232 examples.
Hmm, that sounds pretty impressive. But then later you read:
> We first trained a multiplicative LSTM with 4,096 units on a corpus of 82 million Amazon reviews to predict the next character in a chunk of text. Training took one month across four NVIDIA Pascal GPUs
Wait, what? How did "232 examples" transform into "82 million"??
OK, I get it: they pretrained the network on the 82M reviews, and then trained the last layer to do the sentiment analysis. But you can't honestly claim that you did great with just 232 examples!
The main interesting thing is that none of the Amazon data was labeled, while the 232 labeled examples were.
As noted in my approach, I think there is more to char-by-char architecture than just tossing a super large LSTM at it, both in terms of performance and accuracy. It's an unexplored field with good potential payoffs. In light of the OpenAI article, I'll certainly take a look a neuron data if possible.
(As an aside, since it was not metioned in the original article, I am assuming this is the Amazon review dataset used: http://jmcauley.ucsd.edu/data/amazon/)
Also the synthetic text they generated was surprisingly realistic, despite being generic.
If I were perusing a dozen reviews I probably wouldn't have spotted the AI-generated ones in the crowd.
It would be interesting to see how it performed for other NLP tasks. I'd be pretty interested to see how many neurons it uses to attempt something like stance detection.
Data-parallelism was used across 4 Pascal Titan X gpus to speed up training and increase effective memory size. Training took approximately one month.
Everytime I look at something like this I find a line like that and go: "ok that's ncie.. I'll wait for the trained model".
