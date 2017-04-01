Hacker News new | comments | show | ask | jobs | submit login
Unsupervised sentiment neuron
I don't know, but this seems a bit hyped in places.

They start with:

> Our L1-regularized model matches multichannel CNN performance with only 11 labeled examples, and state-of-the-art CT-LSTM Ensembles with 232 examples.

Hmm, that sounds pretty impressive. But then later you read:

> We first trained a multiplicative LSTM with 4,096 units on a corpus of 82 million Amazon reviews to predict the next character in a chunk of text. Training took one month across four NVIDIA Pascal GPUs

Wait, what? How did "232 examples" transform into "82 million"??

OK, I get it: they pretrained the network on the 82M reviews, and then trained the last layer to do the sentiment analysis. But you can't honestly claim that you did great with just 232 examples!

Is this correct? My sense from the article is that they did all the training on unsupervised, and then checked one of the recurrent lines for a correlation to sentiment.

I think it's a fair claim. Labelled data is very hard to come by compared to unlabelled data. Being able to get a highly accurate model with only a small amount of labelled data is a very sought after and practical property.

Thanks for the feedback — added context to that sentence to make it more clear!

The main interesting thing is that none of the Amazon data was labeled, while the 232 labeled examples were.

To further clarify: does unlabeled mean "we didn't use sentiment data" or "we were only trying to predict the next character given the prior characters", since the amazon data does come with associated 1-5 star ratings, were those used or not?

We did not use the star ratings.

That's what I thought, and that makes this all the more interesting!

Very interesting, this reminds me of the 2012 paper by Andrew Ng: Building High-level Features Using Large Scale Unsupervised Learning

As luck would have it, I published a blog post this week about using a new approach toward the character-by-character deep learning architecture with a slight twist; using pretrained character embeddings so the model does not have to learn them: http://minimaxir.com/2017/04/char-embeddings/

As noted in my approach, I think there is more to char-by-char architecture than just tossing a super large LSTM at it, both in terms of performance and accuracy. It's an unexplored field with good potential payoffs. In light of the OpenAI article, I'll certainly take a look a neuron data if possible.

(As an aside, since it was not metioned in the original article, I am assuming this is the Amazon review dataset used: http://jmcauley.ucsd.edu/data/amazon/)

This is a great name for a band :-). That said, I found the paper really interesting. I tend to think about LSTM systems as series expansions and using that as an analogy don't find it unusual that you can figure out the dominant (or first) coefficient of the expansion and that it has a really strong impact on the output.

Character prediction is a curious way to train a sentiment analyzer, I wouldn't have expected it to work so well. Fascinating that it has.

Also the synthetic text they generated was surprisingly realistic, despite being generic.

If I were perusing a dozen reviews I probably wouldn't have spotted the AI-generated ones in the crowd.

We are getting better and better with automatic text generation. I wonder who will be the copyright owner of an AI-generated text, comments, songs, etc.?

Impressive the abstraction NNs can achieve from just character prediction. Do the other systems they compare to also use 81M Amazon reviews for training? Seems disingenuous to claim "state-of-the-art" and "less data" if they haven't.

So char-by-char models is the next Word2Vec then. Pretty impressive results.

It would be interesting to see how it performed for other NLP tasks. I'd be pretty interested to see how many neurons it uses to attempt something like stance detection.

Data-parallelism was used across 4 Pascal Titan X gpus to speed up training and increase effective memory size. Training took approximately one month.

Everytime I look at something like this I find a line like that and go: "ok that's ncie.. I'll wait for the trained model".

moved that needle I guess

