Hacker News new | comments | ask | show | jobs | submit login
Auto-Generating Clickbait with Recurrent Neural Networks (larseidnes.com)
251 points by lars on Oct 13, 2015 | hide | past | web | favorite | 64 comments

If I could feed this an article and have it generate headlines based on the text of that article (and they were any good), there is a solid chance I would pay real money for that service.

Headlines are an absolute pain, and as the article says, they're decidedly unoriginal most of the time. I can't see an obvious reason that an AI would be much worse at creating them as a human.

Instead of generating random articles

  1. Generate click bait headlines
  2. Write suitable copy for them
  3. ???
  4. Profit
Where 3, of course, is "build ad network"

I know of at least company which actually does this (tests headlines before writing the actual articles).

I've always wondered if sites do this the other way around. A site invests a lot in content they create you would think to get the most out of it they would serve it with different headlines to different demographics. Knowing from Facebook what headlines you have clicked through in the past can indicate how they should write future headlines to get you again.

Oh, they definitely do it the other way around extensively.

In the testing phase, what happens when you click the headline?

Sometimes it just loads a random article, sometimes it gives an error page.

And parts of 2. involves automatically farming out the writing to places like Fiverr..

1a. Split-test. A lot.

That's absolutely easy:

<Random Unexpected Person> did <random unexpected action>, you wouldn't believe it!

There are many, many headline formulae in the world ( some of my favourites were written in the mid 1920s, in John Caples' book "Tested Advertising Methods" ) but they still take time to iterate through and hone for each article.

I'd like a robot to do that for me, please :)

I like the notion of swamping the Internet with fake click-bait headlines, to dilute the attractiveness of this (to me, odious) form.

Give me sincere, honest news and discussion, or else shut up.

Unfortunately, someone out there must really have a craving for "weird old tricks" and "shocking conclusions".

It's a sort of race-to-the-bottom, least common denominator effect.

Maybe someone will write a browser extension that filters out obvious click-bait headlines. Now that would be clever!

People don't have a craving for this kind of crap, that is they don't actively search for it. It works by exploiting the brain. It's the publishing equivalent to junk food. We know it's awful. We know it's bad for us. But we struggle to not consume it because it's cheap and it pings our reward systems.

Actually, I think that the clickbait junk makes us think that it will ping our reward systems. For me, at least, it doesn't really reward me very well (even in a junk food way).

Maybe this means that the real clickbait trash is training me not to click on it, so I don't need the fake to do so?

I cured myself out of clickbait headlines after I clicked on few and learned to expect no content on the other side. It's a simple association, really. You click on something X-y, you get no reward, you learn not to waste time on X-looking things.

For you certainly. But the proof of the pudding is in the tasting. Clickbait drives an insane amount of traffic and it shows no slowing down. Cosmo has been successfully using "clickbait" titles for decades on the cover of their magazines.

Of course. I'm not denying the effectiveness of this technique, just providing a n=1 datapoint. Maybe my personal idiosyncrasies make me immune to that particular type of traffic-driving technique (I have no doubts I'm vulnerable to other methods).

Part of our reward systems is what was initially named the "pleasure centre".

When the "pleasure centre" in the brain was first identified and named, it was named because it was thought that stimulating it caused pleasure, because rodents given the choice to stimulate it vs. other activity would stimulate the pleasure centre even over eating.

But as it turns out, the main function of stimulating this area is strong cravings and compulsion. You may get some pleasure from giving in to the cravings, but the cravings are independent of whether or not there's a "real" reward at the end of it.

Reminds me of a comment I read a few weeks back when an ex-drug addict was describing how the anticipation of using drugs was often more rewarding than the use itself. Which explains the pleasures in drug use rituals.

I've just banned myself from visiting many of the worst offenders. Though it's getting really hard, any more, to find sites that won't sink to that level.

Couldn't this trained RNN also be used to evaluate the "clickbait-ness" of article titles (rather than generate new ones)?

Tl;dr: No. Wrong output format, wrong training set, wrong input.

To create a classifier that does that, you'd need a labeled set - i.e. someone would have to go through and say "this headline is 3 clickbaits. This other headline is 8 clickbaits". You could also sort between clickbaity and non-clickbaity, but that would still require manual work.

You could get that programatically through a few different means, but you'd need a lot more than just headlines.

It also probably wouldn't be a good idea to use a RNN - it doesn't suit the data format well. It'd be better to use a neural network (non-recurrent) or logistic regression with the entire headline as input.

Fortunately, it'll converge on a good solution a LOT faster - fewer parameters to tune + simpler output = fewer examples needed to figure out what's going on - so you might be able to get something that has plausible levels of accuracy with a day or two of set labeling (estimate brought to you by my ass).

>Unfortunately, someone out there must really have a craving for "weird old tricks" and "shocking conclusions".

This problem seems concurrent to the old mystery of Viruses Spontaneously Self-Constructing On People's Computers. "How did you get all these viruses on your computer?" "I didn't do anything it just happened." "Okay, well be really careful what you click on." "I am careful!"

> Give me sincere, honest news and discussion, or else shut up.

There are plenty of sources for what you desire it just isn't what's popular... is that a problem?

Yes, but even respectable news and information websites now include clickbait (Outbrain and other "sponsored" content). I've seen it on WSJ, NYT, and other sites even when I'm paying $10-$15 a month.

Could this RNN model perhaps be used to filter click bait headlines from HN automatically? Perhaps one could perform some sort of backward beam search to figure out how likely a particular headline would've been produced by it. If there are words in a headline that the model doesn't know, one could perhaps just let it replace it with one that it knows.

Now if we can just teach AI to get sidetracked reading all this content we'd also prevent Judgement Day.

SkyNet: (speaking to self?) "Unleash hell on humans. Launch all missiles."

SkyNet: (responding to self?) "Not now, not now. Let me finish this article on John Stamos's belly button."


I really find RNNs to be pretty cool. When they are combined with a natural human tendency to see patterns they are hilarious. So perhaps we need to update our million monkeys hypothesis to a million RNNs with typewriters coming up with all the works of Shakespeare.

Shakespearean RNN http://cs.stanford.edu/people/karpathy/char-rnn/shakespear.t...

Surprisingly convincing if viewed as excerpts rather than a play.

Now to find some English teachers to try to interpret what Shakespeare meant by some of those lines!

Nice! I've wanted to do something like this for awhile, too, but haven't had the time yet.

What's interesting to me, from a research point of view, is the degree of nuance the network uncovers for the clickbait. We all know that <person> is going to be doing <intriguing action>, but for each person these actions are slightly different. The sentence completions for "Barack Obama Says..." are mainly politics related while "Kim Kardashian Says..." involve Kim commenting on herself.

So it might not really understand what it's saying, but it captures the fact those two people will tend to produce different headlines.

Neat Idea: what if we tried the same thing with headlines from the New York Times (or maybe a basket of newspapers)? We would likely find that the Clickbait RNN's vision of Obama is a lot different from the Newspaper RNN's Obama. Teasing apart the differences would likely give you a lot more insight into how the two readerships view the president than any number polls would.

What I'm surprised most is that the headlines seem not to be much better than your average markov chain output

I think this is for three main reasons:

1. You can do really well with a simple grammar

2. You only need short output

3. Lack of training data

There's not an incredibly rich structure to extract, and with short outputs the weirdness doesn't compound and cycles aren't as likely. A common small dataset for playing with RNNs is all of Shakespeare which is somewhere in the region of 1M words.

However, this is still fun and interesting!

> 3. Lack of training data

> [...]

> There's not an incredibly rich structure to extract, and with short outputs the weirdness doesn't compound and cycles aren't as likely. A common small dataset for playing with RNNs is all of Shakespeare which is somewhere in the region of 1M words.

He does state that the network is trained with 2M headlines, meaning ~5-20M words. That should be enough.

I would have thought that RNN would somehow work better. It would be interesting to see direct comparison of fake hacker news headlines generated with Markov chains versus RNN.

True, I had managed to miss that, although it's working on 200 dimensional vectors rather than single letters as in the small shakespeare dataset. That feels like it might make it harder to train. I've personally found more problems dealing with Glove vectors compared to the word2vec ones, but I don't have any hard data for that.

This was an enjoyable article. There is an obvious extension which is to mturk the results and feed the mturk data back into the net. Just give the turkers 5 headlines and ask them which they would click first, repeat a hundred times per a thousand turkers or whatever.

Years ago I considered applying for DoD grant money to implement something reminiscent of all this for military propaganda. That went approximately nowhere, not even past the first steps. Someone else should try this (insert obvious famous news network joke here, although I was serious about the proposal). To save time I'll point out I never got beyond the earliest steps because there is a vaguely infinite pool of clickbaitable English speakers on the turk, but the pool of bilingual Arabic (or whatever) speakers with good taste in pro-usa propaganda is extremely small, so the tech side was easy to scale but the mandatory human side simply couldn't scale enough to make the output realistically anything but a joke.

> The training converges after a few days of number crunching on a GTX980 GPU. Let’s take a look at the results.

Stupid question: why is the GPU important here? I would have thought this was more of a CPU task..??

(then again, as I typed this I remembered that bitcoin farming is supposed to be GPU intensive so I'm guessing the "why" for that is the same as this)

A lot of this kind of work ends up being repetitive-- like multiplying two matrices together that have a few thousand entries each. These are the sorts of things that GPU's do very well with. GPU's have the ability to do such things on a massively parallel scale. GPU's also tend to have more memory bandwidth doing the kinds of things that a CPU would get bogged down on in the memory cache.

GPU's are really good at parallel tasks such as calculating the color of every pixel on the screen, or doing the same operation on a large dataset. According to Newegg, the GTX980 has 2048 CUDA cores (parallel processing cores) that run at ~1266 MHz as opposed to a nice CPU which might have 4 cores that run at 4 GHZ. In other words, if you want to manipulate a whole bunch of things in one way in parallel, you can program it to use the GPU effectively, if you want to manipulate one thing a whole bunch of ways in series, CPU is your best bet.

(note: this is massively oversimplified)

Coarse rule-of-thumb: running on Geforce class GPUs you can get up to 5x, maaaybe 10x the performance per dollar as compared to a top-line CPU. Assuming your problem scales well on GPUs, many problems don't. The GTX980 is actually a great performer. For Tesla class systems like the K40 it's a lot closer to equal with the CPU on performance/$ (they're not much faster than the GTX980 but a lot more expensive). But you can get an edge with the Teslas when you start comparing multi-GPU clusters to multi-CPU clusters, since with GPUs you need less of the super-expensive interconnect hardware. (You're not going to put GTX cards in a cluster, you'd have massive reliability problems.)

IMHO, the guys showing 100x speedups on GPUs are Doing It Wrong; they use a poor implementation on the CPU, use just one CPU core, consider a very synthetic benchmark, or a bunch of other tricks.

Getting this error:

    Error: 500 Internal Server Error

    Sorry, the requested URL 'http://clickotron.com/' caused an error:

    Internal Server Error

    IOError(24, 'Too many open files')

    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 862, in _handle
        return route.call(**args)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 1732, in wrapper
        rv = callback(*a, **ka)
      File "server.py", line 69, in index
        return template('index', left_articles=left_articles, right_articles=right_articles)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3595, in template
        return TEMPLATES[tplid].render(kwargs)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3399, in render
        self.execute(stdout, env)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3386, in execute
        eval(self.co, env)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 189, in __get__
        value = obj.__dict__[self.func.__name__] = self.func(obj)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3344, in co
        return compile(self.code, self.filename or '<string>', 'exec')
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 189, in __get__
        value = obj.__dict__[self.func.__name__] = self.func(obj)
      File "/usr/local/lib/python2.7/dist-packages/bottle.py", line 3350, in code
        with open(self.filename, 'rb') as f:
    IOError: [Errno 24] Too many open files: '/home/ubuntu/clickotron/views/index.tpl'

That'll teach you to do disk I/O on every page render.

Yep. Have a separate process cache a few up and simply cp over the active one to be served as a big bag of immutable bits. Bonus points for using a CDN.

I can't stop laughing at these. Check out the Click-o-tron site: http://clickotron.com/

My favorite: "residents can't remember if they lost their wine at the same time." [1]

[1] http://clickotron.com/article/5588/residents-cant-remember-i...

I used a simpler technique (character level language modelling) to come up with an Australian real estate listing generator: http://electronsoup.net/realtybot

This is pre-generated, not live, for performance reasons. There are a few hundred thousand items though, so the effect is similar.

The data source is several tens of thousands of real estate listings that I scraped and parsed.

This is simply brilliant.

(Ranking algorithm baked into a stored procedure notwithstanding. [ducks])

I am not sure how much I would give credit to the idea that the neural network 'gets' anything as it is written in the article.

> Yet, the network knows that the Romney Camp criticizing the president is a plausible headline.

I am pretty certain that the network does not know any of this and instead just happens to be understood by us as making sense.

Life Is About A Giant White House Close To A Body In These Red Carpet Looks From Prince William’s Epic ‘Dinner With Johnny'

from the article would be a good counterexample of the neural network "getting" anything.

If you're an algorithm "White House", "Prince William" and "Dinner With Johnny" is to "Red Carpet" as "Romney" is to "Camp" and "Bad President".

tldr; guy uses rnn lstm to create link bait site.

hopes crowd sourcing will filter out non-sense.


Site down. Did HN readers crashed the server? Everything old is new again (slashdot effect)?

"Tips From Two And A Half Men : Getting Real" is great. Some of the generate titles are incredible

I can't understand the first two layer RNN which according to the author optimized the word vectors.

it says:

During training, we can follow the gradient down into these word vectors and fine-tune the vector representations specifically for the task of generating clickbait, thus further improving the generalization accuracy of the complete model.

how to you follow the gradient down into these word vectors?

if word vectors are the input of the network, don't we only train the weight of the network? how come the input vectors get optimized during the process?

Missed opportunity for HN headline.

This program generates random clickbait headlines. You won't believe what happens next. You'll love #7.

Reminds me of Headline Smasher [0].

Some pretty fun ones there but it doesn't use RNNs. It just merges existing headlines.

[0]: http://www.headlinesmasher.com/best/all

Great tutorial. Been looking to do something like this for a while. Bookmarked!

I think this one is my favorite:

Life Is About — Or Still Didn’t Know Me

The "top" article in "clickotron.com" is "New President Is 'Hours Away' From Royal Pregnancy" :)

Your main site is down. Bottle can't handle serving files scalably or something? Point is, it broke.

That was exactly the problem, bottle+gevent serving static files. It's moved behind nginx now. (But you might have to wait for a DNS propagation before you get to the new server.)

Interesting blog post, but site is down. How much traffic do You get from HN?

500 Internal Server Error on the site where you could upvote em.

Working on it:) It's getting a bit more traffic than expected at the moment.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact