Hacker News new | past | comments | ask | show | jobs | submit login
Text Summarization with TensorFlow (googleblog.com)
414 points by runesoerensen on Aug 24, 2016 | hide | past | web | favorite | 69 comments

Allegedly (according to HP labs anyway) my 15 year old software[0] was state-of-the-art at this on the CNN dataset[1].

By my thing was very heuristic based, and couldn't generate new sentences like this can. I'm pretty impressed - I'd say some of the machine generated summaries are better than the human ones.

[0] http://classifier4j.sourceforge.net/ (yes, Sourceforge! Shows how old it is!!)

[1] http://dl.acm.org/citation.cfm?id=2797081

This is way too difficult to get right, even for humans.

I dream of a not-so-smart news summarization engine that will not try to rewrite the news, but only pickup all the numbers and quotations, then present them in a table of who-said-what and how-many-what, along with the title.

This would put an end to filler-based journalism.

I wish you wouldn't be so dismissive of journalism and journalists. What they provide is not filler. Controversial though this opinion may be around here, there is serious value in having an actual carbon-based life form -- one who has spent years or decades covering whatever beat -- provide context and insight for the quotes and data. That they have become a dying breed spells real trouble for our civic life.

The journalists you describe are so few and far between that there needs to be a different term for them. The vast majority of the 'news' out there isn't anything close to what you described.

Agreed in an age where twitter storms are worthy of front page news.

The system is not set up to reward them, if there was a genuine demand we might see more of them.

The unfortunate truth is that your clickbait and byte sized arguments are what people want. Trying to solve it from the top down is a lost cause.

The HN crowd does not need context and insight, they can just google a few keywords and then skim a wikipedia article to achieve the expertise necessary to argue with others on a public forum...

I think you described how modern journalism works. Except skimming Wikipedia is often optional.

Are you serious? Journalists have templates for news articles they just fill with some new data every time statistics numbers are released or a politician speaks.

> This would put an end to filler-based journalism.

No it might put an end to the filler-producing journalists, the so called journalism would still get produced, albeit by a bot.

The real journalists (in terms of a better differentiation) would then be even more drowned out in an ever growing dessert of CGH (computer generated headlines).

Put an end to filler based journalism? That would be nice but don't you think that they will just evolve around your trick?

$6,000 for the Gigaword dataset they used to train...


There's not even a way to buy it as an individual who's not part of an organization! They don't even state this as a possibility.

Funny how Google is paying for data

The code can be used to train on other data. All you really need is a collection of news articles. I think there are some free ones available.

This dataset was only used to benchmark against other published results. It was first proposed in https://arxiv.org/abs/1509.00685.

There's SUMMRY (http://smmry.com). I don't recall it being very smart -- that is, it doesn't rewrite sentences. It extracts the most important/relevant ones to basically shorten an article. It's definitely useful.

Wow, I like this.

As an example, here is the Google article resumed by SUMMRY.


Research Blog: Text summarization with TensorFlow Being able to develop Machine Learning models that can automatically deliver accurate summaries of longer text can be useful for digesting such large amounts of information in a compressed form, and is a long-term goal of the Google Brain team.

One approach to summarization is to extract parts of the document that are deemed interesting by some metric and join them to form a summary.

Above we extract the words bolded in the original text and concatenate them to form a summary.

It turns out for shorter texts, summarization can be learned end-to-end with a deep learning technique called sequence-to-sequence learning, similar to what makes Smart Reply for Inbox possible.

In this case, the model reads the article text and writes a suitable headline.

In those tasks training from scratch with this model architecture does not do as well as some other techniques we're researching, but it serves as a baseline.

We hope this release can also serve as a baseline for others in their summarization research.

FYI, their about section also spells out their algorithm: http://smmry.com/about

This is what the tldr; bot on reddit uses. It's pretty darn good most of the time, and useful.

Here's my approach when I built my text-summary app with TensorFlow's SyntaxNet. SyntaxNet (Parsey) gives part-of-speech for each word and a parse tree showing which less-important words/phrases describe higher-level words. "She sells sea shells, down by the sea shore" => (down by the sea shore) is tagged by SyntaxNet as lower describing "sells" so it can be removed from the sentence. Removing adjectives and prepositional phrases gives us simpler sentences easily. Next, we find key words (central to sentences) for news article based on n-grams, and then score key sentences in which they appear. Use MIT ConceptNet for common-sense linking of nouns and most likely relations between them and similar words based on vectors. Generate article summary from the grammatically simple sentences.

My question is how well the trained models interpret human meaning in joined sentences. I discovered that by simplifying sentences you lose the original meaning when that grammatically-low-importance word is central to the meaning. "Clinton may be the historically first nominee, who is a woman, from the Dem or GOP party to win presidency" is way different meaning than that if you remove the "who is a woman". I am also interested in how it makes sence to join-up nouns/entities across sentences. This will cause the wrong meaning unless you are building the human meaning structures like in ConceptNet by learning from the article itself, as opposed to pretrained models based on grammar or word vector in Gigaword.

My work for the future, is using tf–idf style approach for deciding the key words in a sentence, which I would recommend over relative grammar/vectors. In the example in your blog post ("australian wine exports hit record high in september") you left out that it's 52.1 million liters; but if the article went on to mention or relate importance to that number, by comparing it to past records or giving it the price and so on, you can see this "52.1 million liters" phrase in this one sentence has a higher score relative to the collection of all sentences. As opposed to probabilistic word cherry picking based on prior data, this approach will enable you to extract named entities and phrases and build sentences from phrases in any sentence that grammatically refer to it.

a parse tree showing which less-important words/phrases describe higher-level words

Things lower on a parse tree aren't less important than things higher. It just represents a dependency relationship.

You're pointing out what's already obvious. You still need some way to find what's "less important", which is what the topic is all about, like by using grammar dependency or keywords infrequency.

I'm not trying to be hostile, I just don't understand what you mean.

You still need some way to find what's "less important", which is what the topic is all about, like by using grammar dependency or keywords infrequency.

I have some experience in this area[1]. I found keyword frequency worked quite well.

[1] https://news.ycombinator.com/item?id=12356133

Can I see the code of what you wrote? I am interested in learning tensorflow and syntaxnet.

Does this include the state of the network in its already-trained state? It looks like we need to train it with the $6000 dataset if we want to get good results like those mentioned. Is it possible for its state to be saved to disk and restored (so we don't all need a copy of the dataset)?

The Hainan example in the article is especially impressive, the generated summary uses completely different expressions compared to the source text, yet it is spot-on. Of course those are probably cherry-picked results, but still. As a side note, it would be interesting to see how the algorithm performs with longer sources.

We've observed that due to the nature of news headlines, the model can generate good headlines from reading just a few sentences from the beginning of the article.

This illustrates the importance of taking the trouble to understand a domain before trying to model it. I was taught in journalism class that the first paragraph of a newspaper article should summarize the story, the next 3-5 paragraphs summarize again, then the rest of the article fill in the details. Not only do the authors spend time discovering what should have been known from the outset, they reverse cause and effect. The model can generate good headlines due to the nature of newpaper writing, not due to the nature of headlines.

That reminds me. Whatever happened to http://summly.com?

Acquired by Yahoo a few years ago.


also: wasn't it just a frontend to a external summarize service (that's what i recall)

I don't believe so, no. They did most of the original NLP through a contracted team, according to Wikipedia: https://en.wikipedia.org/wiki/Nick_D%27Aloisio#Summly.

got it! remembered it wrong - thx!

> we started looking at more difficult datasets where reading the entire document is necessary to produce good summaries

Was hoping to get rather some more insights on this.

Because when looking at the examples given, I wonder if we really need machine learning to summarize single sentences? Just by cutting all adjectives, replacing words by abbreviations and multiple words by potential category terms, we should face similar results. Maybe it's just a start or did I miss anything?

Next one could turn this into a comment generator, generating comments in the style and personality of any HN commenters with an adequate corpus.

The subredditsimulator subreddit does this for both articles and comments and is restricted to bots.

It is still mostly inane garbage but the content and gems have improved steadily over the past year or so. Interesting experiment in any case.

That would only be accurate if you assume most people read the article.

The table is nice, but I'd like to see examples where it performs poorly as well.

Author of post here. I'd say most of the examples generated from the best model were good. However we chose examples that were not too gruesome, as news can be :)

We encourage you to try the code and see for yourself.

How does the model deal with dangling anaphora[1]? I wrote a summarizer for Spanish following a recent paper as a side project, and it looks as if I'll need a month of work to solve the issue.

[1] That is, the problem of selecting a sentence such as "He approved the motion" and then realising that "he" is now undefined.

We're not "selecting" sentences as an extractive summarizer might. The sentences are generated.

As for how does the model deal with co-reference? There's no special logic for that.

Wouldn't it suffice to do a coreference pass before extracting sentences? Obviously you'll compound coref errors with the errors in your main logic, but that seems somewhat unavoidable.

I am working on this in my kbsportal.com NLP demo. With accurate coreference substitutions (eg., substituting a previous NP like 'San Francisco' for 'there' in a later sentence, substituting full previously mentioned names for pronouns, etc.) extractive summarization should provide better results, and my intuition is that this preprocessing should help abstractive summarization also.

That is inter-sentence logic? Even humans have trouble with such ambiguity for certain cases.

In the post you mentioned that

>>"In those tasks training from scratch with this model architecture does not do as well as some other techniques we're researching, but it serves as a baseline."

Can you elaborate a little on that? Is the training the problem or is the model just not good at longer texts?

Any chance some trained model will be released?

Any hints on how to integrate the whole document for summarization? ;)

I've seen copynet, where you do seq2seq but also have a copy mechanism to copy rare words from the source sentence to the target sentence.

Is it hard to get the code up and running on Google Cloud? Does TensorFlow come as a service?

Agreed, it seems they really hand-picked some shining examples for this post, and it would have been more interesting to see the full spectrum of when it works and when it doesn't. Perhaps the README in the Github repo is a bit more honest in terms of representativeness, though it only has 4 examples, one of them is an interesting failure:

article: novell inc. chief executive officer eric schmidt has been named chairman of the internet search-engine company google .

human: novell ceo named google chairman

machine: novell chief executive named to head internet company


(It's still great that it beats all competition on the ROUGE score, of course.)

I don't see that as a failure. It did produce a sentence that is shorter, grammatical (though "named to head" is a bit weird) and essentially true — calling Google an "internet company" would make sense in its early days (back when Google would be prefixed by "internet search-engine company").

I didn't think it was a failure either until I realized that I was letting future knowledge leak into the past! There is more than one internet company so upon reading that headline, given that it must be a novel event, my question would be: "Which Company?". Now I have to read the article until I find out. The human summary is better because I don't have to ask that question.

Most humans would interpret "novell chief executive named to head internet company" to mean "novell is an internet company and its chief executive just became its head" which is incorrect (and a little nonsensical since the CEO already is in change).

That's pretty interesting. It's taken me 5 readings of that sentence, including once out-loud to get your reading of it.

I thought the generated summary was really, really good. But I knew that Novell wasn't considered an internet company, so it wasn't until I made myself ignore that before I could see the other reading.

Oh, that's funny. I was thinking it was. I guess it's a software company but not an internet company.

I think the holiday pay example is more glaring. Seen in isolation, I would be confused as to what on earth it was getting at. Furthermore, the summary is no good. The abstract isn't either. My summary would be: British Gas continues to fight eu court's decision that commission be included in holiday pay. Continue is used here to emphasize that the case is not yet over.

On the other hand, the football summary is exemplary; better than the provided abstract.

IMO at least the second example shown is already poor, or at least not much better than what sites like SMMRY[1] have been providing for years.

> hainan to curb spread of diseases

That sentence pretty much conveys no useful information - every city wants to "curb spread of diseases", so what has actually changed? The news here is about restriction on livestock, and even a student journalist would be expected to do better than this headline.

To be clear I'm excited about the idea and believe machine learning has much better potential for enormous refinements compared to SMMRY's method (as described by them[2]), I just don't think it's as "done" as a lot of people here seem to assume it to be.

[1] http://smmry.com/ [2] http://smmry.com/about

May be the wrong context here, but if I am starting out learning Machine Learning using the Stanford course or some other one, is Tensorflow a good candidate to look into? Or does it only contain the advanced algorithms?

TF is great once you start doing neural networks. It does have support for other things too, but there are better documented frameworks.

Start with Scikit-Learn or R.

Once you start doing neural network stuff, start with Keras on top of TF.

I remember there was a hackernews comment I read once where the user created an emacs plugin to do just this. It would generate a single sentence summery about whatever text was input into it.

Edit: Hey, found it! https://github.com/mck-/oneliner

Probably not as sophisticated but does the business. Nice bit of work done on this.

The following appears in the README.md

    # Run the eval. Try to avoid running on the same matchine as training.
Why should one avoid that?


I would love to use this kind of tool and look for gotchas or key hidden elements in legal documents. It's not exactly summarisation, but very useful. Something like bubbling up the important elements of the fine print.

I seem to recall someone used Microsoft's AutoSummarize feature to reduce and reduce classical works of literature to a few lines. The results were pretty hilarious, but I can't find it now.

Yep. That's the one. Thanks.

I wonder what it'd be like on a novel, say something like 'pride and prejudice', would it be able to essentially summarise the plot or would it end up like 'movie plots explained badly'

Either way, this is great research with a ton of real world applications!

> Although this task serves as a nice proof-of-concept, we started looking at more difficult datasets where reading the entire document is necessary to produce good summaries. In those tasks training from scratch with this model architecture does not do as well as some other techniques we’re researching, but it serves as a baseline.

That would suggest that this method doesn't work well for long documents.

A student in a course I TA'd in college used Microsoft Word's built-in summarization to reduce books to ten words. The results were hilarious.


by classifying the emotional arcs for a filtered subset of 1,737 stories from Project Gutenberg's fiction collection, we find a set of six core trajectories which form the building blocks of complex narratives. We strengthen our findings by separately applying optimization, linear decomposition, supervised learning, and unsupervised learning


It'll be more remarkable if a machine can read a play script and summarize the plot.

I doubt most humans can even do this well.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact