
Build your own summary tool - shlomib
http://thetokenizer.com/2013/04/28/build-your-own-summary-tool/
======
MojoJolo
Automatic summarization is my MS thesis. In my research, there are two types
of summary: Abstraction & Extraction. I agree with the post, abstraction or
paraphrasing the text is still a holy grail in automatic summarization. On the
other hand, extraction which is just lifting of the most important sentences
in the text is the method done by Summly, mentioned in the post, and also in
my algorithm. The problem with extraction, is that the obtained text sometimes
seems not connected with each other.

There are also some features that are considered in automatic summarization:
title or headline, it is the title of the document. Sentence position or where
the sentence is located in the text; introduction, body, or conclusion.
Sentence length or how many words are in the sentence. Lastly, keyword
frequency which is just how many the words appear in the text. There are also
other features but I think those 4 are the most important.

The computation of those feature scores, stop words and some constants will
affect the output.

Lastly, to evaluate your summary, you may want to use the ROUGE evaluation
toolkit (<http://www.berouge.com>). It needs a reference summary created by a
human as a comparison and determine the quality of your summary based on
precision, recall, and F-score. ROUGE has different methods to evaluate the
summary. There is ROUGE-L which considers the longest common subsequence.
There is also ROUGE-W which adds weights. And many more which I don't
remember.

That's it. I'm really a fan of automatic summarization and also hoping to
create a good algorithm for abstraction.

Edit: sorry for the lack of links and references, it's hard to do those in
mobile.

~~~
shlomib
I totally agree! There is a huge gap between my naive algorithm and a real
working one. The idea of this post was just to introduce the “automatic
summarization world” to those who aren't familiar with it at all.

~~~
MojoJolo
Yeah. But the intersection formula you used is not in my algorithm. Maybe I
can use it to improve my algorithm. ;) Cheers!

Excellent post by the way. I'm really happy seeing posts about automatic
summarization.

------
drakaal
Why not use a well tested, fast, off the shelf summary api?
<https://www.mashape.com/stremor/>

We do all the hard work. Even do the Content extraction from the page.

And yes I have ties to this product, I'm the CTO of Stremor.

No. that doesn't bias me against you building your own. I just know that we
have 7.5 Million words in our code and I need more to get it perfect. Most
people don't have time to assign traits to that many words. (Wordnet has about
350k)

Edit:

Also you have to deal with sentence disambiguation. That is hard. Which is why
NLTK and Core NLP suck at it.

Polyword Nouns and pronouns mean that you can't just use tokens, because "the
President", "President Obama" and "He" are all the same "token". you have to
rectify that when summarizing.

~~~
emidln
As with most of the API services for text analytics, sending out confidential
data to third parties is simply not acceptable. I run into this almost daily
where I work. Contractually, we are neither allowed to send our clients' data
to a fly-by-night startup nor a an established 800 pound gorilla.

The other part is that often in text analytics, the training corpus matters
significantly (also this seems less so with summaries at first glance).

~~~
drakaal
Then you license it for onsite.

The training corpus doesn't matter one bit. The type of document does.
(Report, op-ed, News, thesis, Fiction, etc)

We even do summary based on query. Summarize a 1000 page report only looking
at things related to surgeries, or IT.

We have hundreds of traits you can mix and match. Want to read the Snows of
Kilimanjaro with only the parts that are directly about winter, or foreshadow
winter themes? We can do that.

Lord of the Rings only when encountering evil. We can do that.

~~~
emidln
Licensing it runs into another hurdle. How much value does random feature give
you? Is the feature valuable enough to justify custom licensing costs? Would
100 lines of Python/NLTK give you enough value for your target market?

I'm researching this right now for theme clustering. There are some products
available for purchase/license that provide reasonable results. The question
is if the difference between these reasonable results and 100 lines of
Python/Clojure sitting ontop of Storm is enough to justify the cost of the
license. That question relates to how much value our 100 line knockoff get our
clients vs the extras that a full-feature solution would bring. In the few
times I've done the comparison before, it has not been worth buying.

Edit: I want to make it clear I'm not saying build it yourself 100% or that
building it yourself will give you a reasonable facsimile in all (maybe any
instances) with NLTK/OpenNLP/etc. If summarizing is a core feature of your
product, by all means, license away. What I'm specifically referencing are
situations where maybe I'd like to add a feature, but it's highly unlikely
that the single feature along is going to drive a sale. In these
circumstances, build it yourself with NLTK or the like is extremely
attractive.

~~~
drakaal
The better question is if your clients care so much about privacy, is the risk
that you shorten 3 sentences in such a way that

"My client, Mr. Smith would only kill in self-defense. On the other hand Mr.
Jones is a cold blooded killer. He has killed dozens of people"

Becomes "My client, Mr. Jones is a cold blooded Killer" Because of sentence
parsing.

Or

"Smith would only kill in self-defense. He has killed dozens of people."
Because you used keyword density.

------
samsnelling
Excellent post on the basics! As you mentioned in your post, the real work
begins once you have the algorithm in place.

Scraping is a challenge as Readability, boiler pipe and so on are only so
effective.

Sentence parsing is extremely tricky. Support for surnames, urls, ellipsis,
quotations, and all the other weird language things.

This is the simplest implementation of textrank I've seen, and a lot of credit
to the author for writing it in such a way. If anyone is interested, you can
find out more about textrank from this academic paper which explains it quite
nicely (<http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf>).

Note: just released <http://summary.io>

~~~
MojoJolo
Hi, I just saw your <http://summary.io>. I also have a challenge in extracting
the text from the website. I used boilerpipe, but it doens't perform well for
me. I then changed it to Diffbot (<http://diffbot.com/>). It extracts the text
way better but there is still some minor problems. May I know what you are
using?

Additional question is how you extract the images in the text?

Btw, your <http://summary.io> is almost the same as my <http://readborg.com/>
(sorry, login required and it's still invitation only).

~~~
shlomib
As I wrote in the post, I use Goose (<https://github.com/xgdlm/python-goose>)
in my product. Goose also returns the URL of the main image! (sometimes it
even includes its dimensions)

------
jbrooksuk
I ported this to Node.js and created a module called node-summary. It doesn't
quite bring back the same results, but it's still summarizing nicely.

Check it out on GitHub <https://github.com/jbrooksuk/node-summary> or NPM
registry <https://npmjs.org/package/node-summary>

------
ismaelc
Hi guys, I turned this into an API -
[https://www.mashape.com/ismaelc/summarizer-
tool#!documentati...](https://www.mashape.com/ismaelc/summarizer-
tool#!documentation) I will blog about it tomorrow. There is also a list of 50
Machine Learning APIs here - <http://bit.ly/mlapis>

------
jasallen
Should the Intersection function not be "how many token 'in common'" rather
than just 'how many tokens'? Otherwise I'm missing something. Going to read
the code for that function now, but in the article, that wasn't clicking for
me.

edit: code was easier than expected, yes, its how many in common as its just
using the python 'intersection' function.

~~~
shlomib
Yes, You're right! I wrote "how many common tokens we have", maybe be it
should be "how many tokens we have in common"...

------
jnazario
i've been using libots - <http://libots.sourceforge.net/> \- for many years
now in various projects and it suffices in many cases i encounter.

------
swah
"tl;dr" should do this until some human writes a summary :) (and then A/B test
against the human version...)

Another idea would be to help humans write the summary by picking important
phrases.

