There are also some features that are considered in automatic summarization: title or headline, it is the title of the document. Sentence position or where the sentence is located in the text; introduction, body, or conclusion. Sentence length or how many words are in the sentence. Lastly, keyword frequency which is just how many the words appear in the text. There are also other features but I think those 4 are the most important.
The computation of those feature scores, stop words and some constants will affect the output.
Lastly, to evaluate your summary, you may want to use the ROUGE evaluation toolkit (http://www.berouge.com). It needs a reference summary created by a human as a comparison and determine the quality of your summary based on precision, recall, and F-score. ROUGE has different methods to evaluate the summary. There is ROUGE-L which considers the longest common subsequence. There is also ROUGE-W which adds weights. And many more which I don't remember.
That's it. I'm really a fan of automatic summarization and also hoping to create a good algorithm for abstraction.
Edit: sorry for the lack of links and references, it's hard to do those in mobile.
Excellent post by the way. I'm really happy seeing posts about automatic summarization.
I don't really heard of Wavii before their acquisition thus not knowing what they really do. Sorry.
We do all the hard work. Even do the Content extraction from the page.
And yes I have ties to this product, I'm the CTO of Stremor.
No. that doesn't bias me against you building your own. I just know that we have 7.5 Million words in our code and I need more to get it perfect. Most people don't have time to assign traits to that many words. (Wordnet has about 350k)
Also you have to deal with sentence disambiguation. That is hard. Which is why NLTK and Core NLP suck at it.
Polyword Nouns and pronouns mean that you can't just use tokens, because "the President", "President Obama" and "He" are all the same "token". you have to rectify that when summarizing.
The other part is that often in text analytics, the training corpus matters significantly (also this seems less so with summaries at first glance).
The training corpus doesn't matter one bit. The type of document does. (Report, op-ed, News, thesis, Fiction, etc)
We even do summary based on query. Summarize a 1000 page report only looking at things related to surgeries, or IT.
We have hundreds of traits you can mix and match. Want to read the Snows of Kilimanjaro with only the parts that are directly about winter, or foreshadow winter themes? We can do that.
Lord of the Rings only when encountering evil. We can do that.
I'm researching this right now for theme clustering. There are some products available for purchase/license that provide reasonable results. The question is if the difference between these reasonable results and 100 lines of Python/Clojure sitting ontop of Storm is enough to justify the cost of the license. That question relates to how much value our 100 line knockoff get our clients vs the extras that a full-feature solution would bring. In the few times I've done the comparison before, it has not been worth buying.
Edit: I want to make it clear I'm not saying build it yourself 100% or that building it yourself will give you a reasonable facsimile in all (maybe any instances) with NLTK/OpenNLP/etc. If summarizing is a core feature of your product, by all means, license away. What I'm specifically referencing are situations where maybe I'd like to add a feature, but it's highly unlikely that the single feature along is going to drive a sale. In these circumstances, build it yourself with NLTK or the like is extremely attractive.
"My client, Mr. Smith would only kill in self-defense. On the other hand Mr. Jones is a cold blooded killer. He has killed dozens of people"
Becomes "My client, Mr. Jones is a cold blooded Killer" Because of sentence parsing.
"Smith would only kill in self-defense. He has killed dozens of people." Because you used keyword density.
Or can I access it just through mashape perhaps?
If you have further questions, feel free to email us: support at stremor dot com.
Scraping is a challenge as Readability, boiler pipe and so on are only so effective.
Sentence parsing is extremely tricky. Support for surnames, urls, ellipsis, quotations, and all the other weird language things.
This is the simplest implementation of textrank I've seen, and a lot of credit to the author for writing it in such a way. If anyone is interested, you can find out more about textrank from this academic paper which explains it quite nicely (http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf).
Note: just released http://summary.io
Additional question is how you extract the images in the text?
Btw, your http://summary.io is almost the same as my http://readborg.com/ (sorry, login required and it's still invitation only).
Check it out on GitHub https://github.com/jbrooksuk/node-summary or NPM registry https://npmjs.org/package/node-summary
edit: code was easier than expected, yes, its how many in common as its just using the python 'intersection' function.
Another idea would be to help humans write the summary by picking important phrases.