Hacker News new | comments | show | ask | jobs | submit login
Build your own summary tool (thetokenizer.com)
131 points by shlomib 1633 days ago | hide | past | web | 22 comments | favorite



Automatic summarization is my MS thesis. In my research, there are two types of summary: Abstraction & Extraction. I agree with the post, abstraction or paraphrasing the text is still a holy grail in automatic summarization. On the other hand, extraction which is just lifting of the most important sentences in the text is the method done by Summly, mentioned in the post, and also in my algorithm. The problem with extraction, is that the obtained text sometimes seems not connected with each other.

There are also some features that are considered in automatic summarization: title or headline, it is the title of the document. Sentence position or where the sentence is located in the text; introduction, body, or conclusion. Sentence length or how many words are in the sentence. Lastly, keyword frequency which is just how many the words appear in the text. There are also other features but I think those 4 are the most important.

The computation of those feature scores, stop words and some constants will affect the output.

Lastly, to evaluate your summary, you may want to use the ROUGE evaluation toolkit (http://www.berouge.com). It needs a reference summary created by a human as a comparison and determine the quality of your summary based on precision, recall, and F-score. ROUGE has different methods to evaluate the summary. There is ROUGE-L which considers the longest common subsequence. There is also ROUGE-W which adds weights. And many more which I don't remember.

That's it. I'm really a fan of automatic summarization and also hoping to create a good algorithm for abstraction.

Edit: sorry for the lack of links and references, it's hard to do those in mobile.


I totally agree! There is a huge gap between my naive algorithm and a real working one. The idea of this post was just to introduce the “automatic summarization world” to those who aren't familiar with it at all.


Yeah. But the intersection formula you used is not in my algorithm. Maybe I can use it to improve my algorithm. ;) Cheers!

Excellent post by the way. I'm really happy seeing posts about automatic summarization.


You may be interested in checking out an automatic summarizer I developed this summer. It uses numerous techniques to weight sentences for extraction - https://github.com/shanedownfall/CNGLSummarizer


I did not know that the method used by Summly was publicly known (though I knew they licensed it from SRI, so chances for that were high). Is the algorithm used by Wavii also known?


Hi, I don't know the exact method done by Summly but based on their output, they are just doing extraction. How they extract those sentences is I don't know.

I don't really heard of Wavii before their acquisition thus not knowing what they really do. Sorry.


Why not use a well tested, fast, off the shelf summary api? https://www.mashape.com/stremor/

We do all the hard work. Even do the Content extraction from the page.

And yes I have ties to this product, I'm the CTO of Stremor.

No. that doesn't bias me against you building your own. I just know that we have 7.5 Million words in our code and I need more to get it perfect. Most people don't have time to assign traits to that many words. (Wordnet has about 350k)

Edit:

Also you have to deal with sentence disambiguation. That is hard. Which is why NLTK and Core NLP suck at it.

Polyword Nouns and pronouns mean that you can't just use tokens, because "the President", "President Obama" and "He" are all the same "token". you have to rectify that when summarizing.


As with most of the API services for text analytics, sending out confidential data to third parties is simply not acceptable. I run into this almost daily where I work. Contractually, we are neither allowed to send our clients' data to a fly-by-night startup nor a an established 800 pound gorilla.

The other part is that often in text analytics, the training corpus matters significantly (also this seems less so with summaries at first glance).


Then you license it for onsite.

The training corpus doesn't matter one bit. The type of document does. (Report, op-ed, News, thesis, Fiction, etc)

We even do summary based on query. Summarize a 1000 page report only looking at things related to surgeries, or IT.

We have hundreds of traits you can mix and match. Want to read the Snows of Kilimanjaro with only the parts that are directly about winter, or foreshadow winter themes? We can do that.

Lord of the Rings only when encountering evil. We can do that.


Licensing it runs into another hurdle. How much value does random feature give you? Is the feature valuable enough to justify custom licensing costs? Would 100 lines of Python/NLTK give you enough value for your target market?

I'm researching this right now for theme clustering. There are some products available for purchase/license that provide reasonable results. The question is if the difference between these reasonable results and 100 lines of Python/Clojure sitting ontop of Storm is enough to justify the cost of the license. That question relates to how much value our 100 line knockoff get our clients vs the extras that a full-feature solution would bring. In the few times I've done the comparison before, it has not been worth buying.

Edit: I want to make it clear I'm not saying build it yourself 100% or that building it yourself will give you a reasonable facsimile in all (maybe any instances) with NLTK/OpenNLP/etc. If summarizing is a core feature of your product, by all means, license away. What I'm specifically referencing are situations where maybe I'd like to add a feature, but it's highly unlikely that the single feature along is going to drive a sale. In these circumstances, build it yourself with NLTK or the like is extremely attractive.


The better question is if your clients care so much about privacy, is the risk that you shorten 3 sentences in such a way that

"My client, Mr. Smith would only kill in self-defense. On the other hand Mr. Jones is a cold blooded killer. He has killed dozens of people"

Becomes "My client, Mr. Jones is a cold blooded Killer" Because of sentence parsing.

Or

"Smith would only kill in self-defense. He has killed dozens of people." Because you used keyword density.


Is there someone at Stremor I should specifically talk to? When I looked at your site recently, referred by a Forbes article, I got the impression your technology for summary had become a closed solution (available just to your products).

Or can I access it just through mashape perhaps?


Yes, our summarization technology is now available as an API at the Mashape link in drakaal's post.

If you have further questions, feel free to email us: support at stremor dot com.


Excellent post on the basics! As you mentioned in your post, the real work begins once you have the algorithm in place.

Scraping is a challenge as Readability, boiler pipe and so on are only so effective.

Sentence parsing is extremely tricky. Support for surnames, urls, ellipsis, quotations, and all the other weird language things.

This is the simplest implementation of textrank I've seen, and a lot of credit to the author for writing it in such a way. If anyone is interested, you can find out more about textrank from this academic paper which explains it quite nicely (http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Mihalcea.pdf).

Note: just released http://summary.io


Hi, I just saw your http://summary.io. I also have a challenge in extracting the text from the website. I used boilerpipe, but it doens't perform well for me. I then changed it to Diffbot (http://diffbot.com/). It extracts the text way better but there is still some minor problems. May I know what you are using?

Additional question is how you extract the images in the text?

Btw, your http://summary.io is almost the same as my http://readborg.com/ (sorry, login required and it's still invitation only).


As I wrote in the post, I use Goose (https://github.com/xgdlm/python-goose) in my product. Goose also returns the URL of the main image! (sometimes it even includes its dimensions)


I ported this to Node.js and created a module called node-summary. It doesn't quite bring back the same results, but it's still summarizing nicely.

Check it out on GitHub https://github.com/jbrooksuk/node-summary or NPM registry https://npmjs.org/package/node-summary


Hi guys, I turned this into an API - https://www.mashape.com/ismaelc/summarizer-tool#!documentati... I will blog about it tomorrow. There is also a list of 50 Machine Learning APIs here - http://bit.ly/mlapis


Should the Intersection function not be "how many token 'in common'" rather than just 'how many tokens'? Otherwise I'm missing something. Going to read the code for that function now, but in the article, that wasn't clicking for me.

edit: code was easier than expected, yes, its how many in common as its just using the python 'intersection' function.


Yes, You're right! I wrote "how many common tokens we have", maybe be it should be "how many tokens we have in common"...


i've been using libots - http://libots.sourceforge.net/ - for many years now in various projects and it suffices in many cases i encounter.


"tl;dr" should do this until some human writes a summary :) (and then A/B test against the human version...)

Another idea would be to help humans write the summary by picking important phrases.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: